## Data Cleaning

Now since we have extracted data from the website, it is not cleaned and ready to be analyzed yet. The reviews section will need to be cleaned for punctuations, spellings and other characters.

In [1]:
#imports

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

#regex
import re

In [3]:
#create a dataframe from csv file

cwd = os.getcwd()

df = pd.read_csv(cwd+"/BA_reviews (1).csv", index_col=0)

In [4]:
df.head()

Unnamed: 0,reviews,stars,date,country
0,✅ Trip Verified | Absolutely PATHETIC business...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,23rd December 2023,United States
1,Not Verified | Overall not bad. Staff look ti...,3,21st December 2023,Canada
2,✅ Trip Verified | This was our first flight wi...,6,21st December 2023,Australia
3,✅ Trip Verified | I recently encountered a hig...,10,21st December 2023,United States
4,Not Verified | Beware! BA don't provide any r...,1,20th December 2023,United States


We will also create a column which mentions if the user is verified or not.

In [5]:
df['verified'] = df.reviews.str.contains("Trip Verified")

In [6]:
df['verified']

0        True
1       False
2        True
3        True
4       False
        ...  
3716    False
3717    False
3718    False
3719    False
3720    False
Name: verified, Length: 3721, dtype: bool

### Cleaning Reviews

We will extract the column of reviews into a separate dataframe and clean it for semantic analysis

In [11]:
#for lemmatization of words we will use nltk library
import nltk
nltk.download('stopwords')
nltk.download('wordnet')


from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
lemma = WordNetLemmatizer()


reviews_data = df.reviews.str.strip("✅ Trip Verified |")

#create an empty list to collect cleaned data corpus
corpus =[]

#loop through each review, remove punctuations, small case it, join it and add it to corpus
for rev in reviews_data:
    rev = re.sub('[^a-zA-Z]',' ', rev)
    rev = rev.lower()
    rev = rev.split()
    rev = [lemma.lemmatize(word) for word in rev if word not in set(stopwords.words("english"))]
    rev = " ".join(rev)
    corpus.append(rev)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [12]:
# add the corpus to the original dataframe

df['corpus'] = corpus

In [13]:
df.head()

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | Absolutely PATHETIC business...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,23rd December 2023,United States,True,absolutely pathetic business class product ba ...
1,Not Verified | Overall not bad. Staff look ti...,3,21st December 2023,Canada,False,verified overall bad staff look tired overwork...
2,✅ Trip Verified | This was our first flight wi...,6,21st December 2023,Australia,True,first flight british airway year usual fault c...
3,✅ Trip Verified | I recently encountered a hig...,10,21st December 2023,United States,True,recently encountered highly disappointing expe...
4,Not Verified | Beware! BA don't provide any r...,1,20th December 2023,United States,False,verified beware ba provide refund due serious ...


### Cleaning/Fromat date

In [14]:
df.dtypes

reviews     object
stars       object
date        object
country     object
verified      bool
corpus      object
dtype: object

In [15]:
# convert the date to datetime format

df.date = pd.to_datetime(df.date)

In [16]:
df.date.head()

0   2023-12-23
1   2023-12-21
2   2023-12-21
3   2023-12-21
4   2023-12-20
Name: date, dtype: datetime64[ns]

### Cleaning ratings with stars

In [19]:
#check for unique values
df.stars.unique()

array(['\n\t\t\t\t\t\t\t\t\t\t\t\t\t5', '3', '6', '10', '1', '4', '8',
       '9', '7', '5', '2', 'None'], dtype=object)

In [20]:
# remove the \t and \n from the ratings
df.stars = df.stars.str.strip("\n\t\t\t\t\t\t\t\t\t\t\t\t\t")

In [21]:
df.stars.value_counts()

1       868
2       422
3       405
8       365
10      327
7       310
9       308
5       270
4       249
6       192
None      5
Name: stars, dtype: int64

There are 5 rows having values "None" in the ratings. We will drop all these 5 rows.

In [22]:
# drop the rows where the value of ratings is None
df.drop(df[df.stars == "None"].index, axis=0, inplace=True)

In [23]:
#check the unique values again
df.stars.unique()

array(['5', '3', '6', '10', '1', '4', '8', '9', '7', '2'], dtype=object)

## Check for null Values

In [26]:
df.isnull().value_counts()

reviews  stars  date   country  verified  corpus
False    False  False  False    False     False     3714
                       True     False     False        2
dtype: int64

In [27]:
df.country.isnull().value_counts()

False    3714
True        2
Name: country, dtype: int64

We have two missing values for country. For this we can just remove those two reviews (rows) from the dataframe.

In [28]:
#drop the rows using index where the country value is null
df.drop(df[df.country.isnull() == True].index, axis=0, inplace=True)

In [29]:
df.shape

(3714, 6)

In [30]:
#resetting the index
df.reset_index(drop=True)

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | Absolutely PATHETIC business...,5,2023-12-23,United States,True,absolutely pathetic business class product ba ...
1,Not Verified | Overall not bad. Staff look ti...,3,2023-12-21,Canada,False,verified overall bad staff look tired overwork...
2,✅ Trip Verified | This was our first flight wi...,6,2023-12-21,Australia,True,first flight british airway year usual fault c...
3,✅ Trip Verified | I recently encountered a hig...,10,2023-12-21,United States,True,recently encountered highly disappointing expe...
4,Not Verified | Beware! BA don't provide any r...,1,2023-12-20,United States,False,verified beware ba provide refund due serious ...
...,...,...,...,...,...,...
3709,LHR-HKG on Boeing 747 - 23/08/12. Much has bee...,3,2012-08-29,United Kingdom,False,lhr hkg boeing much written tired old fleet go...
3710,Just got back from Bridgetown Barbados flying ...,9,2012-08-29,United Kingdom,False,got back bridgetown barbados flying british ai...
3711,LHR-JFK-LAX-LHR. Check in was ok apart from be...,8,2012-08-29,United Kingdom,False,lhr jfk lax lhr check ok apart snapped early c...
3712,LHR to HAM. Purser addresses all club passenge...,2,2012-08-28,United Kingdom,False,lhr ham purser address club passenger name boa...


*****

Now our data is all cleaned and ready for data visualization and data analysis.

In [31]:
# export the cleaned data

df.to_csv(cwd + "/cleaned-BA-reviews.csv")