## Data Cleaning

Now since we have extracted data from the website, it is not cleaned and ready to be analyzed yet. The reviews section will need to be cleaned for punctuations, spellings and other characters. 

In [1]:
#imports libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [14]:
#create a dataframe from csv file

df = pd.read_csv("BA_reviews.csv")
df = df.drop("Unnamed: 0", axis=1)

In [15]:
df.head()

Unnamed: 0,reviews,stars,date,country
0,✅ Trip Verified | My family and I have flown ...,5,9th July 2023,United Kingdom
1,✅ Trip Verified | This has been by far the wo...,4,9th July 2023,United States
2,✅ Trip Verified | In Nov 2022 I booked and pa...,2,8th July 2023,United Kingdom
3,Not Verified | BA is not treating its premium ...,2,6th July 2023,United Kingdom
4,✅ Trip Verified | 24 hours before our departu...,4,5th July 2023,South Africa


- **We will also create a column which mentions if the user is verified or not verified.** 

In [16]:
df['verified'] = df.reviews.str.contains("Trip Verified")
df['verified']

0        True
1        True
2        True
3       False
4        True
        ...  
3593    False
3594    False
3595    False
3596    False
3597    False
Name: verified, Length: 3598, dtype: bool

### Cleaning Reviews

- **We will extract the column of reviews into a separate dataframe and clean it for semantic analysis**

In [22]:
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Download NLTK resources (run this if you haven't done so)
nltk.download('wordnet')
nltk.download('stopwords')

# Initialize lemmatizer and stopwords
lemma = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

reviews_data = df.reviews.str.strip("✅ Trip Verified |")

# Create an empty list to collect cleaned data corpus
corpus = []

# Loop through each review, remove punctuation, convert to lowercase, lemmatize, and add to corpus
for rev in reviews_data:
    rev = re.sub('[^a-zA-Z]', ' ', rev)
    rev = rev.lower()
    rev = rev.split()
    rev = [lemma.lemmatize(word) for word in rev if word not in stop_words]
    rev = " ".join(rev)
    corpus.append(rev)


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [23]:
# add the corpus to the original dataframe

df['corpus'] = corpus

In [24]:
df.head()

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | My family and I have flown ...,5,9th July 2023,United Kingdom,True,family flown mostly british airway last year p...
1,✅ Trip Verified | This has been by far the wo...,4,9th July 2023,United States,True,far worst service plane obvious flying economy...
2,✅ Trip Verified | In Nov 2022 I booked and pa...,2,8th July 2023,United Kingdom,True,nov booked paid return journey new zealand ret...
3,Not Verified | BA is not treating its premium ...,2,6th July 2023,United Kingdom,False,verified ba treating premium economy passenger...
4,✅ Trip Verified | 24 hours before our departu...,4,5th July 2023,South Africa,True,hour departure ba cape town heathrow thursday ...


### Cleaning/Fromat date

In [25]:
# convert the date to datetime format

df.date = pd.to_datetime(df.date)

In [26]:
df.date.head()

0   2023-07-09
1   2023-07-09
2   2023-07-08
3   2023-07-06
4   2023-07-05
Name: date, dtype: datetime64[ns]

### Cleaning ratings with stars

In [27]:
#check for unique values
df.stars.unique()

array(['5', '4', '2', '1', '3', '10', '7', '9', '8', '6', 'None'],
      dtype=object)

In [28]:
df.stars.value_counts()

1       806
2       409
3       399
8       355
10      319
7       309
9       303
5       265
4       242
6       186
None      5
Name: stars, dtype: int64

- *There are 5 rows having values "None" in the ratings. We will drop all these 5 rows.*

In [29]:
# drop the rows where the value of ratings is None
df.drop(df[df.stars == "None"].index, axis=0, inplace=True)

## Check for null Values

In [30]:
df.isnull().sum()

reviews     0
stars       0
date        0
country     2
verified    0
corpus      0
dtype: int64

- *We have two missing values for country. For this we can just remove those two reviews (rows) from the dataframe.*

In [31]:
df.dropna(inplace=True)

In [35]:
df.shape

(3591, 6)

In [36]:
# export the cleaned data

df.to_csv("cleaned_BA_reviews.csv")