# Data Cleaning
The data we have scraped from the website is still not clean and can be used for analysis. So we need to clean the data first so the data becomes proper for analysis.

In [22]:
#Load data manipulation package

import pandas as pd

In [23]:
data = pd.read_csv("data/BA_reviews.csv", index_col = 0)

In [24]:
data.head()

Unnamed: 0,reviews,stars,date,country
0,✅ Trip Verified | British Airways absolutely ...,1,1st September 2023,United Kingdom
1,✅ Trip Verified | My recent experience with B...,1,1st September 2023,United States
2,✅ Trip Verified | This is to express our disp...,1,31st August 2023,United States
3,✅ Trip Verified | I flew London to Malaga on ...,1,30th August 2023,United Kingdom
4,✅ Trip Verified | I arrived at the airport ab...,1,30th August 2023,Germany


## Change Data Type Format

In [25]:
#Check data types
data.dtypes

reviews    object
stars      object
date       object
country    object
dtype: object

Changing the format according to the data type will be easier to read and analyze at the later stage.

In [26]:
#Covert date to datetime format
data.date = pd.to_datetime(data.date)

In [27]:
data.head()

Unnamed: 0,reviews,stars,date,country
0,✅ Trip Verified | British Airways absolutely ...,1,2023-09-01,United Kingdom
1,✅ Trip Verified | My recent experience with B...,1,2023-09-01,United States
2,✅ Trip Verified | This is to express our disp...,1,2023-08-31,United States
3,✅ Trip Verified | I flew London to Malaga on ...,1,2023-08-30,United Kingdom
4,✅ Trip Verified | I arrived at the airport ab...,1,2023-08-30,Germany


## Removing Null Values

In [28]:
#Check data value
data.isnull().sum()

reviews    0
stars      0
date       0
country    1
dtype: int64

The data seems have null value in it. The reviews, stars and date column already clean but the country column is not. The country column has 1 null value, it means in this column there are 1 data with no text. So we need to remove this null value data.

In [29]:
#Removing null value
data = data.dropna()

In [30]:
#Check data after removing value
data.country.isnull().sum()

0

## Removing None Values

In [31]:
#Check data value
data.stars.value_counts()

1       831
2       419
3       401
8       364
10      328
7       312
9       309
4       252
5       233
6       183
None      5
Name: stars, dtype: int64

The data seems have none value in it. It is important to 
remove or filter out none values from the dataset as the raise the data quality issue for completeness and can skew further analysis results.

In [32]:
#Removing none value
data.drop(data[data['stars']=='None'].index, inplace=True)

In [33]:
#Check data after removing value
data.stars.unique()

array(['1', '3', '8', '4', '7', '2', '10', '9', '6', '5'], dtype=object)

## Create Column Verified Customers

Create a new column which mentions if the user is verified or not.

In [34]:
data['verified'] = data.reviews.str.contains("Trip Verified")

In [35]:
data['verified']

0        True
1        True
2        True
3        True
4        True
        ...  
3633    False
3634    False
3635    False
3636    False
3637    False
Name: verified, Length: 3632, dtype: bool

In [36]:
data.head()

Unnamed: 0,reviews,stars,date,country,verified
0,✅ Trip Verified | British Airways absolutely ...,1,2023-09-01,United Kingdom,True
1,✅ Trip Verified | My recent experience with B...,1,2023-09-01,United States,True
2,✅ Trip Verified | This is to express our disp...,1,2023-08-31,United States,True
3,✅ Trip Verified | I flew London to Malaga on ...,1,2023-08-30,United Kingdom,True
4,✅ Trip Verified | I arrived at the airport ab...,1,2023-08-30,Germany,True


## Cleaning Reviews Data

In [37]:
#Text processing libraries

import nltk
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “an”, “are” and etc. These are words that generally don’t contribute anything to the meaning of the text. By removing these words, we remove the low-level information from our text in order to give more focus to the important information.

In [38]:
#Initialize WordNetLemmatizer and stopwords
stop_words = set(stopwords.words('english'))
lemma = WordNetLemmatizer()

reviews_data = data.reviews.str.strip("✅ Trip Verified |")

#Create new list for cleaned data from reviews
reviewstext = []

for rev in reviews_data:
    rev = re.sub('[^a-zA-Z]',' ', rev)
    rev = rev.lower()
    rev = rev.split()
    rev = [lemma.lemmatize(word) for word in rev if word not in stop_words]
    rev = " ".join(rev)
    reviewstext.append(rev)

In [39]:
# add the corpus(reviews cleaned column) to the dataset
data['reviewstext'] = reviewstext

In [40]:
data.head()

Unnamed: 0,reviews,stars,date,country,verified,reviewstext
0,✅ Trip Verified | British Airways absolutely ...,1,2023-09-01,United Kingdom,True,british airway absolutely care reserved seat c...
1,✅ Trip Verified | My recent experience with B...,1,2023-09-01,United States,True,recent experience british airway horrendous ut...
2,✅ Trip Verified | This is to express our disp...,1,2023-08-31,United States,True,express displeasure concern regarding flight i...
3,✅ Trip Verified | I flew London to Malaga on ...,1,2023-08-30,United Kingdom,True,flew london malaga august club europe stood ar...
4,✅ Trip Verified | I arrived at the airport ab...,1,2023-08-30,Germany,True,arrived airport hour takeoff time get checked ...


In [41]:
data.shape

(3632, 6)

## Resetting the Index

In [42]:
# Resetting the index while dropping the existing index
data.reset_index(drop=True)

Unnamed: 0,reviews,stars,date,country,verified,reviewstext
0,✅ Trip Verified | British Airways absolutely ...,1,2023-09-01,United Kingdom,True,british airway absolutely care reserved seat c...
1,✅ Trip Verified | My recent experience with B...,1,2023-09-01,United States,True,recent experience british airway horrendous ut...
2,✅ Trip Verified | This is to express our disp...,1,2023-08-31,United States,True,express displeasure concern regarding flight i...
3,✅ Trip Verified | I flew London to Malaga on ...,1,2023-08-30,United Kingdom,True,flew london malaga august club europe stood ar...
4,✅ Trip Verified | I arrived at the airport ab...,1,2023-08-30,Germany,True,arrived airport hour takeoff time get checked ...
...,...,...,...,...,...,...
3627,Flew LHR - VIE return operated by bmi but BA a...,10,2012-08-29,United Kingdom,False,flew lhr vie return operated bmi ba aircraft a...
3628,LHR to HAM. Purser addresses all club passenge...,9,2012-08-28,United Kingdom,False,lhr ham purser address club passenger name boa...
3629,My son who had worked for British Airways urge...,5,2011-10-12,United Kingdom,False,son worked british airway urged fly british ai...
3630,London City-New York JFK via Shannon on A318 b...,4,2011-10-11,United States,False,london city new york jfk via shannon really ni...


## Export the Data into a CSV Format

In [43]:
data.to_csv("data/BA_reviews_cleaned.csv")