# Data Collection

Collecting the data from the airline website [Skytrax](https://www.airlinequality.com/airport-reviews/auckland-airport/). I collected the data about airline rating, seat ratings and 
lounge experience rating.

In [1]:
#Imports
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

In [2]:
#Create emplty list to collect all reviews, rating stars, collect data, review from  the countries
reviews = []
stars = []
date = []
country = []

In [3]:
for i in range(1, 100):
    page = requests.get(f"https://www.airlinequality.com/airline-reviews/british-airways/page/{i}/?sortby=post_date%3ADesc&pagesize=100")
    
    soup = BeautifulSoup(page.content, "html5")
    
    for item in soup.find_all("div", class_="text_content"):
        reviews.append(item.text)
    
    for item in soup.find_all("div", class_ = "rating-10"):
        try:
            stars.append(item.span.text)
        except:
            print(f"Error on page {i}")
            stars.append("None")
            
    #date
    for item in soup.find_all("time"):
        date.append(item.text)
        
    #country
    for item in soup.find_all("h3"):
        country.append(item.span.next_sibling.text.strip(" ()"))

Error on page 30
Error on page 31
Error on page 31
Error on page 33
Error on page 34


In [4]:
len(reviews)

3482

In [5]:
len(country)

3482

In [6]:
len(date)

3482

In [7]:
len(country)

3482

In [8]:
len(stars)


3581

In [9]:
# All arrays must be of the same length
stars=stars[:3482]

In [10]:
BA = pd.DataFrame({"reviews":reviews,"stars":stars,"date":date,"country":country})

In [11]:
BA.head(3)

Unnamed: 0,reviews,stars,date,country
0,✅ Trip Verified | First our morning flight wa...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,28th February 2023,Canada
1,✅ Trip Verified | Although it was a bit uncom...,1,27th February 2023,United Kingdom
2,✅ Trip Verified | Boarding was decently organ...,8,27th February 2023,Belgium


In [12]:
BA.shape

(3482, 4)

In [13]:
#Exporting Data
import os
cwd = os.getcwd()  #The CWD is the directory from which the Python script is being executed.
BA.to_csv(cwd + "/BA_reviews.csv")

# Data Cleaning

In [14]:
import re

1.import re is a Python statement that imports the re module. The re module is a built-in module in Python that provides support for regular expressions. Regular expressions are a powerful way of pattern matching and manipulating text.

2.Once the re module is imported, you can use its functions and classes to perform various operations on strings, such as searching for patterns, replacing text, splitting strings, and more.

In [15]:
BA = pd.read_csv(cwd + "/BA_reviews.csv", index_col=0)

In [16]:
BA.head(3)

Unnamed: 0,reviews,stars,date,country
0,✅ Trip Verified | First our morning flight wa...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,28th February 2023,Canada
1,✅ Trip Verified | Although it was a bit uncom...,1,27th February 2023,United Kingdom
2,✅ Trip Verified | Boarding was decently organ...,8,27th February 2023,Belgium


In [17]:
BA['verified'] = BA.reviews.str.contains("Trip Verified")

In [18]:
BA.head(3)

Unnamed: 0,reviews,stars,date,country,verified
0,✅ Trip Verified | First our morning flight wa...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,28th February 2023,Canada,True
1,✅ Trip Verified | Although it was a bit uncom...,1,27th February 2023,United Kingdom,True
2,✅ Trip Verified | Boarding was decently organ...,8,27th February 2023,Belgium,True


#.strip() is a string method in Python that removes a set of characters from the beginning and end of a string

In [19]:
reviews_data = BA.reviews.str.strip("✅ Trip Verified |") 

In [20]:
#create an empty list to collect cleaned data corpus
corpus =[]

#loop through each review, remove punctuations, small case it, join it and add it to corpus
for rev in reviews_data:
    rev = re.sub('[^a-zA-Z]',' ', rev)
    rev = rev.lower()
    rev = rev.split()
    rev = " ".join(rev)
    corpus.append(rev)

In [21]:
# add the corpus to the original dataframe

BA['corpus'] = corpus

In [22]:
BA.head(3)

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | First our morning flight wa...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,28th February 2023,Canada,True,first our morning flight was cancelled and mov...
1,✅ Trip Verified | Although it was a bit uncom...,1,27th February 2023,United Kingdom,True,although it was a bit uncomfortable flight in ...
2,✅ Trip Verified | Boarding was decently organ...,8,27th February 2023,Belgium,True,boarding was decently organised the a still ha...


In [23]:
BA.dtypes

reviews     object
stars       object
date        object
country     object
verified      bool
corpus      object
dtype: object

In [24]:
# convert the date to datetime format
BA.date = pd.to_datetime(BA.date)
BA.date.head()

0   2023-02-28
1   2023-02-27
2   2023-02-27
3   2023-02-27
4   2023-02-26
Name: date, dtype: datetime64[ns]

In [25]:
#check for unique values
BA.stars.unique()

array(['\n\t\t\t\t\t\t\t\t\t\t\t\t\t5', '1', '8', '6', '7', '4', '5', '9',
       '10', '2', '3', 'None'], dtype=object)

In [26]:
# remove the \t and \n from the ratings
BA.stars = BA.stars.str.strip("\n\t\t\t\t\t\t\t\t\t\t\t\t\t")
BA.drop(BA[BA.stars == 'None'].index, axis=0, inplace= True)

In [27]:
BA.stars.unique()

array(['5', '1', '8', '6', '7', '4', '9', '10', '2', '3'], dtype=object)

In [28]:
BA.stars.value_counts()

1     753
2     390
3     385
8     350
10    313
7     304
9     302
5     261
4     235
6     184
Name: stars, dtype: int64

In [29]:
BA.isnull().sum()

reviews     0
stars       0
date        0
country     2
verified    0
corpus      0
dtype: int64

In [30]:
BA.dropna(inplace = True)

In [31]:
BA.isnull().sum()

reviews     0
stars       0
date        0
country     0
verified    0
corpus      0
dtype: int64

In [32]:
BA.shape

(3475, 6)

In [33]:
BA.head(2)

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | First our morning flight wa...,5,2023-02-28,Canada,True,first our morning flight was cancelled and mov...
1,✅ Trip Verified | Although it was a bit uncom...,1,2023-02-27,United Kingdom,True,although it was a bit uncomfortable flight in ...


In [34]:
BA.reset_index(drop=True)

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | First our morning flight wa...,5,2023-02-28,Canada,True,first our morning flight was cancelled and mov...
1,✅ Trip Verified | Although it was a bit uncom...,1,2023-02-27,United Kingdom,True,although it was a bit uncomfortable flight in ...
2,✅ Trip Verified | Boarding was decently organ...,8,2023-02-27,Belgium,True,boarding was decently organised the a still ha...
3,✅ Trip Verified | Boarding on time and departu...,6,2023-02-27,Belgium,True,boarding on time and departure on time for a f...
4,✅ Trip Verified | My original flight was canc...,7,2023-02-26,United Kingdom,True,my original flight was cancelled just over wee...
...,...,...,...,...,...,...
3470,Flew LHR - VIE return operated by bmi but BA a...,7,2012-08-29,United Kingdom,False,flew lhr vie return operated by bmi but ba air...
3471,LHR to HAM. Purser addresses all club passenge...,1,2012-08-28,United Kingdom,False,lhr to ham purser addresses all club passenger...
3472,My son who had worked for British Airways urge...,9,2011-10-12,United Kingdom,False,my son who had worked for british airways urge...
3473,London City-New York JFK via Shannon on A318 b...,8,2011-10-11,United States,False,london city new york jfk via shannon on a but ...


In [36]:
# Export cleaned data
BA.to_csv(cwd + "/cleaned_BA_reviews.csv" )