# British Airways Web Scraping of Skytrax Review Site
### Web Scrape, Data Cleaning, and Data Verification

[Skytrax](https://www.airlinequality.com/) is a website where travellers can check reviews for airlines and make reviews themselves. There are reviews for all major airlines.

Along with a title for their review and a paragraph describing their experience --

Reviewers gave data on the following:

- Aircraft Type
- Type of Traveller (solo, business, etc)
- Seat Type
- Route
- Date of Flight

And reviewers also gave a star rating (from 1 to 5) for the following categories:

- Seat Comfort
- Cabin Staff Service
- Food & Beverages
- Inflight Entertainment
- Ground Service
- Wifi & Connectivity
- Value for the Money

Reviewers also gave an overall rating (from 1 to 10).

Lastly, reviewers recommended the airline (Yes/No).

This project will scrape all of this data from the skytrax website and put in tabular form. Then, the data can be analyzed for insights that can help British Airways.

Here, we will scrape the web, clean the data, and output two .csv files with the relevant data that can be used by anyone interested in analysing the data.

Python will be used for this analysis.

In [None]:
# Import packages
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## Web Scraping

At the time of this writing, there were 3426 British Airways reviews on skytrax from 2011 to 2022. The reviews were listed in pagesizes of 100 reviews. After creating a list of all of the 35 urls for each of the 100 reviews, we'll iterate through them and scrape all necessary information into their respective lists.

In [16]:
# Create list of urls for the various scroll pages of results
urls = []
for i in range(35):
    page = i+1
    urls.append(f'https://www.airlinequality.com/airline-reviews/british-airways/page/{page}/?sortby=post_date%3ADesc&pagesize=100')

In [196]:
# This will scrape all necessary information from the skytrax review pages and store in the following lists
# We use the python BeautifulSoup package

title = []
star_fill = []
star_fill_page = []
review_value = []
reviews = []
rating_header = []
rating_page = []
rating = []

data_string = ""

for url in urls: 
    response = requests.get(url)
    soup = BeautifulSoup(response.text)
    
    # Title
    for item in soup.find_all("h2", class_="text_header"):
        data_string = data_string + item.get_text()
        title.append(data_string)
        data_string = ""
    
    # Overall Rating
    for item in soup.find_all("span", itemprop="ratingValue"):
        data_string = data_string + item.get_text()
        rating_page.append(data_string)
        data_string = ""
    rating_page = rating_page[1:]
    for i in rating_page:
        rating.append(i)
    rating_page = []

    # Star Ratings
    for item in soup.find_all("span", class_="star fill"):
        data_string = data_string + item.get_text()
        star_fill_page.append(data_string)
        data_string = ""
    star_fill_page = star_fill_page[15:]
    for i in star_fill_page:
        star_fill.append(i)
    star_fill_page = []
        
    # Aircraft, Traveller Type, Seat Type, Route, Date Flown, Recommended    
    for item in soup.find_all("td", class_="review-value"):
        data_string = data_string + item.get_text()
        review_value.append(data_string)
        data_string = ""
        
    # Rating Headers
    for item in soup.find_all("td", class_="review-rating-header"):
        data_string = data_string + item.get_text()
        rating_header.append(data_string)
        data_string = ""

    # Review Body, Verification    
    for item in soup.find_all("div", class_="text_content"):
        data_string = data_string + item.get_text()
        reviews.append(data_string)
        data_string = ""

## Data Cleaning and Data Frame Creation

Without going into exhaustive detail about the quirks of how the data is displayed on skytrax's website (which is all perfectly reasonable for skytrax to display it this way...it just makes the web scraping more problematic), we'll summarize a few interesting details that will explain some of the following code:

Some details:
- At the top of each page of 100 results is a summary of five metrics. This information was scraped along with everything else and had to be deleted every 100 entries.
- At some point, in 2015, how the reviews were displayed changed. Rules of what reviewers were required to input changed. Not giving a rating for a certain aspect of the flight is now handled by skytrax by leaving the category out, but before is sometimes (not always!) expressed with a 'NA'.

Another difficulty is how the information was scraped. One result list had all of the reading headers (i.e. the column names in a data frame). But not every review had data for each header. Another list had the star values--just the numbers--and another list had the airplane type, seat type, etc, data. 

Our main strategy to create a usable data set is to first make an empty data frame filled with na values and then iterate through our header list. For each item, find the appropriate value from either the stars list or the review value list to fill up the data frame.

### Rating Headers

In [205]:
# remove the first five summary ratings on the first page
rating_header_t = rating_header[5:]
rating_header_df = pd.DataFrame(dict(rating_header=rating_header_t))
rating_header_df['rating_header'] = rating_header_df.rating_header.str.replace('Cabin Staff Service','Staff Service').str.lower()

# remove the five summary ratings from all of the other results pages
# here we're using a fortunate result that every review has a recommendation
index = 0
count = 0
for i in rating_header_df.rating_header:
    if count == 100:
        rating_header_df.drop(range(index,index+5), axis=0, inplace=True)
        count = 0
    else:
        if i == 'recommended':
            count += 1
    index += 1

### Star Count

A quirk here is that the scraping picks up every star, not the highest star value for each rating. We delete the unnecessary stars.

In [299]:
stars = []
for i in range(len(star_fill)-1):
    if star_fill[i+1] == '1':
        stars.append(star_fill[i])
# create an iterator from the star ratings
stars_iterator = iter(stars)

### Review Value

In [300]:
# create an iterator from the review values
review_value_iterator = iter([i.lower() for i in review_value])

## Create Information and Ratings DataFrame

In [301]:
# There are 3426 total reviews in our data
# The header order below reflects the order present on the skytrax website
df_headers = pd.DataFrame(index=range(3426))
header_order = ['aircraft','type of traveller','seat type','route','date flown','seat comfort',
                'staff service','food & beverages','inflight entertainment','ground service',
                'wifi & connectivity','value for money','recommended']
# Create a dataframe with 3426 rows (above) and the header order for columns. 
# Filled with na for now.
for header in header_order:
    df_headers[header]=np.nan

In [302]:
# Fill in the headers 

row = 0
for item in rating_header_df.rating_header:
    try:
        if item == 'recommended':
            df_headers.loc[row, item] = next(review_value_iterator)
            row += 1
            continue
        elif item in ['aircraft','type of traveller','seat type','route','date flown']:
            df_headers.loc[row, item] = next(review_value_iterator)
        else:
            df_headers.loc[row, item] = next(stars_iterator)
    # the really old data from skytrax was too inconsistent with their stars, and will eventually 
    # give an error. We'll avoid that here knowing we'll drop the oldest data points anyway.
    except StopIteration:
        break

In [330]:
# No missing data for the title or reviews. Straight forward to create the data frame and
# add it to the data frame above.

df = pd.DataFrame(dict(title=title, reviews=reviews))

reviews_only = []
for i in df.reviews:
    if '|' in i:
        reviews_only.append(i.split('|')[1])
    else:
        reviews_only.append(i)
        
df['reviews'] = reviews_only
    
df['reviews'] = df.reviews.str.strip() \
                          .str.lower()
df['title'] = df.title.str.strip('"') \
                      .str.lower()

ba_data = pd.concat([df_headers, df], axis=1)


In [331]:
# the user's overall rating 
ba_data.drop(ba_data.tail(6).index, axis=0, inplace=True)
ba_data['rating'] = rating

Skytrax apparently changed how they take reviews, changing the format, what's required for reviewers to enter, etc. After creating the above dataset, a random sequential visual sampling of the skytrax website vs the dataset, and the first 2649 data points were verified to be correctly matching the information on the website. After that, the format changed and the information on our data frame was incorrect. Our final step is to drop those roughly 800 rows. While this is not ideal, we are lucky that the oldest data is what we have to drop.

We will, in a separate data set, keep all of the written reviews from all the way back to 2011.

In [332]:
# keep only verified rows of data
ba_data_verified = ba_data.loc[:2648, :]

In [333]:
# final cleaning steps

# create datetime column and drop original date column
ba_data_verified['date_dt'] = pd.to_datetime(ba_data_verified['date flown'], format='%B %Y')
ba_data_verified.drop('date flown', axis=1, inplace=True)

# clean up column names
ba_data_verified.columns = ba_data_verified.columns.str.replace(' ','_').str.replace('&','and')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ba_data_verified['date_dt'] = pd.to_datetime(ba_data_verified['date flown'], format='%B %Y')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ba_data_verified.drop('date flown', axis=1, inplace=True)


In [336]:
# save all data to .csv
ba_data_verified.to_csv('british_airways_skytrax_2015-2022_review_data.csv', index=False)