# Web Scraping to Gain Company Insights

The first thing to do will be to scrape review data from the web. For this, you should use a website called Skytrax.

The team leader wants you to focus on reviews specifically about the airline itself. You should collect as much data as you can in order to improve the output of your analysis. To get started with the data collection, you can use the “Jupyter Notebook” in the Resources section below to run some Python code that will help to collect some data. 

Analyse data
Once you have your dataset, you need to prepare it. The data will be very messy and contain purely text. You will need to perform data cleaning in order to prepare the data for analysis. When the data is clean, you should perform your own analysis to uncover some insights. As a starting point, you could look at topic modelling, sentiment analysis or wordclouds to provide some insight into the content of the reviews. It is recommended to complete this task using Python, however, you can use any tool that you wish. You can use some of the documentation websites provided in the Resources section below to analyse the data.

Please ensure that you have created a folder called "data" and mapped your file path.

In [417]:
# import required libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import random
import json
import string
import concurrent.futures
import re
import os

In [2]:
# define requests header
user_agents = [
    "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14",
    "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0)",
    "Opera/9.25 (Windows NT 5.1; U; en)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)",
    "Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9",
]

In [3]:
HOME_URL = "https://www.airlinequality.com"

In [172]:
# obtain all airline names
all_airlines = []


response = requests.get(url=HOME_URL+"/review-pages/a-z-airline-reviews/", 
                        headers={"User-Agent": random.choice(user_agents)}, timeout=5)
bs_object = BeautifulSoup(markup=response.text, 
                          features="html.parser")

# obtain airline names
airline_lists = ["a2z-ldr-" + i for i in string.ascii_uppercase]

for airline_list in airline_lists:
    airlines = bs_object.find("div", attrs={"id": airline_list}).findAll("a")
    for airline in airlines:
        all_airlines.append(airline.text)

---

In [174]:
def scrap_airline_overall_reviews(name):
    airline_name = name.replace(" ", "-")
    url = HOME_URL + "/airline-reviews/" + airline_name
    
    response = requests.get(url=url,
                            headers={"User-Agent": random.choice(user_agents)}, 
                            timeout=5)
    
    bs_object = BeautifulSoup(markup=response.text,
                              features="html.parser")
    
    ratings = bs_object.find(name="table", 
                            attrs={"class": "review-ratings"}).findAll("tr")
    
    scores = list()
    for rate in ratings:
        score = np.nan 
        score_span = rate.findAll("td")[1].findAll(name="span", attrs={"class": "star fill"})

        if score_span:
            score = score_span[-1].text

        scores.append(score)

    attributes = ["Food & Beverage", "Inflight Entertainment", "Seat Comfort", "Staff Service", "Value for Money"]
    ratings = dict(zip(attributes, scores))
    ratings["airline"] = name
    
    review_scores.append(ratings)
    return 

In [175]:
review_scores = list()

In [176]:
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
    executor.map(scrap_airline_overall_reviews, all_airlines)

In [193]:
review_scores_df = pd.DataFrame(review_scores)
review_scores_df = review_scores_df[["airline", "Food & Beverage", "Inflight Entertainment", "Seat Comfort", "Staff Service", "Value for Money"]]
review_scores_df.rename(columns={"airline": "Airline"}, inplace=True)
review_scores_df

Unnamed: 0,Airline,Food & Beverage,Inflight Entertainment,Seat Comfort,Staff Service,Value for Money
0,Adria Airways,3,2,3,3,3
1,Aerosur,3,2,2,3,2
2,Aero VIP,,,4,4,4
3,Aeromar,2,1,3,2,2
4,Aerocaribbean,,,,,4
...,...,...,...,...,...,...
482,WOW air,2,1,2,2,2
483,Xiamen Airlines,3,3,3,3,3
484,Zambia Airways,1,,2,4,5
485,Wizz Air,2,1,2,2,2


In [199]:
review_scores_df.fillna(-1, inplace=True)
review_scores_df.astype({"Airline": str,
                         "Food & Beverage": np.int8,
                         "Inflight Entertainment": np.int8,
                         "Seat Comfort": np.int8,
                         "Staff Service": np.int8,
                         "Value for Money": np.int8})
review_scores_df

Unnamed: 0,Airline,Food & Beverage,Inflight Entertainment,Seat Comfort,Staff Service,Value for Money
0,Adria Airways,3,2,3,3,3
1,Aerosur,3,2,2,3,2
2,Aero VIP,-1,-1,4,4,4
3,Aeromar,2,1,3,2,2
4,Aerocaribbean,-1,-1,-1,-1,4
...,...,...,...,...,...,...
482,WOW air,2,1,2,2,2
483,Xiamen Airlines,3,3,3,3,3
484,Zambia Airways,1,-1,2,4,5
485,Wizz Air,2,1,2,2,2


In [200]:
review_scores_df.to_csv("../data/review_scores.csv", index=False)

There are some missing airline data, this is caused by different airline urls and highly likely because there is an inconsistent in database. However, the script still manage to scrap 89% data from the skytrax website.

---

In [426]:
def obtain_review_information(review):
    data = dict()
    data["header"] = review.find("h2", {'class': "text_header"}).text
    data["rating"] = review.find("span", {"itemprop": "ratingValue"}).text if review.find("span", {"itemprop": "ratingValue"}) else None
    data["review_date"] = re.search(pattern=r"\d{4}-\d{2}-\d{2}",
                             string=str(review.find("meta", {"itemprop": "datePublished"}))).group()
    data["comment"] = review.find("div", {"class": "text_content"}).text
    data["trip_verified"] = True if "✅ Trip Verified" in data['comment'] else False

    data["comment"] = data["comment"].replace("✅ Trip Verified | ", "").strip()

    data["aircraft"] = review.find("td", {"class": "aircraft"}).findNext('td').text if review.find("td", {"class": "aircraft"}) else None
    data["type_of_traveller"] = review.find('td', {"class": "type_of_traveller"}).findNext('td').text if review.find('td', {"class": "type_of_traveller"}) else None
    data["seat_type"] = review.find('td', {"class": "cabin_flown"}).findNext('td').text if review.find('td', {"class": "cabin_flown"}) else None
    data["route"] = review.find('td', {"class": 'route'}).findNext('td').text if review.find('td', {"class": 'route'}) else None
    data["date_flown"] = review.find("td", {"class": "date_flown"}).findNext('td').text if review.find("td", {"class": "date_flown"}) else None

    data["seat_comfort"] = None
    if review.find("td", {"class": "seat_comfort"}):
        if review.find("td", {"class": "seat_comfort"}).findNext("td").findAll('span', {"class": "star fill"}):
            data["seat_comfort"] = review.find("td", {"class": "seat_comfort"}).findNext("td").findAll('span', {"class": "star fill"})[-1].text 
    
    data["cabin_staff_service"] = None
    if review.find("td", {"class": "cabin_staff_service"}):
        if review.find("td", {"class": "cabin_staff_service"}).findNext("td").findAll('span', {"class": "star fill"}):
            data["cabin_staff_service"] = review.find("td", {"class": "cabin_staff_service"}).findNext("td").findAll('span', {"class": "star fill"})[-1].text
     
    data["food_and_beverages"] = None
    if review.find("td", {"class": "food_and_beverages"}):
        if review.find("td", {"class": "food_and_beverages"}).findNext("td").findAll('span', {"class": "star fill"}):
            data["food_and_beverages"] = review.find("td", {"class": "food_and_beverages"}).findNext("td").findAll('span', {"class": "star fill"})[-1].text
    
    data["food_and_beverages"] = None
    if review.find("td", {"class": "inflight_entertainment"}):
        if review.find("td", {"class": "inflight_entertainment"}).findNext("td").findAll("span", {"class": "star fill"}):
            data["food_and_beverages"] = review.find("td", {"class": "inflight_entertainment"}).findNext("td").findAll("span", {"class": "star fill"})[-1]
    
    data["ground_service"] = None
    if review.find("td", {"class": "ground_service"}):
        if review.find("td", {"class": "ground_service"}).findNext("td").findAll('span', {"class": "star fill"}):
            data["ground_service"] = review.find("td", {"class": "ground_service"}).findNext("td").findAll('span', {"class": "star fill"})[-1].text 
            
    data["value_for_money"] = None
    if review.find("td", {"class": "value_for_money"}):
        if review.find("td", {"class": "value_for_money"}).findNext("td").findAll('span', {"class": "star fill"}):
            data["value_for_money"] = review.find("td", {"class": "value_for_money"}).findNext("td").findAll('span', {"class": "star fill"})[-1].text
            
    data["wifi_and_connecticity"] = None
    if review.find("td", {"class": "wifi_and_connectivity"}):
        if review.find("td", {"class": "wifi_and_connectivity"}).findNext("td").findAll('span', {"class": "star fill"}):
            data["wifi_and_connectivity"] = review.find("td", {"class": "wifi_and_connectivity"}).findNext("td").findAll('span', {"class": "star fill"})[-1].text
    data["recommend"] = review.find("td", {"class": "recommended"}).findNext("td").text == "yes" if review.find("td", {"class": "recommended"}) else None

    return data

In [427]:
def scrap_airline_reviews(name) -> pd.core.frame.DataFrame:
    airline_name = name.replace(" ", "-").lower()
    url = HOME_URL + "/airline-reviews/" + airline_name + "/page/1/?sortby=post_date%3ADesc&pagesize=100"
    response = requests.get(url=url,
                            headers={"User-Agent": random.choice(user_agents)}, 
                            timeout=5)
    
    bs_object = BeautifulSoup(markup=response.text,
                              features="html.parser")
    
    review_num = int(bs_object.find(name="span", 
                                    attrs={"itemprop": "reviewCount"}).text.replace("\n", "").replace("\t", ""))
    review_num = int(np.round(1270/100))
    
    all_reviews = list()
    for page_num in range(1, review_num):
        url = HOME_URL + "/airline-reviews/" + airline_name + "/page/" + str(page_num) + "/?sortby=post_date%3ADesc&pagesize=100"
        
        response = requests.get(url=url,
                                headers={"User-Agent": random.choice(user_agents)},
                                timeout=5)
        bs_object = BeautifulSoup(markup=response.text, features="html.parser")
        reviews = bs_object.findAll(name="article", attrs={"itemprop": "review"})
        
        for review in reviews:
            data = obtain_review_information(review)
            all_reviews.append(data)
    
    df = pd.DataFrame(all_reviews)
    df.to_csv(f"../data/airline_reviews/{name.replace(' ', '-').lower()}.csv", index=False)
    return 

In [429]:
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
    executor.map(scrap_airline_reviews, all_airlines)

In [430]:
data_path = [f for f in os.listdir("../data/airline_reviews") if "csv" in f]

In [432]:
len(data_path)/len(all_airlines)

0.8931860036832413

After scraping review data for each airlines, we are able to obtain 89% airline reviews.

---

Cleanning data

In [451]:
# combine all data into a single dataframe
review_df = pd.read_csv("../data/airline_reviews/" + data_path[0])
review_df["airline"] = data_path[0].replace(".csv", "")

for f in data_path[1::]:
    df = pd.read_csv("../data/airline_reviews/" + f)
    df['airline'] = f.replace(".csv", "")
    review_df = pd.concat([review_df, df], axis=0)
    
review_df

Unnamed: 0,header,rating,review_date,comment,trip_verified,aircraft,type_of_traveller,seat_type,route,date_flown,seat_comfort,cabin_staff_service,food_and_beverages,ground_service,value_for_money,wifi_and_connecticity,recommend,airline,wifi_and_connectivity
0,"""pretty decent airline""",9.0,2019-11-11,Moroni to Moheli. Turned out to be a pretty de...,True,,Solo Leisure,Economy Class,Moroni to Moheli,November 2019,4.0,5.0,,4.0,3.0,,True,ab-aviation,
1,"""Not a good airline""",1.0,2019-06-25,Moroni to Anjouan. It is a very small airline....,True,E120,Solo Leisure,Economy Class,Moroni to Anjouan,June 2019,2.0,2.0,,1.0,2.0,,False,ab-aviation,
2,"""flight was fortunately short""",1.0,2019-06-25,Anjouan to Dzaoudzi. A very small airline and ...,True,Embraer E120,Solo Leisure,Economy Class,Anjouan to Dzaoudzi,June 2019,2.0,1.0,,1.0,2.0,,False,ab-aviation,
0,"""I will never fly again with Adria""",1.0,2019-09-28,Not Verified | Please do a favor yourself and...,False,,Solo Leisure,Economy Class,Frankfurt to Pristina,September 2019,1.0,1.0,,1.0,1.0,,False,adria-airways,
1,"""it ruined our last days of holidays""",1.0,2019-09-24,Do not book a flight with this airline! My fri...,True,,Couple Leisure,Economy Class,Sofia to Amsterdam via Ljubljana,September 2019,1.0,1.0,"<span class=""star fill"">1</span>",1.0,1.0,,False,adria-airways,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14,"""customer service is terrible""",1.0,2022-07-05,Not Verified | Bangkok to Tokyo. I’ve flown ma...,False,,Couple Leisure,Economy Class,Bangkok to Tokyo,June 2022,2.0,1.0,"<span class=""star fill"">1</span>",1.0,1.0,,False,zipair,1.0
15,"""Avoid at all costs""",1.0,2022-06-01,Avoid at all costs. I booked flights to go fro...,True,,Solo Leisure,Economy Class,Singapore to Tokyo,June 2022,,,,,1.0,,False,zipair,
16,"""Will not recommend to anyone""",3.0,2022-05-31,Flight was leaving at 23.15 and after an hour ...,True,,Business,Economy Class,Bangkok to Tokyo,May 2022,2.0,4.0,,1.0,2.0,,False,zipair,
17,"""It was immaculately clean""",6.0,2022-05-23,Zipair is JAL’s budget airline. They don’t hav...,True,Dreamliner,Business,Business Class,Tokyo to Los Angeles,May 2022,3.0,4.0,"<span class=""star fill"">2</span>",1.0,5.0,,True,zipair,5.0


In [452]:
# remove "" wrapper in each header
review_df["header"] = review_df.header.str.replace("\"", "").str.strip()

In [453]:
# remove unexpected values in comment
review_df['comment'] = review_df.comment.str.replace("Not Verified |", "").str.strip()
review_df['comment'] = review_df.comment.str.replace("|", "").str.strip()

  review_df['comment'] = review_df.comment.str.replace("Not Verified |", "").str.strip()
  review_df['comment'] = review_df.comment.str.replace("|", "").str.strip()


In [505]:
def extract_number(x):
    res = re.search(r"\d", str(x))
    if res:
        return res.group()
    else:
        return np.nan

In [507]:
review_df['food_and_beverages'] = review_df.agg(lambda x: extract_number(x['food_and_beverages']), axis=1)

In [509]:
review_df.to_csv("../data/review_comments.csv", index=False)