## Web scraping and analysis of data from Skytrax

We will be using `BeautifulSoup` package to acheive the web scraping of reviews related to British Airways from the website [https://www.airlinequality.com/airline-reviews/british-airways]. The scraped data is stored into a `.csv` file.

In [None]:
# importing libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 38 # total number of pages
page_size = 100 # total reviews in each page

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)
    content = response.content
    soup = BeautifulSoup(content, 'html.parser')

    # Parse content
    # Loop through each review article
    review_articles = soup.find_all("article", {"itemprop": "review"})

    for article in review_articles:
        # Extracting review text
        review_text = article.find("div", {"class": "text_content"}).get_text(strip=True) if article.find("div", {"class": "text_content"}) else ''

        # Extracting the rating value
        review_value = article.find("span", {"itemprop": "ratingValue"}).get_text(strip=True) if article.find("span", {"itemprop": "ratingValue"}) else ''
        
        # Extracting the time the review was posted
        review_time = article.find("time", {"itemprop": "datePublished"}).get_text(strip=True) if article.find("time", {"itemprop": "datePublished"}) else ''
        
        # Extracting the review title
        review_title = article.find("h2", {"class": "text_header"}).get_text(strip=True) if article.find("h2", {"class": "text_header"}) else ''
        
        # Extracting the country of the reviewer
        reviewer_country = article.find("span", {"itemprop": "author"}).next_sibling.strip() if article.find("span", {"itemprop": "author"}) else ''
        
        details = {}
        review_details_table = article.find("table", {"class": "review-ratings"})
        if review_details_table:
            for row in review_details_table.find_all("tr"):
                header = row.find("td", {"class": "review-rating-header"}).get_text(strip=True) if row.find("td", {"class": "review-rating-header"}) else None
                value = row.find("td", {"class": "review-value"}).get_text(strip=True) if row.find("td", {"class": "review-value"}) else 'Not specified'
                if header:
                    details[header] = value
        
        reviews.append({
            "review_text": review_text,
            "review_value": review_value,
            "review_time": review_time,
            "review_title": review_title,
            "reviewer_country": reviewer_country,
            **details  # This adds all the aspect ratings into the dictionary
        })

    print(f"   ---> {len(reviews)} total reviews collected so far")

Scraping page 1
   ---> 100 total reviews collected so far
Scraping page 2
   ---> 200 total reviews collected so far
Scraping page 3
   ---> 300 total reviews collected so far
Scraping page 4
   ---> 400 total reviews collected so far
Scraping page 5
   ---> 500 total reviews collected so far
Scraping page 6
   ---> 600 total reviews collected so far
Scraping page 7
   ---> 700 total reviews collected so far
Scraping page 8
   ---> 800 total reviews collected so far
Scraping page 9
   ---> 900 total reviews collected so far
Scraping page 10
   ---> 1000 total reviews collected so far
Scraping page 11
   ---> 1100 total reviews collected so far
Scraping page 12
   ---> 1200 total reviews collected so far
Scraping page 13
   ---> 1300 total reviews collected so far
Scraping page 14
   ---> 1400 total reviews collected so far
Scraping page 15
   ---> 1500 total reviews collected so far
Scraping page 16
   ---> 1600 total reviews collected so far
Scraping page 17
   ---> 1700 total review

In [3]:
# Convert the list of dictionaries into a DataFrame
df= pd.DataFrame(reviews)

# Display the first few rows to verify
df.head(3)

Unnamed: 0,review_text,review_value,review_time,review_title,reviewer_country,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Food & Beverages,Ground Service,Value For Money,Recommended,Aircraft,Wifi & Connectivity,Inflight Entertainment
0,✅Trip Verified| Absolutely horrible customer ...,1,12th March 2024,"""cancelled our return flight""",(Canada),Family Leisure,Economy Class,Toronto to Mumbai via London,February 2024,Not specified,Not specified,Not specified,Not specified,Not specified,no,,,
1,Not Verified| BA is not what it used to be! A...,7,11th March 2024,"""KLM is definitely a league over BA""",(Denmark),Family Leisure,Economy Class,Copenhagen to Port of Spain via London,February 2024,Not specified,Not specified,Not specified,Not specified,Not specified,yes,,,
2,"✅Trip Verified| BA First, it's not even the b...",1,10th March 2024,"""Service extremely inattentive""",(United Kingdom),Solo Leisure,First Class,Los Angeles to London,March 2024,Not specified,Not specified,Not specified,Not specified,Not specified,no,Boeing 777-300ER,Not specified,


##### Removing Unwanted columns

In [4]:
df = df.drop(["Seat Comfort","Cabin Staff Service","Food & Beverages","Inflight Entertainment", "Ground Service", "Value For Money", "Wifi & Connectivity"], axis = 1)

df.head()

Unnamed: 0,review_text,review_value,review_time,review_title,reviewer_country,Type Of Traveller,Seat Type,Route,Date Flown,Recommended,Aircraft
0,✅Trip Verified| Absolutely horrible customer ...,1,12th March 2024,"""cancelled our return flight""",(Canada),Family Leisure,Economy Class,Toronto to Mumbai via London,February 2024,no,
1,Not Verified| BA is not what it used to be! A...,7,11th March 2024,"""KLM is definitely a league over BA""",(Denmark),Family Leisure,Economy Class,Copenhagen to Port of Spain via London,February 2024,yes,
2,"✅Trip Verified| BA First, it's not even the b...",1,10th March 2024,"""Service extremely inattentive""",(United Kingdom),Solo Leisure,First Class,Los Angeles to London,March 2024,no,Boeing 777-300ER
3,✅Trip Verified| The worst business class expe...,3,5th March 2024,"""worst business class experience""",(Australia),Business,Business Class,Singapore to Sydney,March 2024,no,Boeing 777-300
4,Not Verified| Quite possibly the worst busine...,1,4th March 2024,"""it's truly awful for short-haul""",(Canada),Business,Business Class,Cyprus to London,February 2024,no,


##### Saving to CSV file

In [5]:
df.to_csv("data/BA_reviews.csv", index=False)