#### Scraping Our Data

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_past_reviews(base_url, start_page, end_page):
    reviews = []

    for page_num in range(start_page, end_page + 1):
        url = f"{base_url}/page/{page_num}/?sortby=post_date%3ADesc&pagesize=100"
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')

        review_elements = soup.find_all('article', itemprop='review')

        for element in review_elements:
            review_data = {}

            review_data['verification_status'] = "Verified" if "✅Trip Verified|" in element.get_text() else "Not Verified"
            review_body = element.find('div', class_='text_content').get_text(strip=True).replace("✅Trip Verified|", "").replace("Not Verified|", "").strip()
            review_data['review_body'] = review_body

            published_date = element.find('time', itemprop='datePublished')['datetime']
            review_data['published_date'] = published_date

            rating_element = element.find('div', itemprop='reviewRating')
            if rating_element:
                rating_value = rating_element.find('span', itemprop='ratingValue').get_text(strip=True)
                best_rating = rating_element.find('span', itemprop='bestRating').get_text(strip=True)
                review_data['rating'] = f"{rating_value}/{best_rating}"

            rows = element.find_all('tr')
            for row in rows:
                header = row.find('td', class_='review-rating-header')
                value = row.find('td', class_='review-value')
                if header and value:
                    review_data[header.get_text(strip=True)] = value.get_text(strip=True)

            reviews.append(review_data)

    df = pd.DataFrame(reviews)
    return df

# Define the base URL and page range
base_url = 'https://www.airlinequality.com/airline-reviews/british-airways'
start_page = 1
end_page = 40

# Scrape the reviews
reviews_df = scrape_past_reviews(base_url, start_page, end_page)

print(reviews_df)


     verification_status                                        review_body  \
0           Not Verified  Very good flight following an equally good fli...   
1           Not Verified  An hour's delay due to late arrival of the inc...   
2           Not Verified  I booked through BA because Loganair don’t hav...   
3           Not Verified  British airways lost bags in LHR then found th...   
4           Not Verified  The check in process and reward/loyalty progra...   
...                  ...                                                ...   
3906        Not Verified  YYZ to LHR - July 2012 - I flew overnight in p...   
3907        Not Verified  LHR to HAM. Purser addresses all club passenge...   
3908        Not Verified  My son who had worked for British Airways urge...   
3909        Not Verified  London City-New York JFK via Shannon on A318 b...   
3910        Not Verified  SIN-LHR BA12 B747-436 First Class. Old aircraf...   

     published_date rating Aircraft Type Of Travell

In [2]:
reviews_df.head()

Unnamed: 0,verification_status,review_body,published_date,rating,Aircraft,Type Of Traveller,Seat Type,Route,Date Flown,Recommended
0,Not Verified,Very good flight following an equally good fli...,2025-01-20,9/10,A320,Solo Leisure,Business Class,London Heathrow to Zurich,January 2025,yes
1,Not Verified,An hour's delay due to late arrival of the inc...,2025-01-19,7/10,A319,Family Leisure,Economy Class,London to Lisbon,January 2025,yes
2,Not Verified,I booked through BA because Loganair don’t hav...,2025-01-15,1/10,,Solo Leisure,Economy Class,Manchester to Isle of Man,November 2024,no
3,Not Verified,British airways lost bags in LHR then found th...,2025-01-09,1/10,,Family Leisure,Premium Economy,Houston to cologne via London,December 2024,no
4,Not Verified,The check in process and reward/loyalty progra...,2025-01-05,1/10,A320,Business,Economy Class,London to Basel,January 2025,no


In [3]:
# save this data to the raw data folder
reviews_df.to_parquet('../data/raw/past_reviews.parquet')

In [4]:
# make sure we can read it in
reviews_df = pd.read_parquet('../data/raw/past_reviews.parquet')

In [5]:
reviews_df.head()

Unnamed: 0,verification_status,review_body,published_date,rating,Aircraft,Type Of Traveller,Seat Type,Route,Date Flown,Recommended
0,Not Verified,Very good flight following an equally good fli...,2025-01-20,9/10,A320,Solo Leisure,Business Class,London Heathrow to Zurich,January 2025,yes
1,Not Verified,An hour's delay due to late arrival of the inc...,2025-01-19,7/10,A319,Family Leisure,Economy Class,London to Lisbon,January 2025,yes
2,Not Verified,I booked through BA because Loganair don’t hav...,2025-01-15,1/10,,Solo Leisure,Economy Class,Manchester to Isle of Man,November 2024,no
3,Not Verified,British airways lost bags in LHR then found th...,2025-01-09,1/10,,Family Leisure,Premium Economy,Houston to cologne via London,December 2024,no
4,Not Verified,The check in process and reward/loyalty progra...,2025-01-05,1/10,A320,Business,Economy Class,London to Basel,January 2025,no


### EDA 

It appears that all of our data are **object** types but the rating should be a numeric value. We also see that some of the columns are capital and include spaces from when we scraped the data table and the variables we created are lowercase with underscores so we can change them to a standard format.

In [8]:
# function to rename columns
def rename_cols(reviews_df) -> pd.DataFrame:
    '''Change column names to snake case.'''
    return reviews_df.rename(columns={
        'Aircraft': 'aircraft',
        'Type Of Traveller': 'type_of_traveler', 
        'Seat Type': 'seat_type', 
        'Route': 'route', 
        'Date Flown': 'date_flown', 
        'Recommended': 'recommended'},
        inplace=True
)

In [9]:
# rename the columns using the function
rename_cols(reviews_df)

In [11]:
# check they have been renamed
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3911 entries, 0 to 3910
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   verification_status  3911 non-null   object
 1   review_body          3911 non-null   object
 2   published_date       3911 non-null   object
 3   rating               3906 non-null   object
 4   aircraft             2037 non-null   object
 5   type_of_traveler     3140 non-null   object
 6   seat_type            3909 non-null   object
 7   route                3135 non-null   object
 8   date_flown           3133 non-null   object
 9   recommended          3911 non-null   object
dtypes: object(10)
memory usage: 305.7+ KB


In [12]:
reviews_df.describe()

Unnamed: 0,verification_status,review_body,published_date,rating,aircraft,type_of_traveler,seat_type,route,date_flown,recommended
count,3911,3911,3911,3906,2037,3140,3909,3135,3133,3911
unique,1,3904,1981,10,212,4,4,1633,126,2
top,Not Verified,British Airways from Tampa to Gatwick on Boein...,2015-01-19,1/10,A320,Couple Leisure,Economy Class,London to Johannesburg,August 2015,no
freq,3911,2,26,951,394,1057,2039,21,83,2351


The **review_body** column has 3911 reviews but only 3904 unique values meaning we have duplicates. We can inspect and remove them.

In [14]:
# find the duplicate reviews and inspect them
duplicate_review_body = reviews_df[reviews_df.duplicated(subset=['review_body'], keep=False)]
duplicate_review_body.sort_values('review_body')

Unnamed: 0,verification_status,review_body,published_date,rating,aircraft,type_of_traveler,seat_type,route,date_flown,recommended
2739,Not Verified,British Airways from Tampa to Gatwick on Boein...,2015-12-01,8/10,Boeing 777,Couple Leisure,Business Class,Tampa to Gatwick,November 2015,yes
2767,Not Verified,British Airways from Tampa to Gatwick on Boein...,2015-11-20,8/10,Boeing 777,Couple Leisure,Business Class,Tampa to Gatwick,November 2015,yes
3897,Not Verified,HKG-LHR in New Club World on Boeing 777-300 - ...,2012-08-29,6/10,,,Business Class,,,yes
3905,Not Verified,HKG-LHR in New Club World on Boeing 777-300 - ...,2012-08-29,6/10,,,Business Class,,,yes
3499,Not Verified,I travel to and from Singapore on BA in Club w...,2014-11-20,5/10,,,Business Class,,,yes
3501,Not Verified,I travel to and from Singapore on BA in Club w...,2014-11-20,5/10,,,Business Class,,,yes
3498,Not Verified,Just completed a return trip to Hong Kong on t...,2014-11-20,5/10,,,Economy Class,,,yes
3500,Not Verified,Just completed a return trip to Hong Kong on t...,2014-11-20,5/10,,,Economy Class,,,yes
2740,Not Verified,London Heathrow to Miami on one of British Air...,2015-12-01,6/10,Boeing 747-400,Couple Leisure,Premium Economy,LHR to MIA,November 2015,yes
2769,Not Verified,London Heathrow to Miami on one of British Air...,2015-11-20,6/10,Boeing 747-400,Couple Leisure,Premium Economy,LHR to MIA,November 2015,yes


In [17]:
print(len(duplicate_review_body))

14


When we check the duplicates for the entire dataset we are getting a number smaller (10) than our number for the reviews (14). After inspection, we see that these reviews came in on different dates so that is the reason there are less for the entire dataset. Having a different date posted causes them not to be duplicates. We will remove the duplicates and keep the one with the oldest date. When we scraped the data we were using their filter which had them sorted by date from newest to oldest so we will sort them from oldest to newest to keep the oldest duplicate.

In [16]:
# Identify duplicate rows based on the entire dataset
duplicate_rows = reviews_df[reviews_df.duplicated(keep=False)]

# Display the duplicate rows
print(len(duplicate_rows))


10


In [19]:
# remove duplicates and keep the oldest review - sort by published date
reviews_df['published_date'] = pd.to_datetime(reviews_df['published_date'])
# sort by published date
reviews_df.sort_values('published_date', inplace=True)
# remove duplicates
reviews_df.drop_duplicates(subset='review_body', keep='first', inplace=True)


In [20]:
# now there should be 3904 reviews since there were 3904 unique reviews
len(reviews_df)

3904

In [22]:
# sort by published date with most recent being first
reviews_df.sort_values('published_date', ascending=False, inplace=True) 
reviews_df.head()

Unnamed: 0,verification_status,review_body,published_date,rating,aircraft,type_of_traveler,seat_type,route,date_flown,recommended
0,Not Verified,Very good flight following an equally good fli...,2025-01-20,9/10,A320,Solo Leisure,Business Class,London Heathrow to Zurich,January 2025,yes
1,Not Verified,An hour's delay due to late arrival of the inc...,2025-01-19,7/10,A319,Family Leisure,Economy Class,London to Lisbon,January 2025,yes
2,Not Verified,I booked through BA because Loganair don’t hav...,2025-01-15,1/10,,Solo Leisure,Economy Class,Manchester to Isle of Man,November 2024,no
3,Not Verified,British airways lost bags in LHR then found th...,2025-01-09,1/10,,Family Leisure,Premium Economy,Houston to cologne via London,December 2024,no
4,Not Verified,The check in process and reward/loyalty progra...,2025-01-05,1/10,A320,Business,Economy Class,London to Basel,January 2025,no


In [23]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3904 entries, 0 to 3910
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   verification_status  3904 non-null   object        
 1   review_body          3904 non-null   object        
 2   published_date       3904 non-null   datetime64[ns]
 3   rating               3899 non-null   object        
 4   aircraft             2035 non-null   object        
 5   type_of_traveler     3138 non-null   object        
 6   seat_type            3902 non-null   object        
 7   route                3133 non-null   object        
 8   date_flown           3131 non-null   object        
 9   recommended          3904 non-null   object        
dtypes: datetime64[ns](1), object(9)
memory usage: 335.5+ KB


In [28]:
reviews_df.describe(include='all')

Unnamed: 0,verification_status,review_body,published_date,rating,aircraft,type_of_traveler,seat_type,route,date_flown,recommended
count,3904,3904,3904,3899,2035,3138,3902,3133,3131,3904
unique,1,3904,,10,212,4,4,1633,126,2
top,Not Verified,MIA-LHR in World Traveller on a 747-400. After...,,1/10,A320,Couple Leisure,Economy Class,London to Johannesburg,August 2015,no
freq,3904,1,,951,394,1055,2038,21,83,2350
mean,,,2018-01-31 14:14:15.737705216,,,,,,,
min,,,2011-10-09 00:00:00,,,,,,,
25%,,,2015-09-15 18:00:00,,,,,,,
50%,,,2017-04-06 12:00:00,,,,,,,
75%,,,2019-09-15 00:00:00,,,,,,,
max,,,2025-01-20 00:00:00,,,,,,,


The rating column are reviews based on a scale from 1-10 so we can convert this to a numeric value, however, there are some missing values (5) so we will have to check these out first.

In [24]:
# Filter the DataFrame to find rows with NaN values in the 'review' column
nan_reviews = reviews_df[reviews_df['rating'].isna()]
nan_reviews

Unnamed: 0,verification_status,review_body,published_date,rating,aircraft,type_of_traveler,seat_type,route,date_flown,recommended
3335,Not Verified,Cabin crew polite unfortunately BA ran out of ...,2015-02-18,,,,Economy Class,,,no
3463,Not Verified,Phoenix to London - outbound a wonderful and e...,2014-12-10,,,,First Class,,,no
3487,Not Verified,On past experience I chose BA for our long hau...,2014-11-25,,,,Economy Class,,,no
3718,Not Verified,LHR-CPH-LHR Business Class. This is a joke. Sc...,2014-07-31,,,,Business Class,,,no
3752,Not Verified,I flew with British Airways with my mother fro...,2014-07-15,,,,Economy Class,,,no


Although there are NaN values for the following 5 reviews we can see that they are all bad reviews. This means we can inpute a value here knowing that the review would not be plesent. Looking at the **value_counts** for the ratings most ratings are 1/10 and these ratings seem to fit a 1/10 rating as well so we will impute that value for them.

In [25]:
for review in nan_reviews['review_body']:
    print(f'{review}\n')

Cabin crew polite unfortunately BA ran out of chicken ran out of wine ran out of soft drinks. The food was awful. Inflight service was a disaster - did not work properly. 12 hours 20 mins flight everything counts obviously BA overlooked this. To be fair all cabin crew were polite and accommodating. I will never fly again with BA.

Phoenix to London - outbound a wonderful and enjoyable experience. The problem we had started on our return flight home. We boarded the plane and were disappointed to find out that we were in the old style first class. After being in the new cabin on the way out the old configuration is very run down and there was a rip in the seat. I was attempting to work but when I went to plug in my laptop battery I found that there was no adapter for me to use. Since I could no longer work I decided to put a movie on. The tiny screen in the old first class was just pathetic. The flight attendants did their absolute best to make up for the issues and I give them high mark

In [31]:
reviews_df['rating'].value_counts()

rating
1/10     951
2/10     443
3/10     429
8/10     388
10/10    336
9/10     324
7/10     321
4/10     263
5/10     244
6/10     200
Name: count, dtype: int64

In [33]:
# fill the empty ratings with a 1/10
reviews_df['rating'] = reviews_df['rating'].fillna('1/10')

In [34]:
# no more missing values
reviews_df[reviews_df['rating'].isna()]

Unnamed: 0,verification_status,review_body,published_date,rating,aircraft,type_of_traveler,seat_type,route,date_flown,recommended


In [29]:
# Function to convert rating to float
def convert_rating(rating) -> float:
    '''Convert rating to float.'''
    # Split the string on '/' and take the first part
    numerator = rating.split('/')[0]  
    return float(numerator)

# Apply the function to the 'rating' column
reviews_df['rating'] = reviews_df['rating'].apply(convert_rating)

reviews_df.head()


AttributeError: 'NoneType' object has no attribute 'split'