###### British Airways Virtual Internship

# Scraping Customer Reviews

In the virtual internship programme, I am first tasked to analyze customer feedback from a third-party source, namely Skytrax (`https://www.airlinequality.com/`), and find insights.

To do this, I first need to scrape this website to collect the data.

British Airways have 3588 reviews on this website (as of 30 June 2023); the oldest one dates back to 2011.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re

In [73]:
ba_url = 'https://www.airlinequality.com/airline-reviews/british-airways'
page_size = 100
pages = 36

scores = []
reviews = []

for page in range(1, pages+1):
    page_scores = []
    page_reviews = []
    
    url = f'{ba_url}/page/{page}/?sortby=post_date%3ADesc&pagesize={page_size}'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    for score in soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['rating-10']):
        try:
            page_scores.append(int(re.findall(r"10|[0-9]", score.find('span', attrs={'itemprop':'ratingValue'}).contents[0])[0]))
        except AttributeError:
            page_scores.append(score.text)
    scores += page_scores
    print(f"page #{page}: review scores scraped ({len(page_scores)})")
    
    for review in soup.find_all('div', class_='body'):
        dict = {}
        dict['title'] = review.find('h2', class_='text_header').text
        dict['country'] = review.find('h3', class_='text_sub_header userStatusWrapper').find('span').next_sibling.strip().replace('(', '').replace(')', '')
        dict['time'] = review.find('time').text
        if review.find('div', class_='text_content', attrs={'itemprop':'reviewBody'}).find('a', href='https://www.airlinequality.com/verified-reviews/') != None:
            dict['verified_trip'] = review.find('div', class_='text_content', attrs={'itemprop':'reviewBody'}).find('a', href='https://www.airlinequality.com/verified-reviews/').text
            dict['review_text'] = review.find('div', class_='text_content', attrs={'itemprop':'reviewBody'}).find('strong').next_sibling.strip(' |').strip()
        else:
            dict['verified_trip'] = ''
            dict['review_text'] = review.find('div', class_='text_content', attrs={'itemprop':'reviewBody'}).text.strip(' |').strip()
        
        ratings = review.find('table', class_='review-ratings')
        cat_list = ['type_of_traveller', 'cabin_flown', 'route', 'date_flown']
        for item in cat_list:
            if ratings.find('td', class_=f'review-rating-header {item}') != None:
                dict[item] = ratings.find('td', class_=f'review-rating-header {item}').next_sibling.text
            else:
                dict[item] = ''
                
        star_list = [
            'seat_comfort', 'cabin_staff_service', 'food_and_beverages',
            'ground_service', 'value_for_money', 'inflight_entertainment',
            'wifi_and_connectivity']
        for item in star_list:
            try: 
                dict[item] = ratings.find('td', f'review-rating-header {item}').parent.find_all('span', class_='star fill')[-1].text
            except:
                if ratings.find('td', f'review-rating-header {item}') != None:
                    dict[item] = 0
                else:
                    dict[item] = ''
        
        dict['recommended'] = ratings.find('td', class_='review-rating-header recommended').next_sibling.text
        page_reviews.append(dict)
    reviews += page_reviews
    print(f"page #{page}: finished parsing")
        

page #1: review scores scraped (100)
page #1: finished parsing
page #2: review scores scraped (100)
page #2: finished parsing
page #3: review scores scraped (100)
page #3: finished parsing
page #4: review scores scraped (100)
page #4: finished parsing
page #5: review scores scraped (100)
page #5: finished parsing
page #6: review scores scraped (100)
page #6: finished parsing
page #7: review scores scraped (100)
page #7: finished parsing
page #8: review scores scraped (100)
page #8: finished parsing
page #9: review scores scraped (100)
page #9: finished parsing
page #10: review scores scraped (100)
page #10: finished parsing
page #11: review scores scraped (100)
page #11: finished parsing
page #12: review scores scraped (100)
page #12: finished parsing
page #13: review scores scraped (100)
page #13: finished parsing
page #14: review scores scraped (100)
page #14: finished parsing
page #15: review scores scraped (100)
page #15: finished parsing
page #16: review scores scraped (100)
page 

In [74]:
print(f'scraped {len(scores)} review scores and {len(reviews)} reviews')

scraped 3589 review scores and 3589 reviews


In [75]:
reviews_df = pd.DataFrame(reviews)
reviews_df['review_score'] = scores

In [76]:
reviews_df

Unnamed: 0,title,country,time,verified_trip,review_text,type_of_traveller,cabin_flown,route,date_flown,seat_comfort,cabin_staff_service,food_and_beverages,ground_service,value_for_money,inflight_entertainment,wifi_and_connectivity,recommended,review_score
0,"""Luggage are still in Glasgow""",United States,30th June 2023,Trip Verified,Came from Glasgow to London and took connectin...,Family Leisure,Economy Class,Glasgow to London,June 2023,1,1,,1,1,,,no,1
1,"""whole experience was terrible""",United Arab Emirates,29th June 2023,Trip Verified,My flight on on 12 May 2023 got delayed an hou...,Solo Leisure,Economy Class,Dubai to Keflavik via London,May 2023,1,1,,1,1,,,no,1
2,"""preferred to fly on easyJet""",United Kingdom,29th June 2023,Not Verified,Cairo is a 5 hour flight and BA considers it t...,Couple Leisure,Economy Class,Cairo to London,June 2023,1,1,1,2,1,1,1,no,2
3,"""stated it is not their fault""",United Kingdom,27th June 2023,Trip Verified,After travelling London to Madrid with British...,Solo Leisure,Economy Class,London to Madrid,May 2023,3,3,3,1,1,,,no,1
4,"""luggage was mis-tagged in Dallas""",United States,27th June 2023,Trip Verified,My luggage was mis-tagged in Dallas on my way ...,Family Leisure,Economy Class,London to Cairo,June 2023,1,1,,1,1,,,no,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3584,British Airways customer review,Canada,29th August 2012,,YYZ to LHR - July 2012 - I flew overnight in p...,,Premium Economy,,,4,3,3,,4,4,,yes,8
3585,British Airways customer review,United Kingdom,28th August 2012,,LHR to HAM. Purser addresses all club passenge...,,Business Class,,,4,5,4,,3,0,,yes,9
3586,British Airways customer review,United Kingdom,12th October 2011,,My son who had worked for British Airways urge...,,Economy Class,,,,,,,4,,,yes,5
3587,British Airways customer review,United States,11th October 2011,,London City-New York JFK via Shannon on A318 b...,,Premium Economy,,,1,3,5,,1,0,,no,4


In [77]:
reviews_df.to_csv('data/reviews_raw.csv')