### Data Collection  

In this phase, we gather customer reviews and ratings from the Skytrax airline quality website. The dataset includes airline ratings, seat ratings, and lounge experience reviews.

In [None]:
#imports
import pandas as pd
from bs4 import BeautifulSoup
import requests
import os

In [None]:
#create an empty list to collect all reviews, stars, dates, countries.
reviews, stars, dates, countries = [], [], [], []

We collect airline review data from the Skytrax website by scraping multiple pages. The script extracts customer reviews, star ratings, review dates, and reviewer countries. This data will be used for further analysis of customer feedback.

In [None]:
for i in range(1, 41):
    url = f"https://www.airlinequality.com/airline-reviews/british-airways/page/{i}/?sortby=post_date%3ADesc&pagesize=100"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")  

   
    for item in soup.find_all("div", class_="text_content"):
        reviews.append(item.text.strip())

 
    for item in soup.find_all("div", class_="rating-10"):
        star = item.span.text.strip() if item.span else "No rating"
        stars.append(star)

   
    for item in soup.find_all("time"):
        dates.append(item.text.strip())

    
    for item in soup.find_all("h3"):
        country = item.span.next_sibling.text.strip(" ()") if item.span else "Unknown"
        countries.append(country)

    print(f"✅ Page {i} processed. Collected {len(reviews)} reviews.")

✅ Page 1 processed. Collected 100 reviews.
✅ Page 2 processed. Collected 200 reviews.
✅ Page 3 processed. Collected 300 reviews.
✅ Page 4 processed. Collected 400 reviews.
✅ Page 5 processed. Collected 500 reviews.
✅ Page 6 processed. Collected 600 reviews.
✅ Page 7 processed. Collected 700 reviews.
✅ Page 8 processed. Collected 800 reviews.
✅ Page 9 processed. Collected 900 reviews.
✅ Page 10 processed. Collected 1000 reviews.
✅ Page 11 processed. Collected 1100 reviews.
✅ Page 12 processed. Collected 1200 reviews.
✅ Page 13 processed. Collected 1300 reviews.
✅ Page 14 processed. Collected 1400 reviews.
✅ Page 15 processed. Collected 1500 reviews.
✅ Page 16 processed. Collected 1600 reviews.
✅ Page 17 processed. Collected 1700 reviews.
✅ Page 18 processed. Collected 1800 reviews.
✅ Page 19 processed. Collected 1900 reviews.
✅ Page 20 processed. Collected 2000 reviews.
✅ Page 21 processed. Collected 2100 reviews.
✅ Page 22 processed. Collected 2200 reviews.
✅ Page 23 processed. Collect

To ensure consistency in our dataset, we align the lengths of all extracted lists. The script trims each list to match the shortest one, preventing misalignment issues during data processing.

In [None]:
print(len(reviews), len(stars), len(dates), len(countries))

3924 3924 3924 3924


In [45]:
min_length = min(len(reviews), len(stars), len(dates), len(countries))
reviews, stars, dates, countries = reviews[:min_length], stars[:min_length], dates[:min_length], countries[:min_length]

In [None]:
#create  a dataframe from these collected lists of data
df = pd.DataFrame({
    "Review": reviews,
    "Stars": stars,
    "Date": dates,
    "Country": countries
})

In [None]:
df.head()

Unnamed: 0,Review,Stars,Date,Country
0,✅ Trip Verified | Flight mainly let down by ...,5,19th March 2025,United Kingdom
1,✅ Trip Verified | Another awful experience b...,7,16th March 2025,United States
2,"✅ Trip Verified | The service was rude, full...",1,16th March 2025,United States
3,✅ Trip Verified | This flight was a joke. Th...,3,16th March 2025,United States
4,✅ Trip Verified | This time British Airways ...,1,7th March 2025,United Kingdom


In [49]:
df.shape

(3924, 4)

#### Export the data into a csv format

In [50]:
df.to_csv("BA_reviews.csv")
print("Data successfully collected and saved to BA_reviews.csv!")

Data successfully collected and saved to BA_reviews.csv!
