# Web Scraping

In this step we will scraping data about British Airways customer reviews from website called [Skytrax](https://www.airlinequality.com/airline-reviews/british-airways). We will collect the data about Food and Beverages, Inflight Entertainment, Seat Comfort, Staff Service and Value for Money. After we collected the data, we're going to save it into csv file for further analysis.

In [45]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [46]:
#create an empty list to collect reviews
reviews = []

#create an empty list to collect rating stars
stars = []

#create an empty list to collect date
date = []

#create an empty list to collect country the reviewer is from
country = []

In [47]:
# Loop through a range of page numbers from 1 to 'pages'
for i in range(1, 37 + 1):
    # Collect HTML data from this page
    page = requests.get(f"https://www.airlinequality.com/airline-reviews/british-airways/page/{i}/?sortby=post_date%3ADesc&pagesize=100")
    
    print(f"Scraping page {i}")
    
    # Parse content
    soup = BeautifulSoup(page.content, "html.parser")
    
    #reviews
    for item in soup.find_all("div", {"class": "text_content"}):
        reviews.append(item.text)
    
    #stars
    for item in soup.find_all("div", {"class": "rating-10"}):
        try:
            stars.append(item.span.text)
        except:
            stars.append("None")
            
    #date
    for item in soup.find_all("time"):
        date.append(item.text)
    
    #country
    for item in soup.find_all("h3"):
        country.append(item.span.next_sibling.text.strip(" ()"))

    print(f"   ---> {len(reviews)} total reviews")
    


Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews
Scraping page 21
   ---> 2100 total reviews
Scraping page 22
   ---> 2200 total reviews
Scraping page 23
   ---> 2300 total reviews
Scrapi

In [48]:
stars = [x for x in stars if "\t" not in x]

## Check the lenght of the total data

In [49]:
len(reviews)

3638

In [50]:
len(country)

3638

In [51]:
stars = stars[:3638]

## Create dataframe from the collected data

In [52]:
df = pd.DataFrame({"reviews":reviews, "stars":stars, "date":date, "country":country})

In [53]:
df.head()

Unnamed: 0,reviews,stars,date,country
0,✅ Trip Verified | British Airways absolutely ...,1,1st September 2023,United Kingdom
1,✅ Trip Verified | My recent experience with B...,1,1st September 2023,United States
2,✅ Trip Verified | This is to express our disp...,1,31st August 2023,United States
3,✅ Trip Verified | I flew London to Malaga on ...,1,30th August 2023,United Kingdom
4,✅ Trip Verified | I arrived at the airport ab...,1,30th August 2023,Germany


In [54]:
df.shape

(3638, 4)

## Export the data into a CSV format

In [55]:
df.to_csv("data/BA_reviews.csv")