# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes the code for web scraping. Using a package called `BeautifulSoup` the data is collected from the web and is saved into a local `.csv` file.

### Scraping data from Skytrax

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [117]:
#imports

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests 

In [120]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 38
page_size = 100

#create an empty list to collect all reviews, stars, date, country
reviews  = []
stars = []
date = []
country = []

In [121]:
# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    soup = BeautifulSoup(content, 'html.parser')
    for item in soup.find_all("div", {"class": "text_content"}):
        reviews.append(item.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")
    
    for item in soup.find_all("div", class_ = "rating-10"):
        try:
            stars.append(item.span.text)
        except:
            print(f"Error on page {i}")
            stars.append("None")
    #date
    for item in soup.find_all("time"):
        date.append(item.text)
        
    #country
    for item in soup.find_all("h3"):
        country.append(item.span.next_sibling.text.strip(" ()"))    

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews
Scraping page 21
   ---> 2100 total reviews
Scraping page 22
   ---> 2200 total reviews
Scraping page 23
   ---> 2300 total reviews
Scrapi

In [123]:
#check the length of total reviews extracted
len(reviews)

3752

In [124]:
len(country)

3752

In [125]:
#check the length 
stars = stars[:3752]

In [126]:
df = pd.DataFrame({"reviews":reviews,"stars": stars, "date":date, "country": country})

In [127]:
df.head()

Unnamed: 0,reviews,stars,date,country
0,Not Verified | We have flown BA five times fr...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,17th February 2024,United States
1,✅ Trip Verified | London Heathrow to Istanbul...,3,17th February 2024,United Kingdom
2,"Not Verified | Jan 30th, I booked a last-minut...",3,16th February 2024,United States
3,✅ Trip Verified | I am a British Airways Gold ...,2,11th February 2024,United States
4,Not Verified | Another case of reviewing Brit...,5,8th February 2024,United Kingdom


In [128]:
df.shape

(3752, 4)

### Export the data into a csv format

In [130]:
import os

cwd = os.getcwd()
df.to_csv(cwd+ "/BA_reviews.csv", index=False)

Now we have the necessary data. The next thing that is data cleaning.