# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
# Import the 'requests' library, which is used for making HTTP requests to retrieve web page content
import requests

# Import the 'BeautifulSoup' class from the 'bs4' (Beautiful Soup) library,
# used for parsing HTML and navigating the parse tree
from bs4 import BeautifulSoup

# Import the 'pandas' library for data manipulation and analysis using DataFrame structures
import pandas as pd

# Import the 'numpy' library for numerical operations and array manipulation
import numpy as np

In [2]:
# Create an empty list to collect all reviews
reviews = []

# Create an empty list to collect rating stars
stars = []

# Create an empty list to collect dates
date = []

# Create an empty list to collect the country the reviewer is from
country = []

In [3]:
# Loop through page numbers from 1 to 35
for i in range(1, 36):
    # Construct the URL for each page
    url = f"https://www.airlinequality.com/airline-reviews/british-airways/page/{i}/?sortby=post_date%3ADesc&pagesize=100"
    
    # Make an HTTP request to the page and get its content
    page = requests.get(url)
    
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(page.content, "html.parser")
    
    # Extract text content from all elements with class "text_content" and append to 'reviews' list
    for item in soup.find_all("div", class_="text_content"):
        reviews.append(item.text)
    
    # Extract star ratings from elements with class "rating-10" and append to 'stars' list
    for item in soup.find_all("div", class_="rating-10"):
        try:
            stars.append(item.span.text)
        except:
            # Handle any errors and append "None" to 'stars' in case of an error
            print(f"Error on page {i}")
            stars.append("None")
    
    # Extract date information from 'time' elements and append to 'date' list
    for item in soup.find_all("time"):
        date.append(item.text)
        
    # Extract country information from 'h3' elements and append to 'country' list
    for item in soup.find_all("h3"):
        country.append(item.span.next_sibling.text.strip(" ()"))

Error on page 32
Error on page 33
Error on page 34
Error on page 34


In [4]:
# Print the number of reviews in the 'reviews' list
print(f"Number of reviews: {len(reviews)}")

# Print the number of star ratings in the 'stars' list
print(f"Number of stars: {len(stars)}")

# Print the number of dates in the 'date' list
print(f"Number of dates: {len(date)}")

# Print the number of countries in the 'country' list
print(f"Number of countries: {len(country)}")

Number of reviews: 3500
Number of stars: 3535
Number of dates: 3500
Number of countries: 3500


In [5]:
# Create a dictionary with keys as column names and values as corresponding lists (limited to 3500 elements)
df = {
    'Reviews': reviews[:3500],
    'Stars': stars[:3500],
    'Date': date[:3500],
    'Country': country[:3500],
}

In [6]:
# Create a DataFrame using the dictionary 'df'
df = pd.DataFrame(df)

In [7]:
# Display the first 10 rows of the DataFrame
df.head(10)

Unnamed: 0,Reviews,Stars,Date,Country
0,Not Verified | I did not actually get to fly w...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,5th February 2024,United Kingdom
1,✅ Trip Verified | We had possibly the worse ch...,1,2nd February 2024,United Kingdom
2,✅ Trip Verified | I flew to LHR from ATH in C...,6,30th January 2024,Japan
3,✅ Trip Verified | I like the British Airways ...,9,29th January 2024,United Kingdom
4,✅ Trip Verified | I have come to boarding and...,8,28th January 2024,Ukraine
5,✅ Trip Verified | Stinking nappies being chang...,3,26th January 2024,United Kingdom
6,✅ Trip Verified | Worst service ever. Lost bag...,2,23rd January 2024,Germany
7,✅ Trip Verified | BA 246 21JAN 2023 Did not a...,1,21st January 2024,United Kingdom
8,✅ Trip Verified | Not a great experience. I co...,6,18th January 2024,United Kingdom
9,Not Verified | I was excited to fly BA as I'd ...,3,18th January 2024,United Kingdom


In [8]:
# Display the shape of the dataframe
df.shape

(3500, 4)

In [9]:
# Saving the dataframe into a CSV file
df.to_csv('reviews_data.csv', index=False)