# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code for web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once we've collected the data and saved it into a local `.csv` file we will carry out the analysis.

### Scraping data from Skytrax

we will us the website [https://www.airlinequality.com]for our data and for this task, we are only interested in reviews related to British Airways and the Airline itself.

To see the data use this link: [https://www.airlinequality.com/airline-reviews/british-airways]. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

import re  #



In [2]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [3]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | Quick bag drop at First Win...
1,✅ Trip Verified | 4 Hours before takeoff we r...
2,✅ Trip Verified | I recently had a delay on B...
3,"Not Verified | Boarded on time, but it took a..."
4,"✅ Trip Verified | 5 days before the flight, w..."


In [4]:
import os

# Create a directory named 'data' if it doesn't exist
directory = 'data'
if not os.path.exists(directory):
    os.makedirs(directory)

# Save the DataFrame to a CSV file in the 'data' directory
df.to_csv("data/BA_reviews2.csv", index=False)


Now we have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website.
 The next thing that we should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [5]:
#1. Text Preprocessing
# Removing the symbol " | "
df.reviews= df.reviews.str.split('|',expand=True)[1]          # it will split the text before and after the symbol
df.head()



#sub() function belongs to the Regular Expressions ( re ) module in Python.
#It returns a string where all matching occurrences of the specified pattern are replaced by required string or blanks.
def replace(text):            # Define a function to clean the text
    text = re.sub(r'[^A-Za-z]+', ' ', str(text)) # Replaces all special characters and numericals with blanks and leaving the alphabets
    return text
# Cleaning the text in the review column
df['reviews']= df["reviews"].apply(replace)
df.head()

Unnamed: 0,reviews
0,Quick bag drop at First Wing but too many pas...
1,Hours before takeoff we received a Mail stati...
2,I recently had a delay on British Airways fro...
3,Boarded on time but it took ages to get to th...
4,days before the flight we were advised by BA ...


In [6]:
df['reviews'] = df['reviews'].str.lower()

df.head()

Unnamed: 0,reviews
0,quick bag drop at first wing but too many pas...
1,hours before takeoff we received a mail stati...
2,i recently had a delay on british airways fro...
3,boarded on time but it took ages to get to th...
4,days before the flight we were advised by ba ...


1.4.Removing Punctuations

In [7]:

df['reviews'] = df['reviews'].str.replace('[^\w\s]', '')
df.head()

Unnamed: 0,reviews
0,quick bag drop at first wing but too many pas...
1,hours before takeoff we received a mail stati...
2,i recently had a delay on british airways fro...
3,boarded on time but it took ages to get to th...
4,days before the flight we were advised by ba ...
