# Task 1

# Web scraping and analysis
This Jupyter notebook includes some code to get you started with web scraping. We will use a package called BeautifulSoup to collect the data from the web. Once you've collected your data and saved it into a local .csv file you should start with your analysis.

## Scraping data from Skytrax
If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use Python and BeautifulSoup to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [2]:
pip install wordcloud

Collecting wordcloud
  Downloading wordcloud-1.9.2-cp39-cp39-win_amd64.whl (153 kB)
     -------------------------------------- 153.3/153.3 kB 1.3 MB/s eta 0:00:00
Installing collected packages: wordcloud
Successfully installed wordcloud-1.9.2
Note: you may need to restart the kernel to use updated packages.


In [8]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

from wordcloud import WordCloud 
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
url = "https://www.airlinequality.com/airline-reviews/british-airways"

# Initialize an empty list to store the data that you scrape
data = []

# Setting the initial page number and the increment that you want to use to paginate through the webpage
page_num = 1
page_incr = 1
page_size = 100
# maximum number of pages to be scraped
max_pages = 20

# Set the URL of the webpage to be scraped 
paginated_url = f"{url}/page/{page_num}/?sortby=post_date%3ADesc&pagesize={page_size}"

# A while loop to paginate through the webpage and scrape the data
while page_num <= max_pages:

    print(f"Scraping page {page_num}")

    # A GET request to the paginated URL
    response = requests.get(paginated_url)

    # Parsing the response using BeautifulSoup
    parsed_content = BeautifulSoup(response.text, "html.parser")

    # Finding all the elements on the page that contain the data to be scraped
    elements = parsed_content.find_all("div",class_ = "body")
 # Looping through the elements and extract the data that you want to scrape
    for element in elements:
        header = element.find("h2",class_ = "text_header").text.replace("\n", " ")
        sub_header = element.find("h3",class_ = "text_sub_header").text.replace("\n", " ")
        content = element.find("div",class_ = "text_content").text.replace("\n", " ")
        
        data.append([header,sub_header,content])

    # Increasing the page number and setting the paginated URL to the new page
    page_num += page_incr
    paginated_url = f"{url}/page/{page_num}/?sortby=post_date%3ADesc&pagesize={page_size}"

    print(f"   ---> {len(data)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews


The loops above collected 2000 reviews by iterating through the paginated pages on the website.

The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [9]:
#Coverting the list data into a dataframe
df = pd.DataFrame(data)
df.columns = ["REVIEW","PERSONAL INFO","CONTENT"]

#Removing unwanted text(first text preprocessing)
df.replace(re.compile(r'\s*✅ Trip Verified \|\s*'), '', inplace=True)
df

Unnamed: 0,REVIEW,PERSONAL INFO,CONTENT
0,"""flights changed with no cost""",William Jackson (Spain) 23rd May 2023,Not Verified | Easy check in on the way to He...
1,"""Cheap, quick and efficient""",A Warten (Chile) 23rd May 2023,Online check in worked fine. Quick security ch...
2,"""the worst major European airline""",E Michaels (United Kingdom) 22nd May 2023,. The BA first lounge at Terminal 5 was a zoo...
3,"""do not think the fare was worth the money""",Steve Bennett (United Kingdom) 22nd May 2023,Not Verified | Paid a quick visit to Nice yest...
4,"""BA is on the skids downhill""",N Mayle (United States) 19th May 2023,Words fail to describe this last awful flight ...
...,...,...,...
1995,"""experience has really declined""",G Mantimo (Canada) 23rd August 2016,✅ Verified Review | The British Airways exper...
1996,"""BA has declined significantly""",Richard Brown (New Zealand) 22nd August 2016,Flew Malta to London. First the plus points. G...
1997,"""First Class is a total wate of money""",Bill Atkins (United Kingdom) 21st August 2016,Philadelphia to London Heathrow with British A...
1998,"""every time I complain about the breakfast""",H Lowe (United Kingdom) 19th August 2016,Upgraded on the outbound flight from London to...


In [10]:
df.to_csv(r"C:\Users\krish\OneDrive\Desktop\ba\BA_reviews.csv")

In [11]:
sentiment_analysis_df = df.drop(["REVIEW","PERSONAL INFO"], axis=1)
sentiment_analysis_df.replace(re.compile(r'\s*✅ Verified Review \|\s*'), '', inplace=True)
sentiment_analysis_df

Unnamed: 0,CONTENT
0,Not Verified | Easy check in on the way to He...
1,Online check in worked fine. Quick security ch...
2,. The BA first lounge at Terminal 5 was a zoo...
3,Not Verified | Paid a quick visit to Nice yest...
4,Words fail to describe this last awful flight ...
...,...
1995,The British Airways experience has really decl...
1996,Flew Malta to London. First the plus points. G...
1997,Philadelphia to London Heathrow with British A...
1998,Upgraded on the outbound flight from London to...


In [12]:
sentiment_analysis_df.to_csv("sentiment_content.csv", index=False)