# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [3]:
# Scrape the review data from british airways
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 20 # Number of pages to scrape
page_size = 100 # Number of reviews per page

reviews = []

# This loop iterates through pages 1 to 20
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Uses requests library to fetch webpage content
    # Uses BeautifulSoup to parse HTML content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews


In [4]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | After an excellent flight ...
1,✅ Trip Verified | On a recent flight from Cy...
2,✅ Trip Verified | Flight BA 0560 arrived in ...
3,✅ Trip Verified | This was the first time I ...
4,✅ Trip Verified | Pretty good flight but sti...


In [5]:
df.to_csv("BA_reviews.csv")

## Cleaning the data

In [22]:
df.describe()

Unnamed: 0,reviews
count,2000
unique,2000
top,"❎ Unverified | Flew Gatwick to San Jose, Cost..."
freq,1


In [6]:
# clean the reviews column, drop the  "trip verified"
df['cleaned_reviews'] = df['reviews'].apply(lambda x: x.split('|')[-1].strip())
df.head()

Unnamed: 0,reviews,cleaned_reviews
0,✅ Trip Verified | After an excellent flight ...,After an excellent flight on a 777 CPT to LHR ...
1,✅ Trip Verified | On a recent flight from Cy...,On a recent flight from Cyprus BA621 on 23/11/...
2,✅ Trip Verified | Flight BA 0560 arrived in ...,Flight BA 0560 arrived in Rome on 11 December ...
3,✅ Trip Verified | This was the first time I ...,This was the first time I flew British Airways...
4,✅ Trip Verified | Pretty good flight but sti...,Pretty good flight but still some small things...


In [7]:
df_cleaned_reviews = df.drop('reviews', axis=1)
df_cleaned_reviews

Unnamed: 0,cleaned_reviews
0,After an excellent flight on a 777 CPT to LHR ...
1,On a recent flight from Cyprus BA621 on 23/11/...
2,Flight BA 0560 arrived in Rome on 11 December ...
3,This was the first time I flew British Airways...
4,Pretty good flight but still some small things...
...,...
1995,Overnight flight from St Lucia to Gatwick. Eff...
1996,Cape Town to London Heathrow in an old but wel...
1997,I am due to fly from Tehran to Vancouver via L...
1998,Malta to Gatwick. My friend arrived at the che...


In [8]:
# Required libraries
import nltk
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

# Download required NLTK data
nltk.download('punkt')  # 用于句子分词
nltk.download('stopwords')  # 用于停用词
nltk.download('wordnet')  # 用于词形还原
nltk.download('omw-1.4')  # wordnet的依赖包

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\xuziw\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\xuziw\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\xuziw\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\xuziw\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [9]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenization
    tokens = word_tokenize(text)
    # Remove stopwords 
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(tokens)

# Preprocess text
df_cleaned_reviews['processed_text'] = df['cleaned_reviews'].apply(preprocess_text)


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\xuziw/nltk_data'
    - 'd:\\software\\envs\\BusinessStatistics\\nltk_data'
    - 'd:\\software\\envs\\BusinessStatistics\\share\\nltk_data'
    - 'd:\\software\\envs\\BusinessStatistics\\lib\\nltk_data'
    - 'C:\\Users\\xuziw\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************
