# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
pip install beautifulsoup4

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip





In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import os

In [3]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 20
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews


In [4]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,Not Verified | LHR T5 BA Gold Wing worked wel...
1,Not Verified | Very good service on this rout...
2,✅ Trip Verified | Flight mainly let down by ...
3,✅ Trip Verified | Another awful experience b...
4,"✅ Trip Verified | The service was rude, full..."


In [5]:
df.to_csv("BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [6]:
data = pd.read_csv("BA_reviews.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,reviews
0,0,Not Verified | LHR T5 BA Gold Wing worked wel...
1,1,Not Verified | Very good service on this rout...
2,2,✅ Trip Verified | Flight mainly let down by ...
3,3,✅ Trip Verified | Another awful experience b...
4,4,"✅ Trip Verified | The service was rude, full..."


In [7]:
data.drop('Unnamed: 0',axis = 1 , inplace= True)
data.head()

Unnamed: 0,reviews
0,Not Verified | LHR T5 BA Gold Wing worked wel...
1,Not Verified | Very good service on this rout...
2,✅ Trip Verified | Flight mainly let down by ...
3,✅ Trip Verified | Another awful experience b...
4,"✅ Trip Verified | The service was rude, full..."


In [8]:
data.iloc[0]

reviews    Not Verified |  LHR T5 BA Gold Wing worked wel...
Name: 0, dtype: object

## Normalize Casing 
- convert all text to lower cas so it is all treated the same
- turn the column into a python list so it is easy to loop through 

In [9]:
review = data['reviews'].str.lower().to_list()

In [10]:
review[:5]

['not verified |  lhr t5 ba gold wing worked well. pleasant check in and very fast security screening. concorde room service attentive. c gate boarding ok but nothing special for first passengers. latest ba version of first with only 8 suites with privacy doors. comfortable seat with plenty of stowage. good screen and good choice of ife. amenity kit good quality and bedding, pillows cushions and blankets all good. excellent menu and food very well presented. cabin crew could not have been more attentive and helpful without being obtrusive. on time departure and early arrival. bags delivered relatively swiftly and priority tagged bags were first off. all in all one of the best ba first flights i’ve had in many years. whilst not touching the middle eastern carriers ba first on this showing is easily the best way to cross the atlantic.',
 'not verified |  very good service on this route ba2710 30th march. cabin crew worked hard, particularly ivka (?) who was on the go throughout the fligh

Remove the unkown characters and numbers in order not to spoile the text preprocessing

In [11]:
review = [r.replace('✅', '').strip() for r in review]

In [12]:
review[:5]

['not verified |  lhr t5 ba gold wing worked well. pleasant check in and very fast security screening. concorde room service attentive. c gate boarding ok but nothing special for first passengers. latest ba version of first with only 8 suites with privacy doors. comfortable seat with plenty of stowage. good screen and good choice of ife. amenity kit good quality and bedding, pillows cushions and blankets all good. excellent menu and food very well presented. cabin crew could not have been more attentive and helpful without being obtrusive. on time departure and early arrival. bags delivered relatively swiftly and priority tagged bags were first off. all in all one of the best ba first flights i’ve had in many years. whilst not touching the middle eastern carriers ba first on this showing is easily the best way to cross the atlantic.',
 'not verified |  very good service on this route ba2710 30th march. cabin crew worked hard, particularly ivka (?) who was on the go throughout the fligh

In [13]:
review = [re.sub("[.,|?()-:='~^0-9\\\]"," ", item) for item in review]

In [14]:
review[:5]

['not verified    lhr t  ba gold wing worked well  pleasant check in and very fast security screening  concorde room service attentive  c gate boarding ok but nothing special for first passengers  latest ba version of first with only   suites with privacy doors  comfortable seat with plenty of stowage  good screen and good choice of ife  amenity kit good quality and bedding  pillows cushions and blankets all good  excellent menu and food very well presented  cabin crew could not have been more attentive and helpful without being obtrusive  on time departure and early arrival  bags delivered relatively swiftly and priority tagged bags were first off  all in all one of the best ba first flights i’ve had in many years  whilst not touching the middle eastern carriers ba first on this showing is easily the best way to cross the atlantic ',
 'not verified    very good service on this route ba       th march  cabin crew worked hard  particularly ivka     who was on the go throughout the fligh

# Tokenize the reviews 

In [24]:
pip install nltk


Defaulting to user installation because normal site-packages is not writeable
Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.11.6-cp311-cp311-win_amd64.whl.metadata (41 kB)
Collecting tqdm (from nltk)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------


[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [28]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [15]:
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
review_tokens = [tokenizer.tokenize(item) for item in review]
print(review_tokens[:5])


[['not', 'verified', 'lhr', 't', 'ba', 'gold', 'wing', 'worked', 'well', 'pleasant', 'check', 'in', 'and', 'very', 'fast', 'security', 'screening', 'concorde', 'room', 'service', 'attentive', 'c', 'gate', 'boarding', 'ok', 'but', 'nothing', 'special', 'for', 'first', 'passengers', 'latest', 'ba', 'version', 'of', 'first', 'with', 'only', 'suites', 'with', 'privacy', 'doors', 'comfortable', 'seat', 'with', 'plenty', 'of', 'stowage', 'good', 'screen', 'and', 'good', 'choice', 'of', 'ife', 'amenity', 'kit', 'good', 'quality', 'and', 'bedding', 'pillows', 'cushions', 'and', 'blankets', 'all', 'good', 'excellent', 'menu', 'and', 'food', 'very', 'well', 'presented', 'cabin', 'crew', 'could', 'not', 'have', 'been', 'more', 'attentive', 'and', 'helpful', 'without', 'being', 'obtrusive', 'on', 'time', 'departure', 'and', 'early', 'arrival', 'bags', 'delivered', 'relatively', 'swiftly', 'and', 'priority', 'tagged', 'bags', 'were', 'first', 'off', 'all', 'in', 'all', 'one', 'of', 'the', 'best', '

# NLTK POS tagger
Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.

In [44]:
download_dir = r'D:\nltk_data'

# Download the POS tagger model
nltk.download('averaged_perceptron_tagger', download_dir=download_dir)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     D:\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [22]:
# Check if the POS tagger is available
tagger_path = r"D:\nltk_data"
print("Tagger file exists:", os.path.exists(tagger_path))

Tagger file exists: True


In [23]:
tagger_path2 = r"D:\nltk_data\tagger\averaged_perceptron_tagger"
print("Tagger file exists:", os.path.exists(tagger_path2))

Tagger file exists: False


In [24]:
nltk.download('averaged_perceptron_tagger', download_dir=r"D:\nltk_data\taggers")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     D:\nltk_data\taggers...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [27]:
nltk.data.path.append(r"D:\nltk_data")

In [33]:
tagger_path2 = r"D:\nltk_data\taggers\taggers\averaged_perceptron_tagger"
print("Tagger file exists:", os.path.exists(tagger_path2))

Tagger file exists: True


In [35]:
tagger_path = r"D:\nltk_data\taggers\taggers\averaged_perceptron_tagger"
print("Tagger file exists:", os.path.exists(tagger_path))

# Perform POS tagging on tokenized reviews
review_postage = [pos_tag(item) for item in review_tokens]

# Print the first 5 POS-tagged reviews
print(review_postage[:5])

Tagger file exists: True


LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger_eng[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger_eng')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtaggers/averaged_perceptron_tagger_eng/[0m

  Searched in:
    - 'C:\\Users\\DELL/nltk_data'
    - 'c:\\Program Files\\Python311\\nltk_data'
    - 'c:\\Program Files\\Python311\\share\\nltk_data'
    - 'c:\\Program Files\\Python311\\lib\\nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'D:\\nltk_data'
    - 'D:\\nltk_data'
    - 'D:\\nltk_data'
**********************************************************************


In [32]:

from nltk import pos_tag

# Assuming review_tokens is already tokenized
review_postage = [pos_tag(item) for item in review_tokens]

# Print the first 5 POS-tagged reviews
print(review_postage[:5])

LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger_eng[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger_eng')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtaggers/averaged_perceptron_tagger_eng/[0m

  Searched in:
    - 'C:\\Users\\DELL/nltk_data'
    - 'c:\\Program Files\\Python311\\nltk_data'
    - 'c:\\Program Files\\Python311\\share\\nltk_data'
    - 'c:\\Program Files\\Python311\\lib\\nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'D:\\nltk_data'
    - 'D:\\nltk_data'
    - 'D:\\nltk_data'
**********************************************************************


In [42]:
import nltk
from nltk.tokenize import TreebankWordTokenizer

# Set the NLTK data path to include the directory where your POS tagger is located
nltk.data.path.append(r"D:\nltk_data")

# Manually load the PerceptronTagger from your local path
tagger = nltk.data.load('taggers/averaged_perceptron_tagger')

# Tokenize your reviews
tokenizer = TreebankWordTokenizer()
review_tokens = [tokenizer.tokenize(item) for item in review]

# Perform POS tagging using the loaded tagger
review_postage = [tagger.tag(item) for item in review_tokens]

# Print the first 5 POS-tagged reviews
print(review_postage[:5])


ValueError: Could not determine format for nltk:taggers/averaged_perceptron_tagger based on its file
extension; use the "format" argument to specify the format explicitly.

In [43]:
print(nltk.data.path)


['C:\\Users\\DELL/nltk_data', 'c:\\Program Files\\Python311\\nltk_data', 'c:\\Program Files\\Python311\\share\\nltk_data', 'c:\\Program Files\\Python311\\lib\\nltk_data', 'C:\\Users\\DELL\\AppData\\Roaming\\nltk_data', 'C:\\nltk_data', 'D:\\nltk_data', 'E:\\nltk_data', 'D:\\nltk_data', 'D:\\nltk_data', 'D:\\nltk_data', 'D:\\nltk_data', 'D:\\nltk_data', 'D:\\nltk_data', 'D:\\nltk_data']


In [46]:
# Remove duplicates from nltk data paths
nltk.data.path = list(set(nltk.data.path))

# Now load the POS tagger
tagger = nltk.data.load('taggers/averaged_perceptron_tagger')






ValueError: Could not determine format for nltk:taggers/averaged_perceptron_tagger based on its file
extension; use the "format" argument to specify the format explicitly.