

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [None]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head(10)

Unnamed: 0,reviews
0,✅ Trip Verified | Old A320 with narrow pitch....
1,✅ Trip Verified | Another BA Shambles. Starte...
2,Not Verified | BA cancelled my flight home to...
3,"Not Verified | BA cancelled my flight home, t..."
4,✅ Trip Verified | Turned up 3.5 hours in advan...
5,Not Verified | Boarding – at gate at LGW they...
6,✅ Trip Verified | Missing baggage customer se...
7,✅ Trip Verified | British Airways are not the...
8,✅ Trip Verified | Stupidly tried BA again aft...
9,Not Verified | Seat horribly narrow; 3-4-3 on...


In [None]:
df.columns

Index(['reviews'], dtype='object')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   reviews  1000 non-null   object
dtypes: object(1)
memory usage: 7.9+ KB


In [None]:
df.describe

<bound method NDFrame.describe of                                                reviews
0    ✅ Trip Verified |  Old A320 with narrow pitch....
1    ✅ Trip Verified |  Another BA Shambles. Starte...
2    Not Verified |  BA cancelled my flight home to...
3    Not Verified |  BA cancelled my flight home, t...
4    ✅ Trip Verified | Turned up 3.5 hours in advan...
..                                                 ...
995  ✅ Trip Verified |  Heathrow to Keflavik. I had...
996  ✅ Trip Verified |  London to Muscat first clas...
997  ✅ Trip Verified |  My family and I travelled f...
998  ✅ Trip Verified |  Gatwick to Madeira. The fli...
999  ✅ Trip Verified | London to Casablanca. Their ...

[1000 rows x 1 columns]>

In [None]:
df.sample

<bound method NDFrame.sample of                                                reviews
0    ✅ Trip Verified |  Old A320 with narrow pitch....
1    ✅ Trip Verified |  Another BA Shambles. Starte...
2    Not Verified |  BA cancelled my flight home to...
3    Not Verified |  BA cancelled my flight home, t...
4    ✅ Trip Verified | Turned up 3.5 hours in advan...
..                                                 ...
995  ✅ Trip Verified |  Heathrow to Keflavik. I had...
996  ✅ Trip Verified |  London to Muscat first clas...
997  ✅ Trip Verified |  My family and I travelled f...
998  ✅ Trip Verified |  Gatwick to Madeira. The fli...
999  ✅ Trip Verified | London to Casablanca. Their ...

[1000 rows x 1 columns]>

In [None]:
df.dropna()

Unnamed: 0,reviews
0,✅ Trip Verified | Old A320 with narrow pitch....
1,✅ Trip Verified | Another BA Shambles. Starte...
2,Not Verified | BA cancelled my flight home to...
3,"Not Verified | BA cancelled my flight home, t..."
4,✅ Trip Verified | Turned up 3.5 hours in advan...
...,...
995,✅ Trip Verified | Heathrow to Keflavik. I had...
996,✅ Trip Verified | London to Muscat first clas...
997,✅ Trip Verified | My family and I travelled f...
998,✅ Trip Verified | Gatwick to Madeira. The fli...


In [None]:
type(df)

pandas.core.frame.DataFrame

In [None]:
df.to_csv("data/BA_reviews.csv")

FileNotFoundError: ignored

The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

