# Web scraping and analysis
This Jupyter notebook includes some code to get us started with web scraping. We will use a package called BeautifulSoup to collect the data from the web. Once we've collected our data and saved it into a local .csv file, we should start with our analysis.

Scraping data from Skytrax
If we visit [https://www.airlinequality.com], we can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If we navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways], we will see this data. Now, we can use Python and BeautifulSoup to collect all the links to the reviews and then to collect the text data on each of the individual review links.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 37
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews
Scraping page 21
   ---> 2100 total reviews
Scraping page 22
   ---> 2200 total reviews
Scraping page 23
   ---> 2300 total reviews
Scrapi

In [3]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | First time flying British Ai...
1,Not Verified | I flew London to Cairo and ret...
2,Not Verified | Absolutely the worst experienc...
3,Not Verified | Flew back from Malta after sc...
4,Not Verified | Cabin luggage had to go to carg...


In [4]:
df.to_csv("BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

# Data Cleaning
Now since we have extracted data from the website, it is not cleaned and ready to be analyzed yet. The reviews section will need to be cleaned for punctuations, spellings and other characters.

## Step 1: Tokenization

Tokenization is the process of breaking the text into smaller pieces called Tokens. It can be performed at sentences(sentence tokenization) or word level(word tokenization).

## Step 2: Enrichment – POS tagging

Parts of Speech (POS) tagging is a process of converting each token into a tuple having the form (word, tag). POS tagging essential to preserve the context of the word and is essential for Lemmatization.

## Step 3: Stopwords removal
Stopwords in English are words that carry very little useful information. We need to remove them as part of text preprocessing. nltk has a list of stopwords of every language. 

## Step 4: Obtaining the stem words
A stem is a part of a word responsible for its lexical meaning. The two popular techniques of obtaining the root/stem words are Stemming and Lemmatization.

The key difference is Stemming often gives some meaningless root words as it simply chops off some characters in the end. Lemmatization gives meaningful root words, however, it requires POS tags of the words.



In [5]:
# Remove "Verified" or "Not Verified" labels if they exist at the beginning
df['reviews'] = df['reviews'].str.replace(r'^(?:✅ Trip Verified|Not Verified) \|', '', regex=True)


In [6]:
df

Unnamed: 0,reviews
0,First time flying British Airways and I would...
1,I flew London to Cairo and return in October...
2,Absolutely the worst experience ever. Flew ...
3,Flew back from Malta after scattering our s...
4,"Cabin luggage had to go to cargo, even when I..."
...,...
3683,LHR-JFK-LAX-LHR. Check in was ok apart from be...
3684,LHR to HAM. Purser addresses all club passenge...
3685,My son who had worked for British Airways urge...
3686,London City-New York JFK via Shannon on A318 b...


In [7]:
df.shape

(3688, 1)

In [8]:
import re

# Remove punctuation and non-alphanumeric characters
df['reviews'] = df['reviews'].str.replace(r'[^\w\s]', '', regex=True)


In [9]:
# Remove extra spaces
df['reviews'] = df['reviews'].str.strip()


In [10]:
df

Unnamed: 0,reviews
0,First time flying British Airways and I would ...
1,I flew London to Cairo and return in October 2...
2,Absolutely the worst experience ever Flew int...
3,Flew back from Malta after scattering our sons...
4,Cabin luggage had to go to cargo even when I s...
...,...
3683,LHRJFKLAXLHR Check in was ok apart from being ...
3684,LHR to HAM Purser addresses all club passenger...
3685,My son who had worked for British Airways urge...
3686,London CityNew York JFK via Shannon on A318 bu...


```
NLTK is a leading platform for building Python programs to work with human language data. 

```

```
It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along 
with a suite of text processing libraries for classification, tokenization, stemming, tagging, 
parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries
```

In [11]:
df

Unnamed: 0,reviews
0,First time flying British Airways and I would ...
1,I flew London to Cairo and return in October 2...
2,Absolutely the worst experience ever Flew int...
3,Flew back from Malta after scattering our sons...
4,Cabin luggage had to go to cargo even when I s...
...,...
3683,LHRJFKLAXLHR Check in was ok apart from being ...
3684,LHR to HAM Purser addresses all club passenger...
3685,My son who had worked for British Airways urge...
3686,London CityNew York JFK via Shannon on A318 bu...


# Sentiment Analysis with Separate Columns

In this section, we perform sentiment analysis on the review data while creating separate columns to store both the sentiment value and sentiment label.

1. **Sentiment Analysis Function**: We define a function, `analyze_sentiment`, which takes a text as input, performs sentiment analysis, and returns both the sentiment value (a numeric score) and sentiment label (positive, negative, or neutral).

2. **Applying Sentiment Analysis**: We apply the `analyze_sentiment` function to the 'reviews' column in our DataFrame. The result is a new DataFrame with two additional columns: 'sentiment_value' and 'sentiment.'

3. **Result Explanation**: The 'sentiment_value' column contains a numeric sentiment score that represents the sentiment's intensity, while the 'sentiment' column stores the sentiment label (positive, negative, or neutral).

4. **Example Output**: The resulting DataFrame allows us to easily analyze and visualize the sentiment of each review, and it provides both a numeric and label representation of sentiment.

Let's proceed with the code and analysis.


In [None]:
pip install textblob


In [None]:
# Join the list of stemmed words into a single string
df['stemmed_reviews_as_string'] = df['stemmed_reviews'].apply(lambda words: ' '.join(words))

# Display the DataFrame with the new 'stemmed_reviews_as_string' column
print(df[['stemmed_reviews', 'stemmed_reviews_as_string']])


In [None]:

# Import the required libraries

from textblob import TextBlob

# Function to analyze sentiment and return both sentiment value and label
def analyze_sentiment(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    if sentiment > 0:
        sentiment_label = 'positive'
    elif sentiment < 0:
        sentiment_label = 'negative'
    else:
        sentiment_label = 'neutral'
    return sentiment, sentiment_label

# Apply sentiment analysis and create separate columns for sentiment value and label
df[['sentiment_value', 'sentiment']] = df['stemmed_reviews_as_string'].apply(analyze_sentiment).apply(pd.Series)

# Display the DataFrame with sentiment value and label columns
print(df[['reviews', 'sentiment_value', 'sentiment']])



In [None]:
df