### AIDI 1006- Assignment 2

## Sentiment Analysis: Customer Reviews on British Airways


#### Data URL: https://www.airlinequality.com/airline-reviews/british-airways/?sortby=post_date%3ADesc&pagesize=100

The goal is to develop a Sentiment Analysis solution to analyze customer reviews of British Airways, one of the leading airlines in the world. 

<br><br>

#### Loading Libraries

In [176]:
!pip install requests
!pip install beautifulsoup4
!pip install textblob



In [147]:
import csv
import requests
from bs4 import BeautifulSoup
from textblob import TextBlob
import re
from nltk.tokenize import word_tokenize
import nltk
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

#### Web Scraping Using Beautiful Soup

In [19]:
# Checking whether web scraping allowed by website
r=requests.get("https://www.airlinequality.com/airline-reviews/british-airways/?sortby=post_date%3ADesc&pagesize=100")

r.status_code

200

In [21]:
soup = BeautifulSoup(r.content, 'html.parser')

In [22]:
soup

<!DOCTYPE html>

<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7 lt-ie10" lang="en-GB"> <![endif]-->
<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8 lt-ie10" lang="en-GB"> <![endif]-->
<!--[if IE 8]>    <html class="no-js lt-ie9 lt-ie10" lang="en-GB"> <![endif]-->
<!--[if IE 9]>    <html class="no-js lt-ie10" lang="en-GB"> <![endif]-->
<!--[if gt IE 8]><!-->
<html lang="en-GB">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<title>British Airways Customer Reviews - SKYTRAX</title>
<!-- Google Chrome Frame for IE -->
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<!-- mobile meta -->
<meta content="True" name="HandheldFriendly"/>
<meta content="320" name="MobileOptimized"/>
<meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport">
<!-- icons & favicons -->
<link href="https://www.airlinequality.com/wp-content/themes/airlinequality2014new/library/images/apple-icon-touch.png" rel="apple-tou

In [33]:
# Extracting reviews HTML elements
reviews = soup.find_all('div', itemprop='reviewBody')

In [34]:
reviews

[<div class="text_content" itemprop="reviewBody">✅ <strong><a href="https://www.airlinequality.com/verified-reviews/"><em>Trip Verified</em></a></strong> | A simple story with an unfortunate outcome that really could happen to anyone. My partner and I recently started working after studying purchased two tickets to travel from London City Airport to Frankfurt. When we purchased the tickets, I mistakenly entered my name twice (e.g. Mr John Smith and Ms John Smith). Little did we know that our 1 simple mistake would cost us over 300 pounds. Upon arriving at the airport we were told there was no way to change the name (apparently they can only change 3 letters where there has been a typo?) and I had no other option to purchase the last remaining ticket if I wanted to board the flight - the price: almost seven times (!) higher than my original ticket. Zero empathy was shown. Zero alternative was offered. Trusting BA's staff and under the pretence that there was apparently no other way we c

In [55]:
# Removing HTML tags from each review
cleaned_reviews = []
for review in reviews:
    start_index = review.get_text().find('|')
    cleaned_reviews.append(review.get_text()[start_index+1:].lstrip().rstrip())


print(cleaned_reviews)

["A simple story with an unfortunate outcome that really could happen to anyone. My partner and I recently started working after studying purchased two tickets to travel from London City Airport to Frankfurt. When we purchased the tickets, I mistakenly entered my name twice (e.g. Mr John Smith and Ms John Smith). Little did we know that our 1 simple mistake would cost us over 300 pounds. Upon arriving at the airport we were told there was no way to change the name (apparently they can only change 3 letters where there has been a typo?) and I had no other option to purchase the last remaining ticket if I wanted to board the flight - the price: almost seven times (!) higher than my original ticket. Zero empathy was shown. Zero alternative was offered. Trusting BA's staff and under the pretence that there was apparently no other way we could board the flight we bought this ticket. Immediately after I purchased the ticket I contacted BA's 'Commercial Change Booking Team' and informed them 

#### Text Preprocessing

In [66]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shubh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [133]:
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove HTML tags (if any)
    text = re.sub(r'<.,*?>', '', text)

    # Tokenize the text into individual words
    words = word_tokenize(text)
    
    # Remove special characters and punctuation
    words = [word for word in words if word.isalnum()]
    
    # Remove special characters and numbers
    words = [re.sub(r'[^a-zA-Z\s]', '', word) for word in words]
    

    # Remove stop words
    stop_words = [
    'a', 'the', 'an','about', 'above', 'across', 'after', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 
    'among', 'an', 'and', 'another', 'any', 'anybody', 'anyone', 'anything', 'anywhere', 'are', 'area', 'areas', 'around', 'as', 'ask',
    'asked', 'asking', 'asks', 'at', 'away', 'back', 'backed', 'backing', 'backs', 'be', 'became', 'because', 'become', 'becomes', 'been', 
    'before', 'began', 'behind', 'being','beings', 'between', 'big', 'both', 'but', 'by', 'came', 'can', 'case', 'cases', 'certain', 
    'certainly', 'clear', 'clearly', 'come', 'could', 'did', 'differ', 'different', 'differently', 'do', 'does', 'done', 'down', 'downed',
    'downing', 'downs', 'during', 'each', 'early', 'either', 'end', 'ended', 'ending', 'ends', 'enough', 'even', 'evenly', 'ever', 'every', 
    'everybody', 'everyone', 'everything', 'everywhere', 'face', 'faces', 'fact', 'facts', 'far', 'felt', 'few', 'find', 'finds', 'first', 
    'for', 'four', 'from', 'full', 'fully', 'further', 'furthered', 'furthering',
    'furthers', 'gave', 'general', 'generally', 'get', 'gets', 'give', 'given', 'gives',
    'go', 'going', 'good', 'goods', 'got', 'great', 'greater', 'greatest', 'group', 'grouped',
    'grouping', 'groups', 'had', 'has', 'have', 'having', 'he', 'her', 'here', 'herself',
    'high', 'higher', 'highest', 'him', 'himself', 'his', 'how', 'however', 'i', 'if',
    'important', 'in', 'interest', 'interested', 'interesting', 'interests', 'into', 'is', 'it',
    'its', 'itself', 'just', 'keep', 'keeps', 'kind', 'knew', 'know', 'known',
    'knows', 'large', 'largely', 'last', 'later', 'latest', 'least', 'less', 'let', 'lets',
    'like', 'likely', 'long', 'longer', 'longest', 'made', 'make', 'making', 'man', 'many',
    'may', 'me', 'member', 'members', 'men', 'might', 'more', 'most', 'mostly', 'mr', 'mrs',
    'much', 'must', 'my', 'myself', 'necessary', 'need', 'needed', 'needing', 'needs',
    'never', 'new', 'newer', 'newest', 'next', 'no', 'nobody', 'non', 'noone', 'not', 'nothing',
    'now', 'nowhere', 'number', 'numbers', 'of', 'off', 'often', 'old', 'older', 'oldest',
    'on', 'once', 'one', 'only', 'open', 'opened', 'opening', 'opens', 'or', 'order', 'ordered',
    'ordering', 'orders', 'other', 'others', 'our', 'out', 'over', 'part', 'parted',
    'parting', 'parts', 'per', 'perhaps', 'place', 'places', 'point', 'pointed', 'pointing',
    'points', 'possible', 'present', 'presented', 'presenting', 'presents', 'problem', 'problems',
    'put', 'puts', 'quite', 'rather', 'really', 'right', 'room', 'rooms', 'said',
    'same', 'saw', 'say', 'says', 'second', 'seconds', 'see', 'seem', 'seemed', 'seeming',
    'seems', 'sees', 'several', 'shall', 'she', 'should', 'show', 'showed', 'showing', 'shows',
    'side', 'sides', 'since', 'small', 'smaller', 'smallest', 'so', 'some', 'somebody', 'someone',
    'something', 'somewhere', 'state', 'states', 'still', 'such', 'sure', 'take', 'taken',
    'than', 'that', 'the', 'their', 'them', 'then', 'there', 'therefore', 'these', 'they', 'thing',
    'things', 'think', 'thinks', 'this', 'those', 'though', 'thought', 'thoughts', 'three',
    'through', 'thus', 'to', 'today', 'together', 'too', 'took', 'toward', 'turn', 'turned',
    'turning', 'turns', 'two', 'u', 'under', 'until', 'up', 'upon', 'us', 'use', 'used', 'uses',
    'very', 'want', 'wanted', 'wanting', 'wants', 'was', 'way', 'ways', 'we', 'well',
    'wells', 'went', 'were', 'what', 'when', 'where', 'whether', 'which', 'while', 'who', 'whole',
    'whose', 'why', 'will', 'with', 'within', 'without', 'work', 'worked', 'working', 'works',
    'would', 'year', 'years', 'yet', 'you', 'young', 'younger', 'youngest', 'your',
    'yours']

    words = [word for word in words if word not in stop_words]

    # Combine the words back into a preprocessed text
    preprocessed_text = ' '.join(words)

    return preprocessed_text

In [134]:
preprocessed_text = [preprocess_text(review) for review in cleaned_reviews]

In [135]:
preprocessed_text

['simple story unfortunate outcome happen partner recently started studying purchased tickets travel london city airport frankfurt purchased tickets mistakenly entered name twice john smith ms john smith little  simple mistake cost  pounds arriving airport told change name apparently change  letters typo option purchase remaining ticket board flight price seven times original ticket zero empathy shown zero alternative offered trusting ba staff pretence apparently board flight bought ticket immediately purchased ticket contacted ba change booking team informed situation service representative apologised told changed name cost fee offered cancel original ticket issue partial refund advising claim difference contact customer support told person times claim cost ticket explicitly accepting offer denied claiming cost ticket claim difference fair accepted offer lodged ticket customer support difference days british airways customer response informed unable lodge claim accepted cancellation f

#### Sentiment Analysis using Text Blob

In [139]:
analyzed_data = []

for index in range(0,len(preprocessed_text)):
    blob = TextBlob(str(preprocessed_text[index]))
    polarity_score = blob.sentiment.polarity
    if polarity_score > 0:
        sentiment = "Positive"
    elif polarity_score < 0:
        sentiment = "Negative"
    else:
        sentiment = "Neutral"
    analyzed_data.append({'Review': cleaned_reviews[index], 'Polarity Score': polarity_score, 'Sentiment': sentiment})

print(analyzed_data)

[{'Review': "A simple story with an unfortunate outcome that really could happen to anyone. My partner and I recently started working after studying purchased two tickets to travel from London City Airport to Frankfurt. When we purchased the tickets, I mistakenly entered my name twice (e.g. Mr John Smith and Ms John Smith). Little did we know that our 1 simple mistake would cost us over 300 pounds. Upon arriving at the airport we were told there was no way to change the name (apparently they can only change 3 letters where there has been a typo?) and I had no other option to purchase the last remaining ticket if I wanted to board the flight - the price: almost seven times (!) higher than my original ticket. Zero empathy was shown. Zero alternative was offered. Trusting BA's staff and under the pretence that there was apparently no other way we could board the flight we bought this ticket. Immediately after I purchased the ticket I contacted BA's 'Commercial Change Booking Team' and inf

#### Writing Dictionary List to CSV file

In [158]:
with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['Review', 'Polarity Score', 'Sentiment']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, delimiter=';', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        writer.writeheader()

        for item in analyzed_data:
            review = item['Review']
            polarity_score = item['Polarity Score']
            sentiment = item['Sentiment']

            writer.writerow({'Review': review, 'Polarity Score': polarity_score, 'Sentiment': sentiment})

In [159]:
df = pd.read_csv('sentiment_analysis_results.csv')

In [160]:
df.head()

Unnamed: 0,Review,Polarity Score,Sentiment
0,A simple story with an unfortunate outcome tha...,0.040931,Positive
1,Flight was delayed due to the inbound flight a...,-0.028704,Negative
2,Fast and friendly check in (total contrast to ...,0.231548,Positive
3,I don't understand why British Airways is clas...,0.243182,Positive
4,I'm sure that BA have gradually made their eco...,-0.020833,Negative


In [161]:
df.Sentiment.value_counts()

Negative    60
Positive    36
Neutral      4
Name: Sentiment, dtype: int64

In [162]:
df.head(30)

Unnamed: 0,Review,Polarity Score,Sentiment
0,A simple story with an unfortunate outcome tha...,0.040931,Positive
1,Flight was delayed due to the inbound flight a...,-0.028704,Negative
2,Fast and friendly check in (total contrast to ...,0.231548,Positive
3,I don't understand why British Airways is clas...,0.243182,Positive
4,I'm sure that BA have gradually made their eco...,-0.020833,Negative
5,Customer Service does not exist. One world eme...,-0.009615,Negative
6,"Another really great pair of flights, on time,...",0.406667,Positive
7,Our A380 developed a fault taxiing to the runw...,0.060417,Positive
8,Horrible airline. Does not care about their cu...,-0.416667,Negative
9,My family and I have flown mostly on British A...,0.122222,Positive


In [163]:
df1 = pd.read_csv("sentiment_analysis_results.csv")

In [164]:
df1.head()

Unnamed: 0,Review,Polarity Score,Sentiment,Actual_Sentiment
0,A simple story with an unfortunate outcome tha...,0.040931,Positive,Negative
1,Flight was delayed due to the inbound flight a...,-0.028704,Negative,Negative
2,Fast and friendly check in (total contrast to ...,0.231548,Positive,Neutral
3,I don't understand why British Airways is clas...,0.243182,Positive,Neutral
4,I'm sure that BA have gradually made their eco...,-0.020833,Negative,Negative


In [167]:
sentiment_predictions = df1['Sentiment'].tolist()
actual_sentiment = df1['Actual_Sentiment'].tolist()

In [168]:
# Convert the sentiment predictions and actual labels to numerical values (e.g., 0 for Negative, 1 for Neutral, 2 for Positive)

sentiment_mapping = {'Negative': 0, 'Neutral': 1, 'Positive': 2}
sentiment_predictions_numeric = [sentiment_mapping[s] for s in sentiment_predictions]
actual_sentiment_numeric = [sentiment_mapping[s] for s in actual_sentiment]

In [173]:
# Calculate accuracy
accuracy = accuracy_score(actual_sentiment, sentiment_predictions)
print("Accuracy:", accuracy)

# Calculate precision
precision = precision_score(actual_sentiment, sentiment_predictions, average='weighted')
print("Precision:", precision)

# Calculate recall
recall = recall_score(actual_sentiment, sentiment_predictions, average='weighted')
print("Recall:", recall)

# Calculate F1-score
f1 = f1_score(actual_sentiment, sentiment_predictions, average='weighted')
print("F1-score:", f1)

# Calculate AUC (Area Under the Curve)
auc = roc_auc_score(pd.get_dummies(actual_sentiment), pd.get_dummies(sentiment_predictions), multi_class='ovr')
print("AUC:", auc)

Accuracy: 0.86
Precision: 0.8840555555555555
Recall: 0.86
F1-score: 0.8630624322747354
AUC: 0.8357752634271421


In [174]:
# Calculate the confusion matrix
conf_matrix = confusion_matrix(actual_sentiment, sentiment_predictions)
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[59  2  8]
 [ 0  2  3]
 [ 1  0 25]]
