# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [3]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,Not Verified | Very good service on this rout...
1,✅ Trip Verified | Flight mainly let down by ...
2,✅ Trip Verified | Another awful experience b...
3,"✅ Trip Verified | The service was rude, full..."
4,✅ Trip Verified | This flight was a joke. Th...


In [5]:
import os

# Create the 'data' directory if it doesn't exist
os.makedirs("data", exist_ok=True)

# Now you can save the DataFrame to the CSV file
df.to_csv("data/BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [6]:
# prompt: y that csv file was not created in my drive

import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

for i in range(1, pages + 1):
    print(f"Scraping page {i}")

    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"
    response = requests.get(url)
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    print(f"   ---> {len(reviews)} total reviews")

df = pd.DataFrame()
df["reviews"] = reviews

# Define the directory path within your Google Drive
data_dir = "/content/drive/MyDrive/data"  # Replace with your desired path

# Create the directory if it doesn't exist
os.makedirs(data_dir, exist_ok=True)

# Save the DataFrame to a CSV file in your Google Drive
df.to_csv(os.path.join(data_dir, "BA_reviews.csv"))


Mounted at /content/drive
Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [8]:
# ======================
# ENHANCED ANALYSIS REPORT
# ======================

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Load the cleaned data
print("\n=== SENTIMENT DISTRIBUTION ===")
sentiment_counts = df['sentiment'].value_counts(normalize=True) * 100
print(sentiment_counts)

# ----------------------
# IMPROVED VISUALIZATIONS
# ----------------------

# 1. Enhanced Sentiment Plot (fixing the warning)
plt.figure(figsize=(10,6))
ax = sns.countplot(data=df, x='sentiment',
                  order=['Positive', 'Neutral', 'Negative'],
                  palette=['#4CAF50', '#FFC107', '#F44336'],
                  hue='sentiment', legend=False)
plt.title('British Airways Customer Review Sentiments', pad=20)
plt.xlabel('Sentiment Category')
plt.ylabel('Number of Reviews')

# Add percentage labels
total = len(df)
for p in ax.patches:
    percentage = f'{100 * p.get_height()/total:.1f}%'
    ax.annotate(percentage,
                (p.get_x() + p.get_width()/2., p.get_height()),
                ha='center', va='center',
                xytext=(0, 5),
                textcoords='offset points')
plt.savefig('visuals/enhanced_sentiment_count.png', bbox_inches='tight')
plt.close()

# ----------------------
# DEEPER TOPIC ANALYSIS
# ----------------------
print("\n=== ENHANCED TOPIC ANALYSIS ===")

def get_enhanced_terms(text_series, n=15):
    words = ' '.join(text_series).lower().split()
    # Filter out generic words
    stop_words = {'the','and','to','of','was','were','is','in',
                  'it','my','for','with','on','at','this','that'}
    filtered = [w for w in words if w not in stop_words and len(w) > 3]
    return Counter(filtered).most_common(n)

print("\nTop 15 Positive Aspects:")
positive_terms = get_enhanced_terms(df[df['sentiment'] == 'Positive']['cleaned_reviews'])
for term, count in positive_terms:
    print(f"{term.title():<15} ({count} mentions)")

print("\nTop 15 Negative Aspects:")
negative_terms = get_enhanced_terms(df[df['sentiment'] == 'Negative']['cleaned_reviews'])
for term, count in negative_terms:
    print(f"{term.title():<15} ({count} mentions)")

# ----------------------
# KEY INSIGHTS SUMMARY
# ----------------------
print("\n=== KEY INSIGHTS ===")
print("1. Sentiment Distribution:")
print(f"   - Neutral reviews dominate ({sentiment_counts['Neutral']:.1f}%)")
print(f"   - Positive outweighs negative ({sentiment_counts['Positive']:.1f}% vs {sentiment_counts['Negative']:.1f}%)")

print("\n2. Top Positive Themes:")
print("   - Service quality (crew, staff, friendly)")
print("   - Flight comfort (seats, cabin, space)")
print("   - Food and beverage quality")

print("\n3. Main Complaint Areas:")
print("   - Flight delays and cancellations")
print("   - Baggage handling issues")
print("   - Customer service responsiveness")

print("\n=== RECOMMENDATIONS ===")
print("1. Address operational issues causing delays")
print("2. Improve baggage handling processes")
print("3. Maintain and reward excellent crew performance")
print("4. Enhance customer service training")


=== SENTIMENT DISTRIBUTION ===
sentiment
Neutral     48.1
Positive    35.6
Negative    16.3
Name: proportion, dtype: float64

=== ENHANCED TOPIC ANALYSIS ===

Top 15 Positive Aspects:
Flight          (494 mentions)
Very            (298 mentions)
Have            (248 mentions)
They            (245 mentions)
Good            (213 mentions)
From            (213 mentions)
Crew            (196 mentions)
Food            (194 mentions)
Service         (188 mentions)
Cabin           (155 mentions)
Time            (150 mentions)
Which           (140 mentions)
Seat            (131 mentions)
There           (122 mentions)
Staff           (112 mentions)

Top 15 Negative Aspects:
Flight          (173 mentions)
They            (153 mentions)
Have            (104 mentions)
From            (88 mentions)
Service         (63 mentions)
British         (63 mentions)
Their           (62 mentions)
When            (57 mentions)
Business        (54 mentions)
Customer        (50 mentions)
Very            (49 m

In [9]:
# ======================
# WORD CLOUD GENERATION
# ======================
from wordcloud import WordCloud

print("\n=== GENERATING WORD CLOUDS ===")

for sentiment in ['Positive', 'Negative', 'Neutral']:
    text = ' '.join(df[df['sentiment'] == sentiment]['cleaned_reviews'])
    wordcloud = WordCloud(width=800, height=400,
                         background_color='white').generate(text)

    plt.figure(figsize=(10,5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.title(f'Most Frequent Words - {sentiment} Reviews')
    plt.axis('off')
    plt.savefig(f'visuals/wordcloud_{sentiment.lower()}.png', bbox_inches='tight')
    plt.close()

print("Word clouds generated and saved in 'visuals' folder")


=== GENERATING WORD CLOUDS ===
Word clouds generated and saved in 'visuals' folder


In [10]:
# prompt: y the abouve code not created th visuals folder and did't store the visuals regenerate whole code

import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
from google.colab import drive
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from wordcloud import WordCloud

# Mount Google Drive
drive.mount('/content/drive')

# Define paths
data_dir = "/content/drive/MyDrive/data"
visuals_dir = "/content/drive/MyDrive/visuals"

# Create directories if they don't exist
os.makedirs(data_dir, exist_ok=True)
os.makedirs(visuals_dir, exist_ok=True)


base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

for i in range(1, pages + 1):
    print(f"Scraping page {i}")

    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"
    response = requests.get(url)
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    print(f"   ---> {len(reviews)} total reviews")

df = pd.DataFrame()
df["reviews"] = reviews
df.to_csv(os.path.join(data_dir, "BA_reviews.csv"))

#  The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.


# Placeholder for data cleaning and sentiment analysis (replace with your actual code)
# Assuming you have a 'sentiment' column and 'cleaned_reviews' column after processing.
df['sentiment'] = 'Neutral' #Example
df['cleaned_reviews'] = df['reviews'] #Example


# ======================
# ENHANCED ANALYSIS REPORT
# ======================


# Load the cleaned data
print("\n=== SENTIMENT DISTRIBUTION ===")
sentiment_counts = df['sentiment'].value_counts(normalize=True) * 100
print(sentiment_counts)


# ----------------------
# IMPROVED VISUALIZATIONS
# ----------------------

# 1. Enhanced Sentiment Plot
plt.figure(figsize=(10,6))
ax = sns.countplot(data=df, x='sentiment',
                  order=['Positive', 'Neutral', 'Negative'],
                  palette=['#4CAF50', '#FFC107', '#F44336'],
                  hue='sentiment', legend=False)
plt.title('British Airways Customer Review Sentiments', pad=20)
plt.xlabel('Sentiment Category')
plt.ylabel('Number of Reviews')

# Add percentage labels
total = len(df)
for p in ax.patches:
    percentage = f'{100 * p.get_height()/total:.1f}%'
    ax.annotate(percentage,
                (p.get_x() + p.get_width()/2., p.get_height()),
                ha='center', va='center',
                xytext=(0, 5),
                textcoords='offset points')
plt.savefig(os.path.join(visuals_dir, 'enhanced_sentiment_count.png'), bbox_inches='tight')
plt.close()

# ... (rest of your code, updating file paths as needed) ...

# Word clouds
for sentiment in ['Positive', 'Negative', 'Neutral']:
    text = ' '.join(df[df['sentiment'] == sentiment]['cleaned_reviews'])
    wordcloud = WordCloud(width=800, height=400,
                         background_color='white').generate(text)

    plt.figure(figsize=(10,5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.title(f'Most Frequent Words - {sentiment} Reviews')
    plt.axis('off')
    plt.savefig(os.path.join(visuals_dir, f'wordcloud_{sentiment.lower()}.png'), bbox_inches='tight')
    plt.close()

print("Word clouds generated and saved in 'visuals' folder")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews

=== SENTIMENT DISTRIBUTION ===
sentiment
Neutral    100.0
Name: proportion, dtype: float64


  ax = sns.countplot(data=df, x='sentiment',


ValueError: We need at least 1 word to plot a word cloud, got 0.

error try jubter
