```Name: Wong Wen Bing```    
```Admin #: 230436M```  
```PEM GROUP: AA2303```

# **Part 1: Data Scraping Notebook (1/3)** 
This part will consist of data collection through web scraping textual data. To collect to classify the respective airlines, I was tasked to scrape reviews from airlinequality.com, a website under Skytrax reviews. 

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

We will be using beautiful soup to run the scraping, as the website blocks selenium bots from running. We first start off by creating a list for the reviews to be added in. 

In [2]:
reviews=[]

Next, we define the airlines that we want to scrape from - Singapore Airlines, Scoot, Southwest Airlines, Emirates, Ryanair and British Airways

In [3]:
airlines=['singapore-airlines', 
          'scoot', 
          'southwest-airlines',
          'emirates',
          'ryanair',
          'british-airways']

The code below is to run the web scraping script, I have set to a standardised 200 pages to be scraped from each airline, to try and ensure balance of the dataset as much as possible.    
We will obtain the title of the review, the content of the review, the overall rating given and the airline name.  

In [4]:
for airline in airlines: 
    for i in range(1,200): 
        url=f'https://www.airlinequality.com/airline-reviews/{airline}/page/{i}'
        response=requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
    # Locate the main review section
        articles = soup.select("article.comp.comp_reviews-airline.querylist article")
        if not articles:
            print(f"No reviews found on page {i}.")
            continue 
        for article in articles:
            try:
                # Extract the review title
                title = article.select_one("h2.text_header").text.strip() #title of review

                # Extract the review content
                txt = article.select_one("div.text_content").text.strip()

                # Extract ratings
                table = article.find("table", class_="review-ratings")
                if table:
                    rows = table.find_all("tr")
                    for row in rows:
                        # Attempt to extract ratings
                        try:
                            stars = row.find_all("span")
                            if stars:
                                rating = sum(1 for star in stars if "fill" in star.get("class", []))
                        except Exception as e:
                            print(f"Error extracting rating: {e}")
                # Append the review data
                reviews.append({
                    'title': title,
                    'content': txt,
                    'rating': rating,
                    'airline': airline,
                })
            except Exception as e:
                print(f"Error parsing review: {e}")
                continue

        print(f"{url} done.")
    print(f'{airline} done.')

https://www.airlinequality.com/airline-reviews/singapore-airlines/page/1 done.
https://www.airlinequality.com/airline-reviews/singapore-airlines/page/2 done.
https://www.airlinequality.com/airline-reviews/singapore-airlines/page/3 done.
https://www.airlinequality.com/airline-reviews/singapore-airlines/page/4 done.
https://www.airlinequality.com/airline-reviews/singapore-airlines/page/5 done.
https://www.airlinequality.com/airline-reviews/singapore-airlines/page/6 done.
https://www.airlinequality.com/airline-reviews/singapore-airlines/page/7 done.
https://www.airlinequality.com/airline-reviews/singapore-airlines/page/8 done.
https://www.airlinequality.com/airline-reviews/singapore-airlines/page/9 done.
https://www.airlinequality.com/airline-reviews/singapore-airlines/page/10 done.
https://www.airlinequality.com/airline-reviews/singapore-airlines/page/11 done.
https://www.airlinequality.com/airline-reviews/singapore-airlines/page/12 done.
https://www.airlinequality.com/airline-reviews/si

Once all the reviews are done, we can now convert it to a pandas dataframe before saving it as a csv file.

In [5]:
reviews=pd.DataFrame(reviews)
reviews.head()

Unnamed: 0,title,content,rating,airline
0,"""All I got were apologies""",✅ Trip Verified | I did the automatic 48-hr ...,4,singapore-airlines
1,"""thoughtful, polite and lovely stewardess""",✅ Trip Verified | It makes my day to be serv...,3,singapore-airlines
2,"""The product has changed""",✅ Trip Verified | My first time on SQ since ...,5,singapore-airlines
3,"""did everything to support me after that""",✅ Trip Verified | I want to acknowledge the ...,4,singapore-airlines
4,“Outstanding service”,✅ Trip Verified | Outstanding service from the...,4,singapore-airlines


In [6]:
reviews.to_csv('wenbing.csv')