# Scraping data from Skytrax

I am going to visit https://www.airlinequality.com as we can see that there is a lot of data there. For this task, we are only interested in reviews related to Air India.
I am goming to use Python and BeautifulSoup to collect all the links to the reviews and then to collect the text data on each of the individual review links.

### Important Libraries

In [13]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

### WebScraping Code

In [19]:
base_url = "https://www.airlinequality.com/airline-reviews/air-india/"
pages = 35
page_size = 100

rating=[]
reviews = []
headers =[]
dates=[]

for i in range(1, pages + 1):

    # Create URL to collect links from data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
#   print(parsed_content.prettify())
    for r in parsed_content.find_all("span", {"itemprop": "ratingValue"}):
        rating.append(r.get_text())
    for h in parsed_content.find_all("h2", {"class": "text_header"}):
        headers.append(h.get_text())
    for t in parsed_content.find_all("time", {"itemprop": "datePublished"}):
        dates.append(t.get_text())
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())


In [20]:
df = pd.DataFrame()
df["date"] = dates
df["summary"] = headers
df["reviews"] = reviews
df["rating"] = rating
df.head()

ValueError: Length of values (1213) does not match length of index (1073)

We will Fix this error:
This error can be fixed by preprocessing the new list or NumPy array that is going to be a column of the DataFrame by using the pandas Series() function which actually converts the list or NumPy array into the size of the DataFrame column length by adding NaN if list or NumPy array has lesser length else if the list or NumPy has greater length then it takes the list or NumPy array with the length of columns of the pandas dataframe.

In [21]:
df = pd.DataFrame()
df["date"] = dates
df["summary"] = pd.Series(headers)
df["reviews"] = reviews
df["rating"] = pd.Series(rating)
df.head()

Unnamed: 0,date,summary,reviews,rating
0,5th March 2023,"""Horrible service""","Not Verified | Horrible service, the aircraft...",\n\t\t\t\t\t\t\t\t\t\t\t\t\t4
1,4th March 2023,"""I was very disappointed""",✅ Trip Verified | I was very disappointed tha...,1
2,3rd March 2023,"""Service gone so down""",✅ Trip Verified | I was flying with my friend...,1
3,3rd March 2023,"""Food was below average quality""",✅ Trip Verified | Flight out on time Good ser...,2
4,2nd March 2023,"""Not a good value for money""",✅ Trip Verified | My seats were changed as it...,5


Now We will heck how much data of review we have collected by scraping.

In [22]:
len(df)

1073

We have to save this data in .csv file.

In [23]:
df.to_csv("AirIndia_reviews.csv")

Yes our Scraping is done! Now, We have dataset for Sentiment Analysis.