### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [26]:
reviews = []
recommendation = [] # 'yes' or 'no'
country = []
rating  = []        # 1-10  

In [27]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 35          # pagingation
page_size = 100     # how many max-reviews in each page   


for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    soup = BeautifulSoup(content, 'html.parser')
    for item in soup.find_all("div", {"class": "text_content"}):
        reviews.append(item.get_text())

    #country
    for item in soup.find_all("h3"):
        country.append(item.span.next_sibling.text.strip(" ()"))

    for item in soup.find_all("td", {"class": {"review-value rating-yes", "review-value rating-no"}}):
        recommendation.append(item.get_text())

    for item in soup.find_all("div", class_ = "rating-10"):
        try:
            rating.append(item.span.text)
        except:
            print(f"Error on page {i}")
            rating.append("None")
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews
Scraping page 21
   ---> 2100 total reviews
Scraping page 22
   ---> 2200 total reviews
Scraping page 23
   ---> 2300 total reviews
Scrapi

In [28]:
print(len(reviews))
reviews[0]

3461


'✅ Trip Verified |  This flight was one of the worst I have ever had in my life. I wanted to pamper myself, so I bought business class. I was looking forward to my new experience. I will not mention the chaos of changing gates several times, as these things may happen. What surprised me was the lack of attention to passengers. The flight was delayed by almost 3 hours. Even though staff offered vouchers, we had no idea where to get them, and we were told that we only had about 10 minutes to use them because boarding had already begun. Firstly, I did not see anyone with the voucher, and secondly, even if we got it, we were not able to use it. When I finally got to the airport, there was another waiting for about 30 minutes after cross check. Meantime, we were told that due to problems, they did not load any food, so the flight will be without any food on board. The only food offered and given to everyone on the plane was a small bag of nuts. As a business class passenger, I was offered d

In [29]:
print(len(recommendation))
recommendation[:5]

3461


['no', 'no', 'no', 'yes', 'yes']

In [30]:
print(len(country))
country[:5]

3461


['United Kingdom',
 'United States',
 'United Kingdom',
 'United Kingdom',
 'United Kingdom']

In [31]:
print(len(rating))
rating[:5]

3496


['\n\t\t\t\t\t\t\t\t\t\t\t\t\t5', '2', '3', '2', '9']

Actually there are only 3461 reviews of users but we are here having 3496 `ratings` becoz while we scrapped the `ratings` from each page, our code actually scraps 1 extra `Overall Rating` placed on the above of each pagination page. So we are having more no. of ratings than it should be(here it is 3496 instead of 3461), the rating having `\n\t\t\t\t\t\t\t\t\t\t\t\t\t` in front is actually the overall rating, so we'll drop those unwanted overall rating.

In [33]:
rating = [ x for x in rating if "\t" not in x ]

In [34]:
len(rating)

3461

Now all the columns have same number of data i.e.3461.

## Exporting the data

In [36]:
df = pd.DataFrame({'Reviews':reviews, 'Country':country, 'Recommended':recommendation, 'Rating':rating})
df.head()

Unnamed: 0,Reviews,Country,Recommended,Rating
0,✅ Trip Verified | This flight was one of the ...,United Kingdom,no,2
1,Not Verified | It seems that there is a race t...,United States,no,3
2,Not Verified | As a Spanish born individual l...,United Kingdom,no,2
3,✅ Trip Verified | A rather empty and quiet fl...,United Kingdom,yes,9
4,✅ Trip Verified | Easy check in and staff mem...,United Kingdom,yes,9


In [37]:
df.to_csv("data/BA_reviews.csv")