# web-scraping-british-airways

Use the "Run" button to execute the code.

Pick a website and describe your objective

  -  Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
  -  Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
  -  Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above


### Project Outline

 >   We're going to scrape https://www.airlinequality.com/airline-reviews/british-airways/
 >   We'll get a list of date, country, reviews, ratings, comments.
 >   For the final data we'll create a CSV file in the following format:
    ['date', 'country', 'reviews', 'ratings', 'comments']


In [1]:
!pip install pandas numpy --quiet

In [2]:
!pip install beautifulsoup4 requests --upgrade --quiet

In [3]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests 

In [4]:
reviews_url = 'https://www.airlinequality.com/airline-reviews/british-airways/?sortby=post_date%3ADesc&pagesize=100'

In [5]:
response = requests.get(reviews_url)

In [6]:
response.status_code

200

In [7]:
len(response.text)

630227

In [8]:
page_contents = response.text

In [9]:
page_contents[0:1000]

'<!doctype html>\n\n<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7 lt-ie10" lang="en-GB"> <![endif]-->\n<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8 lt-ie10" lang="en-GB"> <![endif]-->\n<!--[if IE 8]>    <html class="no-js lt-ie9 lt-ie10" lang="en-GB"> <![endif]-->\n<!--[if IE 9]>    <html class="no-js lt-ie10" lang="en-GB"> <![endif]-->\n<!--[if gt IE 8]><!-->\n<html lang="en-GB">\n<!--<![endif]-->\n\n<head>\n    <meta charset="utf-8">\n\n    <title>British Airways Customer Reviews - SKYTRAX</title>\n\n    <!-- Google Chrome Frame for IE -->\n    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n\n    <!-- mobile meta -->\n    <meta name="HandheldFriendly" content="True">\n    <meta name="MobileOptimized" content="320">\n    <meta name="viewport"\n        content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" />\n    <!-- icons & favicons -->\n    <link rel="apple-touch-icon" href="https://www.airlinequali

In [10]:
with open('webpage.html', 'w') as f:
    f.write(page_contents)

In [11]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [12]:
selection_class = 'text_sub_header userStatusWrapper'

customer_info = doc.find_all('h3', {'class': selection_class})

In [13]:
len(customer_info)

100

In [14]:
customer_info[0:3]

[<h3 class="text_sub_header userStatusWrapper">
 <span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
 <span itemprop="name">Pradeep Madhavan</span></span> (United Kingdom) <time datetime="2023-07-09" itemprop="datePublished">9th July 2023</time></h3>,
 <h3 class="text_sub_header userStatusWrapper">
 <span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
 <span itemprop="name">Jeffrey Rice</span></span> (United States) <time datetime="2023-07-09" itemprop="datePublished">9th July 2023</time></h3>,
 <h3 class="text_sub_header userStatusWrapper">
 <span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
 <span itemprop="name">Bridget Fagan</span></span> (United Kingdom) <time datetime="2023-07-08" itemprop="datePublished">8th July 2023</time></h3>]

In [15]:
customer_info_1 = []
for info in customer_info:
    customer_info_1.append(info.text.strip())

In [16]:
customer_info_1[0:10]

['Pradeep Madhavan (United Kingdom) 9th July 2023',
 'Jeffrey Rice (United States) 9th July 2023',
 'Bridget Fagan (United Kingdom) 8th July 2023',
 'Bervin Hedman (United Kingdom) 6th July 2023',
 'Alastair Cockburn (South Africa) 5th July 2023',
 'S Carlsen (United Kingdom) 5th July 2023',
 'A Diamantopoulos (Greece) 4th July 2023',
 'Carlos Whilhelm (Italy) 3rd July 2023',
 'S Warten (Senegal) 2nd July 2023',
 'Kapil Tyagi (United States) 30th June 2023']

### country

In [17]:
country = []    # create an empty list to collect country the reviewer is from
for info in customer_info:
    country.append(info.span.next_sibling.strip(" ()"))

In [18]:
country[0:10]

['United Kingdom',
 'United States',
 'United Kingdom',
 'United Kingdom',
 'South Africa',
 'United Kingdom',
 'Greece',
 'Italy',
 'Senegal',
 'United States']

### overall info

In [51]:
# from https://github.com/hseju/British-Airways-Good-or-Bad/blob/main/Data%20Collection/Data-collection.ipynb

In [40]:
#create an empty list to collect all reviews
reviews  = []

#create an empty list to collect all comments
comments  = []

#create an empty list to collect rating stars
ratings = []

#create an empty list to collect date
date = []

#create an empty list to collect country the reviewer is from
country = []

In [41]:
for i in range(0, 10):
    page = requests.get(f"https://www.airlinequality.com/airline-reviews/british-airways/page/{i}/?sortby=post_date%3ADesc&pagesize=100")
    doc = BeautifulSoup(page_contents, 'html.parser')

    #review
    review_class = 'text_header'
    for item in doc.find_all('h2', {'class': review_class}):
         reviews.append(item.text)

    #country
    for item in doc.find_all("h3"):
        country.append(item.span.next_sibling.text.strip(" ()"))
    
    #comments
    for item in doc.find_all("div", class_="text_content"):
        comments.append(item.text)
    
    #ratings
    for item in doc.find_all("div", class_ = "rating-10"):
        try:
            ratings.append(item.span.text)
        except:
            print(f"Error on page {i}")
            ratings.append("None")
            
    #date
    for item in doc.find_all("time"):
        date.append(item.text)
        


In [42]:
#check the length of total reviews extracted
len(reviews)

1000

In [43]:
len(country)

1000

In [44]:
len(ratings)

1010

In [45]:
(ratings[0], 
ratings[101],
ratings[202],
ratings[303],
ratings[404],
ratings[505],
ratings[606],
ratings[707],
ratings[808],
ratings[909])

('\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5')

In [46]:
(ratings.pop(0), 
ratings.pop(100),
ratings.pop(200),
ratings.pop(300),
ratings.pop(400),
ratings.pop(500),
ratings.pop(600),
ratings.pop(700),
ratings.pop(800),
ratings.pop(900))

('\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t5')

In [47]:
len(ratings)

1000

In [48]:
#create  a dataframe from these collected lists of data

df = pd.DataFrame({"date":date, "country": country, "reviews": reviews, "ratings": ratings, "comments": comments})

### final data

In [49]:
df

Unnamed: 0,date,country,reviews,ratings,comments
0,9th July 2023,United Kingdom,"""Things have really deteriorated""",4,✅ Trip Verified | My family and I have flown ...
1,9th July 2023,United States,"""I will never fly this airline again""",2,✅ Trip Verified | This has been by far the wo...
2,8th July 2023,United Kingdom,"""asked for an explanation but have received none""",2,✅ Trip Verified | In Nov 2022 I booked and pa...
3,6th July 2023,United Kingdom,"""short-changing passengers""",4,Not Verified | BA is not treating its premium ...
4,5th July 2023,South Africa,"""Economy is absolutely awful""",1,✅ Trip Verified | 24 hours before our departu...
...,...,...,...,...,...
995,18th March 2023,India,"""Very impressive and efficient""",8,✅ Trip Verified | First time flying with Briti...
996,18th March 2023,United States,"""We are done with BA""",3,✅ Trip Verified | The latest affront. Stood i...
997,17th March 2023,United States,"""I was left stranded at the airport""",1,Not Verified | Booked a flight return flight ...
998,17th March 2023,Netherlands,"""I will never fly with them again""",1,✅ Trip Verified | I tried to check in on line...


### Export the data into a csv format

In [50]:
df.to_csv('british_airways_reviews.csv', index = None)