<a href="https://colab.research.google.com/github/irisvanwalraven/odcm/blob/main/KAYAK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <span style="color:orange">Scraping data from KAYAK</span>

For the course Online Data Collection & Management at Tilburg University, our team aims to scrape data from the website kayak.com. KAYAK provides a service that compares airline tickets with one another. When on the explore page, KAYAK provides a set of destinations based on the airport of departure. The purpose of this scraper is to create a dataset in which the suggested destinations based on the airport of departure (the top 10 biggest airports in Europe) become visible in April for a trip duration of five days.

## Top 10: Biggest airports in Europe
The top 10 of the biggest airports in Europe is as follows:
1. London Heathrow Airport, United Kingdom (LHA)
2. Aéroport de Paris-Charles de Gaulle, France (CDG)
3. Amsterdam Airport Schiphol, the Netherlands (AMS)
4. Flughafen Frankfurt am Main, Germany (FRA)
5. Aeropuerto Adolfo Suárez Madrid-Barajas, Spain (MAD)
6. Aeroport de Barcelona-el Prat, Spain (BCN)
7. Istanbul Airport, Turkey (IST)
8. Sheremetyevo International Airport, Russia (SVO)
9. Flughafen München Franz Josef Strauß, Germany (MUC)
10. London Gatwick Airport, United Kingdom (LGW)

In this notebook we will discuss the scraping in three chapters:
1. The preparation before scraping
2. The kayak.com/explore scraper
3. Saving the scraped data in a csv file

# <span style="color:orange">1. The preparation before scraping</span>
Before the explore page of kayak can be scraped a few things need to be done. The libraries need to be imported for the code to run. 

## 1.1 Importing libraries
In the first cell the necessary libraries for running our code are imported. This are the libraries we need:
* The requests library lets us make HTML request to Kayak's website server for retrieving the data from their page.
* The BeautifulSoup library is easy to use for beginners and will allow us to extract data from HTML files.
* The time library will be used to pause the execution of the commands. We use this because of the amount of data we gather. 
* The csv library will be used to store the scraped data in a csv file in the end.
* The datetime library is necessary to add the current day to our csv file.


In [None]:
import requests
from bs4 import BeautifulSoup
import time
from time import sleep 
import csv
from datetime import datetime


In [None]:
# To use Chromedriver on Google Colab, run this code
!pip install selenium
!apt-get update 
!apt install chromium-chromedriver

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)

Collecting selenium
  Downloading selenium-4.1.3-py3-none-any.whl (968 kB)
[?25l[K     |▍                               | 10 kB 22.3 MB/s eta 0:00:01[K     |▊                               | 20 kB 30.2 MB/s eta 0:00:01[K     |█                               | 30 kB 24.3 MB/s eta 0:00:01[K     |█▍                              | 40 kB 13.6 MB/s eta 0:00:01[K     |█▊                              | 51 kB 10.3 MB/s eta 0:00:01[K     |██                              | 61 kB 12.0 MB/s eta 0:00:01[K     |██▍                             | 71 kB 13.2 MB/s eta 0:00:01[K     |██▊                             | 81 kB 12.6 MB/s eta 0:00:01[K     |███                             | 92 kB 13.4 MB/s eta 0:00:01[K     |███▍                            | 102 kB 12.7 MB/s eta 0:00:01[K     |███▊                            | 112 kB 12.7 MB/s eta 0:00:01[K     |████                            | 122 kB 12.7 MB/s eta 0:00:01[K     |████▍                           | 133 kB 12.7 MB/s eta

0% [Working]            Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (91.189.91.39)] [Co                                                                               Get:2 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
                                                                               Get:3 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
                                                                               Hit:4 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (91.189.91.39)] [Wa0% [1 InRelease gpgv 242 kB] [Waiting for headers] [Connecting to security.ubun                                                                               Get:5 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
0% [1 InRelease gpgv 242 kB] [Waiting 

  # This is added back by InteractiveShellApp.init_path()


In [None]:
# To use Chromedriver on your computer, run this code
import selenium.webdriver

driver = selenium.webdriver.Chrome()

WebDriverException: ignored

# <span style="color:orange">2. The kayak.com/explore scraper</span>


### 2.1 Define airports and extract airport urls

In [None]:
# Defining base url
base_url = 'https://www.kayak.com/explore'

# Create a list of all the airports we want to extract data from
airports = ["LHR", "CDG", "AMS", "FRA", "MAD", "BCN", "IST", "SVO", "MUC", "LGW"]

# Loop for airport urls
airport_urls = [] # Make an empty list to save the airport urls

for airport in airports:
    airport_url = 'https://www.kayak.com/explore/' + airport + '-anywhere/20220401,20220430' + '?tripdurationrange=5,5'
    airport_urls.append(airport_url) 
airport_urls # List of urls where we want to extract data from

['https://www.kayak.com/explore/LHR-anywhere/20220401,20220430?tripdurationrange=5,5',
 'https://www.kayak.com/explore/CDG-anywhere/20220401,20220430?tripdurationrange=5,5',
 'https://www.kayak.com/explore/AMS-anywhere/20220401,20220430?tripdurationrange=5,5',
 'https://www.kayak.com/explore/FRA-anywhere/20220401,20220430?tripdurationrange=5,5',
 'https://www.kayak.com/explore/MAD-anywhere/20220401,20220430?tripdurationrange=5,5',
 'https://www.kayak.com/explore/BCN-anywhere/20220401,20220430?tripdurationrange=5,5',
 'https://www.kayak.com/explore/IST-anywhere/20220401,20220430?tripdurationrange=5,5',
 'https://www.kayak.com/explore/SVO-anywhere/20220401,20220430?tripdurationrange=5,5',
 'https://www.kayak.com/explore/MUC-anywhere/20220401,20220430?tripdurationrange=5,5',
 'https://www.kayak.com/explore/LGW-anywhere/20220401,20220430?tripdurationrange=5,5']

### 2.2 HTLM code of the city names, country names and prices

Code of the header where to get the information from: id = "JoPq-destinations" and class = "_ihz _ilc _iai">
# <div id="JoPq-destinations" class="_ihz _ilc _iai">

City name code: "_ib0 _igh _ial _1O _iaj City__Name"
For example: <div class="_ib0 _igh _ial _1O _iaj City__Name">Bucharest</div>

Country name code class = "_iC8 _1W _ib0 _iYh _igh Country__Name"
For example: <div class="_iC8 _1W _ib0 _iYh _igh Country__Name">Romania</div>

Prices code: "_ib0 _18 _igh _ial _iaj"
For example: <div class="_ib0 _18 _igh _ial _iaj">from $52</div>

These HTML codes will be used in the next step

### 2.3 Extract data from urls

In [None]:
# From airport_urls we want to scrape the 16 destinations and prices per url
destination_city=[]
destination_country=[]
destination_price=[]2
driver.get('https://www.kayak.com/explore/' + airport + '-anywhere/20220401,20220430' + '?tripdurationrange=5,5')

In [None]:
id = driver.find_elements_by_class_name('_ihz _ilc _iai')

  """Entry point for launching an IPython kernel.


In [None]:
# Code to sum up all the city names, country names and prices per airport_url
content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll('id',href=True, attrs={'class':'_ihz _ilc _iai'}):
city=id.find('div', attrs={'class':'_ib0 _igh _ial _1O _iaj City__Name'})
country=id.find('div', attrs={'class':'_iC8 _1W _ib0 _iYh _igh Country__Name'})
price=id.find('div', attrs={'class':'_ib0 _18 _igh _ial _iaj'})

city.append(city.text) # To get the city names
country.append(country.text) # To get the country names
price.append(prices.text) # To get the prices

# Create a dataframe with all the information scraped from the urls
df = pd.DataFrame({'City':cities,'countries':countries,'Price':prices})

# <span style="color:orange">3. Saving the scraped data in a csv file</span>
