# App to find Cheap Flights

# Introduction:

In 2014, the cheapest fare from New York to Vienna was found to be around $800, but according to the advertised fares, where for a select no. of dates, these tickets were between $350 and $450. 

It all seemed to be a good deal and one might wonder if whether it is true or not. The industry does mistake the occasional mistakes on fares, because airlines occasionally and accidentally do happen to post fares that exclude fuel surcharges. Normally, it is expected that the advanced algorithms employed by these airlines would be updating fares that takes into account large number of factors, however due to the order generations of systems in place, mistakes do happen.



# 1 Import the necessary libraries:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from time import sleep

# 2 - Retrieving the Data from scraping the web:

Fare data are obtained from a AJAX-based (Asynchornous JavaScript) webpage, this will require a browser to do the work. For such a task, there will be a need for two of the following pacakges: Selenium and ChromeDriver.

- Selenium is a package for automating web browsers.
- ChromeDriver is a headless browser, meaning there isn't a user interface.

In [2]:
from bs4 import BeautifulSoup
from selenium import webdriver

chromeDriver_file = 'chromedriver'
chromeDriver_file_conda = 'chromedriver-binary alias'

import os
path = os.path.abspath(chromeDriver_file)
print('pathway to ChromeDriver is: ' + '\n' + path)

# Set the ChromeDriver pathway:
chromeDriver_path = path

browser = webdriver.Chrome(chromeDriver_path)

pathway to ChromeDriver is: 
/Users/y.s.lee/OneDrive/Packt - Python Machine Learning Blue Prints/Project 2 - App to find Cheap Flights/chromedriver


### 2.1 Set the URL (from google flights):

Dates are set to 1st of June to 15th of June in the year 2020 (note: that these dates can be changed to anything).

NOTE: Need to use the Freebase IDs for city/region of interest for travel. for example, m/06y57 is for Sydney. m/0f04v is for empty search. m/02_286 is for NYC.

It is possible to find it when searching in google from th ebelow link:
https://www.google.com.mx/travel/guide?q=New+York+City&sa=X&rlz=1C1CHBD_esMX769MX769&output=search&tra=%5B%22AMAbHIJDZRALeKKuHEbLXHGOJ3aS9zzCTg:1579328461567%22,%22syndey%22,%22/m/02_286%22%5D&tcfs=EhUKCS9tLzAyXzI4NhIITmV3IFlvcms&dest_mid=/m/06y57#dest_mid=/m/06y57&tcfs=EiwKCC9tLzA2eTU3EgZTeWRuZXkaGAoKMjAyMC0wMi0wMxIKMjAyMC0wMi0wNw

And to confirm it with the link below: make sure to control+f and search for Freebase ID.
https://www.wikidata.org/wiki/Q3130

In [3]:
# Input webpage as string:
flight_web_sats = 'https://www.google.com/travel/explore?tfs=CBsQAxojagcIARIDU1lEEgoyMDIwLTA2LTAxcgwIBBIIL20vMDJqOXoaI2oMCAQSCC9tLzAyajl6EgoyMDIwLTA2LTE1cgcIARIDU1lEcAFAAUgB&curr=AUD&gl=au&hl=en&authuser=0&origin=https%3A%2F%2Fwww.google.com&dest_mid'

# Retrieve the webpage's content using Selenium:
browser.get(flight_web_sats)

# Check the title of the webpage:
browser.title

'Explore'

In [4]:
# Check to see if the required information from the webpage was captured: Take a Screenshot and save as 'test_flights.png'.
current_work_directory = os.getcwd()
browser.save_screenshot(current_work_directory + '/test_flights.png')

True

### 2.2 Parsing the DOM to extract the individual flight data from the HTML tags:

Document Object Model (DOM) is the collection of the individual elements on a webpage. These will include things like HTML tags, like 'body' and 'div', or classes and IDs.

In [5]:
# Parsing:
soup = BeautifulSoup(browser.page_source, "html.parser")

#### Extract the individual city data:

In [6]:
# Get the city data:
# At the time of HTML scraping, the flight data was inside <div class='MeBuN'>
# Or the by XPATH: //*[@id="flt-app"]/c-wiz/c-wiz/nav/div[1]/nav/div/div[2]/ol/li[1]/div

sleep(7)

flight_cards = soup.select('div[class*=MebuN]')

# Check out a single flight card:
flight_cards[0]


<div class="MebuN"><div class="L32YH" style="background-image: url('//t1.gstatic.com/images?q=tbn:ANd9GcSWD3RIhb5WVm8o0tVuh7Ygbe67MxTKwnYGRJ1cbhLQupYphfzlG37c6WbGvIIwlP7oGh1QoYZq'), url('//www.gstatic.com/flights/app/runway_200.png')"></div><div class="tsAU4e"><div class="wIuJz"><h3 class="W6bZuc YMlIz">London</h3><div class="ZjDced CQYfx"><img alt="China Airlines" class="C5fbBf" data-iml="7574.610000003304" height="16" src="//www.gstatic.com/flights/airline_logos/70px/CI.png" width="16"/><span class="nx0jzf">1 stop</span><span class="qeoz6e U325Rc"></span><span class="Xq1DAb">1 day 3 hr 20 min</span></div></div><div class="Q70fcd sSHqwe"><div class="MJg7fb"><span class="QB2Jof xLPuCe" data-gs="CidHbzBoVUJHLS0tLS0tLS0tcGZkczIyQUFBQUFGNGxCeDRNMkNnQUESATAaCwj0xQYQAhoDQVVE">$1,073</span></div></div></div></div>

From the piece of HTML information above, it can be noticed that the information needed is within the markup. 
For example: 
- the destination is seen here 'class="W6bZuc YMlIz" London',
- where the duration of the flight is 'class="Xq1DAb" 1 day 3 hr 20 min' 
- and prices are located at 'class="QB2Jof xLPuCe" datags="CidHQ2lQekJHLS0tLS0tLS0tcGZiMTI5QUFBQUFGNGlrY29Bby1aQUESATAaCwj+wAQQAhoDVVNE">$739'.

#### Next, is to obtain the required data from the markup:


In [8]:
# For-loop to extract the relevant information:
# At the time of HTML scraping, the price information are stored in <div class='MJg7fb'>

# for card in flight_cards:
#     print(card.select('h3')[0].text)
#     print(card.select('div[class*=MJg7fb]')[0].text)
#     print('\n')

# Perform clean up of 'Great value' tags before the prices:
for card in flight_cards:
    print(card.select('h3')[0].text)
    print(card.select('div[class*=MJg7fb]')[0].text.replace('Great value',""))
    print('\n')

browser.quit()

print('Testing Complete.')

London
$1,073


Paris
$1,167


Rome
$938


Amsterdam
$907


Athens
$1,332


Dublin
$1,272


Manchester
$1,243


Frankfurt
$1,000


İstanbul
$1,260


Stockholm
$1,465


Barcelona
$1,142


Milan
$1,247


Berlin
$936


Madrid
$1,244


Zürich
$1,255


Munich
$1,396


Copenhagen
$1,157


Reykjavík
$1,788


Vienna
$1,351


Edinburgh
$1,310


Moscow
$1,277


Malta
$1,398


Helsinki
$1,603


Venice
$1,491


Budapest
$1,400


Geneva
$1,391


Lisbon
$1,352


Belgrade
$1,408


Brussels
$1,300


Prague
$1,388


Skopje
$1,406


Oslo
$1,486


Warsaw
$1,462


Glasgow
$1,297


Birmingham
$1,470


Nice
$1,402


Zagreb
$1,465


Düsseldorf
$1,451


Hamburg
$1,474


Newcastle upon Tyne
$1,303


Testing Complete.


From the above, it can be confirmed that it is possible to retrieve the relevant data from the HTML. 

#### Next, is to retrieve flights that presents with the lowest cost and non-stop fares from the starting destination to the arrival destination. These flights would be for a 26 week period. All this is done by making a full scrape and parsing of a large number of fares.

#### Import the required libraries:

In [9]:
import datetime
from datetime import date, timedelta
from time import sleep

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys

In [11]:
#======= Restart ChromeDriver =========
path = os.path.abspath(chromeDriver_file)
print('pathway to ChromeDriver is: ' + '\n' + path)

# Set the ChromeDriver pathway:
chromeDriver_path = path

browser = webdriver.Chrome(chromeDriver_path)

#======= Scraping the Web Data =========

week_period = 26
start_date = '2020-06-01'
end_date = '2020-06-15'

departure_destination = "Sydney"
arrival_destination = "Europe"

# Format the flight dates: with the python datetime standard.
startFlight_date = datetime.datetime.strptime(start_date, '%Y-%m-%d')
endFlight_date = datetime.datetime.strptime(end_date, '%Y-%m-%d')

# Dictionary for Fares:
flightFare_dict = {}

for idx in range(week_period):
    sat_start = str(startFlight_date).split()[0]
    sat_end = str(endFlight_date).split()[0]
    flightFare_dict.update({sat_start: {}})
    
    # Load webpage:
    sats = "https://www.google.com/flights?hl=en#flt=.." + sat_start + "*.." + sat_end + ";c:AUD;e:1;sd:1;t:h"
    sleep(np.random.randint(3,7))
    browser.get(sats)
    print('Index: ' + str(idx) + ' Starting Browser and searching link: Google ' + browser.title + '. Dates are: ' + sat_start + ' and ' + sat_end + '.' )
    
    # Input information to search for flights:
    wait_10sec = WebDriverWait(browser, 10) # Seconds of Waiting.

    print('Link Loaded, Entering Travel Details now.')

    # Departure Search: input of departure location.
    departureDestination_link = wait_10sec.until(EC.presence_of_element_located((By.XPATH, '//*[@id="flt-app"]/div[2]/main[1]/div[4]/div/div[3]/div/div[2]/div[1]')))
    departureDestination_link.click()
    departureDestination_link = wait_10sec.until(EC.presence_of_element_located((By.XPATH, '//*[@id="sb_ifc50"]/input')))
    sleep(1)
    departureDestination_link.send_keys(departure_destination)
    sleep(2)
    departureDestination_link.send_keys(Keys.ENTER)

    # Arrival Search: input of arrival location.
    arrivalDestination = wait_10sec.until(EC.presence_of_element_located((By.XPATH, '//*[@id="flt-app"]/div[2]/main[1]/div[4]/div/div[3]/div/div[2]/div[2]')))
    arrivalDestination.click()
    arrivalDestination = wait_10sec.until(EC.presence_of_element_located((By.XPATH, '//*[@id="sb_ifc50"]/input')))
    sleep(1)
    arrivalDestination.send_keys(arrival_destination)
    sleep(2)
    arrivalDestination.send_keys(Keys.ENTER)

    # Get new URL:
    sleep(1)
    new_browser_url = browser.current_url
    print('After inputting the destinations and searching, the new URL is: \n' + new_browser_url)

    # Finally, click on the 'Search' button:
    floatingActionButton_click = browser.find_elements_by_xpath('//*[@id="flt-app"]/div[2]/main[1]/div[4]/div/div[3]/div/div[4]/floating-action-button')[0]
    sleep(2)
    floatingActionButton_click.click()
    print('Search done. Next is to get a list of the travel information.')
    
    
    # Extract Relevant Data from webpage:
    
    print('Collecting data.')
    sleep(30)
    soup = BeautifulSoup(browser.page_source, 'html.parser')
    flight_cards = soup.select('div[class*=MebuN]')
    
    for card in flight_cards:
        while True:
            try:
                city = card.select('h3')[0].text
                fare = card.select('div[class*=MJg7fb]')[0].text.replace('Great value',"")    
                print(city)
                print(fare)
                print('\n')
                flightFare_dict[sat_start] = {**flightFare_dict[sat_start], **{city: fare}}
                
            except RuntimeError as detail:
                print('Handling run-time error: ' + detail)
                continue
            break
        
#     sleep(7)
    startFlight_date = startFlight_date + timedelta(days = 7)
    endFlight_date = endFlight_date + timedelta(days = 7)
    print('\n')
    
browser.quit()
print('Quiting Broswer, Data Collection Complete.')

pathway to ChromeDriver is: 
/Users/y.s.lee/OneDrive/Packt - Python Machine Learning Blue Prints/Project 2 - App to find Cheap Flights/chromedriver
Index: 0 Starting Browser and searching link: Google Flights. Dates are: 2020-06-01 and 2020-06-15.
Link Loaded, Entering Travel Details now.
After inputting the destinations and searching, the new URL is: 
https://www.google.com/flights?hl=en#flt=..2020-06-01*..2020-06-15;c:AUD;e:1;sd:1;t:h
Search done. Next is to get a list of the travel information.
Collecting data.
London
A$1,073


Paris
A$1,167


Rome
A$938


Amsterdam
A$907


Athens
A$1,332


Dublin
A$1,272


Manchester
A$1,243


Frankfurt
A$1,000


İstanbul
A$1,260


Stockholm
A$1,465


Barcelona
A$1,142


Milan
A$1,247


Berlin
A$936


Madrid
A$1,244


Zürich
A$1,255


Munich
A$1,396


Copenhagen
A$1,157


Reykjavík
A$1,788


Vienna
A$1,351


Edinburgh
A$1,310


Moscow
A$1,277


Malta
A$1,398


Helsinki
A$1,603


Venice
A$1,491


Budapest
A$1,400


Geneva
A$1,391


Lisbon
A$1,352




In [13]:
flightFare_dict

{'2020-06-01': {'London': 'A$1,073',
  'Paris': 'A$1,167',
  'Rome': 'A$938',
  'Amsterdam': 'A$907',
  'Athens': 'A$1,332',
  'Dublin': 'A$1,272',
  'Manchester': 'A$1,243',
  'Frankfurt': 'A$1,000',
  'İstanbul': 'A$1,260',
  'Stockholm': 'A$1,465',
  'Barcelona': 'A$1,142',
  'Milan': 'A$1,247',
  'Berlin': 'A$936',
  'Madrid': 'A$1,244',
  'Zürich': 'A$1,255',
  'Munich': 'A$1,396',
  'Copenhagen': 'A$1,157',
  'Reykjavík': 'A$1,788',
  'Vienna': 'A$1,351',
  'Edinburgh': 'A$1,310',
  'Moscow': 'A$1,277',
  'Malta': 'A$1,398',
  'Helsinki': 'A$1,603',
  'Venice': 'A$1,491',
  'Budapest': 'A$1,400',
  'Geneva': 'A$1,391',
  'Lisbon': 'A$1,352',
  'Belgrade': 'A$1,408',
  'Brussels': 'A$1,300',
  'Prague': 'A$1,388',
  'Skopje': 'A$1,406',
  'Oslo': 'A$1,486',
  'Warsaw': 'A$1,462',
  'Glasgow': 'A$1,297',
  'Birmingham': 'A$1,470',
  'Nice': 'A$1,402',
  'Zagreb': 'A$1,465',
  'Düsseldorf': 'A$1,451',
  'Hamburg': 'A$1,474',
  'Newcastle upon Tyne': 'A$1,303'},
 '2020-06-08': {'Lond