# Outline

- Sourcing airfare pricing on the web (Google Flights)
- Retrieving fare data with advanced web scraping techniques
- Parsing the DOM to extract prices
- Identifying outlier fares with anomaly detection techniques
- Sending real-time text alerts with IFTTT.

## Retrieving Airfare Pricing Data

Page is AJAX-based (Asynchronous JavaScript ehn we'll need advanced web scaping tools. Selenium and ChromeDriver.

Selenium - for automating web browsers.

ChromeDriver - a headless browser.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [5]:
from bs4 import BeautifulSoup
from selenium import webdriver 

chromedriver_path = "/home/lalah/Downloads/chromedriver"

browser = webdriver.Chrome(chromedriver_path)

In [6]:
sats = 'https://www.google.com/travel/explore?tfs=CBwQAxopag0IAhIJL20vMDJfMjg2EgoyMDIxLTAyLTEzcgwIBBIIL20vMDJqOXoaKWoMCAQSCC9tLzAyajl6EgoyMDIxLTAyLTIwcg0IAhIJL20vMDJfMjg2cAGCAQsI____________AUABSAGYAQGyAQIYAQ&tfu=GgA&hl=en&gl=NG'
browser.get(sats)

In [7]:
browser.title

'Explore'

In [8]:
browser.save_screenshot('/home/lalah/Downloads/test_flights.png')

True

## Parsing the DOM to extract pricing data

The DOM is the collection of elements that made comprise a webpage.

It includes HTML tags such as `body` and `div`, as well as the classes and IDs embedded in these tags.

### Parsing

In [9]:
soup = BeautifulSoup(browser.page_source, "html5lib")

In [10]:
cards = soup.select('div[class*=tsAU4e]')
cards[0]

<div class="tsAU4e"><div class="wIuJz"><h3 class="W6bZuc YMlIz">London</h3><div class="ZjDced CQYfx"><img alt="Air France, Delta, and Aer Lingus" class="C5fbBf" data-iml="18602.525000000016" height="16" src="//www.gstatic.com/flights/airline_logos/70px/multi.png" width="16"/><span class="nx0jzf">1 stop</span><span aria-label=" " class="qeoz6e U325Rc"></span><span class="Xq1DAb">9 hr 45 min</span></div></div><div class="Q70fcd sSHqwe"><div class="MJg7fb"><span class="QB2Jof xLPuCe"><span aria-label="233320 Nigerian nairas" data-gs="CidHZ003dUJHLS0tLS0tLS0td3Nienc3QUFBQUFHQVVhLThETkZUQUESATAaCwjong4QABoDTkdOKgoyMDIxLTAyLTEzMgoyMDIxLTAyLTIwOCRKBAgBEBpw2N8D" role="text">NGN 233,320</span></span></div></div></div>

In [27]:
for card in cards:
    print(card.select('h3')[0].text)
    if card.select('span'):
        print(card.select('span')[-1].text)
    print('\n')

London
NGN 233,320


Paris
NGN 182,780


Rome
NGN 203,889


Amsterdam
NGN 191,083


Barcelona
NGN 219,260


Madrid
NGN 167,200


Berlin
NGN 187,720


Lisbon
NGN 189,240


Athens
NGN 279,300


Venice
NGN 203,813


Dublin
NGN 208,335


Vienna
NGN 201,419


Prague
NGN 205,561


Budapest
NGN 202,977


Santorini
NGN 2,334,169


Milan
NGN 201,115


Florence


Copenhagen
NGN 198,949


Edinburgh
NGN 246,943


Munich
NGN 180,120


Moscow
NGN 199,880


Stockholm
NGN 207,841


Dubrovnik
NGN 378,309


Seville
NGN 200,640


Zürich
NGN 162,260


Frankfurt
NGN 194,560


Mykonos
NGN 2,334,967


Naples
NGN 200,393


Porto
NGN 207,480


Nice
NGN 205,295


Bath
To London


Kraków
NGN 269,021


Helsinki
NGN 310,274


Saint Petersburg
NGN 211,641


Oslo
NGN 193,591


Bruges
To Brussels


Salzburg
NGN 457,140


Split
NGN 449,065


Zagreb
NGN 353,400


Granada
NGN 325,223




Since it looks successful, I'll construct a full scrape and parse for a large number of fares.

I'll retrieve the lowest cost, non-stop fares from NYC to Europe for a 26-week period.

In [28]:
cards[30]

<div class="tsAU4e"><div class="wIuJz"><h3 class="W6bZuc YMlIz">Bath</h3><div class="ZjDced CQYfx"><img alt="Air France, Delta, and Aer Lingus" class="C5fbBf" data-iml="18602.055000000066" height="16" src="//www.gstatic.com/flights/airline_logos/70px/multi.png" width="16"/><span class="nx0jzf">1 stop</span><span aria-label=" " class="qeoz6e U325Rc"></span><span class="Xq1DAb">9 hr 45 min</span></div></div><div class="Q70fcd sSHqwe"><div class="MJg7fb"><span class="QB2Jof xLPuCe"><span aria-label="233320 Nigerian nairas" data-gs="CidHZ003dUJHLS0tLS0tLS0td3Nienc3QUFBQUFHQVVhLThETkZUQUESAjMwGgsI6J4OEAAaA05HTioKMjAyMS0wMi0xMzIKMjAyMS0wMi0yMDgkSgQIARAacNjfAw==" role="text">NGN 233,320</span></span></div><span class="IxlmQc sSHqwe">To London</span></div></div>

In [12]:
soup.select('div[class*=info-container]')

[]