## Who am I?
- Slawomir Tulski (Slaw)
- currently: Big Data Engineer at WorldRemit
- previously: Python Data Programmer at Import.io (web scraping start-up)
- linkedin: https://www.linkedin.com/in/slawomir-tulski-091611116/
- personal website: http://slawomirtulski.com/

### My goals for today
- show how to tackle problem of web scraping in different ways than "standard" approach
- present useful tips and tricks in web-scraping 
- avoid making tutorial on popular html parsing / scraping libraries

### Plan
- Quick intoduction to scraping
- Stop crawling, investigate your target instead
    + case study 1: getting all urls you need from website 
- Look for APIs, even if service does not provide (public) one
    + case study 2: getting API KEY and using hidden API in store locator service
    + case study 3: getting available airbnb properties in London
    + case study 4: api json response embeded in html
- Handling JavaScript with Selenium
    + case study 5: handling infinite scroll
- Keynotes

## Some basic...

### How your browser works?
- World Wide Web operates on a client/server model
- Web browser contacts a web server and requests information or resources
- Server locates and then sends the information (html, images etc.) back to the web browser 
- Browser displays the results
- Browser can execute JavaScript code to dynmically "do things" (sends requests, site appreance and bassicaly everyting)
- 4 basic types of http requests (GET and POST - you'll use those most often while scraping, PUT, DELETE)

### How to see what my browser is doing?
- web browsers usually have some sort of "Developers Toolkit" (if not you should think about changing your browser)
- there should be 'Network' tab which shows you what is being sent from/to your broweser/server
- you can check exactly what type of request were sent, headers, parameters, cookies etc.
- also you can find in your Developers Tools console to execute JavaScript

### "Standard" scraping approach
0. I want DATA!!!!
1. don't scrape... find data somewhere else!
2. don't scrape... they should provide an API!
3. ok.. you're screwed. get HTML and parse it!
4. you need a lot of data from different pages of one web service? - build crawler and "catch them all"

## Using sitemaps instead of crawling whole website

### what is web "crawler" ?
* automate bot which recurse from strat page to all internal link it founds
* theoretacaly, it will traverse through all urls on website

### why it's not the best idea?
* not precise (it's brute force... a lot of requests made and a lot of garbage scraped)
* need to write more code and care about lot of things (what type of url it got, can I go there?)
* assumes particual page layout and test whatever it encounter
* easy to catch into trap (honeypots)

### what to use instead?
* very often there is sitemap of whole website already available!
* very often sitemaps are hidden! if you can't see it on page, try **/sitemap.xml** [https://www.skipthedishes.com/]
* also, information about sitmap can be found in **robots.txt** file [https://www.walmart.com/]
* if there is no sitemap, try to follow a pattern **get categories -> get pages -> get listing -> get item**

In [24]:
# built-in
import json
import random
import re
import time
# 3rd part
from IPython.display import HTML
import pandas as pd
import requests
from selenium import webdriver

In [15]:
"""
Case Study 1: getting all links from sitemap 

You want to analyse housing market in UK. Data which interest you most are on http://www.rightmove.co.uk/.
Unfortunately, there is no API available and you need get data from HTMLs. As a first step, before putting your hands
on data, you need to know urls of all avaiable properties on website. Later, you will use those links to extract data.

Find all urls to properties on rightmove.co.uk. Be as precise as possible. Do not built inefficient crawlers.
"""
main_sitemap_url = 'http://www.rightmove.co.uk/sitemap.xml'
main_sitemap_text = requests.get(main_sitemap_url).text
properties_sitemaps = re.findall(r'<loc>(http://www.rightmove.co.uk/sitemap_propertydetails\d+.xml)</loc>', main_sitemap_text)
limit_pages = 3
all_properites = []
for pmap_url in properties_sitemaps[:limit_pages]:
    print('getting properties from: ', pmap_url)
    pmap_text = requests.get(pmap_url).text
    p_urls = re.findall(r'<loc>(http://www.rightmove.co.uk/[\-a-z]+/property-\d+.html)</loc>', pmap_text)
    all_properites.extend(p_urls)
print('I\'ve got ' + str(len(all_properites)) + ' of urls with properties.\nSome examples:')
for url in all_properites[:6]:
      print('\n- '+url)

getting properties from:  http://www.rightmove.co.uk/sitemap_propertydetails0.xml
getting properties from:  http://www.rightmove.co.uk/sitemap_propertydetails1.xml
getting properties from:  http://www.rightmove.co.uk/sitemap_propertydetails2.xml
I've got 150000 of urls with properties.
Some examples:

- http://www.rightmove.co.uk/property-to-rent/property-50480715.html

- http://www.rightmove.co.uk/property-to-rent/property-53775521.html

- http://www.rightmove.co.uk/commercial-property-for-sale/property-64919567.html

- http://www.rightmove.co.uk/property-to-rent/property-68904185.html

- http://www.rightmove.co.uk/commercial-property-to-let/property-47279781.html

- http://www.rightmove.co.uk/property-to-rent/property-61726357.html


## Look for APIs - even if service does not provide (public) one

### why APIs are better (I know... silly question)
* web appearance can change frequently (which will brake scrapers dependant on html tags), but API stays same for longer time
* often, responses from API contains very structured data (e.g. in JSON or XML format)

### but there is no API available for website 'X' ;(
* a lot of modern web services uses some kind of APIs internally [https://www.airbnb.co.uk/s/London/homes]
* to find out if web service is using API track network in your developer’s tools. (I like Chrome’s tools, but Firefox, Opera etc. also has nice ones)
* there are some treasures hidden in requests with type xhr, fetch, json etc.
* often, you need to supply additional information with your request (like API keys or tokens)
* API responses can be dynamically embeded in HTML [https://www.walmart.com/]

In [16]:
"""
Case Study 2: getting API_KEY from html and then data from API

The True Value Company is an American retailer-owned hardware cooperative with over 4,000 independent retail 
locations worldwide. Create scraper which gets all available True Value shops given post code. Scraper
should not have any API key hardcoded, as it can change during site lifetime.

Minimum data you should get:
- address
- city
- country
- latitude
- longitude
- name
- postalcode
- state
"""
main_page_url = 'http://hosted.where2getit.com/truevalue/index2015.html'
main_page_text = requests.get(main_page_url).text
api_key = re.findall(r"appkey: '([0-9A-Z\-]+)', ", main_page_text)[0]
print('Got API KEY from main page: ', api_key)
api_endpoint = 'http://hosted.where2getit.com/truevalue/rest/locatorsearch'
POST_CODE = 20004
body = {
  "request": {
    "appkey": api_key,
    "formdata": {
      "geoip": False,
      "dataview": "store_default",
      "limit": 40,
      "geolocs": {
        "geoloc": [
          {
            "addressline": str(POST_CODE)
          }
        ]
      },
      "searchradius": "40|50|80",
      "where": {
        "and": {
          "giftcard": {
            "eq": ""
          },
          "tvpaint": {
            "eq": ""
          },
          "creditcard": {
            "eq": ""
          },
          "localad": {
            "eq": ""
          },
          "ja": {
            "eq": ""
          },
          "tvr": {
            "eq": ""
          },
          "activeshiptostore": {
            "eq": ""
          },
          "main_id": {
            "eq": ""
          },
          "corronado": {
            "eq": ""
          },
          "tv": {
            "eq": "1"
          }
        }
      },
      "false": "0"
    }
  }
}
r = requests.post(api_endpoint, data=json.dumps(body))
data = json.loads(r.text)
print('Raw response from API: ', data)
shops = [{'name':entry['name'],
          'address':entry['address1'],
          'postalcode':entry['postalcode'],
          'city':entry['city'],
          'state':entry['state'],
          'country':entry['country'],
          'latitude':entry['latitude'],
          'longitude':entry['longitude']
         } for entry in data['response']['collection']]


Got API KEY from main page:  41C97F66-D0FF-11DD-8143-EF6F37ABAA09
Raw response from API:  {'response': {'collectioncount': 16, 'attributes': {'country': 'US', 'province': '', 'postalcode': '20004', 'city': 'WASHINGTON', 'radiusuom': 'mile', 'radius': '40', 'state': 'DC', 'address': '', 'centerpoint': '-77.0255,38.8957'}, 'activeobject': '', 'collection': [{'fri_open_time': '8:00 AM', 'giftcard': '1', 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(202) 462-3146', 'email': 'truevalue17@truevalue.net', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 7:30 PM', 'google_shared': None, 'uid': 1051076976, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 7:30 PM', 'yelp': None, 'sun_open_time': '10:00 AM', 'tvurl': 'http://www.truevalueon17th.com/', 'sun_close_time': '- 6:00 PM', 'tue_open_time': '8:00 AM', 'tvpaint': '1', '_distance': '1.32', 'wed_close_time': '- 7:30 PM', 'mon_open_time': '8:00 AM', 'google_ema

In [17]:
HTML(pd.DataFrame(shops).to_html())

Unnamed: 0,address,city,country,latitude,longitude,name,postalcode,state
0,1623 17th St NW,Washington,US,38.911917,-77.03849,True Value On 17th,20009-2433,DC
1,1108 24th Street NW,Washington,US,38.9038748979592,-77.0514083673469,District Hardware and Bike,20037-1432,DC
2,2100 W VIRGINIA AVENUE NE,WASHINGTON,US,38.91536,-76.98028,KAMCO BUILDING SUPPLY,20002-1834,DC
3,2213 N. Buchanan Street,Arlington,US,38.89758,-77.12483,Bills True Value,22207-2528,VA
4,7301 Mcarthur Blvd,Bethesda,US,38.96912,-77.14004,Christophers Glen Echo Hardware,20816,MD
5,5860 FARINGTON AVE,ALEXANDRIA,US,38.7984625564754,-77.1362345995378,KAMCO BLDG SPLY & TRUE VALUE,22304-4822,VA
6,7902 Fort Hunt Rd,Alexandria,US,38.7436373198134,-77.0570290532158,Hollin Hall Variety Store,22308-1203,VA
7,11616 LIVINGSTON RD,FORT WASHINGTON,US,38.730159346246,-76.9922404673855,FORD LUMBER COMPANY,20744-5148,MD
8,500 Olney Sandy Spring Rd,Sandy Spring,US,39.14891,-77.02204,Christophers Hardware,20860,MD
9,9124 Mathis Ave,Manassas,US,38.7579993877551,-77.4656865306122,J E Rice Co.,20110,VA


In [18]:
"""
Case Study 3: get available airbnb properties in London

You want to visit London and airbnb looks like a nice option for you. As you are crazy data geek and you want to run
some fancy algorithms to make a better choice of apartment to rent - you need data! Get all available airbnb 
properties in London. You are interested in pricing, location, rating, no. of reviews, images and more. 
You also don't like to repeate yourself, so you need to build scraper which will survive till your next trip.
"""

headers = {'accept-encoding': 'gzip, deflate, br',
           'x-requested-with': 'XMLHttpRequest',
           'accept-language': 'en-US,en;q=0.8,pl;q=0.6',
           'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
           'accept': 'application/json, text/javascript, */*; q=0.01',
           'referer': 'https://www.airbnb.co.uk/s/London/homes',
           'authority': 'www.airbnb.co.uk'}
# get api key embeded in html (yes... same story again :))
html = requests.get('https://www.airbnb.co.uk/s/London/homes', headers=headers).text
api_key = re.findall(r'key\&quot;:\&quot;([a-zA-Z0-7]*)\&quot;},\&quot;deep_link', html)[0]
# get first listing
enpoint='https://www.airbnb.co.uk/api/v2/explore_tabs'
params = {'version':'1.2.8',
          '_format':'for_explore_search_web',
          'items_per_grid':'20',
          'fetch_filters':'true',
          'is_guided_search':'true',
          'is_new_cards_experiment':'false',
          'supports_for_you_v3':'true',
          'screen_size':'small',
          'timezone_offset':'60',
          'auto_ib':'false',
          'luxury_pre_launch':'false',
          'metadata_only':'false',
          'is_standard_search':'true',
          'refinements[]':'homes',
          'selected_tab_id':'home_tab',
          'location':'London',
          'allow_override[]':'',
          's_tag':'DOIPutuT',
          'section_offset':'0',
          '_intents':'p1',
          'key':api_key,
          'currency':'GBP',
          'locale':'en-GB'}
r = requests.get(enpoint, params=params)
print('Got listing from: ', r.url)
ds = json.loads(r.text)

"""
note:
some airbnb usefull endpoints
# get listings
https://www.airbnb.co.uk/api/v2/explore_tabs?version=1.2.8&_format=for_explore_search_web&items_per_grid=18&experiences_per_grid=20&guidebooks_per_grid=20&fetch_filters=true&is_guided_search=true&is_new_cards_experiment=false&supports_for_you_v3=true&screen_size=small&timezone_offset=60&auto_ib=false&luxury_pre_launch=false&metadata_only=false&is_standard_search=true&tab_id=home_tab&location=London&allow_override%5B%5D=&ne_lat=51.599363500119274&ne_lng=-0.06168207198925302&sw_lat=51.47626857868991&sw_lng=-0.289648380583003&zoom=12&search_by_map=true&federated_search_session_id=6d72b1e2-cb68-4877-b27c-8614e11fc5b0&_intents=p1&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=&locale=en-GB
# get booking detials
https://www.airbnb.co.uk/api/v2/pdp_listing_booking_details?guests=1&listing_id=13575756&_format=for_web_dateless&_interaction_type=pageload&_intents=p3_book_it&_parent_request_uuid=aed9f0ce-1534-4a89-9cf6-c6813adcb95b&_p3_impression_id=p3_1506465875_Q2VDMsV0pLs27%2BtX&show_smart_promotion=0&force_boost_unc_priority_message_type=&number_of_adults=1&number_of_children=0&number_of_infants=0&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=GBP&locale=en-GB
"""

Got listing from:  https://www.airbnb.co.uk/api/v2/explore_tabs?metadata_only=false&items_per_grid=20&version=1.2.8&luxury_pre_launch=false&_intents=p1&screen_size=small&locale=en-GB&timezone_offset=60&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&is_new_cards_experiment=false&is_standard_search=true&_format=for_explore_search_web&fetch_filters=true&currency=GBP&supports_for_you_v3=true&location=London&is_guided_search=true&s_tag=DOIPutuT&selected_tab_id=home_tab&section_offset=0&auto_ib=false&allow_override%5B%5D=&refinements%5B%5D=homes


'\nnote:\nsome airbnb usefull endpoints\n# get listings\nhttps://www.airbnb.co.uk/api/v2/explore_tabs?version=1.2.8&_format=for_explore_search_web&items_per_grid=18&experiences_per_grid=20&guidebooks_per_grid=20&fetch_filters=true&is_guided_search=true&is_new_cards_experiment=false&supports_for_you_v3=true&screen_size=small&timezone_offset=60&auto_ib=false&luxury_pre_launch=false&metadata_only=false&is_standard_search=true&tab_id=home_tab&location=London&allow_override%5B%5D=&ne_lat=51.599363500119274&ne_lng=-0.06168207198925302&sw_lat=51.47626857868991&sw_lng=-0.289648380583003&zoom=12&search_by_map=true&federated_search_session_id=6d72b1e2-cb68-4877-b27c-8614e11fc5b0&_intents=p1&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=&locale=en-GB\n# get booking detials\nhttps://www.airbnb.co.uk/api/v2/pdp_listing_booking_details?guests=1&listing_id=13575756&_format=for_web_dateless&_interaction_type=pageload&_intents=p3_book_it&_parent_request_uuid=aed9f0ce-1534-4a89-9cf6-c6813adcb95b&_p3_imp

In [19]:
# paginate and get more properties
props = []
page_no = 1
page_limit = 3
while (ds['explore_tabs'][0]['pagination_metadata']['has_next_page'] == True) and (page_no<page_limit):
    params['section_offset'] = str(page_no)
    print('will get page: ', page_no+1)
    r = requests.get(enpoint, params=params)
    ds = json.loads(r.text)
    for idx, section in enumerate(ds['explore_tabs'][0]['sections']):
        for prop in ds['explore_tabs'][0]['sections'][idx]['listings']:
            props.append({'name':prop['listing']['name'],
                          'room_type':prop['listing']['room_type'],
                          'person_capacity':prop['listing']['person_capacity'],
                          'pic':prop['listing']['picture']['picture'],
                          'rating':prop['listing']['star_rating'],
                          'latitude':prop['listing']['lat'],
                          'longitude':prop['listing']['lng'],
                          'price':prop['pricing_quote']['rate']['amount'],
                          'currency':prop['pricing_quote']['rate']['currency'],
                          'price_type':prop['pricing_quote']['rate_type']})
    page_no += 1
    print('No. of properties: ', len(props))
    time.sleep(3)

will get page:  2
No. of properties:  20
will get page:  3
No. of properties:  40
will get page:  4
No. of properties:  60
will get page:  5
No. of properties:  80
will get page:  6
No. of properties:  100


In [20]:
HTML(pd.DataFrame(props).to_html())

Unnamed: 0,currency,latitude,longitude,name,person_capacity,pic,price,price_type,rating,room_type
0,GBP,51.516746,-0.050351,(HAR-A)PRIVATE ROOM FOR 5PPL CLOSE TO TOWER BR...,5,https://a0.muscache.com/im/pictures/786aa625-2...,25,nightly,5.0,Private room
1,GBP,51.524362,-0.116995,Double Room nr Soho | Russell Square |Kings Cross,4,https://a0.muscache.com/im/pictures/affe8de1-a...,62,nightly,4.5,Private room
2,GBP,51.486756,-0.104479,1 double room in Central London,2,https://a0.muscache.com/im/pictures/6bfb24c2-c...,18,nightly,4.5,Private room
3,GBP,51.511007,-0.226281,"THE QUEENS HOSTEL, 6 BED MIXED DORM E",6,https://a0.muscache.com/im/pictures/328d9beb-6...,21,nightly,4.5,Private room
4,GBP,51.564283,-0.120809,"Spacious Double room in Holloway, London",2,https://a0.muscache.com/im/pictures/ae707340-f...,19,nightly,4.5,Private room
5,GBP,51.548527,-0.226324,Modern room 10 min from Central London,2,https://a0.muscache.com/im/pictures/a0e3a2af-8...,40,nightly,5.0,Private room
6,GBP,51.491728,-0.014746,Double Room in Canary Wharf hs,2,https://a0.muscache.com/im/pictures/ebd8e53d-0...,25,nightly,4.5,Private room
7,GBP,51.452218,-0.02542,"Comfortable, Clean London room",2,https://a0.muscache.com/im/pictures/ae5469b5-9...,25,nightly,5.0,Private room
8,GBP,51.50168,-0.052571,"Double room, 2min from the station",2,https://a0.muscache.com/im/pictures/4cfc5263-5...,57,nightly,4.5,Private room
9,GBP,51.624089,-0.054654,En-Suite Bedroom with Bathroom,3,https://a0.muscache.com/im/pictures/97921358/8...,19,nightly,5.0,Private room


In [12]:
"""
Case Study 4 - json embeded in html

You want analyse and compare some retailing corporations. One of them is Wallmart. You want to get as much 
details, about products they are selling, as you can. You have investigated your target well, but infortunately
you cannot find any hidden API... You need to get data from HTML. You already found nice sitemaps, got product
pages and now you're ready to scrape.

What will be most efficient and robust way to get all details about products?
"""

print("will skip code... but you can find nice surprise in any wallmart product code :)")

will skip code... but you can find nice surprise in any wallmart product code :)


## Handling JavaScript

### Why...?
- Sometimes content of webpage can be dynamically presented/altered via JavaScript code
- when you're dowlonading HTML, it can be completly different from what you see on browser
- you need to perform some sort of interaction with page
- your target have some fancy anti-scraping software detecting that you're a bot

###  Selenium+PhantomJS
- Selenium is browser automation tool most often used for testing web application.
- It can be usefull while scraping
- PhantomJS is just headless browser (there is no UI and it works in background)
- BTW: you can use Selenium with any other browser (Firefox, Opera etc.)

### It's often an overkill thoguh!
- Scraping with Selenium+PhantomJS is much heavier than using simple Python libraries!
    + you have to have all additional libraries and software installed
    + it may be slower
    + you have to navigate as you were a human (eg. find button element and click it programatically)
    + you rely on page layout (not robust at all...)
- Very often you can find work-around it. For example, if you try to deal with infinite scroll,
  you can investigate what AJAX requests your browser is sending while scrolling (and emulate it)

In [22]:
"""
Case Study 5 - how to (not)handle infinite scroll

You are planning to spam you friends on Facebook with random quotes to show how smart and deep you are.
You found an amazing page with quotes, but... it contains infinite scroll?! Don't worry though! 
As they say: "You have to look through the rain to see the rainbow."

Your task is to get all quotes from http://spidyquotes.herokuapp.com/
"""
spidyquotes_url = 'http://spidyquotes.herokuapp.com/scroll'
# with Selenium+PhantomJS (in general - bad option. but yeah... may be fancy)
driver = webdriver.PhantomJS('/Applications/phantomjs-2.1.1-macosx/bin/phantomjs')
driver.get(spidyquotes_url)
no_of_scrolls = 5
scroll = 0
while scroll < no_of_scrolls:
    # do a fancy screenshoot here
    driver.get_screenshot_as_file('/Users/stulski/Desktop/osobiste/pydata_meetup/shot_{}.jpg'.format(scroll))
    # scroll down
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    scroll += 1
quote_elements = driver.find_elements_by_class_name('quote')
all_quotes = [element.find_elements_by_class_name('text')[0].text for element in quote_elements]
print('No. of quotes is: ', len(all_quotes))
print(random.choice(all_quotes))

No. of quotes is:  60
“A day without sunshine is like, you know, night.”


In [23]:
# same as above, without unnecessery hassel
p_idx = 1
spidyquotes_better_url = 'http://spidyquotes.herokuapp.com/api/quotes?page='
r = json.loads(requests.get(spidyquotes_better_url+str(p_idx)).text)
time.sleep(1)
all_quotes = []
while r['has_next'] == True:
    for quote in r['quotes']:
        all_quotes.append(quote['text'])
    p_idx += 1
    r = json.loads(requests.get(spidyquotes_better_url+str(p_idx)).text)
    time.sleep(1)
print('No. of quotes is: ', len(all_quotes))
print(random.choice(all_quotes))

No. of quotes is:  90
“Not all of us can do great things. But we can do small things with great love.”


## Keynotes and advices
* investigate you target well (sitemaps, hidden apis, how it works under-the-hood)
* use incognito mode whie exploring
* use developers tools
* think about scraping as a "hacking" activity rather than parsing just getting html elements
* change your user-agent
* add time.sleep if you can afford it
* same data are in different places at the website. find those easy to scrape!
* if you need to parse HTML and get data from there - try to find something which will not break (avoid finding   general elements like DIVs and then finding Nth of those)
* look for comonalities
* websites are different are there is no one magical way to get your data
* it looks easy when I'm showing it, but sometimes it takes time to reverse engineer websites

** if you want production level scrapers - use proxy