- przygotuj sie jeszcze z prezentacji jak do wsystkiego dochodziles (+ print screeny na wszelki wypadek) a nie od razu dojebujesz kodem

## Who am I?
- Slawomir Tulski (Slaw)
- currently: Big Data Engineer at WorldRemit
- previously: Python Data Programmer at Import.io (web scraping start-up)
- linkedin: https://www.linkedin.com/in/slawomir-tulski-091611116/
- personal website: http://slawomirtulski.com/

## Some basic...

### How your browser works?
- World Wide Web operates on a client/server model
- Web browser contacts a web server and requests information or resources
- Server locates and then sends the information (html, images etc.) back to the web browser 
- Browser displays the results
- Browser can execute JavaScript code to dynmically "do things" (sends requests, site appreance and bassicaly everyting)
- 4 basic types of http requests (GET and POST - you'll use those most often while scraping, PUT, DELETE)

### How to see what my browser is doing?
- webbrowsers usually have some sort of "Developers Toolkit" (if not you should think about changing your browser)
- there should be 'Network' tab which shows you what is being sent from/to your broweser/server
- also you can find in your Developers Tools console to execute JavaScript

### "Standard" scraping approach
1. don't scrape... find data somewhere else!
2. don't scrape... they should provide an API!
3. ok.. you're screwed. get HTML and parse it!
4. you need a lot of data from different pages of one web service? - build "crawler" and catch them all


### My goals for today
- show how to tackle problem of web scraping in different ways than "standard" approach
- present useful tips and tricks in web-scraping 
- avoid making tutorial on popular html parsing / scraping libraries

### Plan
- Stop crawling, investigate your target instead
    + case study 1: getting all urls you need from website 
- Look for APIs, even if service does not provide (public) one
    + case study 2: getting API KEY and using hidden API in store locator service
    + case study 3: getting available airbnb properties in London
    + case study 4: api json response embeded in html
- Handling JavaScript with Selenium
    + case study 5: handling infinite scroll
- Keynotes

## Using sitemaps instead of crawling whole website

### what is web "crawler" ?
* automate bot which recurse from strat page to all internal link it founds
* theoretacaly, it will traverse through all urls on website

### why it's not the best idea?
* not precise (it's brute force... a lot of requests made and a lot of garbage scraped)
* easy to catch into trap (honeypots)
* need to write more code and care about lot of things (what type of url it got, can I go there?)
* assumes particual page layout and test whatever it encounter

### what to use instead?
* very often there is sitemap of whole website already available!
* very often sitemaps are hidden! if you can't see it on page, try **/sitemap.xml** [http://www.rightmove.co.uk/]
* also, information about sitmap can be found in **robots.txt** file [https://www.walmart.com/]
* if there is no sitemap, try to follow a pattern **get categories -> get listing -> get item**

In [1]:
# built-in
import json
import re
import time
# 3rd part
from IPython.display import HTML
import pandas as pd
import requests
from selenium import webdriver

In [32]:
# case study 1: getting all links for properties in http://www.rightmove.co.uk/
main_sitemap_url = 'http://www.rightmove.co.uk/sitemap.xml'
main_sitemap_text = requests.get(main_sitemap_url).text
properties_sitemaps = re.findall(r'<loc>(http://www.rightmove.co.uk/sitemap_propertydetails\d+.xml)</loc>', main_sitemap_text)
limit_pages = 1
all_properites = []
for pmap_url in properties_sitemaps[:limit_pages]:
    pmap_text = requests.get(pmap_url).text
    p_urls = re.findall(r'<loc>(http://www.rightmove.co.uk/[\-a-z]+/property-\d+.html)</loc>', pmap_text)
    all_properites.extend(p_urls)
print('I\'ve got ' + str(len(all_properites)) + ' of urls with properties.\nSome examples:')
for url in all_properites[:6]:
      print('\n- '+url)

I've got 50000 of urls with properties.
Some examples:

- http://www.rightmove.co.uk/property-to-rent/property-61543117.html

- http://www.rightmove.co.uk/property-to-rent/property-50480715.html

- http://www.rightmove.co.uk/property-to-rent/property-68904185.html

- http://www.rightmove.co.uk/commercial-property-to-let/property-47279781.html

- http://www.rightmove.co.uk/property-to-rent/property-61726357.html

- http://www.rightmove.co.uk/property-to-rent/property-50556513.html


## Look for APIs - even if service does not provide (public) one

### why APIs are better (I know... silly question)
* web appearance can change frequently (which will brake scrapers dependant on html tags), but API stays same for longer time
* often, responses from API contains very structured data (e.g. in JSON or XML format)

### but there is no API available for website 'X' ;(
* a lot of modern web services uses some kind of APIs internally [https://www.airbnb.co.uk/s/London/homes]
* to find out if web service is using API track network in your developer’s tools. (I like Chrome’s tools, but Firefox, Opera etc. has nice ones also.
* there are some treasures hidden in requests with type xhr, fetch, json etc.
* also, API responses can be dynamically embeded in HTML [https://www.walmart.com/]
* sometimes you need to supply additional information with your request (like API keys or tokens)

In [30]:
# case study 2: getting API_KEY from html and then data from API
main_page_url = 'http://hosted.where2getit.com/truevalue/index2015.html'
main_page_text = requests.get(main_page_url).text
api_key = re.findall(r"appkey: '([0-9A-Z\-]+)', ", main_page_text)[0]
print('Got API KEY from main page: ', api_key)
api_endpoint = 'http://hosted.where2getit.com/truevalue/rest/locatorsearch'
POST_CODE = 20004
body = {
  "request": {
    "appkey": api_key,
    "formdata": {
      "geoip": False,
      "dataview": "store_default",
      "limit": 40,
      "geolocs": {
        "geoloc": [
          {
            "addressline": str(POST_CODE)
          }
        ]
      },
      "searchradius": "40|50|80",
      "where": {
        "and": {
          "giftcard": {
            "eq": ""
          },
          "tvpaint": {
            "eq": ""
          },
          "creditcard": {
            "eq": ""
          },
          "localad": {
            "eq": ""
          },
          "ja": {
            "eq": ""
          },
          "tvr": {
            "eq": ""
          },
          "activeshiptostore": {
            "eq": ""
          },
          "main_id": {
            "eq": ""
          },
          "corronado": {
            "eq": ""
          },
          "tv": {
            "eq": "1"
          }
        }
      },
      "false": "0"
    }
  }
}
r = requests.post(api_endpoint, data=json.dumps(body))
data = json.loads(r.text)
print('Raw response from API: ', data)
shops = [{'name':entry['name'],
          'address':entry['address1'],
          'postalcode':entry['postalcode'],
          'city':entry['city'],
          'state':entry['state'],
          'country':entry['country'],
          'latitude':entry['latitude'],
          'longitude':entry['longitude']
         } for entry in data['response']['collection']]


Got API KEY from main page:  41C97F66-D0FF-11DD-8143-EF6F37ABAA09
Raw response from API:  {'response': {'activeobject': '', 'collectionname': 'poi', 'attributes': {'centerpoint': '-77.0187,38.9076', 'city': 'WASHINGTON', 'radiusuom': 'mile', 'radius': '40', 'state': 'DC', 'postalcode': '20001', 'province': '', 'address': '', 'country': 'US'}, 'collection': [{'tvadvurl': None, 'localad': None, 'sun_open_time': '10:00 AM', 'csurl': None, 'tvadv': None, 'tv': '1', 'giftcard': '1', 'main_id': 'TV', 'hgurl': None, 'country': 'US', 'city': 'Washington', 'jaurl': None, 'tue_close_time': '- 7:30 PM', 'dsurl': None, 'tvr': '1', 'fax': None, 'corronado': None, 'creditcard': None, 'tue_open_time': '8:00 AM', 'fri_close_time': '- 7:30 PM', 'google': None, 'cs': None, 'hg': None, 'thur_open_time': '8:00 AM', 'activeshiptostore': '1', 'facebookurl': None, 'thur_close_time': '- 7:30 PM', 'ja': None, 'sat_close_time': '- 6:00 PM', 'clientkey': 'L4ZK7Q8W-PA4X-4IS6-587Y-FRJLJ8Z5JF84', 'icon': 'default',

In [31]:
HTML(pd.DataFrame(shops).to_html())

Unnamed: 0,address,city,country,latitude,longitude,name,postalcode,state
0,1623 17th St NW,Washington,US,38.911917,-77.03849,True Value On 17th,20009-2433,DC
1,1108 24th Street NW,Washington,US,38.9038748979592,-77.0514083673469,District Hardware and Bike,20037-1432,DC
2,2100 W VIRGINIA AVENUE NE,WASHINGTON,US,38.91536,-76.98028,KAMCO BUILDING SUPPLY,20002-1834,DC
3,2213 N. Buchanan Street,Arlington,US,38.89758,-77.12483,Bills True Value,22207-2528,VA
4,7301 Mcarthur Blvd,Bethesda,US,38.96912,-77.14004,Christophers Glen Echo Hardware,20816,MD
5,5860 FARINGTON AVE,ALEXANDRIA,US,38.7984625564754,-77.1362345995378,KAMCO BLDG SPLY & TRUE VALUE,22304-4822,VA
6,7902 Fort Hunt Rd,Alexandria,US,38.7436373198134,-77.0570290532158,Hollin Hall Variety Store,22308-1203,VA
7,11616 LIVINGSTON RD,FORT WASHINGTON,US,38.730159346246,-76.9922404673855,FORD LUMBER COMPANY,20744-5148,MD
8,500 Olney Sandy Spring Rd,Sandy Spring,US,39.14891,-77.02204,Christophers Hardware,20860,MD
9,9124 Mathis Ave,Manassas,US,38.7579993877551,-77.4656865306122,J E Rice Co.,20110,VA


In [103]:
# case study 3: get available airbnb properties in London
headers = {'accept-encoding': 'gzip, deflate, br',
           'x-requested-with': 'XMLHttpRequest',
           'accept-language': 'en-US,en;q=0.8,pl;q=0.6',
           'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
           'accept': 'application/json, text/javascript, */*; q=0.01',
           'referer': 'https://www.airbnb.co.uk/s/London/homes',
           'authority': 'www.airbnb.co.uk'}
# get api key embeded in html (yes... same story again :))
html = requests.get('https://www.airbnb.co.uk/s/London/homes', headers=headers).text
api_key = re.findall(r'key\&quot;:\&quot;([a-zA-Z0-7]*)\&quot;},\&quot;deep_link', html)[0]
# get first listing
enpoint='https://www.airbnb.co.uk/api/v2/explore_tabs'
params = {'version':'1.2.8',
          '_format':'for_explore_search_web',
          'items_per_grid':'20',
          'fetch_filters':'true',
          'is_guided_search':'true',
          'is_new_cards_experiment':'false',
          'supports_for_you_v3':'true',
          'screen_size':'small',
          'timezone_offset':'60',
          'auto_ib':'false',
          'luxury_pre_launch':'false',
          'metadata_only':'false',
          'is_standard_search':'true',
          'refinements[]':'homes',
          'selected_tab_id':'home_tab',
          'location':'London',
          'allow_override[]':'',
          's_tag':'DOIPutuT',
          'section_offset':'0',
          '_intents':'p1',
          'key':api_key,
          'currency':'GBP',
          'locale':'en-GB'}
r = requests.get(enpoint, params=params)
print('Got listing from: ', r.url)
ds = json.loads(r.text)

"""
note:
some airbnb usefull endpoints
# get listings
https://www.airbnb.co.uk/api/v2/explore_tabs?version=1.2.8&_format=for_explore_search_web&items_per_grid=18&experiences_per_grid=20&guidebooks_per_grid=20&fetch_filters=true&is_guided_search=true&is_new_cards_experiment=false&supports_for_you_v3=true&screen_size=small&timezone_offset=60&auto_ib=false&luxury_pre_launch=false&metadata_only=false&is_standard_search=true&tab_id=home_tab&location=London&allow_override%5B%5D=&ne_lat=51.599363500119274&ne_lng=-0.06168207198925302&sw_lat=51.47626857868991&sw_lng=-0.289648380583003&zoom=12&search_by_map=true&federated_search_session_id=6d72b1e2-cb68-4877-b27c-8614e11fc5b0&_intents=p1&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=&locale=en-GB
# get booking detials
https://www.airbnb.co.uk/api/v2/pdp_listing_booking_details?guests=1&listing_id=13575756&_format=for_web_dateless&_interaction_type=pageload&_intents=p3_book_it&_parent_request_uuid=aed9f0ce-1534-4a89-9cf6-c6813adcb95b&_p3_impression_id=p3_1506465875_Q2VDMsV0pLs27%2BtX&show_smart_promotion=0&force_boost_unc_priority_message_type=&number_of_adults=1&number_of_children=0&number_of_infants=0&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=GBP&locale=en-GB
"""

https://www.airbnb.co.uk/api/v2/explore_tabs?fetch_filters=true&allow_override%5B%5D=&refinements%5B%5D=homes&_format=for_explore_search_web&luxury_pre_launch=false&section_offset=0&currency=GBP&is_new_cards_experiment=false&supports_for_you_v3=true&metadata_only=false&is_guided_search=true&auto_ib=false&_intents=p1&locale=en-GB&version=1.2.8&s_tag=DOIPutuT&location=London&selected_tab_id=home_tab&timezone_offset=60&items_per_grid=20&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&is_standard_search=true&screen_size=small


In [105]:
# paginate and get more properties
props = []
page_no = 1
page_limit = 6
while (ds['explore_tabs'][0]['pagination_metadata']['has_next_page'] == True) or (page_no>page_limit):
    params['section_offset'] = str(page_no)
    print('will get page: ', page_no+1)
    r = requests.get(enpoint, params=params)
    ds = json.loads(r.text)
    for idx, section in enumerate(ds['explore_tabs'][0]['sections']):
        for prop in ds['explore_tabs'][0]['sections'][idx]['listings']:
            props.append({'name':prop['listing']['name'],
                          'room_type':prop['listing']['room_type'],
                          'person_capacity':prop['listing']['person_capacity'],
                          'pic':prop['listing']['picture']['picture'],
                          'rating':prop['listing']['star_rating'],
                          'latitude':prop['listing']['lat'],
                          'longitude':prop['listing']['lng'],
                          'price':prop['pricing_quote']['rate']['amount'],
                          'currency':prop['pricing_quote']['rate']['currency'],
                          'price_type':prop['pricing_quote']['rate_type']})
    page_no += 1
    print('No. of properties: ', len(props))
    time.sleep(3)

will get page:  2
20
{'room_type': 'Private room', 'price': 26, 'price_type': 'nightly', 'latitude': 51.517100765969694, 'person_capacity': 4, 'longitude': -0.05321401070513297, 'name': '(ARM-A)PRIVATE ROOM FOR 4 PPL IN ZONE 1.', 'currency': 'GBP', 'pic': 'https://a0.muscache.com/im/pictures/f1ebef2f-56b5-4a66-ad1e-fae9de4e5316.jpg?aki_policy=large', 'rating': 4.5}
will get page:  3
40
{'room_type': 'Private room', 'price': 26, 'price_type': 'nightly', 'latitude': 51.517100765969694, 'person_capacity': 4, 'longitude': -0.05321401070513297, 'name': '(ARM-A)PRIVATE ROOM FOR 4 PPL IN ZONE 1.', 'currency': 'GBP', 'pic': 'https://a0.muscache.com/im/pictures/f1ebef2f-56b5-4a66-ad1e-fae9de4e5316.jpg?aki_policy=large', 'rating': 4.5}
will get page:  4
60
{'room_type': 'Private room', 'price': 26, 'price_type': 'nightly', 'latitude': 51.517100765969694, 'person_capacity': 4, 'longitude': -0.05321401070513297, 'name': '(ARM-A)PRIVATE ROOM FOR 4 PPL IN ZONE 1.', 'currency': 'GBP', 'pic': 'https:/

KeyboardInterrupt: 

In [106]:
HTML(pd.DataFrame(props).to_html())

Unnamed: 0,currency,latitude,longitude,name,person_capacity,pic,price,price_type,rating,room_type
0,GBP,51.517101,-0.053214,(ARM-A)PRIVATE ROOM FOR 4 PPL IN ZONE 1.,4,https://a0.muscache.com/im/pictures/f1ebef2f-5...,26,nightly,4.5,Private room
1,GBP,51.479652,-0.169762,Luxurious double @ heart of london,2,https://a0.muscache.com/im/pictures/1528064c-5...,36,nightly,5.0,Private room
2,GBP,51.510506,-0.129266,"Trafalgar Square, Peaceful Room.\nFemale frien...",2,https://a0.muscache.com/im/pictures/e0265be9-4...,57,nightly,5.0,Private room
3,GBP,51.548527,-0.226324,Modern room 10 min from Central London,2,https://a0.muscache.com/im/pictures/a0e3a2af-8...,40,nightly,5.0,Private room
4,GBP,51.511007,-0.226281,"THE QUEENS HOSTEL, 6 BED MIXED DORM E",6,https://a0.muscache.com/im/pictures/328d9beb-6...,21,nightly,4.5,Private room
5,GBP,51.564283,-0.120809,"Spacious Double room in Holloway, London",2,https://a0.muscache.com/im/pictures/ae707340-f...,20,nightly,4.5,Private room
6,GBP,51.491728,-0.014746,Double Room in Canary Wharf hs,2,https://a0.muscache.com/im/pictures/ebd8e53d-0...,26,nightly,4.5,Private room
7,GBP,51.476615,-0.13297,SMALL Stockwell station single room-£17,1,https://a0.muscache.com/im/pictures/35f34bb4-1...,18,nightly,4.5,Private room
8,GBP,51.553402,-0.241072,Large Balcony Double Bedroom in Dollis Hill! BR5,2,https://a0.muscache.com/im/pictures/64093166-7...,31,nightly,4.5,Private room
9,GBP,51.498859,-0.086188,Single box room at london bridge,1,https://a0.muscache.com/im/pictures/74d672bc-e...,23,nightly,4.0,Private room


In [1]:
# case study 4 - json embeded in html
# will skip code... but you can find nice surprise in any wallmart product code :)

## Handling JavaScript

###  Selenium+PhantomJS
- Selenium is browser automation tool most often used for testing web application.
- It can be extremaly usefull while scraping
- PhantomJS is just headless browser (there is no UI and it works in background)
- BTW: you can use Selenium with any other browser (Firefox, Opera etc.)

### Why...?
- Sometimes content of webpage can be dynamically presented/altered via JavaScript code
- when you're dowlonading HTML, it can be completly different from what you see on browser
- your target have some fancy anti-scraping software detecting that you're a bot

### It's often an overkill thoguh!
- Scraping with Selenium+PhantomJS is much heavier than using simple Python libraries!
    + you have to have all additional libraries and software installed
    + it may be slower
    + you have to navigate as you were a human (eg. find button element and click it programatically)
    + you rely on page layout (not robust at all...)
- Very often you can find work-around it. For example, if you try to deal with infinite scroll,
  you can investigate what AJAX requests your browser is sending while scrolling (and emulate it)

In [4]:
# case study 5 - handling infinite scroll
spidyquotes_url = 'http://spidyquotes.herokuapp.com/scroll'
# with Selenium+PhantomJS (in general - bad option. but yeah... fancy)
driver = webdriver.PhantomJS('/Applications/phantomjs-2.1.1-macosx/bin/phantomjs')
driver.get(spidyquotes_url)
no_of_scrolls = 5
scroll = 0
while scroll < no_of_scrolls:
    # do a fancy screenshoot here
    driver.get_screenshot_as_file('/Users/stulski/Desktop/osobiste/pydata_meetup/shot_{}.jpg'.format(scroll))
    # scroll down
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    scroll += 1
quote_elements = driver.find_elements_by_class_name('quote')
all_quotes = [element.find_elements_by_class_name('text')[0].text for element in quote_elements]
print('No. of quotes is: ', len(all_quotes))
print(all_quotes[-1])

No. of quotes is:  60
“′Classic′ - a book which people praise and don't read.”


In [7]:
# same as above, without unnecessery hassel
p_idx = 1
spidyquotes_better_url = 'http://spidyquotes.herokuapp.com/api/quotes?page='
r = json.loads(requests.get(spidyquotes_better_url+str(p_idx)).text)
time.sleep(1)
all_quotes = []
while r['has_next'] == True:
    for quote in r['quotes']:
        all_quotes.append(quote['text'])
    p_idx += 1
    r = json.loads(requests.get(spidyquotes_better_url+str(p_idx)).text)
    time.sleep(1)
print('No. of quotes is: ', len(all_quotes))
print(all_quotes[-1])

No. of quotes is:  90
“I believe in Christianity as I believe that the sun has risen: not only because I see it, but because by it I see everything else.”


## Keynotes and advices
* investigate you target well (sitemaps, hidden apis, how it works under-the-hood)
* use incognito mode whie exploring
* use developers tools
* think about scraping as a "hacking" activity rather than parsing just getting html elements
* change your user-agent
* same data are in different places at the website. find those easy to scrape!
* if you need to parse HTML and get data from there - try to find something which will not break (avoid finding   general elements like DIVs and then finding Nth of those)
* look for comonalities
* websites are different are there is no one magical way to get your data

** if you want production level scrapers - use proxy