# Week 06 - Advanced scraping: anti-crawler, browser emulation and other nitty gritty

## Objective
Bypass anti-crawler by modifying user-agent
Handle glitches: encoding(编码), pagination(编页码), ... 
Handle dynamic page with headless browser
Handle login with headless browser
Scrape social networks
Case studies on different websites
Further strengthen the list-of-dict data types; organise multi-layer loops/ item based parsing logics.

## Anti-crawling
### User agent

In [5]:
import requests
r = requests.get('https://nghttp2.org/httpbin/user-agent')
r.text
'{"user-agent":"python-requests/2.19.1"}\n'
r = requests.get('https://nghttp2.org/httpbin/user-agent', headers={'user-agent': 'See, I modified the user agent!!'})
r.text

'{"user-agent":"See, I modified the user agent!!"}\n'

### Rate throttling
Limit by IP
Limit by cookie/ access token
Limit by API quota per a unit time, usually implemented with a leaky(有漏洞的) bucket algorithm(算法)

### Hide numeric incremental IDs

#### Bonus: Stateful page transition

#### Bonus: client authentication(验证)




## Common issues
### Encoding

In [8]:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.comm.hkbu.edu.hk/comd-www/english/people/m_facutly_dept.htm')
r.encoding = 'utf-8'
mypage = BeautifulSoup(r.text)
mypage.find('td', {'class': 'personNameArea'}).text



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


'HUANG, YU\r\n                              黃煜'

### Network delay and jitter
"time.sleep"——pause for some time before proceed
".find_element_by_xxx"—— check if the intended element is already loaded

### Network interruption
"try...except"

### Firewall

### Browser rendering delay




## Browser emulation

### Why use Browser Emulation
1.Some of complicated website can't be directly scraped by static method.
2.Browser Emulation way can handle some complicated scraping work like ones that need you login.
3.Some webpages have strictly rules for anti-scraping.

    two libraries - "Selenium" and "Splinter"

### Limitation
Each time, it need to load all the content of the webpage, the crawling speed is slow, therefore not suitable for scraping cases with a large load of data.

### Selenium
Selenium is a set of different software tools, each with a different approach to supporting browser automation. 

    Selenium Python bindings
    #Selenium Python bindings provides a simple API to write functional/acceptance tests using Selenium WebDriver. Through Selenium Python API you can access all functionalities of Selenium WebDriver in an intuitive way.#

#### Downloading Python bindings for Selenium(DONE)
 
    #with help of the Selenium——navigating to a link, searching, scrolling, clicking etc#

#### Drivers

#### Navigating

In [9]:
from selenium import webdriver
browser = webdriver.Chrome() #initiate webdriver
browser.get('http://google.com/') #visit to google page
element = browser.find_element_by_name("q") #Find the search box
element.send_keys("github python for data and media communication gitbook") #search our openbook
element.submit() #submit search action
# you will find the webpage will automatically return the results you search
link = browser.find_element_by_partial_link_text('GitHub - hupili') #find our tutorial
link.click() #click the link, enter our tutorial
browser.execute_script("window.scrollTo(0,1200);") #scroll in the page, window.scrollTo(x,y), x means horizontal, y means vertical
notes_links = browser.find_element_by_link_text('notes-week-06.md') #find link of notes 6
notes_links.click() #click into notes 6
#browser.close()

### 上面就是搜索的过程

#### Locating Elements
It's similar to the usage in requests method, just a simple find... sentence but more diverse.

#Selenium provides the following methods to locate elements in a page:#
find_element(s)_by_id
find_element(s)_by_name
find_element(s)_by_xpath
find_element(s)_by_link_text
find_element(s)_by_partial_link_text
find_element(s)_by_tag_name
find_element(s)_by_class_name
find_element(s)_by_css_selector

#### In our notes, we mainly use "find_element(s)_by_css_selector" method, due to its easy expression and rich matchability.


#### Find_element(s)_by_css_selector
##### Locating elements by attribute

<div id="summaryList_mixed" class="summaryList" style="display: block;"></div>
css = element_name[<attribute_name>='<value>']
    
    1.#Select id. Use # notation to select the id:#
css="div#summaryList_mixed" or "#summaryList_mixed"

    2.#Select class. Use the . notation to select the class:
css="div.summaryList" or just css=".summaryList"

    3.#Select multiple attributes:#
css="div[class='summaryList'] [style='display:block']"

##### Locating Child Element

<div id="summaryList_mixed" class="summaryList" style="display: block;">
    <div class="summaryBlock"></div>
    <div class="summaryBlock"></div>
    <div class="summaryBlock"></div>
    <div class="summaryBlock"></div>
</div>

    1.#Locate all children#
css="div#summaryList_mixed .summaryBlock"

    2.#Locate the certain one with “nth-of-type”. The first one is "nth-of-type(1), and the last one is "last-child"#
css="div#summaryList_mixed .summaryBlock:nth-of-type(2)"

In [13]:
##Fundamental: One page

#Example: CNN articles scraping
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://money.cnn.com/search/index.html?sortBy=date&primaryType=mixed&search=Search&query=trade%20war')

articles = []
for session in browser.find_elements_by_css_selector('#summaryList_mixed .summaryBlock'): #find all articles wrapped in the path of class='summaryBlock' under the id='summaryList_mixed' 
    article = {}
    h = session.find_element_by_css_selector(".cnnHeadline a")
    article['headline'] = h.text #find headline block
    article['url'] = h.get_attribute('href')#get url attributes from headline block
    article['date'] = session.find_element_by_css_selector("span.cnnDateStamp").text #find date
    articles.append(article)
articles

NoSuchWindowException: Message: no such window: window was already closed
  (Session info: chrome=70.0.3538.77)
  (Driver info: chromedriver=2.43.600229 (3fae4d0cda5334b4f533bede5a4787f7b832d052),platform=Mac OS X 10.13.6 x86_64)


In [14]:
## Advanced: All pages

from selenium import webdriver
import time #mainly use its time sleep function

def get_articles_from_browser(b):
    articles = []
    for session in browser.find_elements_by_css_selector('#summaryList_mixed .summaryBlock'): #find all articles wrapped in the path of class='summaryBlock' under the id='summaryList_mixed'
        article = {}
        h = session.find_element_by_css_selector(".cnnHeadline a")
        article['headline'] = h.text #find headline block
        article['url'] = h.get_attribute('href') #get url attributes from headline block
        article['date'] = session.find_element_by_css_selector("span.cnnDateStamp").text #find date
        articles.append(article)
    
    return articles

url = 'http://money.cnn.com/search/index.html?sortBy=date&primaryType=mixed&search=Search&query=trade%20war'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(2) #sleep 2 second for each call action, if it's too frequently with no sleep time, its has high opportunity to be banned from the website.

all_page_articles = []
for i in range(10):
    time.sleep(0.5)
    try:
        new_articles = get_articles_from_browser(browser)
        all_page_articles.extend(new_articles)

#in the following, we need to emulate to click `next button` to turn pages.
#try 1: just click link by default ...
#next_page = browser.find_element_by_link_text('Next').click()
#error: not clickable. After try several print() in the process,I found that, we need to scroll the window down till we can see the next button. Therefore you can see that selenium browser emulation method is really just like a human behavior.
#try 2: scroll whole body down to the bottom...
#browser.execute_script('window.scrollTo(0, document.body.scrollHeight);')
#error: In some page, the navigation bar has blocked the click button if you scroll down to the bottom
#try 3: (document.body.scrollHeight - int) ...  
#fail: can not be minus, but can be divided. 

        browser.execute_script('window.scrollTo(0, document.body.scrollHeight/1.5);')#test several numbers to choose a suitable one
        next_page = browser.find_element_by_link_text('Next')
        next_page.click()
    except Exception as e:
        print(e)
        print('Error on page %s' % i)



Message: no such window: window was already closed
  (Session info: chrome=70.0.3538.77)
  (Driver info: chromedriver=2.43.600229 (3fae4d0cda5334b4f533bede5a4787f7b832d052),platform=Mac OS X 10.13.6 x86_64)

Error on page 4
Message: no such window: window was already closed
  (Session info: chrome=70.0.3538.77)
  (Driver info: chromedriver=2.43.600229 (3fae4d0cda5334b4f533bede5a4787f7b832d052),platform=Mac OS X 10.13.6 x86_64)

Error on page 5
Message: no such window: window was already closed
  (Session info: chrome=70.0.3538.77)
  (Driver info: chromedriver=2.43.600229 (3fae4d0cda5334b4f533bede5a4787f7b832d052),platform=Mac OS X 10.13.6 x86_64)

Error on page 6
Message: no such window: window was already closed
  (Session info: chrome=70.0.3538.77)
  (Driver info: chromedriver=2.43.600229 (3fae4d0cda5334b4f533bede5a4787f7b832d052),platform=Mac OS X 10.13.6 x86_64)

Error on page 7
Message: no such window: window was already closed
  (Session info: chrome=70.0.3538.77)
  (Driver info: