# Scraping JavaScript

## Ajax

#### Q: 當看到的頁面和抓回來的source code不一樣

1. Ajax傳送或接收server的資訊時, 並不需要重新載入頁面或發送另一個頁面的請求
2. 或是被重新導向到別的網站

#### A:

1. crawl content directly from JavaScript (without python)
2. use Python package to execute js then scrapy
    - e.g. selenium

## Selenium

install selenium
```bash
pip install selenium
```

1. use [**Phantom JS**](http://phantomjs.org/download.html) to run quietly in background
     - load website into memory
     - execute js on the page
     - no any graphic rendering of the website to the user 
2. use [**Chromedriver**](https://sites.google.com/a/chromium.org/chromedriver/) to run it with Chrome

In [2]:
from selenium import webdriver
import time

driver = webdriver.PhantomJS(executable_path='./phantomjs/bin/phantomjs')
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()



Here is some important text you want to retrieve!
A button to click!


#### Q: sleep(3)是為了確保full loaded, but there is better approach

#### A: repeatedly check for existence of some element on a full loaded page


expected_conditions:
1. alert box pops up
2. element is put into "selected" state
3. text is now displayed on the page
4. element is now visible to DOM, or element disappears from DOM

In [3]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
driver = webdriver.PhantomJS(executable_path='./phantomjs/bin/phantomjs')
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")

try:
    element = WebDriverWait(driver, 10).until(
    ec.presence_of_element_located((By.ID, "loadedButton")))
finally:
    print(driver.find_element_by_id("content").text)
    driver.close()



Here is some important text you want to retrieve!
A button to click!


## Redirects

- server-side redirect can be easily traversed by *urllib* 
- client-side redirect won't be handled at all unless something is actually executing the js

#### Q: 重新導向最大的問題是如何知道頁面已經導向完畢

#### A: 從頁面初始化的時候就一直盯著某元素直到拋出 *StaleElementReferenceException*

該例外是指該元素不再attached to the page's DOM而且網站已經被導向了

In [4]:
from selenium.webdriver.remote.webelement import WebElement
from selenium.common.exceptions import StaleElementReferenceException

def wait_for_load(driver):
    element = driver.find_element_by_tag_name("html")
    count = 0
    while True:
        count += 1
        if count > 20:
            print("Timing out after 10 seconds and returning")
            return
        time.sleep(.5)
        try:
            element == driver.find_element_by_tag_name('html')
        except StaleElementReferenceException:
            return
driver = webdriver.PhantomJS(executable_path='./phantomjs/bin/phantomjs')
driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html")
wait_for_load(driver)
print(driver.page_source)
print(driver.get_cookies())



Timing out after 10 seconds and returning
<html><head>
<title>The Destination Page!</title>

</head>
<body>
This is the page you are looking for!

</body></html>
[]


## Save cookie for other scrapers

In [6]:
driver = webdriver.PhantomJS(executable_path='./phantomjs/bin/phantomjs')
driver.get("http://pythonscraping.com")
driver.implicitly_wait(1)
print("Cookie of Driver 1:")
print(driver.get_cookies())

saved_cookie = driver.get_cookies()
driver2 = webdriver.PhantomJS(executable_path='./phantomjs/bin/phantomjs')
driver2.get("http://pythonscraping.com")

# clear driver2 cookie
driver2.delete_all_cookies()
# add cookie into driver2
for cookie in saved_cookie:
    driver2.add_cookie(cookie)

driver2.get("http://pythonscraping.com")
driver2.implicitly_wait(1)
print("Cookie of Driver 2:")
print(driver2.get_cookies())



Cookie of Driver 1:
[{'domain': '.pythonscraping.com', 'expires': 'Sun, 11 Feb 2018 11:21:22 GMT', 'expiry': 1518348082, 'httponly': False, 'name': '_gat', 'path': '/', 'secure': False, 'value': '1'}, {'domain': '.pythonscraping.com', 'expires': 'Mon, 12 Feb 2018 11:20:22 GMT', 'expiry': 1518434422, 'httponly': False, 'name': '_gid', 'path': '/', 'secure': False, 'value': 'GA1.2.1224189113.1518348022'}, {'domain': '.pythonscraping.com', 'expires': 'Tue, 11 Feb 2020 11:20:22 GMT', 'expiry': 1581420022, 'httponly': False, 'name': '_ga', 'path': '/', 'secure': False, 'value': 'GA1.2.1832632677.1518348022'}, {'domain': 'pythonscraping.com', 'httponly': False, 'name': 'has_js', 'path': '/', 'secure': False, 'value': '1'}]


InvalidCookieDomainException: Message: {"errorMessage":"Can only set Cookies for the current domain","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"243","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:61490","User-Agent":"Python http auth"},"httpVersion":"1.1","method":"POST","post":"{\"cookie\": {\"domain\": \".pythonscraping.com\", \"expires\": \"Sun, 11 Feb 2018 11:21:22 GMT\", \"expiry\": 1518348082, \"httponly\": false, \"name\": \"_gat\", \"path\": \"/\", \"secure\": false, \"value\": \"1\"}, \"sessionId\": \"8e0d88c0-0f1d-11e8-8c9c-150d67a7114b\"}","url":"/cookie","urlParsed":{"anchor":"","query":"","file":"cookie","directory":"/","path":"/cookie","relative":"/cookie","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/cookie","queryKey":{},"chunks":["cookie"]},"urlOriginal":"/session/8e0d88c0-0f1d-11e8-8c9c-150d67a7114b/cookie"}}
Screenshot: available via screen
