# Collecting data with Selenium

## Environment Setup 

Selenium is essentially a package for automatic testing. It is also vastly used for collecting data from dynamic generated web pages, which we will talk about later. To begin with, run following scripts to setup Selenium environment.

In [1]:
import pip
pip.main(['install', 'Selenium'])



0

In [2]:
from selenium import webdriver        
from selenium.webdriver.common.keys import Keys     

You will also need to download a driver according to the browser you will use and the operating system of your computer. Here we are going to use Chrome. So download the driver from https://chromedriver.storage.googleapis.com/index.html?path=2.36/. And you will need to put the driver in the same folder as this Jupyter Notebook. To test whether you have setup the environment correctly, please run the code snippet in **Getting Started**.

## Getting started

In [3]:
# -*- coding: UTF-8 -*-
driver = webdriver.Chrome()
driver.get("https://www.google.com/")
html = driver.page_source
print("Length of HTML of Google homepge: " + str(len(html)))
driver.close()

Length of HTML of Google homepge: 222774


Congratulations! You just get the source code from Google homepage. Here you may wonder why the browser is opened after you start the code. Well, this is an important feature, or advantage of Selenium. Selenium is essentially a automatic testing tool. It imitates the operations of a real user. So this piece of code just imitates a real user opening the browser and visiting the Google homepage. You can see here is that you do not need to deal with those annoying HTTP requests but focusing on the web page and the data in it. If you want to collect data from a series of web page, such as lists of products, you only need to think about how you collect data from those pages as a normal user, then automate your behavior as a real user. Cool, right?  
  
Then we will going to talk about how to use Selenium scrape webpages in detail.

## Scrape Yahoo Finance
Now let's try to collect something. How about the current values of Cryptocurrencies? Let's try to collect all the values of cryptocurrencies from https://finance.yahoo.com/cryptocurrencies.  
  
Open the page, and think about how to collect data as a normal user:  
1. Open the browser
2. Open the web page
3. Copy the data row by row
4. Go to the next page
5. Stop collecting data after finishing the 10th page  
6. Close the browser   
  
So let's try to automate it with Selenium. And we will go through those important concepts related to Selenium with this example.

### 1. Web driver

Web driver is the core object of Selenium. It is created with specified browser. For example, if you wants to visit the website with Chrome, you need to create a web driver of Chrome. Also, remember to close your driver after you finish. Otherwise the browser will stay there. You can just easily regard creating web driver as opening up the browser and closing the driver as closing the browser. 
  
Selenium also supports browser like Internet Explorer and Firefox. And later we will talk about a headless browser, which allows you to run Selenium without opening a seeable browser. The following piece of code, which we have used at the begining, creates a driver of Chrome. 

In [4]:
driver = webdriver.Chrome()
driver.close()

### 2. Locating elements
Let's just try to collect a single row from the first page of https://finance.yahoo.com/cryptocurrencies. Easy enough for a new starter, right?  
  
#### Web element 
Web elements are the HTML objects captured by Selenium selector. The object itself provides you ways to parse it. 
  
You can get following attributes from elements:
```
element.tag_name

```
```
element.text
```
It also provides method to access its attribute
```
element.get_attribute(name)
```
Example:
Check if the "active" CSS class is applied to an element:
```
is_active = "active" in target_element.get_attribute("class")
```
  
#### XPath & Selector
So how Selenim locate elements? Selenium supports multiple ways to locate elements. The most common way is to use XPath. XPath is a method to identify the location of an element in an HTML page. It looks like this:  
```
//*[@id="scr-res-table"]/table/tbody/tr[1]/td[2]/a
```
You may wonder what this complex string means and how to write it. Well, this XPath means the location of the element is under the element with id "scr-res-table", the table, the tbody, the first row and second table data (it uses one indexing). And actually you really do not need to know how to write it by yourself. You can easily get it by Chrome developer tool as the figure shows.
![title](figure 1.png)
<center>Figure 1</center>  
  
Then we will use Selenium **selector** to select the element by XPath. In the following code, we firstly obtained the **web elements** by XPath using ```driver.find_element_by_xpath()``` method and then extract information from the ** web elements**. The row we are going to scrape looks like this:
![title](figure 2.png)
<center>Figure 2</center>  

In [5]:
driver = webdriver.Chrome()

row = {
    "symbol":"",
    "name":"",
    "price":"",
    "change":"",
    "percentage_change":"",
    "market_cap":"",
    "volume_in_currency":"",
    "total_volume_all_currencies":"",
    "circulating_supply":"",
}

driver.get("https://finance.yahoo.com/cryptocurrencies")
element = driver.find_element_by_xpath('//*[@id="scr-res-table"]/table/tbody/tr[1]/td[2]')
row["symbol"] = element.text

element = driver.find_element_by_xpath('//*[@id="scr-res-table"]/table/tbody/tr[1]/td[3]')
row["name"] = element.text

element = driver.find_element_by_xpath('//*[@id="scr-res-table"]/table/tbody/tr[1]/td[4]')
row["price"] = element.text

element = driver.find_element_by_xpath('//*[@id="scr-res-table"]/table/tbody/tr[1]/td[5]')
row["change"] = element.text

element = driver.find_element_by_xpath('//*[@id="scr-res-table"]/table/tbody/tr[1]/td[6]')
row["percentage_change"] = element.text

element = driver.find_element_by_xpath('//*[@id="scr-res-table"]/table/tbody/tr[1]/td[7]')
row["market_cap"] = element.text

element = driver.find_element_by_xpath('//*[@id="scr-res-table"]/table/tbody/tr[1]/td[8]')
row["volume_in_currency"] = element.text

element = driver.find_element_by_xpath('//*[@id="scr-res-table"]/table/tbody/tr[1]/td[9]')
row["total_volume_all_currencies"] = element.text

element = driver.find_element_by_xpath('//*[@id="scr-res-table"]/table/tbody/tr[1]/td[10]')
row["circulating_supply"] = element.text

print(row)

driver.close()

{'symbol': 'BTC-USD', 'name': 'Bitcoin USD', 'price': '7,909.70', 'change': '+101.22', 'percentage_change': '+1.30%', 'market_cap': '134.025B', 'volume_in_currency': '518.386M', 'total_volume_all_currencies': '749.978M', 'circulating_supply': '3.687B'}


Now the data of the first row has been successfully crawled by the crawler. Then let's try to scrape the whole page. You may find it annoyed to copy those XPaths. But by carefully observing, you can find that only the index after ```td``` changes and actually this feature happens to most websites. Using this feature, we can crawl all the rows on a single page with a loop. And you do not need to specify how many rows are there, because the number of rows may change(for example, the number of rows on the last page may different from the first page). You only need to use a infinite loop break when there is an ```ElementNotFoundException```.

In [6]:
def scrapePage(driver):
    rows = []
    keys = [
        "symbol", "name", "price", "change",
        "percentage_change", "market_cap", "volume_in_currency",
        "total_volume_all_currencies", "circulating_supply",
    ]
    row_index = 1
    while True:
        try:
            row = dict()
            for data_index in range(2, 11):
                element = driver.find_element_by_xpath('//*[@id="scr-res-table"]/table/tbody/tr[' + str(row_index) + ']/td[' + str(data_index) + ']')
                row[keys[data_index - 2]] = element.text
            rows.append(row)
#             print(row)
            row_index = row_index + 1
        except:
            break

    return rows

In [7]:
driver = webdriver.Chrome()
driver.get("https://finance.yahoo.com/cryptocurrencies")
rows = scrapePage(driver)
driver.close()
print("First five rows of the page: ")
for r in rows[:5]:
    print(r)

First five rows of the page: 
{'symbol': 'BTC-USD', 'name': 'Bitcoin USD', 'price': '7,909.70', 'change': '+101.22', 'percentage_change': '+1.30%', 'market_cap': '134.025B', 'volume_in_currency': '518.386M', 'total_volume_all_currencies': '749.978M', 'circulating_supply': '3.687B'}
{'symbol': 'ETH-USD', 'name': 'Ethereum USD', 'price': '447.68', 'change': '-1.10', 'percentage_change': '-0.25%', 'market_cap': '44.083B', 'volume_in_currency': '149.164M', 'total_volume_all_currencies': '203.905M', 'circulating_supply': '667.67M'}
{'symbol': 'XRP-USD', 'name': 'Ripple USD', 'price': '0.5698', 'change': '-0.0005', 'percentage_change': '-0.0877%', 'market_cap': '21.827B', 'volume_in_currency': '18.882M', 'total_volume_all_currencies': '29.799M', 'circulating_supply': '182.39M'}
{'symbol': 'BCH-USD', 'name': 'Bitcoin Cash / BCC USD', 'price': '867.71', 'change': '-9.98', 'percentage_change': '-1.14%', 'market_cap': '14.788B', 'volume_in_currency': '21.05M', 'total_volume_all_currencies': '28.

### 3. Navigation

Let's move on to collecting data from page to page using the ```scrapePage()``` function. The mechanism Selenium operating a page is called **'navigation'**. The operation may include opening a page, interacting with a page, moving between different windows and etc. Now let's try to scrape from page to page using navigation. 
  
So how can we navigate to the next page? There are multiple ways to do it:
1. Try to get the url of the next page, and open it with web driver
2. Click the **'next'** button using interaction function provided by Selenium  
  
A problem here is how Selenium know when to stop crawling. This requires us to study the web page carefully. When reaching the end of the lists, the **'next'** button is disabled. And in the HTML code of the the **'next'** button, a 'disabled' string shows up. So after finishing crawling each page, we need to check whether 'disabled' string exists in **'next'** button. If so, we need to stop crawling, otherwise we go to the next page. The HTML code can be obtained by ```element.get_attribute("outerHTML")``` function.  
  
Let's try to get the url of next page first.

### First method
By studying the website, we can find that it uses a GET request with ```offset``` and ```count```. For example, ```https://finance.yahoo.com/cryptocurrencies?offset=0&count=25``` means to show 0 - 25 results. Then we can obtain the URL of the next page by setting the ```offset``` and ```count```. 

In [8]:
def getPages(driver):
    rows = []
    offset = 0
    while True:
        driver.get("https://finance.yahoo.com/cryptocurrencies?offset=" + str(offset) + "&count=25")
        next_button_element = driver.find_element_by_xpath('//*[@id="fin-scr-res-table"]/div[2]/div[2]/button[3]')
        rows.extend(scrapePage(driver))
        outer_html = next_button_element.get_attribute("outerHTML")
        if "disabled" in outer_html:
            break
        offset = offset + 25
    return rows

In [9]:
driver = webdriver.Chrome()
rows = getPages(driver)
driver.close()
print("Number of entries: ")
print(len(rows))
print("\nFirst row of data: ")
print(rows[0])
print("\nLast row of data: ")
print(rows[-1])

Number of entries: 
112

First row of data: 
{'symbol': 'BTC-USD', 'name': 'Bitcoin USD', 'price': '7,909.70', 'change': '+101.22', 'percentage_change': '+1.30%', 'market_cap': '134.025B', 'volume_in_currency': '518.386M', 'total_volume_all_currencies': '749.978M', 'circulating_supply': '3.687B'}

Last row of data: 
{'symbol': 'PAY-USD', 'name': 'TenX USD', 'price': '1.04077', 'change': '-0.02215', 'percentage_change': '-2.08388%', 'market_cap': '0', 'volume_in_currency': '0', 'total_volume_all_currencies': '0', 'circulating_supply': '4.529M'}


### Second method
Then let's try the second method, where we are going to use **'navigation'**.  
   
The following code realized it with the second method. The crawler will click the **'next'** button and get the next page. After each click, remember to **refresh** the page using ```driver.refresh()```, otherwise the driver will lose its focus and it will throws exception if you continue scraping.

In [10]:
def getPages2(driver):
    rows = []
    while True:
        next_button_element = driver.find_element_by_xpath('//*[@id="fin-scr-res-table"]/div[2]/div[2]/button[3]')
        rows.extend(scrapePage(driver))
        outer_html = next_button_element.get_attribute("outerHTML")
        if "disabled" in outer_html:
            break
        next_button_element.click()
        driver.refresh()
    return rows

In [11]:
driver = webdriver.Chrome()
driver.get("https://finance.yahoo.com/cryptocurrencies")
rows = getPages2(driver)
driver.close()

print("Number of entries: ")
print(len(rows))
print("\nFirst row of data: ")
print(rows[0])
print("\nLast row of data: ")
print(rows[-1])

Number of entries: 
112

First row of data: 
{'symbol': 'BTC-USD', 'name': 'Bitcoin USD', 'price': '7,907.94', 'change': '+99.46', 'percentage_change': '+1.2737%', 'market_cap': '134.007B', 'volume_in_currency': '518.895M', 'total_volume_all_currencies': '750.487M', 'circulating_supply': '3.689B'}

Last row of data: 
{'symbol': 'PAY-USD', 'name': 'TenX USD', 'price': '1.0441', 'change': '-0.0182', 'percentage_change': '-1.7114%', 'market_cap': '0', 'volume_in_currency': '0', 'total_volume_all_currencies': '0', 'circulating_supply': '4.552M'}


Here you may find another advantage of Selenium. By clicking the button **'next'**, new contents will be generated by Javascript in the original frame, rather than direct the user to a new page. Data like this which generates dynamically is impossible to collect with HTTP requests, if the url does not change with the content. We need to mention two important concepts, which are **static crawler** and **dynamic crawler**.
* Static crawler: Static crawler is only able to crawl static information. It is not able to get dynamic information by interacting with the web page.Crawlers developed by Scrapy or requests are static crawler.
* Dynamic crawler: Unlinke static crawler, dynamic crawler is able to interact with the web page and capture dynamic information. Crawler developed by Selenium is dynamic crawler.  
  
But one shortage for dynamic crawler is the speed is very slow, because it needs to load a real web page before collecting data. And it needs to refresh the page if there are some interactions changes the content of the web page. 
  
Comparing the first method and the second method, the first method is a bit more faster than the second, because there are less interations with the website. So if you can find the regularity in the URLs, using the first method is a better choice.

## Advanced techniques & further studies

### 1. Headless browse and cloud deployment

If you are using a Linux operating system without a graphical user interface, where you cannot have a real browser with UI, you need to use a headless browser. This feature allows you to deploy Selenium crawler on cloud environment such as docker. Headless browser is especially useful when you want to deploy your Selenium browser onto the cloud services, such as AWS EC2, whose OS is a Linux without UI. Following codes shows how to run a Selenium crawler with headless Chrome. You do not need extra set up if you already have a Chrome on you device.

In [12]:
from selenium.webdriver.chrome.options import Options
import requests

def headlessChromeCrawl():
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(chrome_options=chrome_options)
    driver.get("http://www.datasciencecourse.org/")
    html = driver.page_source
    driver.close()
    return html

In [13]:
html = headlessChromeCrawl()
print("Length of HTML of course homepge: " + str(len(html)))

Length of HTML of course homepge: 12368


For further readings, please refer to https://developers.google.com/web/updates/2017/04/headless-chrome

### 2. Wait
Wait is a special mechanism provided by Selenium, which allows Selenium crawler to stop running for a while until something happens or or stop for a fixed time before next action. For example, it takes time for a browser to load some data on the web page. If the Selenium crawler starts to collect those data before browser finishes loading those data, the crawler will get nothing or throws an exception. Wait allows the work of crawler more stable.  
  
For further reference please refer to http://selenium-python.readthedocs.io/waits.html

### 3. Profile

Nowadays most websites have both PC version and mobile version. An interesting thing is that most websites has less forbidden on their mobile version while more forbiddens on their PC version, although the data is the same. Therefore, you may want to collect the data from a mobile version. To do so, you need to install a plugin on Firefox, called **User Agent Overrider**, which allows you to open the web page like a mobile browser. To use that plugin in Selenium, you need to use **Profile**. Profile is a setting in browser, which allows Selenium to retrieve the plugin installed on Firefox, otherwise by default, Selenium will use a browser without any plugin.  
  
For further reading, please refer to:   
* User agnet overrider on Firefox: https://addons.mozilla.org/en-US/firefox/addon/user-agent-overrider/
* Set up profile on Firefox: https://support.mozilla.org/en-US/kb/profile-manager-create-and-remove-firefox-profiles
* Set up Selenium webdriver with profile: http://toolsqa.com/selenium-webdriver/custom-firefox-profile/ 

## When should I use Selenium?

As far as concerned, there are two types of crawler, static and dynamic. And Selenium belongs to the second class. Compared with static crawler such as Requests or Scrapy, Selenium has following advantages:
* Be able to interact and support javascript very well
* Workflow is easy to understand  
  
However, Selenium also has huge disadvabtages:
* Low speed, not suitable for large dataset
* Not very easy to extend the project
* Not many related projects or plugins  
  
Therefore, in most situations, if the website can be crawled by both Selenium and static crawler, static crawler is always a better choice. However, nowadays as the prevailing of single-page application, increasing number of websites are using Javascript to generate their content dynamically to improve the user experience. In such circumstances, static crawler may **not able to** collect data from it. This is when you should use Selenium.

## Reference

Official documents for Selenium, Python edition: http://selenium-python.readthedocs.io/