# Web Scraping using Selenium

## Introduction

This tutorial will introduce some basic concepts of Selenium and explain how to use Selenium to scrape data from websites. Selenium is a tool that allows you to automate your actions in a web browser and save them as automated tests that you can reply at a later time. However, Selenium is also used widely on scraping data from dynamic websites by simulating how a person navigates through the websites. Information from websites can be obtained easily using Selenium. There are several reasons why Selenium is an excellent tool to scrape data from websites. Firstly, Selenium has the capability to operate on almost every OS, including Chrome, Firefox, IE, Safari, etc. Secondly, Selenium provides more convenience when there are too many JavaScripts in a website. Thirdly, Selenium supports multiple languages, such as Python, Pearl, PHP, Ruby, C#, and Java.

In this tutorial, we will talk about the installation of Selenium, a simple example to explain how to use Selenium, and a real case to obtain information of available plane tickets from Expedia (https://www.expedia.com/). 

## Tutorial Contents

We will cover the following topics in this tutorial:
    
- [Installation](#Installation)
- [Getting Started](#Getting-Started)
- [Locating Elements](#Locating-Elements)
- [Waits](#Waits)
- [Case: Scraping plane tickets information from Expedia](#Case:-Scraping-plane-tickets-information-from-Expedia)
    
In this tutorial, we will first explain how to install Selenium. Then a simple example will be given to explain basic operations in Selenium, such as navigating to websites, locating website elements, and applying waits (explicit waits & implicit waits) to make sure pages are fully loaded. Finally, we will use a real case to explain how to obtain plane tickets information from Expedia (https://www.expedia.com/). Now, let's get started!    
    

## Installation

Selenium Python bindings provide a simple API to write functional tests using Selenium WebDriver. The convenient API provided by Selenium Python bindings is able to access Selenium WebDrivers, such as Chrome, Firefox, Ie, Safari etc. The current supported Python versions are 2.7, 3.5 and above. In this documentation, we will explain Selenium 2 WebDriver API with Python 3.5.

Firstly, we need to download Python bindings for Selenium. There are two ways to download Selenium Python bindings: one is to download from the [PyPI page](https://pypi.python.org/pypi/selenium) for Selenium package; the second one, also the better approach, is to install the Selenium package using `pip`. Input the following command in your terminal to install Python bindings for Selenium:
    
    $ pip install selenium
    
Secondly, a web driver is required for Selenium to interface with the selected browser. Web drivers can be downloaded in the following links:

- Chrome [driver](https://sites.google.com/a/chromium.org/chromedriver/downloads)
- Firefox [driver](https://github.com/mozilla/geckodriver/releases)
- Safari [driver](https://webkit.org/blog/6900/webdriver-support-in-safari-10/)
- Edge [driver](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)

After downloading the web drivers, for Linux and MacOS users, make sure the web drivers are in your PATH, e.g., place it in usr/local/bin or usr/bin. For Windows users, detailed instructions can be found [here](http://selenium-python.readthedocs.io/installation.html#detailed-instructions-for-windows-users). To check whether we installed Selenium successfully, we can run the code snippet in Getting Started section, if we can get the page source, then Selenium is installed successfully and we can keep moving!    

## Getting Started

Congratulations! You have installed all required packages of Selenium successfully. Now, we can start a journey to scrape data from websites using Selenium! The following is a simple example to navigate to a URL, locate elements and get results.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

By running the code snippet in the next cell, you can get the page source of python official website.

In [2]:
driver = webdriver.Chrome() #create an instance of Chrome WebDriver
driver.get("http://www.python.org") #navigate to a web page given by the URL
html = driver.page_source #get the source code of the website
#print(html)

Now you get the page source of the website. Next, you can find the search field using element locators. Then inputting search queries, pressing the search button can be simulated by Selenium using send_keys() method. After that, we can screenshot the search results and store it in our local machine to see whether we navigate to the search result page as we expected. We can make the driver sleep for 5 seconds, so that we are able to see the result page in the web browser. Finally, don't forget to close the driver, otherwise the browser will stay there forever. 

In [3]:
element = driver.find_element_by_name("q") #find the "Search" field in the web page
element.clear() #clear pre-populated text in the input field
element.send_keys("python 3.5") #enter search terms into the input field
element.send_keys(Keys.RETURN) #start searching
driver.get_screenshot_as_file("./img/sreenshot1.png") #screenshot the search results
time.sleep(5) #sleep for 5 seconds to see the search results
driver.close() #close the driver

## Locating Elements

You've already obtained an overview of how Selenium works. Now, let's walk deeper on the methods that Selenium provides to locate elements in a page. 

Selenium provides many public methods to locate elements:
    
- find_element_by_id
- find_element_by_name
- find_element_by_xpath
- find_element_by_link_text
- find_element_by_partial_link_text
- find_element_by_tag_name
- find_element_by_class_name
- find_element_by_css_selector

To find multiple elements (these methods return a list):

- find_elements_by_name
- find_elements_by_xpath
- find_elements_by_link_text
- find_elements_by_partial_link_text
- find_elements_by_tag_name
- find_elements_by_class_name
- find_elements_by_css_selector

The followings are some examples of how to use the public methods:

In [4]:
driver = webdriver.Chrome() #create an instance of Chrome WebDriver
driver.get("http://www.python.org") #navigate to a web page given by the URL

For instance, consider the following html snippet:

    <li id="documentation" class="tier-1 element-3" aria-haspopup="true">
        <a href="/doc/" title="" class="">Documentation</a>
    </li>

The "documentation" element can be located like this:

In [5]:
element_by_id = driver.find_element_by_id("documentation")
print(element_by_id.text)

Documentation


For instance, consider the following html snippet:
    
    <button type="submit" name="submit" id="submit" class="search-button" title="Submit this Search" tabindex="3">
        GO
    </button>

The "button" element can be located like this:

In [6]:
element_by_name = driver.find_element_by_name("submit")
print(element_by_name.text)
driver.close()

GO


Besides the public methods mentioned above, there are also two private methods provided by Selenium:
    
- find_element
- find_elements

The examples of how to use the private methods can be found [here](http://selenium-python.readthedocs.io/locating-elements.html).

## Waits

Now you know everything about how to locate elements in a website! But a new problem comes up, because a lot of websites are using AJAX nowadays, which leads to elements in a page loading at different time intervals. This makes locating elements difficult. If an element is not yet present in the DOM ([Document Object Model](https://en.wikipedia.org/wiki/Document_Object_Model)), the locate functions will throw ElementNotVisibleException exception. In order to avoid raising such exception, waits can be used to provide slacks between actions performed.

Selenium WebDriver provides two types of waits: explicit waits and implicit waits. 

An explicit wait makes Selenium WebDriver wait for a certain condition to happen before proceeding further executions. The following is an example of explicit waits.  Notes: More expected conditions can be found [here](http://selenium-python.readthedocs.io/waits.html#explicit-waits).   

In [7]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import selenium.webdriver.support.ui as ui

In [8]:
driver = webdriver.Chrome()
driver.get("https://www.expedia.com/")
try:
    ui.WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.ID, "tab-flight-tab-hp")))
finally:
    driver.close()

An implicit wait makes Selenium WebDriver wait for a certain amount of time before finding any element. The following is an example of implicit waits.

In [9]:
driver = webdriver.Chrome()
driver.implicitly_wait(10) # wait for 10 seconds
driver.get("https://www.expedia.com/")
driver.find_element_by_id("tab-flight-tab-hp")

<selenium.webdriver.remote.webelement.WebElement (session="cd1e7c721b428bedc350218fc38c459a", element="0.2322158736774933-1")>

## Case: Scraping plane tickets information from Expedia

If you are here, then you are ready to collect data from a real website using Selenium by yourself! We have explained all basic concepts with examples in previous notes, now let's go further to take a look at a real case which applies Selenium to scrape data from websites. In this case, we want to gather plane tickets information of a round trip from Pittsburgh to Seattle, where the departure date is 4/10/2018 and the return date is 4/12/2018. 

Firstly, we need to initialize the web driver. We imported chrome options to ignore certificate errors and ssl errors during connection.

In [10]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.common.exceptions import StaleElementReferenceException
import selenium.webdriver.support.ui as ui
import selenium.webdriver.support.expected_conditions as EC
import pandas as pd
import time

In [11]:
chrome_options = Options()
chrome_options.add_argument("--ignore-certificate-errors")   
chrome_options.add_argument("--ignore-ssl-errors")
driver = webdriver.Chrome(chrome_options=chrome_options)

After initialization of the web driver, we can start simulating how a person navigates through the Expedia website. Selenium finds the boxes to input origin, destination, departure date, and arrival date. Then the search button is clicked automatically, and we can find the prices of all the available tickets, the airlines the tickets belong to, duration of the flights, number of stops, departure time, and arrival time. The implementations can be found in the expediaoperation() function. All the information scraped from the website is stored in a dataframe.

The following is an integration of the implementation of this case.

In [12]:
class expediaTest():
    
    # initialize the web driver
    def __init__(self):
        chrome_options = Options()
        chrome_options.add_argument("--ignore-certificate-errors")   
        chrome_options.add_argument("--ignore-ssl-errors")
        self.driver = webdriver.Chrome(chrome_options=chrome_options)
        pass
    
    # sleep for 10 seconds
    def timer(self):
        time.sleep(10)
        pass
    
    # collect plane tickets information from Expedia
    def expediaoperation(self):
        self.driver.get("https://www.expedia.com/")
        ui.WebDriverWait(self.driver, 10).until(EC.visibility_of_element_located((By.ID, "tab-flight-tab-hp")))
        self.driver.find_element_by_id("tab-flight-tab-hp").send_keys(Keys.RETURN)
        ui.WebDriverWait(self.driver, 10).until(EC.visibility_of_element_located((By.ID, "flight-origin-hp-flight")))
        self.driver.find_element_by_id("flight-origin-hp-flight").clear()
        self.driver.find_element_by_id("flight-origin-hp-flight").send_keys("Pittsburgh")
        self.driver.find_element_by_id("flight-destination-hp-flight").clear()
        self.driver.find_element_by_id("flight-destination-hp-flight").send_keys("Seattle")
        self.driver.find_element_by_id("flight-departing-hp-flight").clear()
        self.driver.find_element_by_id("flight-departing-hp-flight").send_keys("4/10/2018")
        self.driver.find_element_by_id("flight-returning-hp-flight").clear()
        self.driver.find_element_by_id("flight-returning-hp-flight").send_keys("4/12/2018")
        self.driver.find_element_by_id("gcw-flights-form-hp-flight").submit()
        time.sleep(10)
        
        alldata = self.driver.find_element_by_id("flightModuleList")
        
        dataset_list = []
        airline_list = []
        from_list = []
        to_list = []
        departure_time = []
        return_time = []
        price_list = []
        duration_list = []
        num_stop_list = []
        departure_time_list = []
        arrival_time_list = []
        
        for i in alldata.find_elements_by_xpath("//*"):
            dataset_list.append(i)

        for data in dataset_list:
            attempts = 0;
            while attempts < 2:
                try:
                    str_test_id = data.get_attribute("data-test-id")
                    str_duration_stop = data.get_attribute("class")    
                except StaleElementReferenceException:
                    print("")
                attempts = attempts + 1
                
            if str_test_id == "airline-name":
                airline_list.append(data.text)
                
            if str_test_id == "listing-price-dollars":
                price_list.append(data.text)

            if str_duration_stop == "duration-emphasis":
                duration_list.append(data.text)

            if str_duration_stop == "number-stops":
                num_stop_list.append(data.text.strip()[1])
                       
            if str_test_id == "departure-time":
                departure_time_list.append(data.text)
           
            if str_test_id == "arrival-time":
                arrival_time_list.append(data.text)
            
            from_list.append("Pittsburgh")
            to_list.append("Seattle")
            departure_time.append("4/10/2018")
            return_time.append("4/12/2018")
        
        print(pd.DataFrame(list(zip(from_list, to_list, departure_time, return_time, airline_list, price_list, duration_list, num_stop_list, departure_time_list, arrival_time_list)), columns=['from', 'to', 'departure_date', 'return_date', 'airline', 'price', 'duration', 'number_of_stops', 'departure_time', 'arrival_time']))
            
    def stop(self):
        self.driver.close()
        pass

In [13]:
expediatest = expediaTest()
#expediatest.timer()
expediatest.expediaoperation()
expediatest.stop()



          from       to departure_date return_date            airline price  \
0   Pittsburgh  Seattle      4/10/2018   4/12/2018              Delta  $342   
1   Pittsburgh  Seattle      4/10/2018   4/12/2018              Delta  $342   
2   Pittsburgh  Seattle      4/10/2018   4/12/2018              Delta  $342   
3   Pittsburgh  Seattle      4/10/2018   4/12/2018             United  $342   
4   Pittsburgh  Seattle      4/10/2018   4/12/2018  American Airlines  $342   
5   Pittsburgh  Seattle      4/10/2018   4/12/2018             United  $342   
6   Pittsburgh  Seattle      4/10/2018   4/12/2018              Delta  $342   
7   Pittsburgh  Seattle      4/10/2018   4/12/2018  American Airlines  $342   
8   Pittsburgh  Seattle      4/10/2018   4/12/2018              Delta  $342   
9   Pittsburgh  Seattle      4/10/2018   4/12/2018             United  $342   
10  Pittsburgh  Seattle      4/10/2018   4/12/2018             United  $386   
11  Pittsburgh  Seattle      4/10/2018   4/12/2018

## Summary and References

This tutorial explained basic concepts of Selenium for novices to gain an overview of how to use Selenium to collect data from websites. And a real case example is explained step by step to help novices get a deeper understanding of Selenium. However, this tutorial only explained how to use Selenium to obtain data from websites in a Chrome browser, scraping data from websites opened in other browsers are similar to the Chrome one though. More details and references can be found in the following links:
    
- [Selenium Documentation](http://selenium-python.readthedocs.io/index.html)
- [Extract information from website using Selenium](https://www.youtube.com/watch?v=zZjucAn_JYk)

Hope this Selenium tutorial helps!