# Selenium

## What is Selenium?
Selenium is a Web Browser Automation Tool.
Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. It allows you to open a browser of your choice & perform tasks as a human being would, such as:

* Clicking buttons
* Entering information in forms
* Searching for specific information on the web pages
* Scrolling
* Taking a screenshot


At the beginning of the project (almost 20 years ago!) it was mostly used for cross-browser, end-to-end testing (acceptance tests).

Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!

The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.



![image.png](attachment:image.png)

## Required installations

### Selenium

### Browser driver

you need to install a **browser driver**, which you choose depending on the browser you often use. In my case, I have Chrome, so I installed the Chrome driver. Below, there are links to the more popular browser drivers:

* ChromeDriver – WebDriver for Chrome (https://sites.google.com/chromium.org/driver/)

* Microsoft Edge Driver (https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)

* WebDriver Support in Safari 10 (https://webkit.org/blog/6900/webdriver-support-in-safari-10/)

To run Chrome in headless mode (without any graphical user interface), you can run it on a server.

**ChromeOptions(Common Arguments and Methods)** is a separate class in selenium that helps to manage options specific to the ChromeDriver. ChromeOptions is a class that extends MutableCapabilities. It was introduced with Selenium v3.6.0.

**Why It is Required:** ChromeOptions class is used to customize the settings of the chrome browser. We can disable-popup-blocking, make-default-browser, disable-extensions, incognito, check the version, and other changes to the browser using this class with the latest version selenium. By default, selenium starts with a fresh session of a browser that doesn’t have any settings, cookies, and history.

**Frequently Used Methods and Arguments:**

* start-maximized: This argument opens the chrome browser window in maximize mode.

* setPageLoadStrategy: This method is used to speed up execution. It is of three types Normal, None, and Eager.
    * Normal: In this mode, Selenium WebDriver wait for the entire page is loaded.
    * None: In this mode, Selenium WebDriver only waits until the initial page is downloaded.
    * Eager: Selenium WebDriver to wait until the initial HTML document has been completely loaded and parsed
    
* disable-infobars: This argument is used to remove the information bar/notifications from the browser. But this argument has been deprecated. We can use this line of code to remove the information bar from the browser.

* Incognito: This argument is used to open a chrome browser in incognito mode. It helps to prevent history and cookies. Incognito mode deletes this data as soon as we close the web browser.

* Version: This argument is used to get the current version of the browser.

* Disable-popup-blocking: This argument is used to disable the popup in the chrome browser. We can block the popup using these methods:
    * Code: options.setExperimentalOption(“excludeSwitches”,Arrays.asList(“disable-popup-blocking”));
    * Code: options.addArguments(“–disable-popup-blocking”);
    
examples: https://studysection.com/blog/chromeoptions-class-common-arguments-and-methods/

=> The driver.page_source will return the full page HTML code.

Here are two other interesting WebDriver properties:

- driver.title gets the page's title
- driver.current_url gets the current URL (this can be useful when there are redirections on the website and you need the final URL)

## First Steps initialisation

<div class="alert alert-success">
The steps to Parse a dynamic page using Selenium are:

1- Initialize a driver (a Python object that controls a browser window)
    
2- Direct the driver to the URL we want to scrape.
    
3- Wait for the driver to finish executing the javascript, and changing the HTML. The driver is typically a Chrome driver, so  the page is treated the same way as if you were visiting it in Chrome.
    
4- Use driver.page_source to get the HTML as it appears after javascript has rendered it.
    
5- Use a parser on the returned HTML
    
</div>

### Initialize a driver (a Python object that controls a browser window)

We'll user a wikipedia page to test scraping on. 

we'll use it on the page https://en.wikipedia.org/wiki/List_of_countries_by_greenhouse_gas_emissions  to extract the data from the table, save it in a Pandas Dataframe and export it into a CSV file.

![image.png](attachment:image.png)


### Direct the driver to the URL we want to scrape.


In [None]:
#3- Wait for the driver to finish executing the javascript, and changing the HTML. The driver is typically a Chrome driver, 
#so the page is treated the same way as if you were visiting it in Chrome.

### Get and parse the HTML 

**Principal methods of Selenium** 

There are various strategies to locate elements in a page. You can use the most appropriate one for your case. Selenium provides the following methods to locate elements in a page:

* find_element_by_id
* find_element_by_name
* find_element_by_xpath
* find_element_by_link_text
* find_element_by_partial_link_text
* find_element_by_tag_name
* find_element_by_class_name
* find_element_by_css_selector

To find multiple elements (these methods will return a list):

* find_elements_by_name
* find_elements_by_xpath
* find_elements_by_link_text
* find_elements_by_partial_link_text
* find_elements_by_tag_name
* find_elements_by_class_name
* find_elements_by_css_selector

Apart from the public methods given above, there are two private methods which might be useful for locating page elements:

* find_element
* find_elements

Examples are here: https://selenium-python.readthedocs.io/locating-elements.html

In [None]:
h1 = driver.find_element_by_name('h1')
h1 = driver.find_element_by_class_name('someclass')
h1 = driver.find_element_by_xpath('//h1')
h1 = driver.find_element_by_id('greatID')


XPath is a language, which uses path expressions to take nodes or a set of nodes in an XML document. There is a similarity to the paths you usually see in your computer file systems. The most useful path expressions are:

- nodename takes the nodes with that name
- / gets from the root node
- // gets nodes in the document from the current node
- . gets the current node
- .. gets the “parent” of the current node
- @ gets the attribute of that node, such as id and class

fore more details: https://www.w3schools.com/xml/xpath_syntax.asp

In [None]:
# 4- Use driver.page_source to get the HTML as it appears after javascript has rendered it.
driver.get(url)

As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need. A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time

In [None]:
#5- Use a parser on the returned HTML
# 5.1 extract the header row of the table. the find_elements_by_class_name needs only the class name as input.
titles = driver.find_elements_by_class_name('headerSort')

![image.png](attachment:image.png)

In [None]:
for t in titles:
    print(t)

<selenium.webdriver.remote.webelement.WebElement (session="e8abb04720dc22e6bf4746f91e0ae3d2", element="f4570160-b1d6-4d84-a5e3-930d1fe0e0fc")>
<selenium.webdriver.remote.webelement.WebElement (session="e8abb04720dc22e6bf4746f91e0ae3d2", element="ccf1ecac-bf26-49f5-8de0-381409c40be2")>
<selenium.webdriver.remote.webelement.WebElement (session="e8abb04720dc22e6bf4746f91e0ae3d2", element="b5857aab-8daa-4840-bea2-8be9c70440e8")>
<selenium.webdriver.remote.webelement.WebElement (session="e8abb04720dc22e6bf4746f91e0ae3d2", element="e71d80e4-cb88-4e1e-8fd1-33e9a2be05ad")>
<selenium.webdriver.remote.webelement.WebElement (session="e8abb04720dc22e6bf4746f91e0ae3d2", element="446b3bfc-06fa-4ff3-bbd1-adbbe5234bef")>


A WebElement is a Selenium object representing an HTML element.

There are many actions that you can perform on those HTML elements, here are the most useful:

- Accessing the text of the element with the property element.text
- Clicking on the element with element.click()
- Accessing an attribute with element.get_attribute('class')
- Sending text to an input with: element.send_keys('mypassword')

There are some other interesting methods like is_displayed(). This returns True if an element is visible to the user.

In [47]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

import os

driver = webdriver.Chrome(os.environ["CHROME_DRIVER_PATH"])
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_greenhouse_gas_emissions'
driver.get(url)
titles = driver.find_elements(by=By.CLASS_NAME, value='headerSort')
all_data = driver.find_element(by=By.TAG_NAME, value="table")
print(all_data.text)
# data = all_data.find_elements(by=By.TAG_NAME, value="td")
#print(data.text)


  driver = webdriver.Chrome(os.environ["CHROME_DRIVER_PATH"])


This article needs to be updated. The reason given is: Consumption-based data needs updating. Please help update this to reflect recent events or newly available information. (February 2022)


=> we obtained a list containing all the titles of the table. We can already create an empty Dataframe, specifying the names of the columns.

In [49]:
import pandas as pd
df = pd.DataFrame(columns=[t.text for t in titles])

df.head()

Unnamed: 0,State,Production-based emissions\n(MtCO2e) 2018 (CAIT)[1],"Production-based emissions INCLUDING land-use, land-use change and forestry (reported to UNFCCC)\nMtCO2e 2018[5]","Production-based emissions excluding land-use, land-use change and forestry(World Resources Institute)\nMtCO2e 2016[6]",Consumption-based emissions\n(Global Carbon Project)\nMtCO2e 2016[7]


In [None]:
### 5.2 extract data contained in each column, the tag contains the body content in an HTML table, 
# so all cells we want to extract are within these tags.

In [None]:
states = driver.find_elements_by_xpath('//table[@class="wikitable sortable plainrowheaders jquery-tablesorter"]/tbody/tr/th')

In [None]:
for idx,s in enumerate(states):
    print('row {}:'.format(idx))
    print('{}'.format(s.text))

=> extract the content of the other columns. After the column of states, all the remaining columns are contained in the tags. The index needs to be specified since we look row by row with the tags.

In [None]:
col2 = #TBD

In [None]:
col3 =   #TBD

In [None]:
col4 =   #TBD

In [None]:
col5 =   #TBD

In [None]:
# Finally, we can add the columns to the DataFrame previously created:
df[df.columns[0]] = [s.text for s in states]
df[df.columns[1]] = [s.text for s in col2]
df[df.columns[2]] = [s.text for s in col3]
df[df.columns[3]] = [s.text for s in col4]
df[df.columns[4]] = [s.text for s in col5]

In [None]:
df.head()

Unnamed: 0,State,Production-based emissions\n(MtCO2e) 2018 (CAIT)[1],"Production-based emissions INCLUDING land-use, land-use change and forestry (reported to UNFCCC)\nMtCO2e 2018[5]","Production-based emissions excluding land-use, land-use change and forestry(World Resources Institute)\nMtCO2e 2016[6]",Consumption-based emissions\n(Global Carbon Project)\nMtCO2e 2016[7]
0,World,48928,,,
1,China (see: Greenhouse gas emissions by China),11706,,12700.0,8801.0
2,United States (see: Greenhouse gas emissions ...,5794,5903.0,6570.0,5716.0
3,India (see: Greenhouse gas emissions by India),3347,,2870.0,2217.0
4,"European Union (EU28, including the United Ki...",3333,3951.0,,4166.0


In [None]:
# Let’s export the dataset into a CSV file:

df.to_csv('greenhouse_gas_emissions.csv')

It can be interesting to avoid honeypots (like filling hidden inputs).

**Honeypots** are mechanisms used by website owners to detect bots. For example, if an HTML input has the attribute type=hidden like this:

```
<input type="hidden" id="custId" name="custId" value="">
```

This input value is supposed to be blank. If a bot is visiting a page and fills all of the inputs on a form with random value, it will also fill the hidden input. A legitimate user would never fill the hidden input value, because it is not rendered by the browser.


## Useful resources:

* https://www.scrapingbee.com/blog/practical-xpath-for-web-scraping/