# Selenium Tutorial

**By: Tristan Dewing**

Selenium is a Python package that is used primarily for website testing and automation, though its capabilities also make it a useful web scraping tool. Using Selenium, you can simulate a user’s experience and interact with live websites, performing actions such as clicking buttons, selecting options from dropdown menus, typing entries into search bars, scrolling, and more. 

These actions make Selenium a good choice to scrape dynamic websites, or interactive sites where the content or data displayed is dependent on user actions and other factors such as time and day or user demographics. It is also a very flexible tool capable of scraping a wide variety of websites using many different browsers and platforms (i.e. Chrome, Firefox, Safari). 

However, because Selenium is primarily a website automation tool, using it for web scraping requires some background knowledge of HTML code and web development. In addition, Selenium is slower compared to other Python web scraping packages such as BeautifulSoup and Scrapy. The best use case for Selenium is scraping data from a dynamic website that requires clicking buttons or other interactive elements to access the data.

## Getting Started

Setting up Selenium is easy but will take a little bit of time. Because Selenium is used to automate websites through a web browser, it will need access to your web browser. To do this, you will first need to install a Selenium "driver" specific to your browser of choice. For this example, we will use Google Chrome.

### 1. Download ChromeDriver: https://chromedriver.chromium.org/downloads

If you are using the ChromeDriver, you will need to install the correct version according to your operating system and the current version of your Chrome. To check your Chrome version, click the three dots to the far right of the search bar in your Chrome window and select Help -> About Google Chrome. From there, you can select the proper version of ChromeDriver according to your Chrome version and operating system (i.e. mac64)

### 2. In terminal or in IDE: pip install selenium

Once you have installed your ChromeDriver or other driver for your browser, you will then need to install the Selenium package using the command `pip install selenium` in your terminal or IDE.

In [1]:
!pip install selenium



## Importing Libraries and Creating the Driver

For this tutorial, we will be using the website https://www.adamchoi.co.uk/teamgoals/detailed, a database for soccer matches. To start off, you will need to create a driver object that will open a ChromeDriver window, which is an automated version of a normal Google Chrome window that Selenium is in control of. You will also need to create objects for the link you are trying to access as well as the path to the ChromeDriver on your computer. Run the code below to test the functionality of your ChromeDriver:

In [210]:
from selenium import webdriver

website = "https://www.adamchoi.co.uk/teamgoals/detailed" # insert link to website you are scraping

path = "/Users/tristan/chromedriver" # insert path to ChromeDriver on your computer

driver = webdriver.Chrome(path) # create webdriver

driver.get(website) # opens ChromeDriver window

# driver.quit() # closes ChromeDriver window

'Team Goals'

Once you run this code chunk, if everything is working properly, a ChromeDriver window should open with the message *“Chrome is being controlled by automated test software”* near the top. That means that Selenium is actively controlling a ChromeDriver window.

To end the ChromeDriver session, you can either close out of the window itself or you can use the `driver.quit()` command, which will automatically end Selenium's control over the window and close it out.

The sequence in the code chunk above will be constant for any web scraping actions you perform using Selenium.

## Selenium Methods

Selenium has a variety of methods that are used for testing websites, but for the purposes of web scraping, the most important ones you will need to know are `find_element()` and `find_elements()`. Similar to regular expression functions, these two methods find specific parts of a website based on the type of HTML element you specify. The first method `find_element()` finds the first instance that matches the element you specify while the plural version `find_elements()` finds all instances that match the element.

When you want to perform an action such as clicking a button or scraping text from a website, you will want to find a specific HTML element that is unique to the button, text blob, or whatever the object of interest is. This unique element is what you will have Selenium "find" within the website's nested HTML structure so that you can interact with or scrape specific parts of the website. There are some common HTML elements that you can directly access such as ID, Name, and Class. However, if the object you are trying to access is not defined under one of these common categories, you can also access it indirectly using an XPath expression, which indexes a specific attribute that occurs after a more common item such as a class or ID.

Aside from `find_element()` and `find_elements()`, you will also need to import the `By` class from the `selenium.webdriver.common.by` module to use attributes that reference specific HTML elements. The By class and corresponding attribute are typically the first argument in the `find_element()` or `find_elements()` method, specifying what type of HTML element you are trying to access. The second argument is the actual text of the element you are looking for, be it the ID, Class, or XPath. Below is a table with each attribute of By for each corresponding type of HTML element along with what each element typically looks like when it is embedded within HTML code and how you would apply its specific By attribute to Selenium code.

| Type                 | Description                                                                    | HTML Example                | Selenium Code                           |
|:----------------------|:--------------------------------------------------------------------------------|:-----------------------------|:--------------------------------------------|
| By.ID                | Searches for elements based on their HTML ID                                   | \<div id="myID">             | find_element(By.ID, "myID")                |
| By.NAME              | Searches for elements based on their name attribute                            | \<input name="myNAME">       | find_element(By.NAME, "myNAME")            |
| By.XPATH             | Searches for elements based on an XPath expression                             | \<span>My \<a>Link\</a>\</span> | find_element(By.XPATH, "//span/a")         |
| By.LINK_TEXT         | Searches for anchor elements based on a match of their text content            | \<a>My Link\</a>              | find_element(By.LINK_TEXT, "My Link")      |
| By.PARTIAL_LINK_TEXT | Searches for anchor elements based on a sub-string match of their text content | \<a>My Link\</a>              | find_element(By.PARTIAL_LINK_TEXT, "Link") |
| By.TAG_NAME          | Searches for elements based on their tag name                                  | \<h1>                        | find_element(By.TAG_NAME, "h1")            |
| By.CLASS_NAME        | Searches for elements based on their HTML classes                              | \<div class="myCLASS">       | find_element(By.CLASSNAME, "myCLASS")      |
| By.CSS_SELECTOR      | Searches for elements based on a CSS selector                                  | \<span>My\<a>Link\</a>\</span> | find_element(By.CSS_SELECTOR, "span > a")  |

## Clicking on a Button with Selenium:

Now we will begin our process of scraping the website.

The first action we will do using Selenium is clicking a button. On the website, we want to click the "All matches" button to reduce the two split tables on the website to one table. To access the button, we will need to "inspect" the website to find the HTML element that references the button. 

If we right click the button and inspect it, we can observe that there is not a clear class name or ID in the HTML code that references the button, but instead an attribute hidden in the middle of a larger body of HTML code: `analytics-event="All matches"`. Because there is no By attribute that can directly access the element, this is an indication that in order to access the button, we will need to reference it indirectly using an XPath. 

Typically, XPaths take on the following format: `//tagName[@AttributeName=“Value”]`. Here, we access a specific attribute such as the button by using the name of the element at the start of the line (i.e. "label") and then access the button ("Value") using the attribute name that references the button.

Therefore, the XPath we will use to reference the "All matches" button is  `'//label[@analytics-event="All matches"]'`, which you can check by typing in the XPath into the Inspect element search bar to see if it highlights the correct button.

Now to command Selenium to access the button and click it, we will import `By` from the `selenium.webdriver.common.by` module and then create an object that references the button using the following command: `driver.find_element(By.XPATH, '//label[@analytics-event="All matches"]') `. Lastly, to click the button, we use the `.click()` method on the object we have just created.

In [206]:
from selenium import webdriver
from selenium.webdriver.common.by import By

website = "https://www.adamchoi.co.uk/teamgoals/detailed" # insert link to website you are scraping
path = "/Users/tristan/chromedriver" # insert path to ChromeDriver
driver = webdriver.Chrome(path) # create webdriver
driver.get(website) # opens ChromeDriver window

all_matches_button = driver.find_element(By.XPATH, '//label[@analytics-event="All matches"]') 
# locates a button in selenium

all_matches_button.click() # clicks the button

driver.quit() # closes ChromeDriver window

If everything works properly, Selenium should open a ChromeDriver window and then automatically click the "All matches" button, followed by the window closing automatically.

## Extracting Data from a Table:

If there is a table of data visible on a website, we can use Selenium to extract the data and load it into Python. Typically, if you view tables in HTML using Inspect, they are coded using the tags `tr` and `td`, which reference rows and columns within each row, respectively. You can also use square brackets `[]` and numeric indices to access specific rows and columns. 

However, it is important to note that unlike many programming languages like Python and C++, HTML numeric indices start at 1 rather than 0, so if you access the first row of a table, you will need to use 1 as the index rather than 0. Therefore, you can locate individual cells for each row in the table using the general XPath `//tr/td[n]` where `n` is the nth column of a given row.

To pull data from the table in the website, we will first click the "All Matches" button to reduce the two tables to one. Next we will access the rows of the table by finding all elements with the tag name `tr` using the command `driver.find_elements(By.TAG_NAME, 'tr')`. 

Then we will create empty lists to load entries from the date, team, and record columns into. Finally, we will loop through the rows of data we scraped and add data entries from each row to the corresponding lists using the `match.find_element(By.XPATH, './td[n]').text` command where `./td[n]` is the nth column of the given row and the `.text` attribute accesses the text of each entry.

In [29]:
from selenium import webdriver
from selenium.webdriver.common.by import By

website = "https://www.adamchoi.co.uk/teamgoals/detailed" # insert link to website you are scraping
path = "/Users/tristan/chromedriver" # insert path to ChromeDriver
driver = webdriver.Chrome(path) # create webdriver
driver.get(website) # opens ChromeDriver window

all_matches_button = driver.find_element(By.XPATH, '//label[@analytics-event="All matches"]') 
# locates a button in selenium

all_matches_button.click() # clicks the button

matches = driver.find_elements(By.TAG_NAME, 'tr') # make element plural to return list of tr elements

date = [] # list to store date values in first column
team = [] # list to store team values in second column
score = [] # list to store score values in third column

for match in matches:
    # print(match.text) prints data row by row with all the data in a row stored in one variable
    date.append(match.find_element(By.XPATH, './td[1]').text)
    # add value from row in first column to date list
    team.append(match.find_element(By.XPATH, './td[2]').text)
    # add value from row in second column to team list
    score.append(match.find_element(By.XPATH, './td[3]').text)
    # add value from row in third column to score list
    
driver.quit() # closes ChromeDriver window

## Exporting Data to a CSV file with Pandas:

If you have properly scraped your data, you should be able to read it into a Pandas DataFrame object and then a .csv file. For this example, we have read in three different lists of information: date, team, and record, which we will load into a Pandas DataFrame using the code below. There is also commented out code to read the DataFrame into a .csv file.

In [30]:
import pandas as pd

# write csv
df = pd.DataFrame({"Date":col1, "Team":col2, "Record":col3})
# df.to_csv(“selenium_demo.csv”, index = False)
df.head()

Unnamed: 0,Date,Team,Record
0,05-08-2022,Crystal Palace,0 - 2
1,13-08-2022,Arsenal,4 - 2
2,20-08-2022,Bournemouth,0 - 3
3,27-08-2022,Arsenal,2 - 1
4,31-08-2022,Arsenal,2 - 1


##  Selecting Elements Within a Simple Dropdown Menu:



For the final action of this tutorial, we will select a specific element from a simple dropdown menu. Let's say under the Country dropdown menu we want to select the option for "Spain" so we can see the game statistics for teams from Spain. 

First, we will import `Select` from the `selenium.webdriver.support.ui` module and create a Select object that references the dropdown menu by its ID, "country". Then, using the Select object which we title dropdown, we use the `.select_by_visible_text()` method to select the option of "Spain".

Note how we also use the line `time.sleep(3)` in the code below, which tells Selenium to wait three seconds before performing the next action. Selenium sometimes gets overwhelmed when it has to perform multiple actions in succession, so using the `.sleep()` method from the `time` package, we can allow Selenium to pause in between actions which will help prevent errors.

In [31]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
import pandas as pd
import time

website = "https://www.adamchoi.co.uk/teamgoals/detailed" # insert link to website you are scraping
path = "/Users/tristan/chromedriver" # insert path to ChromeDriver
driver = webdriver.Chrome(path) # create webdriver
driver.get(website) # opens ChromeDriver window

all_matches_button = driver.find_element(By.XPATH, '//label[@analytics-event="All matches"]') 
# locates a button in selenium

all_matches_button.click() # clicks the button

time.sleep(3) # stops action for 3 seconds to prevent errors

dropdown = Select(driver.find_element(By.ID, 'country'))
# creates Select object for dropdown menu
dropdown.select_by_visible_text('Spain')
# selects value from dropdown menu

driver.quit() # closes ChromeDriver window

If everything works, in the ChromeDriver window, the "All matches" button will be clicked as before, then Selenium will wait about 3 seconds and finally select the option for "Spain" under the dropdown menu for country.

## Conclusion

This tutorial is by no means a comprehensive guide to Selenium, as there are other actions you can perform that aren't covered here such as putting entries into search bars, filling out forms, and scraping images, but it should give a general overview of how to set it up as well as some of the basic functionalities of the package such as clicking buttons, selecting options from dropdown menus, and pulling data from tables. We use a dynamic website in this tutorial to highlight the capabilities of Selenium, which are best applied to an interactive website.

One important note to remember about Selenium is that it is primarily used for website testing and automation rather than web scraping. Because of this, it may not be an ideal choice for every web scraping scenario, mainly due to its slower speed compared to other web scraping packages. Still, with a little background knowledge of HTML code, Selenium presents a good and precise option to scrape a variety of websites.

## References


- https://www.youtube.com/watch?v=UOsRrxMKJYk&ab_channel=ThePyCoach
- https://selenium-python.readthedocs.io/
- https://www.scrapingbee.com/blog/selenium-python/
- https://towardsdatascience.com/how-to-use-selenium-to-web-scrape-with-example-80f9b23a843a