## DA 320 

| Key         | Value |
| ----------- | ----------- |
| Topic  | Selenium Scraping with Jupyter  |
| Author   | Ted Spence        |
| Date   | 2024-09-28        |

This example notebook contains a brief tutorial on scraping web pages with Selenium within Jupyter.

When using Selenium, you must remember that you are simulating a full web browser. Some web pages may trigger lots of javascript actions and examining things may slow down processing.

Some web pages will defer rendering of elements until the user has visited a specific portion of the page.  You may need to use scrolling in order to move the viewport to an element before it will be fully rendered.

***
# Constructing a Selenium instance and fetching a page
***

In this example, we will install the necessary dependencies, then retrieve the source HTML of a web page using selenium.

In [1]:
%pip install selenium

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.1.2 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip




In [3]:
from selenium import webdriver
from selenium.webdriver.common.by import By

# Figure out what year of data we will collect
link = f"https://en.wikipedia.org/wiki/List_of_English_monarchs"

# Construct a selenium engine to retrieve this page
driver = webdriver.Chrome()
driver.get(link)

# Retrieve the HTML source of the page for regular expressions
datastring = driver.page_source
driver.quit()
print(f"Fetched {len(datastring)} characters from {link}.")

Fetched 737275 characters from https://en.wikipedia.org/wiki/List_of_English_monarchs.


***
# Finding elements within a webpage using Selenium
***

In this example, we will fetch a webpage and break down all the headings on the page using find_elements.

Since we often need to experiment with our find_elements calls, I will break up the code blocks into three segments: 
* Fetching the web page
* Searching for the elements
* Closing the web page

Because of this, I can run the first code block once, then keep running the second code block over and over until I find the correct search pattern.

In [4]:
from selenium import webdriver
from selenium.webdriver.common.by import By

# Figure out what year of data we will collect
link = f"https://en.wikipedia.org/wiki/List_of_English_monarchs"

# Construct a selenium engine to retrieve this page
driver = webdriver.Chrome()
driver.get(link)

# Note that this first code block leaves the driver open, so I can re-run the second code block over and over!

In [7]:
# Because this code block is separate from the first code block, I can run this as many times as necessary
# while I tinker with the find_elements call to get the things I want
headings = driver.find_elements(By.CLASS_NAME, "mw-heading")
for heading in headings:
    print(f"{heading.text}")

House of Wessex (886–1013)[edit]
Disputed claimant[edit]
House of Denmark (1013–1014)[edit]
House of Wessex (restored, first time) (1014–1016)[edit]
House of Denmark (restored) (1016–1042)[edit]
House of Wessex (restored, second time) (1042–1066)[edit]
House of Godwin (1066)[edit]
Disputed claimant (House of Wessex)[edit]
House of Normandy (1066–1135)[edit]
House of Blois (1135–1154)[edit]
Disputed claimants[edit]
House of Plantagenet (1154–1485)[edit]
Angevin kings of England[edit]
Disputed claimant (House of Capet)[edit]
Main line of Plantagenets[edit]
House of Lancaster[edit]
House of York[edit]
House of Lancaster (restored)[edit]
House of York (restored)[edit]
House of Tudor (1485–1603)[edit]
Disputed claimant[edit]
House of Stuart (1603–1649)[edit]
First Interregnum (1649–1660)[edit]
House of Stuart (restored) (1660–1707)[edit]
Second Interregnum 1688–1689[edit]
Houses of Stuart and Orange[edit]
Acts of Union[edit]
Timeline[edit]
Titles[edit]
See also[edit]
Explanatory notes[edit]

In [None]:
# Remember to close the driver! Otherwise, you'll leave a bunch of windows open on your computer.
driver.quit()


***
# Scrolling the viewport in a webpage using Selenium
***

Some webpages do not render certain elements until the user scrolls them into view.  In this example, we will fetch a webpage and scroll through the page gradually to cause everything to load.

Since we often need to experiment with our find_elements calls, I will break up the code blocks into three segments: 
* Fetching the web page
* Scroll to the bottom of the page
* Closing the web page

Because of this, I can run the first code block once, then keep running the second code block over and over until I find the desired scrolling logic.

In [8]:
from selenium import webdriver
from selenium.webdriver.common.by import By

# Figure out what year of data we will collect
link = f"https://en.wikipedia.org/wiki/List_of_English_monarchs"

# Construct a selenium engine to retrieve this page
driver = webdriver.Chrome()
driver.get(link)

# Note that this first code block leaves the driver open, so I can re-run the second code block over and over!

In [17]:
import time

# Scroll immediately to the bottom of the page
driver.execute_script("window.scrollBy(0,document.body.scrollHeight)")

# Scroll immediately to the top of the page
driver.execute_script("window.scrollTo(0,0)")

# Scroll gradually from the top to the bottom
pageHeight = driver.execute_script("return document.body.scrollHeight;")
print(f"Page height is {pageHeight}")
i = 0
step = (pageHeight / 10)
while i < pageHeight:
    i += step
    driver.execute_script(f"window.scrollBy(0, {step})")
    # Let's see what our exact position is! 
    # Note we probably won't move by an exact number of pixels per step.
    position = driver.execute_script("return window.pageYOffset;")
    print(f"At position {position}")
    time.sleep(1)

Page height is 27614
At position 2760.800048828125
At position 5521.60009765625
At position 8282.400390625
At position 11043.2001953125
At position 13804
At position 16564.80078125
At position 19325.599609375


In [None]:
# Remember to close the driver! Otherwise, you'll leave a bunch of windows open on your computer.
driver.quit()
