# üóìÔ∏è W02 - NB02 - Lab: Selenium (Wikipedia)

**DS205 W02 NB02 ‚Äì Advanced Data Manipulation (Winter Term 2025/2026)**

<div style="font-family: system-ui; padding: 20px 30px 20px 20px; background-color: #FFFFFF; border-left: 8px solid #ED9255; border-radius: 8px; box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);max-width:650px;color:#212121;">

**Lab Practice Notebook**

- üìÖ Date: 27 January 2026
- üë§ Instructor: Dr Jon Cardoso-Silva
- üéØ Purpose: Repeat the Scrapy selector tasks using Selenium and a live Chromium browser

<span style="display:block;line-height:1.15em;color:#666666;font-size:0.9em;">

ü•Ö **Learning Goals**

i) Start a Selenium browser and confirm it is working,
ii) Extract the same Wikipedia content you worked with in Scrapy,
iii) Understand how CSS selector thinking carries across tools,
iv) Practise clicking and basic navigation.

</span>

</div>




In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By

WIKI_URL = "https://en.wikipedia.org/wiki/Lists_of_foods"

<div style="border: 1px solid red; padding: 10px; border-radius: 5px;font-size:0.7em">

## Quick comparison: Scrapy vs Selenium

### Get the `<h1>` text

Scrapy:

```python
response.css("h1::text").get()
```

Selenium:

```python
driver.find_element(By.CSS_SELECTOR, "h1").text
```

### Get all links (`href`)

Scrapy:

```python
response.css("a::attr(href)").getall()
```

Selenium:

```python
[a.get_attribute("href") for a in driver.find_elements(By.CSS_SELECTOR, "a")]
```

### Restrict to a container

Scrapy:

```python
response.css("div.div-col a::attr(href)").getall()
```

Selenium:

```python
[a.get_attribute("href") for a in driver.find_elements(By.CSS_SELECTOR, "div.div-col a")]
```

### One match vs all matches

Scrapy:

```python
response.css("...").get()
response.css("...").getall()
```

Selenium:

```python
driver.find_element(By.CSS_SELECTOR, "...")
driver.find_elements(By.CSS_SELECTOR, "...")
```

**Take note:** In Scrapy, `.get()` turns a selector into a string. You cannot keep selecting inside a string.

</div>


## Section 1: Start Chromium (and confirm it works)

Run this cell. You should see a Chromium window open.

If it does not open, raise your hand. Your class teacher will help you fix it.

In [None]:
driver = webdriver.Chrome()
driver.get("https://www.google.com")

# Zoom out to 20% (shows 5x more content)
driver.execute_script("document.body.style.zoom='0.2'")
print(f"‚úÖ Selenium is working. Page title: {driver.title}")

## Section 2: Go to the Wikipedia page

We will use the same page as in the Scrapy section:

- [Lists of foods](https://en.wikipedia.org/wiki/Lists_of_foods)

In [None]:

# Go to a different page
driver.get(WIKI_URL)
driver.execute_script("document.body.style.zoom='0.2'")
print(f"‚úÖ Loaded Wikipedia page. Page title: {driver.title}")

## Section 3: Get the `<h1>` title (solution included)

In Scrapy you used selectors like:

- `response.css("h1 span::text").get()`

In Selenium, you select an element then read its `.text`.

In [None]:
h1_text = driver.find_element(By.CSS_SELECTOR, "h1").text
h1_text

## Section 4: Action Point 1 (Selenium version): find the best container

In the Scrapy section you searched for the best container `<div>`.

**Solution**

The selector we needed was:

- `div.div-col`

Let‚Äôs confirm it exists and preview its text.

In [None]:
container = driver.find_element(By.CSS_SELECTOR, "div.div-col")

# View the HTML
print(container.get_attribute('innerHTML')[:1000])

## Section 5: Extract all Wikipedia links inside the container (solution included)

We want the `<a>` elements inside `div.div-col`.
Then we want their `href` values.

**Take note**

Unfortunately, Scrapy doesn't have a `::attr()` selector. Once you get the `<a>` tab, you should later use `get_attribute("href")` on each `<a>` to get its href (the link that it is pointing to).

In [None]:
links = driver.find_elements(By.CSS_SELECTOR, "div.div-col a")
hrefs = [a.get_attribute("href") for a in links]
hrefs = [h for h in hrefs if h]

print(f"‚úÖ Links found inside div.div-col: {len(hrefs)}")
print("Sample:")
for h in hrefs[:10]:
    print("-", h)

## Section 6: Action Point 2 and 3 (Selenium version): `>` versus descendant selectors

In Scrapy you compared:

- `div > ul > li`
- `div ul li`

In Selenium, we can count how many elements each selector matches.

**Solutions**

- `div.div-col > ul > li` matches only the top-level list items
- `div.div-col ul li` matches top-level and nested list items

In [None]:
top_level = driver.find_elements(By.CSS_SELECTOR, "div.div-col > ul > li")
all_descendants = driver.find_elements(By.CSS_SELECTOR, "div.div-col ul li")

print(f"‚úÖ Top-level items (direct children): {len(top_level)}")
print(f"‚úÖ All list items (descendants): {len(all_descendants)}")

## Section 7: Clicking and navigation (solution included)

We will click the first link inside the container and confirm that the page changes.

If your page is zoomed in, zoom out in Chromium using:

- `Ctrl` + `-` (Windows/Linux)
- `Cmd` + `-` (macOS)

You can also zoom from the browser menu.

In [None]:
first_link = driver.find_elements(By.CSS_SELECTOR, "div.div-col a")[0]
first_url = first_link.get_attribute("href")

print(f"Clicking: {first_url}")
first_link.click()

# Set the zoom out again if the Chromium window is too large
driver.execute_script("document.body.style.zoom='0.2'")
print(f"‚úÖ New page title: {driver.title}")

## Section 8: Clean up

Always close the browser when you are done.

In [None]:
driver.quit()
print("‚úÖ Closed Chromium")