# üóìÔ∏è W02 - NB01 | Lecture Demo: Wikipedia Scraping with Scrapy

**DS205 W02 NB01 ‚Äì Advanced Data Manipulation (Winter Term 2025/2026)**

<div style="font-family: system-ui; padding: 20px 30px 20px 20px; background-color: #FFFFFF; border-left: 8px solid #ED9255; border-radius: 8px; box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);max-width:600px;color:#212121;">

**Lecture Demonstration Notebook**
- üìÖ Date: 26 January 2026
- üë§ Instructor: Dr Jon Cardoso-Silva
- üéØ Purpose: Demonstrate how to inspect HTML and build selectors before writing a spider

ü•Ö **Learning Goals**

<ul style="margin: 0.2em 0 0.4em 0; padding-left: 1.25em; font-size:1em; list-style-type:none;font-size:0.85em;color:#666666">

  <li style="margin-bottom:0.15em; padding-left:0.4em; text-indent:-0.4em;">
    <span style="display:inline-block;font-weight:450;width:0.75em">i)</span> Send a simple HTTP request and inspect the response,
  </li>
  <li style="margin-bottom:0.15em; padding-left:0.4em; text-indent:-0.4em;">
    <span style="display:inline-block;font-weight:450;width:0.75em">ii)</span> Build selectors step by step until they are precise,
  </li>
  <li style="margin-bottom:0.15em; padding-left:0.4em; text-indent:-0.4em;">
    <span style="display:inline-block;font-weight:450;width:0.75em">iii)</span> Extract a table with `read_html` and a manual fallback,
  </li>
  <li style="padding-left:0.4em; text-indent:-0.4em;">
    <span style="display:inline-block;font-weight:450;width:0.75em">iv)</span> Move notebook logic into a Scrapy spider.
  </li>
</ul>

</div>

## Reference Links

You will use these frequently:

- [Scrapy selectors overview](https://docs.scrapy.org/en/latest/topics/selectors.html#topics-selectors)
- [Scrapy CSS selector extensions (`::text`, `::attr()`)](https://docs.scrapy.org/en/latest/topics/selectors.html#extensions-to-css-selectors)
- [Scrapy quick overview (runspider flow)](https://docs.scrapy.org/en/latest/intro/overview.html)

## Environment Setup (Conda)

For this lecture, use a dedicated conda environment called `food`. The `environment.yml` file lives next to this notebook.

```bash
conda env create -f environment.yml
conda activate food
```


‚öôÔ∏è **Importing libraries**

Here are the libraries we are using today:

In [None]:
import requests
import warnings

import pandas as pd

from io import StringIO

from scrapy.selector import Selector

from IPython.display import HTML
from IPython.display import display

warnings.filterwarnings("ignore", message="Consider using IPython.display.IFrame instead", category=UserWarning, module="IPython.core.display")

## Section 1: A Simple Request and Its Response

We will send one request and look at the response. This keeps the focus on HTML inspection.

**What is a User-Agent?**

A User-Agent is a short text string that tells the server what kind of client is making the request.
Some websites block unknown or empty User-Agent headers, so we set a clear one.

See the [Wikimedia policy](https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Foundation_User-Agent_Policy) to understand why I'm identifying myself as a bot in this particular way in the UserAgent.

In [None]:
LIST_URL = "https://en.wikipedia.org/wiki/List_of_foods"

headers = {
    "User-Agent": (
        "DS205W02LectureBot/1.0 "
        "(https://lse-dsi.github.io/DS205/2025-2026/winter-term/; "
        "J.Cardoso-Silva@lse.ac.uk) "
        "requests/2.x"
    )
}
output = requests.get(LIST_URL, headers=headers, timeout=30)

print(f"Status code: {output.status_code}")
print(f"Content length: {len(output.text)}")
print(output.content[:500])

üìã **Take Note:** 

Our response had a status code of [HTTP 200](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) which means `OK` (success) - it worked!

But notice that the content we get back (`response.content`) does NOT look like a JSON. The output a request sent to a regular website is also plain text but unlike with APIs (as we saw in last week's notebook), the data doesn't come formatted as JSON but rather as [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML) (**H**yper**T**ext **M**arkup **L**anguage).


In [None]:
# If you want you can look at the HTML in full by saving it to a file
# You can use VS Code capabilities to navigate the content
with open("wikipedia_list_of_foods.html", "wb") as f:
    f.write(output.content)

### Meet Scrapy

There are many open-source libraries written for the Python community that allows us to parse HTML, with the most commmon one being [beautifulsoup](https://beautiful-soup-4.readthedocs.io/en/latest/). While a great library, **WE WON'T USE BEAUTIFULSOUP IN THIS COURSE!**. Instead, we will work with [`scrapy`](https://www.scrapy.org/) or, if needed, [`Selenium`](https://selenium-python.readthedocs.io/). These two libraries, in particular `scrapy`, are more appropriate for when we want to write <span style="background-color:#0C56AA;padding:0.05em 0.2em;color:white;border-radius:0.25em;">**production-ready**</span> code.

What you need is to pass that HTML text to [scrapy's Selector](https://docs.scrapy.org/en/latest/topics/selectors.html), which we have already imported:

In [None]:
response = Selector(text=output.content)
response

But then, to find out which information you want to collect, you need to know where the relevant data is.

Here it is important to learn a bit of [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS) (the **C**ascading **S**tyling **S**heets language).

üìö **Recommended Reading:**

* [What is CSS?](https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Styling_basics/What_is_CSS)
* [CSS Getting Started](https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Styling_basics/Getting_started)
* [Basic Selector](https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Styling_basics/Basic_selectors)
* [Attribute Selector](https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Styling_basics/Attribute_selectors)
* [Pseudo-class and elements](https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Styling_basics/Pseudo_classes_and_elements) 

    (*Note: most of these pseudoclasses don't work with scrapy, sadly. Check out [Scrapy selectors overview](https://docs.scrapy.org/en/latest/topics/selectors.html#topics-selectors) and [Scrapy CSS selector extensions (`::text`, `::attr()`)](https://docs.scrapy.org/en/latest/topics/selectors.html#extensions-to-css-selectors) to learn more*)

**To query the HTML content contained in the `html_page` object, just pass a CSS selector to it with the `.css()` method.**

For example, to gather `<h1>`'s from the page:

In [None]:
response.css('h1')

‚òùÔ∏è The above shows that we received a list containing a single element, itself another object of the class Selector. If we want the _content_ of that, we should use:

* `.css(..some_selector..).get()` (to get just the first element of the list) or
* `.css(..som_selector..).getall()` (if you had multiple matches and you want to extract all)

Compare:

In [None]:
response.css('h1').get()

with:

In [None]:
response.css('h1').getall()

üí° **The type of data returned by `.get()` is a string whereas the type returned by `.getall()` is a list of strings.**

**Note also:** If `getall()` returns an empty list, your selector didn't match anything. First thing to check: does that element actually exist with the class you expected?

### 'Looking' at the page from the notebook

If you just want to get a sense for how the HTML looks like (without all of its styling though), you can use an IFrame from the IPython library:

In [None]:
html_snippet = response.css('h1').get()

display(HTML(f"<iframe width='800' height='100' srcdoc='{html_snippet}'></iframe>"))

### Collecting ALL matches (for example, all images)

In [None]:
response.css('img').getall()

### Collecting attributes of HTML elements

While the output above is useful, you might not want to collect the entire HTML element but rather just one piece of it (for example, the link where the image is rather than the whole `<img>` element).

If we want the links to where the images are stored, we first need to understand that they are inside the `src` and to collect the content of the `src`, we use the `::attr` pseudo-selector as explained in [Scrapy CSS selector extensions (`::text`, `::attr()`)](https://docs.scrapy.org/en/latest/topics/selectors.html#extensions-to-css-selectors):

In [None]:
response.css("img::attr(src)").getall()

ü§î **Think about it:**

Why do you think most of these URLs start with a `/` rather than the usual `http://`? If you try to copy these addresses to your browser, the browser will be confused and not understand that address. Why is that?

üéØ **ACTION POINT**

**Without looking at the solution,** try to figure out how to complete the code below such that y ou display the first image that is captured in the page:

In [None]:
html_snippet = "" # Replace the empty quotes with your string

display(HTML(f"<iframe width='800' height='100' srcdoc='{html_snippet}'></iframe>"))

<details><summary>Click here to view the solution</summary>

```python
html_snippet = f'<img src="https://en.wikipedia.org{html_page.css("img::attr(src)").get()}" alt="" aria-hidden="true" height="50" width="50">'
```

</details>

## Section 2: The Same Idea in the Terminal

You can test selectors in the terminal with the Scrapy shell. We are not doing it now, but I want you to know what it looks like.

```bash
scrapy shell "https://en.wikipedia.org/wiki/List_of_foods"
```

**Take note**

The shell gives you a `response` object, which behaves like the `Selector` we used in the notebook.

Try these inside the shell:

```python
response.css("title::text").get()
response.css("img::attr(src)").getall()[:5]
```

**Why this matters**

It is a fast way to test a selector without writing a full spider.

## Section 3: Narrowing Down Selectors Step by Step

We will start broad, then narrow. Each step should remove noise.

### Step 1: All links

In [None]:
step_1 = response.css("a::attr(href)").getall()
print(f"Step 1: All links: {len(step_1)}")

**Take note**

This is too broad, it includes links to edit pages, menus, and other site furniture.

### Step 2: Only article-looking links

In [None]:
step_2 = [link for link in step_1 if link.startswith("/wiki/")]
print(f"Step 2: Only /wiki/ links: {len(step_2)}")

**Take note**

This is better, but still includes special pages and internal anchors.

### Step 3: Remove special pages and anchors

In [None]:
step_3 = [link for link in step_2 if ":" not in link and "#" not in link]
print(f"Step 3: Remove non-article links: {len(step_3)}")

**Take note**

We now have a cleaner list, but we can still be more precise.

### Step 4: Limit to the list content

In [None]:
step_4 = response.css("div.mw-parser-output ul li a::attr(href)").getall()
step_4 = [link for link in step_4 if link.startswith("/wiki/") and ":" not in link and "#" not in link]
print(f"Step 4: Links inside lists: {len(step_4)}")
print("Sample links:", step_4[:5])

**Take note**

This is the first selector that matches our actual intention.

### CSS vs XPath

CSS selectors are shorter and easier to read for most HTML tasks.

XPath is more powerful for navigation and complex conditions.

Example CSS:

```css
div.mw-parser-output ul li a::attr(href)
```

Example XPath:

```xpath
//div[contains(@class, "mw-parser-output")]//ul//li//a/@href
```

In this course, start with CSS. Use XPath only when CSS is awkward.

## Section 4: Known Target Page and `read_html`

Assume we already know the page we want.

**Why `read_html`?**

It scans the HTML and tries to detect tables automatically. This saves time when the table is well structured.

In [None]:
FOOD_URL = "https://en.wikipedia.org/wiki/Orange_juice"

food_response = requests.get(FOOD_URL, headers=headers, timeout=30)
food_html = food_response.text

tables = pd.read_html(StringIO(food_html))
print(f"Tables found: {len(tables)}")

table = tables[0] if tables else pd.DataFrame()
print(table.head(10))

**Take note**

`read_html` returns a list of tables. You choose the one you want.

### Alternative: Manual Extraction Without `read_html`

If `read_html` fails or you want full control, use selectors and navigate the `<table>` manually with its `<tr>` and `<th>` and `<td>` elements.


## Section 5: Move the Logic Into a Spider

The notebook is for exploration. The spider is for repeatable collection.

**Why move to a `.py` file?**

A spider can be run from the terminal and reused later without rerunning every notebook cell.

<details class="special">
<summary>Click to view the spider code</summary>

```python
import scrapy

class WikipediaFoodSpider(scrapy.Spider):
    name = "wikipedia_food"
    start_urls = ["https://en.wikipedia.org/wiki/List_of_foods"]

    def parse(self, response):
        title = response.css("h1 span::text").get()
        image_urls = response.css("img::attr(src)").getall()
        image_urls = [
            url if url.startswith("http") else f"https://en.wikipedia.org{url}"
            for url in image_urls
        ]

        yield {
            "title": title,
            "image_urls": image_urls[:5]
        }
```

</details>

**How to run the spider**

1. Save the code above to `w02_wikipedia_spider.py`
2. Run it from the terminal:

```bash
scrapy runspider w02_wikipedia_spider.py -O w02-food.jsonl
```

The output file is JSON Lines, one record per line. This is good for streaming data and large files.

**Take Note:** learn more about the parameters you can pass to `runspider` by running: `scrapy runspider --help`