<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Elizabeth Wickes](https://ischool.illinois.edu/people/elizabeth-wickes) for the 2023 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email wickes1@illinois.edu.<br />
____

# Web Scraping Toolkit 1

This is lesson 1 of 3 in the educational series on `Web Scraping`. This notebook is intended to teach the core problem solving perspectives and tools for webscraping. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` 

**Difficulty:** Intermediate

**Completion time:** 90 minutes

**Knowledge Required:** 

* Python basics (variables, flow control, functions, lists)
* Basic file operations (open, close, read, write)

**Knowledge Recommended:**

* basic html/websites

**Learning Objectives:**
After this lesson, learners will be able to:

1. Make basic determinations about which tool to use for extracting data from a website.

**Research Pipeline:**

1. You have a research question and data in mind.
2. You've found some data you want to use.
2. **The data is on a website somewhere and you want to get it off the site and into a data file.**
3. You do your analysis or other data prep!


# Required Python Libraries

* `requests` for downloading things

## Install Required Libraries

In [None]:
### Install Libraries ###

# Using !pip installs
!pip install requests

# Using %%bash magic with apt-get and yes prompt

In [12]:
### Import Libraries ###

#3rd party
import requests
from lxml import etree
from bs4 import BeautifulSoup

import re
import time
import pathlib
import csv

# Introduction


* purpose of this of our shop is to sort of lay a foundation of what web scraping is
* understand the kinds of situations you may find
* review the problem solving approaches
* understand the commonly used technologies and when to use them

Leading into the next workshop, where we focus on exploring xpath and regular expressions. Why are these separate workshops? Those can be pretty heavy to learn and sometimes you really don't need them. Sometimes things like Google sheets can take care of your needs. The whole purpose of this is to explore the breadth of problem solving options available for web scraping so you can pick the one right for what you are working with.

## What this is not going to be

To make it quick, the text you want needs to be actual text somewhere on the website. 

This means we aren't going to be talking about:

* extracting text or other content from PDFs
* OCRing data/etc from images
* deep dive into working with APIs
	* although we will talk a bit about APIs

## What I love about web scraping
Aside from getting data you may not be able to otherwise...

There are so many interesting situations that involve some deeply creative problem solving strategies. You have to be a little bit like a private investigator. Sleuthing through the website trying to figure out if there are structures in the pages that we can use to our advantage.

## The hierarchy of your time

Your time and research time is important. Same for the people you might be teaching these things to. When you're first learning about a lot of programming and other tools sometimes your first thoughts are, hey let me write a script for this! Sure, getting more practice can be good, but you've got to respect your own time and needs.
# Where does this come into a project?
Generally you should already have some sort of question or ask your learners if they have their research question or area. Web scraping comes in during the phase of data gathering or data discovery.

And this workshop sort of presumes that you have data somewhere that you can put your eyeballs on and say, I need that data. And once you got to that point of like, okay, the data exists. How do I get it?

## The first question about web scraping is not about web scraping
Always check first, does this data exist somewhere in a more accessible format? I've done a lot of scraping for things in the past but now other people have put datasets from those things up.

Pro: you can download a thing and it's already data!
Con: it may be an older snapshot, may not have all the things you want, etc.

However, this may be enough for a small proof of concept. Don't discount something quick and easy just to explore the vibes!

## So where can this data be found?

Now, this isn't a workshop about how to find data. However, when you are teaching these workshops, this might be a good place to advertise your services etc. Here are some of my recommendations (this is going to be pretty US specific):

1. https://www.re3data.org/
	1. Repository of repositories, you may be surprised by what's already out there!
2. https://commons.datacite.org/
	1. Search metadata for all DOIs registered under datacite, which does contain a good amount of data! Has cool stuff, but the results can be a bit noisy
3. https://www.icpsr.umich.edu/web/pages/ICPSR/index.html
	1. Lots of social science data, amazing search engine
4. https://data.gov/ or local city/metro area data repository
	1. Government data etc
5. Others that are....less of my favorite
	1. https://www.google.com/publicdata/directory
	2. https://www.kaggle.com/datasets
6. Also search if there's an API, and sometimes you need to search for that specifically

Other important factors to stress: 

1. Check your local resources! 
	1. Subject librarian or other library resources. Be sure to ask directly if there might be something available. Not everything makes it to the website perfectly.
2. Lots and lots of googling
	1. Sometimes you can find stuff in odd places or there are little personal data collections that people may have online. 
	2. I like to give myself a good hour or so to dive deep to try and find something, but I usually have to set a timer so I eventually actually stop...

So eventually you're to a point where you can put eyeballs on the data in front of you, it's on a website, you can copy/paste it, etc. However, you can't find it as a nice pretty dataset to download. This is where web scraping fits in. 





# Briefly, how web pages do the thing

(Very briefly, very high level)

Web pages use HTML as a markup language to dictate how the content should be displayed. Headers, bold, etc. It also allows for things like hyperlinks, pictures, and other content to be displayed. This markup is also just text in specific formats.

Web browsers "parse" or read this text and attempt to understand the content versus the structure of the HTML given to it. The web browser then uses it's tools to display it for you. 

For example, `<b>This will be bold text.</b>` will display that text as **bold**. You won't (or shouldn't) see the `<b>` tags displayed, just the text rendered as bold. 

Many websites have tons of pages and content. These are likely not saved as separate individual files. The HTML to display those many pages is usually coming from some form of a template. The tool reads the data from the database, and sends it to the template, the template, then generates the HTML with the contents to be displayed. This is a really simplified way explaining it, and there are many many tools that do these sorts of things.

The important thing to remember is that when you're dealing with , is that the data is being stored somewhere with some structure. You won't always know how or with what, but you can usually get a good idea about it by looking at the HTML being generated to display it. The HTML formatting elements will often have content specific tags about what it is displaying. For example if it's about a person, they may want the name displayed in a certain way. There are special ways to define HTML formatting where you can label specific groups of content, like a name. So when you look inside the HTML, you may see the name surrounded by some formatting tag that literally says "name". Not always, but generally, you can use these formatting labels to quite clearly extract the data that you want.

# The core tools
We'll be talking first about simpler techniques, as they often will enough to get you going. Then move into the larger tools, like using requests to download things, pathlib to manage the files, and then using regex and xpath to extract information.

* `requests` a very popular and well supported library for handling http calls. Commonly used to read and download website content or files.
* `pathlib` a modern object oriented way of handling files in python, managing folders, etc. Very much an essential tool even if it lacks some of the big buzz words.
* `re` this is the Python regular expressions module. It is used to match text patterns within free text
* `lxml` this package is used to help parse xml and html files and what we will use to execute some xpath queries
* `bs4` this is the beautiful soup package, and it used for cleaning up messy html. It can be used to extract content if you want, but xpath queries are more powerful
* `csv` and `json` these are two python packages we will use to export our data out

# Simpler forms of web scraping
In many cases, you may just want to copy an HTML table into something that is actionable data. This may be a CSV file, Excel file, or maybe some thing that you read directly into Python. There are a variety of tools to help you take a single HTML table and get it into one of those things.

## Copy/paste it into a spreadsheet

### A simple HTML table

* http://www.neoperceptions.com/snakesandfrogs.com/scra/ident/names.htm
	* Open the page and look in the view source. 
	* Looking at the table we can see this HTML comment
```
<!-- The following table was generated by the Internet Assistant Wizard for Microsoft Excel. -->
<!-- ------------------------- -->
<!-- START OF CONVERTED OUTPUT -->
<!-- ------------------------- -->0
```
So this content is likely not being served up to us by another tool, but this is just plain HTML. Say that we want this back into a data file. 

We can select the text within the table, copy, and paste it into Excel or Sheets. However, note that the styling will also get pasted in. 

### More complex table

Let's look at another one. https://threadcolors.com Looking at the scale of this and the HTML all being horizontal, we can safely say that something is generating this html. You can also see the javascript stuff in there as well. 

Let's copy/paste the first part of this table in and see what happens. On my computer, Excel doesn't paste the colors, but google sheets does. I've also seen pictures and other things get pasted in, along with font styling and other formatting. 

### Suppress styling with "Paste special"

Excel and Sheets have versions of "paste special". There are some really nice extra tools in here if you've never explored. I'll briefly explain where to find these things, but interfaces and versions always change. 

* Excel has a few places for finding this. I like to right click on the top left cell where it should go, then select "paste special".
	* You'll see some options there, including another Paste Special. Choosing that opens up a window where you can choose HTML, Unicode text, or text. 
	* Generally I choose "text" to get just the plain text. 

Microsoft and Google each have data import tools etc. you can also play with. 

## Google sheets importing tools

Type `=IMPORT` into a google sheets cell and you'll see a bunch of options. These include tools for reading in data files online etc. Let's look at `importhtml`.

https://support.google.com/docs/answer/3093339?hl=en

Functions like these can be really handy, but you need to work with them really closely. These functions can work really nicely, but expect to spend some time playing with the arguments to ensure it's working correctly.

This will take a URL along with other arguments and import the specified table of data into your sheet. 

The second argument is labeled query, but is asking you to specify if they should be importing a list of data or a table of data. We want to specify table.

The index argument is asking you to specify which table on the website, it should import. Some pages may have dozens of tables, so youcount from the top down and provided the number (starting at 1) for which table it should be. Always check to ensure the right one has come in, because the way the tables appear on the website may not exactly match other specified in the HTML if there is this that many tables on the page that you may accidentally hit the wrong one.

Here is our cell argument: `=IMPORTHTML("http://www.neoperceptions.com/snakesandfrogs.com/scra/ident/names.htm","table",1)`

Take a moment to look at how the data has been read into your sheet. The upper left cell, where you originally put the function information, will still contain the function content, but the other cells will only have the actual text contact. You may also noticed that some of the formatting, like bold and italics did not come through correctly. 

Tools like this can extend how powerful this idea copy and pasting into a spreadsheet can be. Say you had a single table with basic formatting, and wanted to import it into a spreadsheet regularly. Using a function like this would allow you to incorporate some amount of automation, where this can be repeated without having to copy paste the table each time. However, there are several limitations. You shouldn't use this without observing the results in case the original website structure has changed. Timestamp access information is also not retained. 

The list option is also quite interesting. However, it is only importing the top most list that it sees. Meaning that navigating a deeper structure can be harder. 
`=IMPORTHTML("https://en.wikipedia.org/wiki/Lists_of_American_universities_and_colleges", "list", 6)`

TIPS: you'll also see `importxml` in this list, allowing you to do easy xpath queries. This won't give you all the power of using xpath with python, but is still extremely useful. Xpath will be covered more later on. 

`=IMPORTXML("https://en.wikipedia.org/wiki/Lists_of_American_universities_and_colleges", "//div[@class = 'mw-parser-output']/ul/li")` 

Looking at the results, we can still see some limitations.

`=IMPORTXML("https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Illinois", "//table[contains(@class, 'wikitable')]/tbody/tr/th/a")`

If you play around more you can get more, but generally this won't be quite enough.

## Summary

In the section, we saw a few tools to copy and paste or import contacts in from webpages or tables. These tools tend to be really useful for smaller tasks or exploration, but can't (okay shouldn't) be taken much further than that.

# Downloading things

Now we're getting into a situation where we have a list of things we need to download.

This exact situation will depend on what you're doing. I've done this where each page had 50 PDF links that I needed to download and there were 20 total pages. You may also have a page with 100 images and you want to programmatically download all of the images on that page. There's so many reasons that you might want to do this, but the nice thing is this sort of task is a great initial web scraping task.

This sort of task also lens itself really nicely to combining both manual work and programmatic work. This situation may dictate which one gets which, but is there a lot of flexibility.

Your first step here is going to identify where is the page that has all of the links on it and then you need to get access to all of them. My general preference is to download and save this page to my computer, because that allows me to experiment with par seeing and figure things out on my own time without having to reload the page or hit their server for every time that I run my script.

Second, you need a proof of concept to check that you can actually get the content out from those pages. Use just one page to experiment.

Tip: you can often be working on these experiments with getting content out while the other pages you need are downloading.

## Before we move on, some considerations

There are a few things to consider before we move on where we are programmatically downloading things from someone else's server. Not every website wants to be scraped. Some have restrictions some have blocks, and there's a certain kind of etiquette that we want to follow.

First, we want to keep the speed we are hitting their server to something reasonable. This is usually a minimum of 4 seconds, but I've worked with pages that asked for 30 seconds delays. 

Second, some ask that you only do large scale harvesting or scraping during "off peak" times. This often means overnight.

Third, some pages may just completely ban scraping tools from being used. Usually this is because they have an API they'd prefer you to use (and usually pay for) or because the data is sensitive in some way.  Let's look a few examples. 

* Linkedin has a hard block on programmatic web scraping because their data is really valuable and they want to sell it to you.
* Google will quickly block you from scraping their results because they want to you to use an API. Many of theirs are open and reasonable to use, but they don't want HTML scraping.
* Archive of Our Own (AO3) has a block against it because they don't want search engines to index the results. This gives them control over story and author information and the ability to fully take things down as needed. 

But how can you know for sure? This can be hard and there's no single answer. You can often check the `robots.txt` file for the website. You can read about this file here: https://en.wikipedia.org/wiki/Robots.txt Very generally, it will contain information for humans and for bots, and give you an idea about limitations, etc. Not every site will have it, but most with data will. 

* https://en.wikipedia.org/robots.txt
* https://archiveofourown.org/robots.txt
	* my favorite "cruel but efficient"
	* note the crawl delay
* https://www.fanfiction.net/robots.txt

You can ask for permission to go out of bounds for this, especially for research. Just be respectful.


## Handling delays

Most programming languages will have some ability to "delay"actions. We will use the `time` module in Python to delay our execution.

`time.sleep(seconds)` takes a number of seconds and pauses script execution for that long. Other languages use `ms` instead, so be mindful if switching!

In [5]:
import time

for _ in range(3):
	print("hello!")
	time.sleep(5)

hello!
hello!
hello!


## Downloading things off one page

Starting with the simplest version for sure, we have one page with a side of links, and we want to download the results of those links. What those files are, doesn't really matter because you're downloading them to disk. 

So what I love about this page is that they just have the sql statement right at the top of the page. 

https://calphotos.berkeley.edu/cgi/img_query?where-taxon=Allium+anceps

Let's take a look at the structure here:

* clearly these are coming from a database
* there are multiple pages
* the images are displayed on the page
* there are detail links by each image
* being displayed in a table

 Tip: Chrome XPath Helper tool

I like to use this to preview the structure of the elements.

There are a variety of tools you can use for this part! Our basic goal for this is to get URL for each of the pictures. Once we have those collected, we can run through them to download each. I'm going to provide these URLs for now so we can focus on the downloading. 

Just a small preview of this xpath we'll be using:

`//td//img/@src`

* we can use `//img` to get all the images on the page, but most pages will have other images. Best practice is to include something more specific to disambiguate. This is why I have `td` in here.
* Using `@src` allows me to request that it return the value for the source property
* the URLs for the images appear to have a specific folder structure, which I could have also used to gather them
* the URLs gathered are relative links, meaning that I'll need to build the full URL when I'm doing my pass over them. 

Let's open the text file with the URLs and start building those up. As mentioned, these are relative links so we will need to do a bit of editing to get them into the full pattern.  You can check out a link on the main page to inspect what the full URL should be and what the relative links are. Looking a that we can discover that the "base" url should be.

Here's a full link:
`https://calphotos.berkeley.edu/imgs/128x192/0000_0000/1209/2448.jpeg`

And here's the corresponding relative link: 

`imgs/128x192/0000_0000/1209/2448.jpeg`

This means we'll need to prepend `https://calphotos.berkeley.edu` before each URL to have the full one. There are several ways you can do this and this is a great time to practice your core Python skills. 

Some notes:

* using list comprehension syntax here
* using `readlines` to read it in, which returns a list of strings, each string is a line from the file plus a newline character
* `strip` is needed to take the ending newline character off
* I'm concatenating the base before the url from the line, but note that I didn't include the final / because there's already an opening one from the url. 
* This will result in a list of all the urls.

In [6]:
with open('pictures.txt', 'r', encoding = 'utf-8') as infile:
    # urls = infile.readlines()
    urls = ['https://calphotos.berkeley.edu' + u.strip() for u in infile.readlines()]

In [7]:
urls

['https://calphotos.berkeley.edu/imgs/128x192/0000_0000/0903/0732.jpeg',
 'https://calphotos.berkeley.edu/imgs/128x192/0000_0000/1002/0400.jpeg',
 'https://calphotos.berkeley.edu/imgs/128x192/0000_0000/1102/0790.jpeg',
 'https://calphotos.berkeley.edu/imgs/128x192/0000_0000/1102/0792.jpeg',
 'https://calphotos.berkeley.edu/imgs/128x192/0000_0000/1207/0067.jpeg',
 'https://calphotos.berkeley.edu/imgs/128x192/0000_0000/1207/0083.jpeg',
 'https://calphotos.berkeley.edu/imgs/128x192/0000_0000/1207/0084.jpeg',
 'https://calphotos.berkeley.edu/imgs/128x192/0000_0000/1207/0086.jpeg',
 'https://calphotos.berkeley.edu/imgs/128x192/0000_0000/0408/1095.jpeg',
 'https://calphotos.berkeley.edu/imgs/128x192/0000_0000/0608/2437.jpeg',
 'https://calphotos.berkeley.edu/imgs/128x192/0000_0000/0608/2438.jpeg',
 'https://calphotos.berkeley.edu/imgs/128x192/0000_0000/0608/2439.jpeg',
 'https://calphotos.berkeley.edu/imgs/128x192/0000_0000/0608/2440.jpeg',
 'https://calphotos.berkeley.edu/imgs/128x192/0000_

### Working with `requests`

Let's try something basic!

In [4]:
import requests

url = "https://loripsum.net/api/1/plaintext/short"
result = requests.get(url)

print(result)

<Response [200]>



So what we're seeing here is a sucessfull connection, but not the text.  We have to ask about that explicitly from out result object.

We do this with `.text` (no parens!) this will allow us to ask for a variable value within out object (versus calling a function). Some objects just work this way, and we know how to do this by looking at the documentation or a tutorial.


But our items are images? What can we do. Just a few tweaks. From python's perspective, we are moving from data that's text to data that's bytes. 

`requests` actually has a bunch of ways to handle this, and those methods may be better for larger files, etc. However, for smaller files like ours and the fact that we are using pathlib... we can pretty easily handle this. 

### Working with `pathlib`

You'll note that I didn't use pathlib for reading in that file. That's okay! Sometimes you don't need to.

`pathlib` is a great module for working with files/folders/etc. For webscraping it is ideal because you can very cleanly handle making folders, checking if things exist, making longer file paths, etc. Honestly, when I started using it vs other tools it was game changing.

We'll be exploring things with `pathlib` as we go, but we do need to cover a few basics. 

You create `Path` objects to represent files and directories. Once these are made you get access to special methods for taking action on them or getting information back. You create a file and a folder object the same way.

`p = pathlib.Path(string of path info etc.)` 

This returns a `Path` object you'll want to save as an object. 

We can use `pathlib.Path('pictures')` to work with our directory and then make the file path objects like `pathlib.Path('filename.jpeg')`. Neither of these things need to actually exist for us to make these objects. 

We can use the `mkdir()` method to create a folder, and then use the `/` concatenation operator to combine them.

`pathlib` has two awesome path object methods to write out content:

* `write_text(text stuff)``
	* for text!
* `write_bytes(a bytes or non-text doodad)`
	* briefly, for stuff that isn't text


https://calphotos.berkeley.edu/robots.txt

We have a list of URLs now, so we can loop through those and begin downloading them. There are a few tasks we'll need to accomplish.

* create the file name (from the file name)
* create a directory for the new files to go into
* create the full destination path (target folder plus file name)
* open up the requests connection
* access and write the content
* close the connection
* wait for 5 seconds

This is a lot and we build it up bit by bit.

In [9]:
import pathlib
import time
import requests

# create the target folder object
target = pathlib.Path('pictures')
# make the directory if needed
# does nothing if already exists
target.mkdir(exist_ok=True)

for u in urls:
    parts = u.split('/')
    last_two = parts[-2:] # grab the last two parts
    fname = "_".join(last_two)
    # print(fname)
    p = target / pathlib.Path(fname)
    print(p) # this is the full path
    r = requests.get(u) #open connection
    p.write_bytes(r.content) # get content, write bytes
    r.close() # always close your connection!!!
    time.sleep(5) # pause to not anger the server

pictures/0903_0732.jpeg
pictures/1002_0400.jpeg
pictures/1102_0790.jpeg
pictures/1102_0792.jpeg
pictures/1207_0067.jpeg
pictures/1207_0083.jpeg
pictures/1207_0084.jpeg
pictures/1207_0086.jpeg
pictures/0408_1095.jpeg
pictures/0608_2437.jpeg
pictures/0608_2438.jpeg
pictures/0608_2439.jpeg
pictures/0608_2440.jpeg
pictures/0608_2441.jpeg
pictures/0608_2442.jpeg
pictures/0608_2443.jpeg
pictures/0608_2444.jpeg
pictures/0209_0663.jpeg
pictures/0209_0664.jpeg
pictures/0209_0665.jpeg
pictures/0209_0666.jpeg
pictures/0209_0667.jpeg
pictures/0509_0139.jpeg
pictures/1209_2447.jpeg
pictures/1209_2448.jpeg
pictures/0611_1218.jpeg
pictures/0611_1219.jpeg
pictures/0611_1220.jpeg
pictures/0611_1221.jpeg
pictures/0611_1222.jpeg
pictures/0611_1223.jpeg
pictures/0413_3699.jpeg
pictures/1113_3030.jpeg
pictures/1115_2820.jpeg
pictures/1115_2821.jpeg
pictures/1115_3063.jpeg
pictures/1115_3064.jpeg
pictures/1017_1587.jpeg
pictures/1017_1588.jpeg
pictures/1017_1589.jpeg
pictures/0918_2740.jpeg
pictures/0918_27

One thing I always check at this point is the file size for everything that has downloaded. When in jupyter on a cloud service, that can be hard, but `!` to the rescue.

`!ls -l pictures`

Now, what if we had many or some messed up? Using pathlib is awesome here. We can utilize the `exists()` method to check if the file we are proposing to make already exists. 

In [10]:
!ls -l pictures

total 5384
-rw-r--r--  1 wickes1  staff  79560 Jul 24 11:27 0209_0663.jpeg
-rw-r--r--  1 wickes1  staff  77499 Jul 24 11:27 0209_0664.jpeg
-rw-r--r--  1 wickes1  staff  70946 Jul 24 11:27 0209_0665.jpeg
-rw-r--r--  1 wickes1  staff  61068 Jul 24 11:27 0209_0666.jpeg
-rw-r--r--  1 wickes1  staff  72332 Jul 24 11:27 0209_0667.jpeg
-rw-r--r--  1 wickes1  staff  48150 Jul 24 11:26 0408_1095.jpeg
-rw-r--r--  1 wickes1  staff  41317 Jul 24 11:28 0413_3699.jpeg
-rw-r--r--  1 wickes1  staff  53845 Jul 24 11:27 0509_0139.jpeg
-rw-r--r--  1 wickes1  staff  92166 Jul 24 11:26 0608_2437.jpeg
-rw-r--r--  1 wickes1  staff  81895 Jul 24 11:26 0608_2438.jpeg
-rw-r--r--  1 wickes1  staff  79842 Jul 24 11:26 0608_2439.jpeg
-rw-r--r--  1 wickes1  staff  74976 Jul 24 11:26 0608_2440.jpeg
-rw-r--r--  1 wickes1  staff  71964 Jul 24 11:26 0608_2441.jpeg
-rw-r--r--  1 wickes1  staff  65093 Jul 24 11:26 0608_2442.jpeg
-rw-r--r--  1 wickes1  staff  61289 Jul 24 11:26 0608_2443.jpeg
-rw-r--r--  1 wickes1  staff 

In [11]:
target = pathlib.Path('pictures')
target.mkdir(exist_ok=True)

for u in urls:
    parts = u.split('/')
    last_two = parts[-2:] # grab the last two parts
    fname = "_".join(last_two)
    p = target / pathlib.Path(fname)
    # use .exists to check
    if p.exists():
        print("already done!")
    else:
        print(p)
        r = requests.get(u)
        p.write_bytes(r.content)
        r.close()
        time.sleep(5)

already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!
already done!


# Exercises (Optional)

`If possible, include practice exercises for users to do on their own. These may have clear solutions or be more open-ended.`

# Solutions (Optional)
`Offer some possible solutions for the practice exercises.`


# References (Optional)
No citations required but include this if you have cited academic sources. Use whatever format you like, just be consistent. Markdown footnotes are not well-supported in notebooks.[$^{1}$](#1) I suggest using an anchor link with plain html as shown.[$^{2}$](#2)

1. <a id="1"></a> Here is an anchor link footnote.
2. <a id="2"></a> D'Ignazio, Catherine and Lauren F. Klein. [*Data Feminism*](https://mitpress.mit.edu/books/data-feminism). MIT Press, 2020.