# Web Scraping #
#### CAS Applied Data Science 2025 ####

In today's tutorial, we will learn how to collect our own data from the internet through **web scraping** and **APIs**. This is a very broad topic that relates to many technologies you may not know about yet. Don't get discouraged if you encounter many things that are new for you. Even if you do not fully understand how everything works in detail, you will soon be able to extract data from the internet and work with it – and over time, you will develop a better understanding of the things that happen behind the scenes!

In this tutorial, we will only be able to scratch the surface and cover the most important concepts. However, there are many helpful books and online resources if you would like to dive deeper into the topic.

## Getting help

We are selective in this tutorial and only discuss elements that we believe are most important to get started with aquiring data from the web. If you want more details, you can consult, for example, the **Python Standard Library Reference** at https://docs.python.org/3/library/ or the **Language Reference** at https://docs.python.org/3/reference/. But be warned: the amount of detail in these sources can be overwhelming. For **quick and easy-to-understand overviews** of different topics see, for example, https://www.w3schools.com/python/. For comprehensive information about web technologies visit https://developer.mozilla.org/. Here are some specific references for today's tutorial:

Requests:
* https://docs.python-requests.org/en/master/
* https://www.w3schools.com/python/module_requests.asp
* https://www.w3schools.com/python/ref_requests_response.asp
* https://realpython.com/python-requests/

BeautifulSoup:
* https://beautiful-soup-4.readthedocs.io/
* https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/ (includes good general introduction to webscraping with Python)
* https://programminghistorian.org/en/lessons/intro-to-beautiful-soup



If you get stuck or don't remember how to do something, it is usually a good idea to **Google** your problem or to ask a **chatbot**. However, although chatbots are very helpful, you must be able to critically assess the quality of their responses. Python has a large (and fast-growing) community and you will probably find answers to most of your questions online (e.g. on **Stack Overflow** or in a **Youtube tutorial**).

## Introduction

### What is web scraping?

Imagine you have a list of domain names of e.g. companies or organisations and would like to download a certain number of pages from each domain to analyze the content of websites. You may even want to do this every month to monitor the websites. Similarly, you could be interested in tracking data from an online source over time. For example, you may want to collect data on the weather, current health conditions or stock markets.  

Of course, you could try to manually call all the relevant pages and copy the data to a file on your computer. However, this will be very time-consuming and error-prone and may simply not be feasible for vast amounts of pages.

Web scraping allows you to **retrieve the contents of web pages automatically**. This can be extremely useful if you need data from many different (sub-)pages or if you need to repeatedly download a page over time. Instead of downloading all the pages manually, you can write a program that does all that work for you. Python has several very useful libraries that allow you to do this and is thus a good choice for a web scraping project.

Before we get started with these libraries, there are a few things about the internet/web technologies you should know. We will briefly walk you through (1) the basics of data transmission over the internet, (2) the components of a web page and (3) the HTML language.

### What happens if you type a URL into your browser?

When you open your browser and type a URL (e.g. www.google.com) you will usually get directed to a page within milliseconds. Have you ever wondered how this works? Well, the whole process is incredibly complicated – but we can try to provide a simplified answer:






You can think of the internet as a network that connects two types of computers: **clients** and **servers**. The clients are the ones requesting the information. For example, your computer or your smartphone are clients. Servers are where the information is stored. They are also computers (e.g. a single computer or a whole data center), but they are set up to store and deliver data for the clients.

![Client Server Model](https://upload.wikimedia.org/wikipedia/commons/c/c9/Client-server-model.svg)



When you type ``www.google.ch`` into your browser, your computer (i.e. a client) is requesting information from the Google server. But how does it know where to find this server? Each server (and also each client) is identified through a unique number, the so-called **IP (Internet Protocol) address**. Assume the IP address for the Google server is ``8.8.8.8``.




How does your computer figure out that it needs to connect with the server ``8.8.8.8`` when you type ``www.google.com``? This is done through the so-called **Domain Name System (DNS)**. You can think of it as the telephone book of the internet. It consists of servers that store information on all *domain names* (e.g. google.com) and their corresponding *IP addresses*. Your computer will first have to request the IP address from one of these servers (unless it already knows the IP address from previous requests).




Once your computer has figured out the correct IP address, it will send a so-called **HTTP (Hypertext Transfer Protocol) request** to the corresponding server. HTTP is a protocol that defines how clients and servers have to communicate with each other (i.e. how requests and responses should be written). You can think of an HTTP request as a set of simple instructions stating what the clients wants the server to do. A HTTP request for the Google start page could look as follows:



```http
GET / HTTP/2
Host: www.google.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: de,en-US;q=0.7,en;q=0.3
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Cache-Control: max-age=0
TE: Trailers
```




The HTTP request will be passed through several layers of protocols (see https://en.wikipedia.org/wiki/OSI_model for further details) and eventually be transformed into signals that can be sent through the telecommunication infrastructure that spans the globe (e.g. fiber cables, satellites etc.). At the destination, the signals will be converted back into a HTTP request. If everything worked well, the server will then produce a **HTTP response** containing the contents (i.e. the source code) of the page the client requested and send it back to your computer (otherwise, it will send back an error message). Your browser will know how to interpret this response and convert it into an appropriately styled web page – for example the Google start page!

### What are the components of a website?

We said that when we request a page, the responsible server will send us a HTTP response containing the source code of the page. But how does this code look like? What are web pages made of?

Websites are usually a combination of documents in three languages: HTML, CSS and JavaScript. Let's take a brief look at what each of them is responsible for:

* **HTML** – *HyperText Markup Language*: Structured contents of the page (e.g. What are the headings of the page? What do the different paragraphs say? What images do we have?). Additional information at https://developer.mozilla.org/en-US/docs/Web/HTML
* **CSS** – *Cascading Style Sheets*: Styling of the page (e.g. What font size and type should headings have? How should the paragraphs be styled?). Additional information at https://developer.mozilla.org/en-US/docs/Web/CSS
* **JS** – *JavaScript*: Interactivity of the page (e.g. Show or hide more information with the click of a button; Slide through a carousel of images). Additional information at https://developer.mozilla.org/en-US/docs/Web/JavaScript

When we scrape a page, we are usually only interested in its content and will thus only work with the HTML part of a page. So for most tasks, you do not need to worry too much about CSS and Javascript and and can focus on understanding HTML.

><font color = 4e1585> SIDENOTE: Note that some understanding of CSS can be helpful in extracting detailed information from a page (see section on CSS selectors).

### How does HTML work?

When you scrape a page, you will usually fetch the HTML code of that page and parse it (e.g. with Python's ``BeautifulSoup`` library). To be able to make sense of the code you will have to work with, it is useful to know a few things about HTML. Let's take a brief look at it.

HTML is a **markup language**, meaning that it uses **tags** to define **elements** within a document. Let's look at an example:



```html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>This is a heading</h1>
<p>This is a paragraph.</p>

</body>
</html>
```



Each element **starts with ``<*name of tag*>`` and ends with ``</*name of tag*>``**. For example, the ``body`` element starts with ``<body>`` and ends ``</body>`` and everything in between belongs to this element.

Elements are often **nested within each other** (i.e. they contain other elements). For example, the ``h1`` and the ``p`` element are nested within the ``body`` element.  This means that ``body`` is a **parent** of ``h1`` and ``p``. Conversely, ``h1`` and ``p`` are **children** of ``body`` (and **siblings** to each other). Understanding the nested structure of HTML element will be important for extracting specific information from a web page!



There are many different HTML tags (more than 100). You do not need to know all of them, but it is useful to take a look at the most common ones you may encounter (see here for more detailed information on them: https://www.elated.com/first-10-html-tags/):

* ``<html>``: Opening tag of every HTML document
* ``<head>``: Document head
* ``<title>``: Page title (child of ``head`` tag)
* ``<body>``: Page content (parent of the main content of the page)
* ``<h1>``:  Section heading (``<h2>``, ``<h3>`` etc. for lower level headings)
* ``<p>``:  Paragraph
* ``<a>``: Link
* ``<div>``: Block-level container for content
* ``<table>``: Table (with ``<tr>`` tags for each row, a ``<td>`` for each cell and a ``<th>`` tag for table headings)
* ``<img>``: Image


See https://developer.mozilla.org/en-US/docs/Web/HTML/Element for a comprehensive list of HTML elements

Tags can also have **attributes**. Consider the following example:

* ``<a href="www.someaddress.com" target="_blank"> Click on this link</a>``

The ``<a>`` tag above has two attributes. The ``href`` attribute defines the URL and the ``target`` attribute specifies how it should be opened (`"_blank"` means that it will be opened in a new tab). Each tag has its own set of attributes, but there are also global attributes such as `id` or `class` that can be used with any tag. See the following links, if you would like to know more:
* https://www.w3schools.com/html/html_attributes.asp
* https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes





---
### **<font color='teal'>In-class exercise**

>  <font color='teal'> Check out this simple page to see how all of this works: http://farys.org/daten/example-page.html. If you right-click on the page, you should be able to select "View source code". You can also click on "Inspect" (German: "Element untersuchen") to see which part of the code refers to which part on the web page! If you need to scrape a particular element from a page, this is usually a good starting point.
>
>  <font color='teal'> Try to answer the following questions:
>
>*  Can you find the element with the main heading?
* What are the parent, children and siblings of the ``table`` element?
* In which element is the "Bongo Cat" link? What attributes does the tag have?
* Extra (after-class) task: Go to the Bongo Cat page and learn to play "Happy Birthday" with the Marimba! ;-)






---



## Scraping web pages with ``requests``

### Making simple requests

So let's get started with web scraping. How can we use Python to **retrieve a page from the internet**? There are several libraries that allow you to do this. The most popular one is the **``requests`` library** (https://docs.python-requests.org/en/master/). We can import it as follows:

In [None]:
import requests

Now let's request the cat page from the example above using the **``get`` function**. It will send an HTTP request to the responsible server and fetch the response for you:

In [None]:
r = requests.get("http://farys.org/daten/example-page.html")
type(r)

As you can see, we got some kind of response object. What can we do with it? A first useful thing to do is to **check the status code** (by calling the ``status_code`` attribute):

In [None]:
r.status_code

What does this mean? Simply put, the 2xx (i.e. 200, 201 etc.) are successes, the 3xx codes are redirects, the 4xx codes are client errors and the 5xx codes are server errors (see here for a list of all HTTP status codes: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes or https://developer.mozilla.org/en-US/docs/Web/HTTP/Status). So in our case, everything seems to have gone well so far as we got a status code of 200.

Usually, we are interested in the source code of the page. We can access it through the **``text`` attribute**:

In [None]:
r.text

In [None]:
type(r.text)

As you can see, we now have a string with the HTML code of the page – it is the same code you saw when clicking on "View source code" on the page!

###Setting query parameters*

<font color='gray'>This section is for self-study and will not be discussed in class.

You will often encounter URLs that look somewhat like this:

* https://en.wikipedia.org/w/index.php?title=Cat&action=info (information page about the "Cat" page on Wikipedia)

The part after the ``?`` is called a **query string** (https://en.wikipedia.org/wiki/Query_string; For a general overview on the structure of URLs see https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL). It is a way to send additional information to the server, for example search terms when you use Google. The server will still deliver an HTML page, but the information in the query string might be used to alter this page before it is sent to you.

Instead of passing in the entire URL to the ``get`` function of the `requests` library, you can also provide the parameters for the query string as a separate argument. This is done by passing a dictionary as the argument of the **``params`` parameter**:

In [None]:
# Same result as:
# cat = requests.get("https://en.wikipedia.org/w/index.php?title=Cat&action=info")

my_params = {"title" : "Cat",
             "action" : "info"}
cat = requests.get("https://en.wikipedia.org/w/index.php",
                   params=my_params)

print(cat)
print(cat.url) # Look at url that was used for the request

Why would this be useful? Suppose you would like to scrape not only the "Cat" info page, but also the ones for many other animals. Of course, you could write a loop that builds the different URL strings for you. But it is much cleaner and easier to use the ``params`` parameter:

In [None]:
animals = ["Cat", "Dog", "Bird"]

L=[]
for animal in animals:
  my_params = {"title" : animal, "action" : "info"}
  res = requests.get("https://en.wikipedia.org/w/index.php", params=my_params)
  L.append(res)

L # List with response objets for the three animal info pages

><font color = 4e1585> SIDENOTE: The ``get`` function also allows you to specify other useful arguments. See here for an introduction: https://docs.python-requests.org/en/master/user/quickstart/.



---
###**<font color='teal'>In-class exercise**

>  <font color='teal'> Retrieve the Wikipedia page about cats (https://en.wikipedia.org/wiki/Cat) and assign the response to a variable named ``resp``. Print the status code and the text of the response.







---

## Extracting data from HTML with ``BeautifulSoup``

We now have a string with the HTML code of the page. Usually we are not interested in the entire code, but would like to **extract specific information** from it. For example, we might be interested in the headings, the links, some paragraph, a table etc. How can we access these things?
If you wanted to do this all by yourself, you would have to write a lot of complicated code searching for text patterns in the HTML string. Luckily, extracting information from HTML code (or other markup languages) is much more convenient thanks to libraries such as ``BeautifulSoup``.


The ``BeautifulSoup`` library allows you to **parse data from HTML** (and XML) files. This means that you can convert the HTML string into an object that will be easier to search. Moreover, it provides you with many useful **functions and methods to locate the information** you are interested in. We will only be able to cover a few of these functions and methods. Check out the documentation if you want to learn more:
https://beautiful-soup-4.readthedocs.io/

### Parsing HTML strings

Let's get started and import the ``BeautifulSoup`` library:

In [None]:
from bs4 import BeautifulSoup

We will continue to work with the HTML string from the cat example page:

In [None]:
r = requests.get("http://farys.org/daten/example-page.html")
r.text

In [None]:
type(r.text)

First, we need to parse it to convert it into a ``BeautifulSoup`` object:

In [None]:
soup = BeautifulSoup(r.text)  # creates an object of the BeautifulSoup class
type(soup)

It looks as follows:

In [None]:
soup

Another way to print out the content of the object is using the ``prettify()`` method. It allows you to print the HTML code in a way that makes it easier to read; indentation is use to show how the tags are nested within each other:

In [None]:
print(soup.prettify())

### Accessing elements with ``find`` and ``find_all``

Now, how could we **access different elements** within this "soup"? Suppose, we would like to access the ``head`` element. You can do this using the **``find``** method:

In [None]:
soup.find("head")

Similarly, we could access the ``title`` element:

In [None]:
soup.find("title")

As the ``title`` and the ``head`` elements are stored as an **attribute** of the ``soup`` object, we can also access them like this:

In [None]:
soup.head

In [None]:
soup.title

The return object will be a *BeautifulSoup tag*:

In [None]:
type(soup.head)  # Or: soup.find("head")

This means that we can continue to search it. Let's try to get the ``title`` element within the ``head`` element:

In [None]:
soup.head.title # Or: soup.find("head").find("title")

*(Note that in this case, just typing ``soup.title`` returns the same result. However, this will not always be the case as we are now explicitly searching for the title element **within** the head element.)*

Let's now try to access a link. Links are within ``a`` tags:

In [None]:
soup.find("a")

But wait, there are several links on the website! Why did we get only one? The **``find`` method always returns the first element** that matches the search criteria. If we would like to get **all the matching elements, we can use the ``find_all`` method**:

In [None]:
soup.find_all("a")

Now all instances of the `a` tag are returned in a list.

><font color = 4e1585> SIDENOTE: Because ``find_all`` is such a commonly used method, there is a shortcut for it. For example, instead of writing ``soup.find_all("a")``, you could also just write ``soup("a")``.

We can also **refine our search by specifying arguments**. For example, we can use the **attrs** parameter to search tags based on HTML attributes (see section on HTML). Let's fetch the table with the id "famous_cats_table" and the class "cat_table":

In [None]:
soup.find_all("table", attrs={"id": "famous_cats_table", "class": "cat_table"})

Most HTML attributes can also be addressed directly by using the name of the attribute as the argument key in ``find_all()``:


In [None]:
soup.find_all("table", id="famous_cats_table", class_="cat_table")
# Note that you need to write class_, not class (class is a reserved keyword)!

### Extracting text and attribute values of an element

So far, so good – but the outputs we got so far are not completely satisfactory. They still have a lot of HTML markup around them that we do not want. How can we get rid of it? We can use the **``get_text`` method** to get only the text of a tag:

In [None]:
soup.find("title").get_text()  # or using attribute notation: soup.title.text

Let's try the same for the first URL:

In [None]:
link = soup.find("a")
print(link)

In [None]:
link.get_text()

That's not quite what we wanted. How can we access the URL?

As you may have noticed, the URL is part of the **HTML attributes of the ``a`` tag** (see section on HTML). In ``BeautifulSoup``, these attributes can be **treated like a dictionary**. In our case, we have two key:value pairs: the ``href`` key with a value of ``"https://bongo.cat/"`` and the ``target`` key with a value of ``"_blank"``. To get the URL, you can thus simply type:

In [None]:
link["href"]

So what if we wanted to get all the URLs in the document. Let's try:

In [None]:
links = soup.find_all("a")
type(links)

In [None]:
links

In [None]:
# links["href"] # this will return an error!

Why did't this work? Our ``links`` variable does not contain a single element, but a **BeautifulSoup ``ResultSet`` containing several elements**. The set can be treated like a list. For example, you could extract the link for the first element in the ``ResultSet`` like this:

In [None]:
links[0]["href"]

If you want to extract the link for all the elements, you can write a **loop or a list comprehension**:

In [None]:
# Loop to print all the URLs
for tag in links:
  print(tag["href"])

In [None]:
L = []
for tag in links:
  L.append(tag["href"])
L

In [None]:
# Better: List comprehension to write all the URLs into a list
L = [tag["href"] for tag in links]
L

### Accessing elements with ``select``/CSS selectors*

<font color='gray'>This section is for self-study and will not be discussed in class.

The ``BeautifulSoup`` library also features the ``select()`` method that allows you to select elements in a more sophisticated way: through so-called **CSS selectors**. CSS selectors are patterns that are used in CSS to select the elements (within the HTML file) you want to style (see here for an overview: https://www.w3schools.com/cssref/css_selectors.asp). You can use these selectors to extract elements through BeautifulSoup's ``select`` method.

The ``select`` method may be a bit harder to learn, but it is very flexible and powerful. Let's look at some examples so you get a feeling of what is possible:

In [None]:
soup.select("body h2") # Select all h2 tags within the body tag

In [None]:
soup.select("body table td") # Select all td tags within a table tag within a body tag

In [None]:
soup.select("tbody > tr") # Select all tr tags where the parent is a tbody tag

In [None]:
soup.select("h2 ~ p") # Select all  p tags that are preceded by a h2 tag

In [None]:
soup.select("#famous_cats_table") # Select the tag with the id "famous_cats_table"

In [None]:
soup.select(".cat_table") # Select the tag with the class "cat_table"

For the documentation on the ``select`` method, see here: https://beautiful-soup-4.readthedocs.io/en/latest/index.html#css-selectors

To know how to specify the selection criteria, it can be useful to first explore the nested structure of the code. There are many useful functions that allow you to **navigate through the "family tree" of HTML elements**.

In [None]:
list(soup.head.children) # Get all children of the head tag

In [None]:
soup.title.parent # Get the (direct) parent of the title tag

In [None]:
list(soup.title.next_siblings) # Get all siblings that come after the title tag

In [None]:
# and many more

---
###**<font color='teal'>In-class exercise**
>  <font color='teal'>
We will continue to work with the response object from the example page:

>  <font color='teal'> Convert the response into a BeautifulSoup object called ``cats_soup``.

>  <font color='teal'> Now try to extract the following:
* The ``h1`` element
* The second ``h2`` element
* The text of the ``h1`` element
* All elements of the ``class`` "cat_table"

In [None]:
# h1


In [None]:
# second h2


In [None]:
# text of h1


In [None]:
# all cat tables




---



## Extracting HTML tables with Pandas*

<font color='gray'>This section is for self-study and will not be discussed in class.

Now suppose you would like to extract a table from a webpage and write it to a Pandas dataframe so that you can continue to work with it. The table element of the example page looks like this:

In [None]:
soup.find("table")

Of course, you could continue to use BeautifulSoup and write a complicated loop to convert this HTML code into a Pandas dataframe.

Luckily, there is an easier solution: Pandas provides the read_html function that does this automatically for you! You only need to provide the URL and it will fetch you all the tables and return them as a list of Pandas dataframes (https://pandas.pydata.org/docs/reference/api/pandas.read_html.html):

In [None]:
import pandas as pd

In [None]:
tables_list = pd.read_html("http://farys.org/daten/example-page.html") # request tables from example page
type(tables_list) # return object is a list

In [None]:
len(tables_list) # one list element for each table on the page

As there is only one table on our example page, our return object will be a list with one element. Let's fetch the first (and only) element in this list:

In [None]:
cat_table = tables_list[0] # access first element
cat_table

In [None]:
type(cat_table) # Pandas dataframe representing the table from the website

><font color = 4e1585> SIDENOTE: As you have seen, the read_html can fetch the tables from a website and return them as Pandas dataframes. Moreover, it can also be directly applied to a HTML string.

## Web scraping in practice

### Scraping many pages at once

Retrieving information from a single page through web scraping is often not very useful – you might as well just go to the page and copy the content you need. The power of web scraping starts when we **retrieve many pages automatically**. How could we do this?

The answer to this question depends on your project:

1. In some cases, you have a pre-defined list of URLs you would like to scrape. You could then loop through all of them to fetch all the pages and retrieve the contents you are interested in.

2. If you are scraping many pages of the same domain you can try to figure out the structure of the different URLs you want to fetch. You can then write a loop to generate all the strings of these URLs and fetch the respective pages. An example would be a website where pages are available for every day, e.g.: example.org/data/2000-01-01, ... ,example.org/data/2023-08-24. In this example you would need to figure out the earliest and latest dates.

3. In some cases, you can start on one page, retrieve the URLs you find on this page, then request the respective pages and retrieve the URLs on them ... and so on. This is called **web crawling** *(think of the example in the introduction where we started with a domain name and wanted to retrieve a certain number of pages)*.

Let's take the list of URLs we had on our cat example page to build a very simple web scraper:



In [None]:
# Get list of links to cat pages
wiki_links = soup.table.find_all("a")
wiki_links = [link["href"] for link in wiki_links]
wiki_links

Suppose you would like to fetch the page title and the number of outgoing links from each of these pages. Let us write a loop that does this:

In [None]:
# Write simple web scraper
L = []

for link in wiki_links:
  resp = requests.get(link).text
  s = BeautifulSoup(resp)

  title = s.title.get_text()
  nr_links = len(s.find_all("a"))

  L.append([title, nr_links])

L

With this, we have already built a simple web scraper!

When you work on larger web scraping projects, **things will often not go so smoothly**. For example, you might run into different types of server (or client) errors when requesting a page. Moreover, different pages are often structured differently and the tags you want to extract may not exist (or contain something else) on some pages. To make your web scraper more robust, you may have to **define conditions or exceptions** to make sure that it can handle all the problems that might occur.



Moreover, it is often a good idea to **divide your project into two separate tasks**:

1. *Data collection*: Fetch the pages and save the data fo a file (or several files) on your computer.
2. *Information extraction*: Read the file(s) and extract the information you need from it

Let's adapt our code to follow this paradigma:

In [None]:
# Retrieve URLs and save them to your computer/drive
for i, link in enumerate(wiki_links):
  resp = requests.get(link).text
  filename = f"wiki_cat_file_{i}.html"

  with open(filename, "w") as file:  # use the open() function to import/export files
    file.write(resp)

# Load data from the files and retrieve elements
L = []
for i in range(len(wiki_links)):
  filename = f"wiki_cat_file_{i}.html"
  s = BeautifulSoup(open(filename))

  title = s.title.get_text()
  nr_links = len(s.find_all("a"))

  L.append([title, nr_links])

L

For a simple example like ours, this does not make much sense and only makes things unnessarily complicated. But for larger projects where you may run into many kinds of problems and need to handle large amounts of data, this may actually make things easier!

If you feel like going full nerd on a rainy Sunday, check out these two talks about web scraping projects:
* Spiegel Online Mining: https://www.youtube.com/watch?v=-YpwsdRKt8Q
* Deutsche Bahn Mining: https://www.youtube.com/watch?v=0rb9CfOvojk

English versions:
* https://www.youtube.com/watch?v=bYviBstTUwo
* https://www.youtube.com/watch?v=AGCmPLWZKd8

They are fun to watch but they also show you very typical webscraping pitfalls and solutions (scaling issues, umlauts, legal issues, ..)

### Be nice!

Webscraping allows you to request large numbers of pages within seconds. This can **impose considerable load on the servers** – for them it is as if many people used the pages in their browsers. If things go badly, you can make a server slow down considerably (or even crash).

Many 'free' resources on the web (e.g. Wikipedia) are shared by all of us and their infrastructure needs to be payed for somehow (e.g. by donations). So limit your requests to a polite rate and don't scrape more information than you need to!

Some websites protect themselves against webscraping/webcrawling by imposing **rate limits**. For example, you may only be allowed to call a page every 2 seconds. When you get rate limited, you will mostly get a status code of ``429`` and not be able to retrieve the page.

You can **make your program scrape more slowly** using the **``sleep``  function** of the ``time`` module. This function works as follows:

In [None]:
import time
print("Printed immediately.")
time.sleep(2)
print("Printed after 2 seconds.")

Let's make our Wikipedia scraper retrieve the pages more slowly:

In [None]:
L = []

for link in wiki_links:
  time.sleep(1) # sleep for one second
  resp = requests.get(link).text
  s = BeautifulSoup(resp)

  title = s.title.get_text()
  nr_links = len(s.find_all("a"))

  L.append([title, nr_links])

L

Finally, there are some websites that disallow web scraping. Many websites have a ``robot.txt`` file that specifies if there are things you should not do. For example, this is the ``robot.txt`` file for the Wikipedia page: https://en.wikipedia.org/robots.txt.

For further details see:
* https://en.wikipedia.org/wiki/Robots_exclusion_standard
* https://developers.google.com/search/docs/advanced/robots/robots_txt

### Other tools and data sources

Appart from ``BeautifulSoup`` there are two other popular web scraping libraries for Python:

* ``Scrapy`` (https://scrapy.org)
* ``Selenium`` (https://selenium-python.readthedocs.io)
* ``Playwright`` (https://playwright.dev)

These will take more time to learn, but depending on your project, it will be worth the effort. Watch this video for a short comparison: https://youtu.be/zucvHSQsKHA

In this tutorial we worked with 'live' versions of webpages. If you are interested in scraping pages over time (especially for the past), you could work with so-called web archives. The following are two well-known archives:

* http://commoncrawl.org
* https://archive.org

Note, however, that accessing the web archives can be rather involved and you may need some time to find out how you can retrieve the archived pages you are interested in.