# Web Scraping (08.11.2022)
by Thomas Jurczyk (Dr. Eberle Zentrum, Universität Tübingen)

# Overview

1. What is web scraping? Why is it important for data science?
2. Prerequisites
3. Ethical/legal questions
4. Web scraping tools in Python
5. Example / Hands-On

# What is web scraping?

Web scraping simply means collecting data from websites (in an automated way). Oftentimes, this means parsing the HTML tree of a website to get to the information.

For instance, you might want to check the titles of a News website on a daily basis. In order to do so, you could set up a web scraper that check the titles on a daily basis and writes them to a CSV file. Note, however, that you need to be aware of legal and ethical aspects. We will talk about that in a moment.

Instead of checking a single website for information, you can also set up a so-called **web crawler** that jumps from website to website. For instance, you could check if a website contains a title field, then parse the title, check if the website contains any external links and follow these links to check if the linked websites also include titles, etc.

## Static vs. dynamic websites

**Static** websites are websites written in HTML that are served by a web server to every user in the same way. This means that whenever you request this site via an URL, you will receive the same website. On the other hand, **dynamic websites** are dynamically loaded depending on the user, time of the day, etc. A good example are shopping websites. The main page of a shopping website might look different to you than to any other user because certain products might appear based don your previous purchases, you might see different adds, etc.

In this course, we will only be dealing with static websites. Even though the structure of a static website might also change (for instance, because it has been replaced by a different version), it is usually much easier to scrape static websites than dynamic websites.

Sometimes, it is not even possible to scrape dynamic websites at all, at least with the tools we are using here. For example, some websites only serve some JavaScript code which is then executed by the browser. If we were to scrape such a website, we would only receive the JavaScript code but not the content we are interested in.

# Why is it important for data science?

Overall, web scraping should only be the second best option if you want to collect data from the web. There are simply too many pitfalls:

* legal/ethical issues
* mal-formatted HTML
* dynamic websites/JavaScript
* changing HTML tree structures of the websites you are interested in
* etc.

If there is an API, always use the API. If there is no API, make sure that you are allowed to acquire the data and try to program a web scraper.

# Prerequisites

# Ethical/Legal questions

# Web scraping tools in Python

## Requests

## Beautifulsoup

# Example / Hands-On

In [2]:
import requests
from bs4 import BeautifulSoup

In [4]:
WEBSITE = "https://tjurczyk.de/scraping/info.html"

In [9]:
page = requests.get(WEBSITE)

In [10]:
page.ok

True

In [11]:
soup = BeautifulSoup(page.content, "html.parser")

In [24]:
results = soup.find_all("p")

In [26]:
for res in results:
    link = res.find_all("a")
    if link:
        print(link)

[<a href="https://www.projekt-gutenberg.org/sinsheim/shylock/shylock.html">Projekt Gutenberg</a>]
[<a href="https://www.merriam-webster.com/" target="_blank">Meriam Webster</a>]


In [30]:
soup.select(".para-text")

[<p class="para-text">Ich habe dieses Buch in den Jahren 1936 und 1937 geschrieben – in einer Welt, die es heute
         nicht mehr gibt: in der ehemaligen Hauptstadt des Deutschen Reiches, Berlin, in der Welt der Nazis, der
         Konzentrationslager und Pogrome, der Folterungen und Morde. Kein Wunder, daß dieses Buch, wenn es jetzt
         erscheint, bereits eine Geschichte hinter sich hat, die zu erzählen sich lohnt.</p>,
 <p class="para-text">Ich hatte gerade einen zwei Jahre währenden Kampf um die Befreiung eines älteren Bruders hinter
         mir, den das Nazi-Regime wegen angeblichen Landesverrats eingesperrt hatte. Der von den Nazis im April 1934
         eingesetzte Volksgerichtshof üblen Angedenkens sprach meinen Bruder so gut wie frei. Ja, jener Volksgerichtshof
         tat sogar das Seine, um ihn vor der Gestapo zu bewahren – am 22. Dezember 1935! Ich werde diesen Tag so wie die
         vorausgegangenen zwei Jahre des Kampfes nie vergessen. <a href="https://www.merri

# Literature