Web Scraping
============

- HTTP requests
- HTML, XML, DOM, CSS selectors, XPath
- browser automation
- cleanse and export extracted data

Web-based (or browser-based) user interfaces are ubiquitous

- web browser as universal platform to run software (at least, the user interface)
- if a human is able to access information in WWW using a web browser,
  also a computer program can access the same information and automatically extract it
- challenges: navigate a web page, execute user interaction (mouse clicks, forms)
- real challenges: login forms, captchas, IP blocking, etc.
  - not covered here
  - also: ethical considerations whether or not to get around access blocking
- well-defined technology stack
  - HTTP
  - HTML / XML
  - DOM
  - CSS
  - XPath
  - JavaScript


## Web Browser

- render HTML page to make it readable for humans
- basic navigation in the WWW (follow links)
- [text-based browsers](https://en.wikipedia.org/wiki/Text-based_web_browser)
  ```
  lynx https://www.bundestag.de/parlament/fraktionen/cducsu
  ```
- modern graphical browsers
  - interpret JavaScript
  - show multi-media content
  - run “web applications”
- headless vs. headful browsers
  - headful: graphical user interface attached
  - [headless](https://en.wikipedia.org/wiki/Headless_browser)
    - controlled programmatically or via command-line
    - interaction but no mandatory page rendering (saves resources: CPU, RAM)

### Tip: Extract Text and Links Using a Text-Based Browser

Tip: text-based browsers usually have an option to "dump" the text and/or link lists into a file, e.g.

```
lynx -dump https://www.bundestag.de/parlament/fraktionen/cducsu \
    >data/bundestag/fraktionen.cducsu.txt
```

### Tip: Explore Web Pages and Web Technologies using the Developer Tool of your Web Browser

Modern web browsers (Firefox, Chromium, IE, etc.) include a set of [web development tools](https://en.wikipedia.org/wiki/Web_development_tools). Originally addressed to web developers to test and debug the code (HTML, CSS, Javascript) used to build a web site, the browser web developer tools are the easiest way to explore and understand the technologies used to build a web site. The initial exploration later helps to scrape data from the web site.

### Browser Automation

- load a page by URL including page dependencies (CSS, Javascript, images, media)
- simulate user interaction (clicks, input, scrolling)
- take screenshots
- access the DOM tree or the HTML modified by executed Javascript and user interactions
  from/in the browser to extract data

## Process HTML Pages in Python

- [requests](https://pypi.org/project/requests/) to fetch pages via HTTP
- [beautifulsoup](https://pypi.org/project/beautifulsoup4/) to parse HTML

In [1]:
import requests

request_url = 'https://www.bundestag.de/parlament/fraktionen/cducsu'
response = requests.get(request_url)

response

<Response [200]>

In [2]:
response.headers

{'date': 'Tue, 13 Jul 2021 13:51:15 GMT', 'content-type': 'text/html;charset=UTF-8', 'content-length': '23620', 'vary': 'Origin,Access-Control-Request-Method,Access-Control-Request-Headers,Accept-Encoding', 'x-content-type-options': 'nosniff', 'x-xss-protection': '1; mode=block', 'x-frame-options': 'DENY', 'content-language': 'de', 'content-encoding': 'gzip', 'x-ua-compatible': 'IE=edge', 'x-varnish': '73764308 72440053', 'age': '771', 'via': '1.1 Varnish', 'accept-ranges': 'bytes', 'cache-control': 'max-age=900', 'strict-transport-security': 'max-age=604800'}

In [3]:
response.status_code

200

In [4]:
!pip install beautifulsoup4



In [5]:
from bs4 import BeautifulSoup

html = BeautifulSoup(response.text)

html.head.title  # tree-style path addressing of HTML elements

<title>Deutscher Bundestag - CDU/CSU-Fraktion</title>

Note: the HTML document can be represented as a tree structure aka. [DOM tree](https://en.wikipedia.org/wiki/Document_Object_Model):
```
html
├── head
│   ├── meta
│   │   └── @charset=utf-8
│   └── title
│       └── ...(text)
└── body
    └── ...
```

The tree above is an equivalent representation for the HTML snippet

```html
<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8" />
  <title>Deutscher Bundestag - CDU/CSU-Fraktion</title>
</head>
<body>
  ...
</body>
</html>
```

In [6]:
# access the plain text of an HTML element
# (inside the opening and closing tag)
html.head.title.text

'Deutscher Bundestag - CDU/CSU-Fraktion'

In [7]:
# beautifulsoup also allows to select elements by tag name without a tree-like path

html.find('title').text

'Deutscher Bundestag - CDU/CSU-Fraktion'

In [8]:
# or if a tag is expected to appear multiple times:
# select all `a` elements and show the first three
html.findAll('a')[0:3]

[<a class="sr-only sr-only-focusable" href="#main" title="Direkt zum Hauptinhalt springen">Direkt zum Hauptinhalt springen</a>,
 <a class="sr-only sr-only-focusable" href="#main-menu" title="Direkt zum Hauptmenü springen">Direkt zum Hauptmenü springen</a>,
 <a href="https://www.bundestag.de/webarchiv" hreflang="de" lang="de" title="Archiv" xml:lang="de">
 <span class="sr-only-sm-down">Archiv</span>
 <span class="visible-xs-inline">Archiv</span>
 </a>]

In [9]:
# selection by CSS class name
html.find(class_='bt-standard-content')

<article class="bt-artikel col-xs-12 col-md-6 bt-standard-content">
<h3 class="bt-artikel__title">CDU/CSU-Fraktion</h3>
<div class="bt-bild-standard bt-bild-max" data-nosnippet="true">
<img alt="Beschilderung einer Tür im Bereich der CDU/CSU-Fraktion auf der Fraktionsebene im Reichstagsgebäude." class="img-responsive" data-img-md-normal="/resource/image/221096/16x9/570/321/531a1f0d2ca39975478dc1986d66ca19/BV/cducsu_logo.jpg" data-img-md-retina="/resource/image/221096/16x9/1140/642/531a1f0d2ca39975478dc1986d66ca19/qB/cducsu_logo.jpg" data-img-sm-normal="/resource/image/221096/16x9/926/522/531a1f0d2ca39975478dc1986d66ca19/CJ/cducsu_logo.jpg" data-img-sm-retina="/resource/image/221096/16x9/1852/1044/531a1f0d2ca39975478dc1986d66ca19/lc/cducsu_logo.jpg" data-img-xs-normal="/resource/image/221096/16x9/730/411/531a1f0d2ca39975478dc1986d66ca19/mf/cducsu_logo.jpg" data-img-xs-retina="/resource/image/221096/16x9/1460/822/531a1f0d2ca39975478dc1986d66ca19/YO/cducsu_logo.jpg" src="/resource/blob/45

In [10]:
html.find(class_='bt-standard-content').findAll('a')

[<a href="/abgeordnete/biografien/B/brinkhaus_ralph-518692" target="_self">Ralph Brinkhaus</a>,
 <a href="/abgeordnete/biografien/D/dobrindt_alexander-519076" target="_self">Alexander Dobrindt</a>,
 <a href="/abgeordnete/biografien/C/connemann_gitta-518902" target="_self">Gitta Connemann</a>,
 <a href="/abgeordnete/biografien/F/frei_thorsten-519532" rel="noopener" target="_blank">Thorsten Frei</a>,
 <a href="/abgeordnete/biografien/G/groehe_hermann-519870" target="_self">Hermann Gröhe</a>,
 <a href="https://www.bundestag.de/webarchiv/abgeordnete/biografien18/J/jung_andreas-258538" target="_self">Andreas Jung</a>,
 <a href="/abgeordnete/biografien/L/lange_ulrich-521486" target="_self">Ulrich Lange</a>,
 <a href="/abgeordnete/biografien/L/leikert_katja-521554" target="_self">Dr. Katja Leikert</a>,
 <a href="/abgeordnete/biografien/L/linnemann_carsten-521654" target="_self">Dr. Carsten Linnemann</a>,
 <a href="/abgeordnete/biografien/S/schoen_nadine-523428" target="_self">Nadine Schön</a>

In [11]:
# but we are also interested in the function of the members:
html.find(class_='bt-standard-content').findAll('h4')

[<h4>Fraktionsvorsitzender:</h4>,
 <h4>Erster Stellvertretender Fraktionsvorsitzender:</h4>,
 <h4>Stellvertretende Fraktionsvorsitzende:</h4>,
 <h4>Erster Parlamentarischer Geschäftsführer:</h4>,
 <h4>Stellvertreter des Ersten Parlamentarischen Geschäftsführers:</h4>,
 <h4>Parlamentarische Geschäftsführer:</h4>,
 <h4>Sprecher der CDU-Landesgruppen:</h4>,
 <h4>Vorsitzende der Arbeitsgruppen/Sprecher/Obleute:</h4>,
 <h4>Vorsitzende der sechs soziologischen Gruppen:</h4>,
 <h4>Beisitzer:</h4>]

In [12]:
from urllib.parse import urljoin

for role_node in html.find(class_='bt-standard-content').findAll('h4'):

    role = role_node.text.rstrip(':')

    for link_node in role_node.next_sibling.findAll('a'):
        name = link_node.text
        link = urljoin(request_url, link_node.get('href'))
        print(role, name, link)


Fraktionsvorsitzender Ralph Brinkhaus https://www.bundestag.de/abgeordnete/biografien/B/brinkhaus_ralph-518692
Erster Stellvertretender Fraktionsvorsitzender Alexander Dobrindt https://www.bundestag.de/abgeordnete/biografien/D/dobrindt_alexander-519076
Stellvertretende Fraktionsvorsitzende Gitta Connemann https://www.bundestag.de/abgeordnete/biografien/C/connemann_gitta-518902
Stellvertretende Fraktionsvorsitzende Thorsten Frei https://www.bundestag.de/abgeordnete/biografien/F/frei_thorsten-519532
Stellvertretende Fraktionsvorsitzende Hermann Gröhe https://www.bundestag.de/abgeordnete/biografien/G/groehe_hermann-519870
Stellvertretende Fraktionsvorsitzende Andreas Jung https://www.bundestag.de/webarchiv/abgeordnete/biografien18/J/jung_andreas-258538
Stellvertretende Fraktionsvorsitzende Ulrich Lange https://www.bundestag.de/abgeordnete/biografien/L/lange_ulrich-521486
Stellvertretende Fraktionsvorsitzende Dr. Katja Leikert https://www.bundestag.de/abgeordnete/biografien/L/leikert_katja

Now we put everything together, so that we can run this for all factions of the parliament:
- we use a function to
  - fetch the page of the faction and
  - extract the members from the page content
- iterate over all factions
- store the list of faction roles and MPs in a data frame and CSV

In [13]:
%%script false --no-raise-error
# uncomment the above instruction to run this code
# note: do not run the cell by default
# because sending 6 HTTP requests may take long

import requests
from time import sleep
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import pandas as pd

request_base_url = 'https://www.bundestag.de/parlament/fraktionen/'
factions = 'cducsu spd fdp linke gruene afd'.split()

def get_members_of_faction(faction):
    global request_base_url

    url = request_base_url + faction

    response = requests.get(url)
    if not response.ok:
        return

    result = []

    html = BeautifulSoup(response.text)

    for role_node in html.find(class_='bt-standard-content').findAll('h4'):

        role = role_node.text.strip().rstrip(':')

        if not role_node.next_sibling:
            continue

        for link_node in role_node.next_sibling.findAll('a'):
            name = link_node.text
            link = urljoin(url, link_node.get('href'))
            result.append([name, faction, role, link])

    return result


faction_roles = []

for faction in factions:
    if faction_roles:
        # be polite and wait before the next request
        sleep(5)
    faction_roles += get_members_of_faction(faction)

df_faction_roles = pd.DataFrame(faction_roles, columns=['name', 'faction', 'role', 'link'])

df_faction_roles.to_csv('data/bundestag/faction_roles.csv')

In [14]:
import pandas as pd

df_faction_roles = pd.read_csv('data/bundestag/faction_roles.csv')
df_faction_roles.value_counts('faction')

faction
cducsu    64
spd       37
linke     22
gruene    12
afd       12
fdp       11
dtype: int64

In [15]:
# not all members of the parliament have a role in their faction
# and are listed on the landing page of the faction
df_faction_roles.shape

(158, 5)

In [16]:
df_faction_roles[df_faction_roles['role'].str.startswith('Fraktionsvorsitz')]

Unnamed: 0.1,Unnamed: 0,name,faction,role,link
0,0,Ralph Brinkhaus,cducsu,Fraktionsvorsitzender,https://www.bundestag.de/abgeordnete/biografie...
64,64,Rolf Mützenich,spd,Fraktionsvorsitzender,https://www.bundestag.de/abgeordnete/biografie...
101,101,Christian Lindner,fdp,Fraktionsvorsitzender,https://www.bundestag.de/abgeordnete/biografie...
112,112,Amira Mohamed Ali,linke,Fraktionsvorsitzende,https://www.bundestag.de/abgeordnete/biografie...
113,113,Dr. Dietmar Bartsch,linke,Fraktionsvorsitzende,https://www.bundestag.de/abgeordnete/biografie...
134,134,Katrin Göring-Eckardt,gruene,Fraktionsvorsitzende,https://www.bundestag.de/abgeordnete/biografie...
135,135,Dr. Anton Hofreiter,gruene,Fraktionsvorsitzende,https://www.bundestag.de/abgeordnete/biografie...
146,146,Dr. Alexander Gauland,afd,Fraktionsvorsitzende,https://www.bundestag.de/abgeordnete/biografie...
147,147,Dr. Alice Weidel,afd,Fraktionsvorsitzende,https://www.bundestag.de/abgeordnete/biografie...


In [17]:
# now let's try whether we can fetch the biography and other information of a single MP

member_url = df_faction_roles.loc[df_faction_roles['name']=='Andreas Jung','link'].values[0]
member_response = requests.get(member_url)
member_html = BeautifulSoup(member_response.text)

# let's try first using the CSS class "bundestag-standard-content"
for node in member_html.findAll(class_='bt-standard-content'):
    print(node.text)




Abgeordnetenbüro
Deutscher BundestagPlatz der Republik 111011 Berlin









Geboren am 13. Mai 1975 in Freiburg im Breisgau; aufgewachsen in Stockach am Bodensee; katholisch; verheiratet; ein Kind.1981 bis 1985 Grundschule Stockach; 1985 bis 1994 Nellenburggymnasium Stockach; 1994 bis 2000 Studium der Rechtswissenschaft an der Universität Konstanz; Juli 2000 Erstes juristisches Staatsexamen; 2000 bis 2002 Rechtsreferendariat am Landgericht Freiburg; Oktober 2002 Zweites juristisches Staatsexamen in Freiburg; Februar 2003 Zulassung als Rechtsanwalt.1990 bis 2010 Mitglied der Jungen Union; 1991 bis 1993 Ortsvorsitzender der Jungen Union Stockach; 1993 bis 1999 Kreisvorsitzender der Jungen Union Konstanz; 2000 bis 2002 Mitglied im Bundesvorstand der Jungen Union Deutschlands; Umweltpolitischer Sprecher; 2002 bis 2006 Bezirksvorsitzender der Jungen Union Südbaden.Seit 1993 Mitglied der CDU; 1995 bis 2011 gewähltes Mitglied im Kreisvorstand des CDU Kreisverbandes Konstanz; 2007 bis 201

### Automatic Cleansing of Text

A trivial extraction of all text in the body of web page would include a lot of unwanted content (navigation menus, header, footer, side bars), the "main" content could be even only a small part in the middle of the page. There are heuristics and algorithms for automatic removal of "boilerplate" content:

- Mozilla Readability: the [reader view](https://support.mozilla.org/en-US/kb/firefox-reader-view-clutter-free-web-pages) of the Firefox browser
  - originally implemented in JavaScript, see [Readability.js](https://github.com/mozilla/readability/blob/master/Readability.js)
  - but there is a Python port - [ReadabiliPy](https://github.com/alan-turing-institute/ReadabiliPy) or [ReadabiliPy on pypi](https://pypi.org/project/readabilipy/)
- [jusText](https://nlp.fi.muni.cz/projects/justext/) or [jusText on pypi](https://pypi.org/project/jusText/)

Here an example usage of ReadabiliPy with the latest fetched page (without any manual selection of elements by CSS class):

In [18]:
!pip install readabilipy

from readabilipy import simple_json_from_html_string

article = simple_json_from_html_string(member_response.text, use_readability=True)



In [19]:
for paragraph in article['plain_text']:
    print(paragraph['text'])
    print()

Geboren am 13. Mai 1975 in Freiburg im Breisgau; aufgewachsen in Stockach am Bodensee; katholisch; verheiratet; ein Kind.

1981 bis 1985 Grundschule Stockach; 1985 bis 1994 Nellenburggymnasium Stockach; 1994 bis 2000 Studium der Rechtswissenschaft an der Universität Konstanz; Juli 2000 Erstes juristisches Staatsexamen; 2000 bis 2002 Rechtsreferendariat am Landgericht Freiburg; Oktober 2002 Zweites juristisches Staatsexamen in Freiburg; Februar 2003 Zulassung als Rechtsanwalt.

1990 bis 2010 Mitglied der Jungen Union; 1991 bis 1993 Ortsvorsitzender der Jungen Union Stockach; 1993 bis 1999 Kreisvorsitzender der Jungen Union Konstanz; 2000 bis 2002 Mitglied im Bundesvorstand der Jungen Union Deutschlands; Umweltpolitischer Sprecher; 2002 bis 2006 Bezirksvorsitzender der Jungen Union Südbaden.

Seit 1993 Mitglied der CDU; 1995 bis 2011 gewähltes Mitglied im Kreisvorstand des CDU Kreisverbandes Konstanz; 2007 bis 2011 Kreisvorsitzender des CDU Kreisverbandes Konstanz; seit 2001 Mitglied im 

In [20]:
# but there's also a "readable" and simple HTML snippet
# (shown as rendered HTML in the output)
from IPython.core.display import HTML

HTML(article['plain_content'])

## Processing XML

The [Open Data](https://www.bundestag.de/services/opendata) portal of the German parliament offers a zip file      "Stammdaten aller Abgeordneten seit 1949 im XML-Format (Stand 12.03.2021)" for free download. Most likely we should get the information about all PMs from this source. But how do we process XML?

Assumed the zip archive has been downloaded, unzipped and the files are all placed in `data/bundestag/`, we can simply read the file and pass it to beautifulsoup which will parse it. But we request a specific parser feature (`lxml-xml`) so that the casing of XML elements is preserved.

In [21]:
from bs4 import BeautifulSoup

xml = BeautifulSoup(open('data/bundestag/MDB_STAMMDATEN.XML').read(),
                    features='lxml-xml')

xml.MDB

<MDB>
<ID>11000001</ID>
<NAMEN>
<NAME>
<NACHNAME>Abelein</NACHNAME>
<VORNAME>Manfred</VORNAME>
<ORTSZUSATZ/>
<ADEL/>
<PRAEFIX/>
<ANREDE_TITEL>Dr.</ANREDE_TITEL>
<AKAD_TITEL>Prof. Dr.</AKAD_TITEL>
<HISTORIE_VON>19.10.1965</HISTORIE_VON>
<HISTORIE_BIS/>
</NAME>
</NAMEN>
<BIOGRAFISCHE_ANGABEN>
<GEBURTSDATUM>20.10.1930</GEBURTSDATUM>
<GEBURTSORT>Stuttgart</GEBURTSORT>
<GEBURTSLAND/>
<STERBEDATUM>17.01.2008</STERBEDATUM>
<GESCHLECHT>männlich</GESCHLECHT>
<FAMILIENSTAND>keine Angaben</FAMILIENSTAND>
<RELIGION>katholisch</RELIGION>
<BERUF>Rechtsanwalt, Wirtschaftsprüfer, Universitätsprofessor</BERUF>
<PARTEI_KURZ>CDU</PARTEI_KURZ>
<VITA_KURZ/>
<VEROEFFENTLICHUNGSPFLICHTIGES/>
</BIOGRAFISCHE_ANGABEN>
<WAHLPERIODEN>
<WAHLPERIODE>
<WP>5</WP>
<MDBWP_VON>19.10.1965</MDBWP_VON>
<MDBWP_BIS>19.10.1969</MDBWP_BIS>
<WKR_NUMMER>174</WKR_NUMMER>
<WKR_NAME/>
<WKR_LAND>BWG</WKR_LAND>
<LISTE/>
<MANDATSART>Direktwahl</MANDATSART>
<INSTITUTIONEN>
<INSTITUTION>
<INSART_LANG>Fraktion/Gruppe</INSART_LANG>
<INS_L

In [22]:
len(xml.findAll('MDB'))

4089

In [23]:
from collections import Counter

mp_acad_title = Counter()

mp_with_acad_title, mp_total = 0, 0

for mp in xml.findAll('MDB'):

    mp_total += 1

    has_academic_title = False
    for nn in mp.findAll("NAME"):
        if nn.AKAD_TITEL.text:
            has_academic_title = True
            mp_acad_title[nn.AKAD_TITEL.text] += 1

    if has_academic_title:
        # count a title only once (in case of multiple names)
        mp_with_acad_title += 1

mp_with_acad_title / mp_total

0.2582538517975055

In [24]:
mp_acad_title.most_common()

[('Dr.', 930),
 ('Prof. Dr.', 81),
 ('Dr. h. c.', 42),
 ('Dr. Dr. h. c.', 17),
 ('Prof.', 13),
 ('Dr. - Ing.', 11),
 ('Prof. Dr. h. c.', 3),
 ('Dipl. - Ing.', 3),
 ('Dr. Dr.', 3),
 ('Prof. Dr. Dr. h. c.', 3),
 ('Dr. - Ing. e. h.', 2),
 ('Prof. Dr. Dr.', 2),
 ('Dr. - Ing. Dr. h. c.', 1),
 ('Prof. h. c.', 1),
 ('Prof. Dr. - Ing.', 1),
 ('Dr. h. c. Dr. - Ing. e. h.', 1),
 ('Dr. - Ing. Dr. - Ing. e. h. Dr. h. c.', 1),
 ('Dr. h. c. Dr. e. h.', 1),
 ('Prof. h. c. Dr.', 1),
 ('Dr. h. c. (Univ Kyiv)', 1),
 ('HonD', 1),
 ('Dr. h. c. (NUACA)', 1)]

A final note: Reading the XML file describing the members of the German parliament into a tabular data structure will be painful (similar as for JSON data source) because of
- the nested structure
- some list-like data, for example the fact that one MP can have multiple names

Instead of coding the conversion in Python: with [XSLT](https://en.wikipedia.org/wiki/XSLT) there is a dedicated language for transforming XML documents into other document formats.

The [Open Discourse](https://opendiscourse.de/) projects hosts the proceedings of the German parliament and also a list of MPs in data formats easy to consume. See the [Open Discourse data sets](https://dataverse.harvard.edu/dataverse/opendiscourse) page.

## Browser automation with Python

- [Selenium](https://pypi.org/project/selenium/)
  - nice example: [impf-botpy](https://github.com/alfonsrv/impf-botpy)
- [Playwright](https://playwright.dev/python/docs/intro)
  - [Playwright on pypi](https://pypi.org/project/playwright/) including nice examples (some cited below)
  - [Python API docs](https://playwright.dev/python/docs/api/class-playwright)
  
Note: Playwright does not run in a Jupyter notebook. We'll run the scripts directly in the Python interpreter.

Installation:
```
pip install playwright
playwright install
```

Take a screenshot using two different browsers:

```python
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    for browser_type in [p.chromium, p.firefox]:
        browser = browser_type.launch()
        page = browser.new_page()
        page.goto('http://whatsmyuseragent.org/')
        _ = page.screenshot(path=f'figures/example-{browser_type.name}.png')
        browser.close()
```

Just run the script [scripts/playwright_whatsmyuseragent_screenshot.py](scripts/playwright_whatsmyuseragent_screenshot.py) in the console / shell:

```
python ./scripts/playwright_whatsmyuseragent_screenshot.py
```

The screenshots are then found in the folder `figures/` for [chromium](./figures/example-chromium.png) and [firefox](./figures/example-firefox.png).

Playwright can record user interactions (mouse clicks, keyboard input) and create Python code to replay the recorded actions:

```
playwright codegen https://www.bundestag.de/abgeordnete/biografien
```

The created Python code is then modified, here to loop over all overlays showing the members of the parliament:

```python
from time import sleep

from playwright.sync_api import sync_playwright

def run(playwright):
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context(viewport={'height': 1080, 'width': 1920})
    page = context.new_page()
    page.goto("https://www.bundestag.de/abgeordnete/biografien")
    while True:
        try:
            sleep(3)
            page.click("button:has-text(\"Vor\")")
        except Exception:
            break

with sync_playwright() as p:
    run(p)
```

Again: best run the [replay script](./scripts/playwright_replay.py) in the console:

```
python ./scripts/playwright_replay.py
```