# Python Workshop
# Session 4: Web Scraping

Stefan Scholz

In this fourth session we will learn how to crawl websites. By doing this, we will get familiar with HTTP requests, data extraction and browser automation.

## 4.1 Websites

Another web resource in the world wide web is webpages. Every day we use hundreds of websites like [Google](https://www.google.com/), [Wikipedia](https://frr.wikipedia.org/wiki/), [Youtube](https://www.youtube.com/), [Facebook](https://www.facebook.com/), [StackOverflow](https://stackoverflow.com/), and [GitHub](https://github.com/) to get information. These websites usually consist of several webpages written in HTML or a comparable markup language. So why should we not also be able to stream data from webpages?

Yes, in general we are able to stream our data also from webpages with Python. But in comparision to REST APIs it requires much more effort, time and tears, because the desired information has not been collected, cleaned or structured by a provider. Still, if there is no suitable API available or it is unreasonable expensive, it is a good idea to implement a so-called web scraper. They are used to extract data from webpages automatically either with a bot or web crawler. Usually these webpages are scraped repeatedly to observe changes and generate data streams.

In a first step, we want to extract data from single webpages. To extract their data we can use the package `requests` because the communication is based on the HTTP protocol again. We will explain step by step how we can make requests to webpages - which is very similar to REST APIs. As examples we will use the webpages behind the news articles from the [BBC](https://www.bbc.com/) which we collected in the previous section. Then we will show you how to extract information from these webpages.

### URLs

Uniform resource locators (URLs) are references to all kinds of web resources. In the case of APIs, we called them endpoints. With webpages, we will not go into detail how they include parameters in their URLs. This is mainly due to the fact that many websites use parameters differently and not use a standard like REST APIs. Instead we will assume that our URLs are already given.

In the following you can see the schema of a URL and two examples.

```
scheme:[//authority]path[?query][#fragment]
```

```
https://www.bbc.com/news/world-middle-east-67084141
https://en.wikipedia.org/wiki/Web_scraping#Techniques
```

### Requests

There are several ways to send a request to a webpage, among others you can do it with the package `requests`. As mentioned in the previous session, this package allows you to make HTTP requests very easily and quickly. It provides all functions and methods to write your parameters into requests, send you requests and work with your responses.

Let us first import the package or install it if necessary.

In [None]:
import requests

In our example with the articles of the [BBC](https://www.bbc.com/) we are interested in getting the full texts of recent news articles. The best way to understand requests is to prepare an exemplary request on some news articles.

Let us define one article's URL in the variable `url`. Then we pass this variables into the function `requests.get()` to make the corresponding request. This request is best wrapped inside a `try` statement, e.g. to handle an HTTP error `requests.exceptions.HTTPError`. An exception will also be raised by the method `raise_for_status()` when the response contains an invalid status.

In [None]:
# define article
url = "https://www.bbc.com/news/world-middle-east-67084141"

try:
    # make request
    response = requests.get(url)
    # check response
    response.raise_for_status()
except requests.exceptions.HTTPError as e:
    print("HTTPError: {}".format(e))
except Exception as e:
    print("Error: {}".format(e))

If the request works and throws no exception, you can have a first look into the response with the attribute `content`.

In [None]:
# inspect response
print(response.content)

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Crawl the corresponding webpages of the news articles we collected in the previous session.
</div>

### Responses

After a request has been sent to a webpage, you will receive a response to it. This response consists of different information, i.e. a status code, a header and a body. If you have made your request with the package `requests`, then you can easily access these information. With the attribute `content` on the response, we can access its body again, but this time it is not in JSON but HTML format - a markup with opening tags `<tag>` and closing tags `</tag>`.

What follows is a very simple HTML document.

```
<!DOCTYPE html>
<html>
    <head>
    </head>
    <body>
        <h1> Title </h1>
        <p> Full Text </p>
    <body>
</html>
```

In Python, you can parse HTML formatted strings with the use of the package `BeautifulSoup`. The package provides idiomatic ways of navigating, searching, and modifying HTML. In this way, we are aware of the structure und can extract certain information.

Let us first import the package or install it if necessary.

In [None]:
from bs4 import BeautifulSoup

To convert an HTML string into a `BeautifulSoup` object, we have to pass the string and the corresponding parser `html.parser` into the class `BeautifulSoup`. On this object we will have various methods available to work with the HTML format. First, we will print out a structured form of the HTML with the method `prettify()`.

Let us parse one webpage with `BeautifulSoup`.

In [None]:
# parse html
soup = BeautifulSoup(webpages[2], "html.parser")

# print structured html
print(soup.prettify())

Suppose you are now interested in a certain information in your HTML string. You can search for this information by its tag and attributes. For this, you can use the method `find()` which finds exactly one tag with the defined tag and attributes. However, if you want to find all tags that have the defined tag and attributes, then you better use the method `find_all()`. Afterwards, you can access the actual text behind these tags with the method `get_text()`.

But before we can call these methods on our `BeautifulSoup` object, we first have to find out under which tags and attributes our information is hidden. The best way to do this is to open the corresponding webpage and search for the desired information. Once you have found it, you can right-click on it in your browser and select `Inspect` to see all the information about the underlying tag. If this method does not work for you, then you have to look into the HTML of the webpage and find the desired information by yourself. This can be done with `BeautifulSoup` together with `prettify()` or your own browser.

Modern web browsers (Firefox, Chromium, IE, etc.) include a set of [web development tools](https://en.wikipedia.org/wiki/Web_development_tools). Originally addressed to web developers to test and debug the code (HTML, CSS, Javascript) used to build a web site, the browser web developer tools are the easiest way to explore and understand the technologies used to build a web site. The initial exploration later helps to scrape data from the web site.

Let us try to access the full text of one article.

In [None]:
# find tag with certain tag and id
text = soup.find("main", {"id": "main-content"})

# print tag
print(text)

In [None]:
# find all tags with certain tag and class
paragraphs = soup.find_all("div", {"data-component": "text-block"})

# print number tags
print(len(paragraphs))

# append paragraphs to full text
full_text = " ".join(paragraph.get_text() for paragraph in paragraphs)

# inspect full text
print(full_text)

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Parse from your list of webpages the headlines and full texts.
</div>

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Save the headlines and full texts in a JSON file.
</div>

## 4.2 Browser Automation

Browser automation lets you execute actions automatically in a web browser for testing, web scraping or to perform repetitive tasks faster. These tasks include for example:

- Load a page by URL including page dependencies (CSS, Javascript, images, media)
- Simulate user interaction (clicks, input, scrolling)
- Take screenshots
- Access the DOM tree or the HTML modified by executed Javascript and user interactions from/in the browser to extract data

There are two very popular libraries for browser automation:

- [Selenium](https://pypi.org/project/selenium/)
- [Playwright](https://playwright.dev/python/docs/intro)

Note: Playwright does not run in a Jupyter notebook. We'll run the scripts directly in the Python interpreter.

Installation:

```
playwright install
```

To test what Playright can do, we will create a script that will open a web page in two different browsers. Both times a web page is opened and a screenshot of the browser window is created.

Let us have a look at this script.

```python
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    for browser_type in [p.chromium, p.firefox]:
        browser = browser_type.launch()
        page = browser.new_page()
        page.goto('https://www.whatismybrowser.com/')
        _ = page.screenshot(path=f'figures/example-{browser_type.name}.png')
        browser.close()
```

Just run the script [scripts/playwright_whatsmyuseragent_screenshot.py](scripts/playwright_whatsmyuseragent_screenshot.py) in the console / shell:

```
python ./scripts/playwright_whatsmyuseragent_screenshot.py
```

The screenshots are then found in the folder `figures/` for [chromium](./figures/example-chromium.png) and [firefox](./figures/example-firefox.png).

Playwright can record user interactions (mouse clicks, keyboard input) and create Python code to replay the recorded actions:

```
playwright codegen https://www.bundestag.de/abgeordnete/biografien
```

The created Python code is then modified, here to loop over all overlays showing the members of the parliament:

```python
from time import sleep

from playwright.sync_api import sync_playwright

def run(playwright):
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context(viewport={'height': 1080, 'width': 1920})
    page = context.new_page()
    page.goto("https://www.bundestag.de/abgeordnete/biografien")
    while True:
        try:
            sleep(3)
            page.click("button:has-text(\"Vor\")")
        except Exception:
            break

with sync_playwright() as p:
    run(p)
```

It's best to run the [replay script](./scripts/playwright_replay.py) in the console:

```
python ./scripts/playwright_replay.py
```
