# BeautifulSoup

In [1]:
import requests
    
url = 'https://de.wikipedia.org/wiki/Liste_der_Stra%C3%9Fen_und_Pl%C3%A4tze_in_Berlin-Mitte'

r = requests.get(url)

1. Install:

    With [Spack](../productive/envs/spack/index.rst) you can provide BeautifulSoup in your kernel:

    ``` bash
$ spack env activate python-374
$ spack install py-beautifulsoup4 ^python@3.7.4%gcc@9.1.0
    ```

    Alternatively, you can install BeautifulSoup with other package managers, e.g.

    ``` bash
$ pipenv install beautifulsoup4
    ```

2. With `r.content` we can display the HTML of the page.

3. Next we need to decompose this string into a Python representation of the page using BeautifulSoup:

In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content, 'html.parser')

4. To structure the code, let’s create a new function `get_dom` (**D**ocument **O**bject **M**odel) that includes all of the preceding code:

In [4]:
def get_dom(url):
   r = request.get(url)
   r.raise_for_status()
   return BeautifulSoup(r.content, 'html.parser')

The filtering out of individual elements can be done e.g. via CSS selectors. These can be determined in a website by e.g. Firefox, right-click on one of the table cells in the first column of the table. In the *Inspector* that opens you can click the element again with the right mouse button and then select *Copy → CSS Selector*. The clipboard then contains e.g. `table.wikitable:nth-child(13) > tbody:nth-child(2) > tr:nth-child(1)`. We are now cleaning up this *CSS-Selector* because we do not want to filter for the 13th child element of table `table.wikitable` or the 2nd child element in `tbody`, but only for the 1st column within `tbody`.

Finally, with `limit=3` in this notebook, we can only display the first three results as an example:

In [5]:
links = soup.select('table.wikitable > tbody > tr > td:nth-child(1) > a', limit=3)
print(links)

[<a href="/wiki/Ackerstra%C3%9Fe" title="Ackerstraße">Ackerstraße</a>, <a href="/wiki/Alexanderplatz" title="Alexanderplatz">Alexanderplatz</a>, <a href="/wiki/Almstadtstra%C3%9Fe" title="Almstadtstraße">Almstadtstraße</a>]


However, we don’t want the entire HTML link, just its text content:

In [6]:
for content in links:
    print(content.text)

Ackerstraße
Alexanderplatz
Almstadtstraße


## See also:

* [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)