In [1]:
# imports
import numpy as np
import pandas as pd

import requests

from bs4 import BeautifulSoup

# SMA Portfolio Exam 2
## Part 1 - Scraping

# Objective

As a social media analyitics agency we have been tasked with researching the communications behaviour of the german parliament by a political thinktank. Due to the nature of the work our client prefers to remain anonymous. The work order is to research, which topics are generally represented and what is their importance and distribution over time.

The first step will be to scrape all all accessible press release by the german parliament. These can accessed here: https://www.bundestag.de/presse/pressemitteilungen

The available timeframe is from 01.01.2021 to 16.01.2023

# Scraping

Since it is not possible to directly scrape the press releases from 'https://www.bundestag.de/presse/pressemitteilungen', we have to recreate the 89 pagination links to the pages that acutally contain the press releases. Each pagination link is distinguished by the paramter 'offset' like follows:
- https://www.bundestag.de/ajax/filterlist/de/presse/pressemitteilungen/454504-454504?limit=20&noFilterSet=true&offset=0
- https://www.bundestag.de/ajax/filterlist/de/presse/pressemitteilungen/454504-454504?limit=20&noFilterSet=true&offset=10
- ....
- https://www.bundestag.de/ajax/filterlist/de/presse/pressemitteilungen/454504-454504?limit=20&noFilterSet=true&offset=890

In [2]:
# list of links to pages of pressemitteilungen

# the url without the number for the offset
pm_page_stem = 'https://www.bundestag.de/ajax/filterlist/de/presse/pressemitteilungen/454504-454504?limit=20&noFilterSet=true&offset='

# all required offsets
off_sets = np.arange(0, 890, 10)

# empty list to store links
pm_page_links = []

# loop to create a link for each value of offsets
for step in off_sets:
    
    # appending each link to empty list
    # by combining press release page stem with offsets
    pm_page_links.append(pm_page_stem + str(step))

In [3]:
# printing number of pagination links
len(pm_page_links)

89

Now we have a list containing all 89 pagination links. This allows us to open each of these links and grabbing the links to the actual press releases. These links are all contained in the css selector '.bt-link-intern'. 

In [4]:
# getting links to all pressemitteillungen

# intializing empty set to store urls to press relases
# some links occur more than once, the set prevents any double entries
pm_urls = set()

# url stem
url_stem = 'https://www.bundestag.de'

# loop to iterate over each link to a press release
for page_link in pm_page_links:

    # configuring BeautifulSoup
    soup = BeautifulSoup(requests.get(page_link).text, 'html.parser')

    # grabbing the links
    links = soup.select('.bt-link-intern')

    # appending each link to empty list
    # by combining url stem with individual page link
    for link in links:
        pm_urls.add(url_stem + link.get('href'))

In [5]:
print(f'Number of press releases: {len(pm_urls)}')

Number of press releases: 889


These leaves with a set of all available direct links to press releases of the german parliament.

In the next step we will scrape the desired content from each press release. Each page typically consits of:
- date
- title
- several text paragraphs

We are going to extract all of these for each press release and store that content in a dateframe.

In [6]:
# extracting content of press releases

# empty list to store dicts containing content
list_of_dicts = []
for url in pm_urls:

    # cooking soup
    soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    
    # pulling dates
    date = soup.select('span.bt-date')[0].text

    # pulling titles
    title = soup.select('h3.bt-artikel__title')[0].text
    
    # designating paragraph paths
    all_paragraphs = soup.select('.col-md-6 p')
    
    # selecting all paragraphs
    paragraphs = all_paragraphs
    
    # iterating through each paragraph and joining them
    article_text = ' '.join([paragraph.text for paragraph in paragraphs])
    
    # writing elements to dict
    article_dict = {
        'date': date,
        'title': title,
        'text': article_text,
        'url': url}

    # appending dict to list
    list_of_dicts.append(article_dict)

In [7]:
# forming dataframe from list of dicts
df_raw = pd.DataFrame.from_dict(list_of_dicts)

In [None]:
# printing length of dataframe
len(df_raw)

889

The dataframe has 889 columns, which corresponds to the number of links. So we can assume to have scraped all available links.

In [8]:
# displaying first three rows of dataframe
df_raw.head(3)

Unnamed: 0,date,title,text,url
0,\n 8. September 2022,Tagung der Vorsitzenden und stellvertretenden ...,"\nZeit:\n Montag, 12. September 2022,\n ...",https://www.bundestag.de/presse/pressemitteilu...
1,\n 24. Juni 2022,Delegation des Ausschusses für Klimaschutz und...,Norwegen ist vom 27. bis zum 29. Juni 2022 das...,https://www.bundestag.de/presse/pressemitteilu...
2,\n 11. Februar 2021,Kinderkommission zum Red Hand Day am 12. Febru...,Der Red Hand Day am 12. Februar ist in vielen ...,https://www.bundestag.de/presse/pressemitteilu...



The date column does not have a machine readable format, so we are going to fix that next.

In [10]:
# date cleaning

# stripping white spaces and line breaks
df_raw['date'] = df_raw['date'].str.strip()

# conerting date column
import locale
locale.setlocale(locale.LC_ALL, 'de_DE')
df_raw['date'] = pd.to_datetime(df_raw['date'], format='%d. %B %Y')

In [11]:
# reordering columns
df = df_raw[['date', 'title', 'text', 'url']]

In [12]:
df

Unnamed: 0,date,title,text,url
0,2022-09-08,Tagung der Vorsitzenden und stellvertretenden ...,"\nZeit:\n Montag, 12. September 2022,\n ...",https://www.bundestag.de/presse/pressemitteilu...
1,2022-06-24,Delegation des Ausschusses für Klimaschutz und...,Norwegen ist vom 27. bis zum 29. Juni 2022 das...,https://www.bundestag.de/presse/pressemitteilu...
2,2021-02-11,Kinderkommission zum Red Hand Day am 12. Febru...,Der Red Hand Day am 12. Februar ist in vielen ...,https://www.bundestag.de/presse/pressemitteilu...
3,2021-06-04,Öffentliche Sitzung des 3. Untersuchungsaussch...,"\nZeit:\n Dienstag, 8. Juni 2021,\n 14...",https://www.bundestag.de/presse/pressemitteilu...
4,2022-11-08,Konstituierung des Gremiums „Sondervermögen Bu...,Der Haushaltsausschuss des Deutschen Bundestag...,https://www.bundestag.de/presse/pressemitteilu...
...,...,...,...,...
884,2021-11-26,Bundestagspräsidentin Bas entzündet das erste ...,"\nZeit:\n Sonntag, 28. November 2021,\n ...",https://www.bundestag.de/presse/pressemitteilu...
885,2022-09-30,Der Freundeskreis Berlin-Taipei besucht Taiwan,Vom 1. bis 7. Oktober 2022 reist eine Delegati...,https://www.bundestag.de/presse/pressemitteilu...
886,2021-05-14,Öffentliche Sitzung des Innenausschusses zum T...,"\nZeit:\n Montag, 17. Mai 2021,\n 15.0...",https://www.bundestag.de/presse/pressemitteilu...
887,2021-03-17,Öffentliche Sitzung des Umweltausschusses zum ...,"\nZeit:\n Montag, 22. März 2021,\n 14....",https://www.bundestag.de/presse/pressemitteilu...


In [13]:
# storing dataframe for futher processing
df.to_pickle('./bundestag_PMs.pkl')

The experiment will be continued in 02_RF_filtering.ipynb