# Topic Modelling

## Speeches by German Parliament (Bundestag)

*Repository: [github.com/raphaelw/nlp-bundestag](https://github.com/raphaelw/nlp-bundestag)*

## Project Overview (Blackbox)

![Image](doc/figures/pipeline_blackbox.png)

- An algorithm works out topics
- Topics are inspected and manually annotated

## Explore Dataset

> [opendiscourse.de](https://opendiscourse.de/): Richter, F.; Koch, P.; Franke, O.; Kraus, J.; Kuruc, F.; Thiem, A.; Högerl, J.; Heine, S.; Schöps, K. (2020). Open Discourse. https://doi.org/10.7910/DVN/FIKIBO. Harvard Dataverse. V3.

### Exploration: Text amount by parliamentary fraction

![Image](doc/figures/plot_text_mass.png)

*"not found" are speeches by members of government (chancellor, ministers)*

### Exploration: Speech lengths

![Image](doc/figures/plot_speech_lengths.png)

**Longest speech** by the first minister of finance **Fritz Schäffer**. December 7, 1956. *[PDF, Page 2](https://dserver.bundestag.de/btp/02/02178.pdf)*.
- 160k characters, 21k Words, over 2 1/2 hours reading time (estimated).

## Project Overview (Greybox)

![Image](doc/figures/pipeline_greybox.png)

Preprocessing ([spaCy](https://spacy.io/), [Gensim](https://radimrehurek.com/gensim/))
- Lemmatization (_better_ → _good_; _walking_ → _walk_)

- Remove stop words (*the, of, and, ...*)

- Generate bigrams (word pairs)

- Vectorization (bag-of-words)

- Remove rare and common words

## Tech Details
- Number of topics specified: 50
- Computation time:
    - Preprocessing: 10 hours
    - LDA: 7 hours

## Tech Used
- Hardware: Intel i7 (3rd Gen), 16 GB RAM
- Scientific Python
    - [JupyterLab](https://jupyter.org/), [pandas](https://pandas.pydata.org/), [matplotlib](https://matplotlib.org/)
    - [Apache Arrow](https://arrow.apache.org/) (load serialized pandas DataFrames)
- Natural Language Processing
    - [spaCy](https://spacy.io/) (preprocessing)
    - [Gensim](https://radimrehurek.com/gensim/) (preprocessing and LDA)
- Visualization
    - [WordCloud](https://github.com/amueller/word_cloud)

## Results

Visit: [github.com/raphaelw/nlp-bundestag/blob/main/topics.md](https://github.com/raphaelw/nlp-bundestag/blob/main/topics.md)

### Topic 8: German reunification (Deutsche Wiedervereinigung)

![Topic 8](wordclouds/wordcloud_8.png "Topic 8")

### Topic 9: Trade agreements (Handelsabkommen)

![Topic 9](wordclouds/wordcloud_9.png "Topic 9")

### Topic 16: Family policy (Familienpolitik)

![Topic 16](wordclouds/wordcloud_16.png "Topic 16")

### Topic 29: Military missions (Militäreinsätze)

![Topic 29](wordclouds/wordcloud_29.png "Topic 29")

jawohl - Yes, Sir!

### Topic 34: Gender equality (Gleichberechtigung)

![Topic 34](wordclouds/wordcloud_34.png "Topic 34")

## Final thoughts and room for improvement

- Where are data protection / privacy speeches?
- LDA Parameters
- Try different number of topics
- Improve lemmatization or use stemming
- Split those long concatenated german words
- Find stop words specific to this dataset

# Thanks for listening!