Political German Word Embeddings

German word embeddings computed from a corpus of parliamentary transcripts (2017-2019).

Methodology

A detailed description can be found on Medium. This is a corpus centered approach, the models are trained to analyse language change of plenery sittings.

Algorithm

Implementation of word2vec skip-gram with gensim

Parameters

size = 300: Dimensionality of the word vectors
window = 5: Maximum distance between the current and predicted word within a sentence
sg = 1: Training algorithm: 1 for skip-gram; otherwise CBOW
hs = 0: If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used
negative = 5: If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20)
iter = 5: Number of iterations (epochs) over the corpus
min_count = 2:Ignores all words with total frequency lower than this
workers = 4: Use these many worker threads to train the model

Corpus and Preprocessing

The German parliament Bundestag publishes the documents for all 19 legislative periods as machine-readable XML files. The lemmatizing was implemented with the software TreeTagger written by Helmut Schmid from the Ludwig Maximilian University of Munich. Further preprocessing steps were sentence segmentation and the removal of hidden control characters.

Some words, such as "migrants" were not recognized by the lemmatizer. In cases like this, we converted the words back and made "migrants" into "migrant" so that it would fit with the other data.

Models

I provide 30 models of transcripts of the current electoral term. Why more than one? Because word embeddings are unstable, read more about it on Medium. The models are saved as a KeyedVectors instance, so you can start right away with various syntactic/semantic NLP word tasks.

Published stories (in German)

How the German Bundestag was messing up climate change: https://projekte.sueddeutsche.de/artikel/politik/wie-der-bundestag-ueber-klimapolitik-spricht-e704090/
The rushed parliament: The so-called refugee crisis was a turning point, Germany has changed. This data research shows: The discourse on refugees and migration has shifted to the right - also by the AfD https://projekte.sueddeutsche.de/artikel/politik/bundestag-das-gehetzte-parlament-e953507/
"In the engine room of language": A text about the methodology with the ambition is to explain the topic of machine learning without taking refuge in a "it's a black box" https://projekte.sueddeutsche.de/artikel/politik/so-haben-wir-den-bundestag-ausgerechnet-e893391/

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE.md

LICENSE.md

README.md

README.md

Repository files navigation

Political German Word Embeddings

Methodology

Algorithm

Parameters

Corpus and Preprocessing

Models

Published stories (in German)

About

Releases

Packages

License

sueddeutsche/political-german-word-embeddings

Folders and files

Latest commit

History

LICENSE.md

LICENSE.md

README.md

README.md

Repository files navigation

Political German Word Embeddings

Methodology

Algorithm

Parameters

Corpus and Preprocessing

Models

Published stories (in German)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages