Skip to content

sueddeutsche/political-german-word-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Political German Word Embeddings

German word embeddings computed from a corpus of parliamentary transcripts (2017-2019).

Methodology

A detailed description can be found on Medium. This is a corpus centered approach, the models are trained to analyse language change of plenery sittings.

Algorithm

Implementation of word2vec skip-gram with gensim

Parameters

  • size = 300: Dimensionality of the word vectors
  • window = 5: Maximum distance between the current and predicted word within a sentence
  • sg = 1: Training algorithm: 1 for skip-gram; otherwise CBOW
  • hs = 0: If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used
  • negative = 5: If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20)
  • iter = 5: Number of iterations (epochs) over the corpus
  • min_count = 2:Ignores all words with total frequency lower than this
  • workers = 4: Use these many worker threads to train the model

Corpus and Preprocessing

The German parliament Bundestag publishes the documents for all 19 legislative periods as machine-readable XML files. The lemmatizing was implemented with the software TreeTagger written by Helmut Schmid from the Ludwig Maximilian University of Munich. Further preprocessing steps were sentence segmentation and the removal of hidden control characters.

Some words, such as "migrants" were not recognized by the lemmatizer. In cases like this, we converted the words back and made "migrants" into "migrant" so that it would fit with the other data.

Models

I provide 30 models of transcripts of the current electoral term. Why more than one? Because word embeddings are unstable, read more about it on Medium. The models are saved as a KeyedVectors instance, so you can start right away with various syntactic/semantic NLP word tasks.

Published stories (in German)

About

German word embeddings computed from a corpus of parliamentary transcripts (2017-2019)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published