### Sources
**Unless otherwise stated, the code in this notebook comes from the material we received from our lecturers during the NLP course of my Masters in Data Science and AI @UAL.** Other sources, or when I wrote the code myself (or modified it), are marked with **##** at the beginning of the corresponding cell.

#### Citation for the Hanover Tagger used in this notebook:
For a explanation of the underlying ideas see: Christian Wartena (2019). A Probabilistic Morphology Model for German Lemmatization. In: Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers. Pp. 40-49, Erlangen. https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_10.pdf https://doi.org/10.25968/opus-1527

#### Stopword List from:
https://github.com/solariz/german_stopwords

Marco Götze and Steffen Geyer (2016), Source and more Information: https://solariz.de/de/downloads/6/german-enhanced-stopwords.htm
____


# Topic Modelling: Testing different numbers of topics
I added this notebook to test out different topic numbers seperately from the main topic model notebook. This keeps the topic model notebook (04.1) a little easier to follow.

### Preparation: Load data, define tokeniser

In [2]:
#Import all necessary packages
import numpy as np
import pandas as pd 
import re
import nltk
from HanTa import HanoverTagger as ht
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.tokenize.casual import casual_tokenize

In [3]:
#load in data
df = pd.read_csv('data/sorted_manifestos.csv', encoding='utf-8')

In [4]:
## I wrote the code in this cell

#add a column with party names
partyid_greens = [41111,41112,41113]
partyid_lefts = [41221, 41222, 41223]
partynames = []
#loop trough all indices of the dataset
for i in range(len(df)):
    #get the party ID of this index/row
    party_id = df.at[i, 'party']
    #add the corresponding party name
    if party_id in partyid_greens:
        partynames.append('Greens')
    elif party_id in partyid_lefts:
        partynames.append('Lefts')
    elif party_id == 41420:
        partynames.append('FDP')
    elif party_id == 41521:
        partynames.append('CDU')
    elif party_id == 41320:
        partynames.append('SPD')
    elif party_id == 41620:
        partynames.append('DP')
    elif party_id == 41953:
        partynames.append('AFD')

#add column with party names to df
df['partyname'] = partynames
#shorten the df
df = df[['date','partyname','text']]

In [5]:
## Code in this cell is from the author NNK (2022), 'Pandas Combine Two Columns of Text in DataFrame'
## link: https://sparkbyexamples.com/pandas/pandas-combine-two-columns-of-text-in-dataframe/

#add a column that combines date and name, for easier access later
df["name & date"] = df['partyname'] +", "+ df["date"].astype(str)
df.sample(3)

Unnamed: 0,date,partyname,text,name & date
10,202109,Greens,Deutschland. Alles ist drin. Bundestagswahlpro...,"Greens, 202109"
65,196909,CDU,Sicher in die 70er Jahre Kurt Georg Kiesinger...,"CDU, 196909"
42,195709,FDP,AKTIONSPROGRAMM 1957 Verkündet auf dem Wahlko...,"FDP, 195709"


In [6]:
## Code in this cell is from GeeksForGeeks (2021), 'How to Read Text File Into List in Python?', Example 1
## link: https://www.geeksforgeeks.org/how-to-read-text-file-into-list-in-python/
## modified to read my data and split the text on spaces instead of dots

#open .txt file that contains the long stopword list
file_full = open('stopwords/german_stopwords/german_stopwords_full_edit.txt', "r")
data_full = file_full.read()

#create a list
german_stopwords_full = data_full.replace('\n', ' ').split(" ")

In [7]:
## I've modified this cell to use my lemmatiser and stopword list

tagger = ht.HanoverTagger('morphmodel_ger.pgz')

#create tokeniser
def my_tokeniser(doc):
    tokens = re.split(r'[\s.,;!?/"()#»«„“”:&–-]+', doc)
    #remove stopwords
    tokens_sw = [t for t in tokens if not t.lower() in german_stopwords_full and t != '']
    #lemmatiser
    tags = [tagger.analyze(word) for word in tokens_sw]
    tokens_sw_lemma = [lemma.lower() for (lemma,pos) in tags]
    return tokens_sw_lemma

In [8]:
## I wrote the code in this cell

#get all indices for the different parties
greens = [df.loc[i, 'name & date'] for i in range(len(df)) if df.loc[i, 'partyname'] == 'Greens']
lefts = [df.loc[i, 'name & date'] for i in range(len(df)) if df.loc[i, 'partyname'] == 'Lefts']
spd = [df.loc[i, 'name & date'] for i in range(len(df)) if df.loc[i, 'partyname'] == 'SPD']
fdp = [df.loc[i, 'name & date'] for i in range(len(df)) if df.loc[i, 'partyname'] == 'FDP']
cdu = [df.loc[i, 'name & date'] for i in range(len(df)) if df.loc[i, 'partyname'] == 'CDU']
afd = [df.loc[i, 'name & date'] for i in range(len(df)) if df.loc[i, 'partyname'] == 'AFD']
dp = [df.loc[i, 'name & date'] for i in range(len(df)) if df.loc[i, 'partyname'] == 'DP']

## LSA / SVD
Increasing the number of topics only adds on more topics, but it's doesn't change the already existing topics. For example, topic 0-4 always remain the same in the following results.

In [9]:
#get TFIDF
tfidf_vectoriser = TfidfVectorizer(tokenizer=my_tokeniser)
tfidf = tfidf_vectoriser.fit_transform(df["text"])
#Save list of unique tokens (vocab) for later
vocab = tfidf_vectoriser.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf.todense(), columns = vocab)
print(tfidf.todense().shape)

(86, 55414)


In [10]:
#Subtract mean
tfidf_df = tfidf_df - tfidf_df.mean()

### 5 Topics

In [11]:
#How many topics?
num_topics = 5
pd.options.display.max_columns=num_topics
labels = ['topic{}'.format(i) for i in range(num_topics)]

In [12]:
#Calculate topics
svd = TruncatedSVD(n_components = num_topics, n_iter = 100) 
svd_topic_vectors = svd.fit_transform(tfidf_df.values)

In [13]:
#Most relevant tokens for each topic
topic_weights = pd.DataFrame(svd.components_.T, index=vocab, columns=labels)
num_terms = 20
for i in range(num_topics):
    print("___topic " + str(i) + "___")
    topicName = "topic" + str(i)
    weightedlist = topic_weights.get(topicName).sort_values()[-num_terms:]
    print(weightedlist.index.values)

___topic 0___
['frau' 'unterstützen' 'nachhaltig' 'international' 'linke' 'öffentlich'
 'kommune' 'schaffen' 'euro' 'ökologisch' 'europäisch' 'land'
 'unternehmen' 'digital' 'kind' 'deutschland' 'grün' 'eu' 'mensch' 'stark']
___topic 1___
['schaffen' 'international' 'zusammenarbeit' 'cdu' 'wirtschaftlich'
 'sozialdemokrat' 'aufgabe' 'fdp' 'arbeitsplatz' 'csu' 'spd' 'chance'
 'ziel' 'land' 'staat' 'verbessern' 'europäisch' 'bürger' 'liberal'
 'müssen']
___topic 2___
['solidarisch' 'mann' 'demokratisierung' 'ostdeutschland' 'bundesrepublik'
 'gesellschaft' 'demokratisch' 'bündnis' 'ddr' 'gesellschaftlich' '90'
 'brd' 'öffentlich' 'linke' 'frau' 'sozial' 'pds' 'ökologisch' 'grün'
 'müssen']
___topic 3___
['regelung' 'individuell' 'schule' 'bildung' 'bürgergeld' 'privat'
 'staatlich' 'staat' 'einzeln' 'europäisch' 'digital' 'euro' 'wettbewerb'
 'für' 'afd' 'müssen' 'eu' 'fdp' 'demokrat' 'liberal']
___topic 4___
['abrüstungsschritt' 'mädchen' 'international' 'gesellschaftsvertrag'
 'natur'

### 10 Topics

In [14]:
#How many topics?
num_topics = 10
pd.options.display.max_columns=num_topics
labels = ['topic{}'.format(i) for i in range(num_topics)]

In [15]:
#Calculate topics
svd = TruncatedSVD(n_components = num_topics, n_iter = 100) 
svd_topic_vectors = svd.fit_transform(tfidf_df.values)

In [16]:
#Most relevant tokens for each topic
topic_weights = pd.DataFrame(svd.components_.T, index=vocab, columns=labels)
num_terms = 20
for i in range(num_topics):
    print("___topic " + str(i) + "___")
    topicName = "topic" + str(i)
    weightedlist = topic_weights.get(topicName).sort_values()[-num_terms:]
    print(weightedlist.index.values)

___topic 0___
['frau' 'unterstützen' 'nachhaltig' 'international' 'linke' 'öffentlich'
 'kommune' 'schaffen' 'euro' 'ökologisch' 'europäisch' 'land'
 'unternehmen' 'digital' 'kind' 'deutschland' 'grün' 'eu' 'mensch' 'stark']
___topic 1___
['schaffen' 'international' 'zusammenarbeit' 'cdu' 'wirtschaftlich'
 'sozialdemokrat' 'aufgabe' 'fdp' 'arbeitsplatz' 'csu' 'spd' 'chance'
 'ziel' 'land' 'staat' 'verbessern' 'europäisch' 'bürger' 'liberal'
 'müssen']
___topic 2___
['solidarisch' 'mann' 'demokratisierung' 'ostdeutschland' 'bundesrepublik'
 'gesellschaft' 'demokratisch' 'bündnis' 'ddr' 'gesellschaftlich' '90'
 'brd' 'öffentlich' 'linke' 'frau' 'sozial' 'pds' 'ökologisch' 'grün'
 'müssen']
___topic 3___
['regelung' 'individuell' 'schule' 'bildung' 'bürgergeld' 'privat'
 'staatlich' 'staat' 'einzeln' 'europäisch' 'digital' 'euro' 'wettbewerb'
 'für' 'afd' 'müssen' 'eu' 'fdp' 'demokrat' 'liberal']
___topic 4___
['abrüstungsschritt' 'mädchen' 'international' 'gesellschaftsvertrag'
 'natur'

### 20 Topics

In [17]:
#How many topics?
num_topics = 20
pd.options.display.max_columns=num_topics
labels = ['topic{}'.format(i) for i in range(num_topics)]

In [18]:
#Calculate topics
svd = TruncatedSVD(n_components = num_topics, n_iter = 100) 
svd_topic_vectors = svd.fit_transform(tfidf_df.values)

In [19]:
#Most relevant tokens for each topic
topic_weights = pd.DataFrame(svd.components_.T, index=vocab, columns=labels)
num_terms = 20
for i in range(num_topics):
    print("___topic " + str(i) + "___")
    topicName = "topic" + str(i)
    weightedlist = topic_weights.get(topicName).sort_values()[-num_terms:]
    print(weightedlist.index.values)

___topic 0___
['frau' 'unterstützen' 'nachhaltig' 'international' 'linke' 'öffentlich'
 'kommune' 'schaffen' 'euro' 'ökologisch' 'europäisch' 'land'
 'unternehmen' 'digital' 'kind' 'deutschland' 'grün' 'eu' 'mensch' 'stark']
___topic 1___
['schaffen' 'international' 'zusammenarbeit' 'cdu' 'wirtschaftlich'
 'sozialdemokrat' 'aufgabe' 'fdp' 'arbeitsplatz' 'csu' 'spd' 'chance'
 'ziel' 'land' 'staat' 'verbessern' 'europäisch' 'bürger' 'liberal'
 'müssen']
___topic 2___
['solidarisch' 'mann' 'demokratisierung' 'ostdeutschland' 'bundesrepublik'
 'gesellschaft' 'demokratisch' 'bündnis' 'ddr' 'gesellschaftlich' '90'
 'brd' 'öffentlich' 'linke' 'frau' 'sozial' 'pds' 'ökologisch' 'grün'
 'müssen']
___topic 3___
['regelung' 'individuell' 'schule' 'bildung' 'bürgergeld' 'privat'
 'staatlich' 'staat' 'einzeln' 'europäisch' 'digital' 'euro' 'wettbewerb'
 'für' 'afd' 'müssen' 'eu' 'fdp' 'demokrat' 'liberal']
___topic 4___
['abrüstungsschritt' 'mädchen' 'international' 'gesellschaftsvertrag'
 'natur'

**Translation of topics:**

    ___topic 0___
    ['woman' 'support' 'sustainable' 'international' 'left' 'public' 'commune' 'create' 'euro' 'ecological' 
    'european' 'country' 'company' 'digital' 'child' 'germany' 'green' 'eu' 'human' 'strong']
    ___topic 1___
    ['create' 'international' 'cooperation' 'cdu' 'economic' 'social democrat' 'task' 'fdp' 'job' 'csu' 
    'spd' 'opportunity' 'goal' 'country' 'state' 'improve' 'european' 'citizen' 'liberal 'must']
    ___topic 2___
    ['solidarity' 'man' 'democratisation' 'east germany' 'federal republic' 'society' 'democratic' 
    'alliance' 'gdr' 'social' '90' 'frg' 'public' 'left' 'woman' 'social' 'pds' 'ecological' 'green' 'must']
    ___topic 3___
    ['regulation' 'individual' 'school' 'education' 'basic income scheme' 'private' 'state' 'individual' 
    'european' 'digital' 'euro' 'competition' 'for' 'afd' 'must' 'eu' 'fdp' 'democrat' 'liberal']
    ___topic 4___
    ['disarmament step' 'girl' 'international' 'deed of association' 'nature' 'environment' 'world' 'german' 
    'life' 'people' 'product' 'third' 'man' '...' 'must' 'alliance' 'woman' 'ecological' '90' 'green']
    ___topic 5___
    ['corona' '1972' 'pandemic' ''' 'time' 'cf' 'employee' 'citizen' 'people/folk' 'digitalisation' 'eec' 'spd' 
    'public' 'federal government' 'left' 'must' 'social democrat' 'social democratic' 'digital' 'democrat']
    ___topic 6___
    ['year' 'josef' 'about' 'ddr' 'ask' 'spd' '1969' 'pds' 'country' 'voter 'dm' '1971' '1972' 'liberal' 'cdu' 
    'social democrat' 'csu' 'democrat' 'afd' 'for']
    ___topic 7___
    ['1975' 'human' '1971' 'party' 'schmidt' 'right-wing coalition' '1969' 'government' '1972' 'worker' 
    'year' 'federal government' 'for' 'left' 'euro' 'afd' 'social democrat' 'spd' 'social democrat' 'fdp']
    ___topic 8___
    ['scheel (name of a politican)' 'society' 'josef' 'green' 'politics' 'consistency' 'federal republic' 'human' 
    'manifest' 'world' 'voter' 'social' 'democratic' 'peace 'christian democratic' 'left' 'citizen' 'union' 
    'freedom' 'liberal']
    ___topic 9___
    ['led' 'eec' 'eu' 'world good' 'woman' 'east germany' 'employee' 'modern' 'left party' 'citizen' 'government'
    'federal republic' 'politics' 'federal government' social democratic' 'spd' 'social democrat' 'pds' 'fdp'
    'democrat']
    ___topic 10___
    ['support/funding' 'atlantic' 'must' 'peace' 'manifest' 'prosperity' 'workers' 'about' 'state' 
    'social democrat' 'security' 'citizen' 'family' 'spd' 'freedom' 'christian democratic' 'union' 'social' 
    'afd' 'for']
    ___topic 11___
    ['basic demands' 'program point' 'settlement' 'farmersdo' 'capital regeneration' 
    'collective guilt' 'soldiering' 'civilian prisoner' 'diversity' 'industrial restriction' 
    'promotion of agriculture' 'monopoly capitalism' 'educator's right' 'protracted' 'social democrat' 
    'expelled' 'denazification' 'recognition' 'rejection' 'emphasis']
    ___topic 12___
    ['development' 'ertl' 'left party' 'eurocrisis' 'scheel' 'common' 'federation' 'ask' 'european' 'people' 
    'bür' 'reform' 'party' 'strong' 'democratic' 'voter' 'federal government' 'social democratic' 'liberal' 'pds']
    ___topic 13___
    ['take into account' 'big investors' 'risk' 'cameron' 'atone' 'pri' 'milieu background' 'kirchhofsch' 
    'david' 'blo' 'pastaat' 'bailout' 'democrat' 'must' 'ver' 'eurocrisis' 'left' 'bür' 'germany' 'euro']
    ___topic 14___
    ['90' 'düsseldorf' 'expansion' "'" 'dm' 'gdr' '1961' 'promotion' 'green' 'csu' 'union' 'left' '1970' 'cdu' 
    'federation' '1971' 'fdp' '1972' 'christiandemocrat' 'eec']
    ___topic 15___
    ['democrat' 'lie' 'refugee problem' 'equalisation of burdens' 'displaced person' 'country' 'indisputable' 
    'social democratic' 'allied' 'liberal' 'frankfurter' 'struggle' 'union' 'left' 'east zone' 'socialisation' 
    'social democrat' 'christian democratic' 'germany' 'social democracy']
    ___topic 16___
    ['chance' 'work' 'farming' 'give' 'fatherland' 'led' 'enterprise' 'workplace' 'reunification' 'democracy'
    'ecological' 'social democratic' 'german' '1961' 'one-party rule' 'dm' 'social democrat' 'liberal' 
    'christian democratic' 'federal government']
    ___topic 17___
    ['collective guilt' 'federal constitution' 'liberal' 'bit' 'nuclear' 'government team' '1961' 'mindset' 
    'eg' 'grune' 'left' 'spd' 'world' 'dissent' 'government' 'ewg' 'emphasis' 'social democratic' 'germany' 
    'federal government']
    ___topic 18___
    ['pds' 'colour' 'spd' 'lebenszusammenhang' 'pandemic' 'land' '131' 'heimatvertriebener' 'this time' 'support' 
    'corona' '>' 'riding' 'fdp' 'bit' 'christian democratic' 'must' 'bürger*inn' 'grune' 'digital']
    ___topic 19___
    ['educational opportunity' '1970' '>' 'indisputable' 'right-wing party' 'corona' '1971' 'frankfurter' 
    'state issue' '1957' 'must' '1972' 'bürger*inn' 'dm' 'purchasing power' 'state' 'social democrat' 'digital' 
    'one-party rule' 'social democracy']

**Conclusion:**
The topics are generally difficult to interpret because the combinations of terms seem very random. From topic 6 onwards, it also starts to include years and names of politicians, which makes it even more confusing. This is the reason I went with only 6 topics.

![alt text](data/images/lsa.png)

# LDA
When generating more than 9 topics, topics start to repeat themselves. You can start seeing this with 10 topics, where topic 4 and 7 include the exact same words.

In [20]:
#We calculate LDA on the Bag Of Words, NOT TFIDF
count_vectoriser = CountVectorizer(tokenizer=my_tokeniser)
bag_of_words = count_vectoriser.fit_transform(df['text'])
vocab = count_vectoriser.get_feature_names_out()

print(bag_of_words.todense().shape)

(86, 55414)


### 5 Topics

In [21]:
#LDA
lda = LatentDirichletAllocation(n_components=5,
                                random_state=123,
                                learning_method='batch')
lda_topics = lda.fit_transform(bag_of_words)

In [22]:
#Most relevant tokens for each topic
for i, topic in enumerate(lda.components_):
    print("topic " + str(i) + ":")
    #Get last n tokens (highest values)
    print(vocab[topic.argsort()[-num_terms:]])

topic 0:
['ideologisch' 'partei' 'corona' 'beenden' 'staat' 'volk' 'abschaffung'
 'familie' '12' 'staatlich' 'deutschland' 'ezb' 'eu' 'lehnen' 'bürger'
 'müsse' 'über' 'deutsch' 'afd' 'für']
topic 1:
['zukunft' 'gesellschaft' 'groß' 'international' 'jahr' 'unternehmen'
 'ziel' 'gemeinsam' 'fördern' 'eu' 'deutsch' 'kind' 'europa' 'sozial'
 'schaffen' 'europäisch' 'land' 'mensch' 'stark' 'deutschland']
topic 2:
['wirtschaft' 'gesellschaft' 'groß' 'öffentlich' 'bundesrepublik' 'ziel'
 'europäisch' 'bürger' 'wirtschaftlich' 'frau' 'jahr' 'deutschland'
 'staat' 'politisch' 'politik' 'mensch' 'land' 'deutsch' 'sozial' 'müssen']
topic 3:
['fördern' 'jahr' 'linke' 'unternehmen' 'deutschland' 'europäisch' 'eu'
 'gesellschaft' 'land' 'arbeit' 'frau' 'schaffen' 'kind' 'leben' 'grün'
 'ökologisch' 'stark' 'öffentlich' 'sozial' 'mensch']
topic 4:
['freiheit' 'schaffen' 'kind' 'öffentlich' 'staatlich' 'unternehmen'
 'wettbewerb' 'mensch' 'sozial' 'deutsch' 'international' 'stark' 'land'
 'bürger

**Translation:**
    
    topic 0:
    ['ideological' 'party' 'corona' 'end' 'state' 'people' 'abolition' 'family' '12' 'state' 'germany' 'ezb' 
    'eu' 'lean' 'citizen' 'must' 'over' 'german' 'afd' 'for']
    topic 1:
    ['future' 'society' 'big' 'international' 'year' 'company' 'goal' 'together' 'promote' 'eu' 'german' 
    'child' 'europe' 'social' 'create' 'european' 'country' 'human' 'strong' 'germany']
    topic 2:
    ['economy' 'society' 'great' 'public' 'federal republic' 'goal' 'european' 'citizen' 'economic' 'woman' 
    'year' 'germany' 'state' 'political' 'politics' 'human' 'country' 'german' 'social' 'must']
    topic 3:
    ['promote' 'year' 'left' 'business' 'germany' 'european' 'eu' 'society' 'country' 'work' 'woman' 'create' 
    'child' 'live' 'green' 'ecological' 'strong' 'public' 'social' 'human']
    topic 4:
    ['freedom' 'create' 'child' 'public' 'state' 'business' 'competition' 'human' 'social' 'german' 'international'
    'strong' 'country' 'citizen' 'state' 'european' 'liberal' 'germany' 'fdp' 'must']

### 10 Topics

In [23]:
#LDA
lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch')
lda_topics = lda.fit_transform(bag_of_words)

In [24]:
#Most relevant tokens for each topic
for i, topic in enumerate(lda.components_):
    print("topic " + str(i) + ":")
    #Get last n tokens (highest values)
    print(vocab[topic.argsort()[-num_terms:]])

topic 0:
['identität' 'abschaffung' '12' 'beenden' 'kind' 'national' 'staat'
 'staatlich' 'ezb' 'euro' 'bürger' 'familie' 'lehnen' 'müsse' 'über'
 'eu' 'deutschland' 'deutsch' 'afd' 'für']
topic 1:
['vermassung' 'alliierter' 'erziehungswesen' 'besatzungszone'
 'frankfurter' 'christlichdemokratisch' 'rechtspartei' 'sozialisierung'
 'reich' 'mittelschicht' 'vaterland' 'bauerntum' 'einparteienherrschaft'
 'flüchtling' 'gemeinschaftsschule' 'kaufkraft' 'volk' 'begeben' 'kampf'
 'sozialdemokratie']
topic 2:
['verbesserung' 'rechtskoalition' 'leistung' 'reform' 'milliarde' 'ausbau'
 'müssen' 'million' 'sozialdemokrat' '000' 'bund' 'prozent' 'liberal'
 '1969' 'gesetz' 'jahr' '1970' 'dm' '1971' '1972']
topic 3:
['gesellschaft' 'deutschland' 'demokratisch' 'prozent' 'eu' 'schaffen'
 'ökologisch' 'jahr' 'kind' 'land' 'unternehmen' 'leben' 'euro' 'arbeit'
 'beschäftigter' 'stark' 'linke' 'mensch' 'öffentlich' 'sozial']
topic 4:
['koalitionsaussage' 'bundesernährungsminister' 'denkarbeit'
 'ge

**Translation:**

    topic 0:
    ['identity' 'abolition' '12' 'end' 'child' 'national' 'state' 'governmental' 'ezb' 'euro' 'citizen' 'family'
    'lean' 'must' 'over' 'eu' 'germany' 'deutsch' 'afd' 'for']
    topic 1:
    ['miscegenation' 'allied' 'education' 'occupation zone' 'frankfurter' 'christian democratic' 'right-wing party'
    'socialisation' 'rich' 'middle class' 'fatherland' 'peasantry' 'one-party rule' 'refugee' 'community school' 
    'purchasing power' 'people' 'give' 'struggle' 'social democracy']
    topic 2:
    ['improvement' 'right-wing coalition' 'performance' 'reform' 'billion' 'expansion' 'must' 'million' 'social 
    democrat' '000' 'federal' 'percent' 'liberal' '1969' 'law' 'year' '1970' 'dm' '1971' '1972']
    topic 3:
    ['society' 'germany' 'democratic' 'percent' 'eu' 'create' 'ecological' 'year' 'child' 'country' 'company' 
    'life' 'euro' 'work' 'employed' 'strong' 'left' 'human' 'public' 'social']
    topic 4:
    ['coalition statement' 'federal food minister' 'thought work' 'legislative package' 'legislative work' 'fear 
    propaganda' '28september1969' 'ministerial team' 'regeneration' 'mature' 'entitlement system' 'three hundred'
    'barzel' 'voter mandate' 'ninety' 'secondary chancellor' 'coalition capable' 'dreaming' 'just' 'overslept']
    topic 5:
    ['support' 'together' 'international' 'europe' 'promote' 'eu' 'public' 'society' 'ecological' 'life' 'woman'
    'child' 'create' 'european' 'country' 'germany' 'social' 'green' 'strong' 'human']
    topic 6:
    ['freedom' 'social' 'competition' 'child' 'chance' 'create' 'international' 'eu' 'business' 'german' 'state'
    'liberal' 'citizen' 'must' 'country' 'fdp' 'european' 'human' 'strong' 'germany']
    topic 7:
    ['coalition statement' 'federal food minister' 'thought work' 'legislative package' 'legislative work' 'fear 
    propaganda' '28september1969' 'ministerial team' 'regeneration' 'mature' 'entitlement system' 'three hundred'
    'barzel' 'voter mandate' 'ninety' 'secondary chancellor' 'coalition capable' 'dreaming' 'just' 'overslept']
    topic 8:
    ['education' 'goal' 'europe' 'german' 'chance' 'international' 'great' 'state' 'promote' 'enable' 'human' 
    'create' 'country' 'company' 'germany' 'eu' 'digital' 'european' 'strong' 'democrat']
    topic 9:
    ['strong' 'society' 'economy' 'goal' 'big' 'public' 'cdu' 'citizen' 'economic' 'european' 'year' 'political' 
    'state' 'politics' 'human' 'german' 'germany' 'country' 'must' 'social']

### 20 Topics

In [25]:
#LDA
lda = LatentDirichletAllocation(n_components=20,
                                random_state=123,
                                learning_method='batch')
lda_topics = lda.fit_transform(bag_of_words)

In [26]:
#Most relevant tokens for each topic
for i, topic in enumerate(lda.components_):
    print("topic " + str(i) + ":")
    #Get last n tokens (highest values)
    print(vocab[topic.argsort()[-num_terms:]])

topic 0:
['geburtshilfe' 'geburtshilfeabteilung' 'geburtshilfegipfel'
 'geburtshilfeklinik' 'geburtsvorbereitung' 'geburtshilfestärkungsgesetz'
 'geburtsjahr' 'geburtsjahrgang' 'geburtsort' 'geburtsortprinzip'
 'geburtsortsprinzip' 'geburtsrecht' 'geburtsstark' 'geburtsstation'
 'geburtstag' 'geburtshilflich' '\uf02fv' "zukunft''" 'genossin'
 'anrechnungsverfahren']
topic 1:
['besatzungszone' 'lüge' 'zwangswirtschaft' 'währungsreform' 'planung'
 'reich' 'alliierter' 'gemeinschaftsschule' 'kirche' 'bodenreform'
 'eigentum' 'sozialisierung' 'einheit' 'deutsch' 'flüchtling' 'volk' 'fdp'
 'kampf' 'frei' 'sozialdemokratie']
topic 2:
['schicht' 'deutsch' "marktwirtschaft'" 'ware' 'erhaltung' 'industriell'
 'krieg' 'politisch' 'partei' 'grundlage' 'leistungswettbewerb' 'sozial'
 'wirtschaftspolitik' 'monopolkontrolle' 'produktion' 'planwirtschaft'
 'echt' 'preis' 'müssen' 'volk']
topic 3:
['lohn' 'eu' 'solidarisch' 'land' 'kind' 'schaffen' 'jahr' 'ökologisch'
 'prozent' 'unternehmen' 'leben' 

**Translation:**
       
       topic 0:
    ['obstetrics' 'obstetric department' 'obstetric summit' 'obstetric clinic' 'obstetric preparation' 
    'obstetric strengthening act' 'birth year' 'birth cohort' 'birthplace' 'birthplace principle' 'birth right' 
    'birth strong' 'birth ward' 'birth day' 'obstetric' '\uf02fv' 'future'' 'genossin' 'crediting procedure']
    topic 1:
    ['occupation zone' 'lie' 'forced economy' 'currency reform' 'planning' 'reich' 'allied' 'community school'
    'church' 'land reform' 'property' 'socialisation' 'unity' 'german' 'refugee' 'volk' 'fdp' 'struggle' 'free' 
    'social democracy']
    topic 2:
    ['stratum' 'german' 'market economy'' 'commodity' 'conservation' 'industrial' 'war' 'political' 'party' 
    'basis' 'meritocracy' 'social' 'economic policy' 'monopoly control' 'production' 'planned economy' 'real' 
    'price' 'must' 'people']
    topic 3:
    ['wage' 'eu' 'solidarity' 'country' 'child' 'create' 'year' 'ecological' 'percent' 'enterprise' 'life' 
    'democratic' 'strong' 'euro' 'work' 'employed' 'human' 'left' 'public' 'social']
    topic 4:
    ['obstetrics' 'obstetrics department' 'obstetrics summit' 'obstetrics clinic' 'obstetrics preparation' 
    'obstetrics strengthening act' 'birth year' 'birth cohort' 'birthplace' 'birthplace principle' 'birth right'
    'birth strong' 'birth ward' 'birth day' 'obstetric' '\uf02fv' 'future'' 'genoss' 'imputation procedure']
    topic 5:
    ['international' 'company' 'big' 'europe' 'together' 'society' 'support' 'public' 'promote' 'life' 'eu' 
    'green' 'child' 'create' 'country' 'germany' 'social' 'european' 'human' 'strong']
    topic 6:
    ['liberal' 'year' 'euro' 'europe' 'democracy' 'international' 'chance' 'child' 'citizen' 'german' 'state' 
    'create' 'eu' 'company' 'country' 'european' 'human' 'fdp' 'strong' 'germany']
    topic 7:
    ['obstetrics' 'obstetrics department' 'obstetrics summit' 'obstetrics clinic' 'obstetrics preparation' 
    'obstetrics strengthening act' 'birth year' 'birth cohort' 'birthplace' 'birthplace principle' 'birth right'
    'birth strong' 'birth ward 'birth day' 'obstetric' '\uf02fv' 'future'' 'genossin' 'crediting procedure']
    topic 8:
    ['social' 'goal' 'education' 'opportunity' 'german' 'great' 'promote' 'international' 'enable' 'create' 
    'state' 'digital' 'country' 'human' 'company' 'european' 'germany' 'eu' 'strong' 'democrat']
    topic 9:
    ['citizen' 'economy' 'create' 'family' 'goal' 'great' 'future' 'year' 'economic' 'europe' 'european' 
    'politics' 'csu' 'cdu' 'strong' 'german' 'human' 'social' 'country' 'germany']
    topic 10:
    ['development' 'private' 'improve' 'create' 'goal' 'strong' 'state' 'freedom' 'necessary' 'german' 
    'international' 'public' 'germany' 'country' 'social' 'european' 'citizen' 'state' 'liberal' 'must']
    topic 11:
    ['obstetrics' 'obstetric department' 'obstetric summit' 'obstetric clinic' 'obstetric preparation' 
    'obstetric strengthening act' 'birth year' 'birth cohort' 'birthplace' 'birthplace principle' 'birth right' 
    'birth strong' 'birth ward' 'birth day' 'obstetric' '\uf02fv' 'future'' 'genossin' 'imputation procedure'].
    topic 12:
    ['solidarity' 'life' 'armed forces' 'health' 'climate neutral' 'example' 'conversion' 'violence' 'year' 
    'queer' 'coronakris' 'corporation' 'digital' 'public' 'social' 'socio-ecological' 'chapter' 'employed' 
    'cf' 'human']
    topic 13:
    ['social' 'country' 'strong' 'year' 'national' 'citizen' 'euro' 'state' 'european' 'child' 'must' 'lean' 
    'over' 'state' 'family' 'eu' 'germany' 'deutsch' 'afd' 'for']
    topic 14:
    ['atomic energy' 'autocracy' 'social democracy' 'nuclear' 'nuclear weapon' 'full' 'adenauer' 'price' 
    'realistic' 'state' 'people' 'creative' 'reunification' 'security system' 'arms race' 'unfolding' 'demand' 
    'freedom' 'conclusion' 'party']
    topic 15:
    ['woman' 'federation' 'achievement' 'security' 'economic' 'green' 'dm' 'human' 'public' 'state' 'citizen' 
    'reform' '90' 'social democrat' 'politics' 'alliance' 'country' 'year' 'must' 'social']
    topic 16:
    ['development' 'democratic' 'child' 'economy' 'germany' 'european' 'social' 'green' 'german' 'society' 
    'federal government' 'political' 'politics' 'public' 'country' 'human' 'ecological' 'woman' 'must' 'social']
    topic 17:
    ['war' 'person' 'nature' 'chemical' 'resistance' 'girl' 'christian democratic' 'decide' 'people' 
    'population' 'ban' 'nato' 'life' 'production' 'world' 'military' 'third' 'federal republic' 'woman' 'green']
    topic 18:
    ['obstetrics' 'obstetrics department' 'obstetrics summit' 'obstetrics clinic' 'obstetrics preparation' 
    'obstetrics strengthening act' 'birth year' 'birth cohort' 'birthplace' 'birthplace principle' 'birth right'
    'birth strong' 'birth ward' 'birth day' 'obstetric' '\uf02fv' 'future'' 'genossin 'imputation procedure']
    topic 19:
    ['awake' 'collegial' 'decisive' 'dreaming' 'environmental protection programme' 'unpredictability' 'hans' 
    'walter' 'reason' 'genscher' 'dietrich' 'domestic' 'utopia' 'solidification' 'right of way' 'josef' 'scheel'
    'ertl' 'ask' 'voter']

### 9 Topics
As I discovered the repetitive topics, starting with a number of 10 topics, I tried out whether this would change with 9 topics.

In [27]:
#LDA
lda = LatentDirichletAllocation(n_components=9,
                                random_state=123,
                                learning_method='batch')
lda_topics = lda.fit_transform(bag_of_words)

In [28]:
#Most relevant tokens for each topic
for i, topic in enumerate(lda.components_):
    print("topic " + str(i) + ":")
    #Get last n tokens (highest values)
    print(vocab[topic.argsort()[-num_terms:]])

topic 0:
['politisch' 'maßnahme' 'volk' 'europäisch' 'ezb' 'euro' 'national'
 'staatlich' 'kind' 'bürger' 'staat' 'lehnen' 'müsse' 'familie' 'über'
 'eu' 'deutschland' 'deutsch' 'afd' 'für']
topic 1:
['maßnahme' 'politisch' 'partei' 'sozial' 'regierung' 'europäisch' 'staat'
 'freiheit' 'groß' 'gemeinde' 'bund' 'bundesrepublik' 'müssen' 'aufgabe'
 'sozialdemokratisch' 'land' 'deutschland' 'bundesregierung' 'volk'
 'deutsch']
topic 2:
['groß' 'europäisch' 'gesellschaft' 'wirtschaft' 'frau' 'freiheit'
 'öffentlich' 'ziel' 'bundesrepublik' 'jahr' 'wirtschaftlich' 'bürger'
 'mensch' 'politik' 'land' 'deutsch' 'politisch' 'staat' 'sozial' 'müssen']
topic 3:
['prozent' 'pds' 'schaffen' 'gesellschaft' 'unternehmen' 'jahr' 'euro'
 'deutschland' 'kind' 'demokratisch' 'leben' 'land' 'ökologisch' 'arbeit'
 'beschäftigter' 'stark' 'linke' 'mensch' 'öffentlich' 'sozial']
topic 4:
['hallstei' 'wahlalter' 'hauptstadt' 'ausschuß' 'aufstiegschance'
 'gesamthochschul' 'christ' 'almosen' 'aktionsprogr

**Translation:**
   
    topic 0:
    ['political' 'measure' 'people' 'european' 'eec' 'euro' 'national' 'state-owned' 'child' 'citizen' 'state' 
    'lean' 'must' 'family' 'about' 'eu' 'germany' 'deutsch' 'afd' 'for']
    topic 1:
    ['measure' 'political' 'party' 'social' 'government' 'european' 'state' 'freedom' 'big' 'community' 
    'federation' 'federal republic' 'must' 'task' 'social democratic' 'country' 'germany' 'federal government'
    'people' 'german']
    topic 2:
    ['big' 'european' 'society' 'economy' 'woman' 'freedom 'public' 'goal' 'federal republic' 'year' 'economic' 
    'citizen' 'human' 'politics' 'country' 'german' 'political' 'state' 'social' 'must']
    topic 3:
    ['percent' 'pds' 'create' 'society' 'company' 'year' 'euro' 'Germany' 'child' 'democratic' 'life' 'country'
    'ecological' 'work 'employed' 'strong' 'left' 'human' 'public' 'social']
    topic 4:
    ['hallstei' 'voting age' 'capital' 'committee' 'opportunity for advancement' 'university' 
    'Comprehensive University' 'christian' 'handout' 'action programme' 'comrade(female version)' 'comrade' 
    'defence of the country' 'small shareholder' 'formula' 'commentary' 'these' 'shareholder' 'extensive' 
    'basic industry' 'sielen']
    topic 5:
    ['goal' 'support' 'year' 'big' 'promote' 'eu' 'society' public' 'life' 'child' 'germany' 'create' 'european'
    'woman 'country' 'ecological' 'social' 'strong' 'green' 'human']
    topic 6:
    ['promote' 'society' 'fdp' 'goal' 'europe' 'must' 'year' 'state' 'child' 'citizen' 'company' 'international'
    'create' 'german' 'social' 'european 'social' 'european' 'human' 'strong' 'country' 'germany']
    topic 7:
    ['people' 'christian' 'freedom' 'year' 'unity' 'great' 'spd' 'family' 'task' 'düsseldorf 'task' 'düsseldorf'
    'europe' 'world' 'union' 'future' 'politics' 'country' 'germany' 'german 'germany' 'deutsch' 'csu' 'cdu']
    topic 8:
    ['support' 'german' 'goal' 'big' 'international' 'child' 'state' 'common' 'company' 'promote 'together' 
    'company' 'promote' 'europe' 'create' 'eu' 'human being 'country' 'digital' 'european' 'germany' 'strong' 
    'democracy']


**Conclusion**: It helps with the repetition, but I couldn't really make sense of topic 4. So I decided to go with 8 topics.

### 8 Topics

In [29]:
#LDA
lda = LatentDirichletAllocation(n_components=8,
                                random_state=123,
                                learning_method='batch')
lda_topics = lda.fit_transform(bag_of_words)

In [30]:
#Most relevant tokens for each topic
for i, topic in enumerate(lda.components_):
    print("topic " + str(i) + ":")
    #Get last n tokens (highest values)
    print(vocab[topic.argsort()[-num_terms:]])

topic 0:
['corona' 'abschaffung' 'euro' 'beenden' '12' 'kind' 'national' 'ezb'
 'staatlich' 'staat' 'bürger' 'lehnen' 'familie' 'eu' 'müsse' 'über'
 'deutschland' 'deutsch' 'afd' 'für']
topic 1:
['regierung' 'staat' 'politik' 'freiheit' 'gemeinde' 'wirtschaftlich'
 'müssen' 'bundesrepublik' 'bund' 'politisch' 'sozial' 'partei' 'groß'
 'aufgabe' 'sozialdemokratisch' 'land' 'deutschland' 'bundesregierung'
 'volk' 'deutsch']
topic 2:
['frau' 'gesellschaft' 'wirtschaft' 'deutschland' 'freiheit' 'europäisch'
 'öffentlich' 'bundesrepublik' 'ziel' 'wirtschaftlich' 'jahr' 'bürger'
 'politisch' 'politik' 'mensch' 'deutsch' 'staat' 'land' 'sozial' 'müssen']
topic 3:
['schaffen' 'gesellschaftlich' 'jahr' 'unternehmen' 'deutschland'
 'gesellschaft' 'pds' 'euro' 'kind' 'land' 'leben' 'stark' 'demokratisch'
 'arbeit' 'ökologisch' 'beschäftigter' 'linke' 'mensch' 'öffentlich'
 'sozial']
topic 4:
['arbeitsplatz' 'marktwirtschaftlich' 'privatisierung' 'aufgabe' 'mittel'
 'ökologisch' 'staatlich' 'v

**Translation of topics:**
   
    topic 0:
    ['corona' 'abolition' 'euro' 'to end smth.' '12' 'child' 'national' 'eec' 'state-owned' 'state' 'citizen' 
    'lean' 'family' 'eu' 'must' 'about' 'germany' 'deutsch' 'afd' 'for']
    topic 1:
    ['government' 'state' 'politics' 'freedom' 'community' 'economic' 'must' 'federal republic' 'federation' 
    'political' 'social' 'party' 'big' 'task' 'social democratic' 'country' 'germany' 'federal government'
    'people' 'german']
    topic 2:
    ['woman' 'society' 'economy' 'germany' 'freedom' 'european' 'public' 'federal republic' 'goal' 'economic' 
    'year' 'citizen' 'political' 'politics' 'human' 'german' 'state' 'country' 'social' 'must']
    topic 3:
    ['create' 'social' 'year' 'company' 'germany' 'society' 'pds' 'euro' 'child' 'country' 'life' 'strong'
    'democratic' 'work' 'ecological' 'employee' 'left' 'human' 'public' 'social']
    topic 4:
    ['job' 'market-based' 'privatisation' 'task' 'medium/means' 'ecological' 'state-owned' 'improve' 'high' 
    'private' 'tax' 'state' 'necessary' 'public' 'international' 'citizen' 'european' 'state' 'liberal' 'must']
    topic 5:
    ['goal' 'year' 'international' 'great' 'eu' 'promote' 'society' 'public' 'life' 'germany' 'child' 'create'
    'european' 'woman' 'country' 'ecological' 'social' 'strong' 'green' 'human']
    topic 6:
    ['together' 'citizen' 'great' 'goal' 'state' 'year' 'promote' 'international' 'eu' 'europe' 'child' 'company'
    'german' 'create' 'social' 'european' 'country' 'human' 'strong' 'germany']
    topic 7:
    ['people' 'mark'(I think this refers to the German mark, the currency before the euro) 'great' 'year' 'family' 
    'christian' 'task' 'europe' 'unity' 'düsseldorf' 'world' 'spd' 'future' 'union' 'politics' 'country' 'german'
    'germany' 'csu' 'cdu']


![alt text](data/images/lda.png)

## Comment – coherence score
In hindsight (and for future work), a coherence score could've helped to pick a good number of topics for modelling.