### Sources
**Unless otherwise stated, the code in this notebook is from "NLP Week 2.2 Lecture" and "NLP Week 2.2 Task".** Other sources, or when I wrote the code myself (or modified it), are marked with **##** at the beginning of the corresponding cell.

#### Citation for the Hanover Tagger used in this notebook:
For a explanation of the underlying ideas see: Christian Wartena (2019). A Probabilistic Morphology Model for German Lemmatization. In: Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers. Pp. 40-49, Erlangen. https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_10.pdf https://doi.org/10.25968/opus-1527

#### Stopword List from:
https://github.com/solariz/german_stopwords

Marco Götze and Steffen Geyer (2016), Source and more Information: https://solariz.de/de/downloads/6/german-enhanced-stopwords.htm
____


# Topic Modelling
Note: The reasoning behind my choice of the topic numbers used here can be found in the 04.2_Topic-Numbers Notebook.
I seperated it from this one to keep it a little easier to follow.

### Preparation: Load data, define tokeniser

In [1]:
#Import all necessary packages
import numpy as np
import pandas as pd 
import re
import nltk
from HanTa import HanoverTagger as ht
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.tokenize.casual import casual_tokenize

In [2]:
#load in data
df = pd.read_csv('data/sorted_manifestos.csv', encoding='utf-8')

In [3]:
## I wrote the code in this cell

#add a column with party names
partyid_greens = [41111,41112,41113]
partyid_lefts = [41221, 41222, 41223]
partynames = []
#loop trough all indices of the dataset
for i in range(len(df)):
    #get the party ID of this index/row
    party_id = df.at[i, 'party']
    #add the corresponding party name
    if party_id in partyid_greens:
        partynames.append('Greens')
    elif party_id in partyid_lefts:
        partynames.append('Lefts')
    elif party_id == 41420:
        partynames.append('FDP')
    elif party_id == 41521:
        partynames.append('CDU')
    elif party_id == 41320:
        partynames.append('SPD')
    elif party_id == 41620:
        partynames.append('DP')
    elif party_id == 41953:
        partynames.append('AFD')

#add column with party names to df
df['partyname'] = partynames
#shorten the df
df = df[['date','partyname','text']]

In [4]:
## code in this cell is from the author NNK (2022), 'Pandas Combine Two Columns of Text in DataFrame'
## link: https://sparkbyexamples.com/pandas/pandas-combine-two-columns-of-text-in-dataframe/

#add a column that combines date and name, for easier access later
df["name & date"] = df['partyname'] +", "+ df["date"].astype(str)
df.sample(3)

Unnamed: 0,date,partyname,text,name & date
42,195709,FDP,AKTIONSPROGRAMM 1957 Verkündet auf dem Wahlko...,"FDP, 195709"
19,202109,Lefts,"Zeit zu handeln! Für soziale Sicherheit, Fried...","Lefts, 202109"
80,194908,DP,Programmpunkte der Deutschen Partei 1. Erneu...,"DP, 194908"


In [5]:
## code in this cell is from GeeksForGeeks (2021), 'How to Read Text File Into List in Python?', Example 1
## link: https://www.geeksforgeeks.org/how-to-read-text-file-into-list-in-python/
## modified to read my data and split the text on spaces instead of dots

#open .txt file that contains the long stopword list
file_full = open('stopwords/german_stopwords/german_stopwords_full_edit.txt', "r")
data_full = file_full.read()

#create a list
german_stopwords_full = data_full.replace('\n', ' ').split(" ")

In [6]:
## I've modified this cell to use my lemmatiser and stopword list

tagger = ht.HanoverTagger('morphmodel_ger.pgz')

#create tokeniser
def my_tokeniser(doc):
    tokens = re.split(r'[\s.,;!?/"()#»«„“”:&–-]+', doc)
    #remove stopwords
    tokens_sw = [t for t in tokens if not t.lower() in german_stopwords_full and t != '']
    #lemmatiser
    tags = [tagger.analyze(word) for word in tokens_sw]
    tokens_sw_lemma = [lemma.lower() for (lemma,pos) in tags]
    return tokens_sw_lemma

## LSA / SVD

In [7]:
#get TFIDF
tfidf_vectoriser = TfidfVectorizer(tokenizer=my_tokeniser)
tfidf = tfidf_vectoriser.fit_transform(df["text"])
#Save list of unique tokens (vocab) for later
vocab = tfidf_vectoriser.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf.todense(), columns = vocab)
print(tfidf.todense().shape)

(86, 55414)


In [8]:
#Subtract mean
tfidf_df = tfidf_df - tfidf_df.mean()

In [9]:
#How many topics?
num_topics = 6
pd.options.display.max_columns=num_topics
labels = ['topic{}'.format(i) for i in range(num_topics)]

In [10]:
#Calculate topics
svd = TruncatedSVD(n_components = num_topics, n_iter = 100) 
svd_topic_vectors = svd.fit_transform(tfidf_df.values)

In [11]:
#Most relevant tokens for each topic
topic_weights = pd.DataFrame(svd.components_.T, index=vocab, columns=labels)
num_terms = 20
for i in range(num_topics):
    print("___topic " + str(i) + "___")
    topicName = "topic" + str(i)
    weightedlist = topic_weights.get(topicName).sort_values()[-num_terms:]
    print(weightedlist.index.values)

___topic 0___
['frau' 'unterstützen' 'nachhaltig' 'international' 'linke' 'öffentlich'
 'kommune' 'schaffen' 'euro' 'ökologisch' 'europäisch' 'land'
 'unternehmen' 'digital' 'kind' 'deutschland' 'grün' 'eu' 'mensch' 'stark']
___topic 1___
['schaffen' 'international' 'zusammenarbeit' 'cdu' 'wirtschaftlich'
 'sozialdemokrat' 'aufgabe' 'fdp' 'arbeitsplatz' 'csu' 'spd' 'chance'
 'ziel' 'land' 'staat' 'verbessern' 'europäisch' 'bürger' 'liberal'
 'müssen']
___topic 2___
['solidarisch' 'mann' 'demokratisierung' 'ostdeutschland' 'bundesrepublik'
 'gesellschaft' 'demokratisch' 'bündnis' 'ddr' 'gesellschaftlich' '90'
 'brd' 'öffentlich' 'linke' 'frau' 'sozial' 'pds' 'ökologisch' 'grün'
 'müssen']
___topic 3___
['regelung' 'individuell' 'schule' 'bildung' 'bürgergeld' 'privat'
 'staatlich' 'staat' 'einzeln' 'europäisch' 'digital' 'euro' 'wettbewerb'
 'für' 'afd' 'müssen' 'eu' 'fdp' 'demokrat' 'liberal']
___topic 4___
['abrüstungsschritt' 'mädchen' 'international' 'gesellschaftsvertrag'
 'natur'

**Translation of topics:**

    ___topic 0___
    ['woman' 'support' 'sustainable' 'international' 'left' 'public' 'commune' 'create' 'euro' 'ecological' 
    'european' 'country' 'company' 'digital' 'child' 'germany' 'green' 'eu' 'human' 'strong']
    ___topic 1___
    ['create' 'international' 'cooperation' 'cdu' 'economic' 'social democrat' 'task' 'fdp' 'job' 'csu' 
    'spd' 'opportunity' 'goal' 'country' 'state' 'improve' 'european' 'citizen' 'liberal 'must']
    ___topic 2___
    ['solidarity' 'man' 'democratisation' 'east germany' 'federal republic' 'society' 'democratic' 
    'alliance' 'gdr' 'social' '90' 'frg' 'public' 'left' 'woman' 'social' 'pds' 'ecological' 'green' 'must']
    ___topic 3___
    ['regulation' 'individual' 'school' 'education' 'basic income scheme' 'private' 'state' 'single/individual' 
    'european' 'digital' 'euro' 'competition' 'for' 'afd' 'must' 'eu' 'fdp' 'democrat' 'liberal']
    ___topic 4___
    ['disarmament step' 'girl' 'international' 'deed of association' 'nature' 'environment' 'world' 'german' 
    'life' 'people' 'product' 'third' 'man' '...' 'must' 'alliance' 'woman' 'ecological' '90' 'green']
    ___topic 5___
    ['corona' '1972' 'pandemic' ''' 'time' 'cf' 'employee' 'citizen' 'people/folk' 'digitalisation' 'eec' 'spd' 
    'public' 'federal government' 'left' 'must' 'social democrat' 'social democratic' 'digital' 'democrat']

In [12]:
## I wrote the code in this cell

#get all indices for the different parties
greens = [df.loc[i, 'name & date'] for i in range(len(df)) if df.loc[i, 'partyname'] == 'Greens']
lefts = [df.loc[i, 'name & date'] for i in range(len(df)) if df.loc[i, 'partyname'] == 'Lefts']
spd = [df.loc[i, 'name & date'] for i in range(len(df)) if df.loc[i, 'partyname'] == 'SPD']
fdp = [df.loc[i, 'name & date'] for i in range(len(df)) if df.loc[i, 'partyname'] == 'FDP']
cdu = [df.loc[i, 'name & date'] for i in range(len(df)) if df.loc[i, 'partyname'] == 'CDU']
afd = [df.loc[i, 'name & date'] for i in range(len(df)) if df.loc[i, 'partyname'] == 'AFD']
dp = [df.loc[i, 'name & date'] for i in range(len(df)) if df.loc[i, 'partyname'] == 'DP']

In [13]:
## code in this cell is from kanoki (2019), 'Color Columns, Rows & Cells of Pandas Dataframe'
## link: https://kanoki.org/2019/01/02/pandas-trick-for-the-day-color-code-columns-rows-cells-of-dataframe/

#function to highlight all values over 0.09
def color(val):
    color = 'peachpuff' if val > 0.1 else '0'
    return 'background-color: %s' % color

In [14]:
#How much does each topic apply to each manifesto of the CDU?
names = [t for t in df["name & date"]]
svd_topic_vectors_df = pd.DataFrame(svd_topic_vectors, columns=labels, index = names)
svd_topic_vectors_df.loc[cdu].style.applymap(color)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5
"CDU, 194908",-0.309774,-0.042604,0.013607,0.017922,0.008894,0.106317
"CDU, 195309",-0.360817,-0.069148,-0.055315,-0.022354,0.01062,0.052857
"CDU, 195709",-0.374518,-0.305102,-0.20408,-0.100667,0.044893,-0.165958
"CDU, 196109",-0.314959,-0.298721,-0.188831,-0.058389,0.047358,-0.142975
"CDU, 196509",-0.261657,-0.053693,-0.208686,-0.313375,0.027896,-0.08842
"CDU, 196909",-0.27094,0.080388,-0.043406,-0.054807,0.002368,0.083322
"CDU, 197211",-0.230471,0.143681,0.028214,-0.073892,-0.053329,0.089838
"CDU, 197610",-0.109763,0.159873,-0.04459,-0.155112,-0.087067,-0.00586
"CDU, 198010",-0.123987,0.118223,-0.069966,-0.20223,-0.034885,-0.078487
"CDU, 198303",-0.139362,0.050364,-0.075433,-0.310861,-0.018889,-0.025255


In [15]:
#How much does each topic apply to each manifesto of the SPD?
svd_topic_vectors_df.loc[spd].style.applymap(color)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5
"SPD, 194908",-0.278067,-0.255175,-0.100378,-0.03047,0.058436,0.030341
"SPD, 195309",-0.2742,-0.187717,-0.026855,-0.038405,0.033514,0.101135
"SPD, 195709",-0.26987,-0.040947,-0.095069,-0.132273,-0.005639,0.066374
"SPD, 196109",-0.301698,-0.068778,-0.060421,-0.110602,0.042954,0.173813
"SPD, 196509",-0.235024,0.060695,-0.047813,-0.089604,-0.008214,0.230121
"SPD, 196909",-0.273943,0.1677,0.034969,-0.16674,-0.07779,0.203965
"SPD, 197211",-0.164216,0.087404,0.030428,-0.115895,-0.120974,0.150265
"SPD, 197610",-0.15899,0.1567,0.028716,-0.23861,-0.127035,0.167826
"SPD, 198010",-0.130621,0.221173,0.115848,-0.211313,-0.059037,0.157176
"SPD, 198303",-0.125362,0.151447,0.11911,-0.211347,-0.044691,0.137774


In [16]:
svd_topic_vectors_df.loc[fdp].style.applymap(color)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5
"FDP, 194908",-0.33229,-0.22981,-0.077193,0.081672,0.059958,0.014327
"FDP, 195309",-0.331092,-0.182736,-0.118707,0.160474,0.018487,-0.157575
"FDP, 195709",-0.337953,-0.199455,-0.106743,-0.015281,0.052512,-0.064256
"FDP, 196109",-0.32029,-0.047746,-0.052808,0.137762,0.007917,0.175957
"FDP, 196509",-0.320356,0.113735,0.051453,0.194845,-0.001209,0.096786
"FDP, 196909",-0.274353,0.17284,0.093031,0.134064,-0.032272,0.129315
"FDP, 197211",-0.266167,-0.031976,-0.092037,0.137904,-0.060893,-0.162677
"FDP, 197610",-0.213103,0.350723,0.085947,0.210968,-0.098484,-0.007711
"FDP, 198010",-0.125114,0.383492,0.111909,0.236229,-0.045865,0.059923
"FDP, 198303",-0.168573,0.309659,0.111413,0.240241,-0.064873,-0.0445


In [17]:
svd_topic_vectors_df.loc[greens].style.applymap(color)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5
"Greens, 198303",-0.144388,-0.201227,0.200227,-0.048468,0.23185,0.000838
"Greens, 198701",0.027523,-0.143053,0.410976,-0.026315,0.459437,-0.037855
"Greens, 199012",0.005579,-0.078876,0.485268,-0.028134,0.373659,-0.049975
"Greens, 199410",0.081024,0.059184,0.425998,-0.019562,0.282791,-0.010497
"Greens, 199809",0.092828,0.044607,0.377229,0.014761,0.396095,-0.024556
"Greens, 200209",0.343845,-0.01491,0.154193,-0.021318,0.183333,-0.076062
"Greens, 200509",0.378684,-0.003262,0.074042,-0.0463,0.196549,-0.087401
"Greens, 200909",0.380871,-0.103994,0.111328,0.007176,0.24108,0.009873
"Greens, 201309",0.454915,-0.076593,0.057747,0.023493,0.188924,0.055849
"Greens, 201709",0.438377,-0.117955,0.018862,0.014365,0.239478,0.09539


In [18]:
svd_topic_vectors_df.loc[lefts].style.applymap(color)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5
"Lefts, 199012",-0.023965,-0.288039,0.360324,-0.000319,-0.330633,-0.128732
"Lefts, 199410",0.028277,-0.151221,0.40406,-0.040288,-0.211984,-0.198603
"Lefts, 199809",0.024705,-0.172066,0.390716,-0.026612,-0.295079,-0.260652
"Lefts, 200209",0.176434,-0.166197,0.241454,-0.056446,-0.282034,-0.232042
"Lefts, 200509",0.170221,-0.201493,0.215458,-0.047521,-0.215991,-0.206111
"Lefts, 200909",0.276855,-0.226897,0.206028,-0.006427,-0.250379,0.108715
"Lefts, 201309",0.305373,-0.262144,0.208498,0.030452,-0.300402,0.146221
"Lefts, 201709",0.350335,-0.26447,0.17429,0.023157,-0.267093,0.180095
"Lefts, 202109",0.362956,-0.255877,0.140123,0.042415,-0.200843,0.219646


In [19]:
svd_topic_vectors_df.loc[afd].style.applymap(color)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5
"AFD, 201309",-0.105097,-0.299689,-0.165148,0.157798,0.011575,-0.042117
"AFD, 201709",-0.039946,-0.239583,-0.170381,0.151127,0.020953,-0.050431
"AFD, 202109",-0.174302,-0.356465,-0.144868,0.18959,0.05154,-0.063805


In [20]:
svd_topic_vectors_df.loc[dp].style.applymap(color)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5
"DP, 194908",-0.329732,-0.361448,-0.056242,0.131652,0.022065,-0.074925
"DP, 195309",-0.342716,-0.191646,-0.059746,0.08021,0.065475,0.022693
"DP, 195709",-0.385395,-0.10009,-0.056786,0.073827,0.067488,0.047589


**Conclusion**

The topics are generally difficult to interpret because the combinations of terms seem very random. And increasing or decreasing the number of topics doesn't help either since the topics stay the same; only new topics are added.


Nevertheless, we can see some patterns in the data: 

The 'older' parties that have been around since 1949 (CDU, SPD, FDP) all score well in topic 0 and 1, but the SPD also reaches high values in topic 2 and 5, and the FDP especially in topic 3.

The Greens score well in topic 4 (alongside topic 0 and 2), making them the only party to score above 0.09 in this topic. The Lefts score high in topic 0, 2 and 5. 

The right-wing parties AFD and DP both score highest in topic 3.

-----
# LDA

In [21]:
#We calculate LDA on the Bag Of Words
count_vectoriser = CountVectorizer(tokenizer=my_tokeniser)
bag_of_words = count_vectoriser.fit_transform(df['text'])
vocab = count_vectoriser.get_feature_names_out()

print(bag_of_words.todense().shape)

(86, 55414)


In [22]:
#How many topics?
num_topics = 8
pd.options.display.max_columns=num_topics
labels = ['topic{}'.format(i) for i in range(num_topics)]

In [23]:
#LDA
lda = LatentDirichletAllocation(n_components=num_topics,
                                random_state=123,
                                learning_method='batch')

In [24]:
lda_topics = lda.fit_transform(bag_of_words)

In [25]:
#Most relevant tokens for each topic
for i, topic in enumerate(lda.components_):
    print("topic " + str(i) + ":")
    #Get last n tokens (highest values)
    print(vocab[topic.argsort()[-num_terms:]])

topic 0:
['corona' 'abschaffung' 'euro' 'beenden' '12' 'kind' 'national' 'ezb'
 'staatlich' 'staat' 'bürger' 'lehnen' 'familie' 'eu' 'müsse' 'über'
 'deutschland' 'deutsch' 'afd' 'für']
topic 1:
['regierung' 'staat' 'politik' 'freiheit' 'gemeinde' 'wirtschaftlich'
 'müssen' 'bundesrepublik' 'bund' 'politisch' 'sozial' 'partei' 'groß'
 'aufgabe' 'sozialdemokratisch' 'land' 'deutschland' 'bundesregierung'
 'volk' 'deutsch']
topic 2:
['frau' 'gesellschaft' 'wirtschaft' 'deutschland' 'freiheit' 'europäisch'
 'öffentlich' 'bundesrepublik' 'ziel' 'wirtschaftlich' 'jahr' 'bürger'
 'politisch' 'politik' 'mensch' 'deutsch' 'staat' 'land' 'sozial' 'müssen']
topic 3:
['schaffen' 'gesellschaftlich' 'jahr' 'unternehmen' 'deutschland'
 'gesellschaft' 'pds' 'euro' 'kind' 'land' 'leben' 'stark' 'demokratisch'
 'arbeit' 'ökologisch' 'beschäftigter' 'linke' 'mensch' 'öffentlich'
 'sozial']
topic 4:
['arbeitsplatz' 'marktwirtschaftlich' 'privatisierung' 'aufgabe' 'mittel'
 'ökologisch' 'staatlich' 'v

**Translation of topics:**
   
    topic 0:
    ['corona' 'abolition' 'euro' 'to end smth.' '12' 'child' 'national' 'eec' 'state-owned' 'state' 'citizen' 
    'lean' 'family' 'eu' 'must' 'about' 'germany' 'deutsch' 'afd' 'for']
    topic 1:
    ['government' 'state' 'politics' 'freedom' 'community' 'economic' 'must' 'federal republic' 'federation' 
    'political' 'social' 'party' 'big' 'task' 'social democratic' 'country' 'germany' 'federal government'
    'people' 'german']
    topic 2:
    ['woman' 'society' 'economy' 'germany' 'freedom' 'european' 'public' 'federal republic' 'goal' 'economic' 
    'year' 'citizen' 'political' 'politics' 'human' 'german' 'state' 'country' 'social' 'must']
    topic 3:
    ['create' 'social' 'year' 'company' 'germany' 'society' 'pds' 'euro' 'child' 'country' 'life' 'strong'
    'democratic' 'work' 'ecological' 'employee' 'left' 'human' 'public' 'social']
    topic 4:
    ['job' 'market-based' 'privatisation' 'task' 'medium/means' 'ecological' 'state-owned' 'improve' 'high' 
    'private' 'tax' 'state' 'necessary' 'public' 'international' 'citizen' 'european' 'state' 'liberal' 'must']
    topic 5:
    ['goal' 'year' 'international' 'great' 'eu' 'promote' 'society' 'public' 'life' 'germany' 'child' 'create'
    'european' 'woman' 'country' 'ecological' 'social' 'strong' 'green' 'human']
    topic 6:
    ['together' 'citizen' 'great' 'goal' 'state' 'year' 'promote' 'international' 'eu' 'europe' 'child' 'company'
    'german' 'create' 'social' 'european' 'country' 'human' 'strong' 'germany']
    topic 7:
    ['people' 'mark'(I think this refers to the German mark, the currency before the euro) 'great' 'year' 'family' 
    'christian' 'task' 'europe' 'unity' 'düsseldorf' 'world' 'spd' 'future' 'union' 'politics' 'country' 'german'
    'germany' 'csu' 'cdu']


In [26]:
#How much does each topic apply to each manifesto of the CDU?
lda_topic_vectors_df = pd.DataFrame(lda_topics, columns=labels, index = names)
lda_topic_vectors_df.loc[cdu].style.applymap(color)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7
"CDU, 194908",3.8e-05,3.8e-05,0.999733,3.8e-05,3.8e-05,3.8e-05,3.8e-05,3.8e-05
"CDU, 195309",6.5e-05,0.458442,0.541166,6.5e-05,6.5e-05,6.5e-05,6.5e-05,6.5e-05
"CDU, 195709",0.000448,0.99686,0.000449,0.000448,0.000448,0.000449,0.000449,0.000449
"CDU, 196109",0.000551,0.740699,0.255993,0.000551,0.000551,0.000551,0.000552,0.000552
"CDU, 196509",2.6e-05,2.6e-05,0.423323,2.6e-05,2.6e-05,2.6e-05,2.6e-05,0.576521
"CDU, 196909",0.000104,0.080288,0.660709,0.000104,0.000104,0.000104,0.258482,0.000104
"CDU, 197211",6.8e-05,0.272883,0.726711,6.8e-05,6.8e-05,6.8e-05,6.8e-05,6.8e-05
"CDU, 197610",3.9e-05,0.147373,0.785146,3.9e-05,3.9e-05,3.9e-05,0.067285,3.9e-05
"CDU, 198010",2.4e-05,2.4e-05,0.808579,2.4e-05,2.4e-05,2.4e-05,2.4e-05,0.191275
"CDU, 198303",5.4e-05,5.4e-05,0.915501,5.4e-05,5.4e-05,5.4e-05,5.4e-05,0.084176


In [27]:
lda_topic_vectors_df.loc[spd].style.applymap(color)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7
"SPD, 194908",0.000133,0.878029,0.121173,0.000133,0.000133,0.000133,0.000133,0.000133
"SPD, 195309",5.7e-05,0.966968,0.03269,5.7e-05,5.7e-05,5.7e-05,5.7e-05,5.7e-05
"SPD, 195709",9.8e-05,0.12951,0.8699,9.8e-05,9.8e-05,9.8e-05,9.8e-05,9.8e-05
"SPD, 196109",4.1e-05,0.865705,0.13405,4.1e-05,4.1e-05,4.1e-05,4.1e-05,4.1e-05
"SPD, 196509",1e-05,0.754414,0.245523,1e-05,1e-05,1e-05,1e-05,1e-05
"SPD, 196909",7.5e-05,7.5e-05,0.999476,7.5e-05,7.5e-05,7.5e-05,7.5e-05,7.5e-05
"SPD, 197211",1.9e-05,1.9e-05,0.999865,1.9e-05,1.9e-05,1.9e-05,1.9e-05,1.9e-05
"SPD, 197610",1.6e-05,1.6e-05,0.999891,1.6e-05,1.6e-05,1.6e-05,1.6e-05,1.6e-05
"SPD, 198010",2.8e-05,2.8e-05,0.912195,2.8e-05,2.8e-05,2.8e-05,2.8e-05,0.087637
"SPD, 198303",2.6e-05,2.6e-05,0.999818,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05


In [28]:
lda_topic_vectors_df.loc[fdp].style.applymap(color)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7
"FDP, 194908",3.5e-05,0.010403,0.989385,3.5e-05,3.5e-05,3.5e-05,3.5e-05,3.5e-05
"FDP, 195309",0.000147,0.732475,0.266644,0.000147,0.000147,0.000147,0.000147,0.000147
"FDP, 195709",0.000182,0.578907,0.420001,0.000182,0.000182,0.000182,0.000182,0.000182
"FDP, 196109",0.0001,0.581061,0.418341,0.0001,0.0001,0.0001,0.0001,0.0001
"FDP, 196509",3.5e-05,0.120964,0.610073,0.268787,3.5e-05,3.5e-05,3.5e-05,3.5e-05
"FDP, 196909",5.9e-05,5.9e-05,0.809721,5.9e-05,0.189926,5.9e-05,5.9e-05,5.9e-05
"FDP, 197211",0.097996,0.000299,0.900212,0.000299,0.000299,0.000299,0.000299,0.000299
"FDP, 197610",3.2e-05,3.2e-05,0.999776,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05
"FDP, 198010",1.1e-05,1.1e-05,0.999925,1.1e-05,1.1e-05,1.1e-05,1.1e-05,1.1e-05
"FDP, 198303",3.3e-05,3.3e-05,0.999768,3.3e-05,3.3e-05,3.3e-05,3.3e-05,3.3e-05


In [29]:
lda_topic_vectors_df.loc[greens].style.applymap(color)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7
"Greens, 198303",6.4e-05,6.4e-05,0.868586,0.020194,6.4e-05,0.110901,6.4e-05,6.4e-05
"Greens, 198701",1.4e-05,1.4e-05,0.722364,0.061605,1.4e-05,0.215959,1.4e-05,1.4e-05
"Greens, 199012",1.2e-05,1.2e-05,0.305433,0.021953,1.2e-05,0.672552,1.2e-05,1.2e-05
"Greens, 199410",8e-06,8e-06,0.2578,0.002095,8e-06,0.740064,8e-06,8e-06
"Greens, 199809",8e-06,8e-06,0.204824,8e-06,8e-06,0.795129,8e-06,8e-06
"Greens, 200209",1.2e-05,1.2e-05,0.01978,1.2e-05,1.2e-05,0.837348,0.142814,1.2e-05
"Greens, 200509",1e-05,1e-05,0.001963,1e-05,1e-05,0.833168,0.164821,1e-05
"Greens, 200909",5e-06,5e-06,5e-06,5e-06,5e-06,0.997262,0.002705,5e-06
"Greens, 201309",3e-06,3e-06,3e-06,3e-06,3e-06,0.999598,0.000383,3e-06
"Greens, 201709",5e-06,5e-06,5e-06,5e-06,5e-06,0.999968,5e-06,5e-06


In [30]:
lda_topic_vectors_df.loc[lefts].style.applymap(color)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7
"Lefts, 199012",2.5e-05,2.5e-05,0.309937,0.689915,2.5e-05,2.5e-05,2.5e-05,2.5e-05
"Lefts, 199410",3.1e-05,3.1e-05,0.224936,0.774876,3.1e-05,3.1e-05,3.1e-05,3.1e-05
"Lefts, 199809",1.6e-05,1.6e-05,0.255841,0.744064,1.6e-05,1.6e-05,1.6e-05,1.6e-05
"Lefts, 200209",1.8e-05,1.8e-05,0.105411,0.579356,1.8e-05,1.8e-05,0.315141,1.8e-05
"Lefts, 200509",3e-05,3e-05,0.080887,0.580067,3e-05,0.128722,0.210204,3e-05
"Lefts, 200909",1.2e-05,1.2e-05,0.022936,0.395606,1.2e-05,0.480363,0.101046,1.2e-05
"Lefts, 201309",7e-06,7e-06,7e-06,0.964963,7e-06,0.000381,0.034621,7e-06
"Lefts, 201709",5e-06,5e-06,5e-06,0.996897,5e-06,5e-06,0.003074,5e-06
"Lefts, 202109",4e-06,4e-06,4e-06,0.474418,4e-06,0.131028,0.394536,4e-06


In [31]:
lda_topic_vectors_df.loc[afd].style.applymap(color)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7
"AFD, 201309",0.009179,0.319899,0.000321,0.000321,0.000321,0.219764,0.449875,0.000321
"AFD, 201709",0.215567,0.171846,0.000132,0.019395,1.6e-05,0.111014,0.482013,1.6e-05
"AFD, 202109",0.634989,9e-06,0.004396,9e-06,9e-06,9e-06,0.360569,9e-06


In [32]:
lda_topic_vectors_df.loc[dp].style.applymap(color)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7
"DP, 194908",0.000687,0.100065,0.89581,0.000688,0.000687,0.000687,0.000687,0.000687
"DP, 195309",0.000133,0.147311,0.85189,0.000133,0.000133,0.000133,0.000133,0.000133
"DP, 195709",0.000203,0.595173,0.403606,0.000204,0.000203,0.000204,0.000204,0.000203


**Conclusion**

It is interesting to see that the 'older' parties that have been around since 1949 (CDU, SPD, FDP) score high on the same topics, namely 2 and 6. The results for topic 2 are high until about 1990/94, after which it shifts to topic 2. The DP party is also high in topic 2 (but also in topic 1), which fits the time period (their manifestos are from 1949-1957).
All 'newer' parties have more or less their own topic: The Greens score high on topic 5, the Left on topic 3, and the AFD on topic 0 (but also topic 1&6).

We can see similar results if we create a query searching for the party names. Since parties are likely to name themselves often in their election manifestos, this makes sense. One exception is the DP, which appears the most in topic 1, although it also had high scores for topic 2 in the previous analysis.

In [33]:
## I've modified this cell to use it for LDA instead of LSA

#query
topic_weights = pd.DataFrame(lda.components_.T, index=vocab, columns=labels)
df_topics = topic_weights.T["cdu spd fdp linke grün afd dp".split()]
df_topics.style.background_gradient(cmap='Greens')

Unnamed: 0,cdu,spd,fdp,linke,grün,afd,dp
topic0,0.125032,0.12504,0.125,2.901631,0.125003,264.078224,0.125
topic1,13.25221,57.618115,16.732354,1.175226,0.125212,0.125044,2.070777
topic2,241.703522,315.098351,78.349614,3.107902,49.599389,0.125,0.179223
topic3,12.569619,28.626849,7.412257,661.455957,18.14339,6.106603,0.125
topic4,0.125042,0.125037,0.125006,0.125,0.173206,0.125,0.125
topic5,76.19595,34.028073,17.456994,2.681094,1474.035474,1.190067,0.125
topic6,347.405818,192.901949,775.015656,6.42819,121.989955,0.125062,0.125
topic7,286.622808,50.476585,4.783119,0.125,0.808371,0.125,0.125
