# Kosningaprof
> Exploration of the data from the kosningaprof

- toc: true 
- badges: true
- comments: false
- categories: [election, data-science, machine-learning]

# Introduction

The Icelandic parliament election is on the 25. September. Before every election, media outlets set up a quiz/questionnaire where candidates get statements such as *"Iceland should be a part of NATO"* and *"The Icelandic government should put more money into the healthcare system"* and the candidates answer if they agree/disagree with or are neutral towards the statement. Users can then answer the same questions and figure out which candidates and political parties they are "closest to" their political beliefs using the answers to the questions.

These are mostly for fun and should only serve as an indicator, but it's an enjoyable process to go through and it's always interesting to see which candidates are "most similar" to oneself.

As a whole this collection of data, candidates and their answers to a set of questions, is interesting and has a lot of opportunities for some data exploration and the purpose of this post is to take the data from the [RUV quiz](https://www.ruv.is/x21/kosningaprof) explore it and try to answer some questions about it.

Similar (and definitely more rigorous) analysis has been done before by people designing the tests and actualluy working with the data, see for example this great thread [here](https://twitter.com/hafsteinneinars/status/1435268582053711881) on this [quiz](https://egkys.is/kosningavitinn/). Since this should not be taken too seriously, the analysis in this post will be more about generating plausible hypthes and doing some ad-hoc analysis.

# The Data

In [1]:
# hide
import numpy as np
import pandas as pd

If you want to have the data for yourself, e.g. to run this notebook locally follow the instructions [here](https://github.com/roberttorfason/kosningaprof) to set up an environment and fetch the data.

Let's load the data, pre-process it and set up some helper objects

In [115]:
df_results, df_questions = pd.read_csv("results_2021.csv"), pd.read_csv("questions_2021.csv")

In [116]:
# collapse-hide

# Pre-processing
df_results["party"] = df_results["party"].astype("category")
df_results["gender"] = df_results["gender"].astype("category")

# Bin the ages. `pd.cut` returns intervals that are annoying to work with so we just use the
# left age of each bin e.g. 30 to represent the interval [30, 40)
age_binned_series = pd.cut(df_results["age"], bins=[-10, 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100], right=False)
df_results.insert(df_results.columns.get_loc("age") + 1, "age_binned", age_binned_series)

df_results["age_binned"] = df_results["age_binned"].map(lambda x: x.left).astype("category")

# Most of the analysis centers around the political party so we drop the candiadates that don't have
# a party specified
df_results = df_results[~df_results["party"].isna()]
df_results = df_results.reset_index(drop=True)

In [88]:
# collapse-hide
cols_questions = [c for c in df_results.columns if c.startswith("question_")]
cols_meta = [c for c in df_results.columns if c not in cols_questions]

question_id_to_string = dict(zip(df_questions["question_number"], df_questions["question"]))

and take a look at the structure

In [5]:
df_questions.head(3)

Unnamed: 0,question_number,id,question
0,question_0,5719843612393472,"Íslenskt samfélag einkennist af réttlæti, sann..."
1,question_1,5692341510733824,Það á að leyfa kórónuveirunni að ganga án sótt...
2,question_2,5284747587616768,Stjórnvöld eiga að setja strangar takmarkanir ...


`df_questions` has all the questions and their ids/numbers.

In [6]:
df_results.head(3)

Unnamed: 0,name,party,age,age_binned,gender,timestamp,constituency,answering_done,timestamp_dt,question_0,...,question_21,question_22,question_23,question_24,question_25,question_26,question_27,question_28,question_29,question_30
0,Elín Tryggvadóttir,Samfylkingin,46,40,Kona,1631051439022,Reykjavíkurkjördæmi suður,1.0,2021-09-07 21:50:39.022,51,...,83,3,100,93,100,100,100,8,100,63
1,Ágústa Anna Ómarsdóttir,Sósíalistaflokkurinn,55,50,Kona,1631547337578,Norðvesturkjördæmi,1.0,2021-09-13 15:35:37.578,0,...,0,27,61,61,96,100,100,0,100,32
2,María Lilja Þrastardóttir Kemp,Sósíalistaflokkurinn,35,30,Kona,1631030960104,Reykjavíkurkjördæmi suður,1.0,2021-09-07 16:09:20.104,22,...,100,2,100,58,100,100,100,38,100,24


`df_results` represents each candidate in a row, metadata (`age`, `party`, `name`) and the results for all the questions, **where each answer is on the scale from 0-100, 0 meaning that the candidate strongly disagrees with the statement and 100 means the candidate strongly agrees with the statement**

In [7]:
# collapse-hide
"""This text is in a code cell because it's not possible to collapse markdown cells in fastpages

Additionally each question in `df_results` has a mapping back to `df_questions` via the column name.
Note that the way the questions are indexed there is an easy correspondance between the (numeric) index 
of each column and `df_questions`. This means that when we later transform the data to numpy arrays, 
where we don't have named columns, and do something like `x[:, 3]`, it will correspond to `df_questions.iloc[3]` 
so going back and forth between the data and the actual questions is easy.
""";

# Interactive Histogram of Questions and Answers

Below we visualize a histogram of the answers the candidates gave to the questions. The x-axis, Answer Value, is the value of the answer to each question binned and the y-axis is simply the count of those values. Again, these answers are on the scale 0-100, 0 meaning strongly disagree with the statement and 100 strongly agree with the statement. There are also two dropdown menus: One filters by political party and one filters by question so you can see the distribution of answers for each party and each question

The questions are ordered by most "interesting" to least "interesting", where the standard deviation is used as proxy for how interesting it is.

In [8]:
# collapse-hide
"""This text is in a code cell because it's not possible to collapse markdown cells in fastpages

Why does standard deviation make sense as a proxy for how interesting a question is?
As a very informal argument, thinking about the different scenarios:

1. Everyone answers the same or a similar value -> the std. will be low
2. The answers are (roughly) uniformly distributed over possible values -> std. will be a "medium" value
3. If there is a strong split (bi-modal distribution) where candidates either agree or disagree with the
   statements -> std. is high

Visual inspection of the plots also supports this.

One might be inclined to use entropy to measure how interesting a question is, but in that case the ordering
would be 1. < 3. < 2., whichis not the desired outcome for this problem, so the std. is more appropriate her.
""";

In [9]:
# hide
import altair as alt

We need to pre-process the data for this plot, transforming it from a tall dataframe to a wide dataframe. See a good discussion on why that's useful [here](https://altair-viz.github.io/user_guide/data.html#long-form-vs-wide-form-data)

In [10]:
# collapse-hide
df_results_melt = pd.melt(df_results, id_vars=cols_meta, value_vars=cols_questions)
df_results_melt = df_results_melt[["party", "variable", "value"]]
df_results_melt = df_results_melt.rename(columns={"variable": "question", "value": "Answer Value"})
df_results_melt["question"] = df_results_melt["question"].replace(question_id_to_string)
df_results_melt["question"] = df_results_melt["question"].astype("category")

In [12]:
df_questions_std = df_results_melt.groupby("question").std().sort_values("Answer Value", ascending=False)
questions_sorted = df_questions_std.index.to_list()

In [13]:
# hide_input
alt.data_transformers.disable_max_rows()

parties_list = df_results_melt["party"].cat.categories.to_list()
questions_list = df_results_melt["question"].cat.categories.to_list()

# Highest and lowest entropy/variance parties
chart = alt.Chart(df_results_melt).mark_bar().encode(
    x=alt.X(f'Answer Value:Q', bin=alt.Bin(extent=[0, 100], step=10), scale=alt.Scale(domain=(0, 100))),
    y=alt.Y('count()'),
    color='party',
    tooltip=['party', alt.Tooltip('count()', title='count')]
).interactive()
    
# A dropdown filter
question_dropdown = alt.binding_select(options=[None] + questions_sorted, labels=["All"] + questions_sorted)
question_select = alt.selection_single(fields=["question"], bind=question_dropdown, name="Question")

chart_filter_question = chart.add_selection(
    question_select
).transform_filter(
    question_select
).properties(title="Question Result Histogram")
 
# A dropdown filter
party_dropdown = alt.binding_select(options=[None] + parties_list, labels=["All"] + parties_list)
party_select = alt.selection_single(fields=["party"], bind=party_dropdown, name="Party")

chart_filter_party = chart_filter_question.add_selection(
    party_select
).transform_filter(
    party_select
)

chart_filter_party

# Dimensionality reduction and embeddings

In [14]:
# hide
from sklearn.decomposition import PCA, NMF

Let's pick out only the questions and transform the extracted questions to a numpy array to be used with `sklearn` and normalize them to be in the range [0, 1]

In [15]:
df_questions_only = df_results.filter(like="question_")
x = df_questions_only.to_numpy()
x = x.astype(float) / 100
x.shape

(311, 31)

In [53]:
from typing import List

def numpy_to_dataframe(_x: np.ndarray, _df: pd.DataFrame, cols_to_use: List[str]) -> pd.DataFrame:
    """Concatenate a numpy array with selected columns from a dataframe to be used with altair plotting"""
    df_out = pd.DataFrame(_x)
    df_out.columns = df_out.columns.astype(str)
    df_out = pd.concat([_df[cols_to_use].reset_index(drop=True), df_out], axis=1)
    return df_out

In [58]:
pca = PCA(n_components=10)
x_pca = pca.fit_transform(x)
x_pca.shape

(311, 10)

In [74]:
np.sort(pca.components_[0])

array([-0.26669372, -0.25154715, -0.24204881, -0.23889107, -0.23458142,
       -0.23248551, -0.19411324, -0.19282375, -0.19150731, -0.18925264,
       -0.17339732, -0.13577824, -0.12519995, -0.11957096, -0.11932713,
       -0.10620978, -0.1041671 , -0.08134275, -0.0644498 , -0.02672937,
        0.02933018,  0.0868938 ,  0.12218708,  0.13568071,  0.13698755,
        0.1488656 ,  0.1787404 ,  0.18264165,  0.21316296,  0.26772219,
        0.31950847])

In [75]:
df_questions[["question"]].iloc[np.argsort(np.abs(pca.components_[0]))[::-1]]

Unnamed: 0,question
3,Auka á vægi einkareksturs í heilbrigðiskerfinu.
8,Ísland á að eiga aðild að NATO.
14,Hækka á skatta á tekjuhæsta fólkið.
25,Alþingi á að vinna markvisst að endurskoðun st...
23,Ríkið á að styðja og taka þátt í fjármögnun Bo...
10,Stjórnvöld eiga með beinum hætti að stuðla að ...
7,Efna á til þjóðaratkvæðagreiðslu um framhald a...
13,Beita á skattkerfinu til að minnka bilið milli...
22,Reykjavíkurflugvöllur á að vera í Vatnsmýri um...
16,Sjávarútvegsfyrirtæki ættu að greiða meira til...


In [76]:
df_questions[["question"]].iloc[np.argsort(np.abs(pca.components_[1]))[::-1]]

Unnamed: 0,question
28,Leyfa á smásölu áfengis í matvöruverslunum.
22,Reykjavíkurflugvöllur á að vera í Vatnsmýri um...
7,Efna á til þjóðaratkvæðagreiðslu um framhald a...
29,Varsla neysluskammta af kannabisefnum ætti að ...
23,Ríkið á að styðja og taka þátt í fjármögnun Bo...
21,Innheimta á veggjöld í auknum mæli til að fjár...
8,Ísland á að eiga aðild að NATO.
10,Stjórnvöld eiga með beinum hætti að stuðla að ...
3,Auka á vægi einkareksturs í heilbrigðiskerfinu.
9,Ísland á að taka á móti fleiri flóttamönnum.


In [57]:
df_pca = numpy_to_dataframe(x_pca, df_results, cols_meta)

alt.Chart(df_pca).mark_circle(size=60).encode(
    x='0',
    y='1',
    color='party',
    tooltip=['party', "name"]
).interactive()

In [107]:
pca = PCA(n_components=2)
x_pca_q = pca.fit_transform(x.T)
df_pca_q = numpy_to_dataframe(x_pca_q, df_questions, df_questions.columns)


# Think about interpretation of this. Correlation in users!
alt.Chart(df_pca_q).mark_circle(size=60).encode(
    x='0',
    y='1',
    #color='party',
    tooltip=['question']
).interactive()

In [110]:
clf = PCA(n_components=10)
y_val = "party"

x_pca_questions = clf.fit_transform(x.T)
x_pca_questions.shape
clf.components_[0]  # Sum over parties
for i, val in enumerate(df_results[y_val].cat.categories):
    print(val)
    print(np.sum(clf.components_[1][df_results["party"].cat.codes.to_numpy() == i]))

Flokkur Fólksins
-0.508887677598457
Framsóknarflokkurinn
-1.5549118019331027
Frjálslyndi Lýðræðisflokkurinn
-0.5189058744773785
Miðflokkurinn
-3.071598084520553
Píratar
-0.2355947718647991
Samfylkingin
0.011816785420933805
Sjálfstæðisflokkurinn
-2.712445380495281
Sósíalistaflokkurinn
0.15353288988465097
Vinstri Græn
0.05635031516497418
Viðreisn
-1.2524405438297752
Ábyrg Framtíð
0.0010726311138651644


In [111]:
nmf = NMF(n_components=2, init='random', max_iter=400, alpha=0.5, l1_ratio=0.5)
W = nmf.fit_transform(x)
H = nmf.components_

np.mean(np.abs(W @ H - x))

# What users/questions is it hard to reconstruct?

0.17285265612318518

- Classification to find the "most important" questions
- Biggest outlier per party
- Do any later components in PCA explain something else than party? E.g. age?

In [121]:
import altair as alt
#from vega_datasets import data

#source = data.cars()
#source = pd.DataFrame({"Component0": H.T[:, 0], "Component1": H.T[:, 1]})
source = pd.DataFrame({"Component0": W[:, 0], "Component1": W[:, 1], "party": df_results["party"], "name": df_results["name"]})

chart_results = alt.Chart(source).mark_circle(size=60).encode(
    x='Component0',
    y='Component1',
    color='party',
    tooltip=['party', "name"]
).interactive()

In [122]:
import altair as alt

source = pd.DataFrame({"Component0": H.T[:, 0], "Component1": H.T[:, 1], "question": df_questions["question"], "num": range(len(H.T[:, 0]))})

chart_questions = alt.Chart(source).mark_circle(size=60).encode(
    x='Component0',
    y='Component1',
    #color='party',
    tooltip=['question', 'num']
).interactive()

In [125]:
chart_questions | chart_results 

There is a constant that can be moved arbitrarily between the matrices. How about scaling users in [0, 1] and setting questions to [0, 100]?

Fill in NaNs using KNN. k closest users, take the mean and use that to fill in
Matrix factorization is also interesting. Feature vector for questions not that interesting? Dno, probably also possible to calculate using correlation or implicit from the results of the PCA

Hvaða spurning hefur mest predictive value fyrir
- Aldur
- Flokk
- Kyn

Vitum a-priori hvað targettin eru, hvaða clustering aðferð er best?



Mean median of a question!

Density of points to see conformity? How? Variance? Something else? Mean distance of everyone to everyone?

Compare ranking based on metrics: Cosine, correlation, l2, l1

In [None]:
df_users_filter = df_users[~df_users["party"].isna()]
df_users_filter = df_users_filter[df_users_filter["party"] != "Ábyrg Framtíð"]

print(df_users_filter.party.cat.categories)


x = df_users_filter.filter(like="question_").to_numpy()
x = x.astype(float) / 100

# Try gender
y = df_users_filter["party"].cat.codes.to_numpy()

x.shape, y.shape

In [127]:
y = df_results["age_binned"].cat.codes.to_numpy()

In [128]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

In [130]:
from sklearn.model_selection import cross_validate
cv_results = cross_validate(clf, x, y, cv=5)
cv_results



{'fit_time': array([0.51878071, 0.74593043, 0.21286488, 0.14841795, 0.14580989]),
 'score_time': array([0.03884315, 0.01217532, 0.00981164, 0.01013565, 0.00973415]),
 'test_score': array([0.36507937, 0.25806452, 0.25806452, 0.37096774, 0.24193548])}

In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3, random_state=None, shuffle=True)
train_index, test_index = next(kf.split(x))

clf.fit(x[train_index, :], y[train_index])
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(clf, x[test_index, :], y[test_index], display_labels=df_users_filter["party"].cat.categories.to_list()[:-1])


len(test_index), len(train_index), df_users_filter["party"].cat.categories
# Who is confused?

In [None]:
np.unique(y[test_index], return_counts=True)

In [None]:
df_users["party"].isna().sum()

clf.feature_importances_

In [None]:
df_questions.iloc[7]

# Feature selection

- Classification to find the "most important" questions
- Biggest outlier per party
- Do any later components in PCA explain something else than party? E.g. age?

In [79]:
from sklearn.feature_selection import f_classif, SelectKBest, chi2, mutual_info_classif

In [100]:
y = df_results["age_binned"].cat.codes.to_numpy()

f_classif(x, y)

(array([3.81490248, 2.36112975, 1.10983725, 0.54580214, 0.89662667,
        0.56225249, 0.90254074, 0.66569223, 0.18893098, 1.56250092,
        0.35002768, 0.76207119, 0.81723808, 0.71496848, 0.60556133,
        0.52064862, 1.10717252, 0.72531261, 1.17537611, 2.26078233,
        0.45343125, 1.32735415, 1.1998307 , 1.50466006, 1.78091835,
        0.41269344, 0.83224074, 0.78323649, 5.62639072, 1.83757647,
        0.36755519]),
 array([5.46476468e-04, 2.31405227e-02, 3.56638692e-01, 7.99367947e-01,
        5.09362658e-01, 7.86367359e-01, 5.04712540e-01, 7.01084840e-01,
        9.87605647e-01, 1.46075858e-01, 9.30048461e-01, 6.19572714e-01,
        5.73523601e-01, 6.59390163e-01, 7.51264776e-01, 8.18809684e-01,
        3.58338630e-01, 6.50627439e-01, 3.16600221e-01, 2.95347680e-02,
        8.67526366e-01, 2.36766734e-01, 3.02534985e-01, 1.65047223e-01,
        9.06040545e-02, 8.94202169e-01, 5.61177132e-01, 6.01802866e-01,
        4.06340931e-06, 7.97422584e-02, 9.20710959e-01]))

In [101]:
top_vals = mutual_info_classif(x, y)

In [102]:
idx_sort = np.argsort(top_vals)[::-1]

In [103]:
df_questions.iloc[idx_sort,:]["question"].to_list()

['Ísland á að eiga aðild að NATO.',
 'Leyfa á smásölu áfengis í matvöruverslunum.',
 'Stjórnvöld ættu að flytja fleiri ríkisstofnanir til landsbyggðanna.',
 'Ísland á að taka á móti fleiri flóttamönnum.',
 'Tryggja þarf að örorkubætur og ellilífeyrir séu ekki lægri en lægstu laun á vinnumarkaði.',
 'Alþingi á að vinna markvisst að endurskoðun stjórnarskrárinnar á næsta kjörtímabili.',
 'Sjávarútvegsfyrirtæki ættu að greiða meira til ríkisins fyrir aðgang að auðlindum.',
 'Efna á til þjóðaratkvæðagreiðslu um framhald aðildarviðræðna við ESB.',
 'Það ætti að lækka tryggingagjald á fyrirtæki.',
 'Stjórnvöld eiga að setja strangar takmarkanir við landamæri til að tryggja að smit berist ekki til landsins.',
 'Það á að leyfa kórónuveirunni að ganga án sóttvarnatakmarkana.',
 'Reykjavíkurflugvöllur á að vera í Vatnsmýri um ókomna framtíð.',
 'Stjórnvöld eiga að halda áfram að styðja við atvinnulífið í kjölfar heimsfaraldursins til að koma til móts við fyrirtæki sem lentu í rekstrarvanda í far

In [None]:
from sklearn.metrics import pairwise_distances

dist_mat = pairwise_distances(x, metric="cosine")

dist_mat


for m in ["cosine", "l2", "l1"]:
    dist_mat = pairwise_distances(x[100:101, :], x, metric=m).ravel()

    idx_sort = np.argsort(dist_mat)
    print(m)
    print(idx_sort[:10])
    print(dist_mat[idx_sort[:10]])

plt.figure()
plt.plot(x[100])
plt.plot(x[25])
plt.plot(x[5])

dist_mat

dist_mat = pairwise_distances(x, metric="cosine")
dist_mat

def _party_idx():
    """Maps the name of the party to indices"""
    ...

for i, val in enumerate(df_users_filter["party"].cat.categories):
    dist_mat_party = dist_mat[y == i, :][:, y == i]
    print(val)
    print(np.mean(dist_mat_party))

dist_mat_party += np.diag(10000 * np.ones(dist_mat_party.shape[0]))
np.argmax(np.min(dist_mat_party, axis=0))

df_users_filter.loc[y == 5].iloc[28]

# Thoughts

Skoða aldur, kyn, flokk sem target

Þetta eru ekki gögn í evklíðsku rúmi, nota KernelPCA með cosine eða jafnvel mutual information? Eða cross entropy


Motivating dæmi fyrir cosine, einn er hlédrægur og kýs alltaf nálægt miðjunni, annar lýs alltaf langt frá. Hlédrægi mun samt vera nær einhverjum sem er hlédrægur í hina stefnuna. Mælir meira hlédrægni vs. ákveðni?

Fill in NaNs using KNN. k closest users, take the mean and use that to fill in
Matrix factorization is also interesting. Feature vector for questions not that interesting? Dno, probably also possible to calculate using correlation or implicit from the results of the PCA

Hvaða spurning hefur mest predictive value fyrir
- Aldur
- Flokk
- Kyn

Vitum a-priori hvað targettin eru, hvaða clustering aðferð er best?

This adds a linked superscript {% fn 1 %}

{{ "This is the actual footnote" | fndetail: 1 }}

In [None]:
# hide
# Discussion on entropy vs variance?

In [None]:
df_results_melt.groupby(["party", "question"]).std().groupby("party").mean().sort_values("Answer Value", ascending=False).reset_index()

In [None]:
from sklearn.decomposition import FastICA

transformer = FastICA(n_components=2)
x_ica = transformer.fit_transform(x)

In [None]:
x_ica.shape

In [None]:
# Compare to the PCA components
transformer.components_