# Kosningaprof
> Exploration of the data from the kosningaprof

- toc: true 
- badges: true
- comments: false
- categories: [election, data-science, machine-learning]

# Introduction

The Icelandic parliament election is on the 25. September. Before every election, media outlets set up a quiz/questionnaire where candidates get statements such as *"Iceland should be a part of NATO"* and *"The Icelandic government should put more money into the healthcare system"* and the candidates answer if they agree/disagree with or are neutral towards the statement. Users can then answer the same questions and figure out which candidates and political parties they are "closest to" their political beliefs using the answers to the questions.

These are mostly for fun and should only serve as an indicator, but it's an enjoyable process to go through and it's always interesting to see which candidates are "most similar" to oneself.

As a whole this collection of data, candidates and their answers to a set of questions, is interesting and has a lot of opportunities for some data exploration and the purpose of this post is to take the data from the [RUV quiz](https://www.ruv.is/x21/kosningaprof) explore it and try to answer some questions about it.

Similar (and definitely more rigorous) analysis has been done before by people designing the tests and actualluy working with the data, see for example this great thread [here](https://twitter.com/hafsteinneinars/status/1435268582053711881) on this [quiz](https://egkys.is/kosningavitinn/). Since this should not be taken too seriously, the analysis in this post will be more about generating plausible hypthes and doing some ad-hoc analysis.

# The Data

In [1]:
# hide
import numpy as np
import pandas as pd

If you want to have the data for yourself, e.g. to run this notebook locally follow the instructions [here](https://github.com/roberttorfason/kosningaprof) to set up an environment and fetch the data.

Let's load the data, pre-process it and set up some helper objects

In [2]:
df_results, df_questions = pd.read_csv("results_2021.csv"), pd.read_csv("questions_2021.csv")

In [3]:
# collapse-hide

# Pre-processing
df_results["party"] = df_results["party"].astype("category")
df_results["gender"] = df_results["gender"].astype("category")

# Bin the ages. `pd.cut` returns intervals that are annoying to work with so we just use the
# left age of each bin e.g. 30 to represent the interval [30, 40)
age_binned_series = pd.cut(df_results["age"], bins=[-10, 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100], right=False)
df_results.insert(df_results.columns.get_loc("age") + 1, "age_binned", age_binned_series)

df_results["age_binned"] = df_results["age_binned"].map(lambda x: x.left).astype(int)

# Most of the analysis centers around the political party so we drop the candiadates that don't have
# a party specified
df_results = df_results[~df_results["party"].isna()]

In [4]:
# collapse-hide
cols_questions = [c for c in df_results.columns if c.startswith("question_")]
cols_meta = [c for c in df_results.columns if c not in cols_questions]

question_id_to_string = dict(zip(df_questions["question_number"], df_questions["question"]))

and take a look at the structure

In [5]:
df_questions.head(3)

Unnamed: 0,question_number,id,question
0,question_0,5719843612393472,"Íslenskt samfélag einkennist af réttlæti, sann..."
1,question_1,5692341510733824,Það á að leyfa kórónuveirunni að ganga án sótt...
2,question_2,5284747587616768,Stjórnvöld eiga að setja strangar takmarkanir ...


`df_questions` has all the questions and their ids/numbers.

In [6]:
df_results.head(3)

Unnamed: 0,name,party,age,age_binned,gender,timestamp,constituency,answering_done,timestamp_dt,question_0,...,question_21,question_22,question_23,question_24,question_25,question_26,question_27,question_28,question_29,question_30
0,Elín Tryggvadóttir,Samfylkingin,46,40,Kona,1631051439022,Reykjavíkurkjördæmi suður,1.0,2021-09-07 21:50:39.022,51,...,83,3,100,93,100,100,100,8,100,63
1,Ágústa Anna Ómarsdóttir,Sósíalistaflokkurinn,55,50,Kona,1631547337578,Norðvesturkjördæmi,1.0,2021-09-13 15:35:37.578,0,...,0,27,61,61,96,100,100,0,100,32
2,María Lilja Þrastardóttir Kemp,Sósíalistaflokkurinn,35,30,Kona,1631030960104,Reykjavíkurkjördæmi suður,1.0,2021-09-07 16:09:20.104,22,...,100,2,100,58,100,100,100,38,100,24


`df_results` represents each candidate in a row, metadata (`age`, `party`, `name`) and the results for all the questions, where each answer is on the scale from 0-100.

In [7]:
# collapse-hide
"""
This text is in a code cell because it's not possible to collapse markdown cells in fastpages

Additionally each question in `df_results` has a mapping back to `df_questions` via the column name.
Note that the way the questions are indexed there is an easy correspondance between the (numeric) index 
of each column and `df_questions`. This means that when we later transform the data to numpy arrays, 
where we don't have named columns, and do something like `x[:, 3]`, it will correspond to `df_questions.iloc[3]` 
so going back and forth between the data and the actual questions is easy.
""";

# Interactive Histogram of Questions and Answers

Below we visualize a histogram of the questions where you can select individual parties and select the questions to see the difference. The questions are ordered by how "interesting" they are, where the standard deviation of each question is used as a proxy for how interesting it is. If everyone answers the same, the std. will be low. If everyone answers different it will be medium and if there is a strong split where candidates either agree or disagree with the statements

In [8]:
# hide
import altair as alt

In [9]:
# collapse-hide
df_results_melt = pd.melt(df_results, id_vars=cols_meta, value_vars=cols_questions)
df_results_melt = df_results_melt[["party", "variable", "value"]]
df_results_melt = df_results_melt.rename(columns={"variable": "question", "value": "Answer Value"})
df_results_melt["question"] = df_results_melt["question"].replace(question_id_to_string)
df_results_melt["question"] = df_results_melt["question"].astype("category")

In [10]:
df_questions_std = df_results_melt.groupby("question").std().sort_values("Answer Value", ascending=False)
questions_sorted = df_questions_std.index.to_list()

In [11]:
# hide
# Discussion on entropy vs variance?

In [12]:
df_results_melt.groupby(["party", "question"]).std().groupby("party").mean().sort_values("Answer Value", ascending=False).reset_index()

Unnamed: 0,party,Answer Value
0,Miðflokkurinn,28.339318
1,Sjálfstæðisflokkurinn,23.921914
2,Frjálslyndi Lýðræðisflokkurinn,23.890085
3,Framsóknarflokkurinn,21.14051
4,Viðreisn,20.839005
5,Flokkur Fólksins,20.393239
6,Vinstri Græn,19.294324
7,Píratar,18.837223
8,Sósíalistaflokkurinn,18.455783
9,Samfylkingin,16.492775


In [13]:
# hide_input
alt.data_transformers.disable_max_rows()

parties_list = df_results_melt["party"].cat.categories.to_list()
questions_list = df_results_melt["question"].cat.categories.to_list()

# Highest and lowest entropy/variance parties
chart = alt.Chart(df_results_melt).mark_bar().encode(
    x=alt.X(f'Answer Value:Q', bin=alt.Bin(extent=[0, 100], step=10), scale=alt.Scale(domain=(0, 100))),
    y=alt.Y('count()'),
    color='party',
    tooltip=['party', alt.Tooltip('count()', title='count')]
).interactive()
    
# A dropdown filter
question_dropdown = alt.binding_select(options=[None] + questions_sorted, labels=["All"] + questions_sorted)
question_select = alt.selection_single(fields=["question"], bind=question_dropdown, name="Question")

chart_filter_question = chart.add_selection(
    question_select
).transform_filter(
    question_select
).properties(title="Question Result Histogram")
 
# A dropdown filter
party_dropdown = alt.binding_select(options=[None] + parties_list, labels=["All"] + parties_list)
party_select = alt.selection_single(fields=["party"], bind=party_dropdown, name="Party")

chart_filter_party = chart_filter_question.add_selection(
    party_select
).transform_filter(
    party_select
)

chart_filter_party

# Dimensionality reduction and embeddings

In [14]:
from sklearn.decomposition import PCA, NMF

Let's pick out only the questions and transform the extracted questions to a numpy array to be used with `sklearn` and normalize them to be in the range [0, 1]

In [15]:
df_questions_only = df_results.filter(like="question_")
x = df_questions_only.to_numpy()
x = x.astype(float) / 100
x.shape

(311, 31)

In [None]:
pca = PCA(n_components=10)
x_pca = pca.fit_transform(x)

In [None]:
clf = PCA(n_components=10)
y_val = "party"

x_pca_questions = clf.fit_transform(x.T)
x_pca_questions.shape
clf.components_[0]  # Sum over parties
for i, val in enumerate(df_users[y_val].cat.categories):
    print(val)
    print(np.sum(clf.components_[1][df_users["party"].cat.codes.to_numpy() == i]))

In [None]:
df_pca_q = pd.DataFrame(x_pca_questions)
df_pca_q.columns = df_pca_q.columns.astype(str)
df_pca_q = pd.concat([df_questions, df_pca_q], axis=1)
df_pca_q.head(5)

# Plot the questions (interactive dropdown)

In [None]:
# Think about interpretation of this. Correlation in users!
alt.Chart(df_pca_q).mark_circle(size=60).encode(
    x='0',
    y='1',
    #color='party',
    tooltip=['question']
).interactive()

In [None]:
# Map back to questions
#plt.figure()
#plt.plot(clf.components_[0])
#plt.plot(clf.components_[1])
#plt.plot(clf.components_[2])
#plt.plot(clf.components_[3])

In [None]:
df_pca = pd.DataFrame(x_pca)
df_pca.columns = df_pca.columns.astype(str)
df_pca = pd.concat([df_users[cols_meta], df_pca], axis=1)
df_pca = df_pca.drop(columns=["age_binned"])
df_pca.head(5)

In [None]:
alt.Chart(df_pca).mark_circle(size=60).encode(
    x='0',
    y='1',
    color='party',
    tooltip=['party', "name"]
).interactive()

In [None]:
df_users["party"].cat.categories

In [None]:
df_users["party"].head(), df_users["party"].cat.codes.head()

In [None]:
model = NMF(n_components=2, init='random', max_iter=400, alpha=0.5, l1_ratio=0.5)
W = model.fit_transform(x)
H = model.components_

np.mean(np.abs(W @ H - x))

# What users/questions is it hard to reconstruct?

In [None]:
df_users["age_binned"] = pd.cut(df_users["age"], bins=[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100])

In [None]:
df_users["age_binned"].cat.codes  # needs gradual color scheme i.e. old is dark and young is a lighter shade of the same color

In [None]:
y_val = "party"
y = df_users[y_val].cat.codes.to_numpy()

plt.figure(figsize=(12, 12))
for i, val in enumerate(df_users[y_val].cat.categories):
    plt.scatter(W[y == i, 0], W[y == i, 1], label=val)
plt.legend()

- Classification to find the "most important" questions
- Biggest outlier per party
- Do any later components in PCA explain something else than party? E.g. age?

In [None]:
# Need to see the questions, will answer what the latent factors mean?
# I think the orthogonal direction means: Which questions do most people agree on
plt.figure(figsize=(15, 15))
for i in list(range(len(H.T)))[:10]:
    plt.scatter(H.T[i, 0], H.T[i, 1], label=df_questions["question"].iloc[i])
plt.legend()

# Áhugavert, Alþjóðaás vs. þjóðernisás? Heilbrigðisþjónusta í cluster

In [None]:
import altair as alt

In [None]:
#!pip install vega_datasets

In [None]:
import altair as alt
#from vega_datasets import data

#source = data.cars()
#source = pd.DataFrame({"Component0": H.T[:, 0], "Component1": H.T[:, 1]})
source = pd.DataFrame({"Component0": W[:, 0], "Component1": W[:, 1], "party": df_users["party"], "name": df_users["name"]})

alt.Chart(source).mark_circle(size=60).encode(
    x='Component0',
    y='Component1',
    color='party',
    tooltip=['party', "name"]
).interactive()

In [None]:
import altair as alt
from vega_datasets import data

# Passa að df_questions passi örugglega við H
source = data.cars()
source = pd.DataFrame({"Component0": H.T[:, 0], "Component1": H.T[:, 1], "question": df_questions["question"], "num": range(len(H.T[:, 0]))})

alt.Chart(source).mark_circle(size=60).encode(
    x='Component0',
    y='Component1',
    #color='party',
    tooltip=['question', 'num']
).interactive()

There is a constant that can be moved arbitrarily between the matrices. How about scaling users in [0, 1] and setting questions to [0, 100]?

In [None]:
from sklearn.decomposition import FastICA

transformer = FastICA(n_components=2)
x_ica = transformer.fit_transform(x)

In [None]:
x_ica.shape

In [None]:
# Compare to the PCA components
transformer.components_

In [None]:
import altair as alt
from vega_datasets import data

# Passa að df_questions passi örugglega við H
source = data.cars()
source = pd.DataFrame({"Component0": x_ica[:, 0], "Component1": x_ica[:, 1], "name": df_users["name"], "party": df_users["party"]})

alt.Chart(source).mark_circle(size=60).encode(
    x='Component0',
    y='Component1',
    color='party',
    tooltip=['name']
).interactive()

In [None]:
# Similarity to all in the 

Fill in NaNs using KNN. k closest users, take the mean and use that to fill in
Matrix factorization is also interesting. Feature vector for questions not that interesting? Dno, probably also possible to calculate using correlation or implicit from the results of the PCA

Hvaða spurning hefur mest predictive value fyrir
- Aldur
- Flokk
- Kyn

Vitum a-priori hvað targettin eru, hvaða clustering aðferð er best?



Mean median of a question!

Density of points to see conformity? How? Variance? Something else? Mean distance of everyone to everyone?

Compare ranking based on metrics: Cosine, correlation, l2, l1

In [None]:
df_users.party.unique()

In [None]:
df_users_filter = df_users[~df_users["party"].isna()]
df_users_filter = df_users_filter[df_users_filter["party"] != "Ábyrg Framtíð"]

print(df_users_filter.party.cat.categories)


x = df_users_filter.filter(like="question_").to_numpy()
x = x.astype(float) / 100

# Try gender
y = df_users_filter["party"].cat.codes.to_numpy()

x.shape, y.shape

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

In [None]:
from sklearn.model_selection import cross_validate
cv_results = cross_validate(clf, x, y, cv=5)
cv_results

In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3, random_state=None, shuffle=True)
train_index, test_index = next(kf.split(x))

clf.fit(x[train_index, :], y[train_index])
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(clf, x[test_index, :], y[test_index], display_labels=df_users_filter["party"].cat.categories.to_list()[:-1])


len(test_index), len(train_index), df_users_filter["party"].cat.categories
# Who is confused?

In [None]:
np.unique(y[test_index], return_counts=True)

In [None]:
df_users["party"].isna().sum()

clf.feature_importances_

In [None]:
df_questions.iloc[7]

# Feature selection

In [None]:
from sklearn.feature_selection import f_classif, SelectKBest, chi2, mutual_info_classif

In [None]:
chi2(x, y)

In [None]:
top_vals = mutual_info_classif(x, y)

In [None]:
idx_sort = np.argsort(top_vals)[::-1]

In [None]:
df_questions.iloc[idx_sort,:]["question"].to_list()

In [None]:

from sklearn.metrics import pairwise_distances

dist_mat = pairwise_distances(x, metric="cosine")

dist_mat


for m in ["cosine", "l2", "l1"]:
    dist_mat = pairwise_distances(x[100:101, :], x, metric=m).ravel()

    idx_sort = np.argsort(dist_mat)
    print(m)
    print(idx_sort[:10])
    print(dist_mat[idx_sort[:10]])

plt.figure()
plt.plot(x[100])
plt.plot(x[25])
plt.plot(x[5])

dist_mat

dist_mat = pairwise_distances(x, metric="cosine")
dist_mat

def _party_idx():
    """Maps the name of the party to indices"""
    ...

for i, val in enumerate(df_users_filter["party"].cat.categories):
    dist_mat_party = dist_mat[y == i, :][:, y == i]
    print(val)
    print(np.mean(dist_mat_party))

dist_mat_party += np.diag(10000 * np.ones(dist_mat_party.shape[0]))
np.argmax(np.min(dist_mat_party, axis=0))

df_users_filter.loc[y == 5].iloc[28]

# Thoughts

Skoða aldur, kyn, flokk sem target

Þetta eru ekki gögn í evklíðsku rúmi, nota KernelPCA með cosine eða jafnvel mutual information? Eða cross entropy


Motivating dæmi fyrir cosine, einn er hlédrægur og kýs alltaf nálægt miðjunni, annar lýs alltaf langt frá. Hlédrægi mun samt vera nær einhverjum sem er hlédrægur í hina stefnuna. Mælir meira hlédrægni vs. ákveðni?

Fill in NaNs using KNN. k closest users, take the mean and use that to fill in
Matrix factorization is also interesting. Feature vector for questions not that interesting? Dno, probably also possible to calculate using correlation or implicit from the results of the PCA

Hvaða spurning hefur mest predictive value fyrir
- Aldur
- Flokk
- Kyn

Vitum a-priori hvað targettin eru, hvaða clustering aðferð er best?

In [None]:
df_results.isna().sum()

This adds a linked superscript {% fn 1 %}

{{ "This is the actual footnote" | fndetail: 1 }}