# Election Questionnaire (Kosningapróf)
> Exploration of the data from the kosningaprof

- toc: true 
- badges: true
- comments: false
- categories: [data-science, election, machine-learning]

# Introduction

The Icelandic parliament election is on the 25. September. Before every election, media outlets set up a quiz/questionnaire where candidates get statements such as *"Iceland should be a part of NATO"* and *"The Icelandic government should put more money into the healthcare system"* and the candidates answer if they agree/disagree with or are neutral towards the statement. Users can then answer the same questions and figure out which candidates and political parties they are "closest to" their political beliefs using the answers to the questions.

These are mostly for fun and should only serve as an indicator, but it's an enjoyable process to go through and it's always interesting to see which candidates are "most similar" to oneself.

As a whole this collection of data, candidates and their answers to a set of questions, is interesting and has a lot of opportunities for some data exploration and the purpose of this post is to take the data from the [RUV quiz](https://www.ruv.is/x21/kosningaprof) explore it and try to answer some questions about it.

Similar (and definitely more rigorous) analysis has been done before by people designing the tests and actualluy working with the data, see for example this great thread [here](https://twitter.com/hafsteinneinars/status/1435268582053711881) on this [quiz](https://egkys.is/kosningavitinn/). Since this should not be taken too seriously, the analysis in this post will be more about generating plausible hypthes and doing some ad-hoc analysis.

**If you want to fetch the data for yourself, e.g. to run this notebook locally follow the instructions [here](https://github.com/roberttorfason/kosningaprof)**

# The Data

In [None]:
# hide
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

Let's load the data, pre-process it and set up some helper objects

In [None]:
df_results, df_questions = pd.read_csv("data/results_2021.csv"), pd.read_csv("data/questions_2021.csv")

In [None]:
# collapse-hide

# Pre-processing
df_results["party"] = df_results["party"].astype("category")
df_results["gender"] = df_results["gender"].astype("category")

# Bin the ages. `pd.cut` returns intervals that are annoying to work with so we just use the
# left age of each bin e.g. 30 to represent the interval [30, 40)
age_binned_series = pd.cut(df_results["age"], bins=[-10, 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100], right=False)
df_results.insert(df_results.columns.get_loc("age") + 1, "age_binned", age_binned_series)

df_results["age_binned"] = df_results["age_binned"].map(lambda x: x.left).astype("category")

# Most of the analysis centers around the political party so we drop the candiadates that don't have
# a party specified
df_results = df_results[~df_results["party"].isna()]
df_results = df_results.reset_index(drop=True)

In [None]:
# collapse-hide
cols_questions = [c for c in df_results.columns if c.startswith("question_")]
cols_meta = [c for c in df_results.columns if c not in cols_questions]

question_id_to_string = dict(zip(df_questions["question_number"], df_questions["question"]))

and take a look at the structure

In [None]:
df_questions.head(3)

`df_questions` has all the questions and their ids/numbers.

In [None]:
df_results.head(3)

`df_results` represents each candidate in a row, metadata (`age`, `party`, `name`) and the results for all the questions, **where each answer is on the scale from 0-100, 0 meaning that the candidate strongly disagrees with the statement and 100 means the candidate strongly agrees with the statement**

In [None]:
# collapse-hide
"""This text is in a code cell because it's not possible to collapse markdown cells in fastpages

Additionally each question in `df_results` has a mapping back to `df_questions` via the column name.
Note that the way the questions are indexed there is an easy correspondance between the (numeric) index 
of each column and `df_questions`. This means that when we later transform the data to numpy arrays, 
where we don't have named columns, and do something like `x[:, 3]`, it will correspond to `df_questions.iloc[3]` 
so going back and forth between the data and the actual questions is easy.
""";

# Interactive Histogram of Questions and Answers

Below we visualize a histogram of the answers the candidates gave to the questions. The x-axis, Answer Value, is the value of the answer to each question binned and the y-axis is simply the count of those values. Again, these answers are on the scale 0-100, 0 meaning strongly disagree with the statement and 100 strongly agree with the statement. There are also two dropdown menus: One filters by political party and one filters by question so you can see the distribution of answers for each party and each question

The questions are ordered by most "interesting" to least "interesting", where the standard deviation is used as proxy for how interesting it is.

In [None]:
# collapse-hide
"""This text is in a code cell because it's not possible to collapse markdown cells in fastpages

Why does standard deviation make sense as a proxy for how interesting a question is?
As a very informal argument, thinking about the different scenarios:

1. Everyone answers the same or a similar value -> the std. will be low
2. The answers are (roughly) uniformly distributed over possible values -> std. will be a "medium" value
3. If there is a strong split (bi-modal distribution) where candidates either agree or disagree with the
   statements -> std. is high

Visual inspection of the plots also supports this.

One might be inclined to use entropy to measure how interesting a question is, but in that case the ordering
would be 1. < 3. < 2., whichis not the desired outcome for this problem, so the std. is more appropriate her.
""";

In [None]:
# hide
import altair as alt

We need to pre-process the data for this plot, transforming it from a tall dataframe to a wide dataframe. See a good discussion on why that's useful [here](https://altair-viz.github.io/user_guide/data.html#long-form-vs-wide-form-data)

In [None]:
# collapse-hide
df_results_melt = pd.melt(df_results, id_vars=cols_meta, value_vars=cols_questions)
df_results_melt = df_results_melt[["party", "variable", "value"]]
df_results_melt = df_results_melt.rename(columns={"variable": "question", "value": "Answer Value"})
df_results_melt["question"] = df_results_melt["question"].replace(question_id_to_string)
df_results_melt["question"] = df_results_melt["question"].astype("category")

In [None]:
df_questions_std = df_results_melt.groupby("question").std().sort_values("Answer Value", ascending=False)
questions_sorted = df_questions_std.index.to_list()

In [None]:
# hide_input
alt.data_transformers.disable_max_rows()

parties_list = df_results_melt["party"].cat.categories.to_list()
questions_list = df_results_melt["question"].cat.categories.to_list()

# Highest and lowest entropy/variance parties
chart = alt.Chart(df_results_melt).mark_bar().encode(
    x=alt.X(f'Answer Value:Q', bin=alt.Bin(extent=[0, 100], step=10), scale=alt.Scale(domain=(0, 100))),
    y=alt.Y('count()'),
    color='party',
    tooltip=['party', alt.Tooltip('count()', title='count')]
).interactive()
    
# A dropdown filter
question_dropdown = alt.binding_select(options=[None] + questions_sorted, labels=["All"] + questions_sorted)
question_select = alt.selection_single(fields=["question"], bind=question_dropdown, name="Question")

chart_filter_question = chart.add_selection(
    question_select
).transform_filter(
    question_select
).properties(title="Question Result Histogram")
 
# A dropdown filter
party_dropdown = alt.binding_select(options=[None] + parties_list, labels=["All"] + parties_list)
party_select = alt.selection_single(fields=["party"], bind=party_dropdown, name="Party")

chart_filter_party = chart_filter_question.add_selection(
    party_select
).transform_filter(
    party_select
)

chart_filter_party

# Dimensionality Reduction and Embeddings

## Principal Component Analysis (PCA)

In [None]:
# hide
from sklearn.decomposition import PCA, NMF

Now we want to plot the data in a lower dimension so we run PCA on the data to get the 2 components that explain most of the variance in the data to be able to plot the candidates and their location in space using these new basis functions

First we pick out the questions from the dataframe and transform the extracted questions to a numpy array to be used with `sklearn` functions. Finally we normalize it to be in the range `[0, 1]`

In [None]:
df_questions_only = df_results.filter(like="question_")
x = df_questions_only.to_numpy()
x = x.astype(float) / 100

In [None]:
# hide
from typing import List

def numpy_to_dataframe(_x: np.ndarray, _df: pd.DataFrame, cols_to_use: List[str]) -> pd.DataFrame:
    """Concatenate a numpy array with selected columns from a dataframe to be used with altair plotting"""
    df_out = pd.DataFrame(_x)
    df_out.columns = df_out.columns.astype(str)
    df_out = pd.concat([_df[cols_to_use].reset_index(drop=True), df_out], axis=1)
    return df_out

In [None]:
pca = PCA(n_components=4)
x_pca = pca.fit_transform(x)
print(f"Explained variance of each component {pca.explained_variance_ratio_}.\n"
      f"Total explained variance of first 4 components {np.sum(pca.explained_variance_ratio_):.4f}\n"
      f"Total explained variance of first 2 components {np.sum(pca.explained_variance_ratio_[:2]):.4f}")

The first 2 components only explain roughly 50% of the variance in the data, but the ones that come after individually do not add a lot of explaining power.

We plot these components where you can see a breakdown by party and a tooltip overlay indicating which candidate corresponds to which point on the plot. 

In [None]:
# hide_input
df_pca = numpy_to_dataframe(x_pca, df_results, cols_meta)
df_pca = df_pca.rename(columns={"0": "PCA Component 0", "1": "PCA Component 1"})

alt.Chart(df_pca).mark_circle(size=60).encode(
    x='PCA Component 0',
    y='PCA Component 1',
    color='party',
    tooltip=['party', "name"]
).interactive()

I don't want to interpret the components for the readers, that is a subjective process, but since the components are a linear combination of the question vectors, we can take a look at which questions contribute most strongly to each component so we can use that to help us interpret them.

Note that for negative (red) questions, we need to negate the statement to get the direction that aligns with the components that have positive (green) questions.

In [None]:
# collapse-hide
df_questions_and_components = pd.DataFrame(
    {"Question": df_questions["question"], "Component 0": pca.components_[0], "Component 1": pca.components_[1]}
)
# We need to sort by the absolute value of each component s.t. we don't disregard components with a negative sign
df_questions_and_components["Component 0 abs"] = df_questions_and_components["Component 0"].abs()
df_questions_and_components["Component 1 abs"] = df_questions_and_components["Component 1"].abs()

In [None]:
# hide
def style_positive_and_negative(v):
    if isinstance(v, str):
        return None
    if v < 0:
        return 'color:red;'
    else:
        return 'color:green;'

The top contributing questions for PCA component 0:

In [None]:
# hide_input
questions_sorted_component_0 = df_questions_and_components.sort_values(by="Component 0 abs", ascending=False)
questions_sorted_component_0[["Component 0", "Question"]].head(6).style.applymap(style_positive_and_negative)

The top contributing questions for PCA component 1:

In [None]:
# hide_input
questions_sorted_component_1 = df_questions_and_components.sort_values(by="Component 1 abs", ascending=False)
questions_sorted_component_1[["Component 1", "Question"]].head(6).style.applymap(style_positive_and_negative)

## Non-negative Matrix Factorization (NMF)

PCA is not the only way to do dimensionality reduction. Another method is non-negative matrix factorization, whose purpose is not necesseraly dimensionality reduction, but the process gives us a natural lower dimensional representation of the candidates **and** the questions in 2 differnt spaces that have a similar structure.

Very briefly, NMF seeks to find 2 low rank non-negative matrices $W$ and $H$ such that $W \cdot H \approx X$, where $X$ in this case is our data matrix with the shape `(n_candidates, n_questions)`, $W$ has the shape `(n_candidates, n_components)` and $H$ has the shape `(n_components, n_questions)`. For each candidate and each question we get a representation in `n_components` dimensions, in our case we get a 2-dimensional vector.

In [None]:
# collapse-hide
"""This text is in a code cell because it's not possible to collapse markdown cells in fastpages

We could also try doing PCA on x.T. The problem is that the basis functions are not interpretable (linear
combinations of candidates) and the structure of the space does not necessarily have to correspond to the
structure of the space found when doing PCA on the candidates.

""";

In [None]:
nmf = NMF(n_components=2, init='random', max_iter=400, alpha=0.5, l1_ratio=0.5)
W = nmf.fit_transform(x)
H = nmf.components_

Inspect the shapes of the results as a sanity check

In [None]:
# hide_input
print(f"x.shape = (n_candidates, n_questions) = {x.shape}")
print(f"W.shape = (n_candidates, n_components) = {W.shape}")
print(f"H.shape = (n_components, n_questions) = {H.shape}")
print(f"(W * H).shape = (n_candidates, n_questions) = {(W @ H).shape}")

In [None]:
# hide_input
source = pd.DataFrame({"NMF Component 0": W[:, 0], "NMF Component 1": W[:, 1], "Party": df_results["party"], "Name": df_results["name"]})

chart_results = alt.Chart(source).mark_circle(size=60).encode(
    x='NMF Component 0',
    y='NMF Component 1',
    color='Party',
    tooltip=["Name", 'Party']
).interactive().properties(title="Candidates")

source = pd.DataFrame({"NMF Component 0": H.T[:, 0], "NMF Component 1": H.T[:, 1], "Question": df_questions["question"], "Number": range(len(H.T[:, 0]))})

chart_questions = alt.Chart(source).mark_circle(size=60).encode(
    x='NMF Component 0',
    y='NMF Component 1',
    tooltip=['Question', 'Number']
).interactive().properties(title="Questions")

chart_results | chart_questions

We get a space that seems to capture variability along a single dimension, but the components have a pretty clear meaning. But while the single dimension of variability has a clear interpretation, what does the smaller variability in the tangential dimension of this single dimension (distance from origin) mean?

Each element $x_{ij}$ in $X$ is approximated the inner product of a candidate vector and a question vector i.e. $w_i \cdot h_j$. The higher this inner product is, the higher the value (higher agreement) the resulting answer. If we focus on a question, the interpretation of the radial distance is proportional to the ratio of candidates agree with the statement.

To support the hypothesis, let's select two questions whose embeddings are close and far away from the origin and look at the histogram they produce

In [None]:
# hide_input
question_numbers = [11, 19]

alt.data_transformers.disable_max_rows()

# Filter on those 2 questions
idxs = df_results_melt["question"].isin(df_questions.iloc[question_numbers]["question"])
df_results_melt_filter = df_results_melt[idxs]
# A quick hack to get a line break in the question for the plot
df_results_melt_filter.loc[:, "question"] = df_results_melt_filter["question"].str.replace("umsvifum ", "umsvifum\n")
df_results_melt_filter = df_results_melt_filter.rename(columns={"question": "Question"})

chart = alt.Chart(df_results_melt_filter).mark_bar().encode(
    x=alt.X(f'Answer Value:Q', bin=alt.Bin(extent=[0, 100], step=10), scale=alt.Scale(domain=(0, 100))),
    y=alt.Y('count()'),
    color='party',
    facet="Question",
    tooltip=['party', alt.Tooltip('count()', title='count')]
).interactive().configure(lineBreak="\n").properties(width=300)

chart

# Interpreting Questions by Classifying Targets

Thus far we have not used the "target" information i.e. "party" or "age" in our analysis. Both dimensionality reduction techniques learned a representation of the matrix $X$ without using the target data, except for visualization.

Another way to look at the questions is to train a classifier and see which questions are most important to distinguish between the target variables, "party" or "age". Note that this is not the the same as calculating the variance of the questions as we did before. Even though these are related, a question might have high variance but would not help a classifier separate parties because each party could be divided on the question. This is not very likely, but it could happen in theory to some extent.

In [None]:
# hide
from typing import Any, Tuple

def target_from_col_name(_df: pd.DataFrame, target_name: str, val_to_remove: Any = None) -> Tuple[np.ndarray, np.ndarray]:
    col = _df[target_name]
    idxs_to_keep = ~col.isna()
    if val_to_remove is not None:
        idxs_to_keep = idxs_to_keep & ~(col == val_to_remove)
    return col.cat.codes.to_numpy(), idxs_to_keep

Let's look at classifying between the different parties using the questions

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate

clf = RandomForestClassifier()

We use the default `RandomForestClassifier`. We are not really interested in maximizing performance, we just want to see if there is a signal present

Filter the data and check the performance performance over 5 splits.

In [None]:
y, idxs = target_from_col_name(df_results, "party")
x_filter, y_filter = x[idxs, :], y[idxs]

cv_results = cross_validate(clf, x_filter, y_filter, cv=5)
f"Accuracy = {np.mean(cv_results['test_score']):.3f}, Std. = {np.std(cv_results['test_score']):.4f}"

We see there is a strong signal present, as expected since these questions were designed explicitly to differentiate between political parties! We'd expect a random classifier to have an accuracy of $1/11 \approx 0.09$ (assuming uniform distribution of number of candidates per party for 11 parties) so this is clearly way better.

Let's take a look at the questions that are most important for classification between parties using `RandomForestClassifier.feature_importance_`. We don't really care about the numbers themselves, just the rank ordering.

In [None]:
# hide
def most_important_questions(_clf, _x: np.ndarray, _y: np.ndarray, _df_questions: pd.DataFrame) -> pd.DataFrame:
    _clf = _clf.fit(_x, _y)
    _df_questions_with_importance = _df_questions.copy()
    _df_questions_with_importance["importance"] = _clf.feature_importances_
    return _df_questions_with_importance.sort_values(by="importance", ascending=False)

In [None]:
# hide
pd.set_option('display.max_colwidth', -1)

In [None]:
df_questions_with_importance = most_important_questions(clf, x_filter, y_filter, df_questions)
df_questions_with_importance[["importance", "question"]].head(6)

And the least important

In [None]:
df_questions_with_importance.tail(6)[["importance", "question"]]

We can do the same but using "age" as our target

In [None]:
y, idxs = target_from_col_name(df_results, "age_binned", -10)
x_filter, y_filter = x[idxs, :], y[idxs]

cv_results = cross_validate(clf, x_filter, y_filter, cv=5)
f"Accuracy = {np.mean(cv_results['test_score']):.3f}, Std. = {np.std(cv_results['test_score']):.4f}"

The signal is not nearly as strong, but there is still some present. We'd expect a random classifier to have accuracy of $1 / 7 \approx 0.14$.

The most important questions to classify between age groups are

In [None]:
# hide_input
df_questions_with_importance = most_important_questions(clf, x_filter, y_filter, df_questions)[["importance", "question"]]
df_questions_with_importance.head(6)

# Conclusion

Hopefully you enjoyed this slightly scattered analysis of this data and learned something new. At least got a chance to play with the interactive plots. There are still a lot of things to explore in the data so I recommend [fetching](https://github.com/roberttorfason/kosningaprof) it for yourself and playing around with it

In [None]:
# hide
# Outliear, density of party, difference between metrics, confusion matrix, mean vs median of question