In this script, we will continue the winner prediction, but we will use subcategories instead of the sum of education spending. Additionally, we will predict the accuracy for predicting each of the 3 major parties.

We will start with getting/initializing a spark session. 

In [11]:
#!/usr/bin/env python3
import json
import re
from unidecode import unidecode # https://pypi.org/project/Unidecode/

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.functions import col, when, regexp_replace, udf
from pyspark.sql.types import StringType, IntegerType, FloatType
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier, OneVsRest
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

spark = SparkSession.builder \
    .appName("MunicipalSpendingAndElectionAnalysis") \
    .config("spark.network.timeout", "600s") \
    .config("spark.executor.heartbeatInterval", "60s") \
    .getOrCreate()

print("Done")

Done


Now, we will load the data. First, we load the election/spending dataset (data/merged_data.csv)

In [12]:
merged_data_path = "data/merged_data.csv"
merged_df = (
    spark.read
         .option("header", True)
         .option("inferSchema", True)
         .option("sep", ",")
         .csv(merged_data_path)
         .drop("_c0")  # Drop unnecessary column if exists
)

# ensure proper types
merged_df = merged_df.withColumn("Year", F.col("Year").cast("integer"))

# ensure col name consistency.
if "Municipality" in merged_df.columns:
    merged_df = merged_df.withColumnRenamed("Municipality", "Municipality_Name")
if "Municipality_Lowercase" not in merged_df.columns:
    merged_df = merged_df.withColumn(
        "Municipality_Lowercase",
        F.regexp_replace(F.lower(F.trim(col("Municipality_Name"))), r"\s+", "-")
    )

print("Election data:")
merged_df.show(5, truncate=False)

# we will load the spending data again to get the subcategory-level spending.
spend_raw = spark.read.csv("data/Bildungsausgaben_Gemeinden_Oberösterreich_data_2007_bis_2019.csv",
                           header=True, sep=",", inferSchema=True)

# rename cols for clarity
spend_raw = spend_raw.withColumnRenamed("Gemeinde", "Municipality")\
                     .withColumnRenamed("Year", "Year")\
                     .withColumnRenamed("Abschnitt", "Subcategory")\
                     .withColumnRenamed("Betrag in Euro", "Spending")

# Add a lowercase municipality column for joining later
spend_raw = spend_raw.withColumn(
    "Municipality_Lowercase",
    F.regexp_replace(F.lower(F.trim(col("Municipality"))), r"\s+", "-")
)

spend_raw = spend_raw.withColumn("Year", col("Year").cast("int"))\
                     .withColumn("Spending", col("Spending").cast("float"))

# pivot the spending data so that each subcat. becomes its own col.
pivot_spend = spend_raw.groupBy("Municipality_Lowercase", "Year") \
                       .pivot("Subcategory") \
                       .agg(F.sum("Spending"))
pivot_spend = pivot_spend.na.fill(0)

# remove special chars, add SubCat_ prefix.
def clean_column_name(col_name):
    """
    remove special characters
    replace not alphanumeric characters with underscore
    add SubCat_ prefix
    
    sidenote: 
        unidecode replaces special chars with similar ASCII chars, so Förderung -> Forderung
        while this changes the word meaning, context helps avoiding misunderstandings
    
    :param col_name: column name for which func should be applied to
    :return: new column name
    """
    col_name = unidecode(col_name)  # special char -> ascii
    col_name = re.sub(r"[^a-zA-Z0-9]+", "_", col_name.strip())  # not alphanum -> "_"
    return f"SubCat_{col_name}"  # add SubCat_

# apply cleaning to subcategory columns
for old_col in pivot_spend.columns:
    if old_col not in ["Municipality_Lowercase", "Year"]:
        new_col = clean_column_name(old_col)
        pivot_spend = pivot_spend.withColumnRenamed(old_col, new_col)

print("Subcategory spending:")
pivot_spend.show(5, truncate=False)

Election data:
+----------------------+----+---------------+-----------------+---------------+------------------+----------------+---------------+-------------+-------------------+---------------------+-----------------+-----------------------+
|Municipality_Lowercase|Year|Municipality_ID|Municipality_Name|Wahlberechtigte|abgegebene_Stimmen|gueltige_Stimmen|Wahlbeteiligung|Winning_Party|Spending_Summe     |Education_Spending_PC|Total_Spending_PC|Edu_Spending_Percentage|
+----------------------+----+---------------+-----------------+---------------+------------------+----------------+---------------+-------------+-------------------+---------------------+-----------------+-----------------------+
|linz                  |2008|40101          |Linz             |142125         |96209             |94496           |67.69          |SPO          |1.203534975E8      |319.61816451293      |3378.0490982967  |9.461619864379408      |
|steyr                 |2008|40201          |Steyr            |28

Now we will merge the election results dataset (merged_df) with the spending dataset (pivot_spend). After that, we will clean and prepare the merged dataset for modeling.

In [13]:
merged_df = merged_df.join(pivot_spend, on=["Municipality_Lowercase", "Year"], how="inner")
merged_df = merged_df.na.drop(subset=["Winning_Party"]) # if winning party is missing, drop

# fill NA with 0 for numeric
numeric_cols = [f.name for f in merged_df.schema.fields if f.dataType in [IntegerType(), FloatType()]]
for ncol in numeric_cols:
    merged_df = merged_df.withColumn(ncol, when(col(ncol).isNull(), 0).otherwise(col(ncol)))

print("Merged election + spending data:")
merged_df.cache() # cache because we want quick access to merged_df
merged_df.select("Wahlbeteiligung").show(5, truncate=False)

# now we will clean Wahlbeteiligung col values

# For our current datasets, this is mostly redundant
#  but we will clean it to make sure the script works with other datasets as well

# we use a single nested regexp_replace instead of doing each of these individually
#  to make it more efficient and reduce spark overhead
merged_df = merged_df.withColumn(
    "Wahlbeteiligung",
    regexp_replace(
        regexp_replace(
            regexp_replace(col("Wahlbeteiligung"), "%", ""),  # remove %
            ",", "."                                          # replace commas with dots
        ),
        "[^\\d.]", ""                                        # remove non-numeric chars except dots
    ).cast("float")                                          # convert to float
)

# identify subcategory columns (columns that start with "SubCat_")
# we exclude SubCat_Summe because it is just the sum of the subcategories and
#  including them would thus distort the results
# Ensure the new columns are available and cast to float
for col_name in ["Wahlbeteiligung", "Education_Spending_PC", "Edu_Spending_Percentage"]:
    merged_df = merged_df.withColumn(col_name, F.col(col_name).cast("float"))
    
subcat_cols = [col for col in merged_df.columns if col.startswith("SubCat_") and col != "SubCat_Summe"]

# for each subcategory, we compute its per capita value:
# SubCat_*_PC = (absolute subcat spending / SubCat_Summe) * Education_Spending_PC
for c in subcat_cols:
    new_col_name = c + "_PC"
    merged_df = merged_df.withColumn(
        new_col_name,
        F.when(F.col("SubCat_Summe") == 0, 0)
         .otherwise((F.col(c) / F.col("SubCat_Summe")) * F.col("Education_Spending_PC"))
    )
# drop absolute subcategory spendings
merged_df = merged_df.drop(*subcat_cols)

# new per capita subcat cols
subcat_pc_cols = [c + "_PC" for c in subcat_cols]
# add following to feature cols, we will use them in model training
feature_cols = subcat_pc_cols  + ["Wahlbeteiligung", "Education_Spending_PC", "Edu_Spending_Percentage"]

print("Feature columns used for modeling:", feature_cols)

# Ensure that all feature columns are of type float
for f in feature_cols:
    merged_df = merged_df.withColumn(f, col(f).cast("float"))

merged_df.cache()
print("Merged data (post cleaning):")
merged_df.show(5, truncate=False)

Merged election + spending data:
+---------------+
|Wahlbeteiligung|
+---------------+
|82.12          |
|68.22          |
|67.95          |
|67.12          |
|62.09          |
+---------------+
only showing top 5 rows

Feature columns used for modeling: ['SubCat_Allgemeinbildender_Unterricht_PC', 'SubCat_Ausserschulische_Jugenerziehung_PC', 'SubCat_Berufsb_Unterricht_PC', 'SubCat_Erwachsenenbildung_PC', 'SubCat_Forschung_und_Wissenschaft_PC', 'SubCat_Forderung_des_Unterrichtes_PC', 'SubCat_Gesonderte_Verwaltung_PC', 'SubCat_Sport_und_ausserschulische_Leibeserziehung_PC', 'SubCat_Vorschulische_Erziehung_PC', 'Wahlbeteiligung', 'Education_Spending_PC', 'Edu_Spending_Percentage']
Merged data (post cleaning):
+----------------------+----+---------------+-----------------+---------------+------------------+----------------+---------------+-------------+------------------+---------------------+-----------------+-----------------------+------------+---------------------------------------+---

Now that we have the dataframe, we will build the spark pipeline and train the random forest model.

In [None]:
# we convert the winning party values into numeric labels (0.0, 1.0 etc.)
label_indexer = StringIndexer(inputCol="Winning_Party", outputCol="label", handleInvalid="skip")

# we will assemble the feature cols into a single feature vector
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features", handleInvalid="skip")

# we define the RF classifier and then create the pipeline
rf_classifier = RandomForestClassifier(featuresCol="features", labelCol="label", numTrees=30, maxDepth=5, seed=1)
pipeline = Pipeline(stages=[label_indexer, assembler, rf_classifier])

# we split the data, train the model
train_df, test_df = merged_df.randomSplit([0.8, 0.2], seed=1)
model = pipeline.fit(train_df)

# we make predictions on the test_set
predictions = model.transform(test_df)

# for debug, display winning party - label index pairs.
party_label_list = [(party, str(index)) for index, party in enumerate(model.stages[0].labels)] # model.stages[0].labels = label names
party_label_df = spark.createDataFrame(party_label_list, ["Winning_Party", "Label"])
party_label_df.show(truncate=False)

print("Predictions:")
predictions.select("Municipality_Name", "Year", "Winning_Party", "prediction").show(5, truncate=False)

# Now we evaulate model accuracy and get the feature importances.
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
print(f"\nTest Accuracy = {evaluator.evaluate(predictions) * 100:.2f}%\n")

importances = model.stages[-1].featureImportances
for idx, feature_name in enumerate(assembler.getInputCols()):
    print(f"{feature_name}: {importances[idx]:.4f}")

**Discussion:**

Our model predicted with ~71.6% accuracy the winning party. We consider this a good accuracy, considering we have only little data on previous elections.

Now that we analyzed the feature importance of the subcategories, we can deduct that `Erwachsenenbildung_PC` have the highest importance in predicting the winning party (11.85%). After that, `Wahlbeteiligung` and `SubCat_Gesonderte_Verwaltung_PC` are also relatively important.

Now, we will analyze how accurately the model can predict specific parties' winnings and the feature importances for each of the specific partys. For this, we will use the One-vs.-rest classification. While this could be done manually as well (by using a binary label whether winning party = specified party or not), we will use spark's built in OneVsRest method, since it's parallelized, optimized, and scalable.

Docs: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.OneVsRest.html

In [5]:
# index the labels into numeric labels
label_indexer = StringIndexer(
    inputCol="Winning_Party",
    outputCol="label",
    handleInvalid="skip"
)
label_indexer_model = label_indexer.fit(merged_df) # fit on existing data
merged_df = label_indexer_model.transform(merged_df) # apply the mapping to the dataset
parties = label_indexer_model.labels  # store the winning party labels

# assemble feature vector
assembler_ovr = VectorAssembler(
    inputCols=feature_cols,
    outputCol="features_ovr",
    handleInvalid="skip"
)
merged_df = assembler_ovr.transform(merged_df) # transform to include feature vector

# define random forest classifier
# this is the base classifier for the OvR. All one-vs-rest classifiers will use random forest.
base_classifier = RandomForestClassifier(
    labelCol="label",
    featuresCol="features_ovr",
    numTrees=30,
    maxDepth=5,
    seed=1
)

# define OVR classifier
# it will train a binary classifier (party X or not) for each party.
ovr = OneVsRest(classifier=base_classifier, labelCol="label", featuresCol="features_ovr")

# # split data, fit on OVR, make predictions, test overall accuracy.
train_df, test_df = merged_df.randomSplit([0.8, 0.2], seed=42)
ovr_model = ovr.fit(train_df)
predictions = ovr_model.transform(test_df)
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
print(f"\nOne-Vs-Rest Test Accuracy = {evaluator.evaluate(predictions) * 100:.2f}%\n")

# since pyspark's OneVsRest does not directly provide AUC or feature importances,
# we will manually compute them for each binary classifier.

# parties from labelindexer
for i, party in enumerate(parties):
    print(f"\n=== One-vs-Rest for: {party} vs. ALL ===")

    # get the corresponding binary model for the party
    binary_model = ovr_model.models[i]

    # create a binary test set for the current party (1 if current party, 0 if not)
    test_binary = test_df.withColumn("binary_label", when(col("label") == i, 1).otherwise(0))

    # make predictions for AUC evaluation and then compute AUC (area under ROC curve)
    binary_preds = binary_model.transform(test_binary)
    evaluator_bin = BinaryClassificationEvaluator(labelCol="binary_label", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
    print(f"Area Under ROC for {party} vs. All: {evaluator_bin.evaluate(binary_preds):.3f}")

    # extract feature importances
    importances = binary_model.featureImportances
    for j, feature_name in enumerate(feature_cols):
        print(f"  {feature_name}: {importances[j]:.4f}")

print("\nDone\n")


One-Vs-Rest Test Accuracy = 78.04%


=== One-vs-Rest for: OVP vs. ALL ===
Area Under ROC for OVP vs. All: 0.772
  SubCat_Allgemeinbildender_Unterricht_PC: 0.0816
  SubCat_Ausserschulische_Jugenerziehung_PC: 0.2043
  SubCat_Berufsb_Unterricht_PC: 0.0566
  SubCat_Erwachsenenbildung_PC: 0.1429
  SubCat_Forschung_und_Wissenschaft_PC: 0.0951
  SubCat_Forderung_des_Unterrichtes_PC: 0.0452
  SubCat_Gesonderte_Verwaltung_PC: 0.0890
  SubCat_Sport_und_ausserschulische_Leibeserziehung_PC: 0.0564
  SubCat_Vorschulische_Erziehung_PC: 0.0461
  Wahlbeteiligung: 0.0851
  Education_Spending_PC: 0.0414
  Edu_Spending_Percentage: 0.0563

=== One-vs-Rest for: SPO vs. ALL ===
Area Under ROC for SPO vs. All: 0.839
  SubCat_Allgemeinbildender_Unterricht_PC: 0.0767
  SubCat_Ausserschulische_Jugenerziehung_PC: 0.1500
  SubCat_Berufsb_Unterricht_PC: 0.0608
  SubCat_Erwachsenenbildung_PC: 0.1249
  SubCat_Forschung_und_Wissenschaft_PC: 0.1016
  SubCat_Forderung_des_Unterrichtes_PC: 0.0349
  SubCat_Gesonderte_Ve

**Discussion:**

The One-Vs-Rest model achieved an overall test accuracy of 78.04%, which is relatively strong. However, the Area Under ROC (AUC) scores vary significantly between parties, indicating that the model predicts some parties better than others.

SPÖ (AUC = 0.839) and ÖVP (AUC = 0.772) are well-predicted.

FPÖ (AUC = 0.633) is much weaker, closer to random guessing (0.5).

This suggests that the features used (spending categories, turnout, education spending %) do not explain FPÖ victories as well as they do for SPÖ and ÖVP.


Interpretations:

**ÖVP vs. ALL (AUC = 0.772)**
- The most influential features are extracurricular youth education (20.43%) and adult education (14.29%)

**SPÖ vs. ALL (AUC = 0.839)**
- Extracurricular youth education (15.00%) and `Gesonderte_Verwaltung` i.e. administrative costs (14.70%)  are the strongest factors. Adult education (12.49%) is also important.

Both of these parties are known for their stance supporting public infrastructure (including education) spending, so these results are expected.

Voter turnout (8.10%) plays a slighly larger role for SPÖ than for ÖVP (8.51%). This might indicate that both parties benefit from low or high turnouts, most likely from high.

**FPÖ vs. ALL (AUC = 0.633)**
- The low AUC suggests that spending and voter turnout are not very good predictors of FPÖ wins.
- General education (13.63%) and the ratio of education spending to total spending (13.19%) are more relevant here, which is relatively unexpected given FPÖ's right-wing, anti-establishment stance.
- Turnout (10.78%) is more influential than for ÖVP/SPÖ. This might indicate that FPÖ benefits from either low or high turnouts.
- The low AUC score is expected for FPÖ, since the party benefits from other external factors not captured in this dataset, e.g. anti-establishment sentiment or sociocultural issues.

Although the analysis below is unrelated to our research topic, we will analyze the relationship between voter turnout and party wins out of interest.

In [22]:
# load the data
merged_data = spark.read.csv("data/merged_data.csv", header=True, inferSchema=True)
merged_data = merged_data.withColumn("Wahlbeteiligung", col("Wahlbeteiligung").cast("float"))

# compute mean and stdev
turnout_stats = merged_data.select(
    F.mean("Wahlbeteiligung").alias("mean_turnout"),
    F.stddev("Wahlbeteiligung").alias("stddev_turnout")
).collect()[0]

mean_turnout = turnout_stats["mean_turnout"]
stddev_turnout = turnout_stats["stddev_turnout"]

print(f"Mean Turnout: {mean_turnout:.2f}%, Standard Deviation: {stddev_turnout:.2f}%")

# compute deviation from mean
merged_data = merged_data.withColumn("Turnout_Deviation", (col("Wahlbeteiligung") - mean_turnout).cast("float"))

# assign turnout categories based on st dev
# <=1σ near median
# 1<σ<=2 moderate high/low (depending on the direction of the deviance of turnout from mean)
# 2<σ extreme high/low

merged_data = merged_data.withColumn(
    "Turnout_Category",
    when(F.abs(col("Turnout_Deviation")) <= stddev_turnout, "Near Mean")
    .when((col("Turnout_Deviation") > stddev_turnout) & (col("Turnout_Deviation") <= 2 * stddev_turnout), "Moderate High")
    .when((col("Turnout_Deviation") < -stddev_turnout) & (col("Turnout_Deviation") >= -2 * stddev_turnout), "Moderate Low")
    .when(col("Turnout_Deviation") > 2 * stddev_turnout, "Extreme High")
    .otherwise("Extreme Low")
)

# count party wins across turnout categories
party_wins = merged_data.groupBy("Winning_Party", "Turnout_Category").count()

# calculate total wins per party and calculate turnout category wins / total wins
total_wins_per_party = party_wins.groupBy("Winning_Party").agg(F.sum("count").alias("total_wins"))
party_wins = party_wins.join(total_wins_per_party, on="Winning_Party", how="left") \
                       .withColumn("Percentage", F.round((col("count") / col("total_wins")) * 100, 2))

print("Party Win Distribution by Turnout Level:")
party_wins.orderBy("Winning_Party", "Turnout_Category").show()

print("\nDone")


Mean Turnout: 71.77%, Standard Deviation: 6.66%
Party Win Distribution by Turnout Level:
+-------------+----------------+-----+----------+----------+
|Winning_Party|Turnout_Category|count|total_wins|Percentage|
+-------------+----------------+-----+----------+----------+
|          FPO|     Extreme Low|    5|        87|      5.75|
|          FPO|   Moderate High|    3|        87|      3.45|
|          FPO|    Moderate Low|   15|        87|     17.24|
|          FPO|       Near Mean|   64|        87|     73.56|
|          OVP|    Extreme High|   14|       911|      1.54|
|          OVP|     Extreme Low|   28|       911|      3.07|
|          OVP|   Moderate High|  136|       911|     14.93|
|          OVP|    Moderate Low|  135|       911|     14.82|
|          OVP|       Near Mean|  598|       911|     65.64|
|          SPO|    Extreme High|    1|       242|      0.41|
|          SPO|     Extreme Low|    4|       242|      1.65|
|          SPO|   Moderate High|   43|       242|     17.

**Discussion**

As suspected, voter turnout does affect electoral success for different parties.

**ÖVP (Austrian People's Party)**

ÖVP, a center-right, traditionally conservative party, performs strongly in Near Median turnout areas (65.64%), while its success in Moderate High (14.93%) and Moderate Low (14.82%) turnout areas is nearly identical.

Interestingly, ÖVP does not perform significantly better in extreme turnout scenarios, with only 1.54% of wins in Extreme High turnout and 3.07% in Extreme Low turnout.

This suggests that ÖVP’s voter base is relatively stable across different turnout levels, reinforcing the idea that the party appeals to an established, consistent voterbase, irrespective of voter turnout.

**SPÖ (Social Democratic Party of Austria)**

SPÖ, a center-left party, also sees most of its victories in Near Median turnout areas (72.73%), but with relatively strong presence in Moderate High turnout (17.77%).

SPÖ underperforms in Extreme High turnout (0.41%) and Extreme Low turnout (1.65%), indicating that it does not significantly benefit from electoral volatility.

The party's strong relative performance in Moderate High turnout areas suggests that higher turnout slightly favors SPÖ, which aligns with expectations that left-leaning parties benefit when voter mobilization efforts succeed.

However, SPÖ's lower success in Moderate Low (7.44%) and Extreme Low turnout areas (1.65%) suggests that it struggles when voter participation drops significantly.

**FPÖ (Freedom Party of Austria)**

FPÖ, a right-wing populist and nationalist party, shows an overwhelming concentration of wins in Near Median turnout areas (73.56%), but its second-highest win category is Moderate Low turnout (17.24%). It also performs well in Extreme Low (5.75%) turnouts compared to other parties. Moderate High turnout (3.45%) is weak and it has no wins in Extreme High turnout areas.

Unlike ÖVP, which remains relatively stable across turnout conditions, FPÖ shows a clear bias toward lower turnout conditions, reinforcing the idea that it benefits more from disengaged voters than from high-turnout mobilization efforts.

In [7]:
spark.stop()