# Predicting Election Turnout in Poland (2030 Perspective) - End-to-End Data Science Project + Visualization

## About the Project
The goal of this project is to analyze historical election data and demographic-economic indicators in Poland (2000-2025) to predict voter turnout for the 2030 presidential elections at the county (*powiat*) level.

As the creator, my objective was to build a model that not only analyzes the past but also effectively generalizes trends for the future. The project addresses complex challenges such as administrative boundary changes over the years and gaps in historical data.

## Data Sources & Licenses
The project relies on publicly available government data, processed in accordance with "Open Data" principles.

* **Election Data (2000-2025):** [National Electoral Commission (PKW)](https://pkw.gov.pl/)
* **Demographic & Economic Indicators:** [Statistics Poland - Local Data Bank (GUS)](https://bdl.stat.gov.pl/)
    * Variables include: GDP per capita, unemployment rate, average gross salary, population density, urbanization rate, demographic dependency ratio.
* **Geospatial Data:** [Head Office of Geodesy and Cartography (GUGiK)](https://www.geoportal.gov.pl/)

**Data License:** The data used in this project is available under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license, allowing for free use (including commercial) provided the source is credited.

## Project Structure (Notebooks)

The analysis is divided into 4 key stages, represented by sequential Jupyter Notebooks:

### 1. `01_ETL_process.ipynb` (Extract, Transform, Load)
* **Goal:** Aggregation and cleaning of raw data.
* **Key Challenges:** Handling administrative changes in Poland over the last 25 years. I implemented logic to map old county boundaries (e.g., the division of the Warsaw county, changes in Wałbrzych or Tychy) to the current administrative state, ensuring time-series consistency.

### 2. `02_feature_engineering_and_visualization.ipynb`
* **Goal:** Feature engineering and data imputation.
* **Methodology:**
    * Utilized **ARIMA** models and linear extrapolation to fill missing historical data (e.g., for 1999) and to forecast feature values (feature forecasting) up to the year 2030.
    * Created "Delta" variables (change in indicators over time) to allow the model to learn the dynamics of change.

### 3. `03_main_model_training_and_prediction.ipynb`
* **Goal:** Training the predictive model.
* **Model:** Employed the **XGBoost Regressor** due to its high performance with tabular data.
* **Validation:** Conducted `GridSearchCV` for hyperparameter optimization. The model was trained on data from 2000-2020, and its performance was validated on "future" data (from the training perspective) for the year 2025.

### 4. `04_folium_visualization.ipynb`
* **Goal:** Presentation of results.
* **Outcome:** An interactive map of Poland (using the Folium library) visualizing the predicted turnout for both election rounds in 2030 for every county.

## Results & Model Success

I consider the results achieved for the test set (year 2025) a significant success, especially given the stochastic nature of human behavior.

**Scores for 2025 (XGBoost):**
* **MAE (Mean Absolute Error):** 5.1619
* **RMSE (Root Mean Squared Error):** 6.1663

An error margin of approximately 5 percentage points when predicting a phenomenon as complex as voter turnout demonstrates the high quality of the prepared data and the effectiveness of the chosen model.

### Insights - Key Factors (Feature Importance)
The feature importance analysis revealed that the model relies heavily on the **dynamics of economic and demographic changes**, rather than just static values. The 5 most important variables were:

1.  **`gdp_per_capita_delta_5_years`**: The change in wealth (GDP per capita) over a 5-year perspective. This suggests that the pace of regional development (or stagnation) determines voter mobilization more strongly than wealth itself.
2.  **`average_gross_salary`**: The average salary level remains a key indicator of voters' material status, strongly correlating with participation.
3.  **`demographic_dependency_ratio`**: The ratio of the non-working-age population to the working-age population.
4.  **`demographic_dependency_ratio_delta_5_years`**: The shift in demographic structure over time – rapid societal aging in a region significantly impacts turnout.
5.  **`population_70_plus_delta_1_year`**: Short-term growth of the oldest population group (70+). This group exhibits specific, typically highly disciplined voting behaviors.

## Model Limitations & Disclaimer
It is important to note that this model relies exclusively on measurable (quantitative) data. The project **did not account for unmeasurable factors**, such as:
* Real-time influence of media and election campaigns.
* Political scandals and social sentiment during the election week.
* The charisma of specific candidates.

While these factors undoubtedly have a massive, often decisive impact on turnout, they are impossible to reliably capture in a model based on statistical data (GUS/PKW) with a 5-year lead time. Therefore, the model indicates the **structural potential** for turnout resulting from the region's socio-economic condition.

---
Wojciech Kiełbowicz