## Introduction
Elections are critical events that determine policy directions and the allocation of public resources. Governments and citizens alike are interested in understanding which factors contribute to a successful ballot measure (e.g., tax levies, bond measures). With the growth of publicly available election data, we can leverage Machine Learning to predict election outcomes based on historical patterns and feature relationships.
This project aims to develop an end-to-end predictive modeling pipeline that can forecast the pass/fail outcome of election measures based on various features like vote percentages, measure type, election type, amount of tax/bond proposed, and more. The final solution is deployed as an interactive Streamlit app for easy usability by non-technical stakeholders.

## Problem Statement
Can we predict whether a given ballot measure will pass or fail in an election, based on its attributes such as location, tax amount, election type, and vote percentage?

Specifically, given features such as:

% Yes and % No votes

Amount of Bond/Tax

Agency County, Election Type, Type of Tax/Debt, etc.

We aim to predict the Result (Pass/Fail) using machine learning classification models.



## Project Outcome
The project followed a complete ML pipeline:

1.Data Preparation:
Cleaned missing values and inconsistent formats (e.g., dollar signs in tax amounts)

Encoded categorical variables using one-hot encoding

Converted the target (Result (Pass/Fail)) into a binary label

2.Model Training:
Used Random Forest Classifier due to its robustness and interpretability

Applied train-test split with stratification to ensure balanced class distribution

Evaluated model using accuracy and classification report

3.Streamlit App Deployment:
Built an interactive frontend using Streamlit

Allowed users to input key features through sidebar UI

App preprocessed the input, scaled it, and predicted the election result

Displayed the prediction and class probabilities (Pass/Fail)



In [16]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier

import joblib


The deployed machine learning model accurately predicts election outcomes based on key features, enabling data-driven decision-making through an interactive Streamlit interface.

In [17]:
# Load and Explore Dataset
df = pd.read_csv("/Users/vishwaashwin/Desktop/Sem-6/SI/Predictive Modeling and Forecasting for Election Outcomes,End-to-End Predictive Modeling Pipeline with Streamlit Deployment/election_data.csv") 
df

Unnamed: 0,Agency County,Agency Name,Type of Tax/Debt,Amount of Bond/Tax,Purpose,Measure,% Yes,% No,Result (Pass/Fail),Threshold,Election Year,Election Type,Election Date
0,Los Angeles,Los Angeles County Flood Control District,PLF Debt,Parcel Tax: Enact a rate of $.025/sq. ft. of l...,"Water Supply, Storage, Distribution",W,69.45,30.55,Pass,two-thirds,2018,General,11/6/2018 0:00
1,Multiple,State of California,General Obligation Bond,8877000000,"Water Supply, Storage, Distribution",Prop. 3,49.30,50.70,Fail,Majority,2018,General,11/6/2018 0:00
2,Los Angeles,Culver City,PLF Debt,"Parcel Tax: Impose a $99/single-family parcel,...","Water Supply, Storage, Distribution",CW,74.14,25.86,Pass,Two-thirds,2016,General,11/8/2016 0:00
3,Santa Cruz,Santa Cruz County CFD No 2,PLF Debt,Parcel Tax: Increase parcel tax to $517 and $1...,"Water Supply, Storage, Distribution",N,33.53,66.47,Fail,two-thirds,2015,Local,2/24/2015 0:00
4,Los Angeles,Claremont,Revenue Bond,"$135,000,000","Water Supply, Storage, Distribution",W,72.00,28.00,Pass,two-thirds,2014,General,11/4/2014 0:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5972,Kings,Corcoran,General Obligation Bond,"$2,000,000",Bridges and Highways,C,57.00,43.00,Fail,two-thirds,1986,General,11/4/1986 0:00
5973,Solano,Suisun City,General Obligation Bond,"$4,250,000",Bridges and Highways,B,69.00,31.00,Pass,two-thirds,1986,General,11/4/1986 0:00
5974,El Dorado,Cameron Park Airport District,PLF Debt,"Parcel Tax: increase from $300 to $1,200 per p...",Airport,P,62.62,37.38,Fail,two-thirds,2020,General,11/3/2020 0:00
5975,Kings,Corcoran Hospital District,,Not available,,H,63.29,36.71,,Not available,2001,Local,6/5/2001 0:00


The dataset contains election-related information including vote percentages, tax amounts, and measure details, which are essential for building a predictive model of election outcomes.

In [18]:
df.head()

Unnamed: 0,Agency County,Agency Name,Type of Tax/Debt,Amount of Bond/Tax,Purpose,Measure,% Yes,% No,Result (Pass/Fail),Threshold,Election Year,Election Type,Election Date
0,Los Angeles,Los Angeles County Flood Control District,PLF Debt,Parcel Tax: Enact a rate of $.025/sq. ft. of l...,"Water Supply, Storage, Distribution",W,69.45,30.55,Pass,two-thirds,2018,General,11/6/2018 0:00
1,Multiple,State of California,General Obligation Bond,8877000000,"Water Supply, Storage, Distribution",Prop. 3,49.3,50.7,Fail,Majority,2018,General,11/6/2018 0:00
2,Los Angeles,Culver City,PLF Debt,"Parcel Tax: Impose a $99/single-family parcel,...","Water Supply, Storage, Distribution",CW,74.14,25.86,Pass,Two-thirds,2016,General,11/8/2016 0:00
3,Santa Cruz,Santa Cruz County CFD No 2,PLF Debt,Parcel Tax: Increase parcel tax to $517 and $1...,"Water Supply, Storage, Distribution",N,33.53,66.47,Fail,two-thirds,2015,Local,2/24/2015 0:00
4,Los Angeles,Claremont,Revenue Bond,"$135,000,000","Water Supply, Storage, Distribution",W,72.0,28.0,Pass,two-thirds,2014,General,11/4/2014 0:00


The first few rows of the dataset reveal structured information on election measures, including agency details, vote percentages, tax amounts, and pass/fail results.

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5977 entries, 0 to 5976
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Agency County       5977 non-null   object 
 1   Agency Name         5977 non-null   object 
 2   Type of Tax/Debt    5974 non-null   object 
 3   Amount of Bond/Tax  5973 non-null   object 
 4   Purpose             5975 non-null   object 
 5   Measure             5974 non-null   object 
 6   % Yes               5977 non-null   float64
 7   % No                5977 non-null   float64
 8   Result (Pass/Fail)  5976 non-null   object 
 9   Threshold           5977 non-null   object 
 10  Election Year       5977 non-null   int64  
 11  Election Type       5977 non-null   object 
 12  Election Date       5977 non-null   object 
dtypes: float64(2), int64(1), object(10)
memory usage: 607.2+ KB


The dataset contains multiple categorical and numerical columns with some missing values, and data types need preprocessing (e.g., converting object types and handling nulls) before model training.

## Data Preprocessing

In [20]:
# Clean column names (remove leading/trailing spaces)
df.columns = df.columns.str.strip()

Removed rows with missing target values to ensure reliable model training.

In [21]:
# Drop rows where target is missing
df = df.dropna(subset=['Result (Pass/Fail)'])

Converted the target variable into a binary format (1 = Pass, 0 = Fail).

In [22]:
# Clean 'Amount of Bond/Tax': remove $ and commas, convert to float
df['Amount of Bond/Tax'] = df['Amount of Bond/Tax'].astype(str).str.replace(r'[\$,]', '', regex=True)
df['Amount of Bond/Tax'] = pd.to_numeric(df['Amount of Bond/Tax'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Amount of Bond/Tax'] = df['Amount of Bond/Tax'].astype(str).str.replace(r'[\$,]', '', regex=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Amount of Bond/Tax'] = pd.to_numeric(df['Amount of Bond/Tax'], errors='coerce')


The 'Amount of Bond/Tax' column was cleaned by removing dollar signs and commas, then converted to numeric format for analysis and modeling.

In [23]:
# Fill missing numeric columns properly (avoid inplace)
df['Amount of Bond/Tax'] = df['Amount of Bond/Tax'].fillna(df['Amount of Bond/Tax'].mean())
df['% Yes'] = pd.to_numeric(df['% Yes'], errors='coerce')
df['% Yes'] = df['% Yes'].fillna(df['% Yes'].mean())
df['% No'] = pd.to_numeric(df['% No'], errors='coerce')
df['% No'] = df['% No'].fillna(df['% No'].mean())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Amount of Bond/Tax'] = df['Amount of Bond/Tax'].fillna(df['Amount of Bond/Tax'].mean())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['% Yes'] = pd.to_numeric(df['% Yes'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['% Yes'] = df['% Yes'].fillna(df['% Yes'].mean())
A 

Missing values in numeric columns ('Amount of Bond/Tax', '% Yes', '% No') were filled with their respective means to maintain data integrity for modeling.

In [24]:
# Encode target variable: Pass=1, Fail=0
df['Result'] = df['Result (Pass/Fail)'].str.lower().map({'pass': 1, 'fail': 0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Result'] = df['Result (Pass/Fail)'].str.lower().map({'pass': 1, 'fail': 0})


The target variable 'Result (Pass/Fail)' was successfully encoded into binary format, with 'pass' as 1 and 'fail' as 0, for classification modeling.

In [25]:
# Define X and y
y = df['Result']

The target variable `y` is defined as the binary-encoded election result for model training.

In [26]:
# Drop non-feature columns (target, dates, threshold, election year)
cols_to_drop = ['Result', 'Result (Pass/Fail)', 'Election Date', 'Threshold', 'Election Year']

The specified columns including target, dates, threshold, and election year were identified for removal to isolate only the relevant features for modeling.

In [27]:
X = df.drop(columns=cols_to_drop)

The feature matrix `X` was created by dropping non-predictive columns to prepare data for model training.

In [28]:
# Convert categorical columns to 'category' dtype
categorical_cols = ['Agency County', 'Agency Name', 'Type of Tax/Debt', 'Purpose', 'Measure', 'Election Type']
for col in categorical_cols:
    X[col] = X[col].astype('category')

Categorical columns were converted to 'category' dtype to optimize memory usage and prepare for encoding.

In [29]:
# One-hot encode categorical features
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

Categorical features were one-hot encoded to convert them into numeric format suitable for machine learning algorithms.

In [30]:
# Check shapes and preview data
print(f"Features shape: {X.shape}")

Features shape: (5976, 2487)


The feature matrix shape was printed to verify the number of samples and encoded feature columns before modeling.

In [31]:
print(f"Target shape: {y.shape}")

Target shape: (5976,)


The target vecto shape was displayed to confirm it matches the number of samples in the feature set.

In [32]:
print(X.head())

   Amount of Bond/Tax  % Yes   % No  Agency County_Alpine  \
0        2.005021e+08  69.45  30.55                 False   
1        8.877000e+09  49.30  50.70                 False   
2        2.005021e+08  74.14  25.86                 False   
3        2.005021e+08  33.53  66.47                 False   
4        1.350000e+08  72.00  28.00                 False   

   Agency County_Amador  Agency County_Butte  Agency County_Calaveras  \
0                 False                False                    False   
1                 False                False                    False   
2                 False                False                    False   
3                 False                False                    False   
4                 False                False                    False   

   Agency County_Colusa  Agency County_Contra Costa  Agency County_Del Norte  \
0                 False                       False                    False   
1                 False           

The preview of the first rows in the feature matrix confirms successful encoding and readiness of the data for modeling.

In [33]:
print(y.head())

0    1
1    0
2    1
3    0
4    1
Name: Result, dtype: int64


The first few values of the target variable were displayed to verify correct binary encoding of election outcomes.


## Train-Test Split

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [35]:
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Distribution in y_train:", y_train.value_counts(normalize=True))
print("Distribution in y_test:", y_test.value_counts(normalize=True))

Shape of X_train: (4780, 2487)
Shape of X_test: (1196, 2487)
Distribution in y_train: Result
1    0.650628
0    0.349372
Name: proportion, dtype: float64
Distribution in y_test: Result
1    0.650502
0    0.349498
Name: proportion, dtype: float64


## Model Training 

In [36]:
scaler = StandardScaler()

A StandardScaler object was initialized to standardize numerical features by removing the mean and scaling to unit variance.

In [37]:
#Fit scaler on training data and transform both train and test
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

The scaler was fitted on the training data and applied to both training and test sets to ensure consistent feature scaling for model training and evaluation.

In [38]:
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


A Random Forest classifier was instantiated and trained on the scaled training data to learn patterns for predicting election outcomes.

In [39]:
y_pred = model.predict(X_test)

The trained model generated predictions on the test set to evaluate its performance on unseen data.

In [40]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9498327759197325
              precision    recall  f1-score   support

           0       0.95      0.90      0.93       418
           1       0.95      0.98      0.96       778

    accuracy                           0.95      1196
   macro avg       0.95      0.94      0.94      1196
weighted avg       0.95      0.95      0.95      1196



Model accuracy and detailed classification metrics were printed to assess prediction performance and evaluate the classifier’s effectiveness.

## Save the model for deployment

In [41]:
joblib.dump(model, 'election_model.pkl')
joblib.dump(scaler, 'scaler.pkl')
joblib.dump(X.columns, 'features.pkl') 

['features.pkl']

The trained model, scaler, and feature columns are successfully saved using `joblib`, enabling efficient reuse for prediction without retraining.

## Conclusion
This project demonstrates how machine learning can provide actionable insights into election outcomes. With just a few features such as vote percentages, type of tax measure, and election context, we can accurately predict whether a measure will pass or fail.

The deployment via Streamlit makes it accessible and usable for analysts, decision-makers, and political strategists. This model can be further improved by:

Adding demographic or historical voting data

Using time-series features (e.g., trends from past years)

Visualizing feature importance and decision paths