# Project Proposal and Technical Report
## CPSC 322, Fall 2024
### Team: Matt S && Ryan St. Mary

### Project Title: PJames Predicting U.S. House Election Winners

## Introduction
In this project, we aim to predict the winners of U.S. House elections based on polling data and past election results. We have chosen a dataset that merges U.S. House polling data with actual election outcomes for recent years (2018–2022). The classification task is to predict the political party of the winning candidate (e.g., Democrat, Republican) based on attributes from aggregated polls and historical election data.

We will be performing multiple preprocessing steps, exploratory data analysis (EDA), and comparing at least three different classifiers (including a random forest classifier, a kNN classifier, a Naive Bayes classifier, and a decision tree classifier for baseline comparison). The random forest classifier will be implemented as described in the project instructions. Finally, we will deploy a Flask web app that uses our best performing model.

Our analysis will include feature selection and hyperparameter tuning. We will explore N, M, and F parameters for random forests and attempt to find a configuration that leads to improved predictive accuracy.

## Dataset Description
### Source
The dataset is derived from publicly available U.S. House election data and polling data. The election results data is sourced from the MIT Election Lab ([https://electionlab.mit.edu/data]) which provides comprehensive historical U.S. election data, while the polling data was compiled from publicly available polling sources and merged into a CSV by a previous data collection step. The combined dataset consists of:
- `1976-2022-house.csv` (full historical data)
- `2018-2022-house.csv` (recent years election data)
- `house_polls_merged.csv` (polling data for U.S. House races)

We have merged these files to create `joined_polls_elections.csv`, which combines polling metrics and election results for various state/district/year combinations.

### Data Format
The data is in CSV format. The final merged dataset `joined_polls_elections.csv` includes attributes such as:
- `year`: Election year.
- `state`: State abbreviation.
- `district`: Congressional district number.
- `c1`, `c2`: Top two candidates in a race.
- `c1_party`, `c2_party`: Political parties of these candidates.
- `c1_pct`, `c2_pct`: Polling percentages for the top two candidates.
- `winner`, `winner_party`: Actual election winner and their party.

We will use `winner_party` as the class label to predict. Attributes like `c1_pct`, `c2_pct`, `c1_party`, `c2_party`, and other features from the polls and elections data will serve as input features.

## Implementation / Technical Merit
- **MyDummyClassifier**: A baseline classifier that always predicts the majority class.
- **MyKNeighborsClassifier**: A custom k-Nearest Neighbors classifier.
- **MyNaiveBayesClassifier**: A custom Naive Bayes classifier.
- **MyRandomForestClassifier**: A custom Random Forest classifier implemented using test-driven development, following the specifications provided. This classifier will involve:

### Potential Impact of Results
These results could be useful for political analysts, campaign strategists, and researchers interested in understanding how polling data relates to actual election outcomes. Stakeholders may include political parties, non-partisan election analysts, the media, and voters.

### Citations
- MIT Election Data Lab: [https://electionlab.mit.edu/data]
- Polling data sources aggregated from public domain polling results.
- Code and classifiers inspired by course materials and standard Python libraries.
  Any additional code or data taken from external sources will be properly cited in final submission.


## Data Loading and Initial Exploration

In [None]:
# Cell: Import Libraries and Load Initial Data
import pandas as pd
import matplotlib.pyplot as plt

# Load polls and elections data
polls = pd.read_csv('data/house_polls_merged.csv')
elections = pd.read_csv('data/1976-2022-house.csv')

print("Polls Data Columns:", polls.columns.tolist())
print("Elections Data Columns:", elections.columns.tolist())

## Processing Election Data (2018-2022)

In [None]:
# Cell: Load and Prepare Elections Data (2018-2022)
elections_2018_2022 = pd.read_csv('data/2018-2022-house.csv')
elections_2018_2022.reset_index(drop=True, inplace=True)

election_groupby = elections_2018_2022.groupby(['year', 'state', 'district'])
print(f"Total Groups: {len(election_groupby)}")

Check unique stages in the elections data.

In [None]:
print(set(elections_2018_2022['stage']))

### Process Elections Data into Structured DataFrame

In [None]:
election_rows = []
for group_key, group_df in election_groupby:
    if 'totalvotes' not in group_df.columns:
        print(f"Missing 'totalvotes' in group: {group_key}")
        continue
    sorted_group = group_df.sort_values(by='candidatevotes', ascending=False)
    c1 = sorted_group.iloc[0]
    c1_name = c1['candidate']
    c1_party = c1['party']
    c1_votes = c1['candidatevotes']
    if len(sorted_group) > 1:
        c2 = sorted_group.iloc[1]
        c2_name = c2['candidate']
        c2_party = c2['party']
        c2_votes = c2['candidatevotes']
    else:
        c2_name, c2_party, c2_votes = None, None, None
    winner = c1_name
    winner_party = c1_party
    election_rows.append([
        group_key[0], group_key[1], group_key[2],
        c1_name, c2_name, c1_party, c2_party, c1_votes, c2_votes, winner, winner_party
    ])

election_df = pd.DataFrame(
    election_rows,
    columns=[
        'year', 'state', 'district', 'c1', 'c2', 'c1_party', 'c2_party',
        'c1_votes', 'c2_votes', 'winner', 'winner_party'
    ]
)
print("Elections DataFrame Preview:")
print(election_df.head())
election_df.to_csv('data/elections-2018-2022.csv', index=False)
print("Processed elections data saved to 'data/elections-2018-2022.csv'")

Ensure the CSV is written.

In [None]:
election_df.to_csv('data/elections-2018-2022.csv', index=False)

## Processing Polls Data

In [None]:
polls = pd.read_csv('data/house_polls_merged.csv')
polls['candidate_name'] = polls['candidate_name'].apply(lambda x: x.upper())
polls['state'] = polls['state'].apply(lambda x: x.upper())

groupby = polls.groupby(['state', 'seat_number', 'cycle', 'poll_id'])
groups = list(groupby)

poll_df = pd.DataFrame(columns=['year', 'state', 'district', 'c1', 'c2', 'c1_party', 'c2_party', 'c1_pct', 'c2_pct'])
elections_done = []

for group in groups:
    if(list(group[0][:3]) not in elections_done):
      df = group[1]
      c1_idx = df['pct'].idxmax()
      c1 = df['candidate_name'][c1_idx]
      c1_party = df['party'][c1_idx]
      c1_pct = df['pct'][c1_idx]

      df = df.drop(index=c1_idx)
      try:
          c2_idx = df['pct'].idxmax()
          c2 = df['candidate_name'][c2_idx]
          c2_party = df['party'][c2_idx]
          c2_pct = df['pct'][c2_idx]
      except:
          c2, c2_party, c2_pct = None, None, None

      new_row = [group[0][2], group[0][0], group[0][1], c1, c2, c1_party, c2_party, c1_pct, c2_pct]
      poll_df.loc[len(poll_df)] = new_row
      elections_done.append(list(group[0][:3]))


Check how many unique sets we processed and the length of poll_df.

In [None]:
print(elections_done)
print(len(poll_df))

Check columns of poll_df and election_df.

In [None]:
print(poll_df.columns)
print(election_df.columns)
print(len(poll_df.values))

## Merging Polls and Elections Data

In [None]:
print(len(election_df))
print(len(poll_df))

joined_df = pd.merge(election_df, poll_df, on=['year', 'state', 'district'], how='inner')
print(len(joined_df.values))

joined_df.to_csv('data/joined_polls_elections.csv', index=False)
print("Joined polls and elections data saved to 'data/joined_polls_elections.csv'")

joined_df['winner_party'].value_counts().plot(kind='bar', title='Winner Party Distribution')
plt.tight_layout()
plt.show()

## Exploratory Data Analysis (EDA)
Perform initial EDA here (e.g., summary statistics, distributions). More complex EDA and visualizations can be added.


In [None]:
# Summary statistics of joined data
print(joined_df.describe(include='all'))

## Classifier Implementations and Evaluations
We will import our custom classifiers and evaluate them on the dataset.

In [None]:
from classifiers.myclassifiers import MyKNeighborsClassifier as knn
from classifiers.myclassifiers import MyNaiveBayesClassifier as bayes
from classifiers.myclassifiers import MyDummyClassifier as dummy
from classifiers.myclassifiers import DecisionTreeClassifier, RandomForestClassifier

import numpy as np

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd

# Load and preprocess data
data = pd.read_csv("data/joined_polls_elections.csv")
y = data['winner_party'].apply(lambda x: 'DEMOCRAT' if x == 'DEMOCRATIC-FARMER-LABOR' else x)
X = data.drop('winner_party', axis=1)

# Convert categorical features if present
for col in ['c1_party_x', 'c2_party_x']:
    if col in X.columns:
        X[col] = X[col].astype('category').cat.codes

selected_features = ['c1_pct', 'c2_pct']
if 'c1_party_x' in X.columns:
    selected_features.append('c1_party_x')
if 'c2_party_x' in X.columns:
    selected_features.append('c2_party_x')

X_sel = X[selected_features]

X_train, X_test, y_train, y_test = train_test_split(X_sel, y, test_size=0.4, random_state=42)



In [None]:

#####################################
# Dummy Classifier
###################################
d = dummy()
d.fit(X_train.values.tolist(), y_train.tolist())
y_pred_dummy = d.predict(X_test.values.tolist())
print("Dummy classifier accuracy:", accuracy_score(y_test, y_pred_dummy))

In [None]:
###################################
# K-Nearest Neighbors Classifier
###################################
# KNN classifier
kn = knn()
kn.fit(X_train.values.tolist(), y_train.tolist())
y_pred_knn = kn.predict(X_test.values.tolist(), categorical=True)
print("KNN classifier accuracy:", accuracy_score(y_test, y_pred_knn))

In [None]:
###################################
# Naive Bayes Classifier Example

###################################

nb = bayes()
nb.fit(X_train.values.tolist(), y_train.tolist())

nb_preds = nb.predict(X_test.values.tolist())
print("Naive Bayes classifier accuracy:", accuracy_score(y_test, nb_preds))

In [None]:
###################################
# Decision Tree Classifier Example

###################################

print(X_train['c1_pct'][0])

  

  


tree = DecisionTreeClassifier()
tree.fit(X_train.values.tolist(), y_train.values.tolist())


tree_preds = tree.predict(X_test.values.tolist())
print("Decision tree classifier accuracy:", accuracy_score(y_test, y_pred_knn))

In [None]:
import importlib
import classifiers
from classifiers.myclassifiers import MyDecisionTreeClassifier
importlib.reload(classifiers)
mtree = MyDecisionTreeClassifier()



# for i in range(len(X_train['c1_pct'])):
#   if int(X_train['c1_pct'][i])> 50:
#     X_train['c1_pct'][i] = "OVER"
#   else:
#     X_train['c1_pct'][i] = "UNDER"

#   if(int(X_train['c2_pct'[i]])> 50):
#     X_train['c2_pct'][i] = "OVER"
#   else:
#     X_train['c2_pct'][i] = "UNDER"

#   if int(X_test['c1_pct'][i])> 50:
#     X_test['c1_pct'][i] = "OVER"
#   else:
#     X_test['c1_pct'][i] = "UNDER"

#   if(int(X_test['c2_pct'[i]])> 50):
#     X_test['c2_pct'][i] = "OVER"
#   else:
#     X_test['c2_pct'][i] = "UNDER"

mtree.fit(X_train.values.tolist(), y_train.tolist())

mtree_preds = mtree.predict(X_test.values.tolist())
print(mtree_preds)
print(len(mtree_preds))
print(len(y_test))
acc = sum([1 for i in range(len(y_test)) if mtree_preds[i] == y_test.tolist()[i] ])/146
print("Decision tree classifier accuracy:" +  str(acc))


In [None]:

###################################
# Random Forest Classifier

###################################



## Deploying the Best Classifier via Flask


In [None]:
from flask import Flask
app = Flask(__name__)

@app.route("/")
def index():
    return "<h1>Hello!</h1>"

if __name__ == "__main__":
    app.run()

## Conclusion
In this notebook, we described our project, loaded and merged the dataset, performed initial exploration, and tested baseline classifiers. In subsequent steps, we will refine our approach, implement the random forest, run parameter tuning, perform EDA with visualizations, and finally deploy a web app.

## Acknowledgments
- MIT Election Data Science Lab for election datasets: https://electionlab.mit.edu/data
- Polling data compiled from public domain sources.
- Any external code or libraries beyond standard Python/pandas/matplotlib/sklearn will be cited.

All code and approaches for the classifiers (Dummy, KNN, Naive Bayes, Random Forest) are inspired by course materials from CPSC 322, Fall 2024.
