<h2>Election Result Prediction for US Counties</h2>

Vaed Prasad

<h3>Introduction:</h3>


Economic and sociological factors have been widely used when making predictions on the voting results of US elections. Economic and sociological factors vary a lot among counties in the United States. In addition, as observed from the election map of recent elections, neighbor counties show similar patterns in terms of the voting results. In this project we will bring the power of machine learning to make predictions for the county-level election results using Economic and sociological factors and the geographic structure of US counties. </p>


In [1]:
import os
import pandas as pd
import numpy as np
import csv
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import KFold
from sklearn.neural_network import MLPClassifier

<h3>1.2 Weighted Accuracy:</h3><p>
Since our dataset labels are heavily biased, we will use the following function to compute weighted accuracy throughout our training and validation process.
<p>

In [2]:
def weighted_accuracy(pred, true):
    assert(len(pred) == len(true))
    num_labels = len(true)
    num_pos = sum(true)
    num_neg = num_labels - num_pos
    frac_pos = num_pos/num_labels
    weight_pos = 1/frac_pos
    weight_neg = 1/(1-frac_pos)
    num_pos_correct = 0
    num_neg_correct = 0
    for pred_i, true_i in zip(pred, true):
        num_pos_correct += (pred_i == true_i and true_i == 1)
        num_neg_correct += (pred_i == true_i and true_i == 0)
    weighted_accuracy = ((weight_pos * num_pos_correct) 
                         + (weight_neg * num_neg_correct))/((weight_pos * num_pos) + (weight_neg * num_neg))
    return weighted_accuracy

<h2>Part 2: Baseline Solution</h2><p>

<h3>2.1 Preprocessing and Feature Extraction:</h3>

In [4]:
# Creates a dataframe from the csv files
# to run the code, all relavant files must be placed in a folder named 'data'
file_dir = 'data/'
test = 'test_2016_no_label.csv'
train = 'train_2016.csv'
graph = 'graph.csv'
train_2012 = 'train_2012.csv'
test_2012 = 'test_2012_no_label.csv'

train_df = pd.read_csv(file_dir + train)
test_df = pd.read_csv(file_dir + test)

neighboring_counties = pd.read_csv(file_dir + graph)

df_2012_test = pd.read_csv(file_dir + test_2012)
df_2012_train = pd.read_csv(file_dir + train_2012)

In [5]:
# Normalizes the numpy array by the given mean and standard deviation
def normalize_by_param(X, mean, std):
  """Return normalized dataset using given mean and standard deviation"""
  X = np.divide(np.subtract(X, mean), std)
  return X

# Normalizes the numpy array, returning also the mean and standard deviation
def normalize(X):
  """Return normalized dataset, mean, and standard deviation"""
  mean = np.mean(X, axis=0)
  std = np.std(X, axis=0)
  X = normalize_by_param(X, mean, std)
  return X, mean, std

# Normalizes the dataframe, returning also the mean and standard deviation
def normalize_df(df):
  """Return normalized dataframe, mean, and standard deviation"""
  mean = df.mean()
  std = df.std()
  return normalize_df_by_param(df, mean, std), mean, std

# Normalizes the dataframe by the given mean and standard deviation
def normalize_df_by_param(df, mean, std):
  """Return normalized dataframe using given mean and standard deviation"""
  return (df - mean) / std

# Converts the 'MedianIncome' field in the dataframe to float from string
def parse_income(df):
  """Return parsed dataframe for 'MedianIncome' by removing commas and converting to type float"""
  df["MedianIncome"] = df["MedianIncome"].replace(',','', regex=True).astype('float')
  return df

# Returns a numpy array of the labels
def get_labels(train_df):
  """Return dataframe with binary 'winner' feature for county's winner"""
  train_df["winner"] = (train_df["DEM"] > train_df["GOP"]) + 0
  return train_df["winner"].to_numpy()

# Returns the state code of a county based on its fips code
def get_state_code(fips):
  """Return first two digits of fips code, representing the state code"""
  if len(str(fips)) == 4:
    return "0" + str(fips)[0:1]
  else:
    return str(fips)[0:2]

# Returns the set of the fips code of all pf a counties neighbors
def get_neighbors(fips):
  """Return set of neighboring counties for county with [fips]"""
  fipses = set()
  for _, row in neighboring_counties[neighboring_counties["SRC"] == fips].iterrows():
    if (fips != row["DST"]):
      fipses.add(row["DST"])

  return fipses

# Return the class weights based on the proportions of labelings
def get_class_weights(Y):
  """Return proportion of values for binary class"""
  n = Y.shape[0]
  return {1: (n - Y.sum()) / n, 0: Y.sum() / n}

# Indices of instance variables in the dataframe
indices = ["MedianIncome","MigraRate","BirthRate", "DeathRate", "BachelorRate", "UnemploymentRate"]
neighboring_indices = ["*MedianIncome","*MigraRate","*BirthRate", "*DeathRate", "*BachelorRate", "*UnemploymentRate"]

In [6]:
# Our Basic training set involves preproccessing the test and training dataframes
# imported from the csv files by converting the income values into a float data
# type and normalizing the data

def generateTestandTrain(train_df, test_df):
  '''
  Parameters: train_df and test_df are the dataframes corresponding to the
  training and testing data
  Returns: X, xTe, and Y where X and xTe are the normalized and preprocessed
  Training and testing set. Y is the set of labeks for training set X 
  '''
  
  train_df = parse_income(train_df)
  X = train_df[indices].to_numpy()
  X, mean, std = normalize(X)
  Y = get_labels(train_df)
  
  test_df = parse_income(test_df)
  xTe = test_df[indices].to_numpy()
  xTe = normalize_by_param(xTe, mean, std)
  
  return X, xTe, Y

X_Basic, xTe_Basic, Y = generateTestandTrain(train_df, test_df)

<h3>2.2.1 K - Nearest Neighbors:</h3><p>

In [7]:
def knn(X,Y):
  '''
  Parameters: X is the training data set of datatype 2D np.array
  and Y is the corresponding labels of type 1D np.array
  Returns: Runs K-Fold Validation on the training set run on the k nearest neighbor
  model and prints the highest validation accuracy. 
  Returns the knn model trained on the set with the highest validation accuracy.
  '''
  kf = KFold(n_splits=10)
  kf.get_n_splits(X)
  neigh = KNeighborsClassifier(n_neighbors=5)
  splits = kf.split(X)

  accuracies = []
  indices = []
  for train_index, valid_index in splits:
    neigh.fit(X[train_index], Y[train_index])
    acc = weighted_accuracy(neigh.predict(X[valid_index]), Y[valid_index])
    accuracies.append(acc)
    indices.append(train_index)

  best_index = indices[np.argmax(accuracies)]
  neigh.fit(X[best_index], Y[best_index])
  print(max(accuracies))
  return neigh


<h3>2.2.2 Support Vector Machine:</h3><p>

In [8]:
def svc(X, Y):
  '''
  Parameters: X is the training data set of datatype 2D np.array
  and Y is the corresponding labels of type 1D np.array
  Returns: Runs K-Fold Validation on the training set run on an kernelized SVM
  model with weighted classes. Returns the SVM model trained on the set with
  the highest validation accuracy.
  '''

  svc = SVC(kernel='rbf', gamma='auto', class_weight=get_class_weights(Y))
  kf = KFold(n_splits=10)
  kf.get_n_splits(X)

  accuracies = []
  indices = []
  for train_index, valid_index in kf.split(X):
    svc.fit(X[train_index], Y[train_index])
    acc = weighted_accuracy(svc.predict(X[valid_index]), Y[valid_index])
    accuracies.append(acc)
    indices.append(train_index)

  best_index = indices[np.argmax(accuracies)]
  svc.fit(X[best_index], Y[best_index])
  print(max(accuracies))
  return svc


<h3>2.2.3 Neural Network:</h3><p>

In [9]:
def neural_network(X,Y):
  '''
  Parameters: X is the training data set of datatype 2D np.array
  and Y is the corresponding labels of type 1D np.array
  Returns: Runs K-Fold Validation on the training set run on the neural network
  model and prints the highest validation accuracy. 
  Returns the neural network model trained on the set with the highest 
  validation accuracy.
  '''

  kf = KFold(n_splits=10)
  kf.get_n_splits(X)
  nn = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(10, 10),random_state=1, max_iter = 400, activation = 'tanh')
  splits = kf.split(X)

  accuracies = []
  indices = []
  for train_index, valid_index in splits:
    nn.fit(X[train_index], Y[train_index])
    acc = weighted_accuracy(nn.predict(X[valid_index]), Y[valid_index])
    accuracies.append(acc)
    indices.append(train_index)
  
  print(max(accuracies))
  best_index = indices[np.argmax(accuracies)]
  nn.fit(X[best_index], Y[best_index])

  return nn


<h3>2.3 Training, Validation and Model Selection:</h3><p>

In [10]:
'''
Our model functions (knn, svc, and nn) run a K-Fold validation on the training
set and returns the model that was trained on the split that produce the highest
accuracies. Here, we print out the accuracies of each of these models and choose
the model with the highest accuracy. The accuracy of each model is commented on 
the side.

''' 
print("------------------BASIC---------------------")
knn_basic = knn(X_Basic, Y) #0.7713333333333333
svc_basic = svc(X_Basic, Y) #0.8647342995169083
nn_basic = neural_network(X_Basic, Y) #0.8115384615384615


------------------BASIC---------------------
0.7713333333333333
0.8647342995169083


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

0.8115384615384615


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


<h3>2.4 Initial Results:</h3><p>


2.4.1 *How did you preprocess the dataset and features?*

In order to clean the dataset and features, we first parsed the “MedianIncome” by removing commas and converting the data type from a string to a float. Next, we calculated each column’s mean and standard deviation in order to normalize the training data frame by subtracting the mean and dividing by the standard deviation for each entry in a given column. We also created a binary “winner” feature column by comparing the number of “GOP” votes and “DEM” votes in a given county, where 1 represents that the Democrats won the county and 0 represents that the Republicans won the county. Finally, we also normalized our test set with the training set’s mean and standard deviation.

2.4.2 *Which two learning methods from class did you choose and why did you made the choices?*

The two learning methods we chose from class was K-Nearest Neighbors and Soft-Margin Support Vector Machine. We selected K-Nearest Neighbors because we believed that counties with similar political voting patterns will be clustered near each other in the feature space. Similarly, we selected a Soft-Margin Support Vector Machine because we believed that the features would separate Republican counties from Democratic counties far enough to build an accurate support vector machine.

2.4.3 *How did you do the model selection?*

For both our K-Nearest Neighbors and Soft-Margin Support Vector Machine learning methods we leveraged k-fold validation by making 10 splits and selecting the trained model with the best accuracy. Then, we compared the accuracy of the best performing K-Nearest Neighbors model and Soft-Margin Support Vector Machine model and selected the SVM model as it yielded a better accuracy. Our model had a test performance of 70.36%.

<h2>Part 3: Feature Adjustments</h2><p>

<h3>3.1.1 Considering 2012 Election Data:</h3><p>

In [11]:

# Our 2012 training set involves preproccessing the test and training dataframes
# imported from the csv files by converting the income values into a float data
# type and normalizing the data. We also take it one step further than the basic
# training set by adding in new features that represent the change in features
# from 2012 to 2016. The feature changes that we looked at were change in 
# Migration Rate, Birth Rate, Death Rate, Unemployment Rate, and Median Income

def generateTestandTrain2012(train_creative, test_creative, df_2012_train, df_2012_test):
  
  '''
  Parameters: train_creative and test_creative are the dataframes corresponding 
  to the 2016 training and testing data; df_2012_train and df_2012_test are the 
  dataframes corresponding to the 2012 training and testing data
  Returns: X_Creative, xTe_Creative, Y where X_Creative andxTe_Creative are the 
  normalized and preprocessed training and testing set. Y is the set of labels 
  for training set X_Creative
  '''

  df_2012_train = parse_income(df_2012_train)
  train_creative = parse_income(train_creative)
  train_creative["MigraRateChange"] = df_2012_train["MigraRate"] - train_creative["MigraRate"]
  train_creative["BirthRateChange"] = df_2012_train["BirthRate"] - train_creative["BirthRate"]
  train_creative["DeathRateChange"] = df_2012_train["DeathRate"] - train_creative["DeathRate"]
  train_creative["UnemploymentChange"] = df_2012_train["UnemploymentRate"] - train_creative["UnemploymentRate"]
  train_creative["MedianIncomeChange"] = df_2012_train["MedianIncome"] - train_creative["MedianIncome"]

  train_creative = parse_income(train_creative)

  X_Creative = train_creative[indices + ["MigraRateChange","BirthRateChange", "DeathRate", "UnemploymentChange","MedianIncomeChange"]].to_numpy()
  X_Creative, mean_creative, std_creative = normalize(X_Creative)

  df_2012_test = parse_income(df_2012_test)
  test_creative = parse_income(test_creative)

  test_creative["MigraRateChange"] = df_2012_test["MigraRate"] - test_creative["MigraRate"]
  test_creative["BirthRateChange"] = df_2012_test["BirthRate"] - test_creative["BirthRate"]
  test_creative["DeathRateChange"] = df_2012_test["DeathRate"] - test_creative["DeathRate"]
  test_creative["UnemploymentChange"] = df_2012_test["UnemploymentRate"] - test_creative["UnemploymentRate"]
  test_creative["MedianIncomeChange"] = df_2012_test["MedianIncome"] - test_creative["MedianIncome"]

  xTe_Creative = test_creative[indices + ["MigraRateChange","BirthRateChange", "DeathRate", "UnemploymentChange","MedianIncomeChange"]].to_numpy()
  xTe_Creative = normalize_by_param(xTe_Creative, mean_creative, std_creative)

  Y = get_labels(train_df)

  return X_Creative, xTe_Creative, Y

X_2012, xTe_2012, Y = generateTestandTrain2012(train_df, test_df, df_2012_train, df_2012_test)


<h3>3.1.2 Considering Neighboring Counties:</h3><p>

In [12]:
# Our neighbors training set involves preproccessing the test and training dataframes
# imported from the csv files by converting the income values into a float data
# type and normalizing the data. We also take it one step further than the basic
# training set by adding in a new feature that represents the number of neighbors 
# a county has. We too this data from the graph.csv file. 

def generateTestandTrainneighbors(train_creative, test_creative, neighboring_counties):
  '''
  Parameters: train_creative and test_creative are the dataframes corresponding 
  to the 2016 training and testing data; neighboring_counties is the dataframe
  corresponding to the number of neighboring counties each county has
  Returns: X_Creative, xTe_Creative, Y where X_Creative andxTe_Creative are the 
  normalized and preprocessed training and testing set. Y is the set of labels 
  for training set X_Creative
  '''

  #creates a dictionary that counts how many times a FIPS number occurs in the 
  #neighboring_counties dataframe 

  train_creative["neighbors"] = 0
  number_neighbors = []
  frequencies = neighboring_counties["SRC"].value_counts()
  for i in train_creative["FIPS"]:
    number_neighbors.append(frequencies[i])

  # adds this new feature to our training set
  train_creative["neighbors"] = number_neighbors
  train_creative = parse_income(train_creative)
  X_creative = train_creative[indices].to_numpy()
  X_creative, mean, std = normalize(X_creative)
  
  test_creative["neighbors"] = 0
  number_neighbors = []

  for i in test_creative["FIPS"]:
    number_neighbors.append(frequencies[i])

  # adds this new feature to our testing set
  test_creative["neighbors"] = number_neighbors
  test_creative = parse_income(test_creative)
  xTeCreative = test_creative[indices].to_numpy()
  xTeCreative = normalize_by_param(xTeCreative, mean, std)

  Y = get_labels(train_creative)

  return X_creative, xTeCreative, Y

X_neighbors, xTe_neighbors, Y = generateTestandTrainneighbors(train_df, test_df, neighboring_counties)


In [18]:
  # Our neighbors training set involves preproccessing the test and training dataframes
  # imported from the csv files by converting the income values into a float data
  # type and normalizing the data. We also take it one step futher and add a binary
  # instance variable for every single state, where a value of 1 would indicate
  # that a certain county is located in that state 
  def generateTestAndTrainStates(train_creative, test_creative):
    '''
    Parameters: train_df and test_df are the dataframes corresponding to the
    training and testing data
    Returns: X, xTe, and Y where X and xTe are the normalized and preprocessed
    Training and testing set. Y is the set of labeks for training set X 
    '''

    states = []

    train_creative = parse_income(train_creative)
    train_creative[indices], mean, std = normalize_df(train_creative[indices])

    test_creative = parse_income(test_creative)
    test_creative[indices] = normalize_df_by_param(test_creative[indices], mean, std)

    states = []
    for _, row in train_creative.iterrows():
      s = get_state_code(row["FIPS"])
      if (s not in states):
        states.append(s)

    train_creative[states] = 0
    test_creative[states] = 0

    for i, row in train_creative.iterrows():
      s = get_state_code(row["FIPS"])
      train_creative.loc[i, s] = 1
    for i, row in test_creative.iterrows():
      s = get_state_code(row["FIPS"])
      test_creative.loc[i, s] = 1

    X_creative = train_creative[indices + states].to_numpy()
    xTeCreative = test_creative[indices + states].to_numpy()

    Y = get_labels(train_creative)

    return X_creative, xTeCreative, Y

X_states, xTe_states, Y = generateTestAndTrainStates(train_df, test_df)

In [20]:
  # Our neighbors training set involves preproccessing the test and training dataframes
  # imported from the csv files by converting the income values into a float data
  # type and normalizing the data. We also take it one step futher and add a duplicate
  # instance variable for all the once used in our basic training data, which we assign
  # to be a weighted average of all of the nearby counties (in this case, all counties that
  # are less than 2 border crossings away from the original)
  def generateTestAndTrainNeighborsAverage(train_creative, test_creative):
    '''
    Parameters: train_df and test_df are the dataframes corresponding to the
    training and testing data
    Returns: X, xTe, and Y where X and xTe are the normalized and preprocessed
    Training and testing set. Y is the set of labeks for training set X 
    '''

    train_creative = parse_income(train_creative)
    train_creative[indices], mean, std = normalize_df(train_creative[indices])

    test_creative = parse_income(test_creative)
    test_creative[indices] = normalize_df_by_param(test_creative[indices], mean, std)

    combined_creative = pd.concat([train_creative, test_creative])
  
    train_creative[neighboring_indices] = 0
    test_creative[neighboring_indices] = 0

    for index, row in train_creative.iterrows():
      priority_sum = 0
      covered = {row["FIPS"]}
      neighbors = set()
      for i in range(2):
        for fips in covered:
          neighbors.update(get_neighbors(fips))
        neighbors = neighbors.difference(covered)
        for neighbor in neighbors:
          query = combined_creative[combined_creative["FIPS"] == neighbor]
          try:
            res = query.iloc[0]
            weight = 1 / (i + 1)
            row["*MedianIncome"] += weight * res["MedianIncome"]
            row["*MigraRate"] += weight *  res["MigraRate"]
            row["*BirthRate"] += weight * res["BirthRate"]
            row["*DeathRate"] += weight * res["DeathRate"]
            row["*BachelorRate"] += weight * res["BachelorRate"]
            row["*UnemploymentRate"] += weight * res["UnemploymentRate"]
            priority_sum += weight
          except Exception:
            pass
        covered.update(neighbors)
        neighbors.clear()
      if (priority_sum > 0):
        row[neighboring_indices] /= priority_sum
      train_creative.loc[index, neighboring_indices] = row[neighboring_indices]

    for index, row in test_creative.iterrows():
      priority_sum = 0
      covered = {row["FIPS"]}
      neighbors = set()
      for i in range(2):
        for fips in covered:
          neighbors.update(get_neighbors(fips))
        neighbors = neighbors.difference(covered)
        for neighbor in neighbors:
          query = combined_creative[combined_creative["FIPS"] == neighbor]
          try:
            res = query.iloc[0]
            weight = 1 / (i + 1)
            row["*MedianIncome"] += weight * res["MedianIncome"]
            row["*MigraRate"] += weight *  res["MigraRate"]
            row["*BirthRate"] += weight * res["BirthRate"]
            row["*DeathRate"] += weight * res["DeathRate"]
            row["*BachelorRate"] += weight * res["BachelorRate"]
            row["*UnemploymentRate"] += weight * res["UnemploymentRate"]
            priority_sum += weight
          except Exception:
            pass
        covered.update(neighbors)
        neighbors.clear()
      if (priority_sum > 0):
        row[neighboring_indices] /= priority_sum
      test_creative.loc[index, neighboring_indices] = row[neighboring_indices]

    X_creative = train_creative[indices + neighboring_indices].to_numpy()
    xTeCreative = test_creative[indices + neighboring_indices].to_numpy()

    Y = get_labels(train_creative)

    return X_creative, xTeCreative, Y

X_neighborsaverage, xTe_neighborsaverage, Y = generateTestAndTrainNeighborsAverage(train_df, test_df)


In [21]:
'''
Our model functions (knn, svc, and nn) run a K-Fold validation on the training
set and returns the model that was trained on the split that produce the highest
accuracies. Here, we run the models on different datasets which each have a 
different set of features. We also print out the accuracies of each of these 
models and choose the model with the highest accuracy. The accuracy of each 
model is commented on the side.

''' 

print("------------------2012---------------------")
knn_2012 = knn(X_2012, Y) #0.7661498708010337
svc_2012 = svc(X_2012, Y) #0.8369565217391306
nn_2012 = neural_network(X_2012, Y) #0.7799310938845824

print("----------------NEIGHBORS-------------------")
knn_neighbors = knn(X_neighbors, Y) #0.7713333333333333
svc_neighbors = svc(X_neighbors, Y) #0.8647342995169083
nn_neighbors = neural_network(X_neighbors, Y) #0.8115384615384615

print("---------------NEIGHBORS AVERAGE-------------")
knn_neighborsaverage = knn(X_neighborsaverage, Y) #0.7297619047619048
svc_neighborsaverage = svc(X_neighborsaverage, Y) #0.8937198067632851
nn_neighborsaverage = neural_network(X_neighborsaverage, Y) #0.7940476190476191


------------------2012---------------------
0.7661498708010337
0.8369565217391306


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

0.7799310938845824


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


----------------NEIGHBORS-------------------
0.7713333333333333
0.8647342995169083


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

0.8115384615384615


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


---------------NEIGHBORS AVERAGE-------------
0.7297619047619048
0.8937198067632851


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

0.7940476190476191


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


<h3>3.2 Final Results:</h3><p>

Our basic submission (which was running the k nearest neighbor model looking at 5 neighbors with the basic feature set) gave an accuracy of 0.70361. Our creative model (which was running an svm model with a weighted accuracy and a feature set that took into account the neighboring counties features) gave an accuracy of 0.77655, so our model improved by about 7%.



For our final creative model, we used an kernelized SVM with 10-fold validation and weighted class-labels to train the training dataset generated by `generateTestAndTrainNeighborsAverage`, which duplicates every instance variable in the basic dataset, and sets those variables equal to the weighted average of the values of those instance variables for each county's neighbors. This extra feature was added to take into account the features of counties in proximity to the county we are prediction. We wanted to add this extra feature since we had a hypothesis that counties are likely to vote in a similar manner to their neighbors. 

We also wanted to use weighted class-labels because when training the data set, we wanted to penalize incorrect predictions of Democratic counties over Republican ones because there are alot more Republican counties in the training set.  

Another feature we tried out, but had an insignificant effect on improving our prediction accuracy, was looking at data from 2012 and seeing whether the changes in certain features over the years had an effect on which counties were more likely to vote Democtatic or Republican. When looking at the training data, we observed certain correlations to a county's change in median income/unemplyoment rate with a change in party affiliation. To encompass this feature in our training set, we incorporated several additional columns that represented the change in median income and unemployment from 2012 to 2016. 

Another feature we attempted to leverage in training our data was observing the number of direct neighbors a particular county had. We hypothesized that this feature might have an impact since costal counties tend to vote Democratic and also have less neighbors. However, encoding this feature had a minimal effect on our accuracies so we ended up pursuing other approaches of feature extraction.

We also tried using a neural network model with `sklearn` that gave promising results. But, during the backpropagation portion of the algorithm, we wanted to used a weighted class-labels method to update the weights, but the `sklearn` package for the neural network model did not give us that option. Therefore, we instead implemented the weighted class-labels method in the svm model which ultimately gave better results. 

<h2>Part 4: Model Output and Post-Processing</h2><p>

In [22]:
# This function runs a model on our training dataset and outputs a 
# csv file with the predicted labels.

def create_csv(model, fileName, xTe):
  '''
  Parameters: model is the trained model we want to
  run on the test set, fileName is the chosen fileName 
  to write the predictions, xTe is the preprocessed 
  test set
  Returns: Function returns a csv file with the test
  set predictions
  '''
  data = test_df['FIPS']
  df = pd.DataFrame(data , columns = ['FIPS']) 
  df["Result"] = model.predict(xTe)
  df.to_csv (fileName, index = False, header=True)

create_csv(svc_neighborsaverage, "svc_creative.csv", xTe_neighborsaverage)


<h2>Part 5: References</h2><p>



* Documentation On Neural Networks: https://scikit-learn.org/stable/modules/neural_networks_supervised.html

*  Documentation on Neural Networks: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
* Documentation on K Nearest Neighbors: https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html
* Documentation on skLearn: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

