<a href="https://colab.research.google.com/github/taliafabs/sta496/blob/main/STA496_RaceDep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Race Depolarization and Vote Choice Modeling

This mini-paper/assignment will examine whether race should be used in a logistic regression vote choice model. Due to the nature of the two-party system, there are only ever two candidates who can realistically win the U.S. presidential election.

I will use data from the CCES (obtained via Harvard dataverse) and train logistic regression models to predict vote choice (binary choice between the Democratic and Republican nominees).

Examining the strength of race as a predictor:


*   I'll start by fitting logistic regression models where race is included as a categorical predictor on CCES data from the 2016, 2020, and 2024 U.S. presidential elections (the three elections where Trumpw was the GOP nominee)
* I'll then perform analysis to determine whether removing race from the model (or possibly its interaction with gender and/or education) can improve model performance (validation accuracy)
* I'll also do some classical statistics tests (maybe try likelihood ratio to determine whether removing the race variable is appropriate)

I'll do it separately for 2016, 2020, 2024 to answer a few questions:
1. Did the significance of race as a predictor of vote choice decrease over time between 2016 and 2024?
2. Do the effects of gender and education on vote choice vary by race? (i.e. does a college education affect the likelihood to support Trump differently for men and women?)
3. Does the CCES data show evidence of "RaceDep"?






In [97]:
# workplace setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pymc as pm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import statsmodels.api as sm
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

## Introduction

#### Motivation
* Racial divide as a decisive factor in vote choice may be receding relative to other factors such as the rural-urban divide and the educational divide (Kuriwaki et al., 2023).
* Congressional districts with the highest levels of racial polarization are in the South and Midwest (Kuriwaki et al., 2023)
* This suggests that the effects of geography (region, state, urban vs rural vs suburban vs rurban area, etc.) may vary across different racial groups.

## Data

In [None]:
# loading the data
from google.colab import drive
drive.mount('/content/drive')

# 2024 data
ces24 = pd.read_stata("/content/drive/MyDrive/STA496/Datasets/CES24_Common.dta")

# 2020
ces20 = pd.read_stata("/content/drive/MyDrive/STA496/Datasets/CES20_Common_OUTPUT_vv.dta")

# 2016
ces16 = pd.read_stata("/content/drive/MyDrive/STA496/Datasets/CCES16_Common_OUTPUT_Feb2018_VV.dta")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# cleaning the data
ces24.head()

Unnamed: 0,caseid,tookpost,commonweight,commonpostweight,CCEStake,add_confirm,inputzip,birthyr,gender4,gender4_t,...,page_CC24_445_timing,page_gunown_timing,page_numchildren_timing,page_gigwork_timing,page_edloan_timing,page_student_timing,starttime,endtime,starttime_post,endtime_post
0,1853651564,Yes,0.418897,0.180057,Yes,Yes,,1978,Woman,__NA__,...,13.454,0.0,4.736,5.564,0.0,0.0,2043432000000.0,2043433000000.0,2046740000000.0,2046741000000.0
1,1853655732,Yes,0.94665,0.700897,Yes,,11236.0,1968,Man,__NA__,...,4.454,0.0,15.087,9.487,0.0,0.0,2043433000000.0,2043434000000.0,2046898000000.0,2046899000000.0
2,1852716424,Yes,0.194303,0.046275,Yes,Yes,,1946,Woman,__NA__,...,21.204,0.0,16.13,9.494,0.0,0.0,2043432000000.0,2043434000000.0,2046916000000.0,2046917000000.0
3,1853644254,Yes,0.083223,0.008897,Yes,,90039.0,2001,Woman,__NA__,...,64.011,0.0,4.617,55.219,0.0,2.347,2043432000000.0,2043434000000.0,2049248000000.0,2049251000000.0
4,1853644132,Yes,0.596598,0.555395,Yes,Yes,,1955,Woman,__NA__,...,164.473,8.722,6.011,37.659,5.242,0.0,2043432000000.0,2043435000000.0,2046556000000.0,2046558000000.0


In [None]:
ces20.head()

Unnamed: 0,caseid,commonweight,commonpostweight,vvweight,vvweight_post,tookpost,CCEStake,birthyr,gender,educ,...,CL_2020ppep,CL_2020ppvm,CL_2020pep,CL_2020pvm,CL_state,CL_party,starttime,endtime,starttime_post,endtime_post
0,1232319000.0,0.78251,0.665971,0.850917,0.606593,Yes,Yes,1966.0,Male,2-year,...,,absentee,,absentee,CT,REP,2020-09-29 21:22:42,2020-10-14 19:54:26,2020-11-17 20:34:25,2020-11-17 21:14:23
1,1231395000.0,1.344424,1.435594,,,Yes,Yes,1955.0,Female,Post-grad,...,,,,,MO,UNK,2020-09-30 00:15:19,2020-10-19 17:45:07,2020-12-04 19:33:08,2020-12-04 19:46:31
2,1232452000.0,0.40552,0.342454,,,Yes,Yes,1946.0,Female,4-year,...,,,,,,,2020-09-29 23:31:57,2020-10-01 19:59:20,2020-11-26 16:29:54,2020-11-26 16:54:39
3,1232495000.0,0.957734,0.822106,1.041459,1.002495,Yes,Yes,1962.0,Female,4-year,...,,,DEM,unknown,MA,NPA,2020-09-30 00:07:57,2020-10-02 18:01:22,2020-11-16 00:54:31,2020-11-16 01:16:09
4,1232495000.0,0.194665,0.161725,,,Yes,Yes,1967.0,Male,4-year,...,,,,,,,2020-09-30 00:08:14,2020-09-30 23:51:24,2020-11-13 17:00:19,2020-11-13 17:11:25


In [None]:
ces16['CC16_410a'].head()

Unnamed: 0,CC16_410a
0,Donald Trump (Republican)
1,Donald Trump (Republican)
2,
3,
4,Hillary Clinton (Democrat)


In [None]:
ces16['CC16_410a'].head()

Unnamed: 0,CC16_410a
0,Donald Trump (Republican)
1,Donald Trump (Republican)
2,
3,
4,Hillary Clinton (Democrat)


In [100]:
ces24, ces20, ces16 = pd.DataFrame(ces24), pd.DataFrame(ces20), pd.DataFrame(ces16)

# subset to only include relevant columns
ces24_subset = ces24.copy()[['gender4', 'race', 'hispanic', 'educ', 'marstat', 'inputstate', 'region', 'birthyr',
                  'ownhome', 'urbancity', 'industry',
                  'religpew', 'pew_religimp', 'pew_churatd', 'pew_prayer',
                  'CC24_361b', 'CC24_363', 'CC24_364a', 'CC24_364b', 'presvote20post', 'pid3', 'pid7'
                  ]]

ces20_subset = ces20.copy()[['gender', 'race', 'hispanic', 'educ', 'marstat', 'inputstate', 'region', 'birthyr',
                      'ownhome', 'urbancity', 'industryclass',
                      'religpew', 'pew_religimp', 'pew_churatd', 'pew_prayer',
                      'votereg', 'votereg_f','CC20_364a', 'CC20_364b', 'presvote16post', 'pid3', 'pid7']]

ces16_subset = ces16.copy()[['gender', 'race', 'hispanic', 'educ', 'marstat', 'inputstate', 'birthyr',
                      'ownhome', 'industryclass',
                      'religpew', 'pew_religimp', 'pew_churatd', 'pew_prayer',
                      'votereg', 'CC16_410a', 'CC16_410a_nv', 'CC16_326', 'pid3', 'pid7']]

# only include trump and clinton/biden/harris voters
ces24_subset = ces24_subset[
    (ces24_subset['CC24_364a'] == "Kamala Harris (Democrat)") |
    (ces24_subset['CC24_364a'] == "Donald Trump (Republican)")
]

ces20_subset = ces20_subset[
    (ces20_subset['CC20_364a'] == "Joe Biden (Democrat)") |
    (ces20_subset['CC20_364a'] == "Donald Trump (Republican)")
]

ces16_subset = ces16_subset[
    (ces16_subset['CC16_410a'] == "Hillary Clinton (Democrat)") |
    (ces16_subset['CC16_410a'] == "Donald Trump (Republican)")
]

# create vote_trump binary variable
ces24_subset['vote_trump'] = np.where(ces24_subset['CC24_364a'] == 'Donald Trump (Republican)', 1, 0)
ces20_subset['vote_trump'] = np.where(ces20_subset['CC20_364a'] == 'Donald Trump (Republican)', 1, 0)
ces16_subset['vote_trump'] = np.where((ces16_subset['CC16_410a'] == 'Donald Trump (Republican)'), 1, 0)

# columns
ces24_subset['age'] = 2024 - ces24_subset['birthyr']

ces24_subset['age_bracket'] = pd.cut(
    ces24_subset['age'],
    bins=[17, 24, 34, 44, 54, 64, 74, 100],
    labels=[
        '18–24', '25–34', '35–44', '45–54', '55–64', '65–74', '75+'
    ]
)

ces24_subset = ces24_subset.copy()[[
    'vote_trump',
    'age_bracket',
    'gender4',
    'race',
    'hispanic',
    'educ',
    'marstat',
    'inputstate',
    'region',
    'urbancity',
    'religpew',
    'pew_religimp',
    'pew_churatd'
]]

# 2020
ces20_subset['age'] = 2020 - ces20_subset['birthyr']

ces20_subset['age_bracket'] = pd.cut(
    ces20_subset['age'],
    bins=[17, 24, 34, 44, 54, 64, 74, 100],
    labels=[
        '18–24', '25–34', '35–44', '45–54', '55–64', '65–74', '75+'
    ]
)

ces20_subset = ces20_subset.copy()[[
    'vote_trump',
    'age_bracket',
    'gender',
    'race',
    'hispanic',
    'educ',
    'marstat',
    'inputstate',
    'region',
    'urbancity',
    'religpew',
    'pew_religimp',
    'pew_churatd'
]]

# categorical predictors
predictors24 = ces24_subset.columns.drop('vote_trump')
ces24_subset[predictors24] = ces24_subset[predictors24].astype('category')
predictors20 = ces20_subset.columns.drop('vote_trump')
ces20_subset[predictors20] = ces20_subset[predictors20].astype('category')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ces24_subset['vote_trump'] = np.where(ces24_subset['CC24_364a'] == 'Donald Trump (Republican)', 1, 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ces20_subset['vote_trump'] = np.where(ces20_subset['CC20_364a'] == 'Donald Trump (Republican)', 1, 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  c

In [117]:
ces24_clean = ces24_subset.copy().dropna()
ces20_clean = ces20_subset.copy().dropna()
ces24_clean

Unnamed: 0,vote_trump,age_bracket,gender4,race,hispanic,educ,marstat,inputstate,region,urbancity,religpew,pew_religimp,pew_churatd
7,0,55–64,Woman,Black,No,4-year,Divorced,Michigan,Midwest,Suburb,Protestant,Very important,More than once a week
86,0,75+,Woman,Black,No,Some college,Married,Kentucky,South,Town,Protestant,Very important,Once a week
144,0,65–74,Woman,White,No,Post-grad,Domestic / civil partnership,Vermont,Northeast,Rural area,Protestant,Not too important,Seldom
202,0,18–24,Man,White,No,Some college,Never married,New York,Northeast,Suburb,Agnostic,Not too important,Once or twice a month
228,0,75+,Woman,White,No,4-year,Divorced,Delaware,South,Town,Nothing in particular,Not at all important,Never
...,...,...,...,...,...,...,...,...,...,...,...,...,...
59986,0,25–34,Woman,Asian,No,Post-grad,Married,Michigan,Midwest,Suburb,Something else,Somewhat important,Never
59990,0,25–34,Woman,Asian,No,4-year,Married,Virginia,South,Suburb,Muslim,Somewhat important,A few times a year
59991,0,18–24,Woman,White,No,High school graduate,Never married,Texas,South,City,Nothing in particular,Not too important,Never
59996,0,35–44,Man,White,No,Post-grad,Married,Illinois,Midwest,City,Agnostic,Not at all important,Seldom


In [123]:
# create train and test splits for 2024 data
# Separate response and predictors
# I used ChatGPT for these steps
y_24 = ces24_clean.iloc[:, 0]     # vote_trump (binary target)
X_24 = ces24_clean.iloc[:, 1:]    # all categorical predictors
categorical_cols_24 = X_24.columns.tolist()

ohe_24 = OneHotEncoder(handle_unknown='ignore', drop='first')
preprocessor_24 = ColumnTransformer([('cat', ohe_24, categorical_cols_24)])

pipeline_24 = Pipeline([
    ('preprocessor', preprocessor_24),
    ('classifier', LogisticRegression(max_iter=2000))
])

# split into train, validation, test
X_train_val24, X_test24 = train_test_split(X_24, test_size=0.2, random_state=42)
X_train24, X_val24 = train_test_split(X_train_val24, test_size=0.25, random_state=42)
y_train_val24, y_test24 = train_test_split(y_24, test_size=0.2, random_state=42)
y_train24, y_val24 = train_test_split(y_train_val24, test_size=0.25, random_state=42)

# could also just do a rain-test split??
# X_train24, y_train24, X_test24, y_test24 = train_test_split(X_24, y_24, test_size=0.2, random_state=42)


In [110]:
# train and test splits for 2020
y_20 = ces20_clean.iloc[:, 0]
X_20 = ces20_clean.iloc[:, 1:]
categorical_cols_20 = X_20.columns.tolist()

ohe_20 = OneHotEncoder(handle_unknown='ignore', drop='first')
preprocessor_20 = ColumnTransformer([('cat', ohe_20, categorical_cols_20)])

pipeline_20 = Pipeline([
    ('preprocessor', preprocessor_20),
    ('classifier', LogisticRegression(max_iter=1000))
])

# splits
X_train_val20, X_test20 = train_test_split(X_20, test_size=0.2, random_state=42)
X_train20, X_val20 = train_test_split(X_train_val20, test_size=0.25, random_state=42)
y_train_val20, y_test20 = train_test_split(y_20, test_size=0.2, random_state=42)
y_train20, y_val20 = train_test_split(y_train_val20, test_size=0.25, random_state=42)

## Models

Based on existing vote choice modeling research, I started out with the following logistic regression models to predict support for Trump:


# Machine learning model evaluations

In [131]:
# Fit the model
pipeline_24.fit(X_train24, y_train24)
y_pred24 = pipeline_24.predict(X_test24)
print(classification_report(y_test24, y_pred24))
# not very good at identifying trump voters???

              precision    recall  f1-score   support

           0       0.82      0.88      0.85      1309
           1       0.64      0.54      0.58       540

    accuracy                           0.78      1849
   macro avg       0.73      0.71      0.72      1849
weighted avg       0.77      0.78      0.77      1849



In [128]:
pipeline_20.fit(X_train20, y_train20)
y_pred20 = pipeline_20.predict(X_test20)
print(classification_report(y_test20, y_pred20))
# trump voters might be undersampled
# high accuracy on biden voters
# suggests that a sufficient sample size might make this model accurate???

              precision    recall  f1-score   support

           0       0.84      0.92      0.88      1649
           1       0.59      0.40      0.47       475

    accuracy                           0.80      2124
   macro avg       0.72      0.66      0.68      2124
weighted avg       0.79      0.80      0.79      2124



# Classical statistics model evaluations

Can we remove some predictors? Are there any predictors that are correlated to one another? Should interaction terms be used?

In [139]:
import statsmodels.api as sm
y_train24_  = y_train24.astype(int)  # or float
X_train_24_ = pd.get_dummies(X_train24, drop_first=True)
X_train_24_ = sm.add_constant(X_train_24_)

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

## How has the strength of race as a vote choice predictor changed over time?



## What could this mean?

## Next steps

## Scratch cells

In [90]:
# One-hot encode categorical columns
# X_train24 = pd.get_dummies(X_train24, drop_first=True)
# X_val24 = pd.get_dummies(X_val24, drop_first=True)
# X_test24 = pd.get_dummies(X_test24, drop_first=True)

# # # Add constant for the model
# X_train24 = sm.add_constant(X_train24)
# X_val24 = sm.add_constant(X_val24)
# X_test24 = sm.add_constant(X_test24)

# # 2024 initial logistic model
# result1 = model1.fit()
# print(result1.summary())