<a href="https://colab.research.google.com/github/taliafabs/sta496/blob/main/STA496_LearningDiary1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vote Choice Modeling Learning Process
I document my vote choice model learning process in this ipynb notebook

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 2024 CCES data from Harvard Dataverse
I'll use this dataset (and previous versions) a lot throughout my research.

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X11EP6

In [None]:
# Loading the 2024 CCES data
from google.colab import drive
drive.mount('/content/drive')

ces24_dta = pd.read_stata("/content/drive/MyDrive/CES24_Common.dta")
ces24_csv = pd.read_csv("/content/drive/MyDrive/CES24_Common.csv")

ces24_csv.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0.1,Unnamed: 0,caseid,tookpost,commonweight,commonpostweight,CCEStake,add_confirm,inputzip,birthyr,gender4,...,page_CC24_445_timing,page_gunown_timing,page_numchildren_timing,page_gigwork_timing,page_edloan_timing,page_student_timing,starttime,endtime,starttime_post,endtime_post
0,1,1853651564,2,0.418897,0.180057,1,1.0,,1978,2,...,13.454,0.0,4.736,5.564,0.0,0.0,2024-10-01 20:07:13,2024-10-01 20:23:35,2024-11-09 03:00:05,2024-11-09 03:09:39
1,2,1853655732,2,0.94665,0.700897,1,,11236.0,1968,1,...,4.454,0.0,15.087,9.487,0.0,0.0,2024-10-01 20:10:39,2024-10-01 20:26:32,2024-11-10 22:46:35,2024-11-10 22:56:59
2,3,1852716424,2,0.194303,0.046275,1,1.0,,1946,2,...,21.204,0.0,16.13,9.494,0.0,0.0,2024-10-01 20:01:14,2024-10-01 20:33:21,2024-11-11 03:52:52,2024-11-11 04:11:38
3,4,1853644254,2,0.083223,0.008897,1,,90039.0,2001,2,...,64.011,0.0,4.617,55.219,0.0,2.347,2024-10-01 20:00:55,2024-10-01 20:38:52,2024-12-08 03:31:26,2024-12-08 04:30:16
4,5,1853644132,2,0.596598,0.555395,1,1.0,,1955,2,...,164.473,8.722,6.011,37.659,5.242,0.0,2024-10-01 19:59:59,2024-10-01 20:51:51,2024-11-06 23:47:48,2024-11-07 00:27:52


In [None]:
# cleaning the data and selecting columns
ces24 = ces24_dta.copy()

# subset to include gender, race columns
## CC24_364a asks who they voted for president in 2024, b asks which candidate they prefer
# selecting columns that are classic predictors of vote choice (Similar to things I previously used)

# choose only those who respondend to the post survey
# ces24 = ces24[ces24['tookpost'] == 2]

ces24 = ces24[['gender4', 'race', 'hispanic', 'educ', 'marstat', 'inputstate', 'birthyr',
                  'ownhome', 'urbancity', 'industry',
                  'religpew', 'pew_religimp', 'pew_churatd', 'pew_prayer',
                  'CC24_361b', 'CC24_363', 'CC24_364a', 'CC24_364b',
                  ]]

# # number of people per response
# # why are there 41k+ NA presidential votes
# ces24_rd["CC24_364a"].value_counts(dropna=False)
# ces24_rd["CC24_364b"].value_counts(dropna=False)

# # create a presvote24 column
# # if they prefer harris, voted for harris, are a democrat, or strongly approve of her

# ces24_rd_pres = ces24_rd[(ces24_rd['CC24_364a'] == 1.0) | (ces24_rd['CC24_364b'] == 2.0)
#                        | (ces24_rd['CC24_364b'] == 1.0) | (ces24_rd['CC24_364b'] == 2.0)]

ces24.head()

Unnamed: 0,gender4,race,hispanic,educ,marstat,inputstate,birthyr,ownhome,urbancity,industry,religpew,pew_religimp,pew_churatd,pew_prayer,CC24_361b,CC24_363,CC24_364a,CC24_364b
0,Woman,Black,No,High school graduate,Divorced,Pennsylvania,1978,Own,City,,Protestant,Not too important,Never,Never,Democratic Party,"Yes, definitely",,Kamala Harris (Democrat)
1,Man,Hispanic,Yes,4-year,Married,New York,1968,Own,City,Transportation and warehousing,Roman Catholic,Very important,Once a week,Once a day,Republican Party,"Yes, definitely",,Kamala Harris (Democrat)
2,Woman,White,No,2-year,Widowed,Pennsylvania,1946,Own,Suburb,Health care and social assistance,Something else,Not at all important,Never,Once a day,"No Party, Independent, Declined to State","Yes, definitely",,Kamala Harris (Democrat)
3,Woman,White,Yes,High school graduate,Domestic / civil partnership,California,2001,Own,Suburb,,Something else,Somewhat important,Seldom,A few times a month,Democratic Party,Probably,,Kamala Harris (Democrat)
4,Woman,White,No,High school graduate,Widowed,Montana,1955,Rent,Town,Other services,Protestant,Very important,More than once a week,Several times a day,,"Yes, definitely",,Other


In [None]:
ces24['CC24_364a'].value_counts(dropna=False)

Unnamed: 0_level_0,count
CC24_364a,Unnamed: 1_level_1
,50432
Kamala Harris (Democrat),6392
Donald Trump (Republican),2860
Someone else,261
I'm not sure,41
I didn't vote in this election,14


## "RaceDep" and what that could mean for vote choice modeling

This section contains some data cleaning, exploratory data analysis, and logistic regression modeling to try and explore "racedep"


*  2024 post-election surveys suggest that President Donald Trump gained support among non-white voters, especially Latino voters, and to a lesser extent, Black votes (Sides 2024).
* Trump gained support among Black and Latino voters in 2020 (compared to 2016) (Ghitza and Robinson 2020; Igelnik et al. 2020).

This raises some interesting questions:
* Is race alone a strong predictor of vote choice?
* How does race interact with other predictors of vote choice, such as gender, education, and religion?

In this vote choice modeling colab, I will build some simple logistic regression models and train them on 2024 CES post-election data to predict vote preference.

I'll start by using classical predictors: race, gender, education, state, type of area the individual resides in, civic engagement, and religion.

This research is preliminary; I'll have to dive into it deeper.

#### Ideas for studying "racedep"
In my midterm paper, I will build two logistic regression vote choice models
using 2024 CCES post-election survey data to determine the extent to which race is a predictor of vote choice and how it interacts with other predictors such as age, education, income, and religion.





#### data cleaning for studying "racedep"

In [None]:
# presidential vote data
ces24_pres = ces24[
    (ces24['CC24_364a'] == 'Kamala Harris (Democrat)') |
    (ces24['CC24_364a'] == 'Donald Trump (Republican)')
]

# make the predicrors categorical
ces24_pres['vote_trump'] = np.where(ces24_pres['CC24_364a'] == 'Donald Trump (Republican)', 1, 0)
ces24_pres['gender4'] = ces24_pres['gender4'].astype('category')
ces24_pres['race'] = ces24_pres['race'].astype('category')
ces24_pres['hispanic'] = ces24_pres['hispanic'].astype('category')
ces24_pres['educ'] = ces24_pres['educ'].astype('category')
ces24_pres['marstat'] = ces24_pres['marstat'].astype('category')
ces24_pres['state'] = ces24_pres['inputstate'].astype('category')
ces24_pres['age'] = 2024 - ces24_pres['birthyr']
ces24_pres['age_bracket'] = pd.cut(
    ces24_pres['age'],
    bins=[17, 24, 34, 44, 54, 64, 74, 100],
    labels=[
        '18–24', '25–34', '35–44', '45–54', '55–64', '65–74', '75+'
    ]
)
ces24_pres['urbancity'] = ces24_pres['urbancity'].astype('category')
ces24_pres['pew_religimp'] = ces24_pres['pew_religimp'].astype('category')
ces24_pres['religpew'] = ces24_pres['religpew'].astype('category')


# select relevant columns for modeling
ces24_pres = ces24_pres[[
    'vote_trump',
    'CC24_364a',
    'gender4',
    'race',
    'hispanic',
    'educ',
    'marstat',
    'state',
    'age_bracket',
    'urbancity',
    # 'pew_religimp',
    'religpew'
]]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ces24_pres['vote_trump'] = np.where(ces24_pres['CC24_364a'] == 'Donald Trump (Republican)', 1, 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ces24_pres['gender4'] = ces24_pres['gender4'].astype('category')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ces24_pres['race'] = ces24_pres['race'].as

In [None]:
type(ces24_pres)

In [None]:
ces24_pres.dropna(inplace=True)

# get the shape
ces24_pres.shape

(9246, 11)

In [None]:
# build a logistic regression model predicting vote_trump
# note: vote_trump is binary

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import statsmodels.formula.api as smf

In [None]:
ces24_pres.head()

Unnamed: 0,vote_trump,CC24_364a,gender4,race,hispanic,educ,marstat,state,age_bracket,urbancity,religpew
7,0,Kamala Harris (Democrat),Woman,Black,No,4-year,Divorced,Michigan,55–64,Suburb,Protestant
86,0,Kamala Harris (Democrat),Woman,Black,No,Some college,Married,Kentucky,75+,Town,Protestant
144,0,Kamala Harris (Democrat),Woman,White,No,Post-grad,Domestic / civil partnership,Vermont,65–74,Rural area,Protestant
202,0,Kamala Harris (Democrat),Man,White,No,Some college,Never married,New York,18–24,Suburb,Agnostic
228,0,Kamala Harris (Democrat),Woman,White,No,4-year,Divorced,Delaware,75+,Town,Nothing in particular


In [None]:
ces24_pres['CC24_364a'].value_counts(dropna=False)

Unnamed: 0_level_0,count
CC24_364a,Unnamed: 1_level_1
Kamala Harris (Democrat),6388
Donald Trump (Republican),2858
Someone else,0
I'm not sure,0
I didn't vote in this election,0


In [None]:
formula = """
vote_trump ~ gender4 + race + educ + marstat + state + age_bracket + urbancity + religpew
            + race:gender4 + race:educ
"""


In [None]:
formula_norace = """
vote_trump ~ gender4 + educ + marstat + state + age_bracket + urbancity + religpew
"""

#### exploratory data analysis