Model Evaluation for Kavanaugh Data
Overview
This documentation provides an overview of the methods and processes used to evaluate data from the paper "They Saw a Hearing: Democrats’ and Republicans’ Perceptions of and Responses to the Ford-Kavanaugh Hearings" by Grisham et al. (2023). The analysis involves the use of linear regression and logistic regression models to predict responses based on survey data.

Data Preparation
Step 1: Load the Dataset
The dataset is read from a CSV file named kavanaughdata.csv into a pandas DataFrame.

python
Copy code
df = pd.read_csv('kavanaughdata.csv')
Step 2: One-Hot Encoding
For linear regression:

One-hot encode the column Q2_2.
For logistic regression:

One-hot encode several categorical columns: Q2_2, Q2_3, Q2_4, Q2_5, PARTYID7, GENDER, RACETHNICITY, EDUC4, INCOME.
Step 3: Feature Selection
For both models, the following columns are excluded:

Q12 (open-ended text)
CaseId (identifier)
The target variable for prediction is Q2_1.

Linear Regression
Step 4: Data Splitting
Split the data into training and testing sets with an 80-20 split.

python
Copy code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 5: Model Initialization and Training
Initialize and train a linear regression model.

python
Copy code
model = LinearRegression()
model.fit(X_train, y_train)
Step 6: Prediction and Evaluation
Predict the target variable on the test set and evaluate the model using Mean Squared Error (MSE).

python
Copy code
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
Step 7: Model Coefficients
Display the coefficients of the trained linear regression model.

python
Copy code
coef_df = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
print(coef_df)
Logistic Regression
Step 8: Data Splitting
Split the encoded data into training and testing sets with an 80-20 split.

python
Copy code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 9: Model Initialization and Training
Initialize and train a logistic regression model with multinomial classification and a maximum of 10,000 iterations.

python
Copy code
model = LogisticRegression(max_iter=10000, multi_class='multinomial')
model.fit(X_train, y_train)
Step 10: Prediction and Evaluation
Predict the target variable on the test set and evaluate the model using accuracy score.

python
Copy code
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
Step 11: Model Coefficients
Display the coefficients of the trained logistic regression model.

python
Copy code
coef_df = pd.DataFrame(model.coef_, columns=X.columns)
print(coef_df)

In [1]:
import pandas as pd
import numpy as np

In [15]:
## read csv file and store in a variable
df = pd.read_csv('kavanaughdata.csv')

In [16]:
df

Unnamed: 0,CaseId,VOTEDT,Q2_1,Q2_2,Q2_3,Q2_4,Q2_5,Q6_1,Q6_2,Q6_3,...,Q7_1,Q8_1,Q9_1,Q12,PARTYID7,GENDER,AGE,RACETHNICITY,EDUC4,INCOME
0,63,1,1,1,1,2,1,0,0,0,...,1,1,1,i think this is a distraction this nonsense is...,6,1,41,1,4,9
1,66,2,1,3,1,4,3,1,1,1,...,2,2,2,i have engaged in behaviors i deeply regret an...,2,1,59,1,2,15
2,67,1,1,1,1,1,1,0,0,0,...,2,2,2,i think it is terrible for someone to accuse a...,7,2,70,1,2,11
3,76,1,1,2,2,1,1,1,0,2,...,2,2,2,i feel that the judge should not be placed on ...,2,2,68,2,4,14
4,88,1,1,1,1,2,1,3,2,2,...,2,2,2,the amount of coverage this issue is getting i...,1,1,64,1,4,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2469,13550,2,3,3,3,3,3,5,0,0,...,1,1,1,it is sad to see how badly drford was treated ...,2,2,56,2,3,6
2470,13558,2,1,1,1,1,1,5,0,0,...,2,2,2,i agree with the decision,6,1,79,1,4,15
2471,13563,2,1,1,1,1,1,1,3,2,...,2,2,2,the fbi investigation was rushed constrained a...,3,1,67,1,4,18
2472,13564,2,2,3,2,2,2,1,2,1,...,2,2,2,i believe in dr ford,4,1,53,6,4,8


In [20]:
df['Q2_2']

0       1
1       3
2       1
3       2
4       1
       ..
2469    3
2470    1
2471    1
2472    3
2473    5
Name: Q2_2, Length: 2474, dtype: int64

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Q2_2_one_hot = pd.get_dummies(df['Q2_2'], prefix='Q2_2')

# Drop Q12 (open-ended text) and non-feature columns (CaseId)
X = df.drop(columns=['Q2_1', 'Q12', 'CaseId'])
y = df['Q2_1']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# Display the coefficients of the model
coef_df = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
print(coef_df)


Mean Squared Error: 2.046846330448367
              Coefficient
VOTEDT           0.033429
Q2_2             0.153140
Q2_3             0.211909
Q2_4             0.003262
Q2_5             0.217835
Q6_1             0.004654
Q6_2            -0.000806
Q6_3            -0.002675
Q6_4             0.005542
Q6_5            -0.004796
Q7_1             0.002575
Q8_1             0.001471
Q9_1            -0.004190
PARTYID7        -0.034130
GENDER           0.076470
AGE              0.001849
RACETHNICITY     0.033591
EDUC4            0.022268
INCOME          -0.007932


In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Convert specified columns into one-hot encoded vectors
columns_to_encode = ['Q2_2', 'Q2_3', 'Q2_4', 'Q2_5', 'PARTYID7', 'GENDER', 'RACETHNICITY', 'EDUC4', 'INCOME']
df_encoded = pd.get_dummies(df, columns=columns_to_encode, drop_first=True)

# Drop unnecessary columns for training
X = df_encoded.drop(columns=['Q2_1', 'Q12', 'CaseId'])
y = df['Q2_1']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the logistic regression model
model = LogisticRegression(max_iter=10000, multi_class='multinomial')
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Display the coefficients of the model
coef_df = pd.DataFrame(model.coef_, columns=X.columns)
print(coef_df)


Accuracy: 0.8222222222222222
     VOTEDT      Q6_1      Q6_2      Q6_3      Q6_4      Q6_5      Q7_1  \
0 -0.132437  0.061311  0.004491  0.010011  0.071029  0.068740  0.494551   
1 -0.112940  0.069115  0.002262  0.022216  0.064211  0.056736  0.188671   
2 -0.033040  0.072687 -0.002818  0.012492  0.074651  0.072916  0.528066   
3  0.137827  0.045014  0.031447 -0.014581  0.078692  0.051149 -0.510160   
4 -0.024853  0.078338 -0.029215  0.068499  0.086323  0.037261 -0.416434   
5  0.165443 -0.326466 -0.006166 -0.098637 -0.374905 -0.286803 -0.284695   

       Q8_1      Q9_1       AGE  ...  INCOME_9  INCOME_10  INCOME_11  \
0  0.172072  0.282503 -0.007788  ... -0.288896  -0.102483   0.067090   
1  0.050214  0.193053 -0.007376  ... -0.025711  -0.073061   0.213003   
2 -0.269141 -0.077394 -0.002759  ...  0.047378  -0.351414  -0.050280   
3  0.200985 -0.238879  0.002526  ... -0.538686   0.596718   0.639951   
4 -0.299034  0.177332 -0.017111  ...  0.320098   0.015170  -0.821980   
5  0.144905 -