In [2]:
import pandas as pd
import sys, os
sys.path.insert(0, os.path.abspath(".."))
from src.assessment_utilities import compute_features
from src.assessment_utilities import predict_increased_trump_approval

In [3]:
df = pd.read_csv('../data/full_dataset.csv')

In [4]:
missing_counts = df.isnull().sum()
missing_counts = missing_counts[missing_counts > 0]
print(missing_counts)

video_id        1383
text            1413
maxdiff_mean    1413
sample_size     1413
dtype: int64


Mismatch. Video 55 is missing the text and maxdiff.

In [5]:
for col in df.columns:
    if col not in ['video_id', 'text', 'maxdiff_mean', 'sample_size', 'persuadability_score']:
        print(f"Value counts for column '{col}':")
        print(df[col].value_counts())
        print("\n")


Value counts for column 'treated':
treated
1    4310
0    1383
Name: count, dtype: int64


Value counts for column 'trump_approval':
trump_approval
0     3301
1     2390
11       2
Name: count, dtype: int64


Value counts for column 'vote_pres_2020':
vote_pres_2020
Democrat Joe Biden         2306
Republican Donald Trump    2069
other_did_not_vote         1318
Name: count, dtype: int64


Value counts for column 'vote_pres_2024':
vote_pres_2024
Democrat Kamala Harris     2255
Republican Donald Trump    2095
other_did_not_vote         1343
Name: count, dtype: int64


Value counts for column 'democratic_party_fav':
democratic_party_fav
Very unfavorable           1807
Somewhat favorable         1515
Somewhat unfavorable       1199
Very favorable              869
Never heard of/Not sure     303
Name: count, dtype: int64


Value counts for column 'republican_party_fav':
republican_party_fav
Very unfavorable           2024
Somewhat favorable         1430
Somewhat unfavorable       1021
Very fa

Quick thoughts:
- trump_approval has a miscoded '11'. That should be a '1' based on eye-balling the respondent's other variables.
- Can do cat for voting history
- Can do ordinal for favorbility. Wording isn't consistent so I need to account for that.

In [6]:
for col in ['maxdiff_mean', 'sample_size', 'persuadability_score']:
    print(f"Stats for column '{col}':")
    print(f"Min: {df[col].min()}")
    print(f"Median: {df[col].median()}")
    print(f"Mean: {df[col].mean()}")
    print(f"Max: {df[col].max()}")
    print("\n")


Stats for column 'maxdiff_mean':
Min: 0.228064707197437
Median: 0.500027755246306
Mean: 0.499196602665806
Max: 0.748381805398991


Stats for column 'sample_size':
Min: 904.0
Median: 1026.0
Mean: 1023.6056074766356
Max: 1120.0


Stats for column 'persuadability_score':
Min: 0.0001315880146184
Median: 0.5007855862581203
Mean: 0.4986549429068211
Max: 0.99998759580745




Unsure how sample size will come into play, but all samples thankfully are between 904 and 1120.

In [3]:
features = compute_features(df)

Corrected 2 trump_approval values from 11 to 1.
Dropping video 55 rows due to missing text/maxdiff/sample_size (30 rows).


In [4]:
predict_increased_trump_approval(df)

[predict_increased_trump_approval] start
[compute_features] start: input shape=(5693, 14)
Corrected 2 trump_approval values from 11 to 1.
Dropping video 55 rows due to missing text/maxdiff/sample_size (30 rows).
[compute_features] encoding vote history (one-hot)
[compute_features] encoding favorability (ordinal)
[compute_features] computing control means by partisanship
[compute_features] treated rows=4,280, control rows=1,383
[compute_features] aggregating to video_id Ã— partisanship
[compute_features] grouped shape=(378, 18), positives=180
[compute_features] building per-video metadata + text PCA features
[compute_text_embedding_pca] start: n_components=5
[compute_text_embedding_pca] embedding 126 texts
[compute_text_embedding_pca] running PCA with 5 components on dim=384
[compute_text_embedding_pca] done
[compute_features] merging aggregated outcomes with video metadata/features
[compute_features] done: output shape=(378, 26)
[predict_increased_trump_approval] compute_features done:

array([0.87897444, 0.02901279])