# Keyness Analysis of Trump vs Kamala Interviews and Podcasts

Keyness Analysis is an NLP technique used to measure the relative difference in token frequency between corpora. In this notebook we deploy this method to see which words trump and kamala tend to use more frequently in their podcast and interview appearances. We build keyness tables that provide a comparative analysis of word frequencies between two datasets (labeled A and B) and includes the following metrics:

- WORD:  The word being analyzed.
- A Freq: The number of times the word appears in corpus A
- B Freq: The number of times the word appears in corpus B
- Keyness: A statistical measure (e.g., log-likelihood) that quantifies the significance of the difference in word frequency between the two datasets.  A higher Keyness value indicates a stronger, more statistically significant difference. This value accounts for the total size of each corpus to ensure that differences aren’t just due to unequal corpus sizes.
- %DIFF: The percentage difference between the frequency of the word in corpus A compared to B**, normalized by the relative sizes of the datasets. Positive values mean the word is more frequent in A.  Higher percentages reflect stronger disparities.
- BIC (Bayesian Information Criterion): Another statistical measure for model selection that penalizes overly complex models. In this context, it assesses how well the word's frequency differences fit a statistically significant pattern. Higher BIC values indicate more robust differences. Negative values or small positive values suggest less meaningful differences.

This notebook is organized to guide the analysis step-by-step, starting with data preparation and progressing through multiple comparative analyses. Each section focuses on a specific comparison to highlight differences in word usage. Here's an overview of the structure:

- Loading and Structuring Data: Prepares the datasets for analysis by cleaning, tokenizing, and organizing them into the required format for keyness analysis. Includes details on how the corpora (e.g., podcasts and interviews) are selected and processed.
- Keyness Analysis of Trump Interviews vs. Trump Podcasts: Compares word frequencies between Trump's interview appearances and his podcast appearances, highlighting differences in language use across these two formats.
- Keyness Analysis of Kamala Interviews vs. Kamala Podcasts: Examines the same comparison for Kamala, analyzing word frequency differences between her interviews and podcast appearances.
- Keyness Analysis of Trump Podcasts vs. Kamala Podcasts: Investigates how word frequencies differ between Trump's and Kamala's podcast appearances, identifying notable contrasts in their language.
- Keyness Analysis of Trump Interviews vs. Kamala Interviews: Focuses on differences in word use between Trump's and Kamala's interview appearances, offering insights into how their language varies in similar settings.

Each section builds upon the previous one, employing the same keyness analysis methodology while providing a targeted comparison. Together, these sections offer a comprehensive view of linguistic patterns across different speakers and formats.

## Loading and Structuring Data

In [2]:
import json
from collections import Counter
from nltk.corpus import stopwords

In [3]:
%run functions.ipynb

In [4]:
master_list = json.load(open('../data/master_list.json'))
len(master_list)

24

In [5]:
# Separate lists for each type of interview or podcast
harris_interviews_toks = []
harris_podcasts_toks = []
trump_interviews_toks = []
trump_podcasts_toks = []

# Filter and append tokens for Harris interviews
for item in master_list:
    if item['medium'] == 'harris_interviews':
        harris_interviews_toks.extend(item['tokens'])
    elif item['medium'] == 'harris_podcasts':
        harris_podcasts_toks.extend(item['tokens'])
    elif item['medium'] == 'trump_interviews':
        trump_interviews_toks.extend(item['tokens'])
    elif item['medium'] == 'trump_podcasts':
        trump_podcasts_toks.extend(item['tokens'])

# Count token frequencies for each list
harris_interviews_freq = Counter(harris_interviews_toks)
harris_podcasts_freq = Counter(harris_podcasts_toks)
trump_interviews_freq = Counter(trump_interviews_toks)
trump_podcasts_freq = Counter(trump_podcasts_toks)

# Print the most common tokens for each type
print("Harris Interviews:", harris_interviews_freq.most_common(10))
print("Harris Podcasts:", harris_podcasts_freq.most_common(10))
print("Trump Interviews:", trump_interviews_freq.most_common(10))
print("Trump Podcasts:", trump_podcasts_freq.most_common(10))

Harris Interviews: [('the', 452), ('to', 353), ('and', 332), ('of', 275), ('that', 274), ('i', 210), ('a', 196), ('in', 161), ('is', 152), ('we', 142)]
Harris Podcasts: [('the', 1194), ('and', 1162), ('to', 1083), ('of', 770), ('i', 755), ('that', 722), ('you', 618), ('a', 575), ('is', 478), ('in', 467)]
Trump Interviews: [('i', 670), ('the', 608), ('and', 526), ('to', 504), ('a', 449), ('you', 442), ('of', 313), ('that', 301), ('in', 274), ('it', 261)]
Trump Podcasts: [('i', 4033), ('and', 3861), ('the', 3479), ('a', 3148), ('you', 3121), ('to', 2318), ('it', 2241), ('they', 1854), ('that', 1833), ('of', 1780)]


## Comparison Between Trumps' Interview and Podcast Appearances

In [5]:
kdf=calculate_key(trump_podcasts_freq, trump_interviews_freq, top=100)

WORD                     A Freq.   B Freq.   Keyness   %DIFF     BIC
yeah                     482       16        64.977    382.021   53.208
gonna                    257       10        30.596    311.218   18.828
know                     1407      145       28.676    55.262    16.907
it                       2241      261       25.645    37.385    13.876
get                      380       28        19.677    117.152   7.909
hes                      450       37        18.336    94.603    6.568
great                    403       33        16.583    95.402    4.815
was                      1502      183       13.008    31.328    1.239
and                      3861      526       12.473    17.45     0.705
but                      1422      175       11.495    30.017    -0.273
problem                  107       5         10.709    242.415   -1.060
when                     352       33        10.004    70.674    -1.765
he                       1224      152       9.277     28.848    -2.492


### Observations: 
- Conversational tone: Words like "yeah," "gonna," "know," and "it" are significantly more frequent in podcasts. These conversational markers suggest a more casual, spontaneous, and conversational tone typical of podcasts.
- Trumpian Vocabulary: As anticipated Trump's typical linguistic patterns are amplified on podcasts where he's more free to speak his mind. We see words like "great", "terrible", "crazy", "big", and "fight"
- Topic Focus: For topics we see Trump talking noticably more about "election" and "vote" on podcasts and also "russia" and "war" 

In [6]:
kdf=calculate_key(trump_interviews_freq, trump_podcasts_freq)

WORD                     A Freq.   B Freq.   Keyness   %DIFF     BIC
—                        179       70        434.147   1498.146  422.378
our                      98        224       59.024    173.426   47.255
companies                30        29        45.692    546.524   33.924
going                    137       430       43.393    99.119    31.625
percent                  26        22        43.332    638.605   31.564
to                       504       2318      36.417    35.887    24.649
tariff                   18        13        33.009    765.348   21.240
tariffs                  22        23        31.629    497.801   19.860
black                    21        21        31.21     524.973   19.442
country                  90        297       24.961    89.386    13.192
case                     19        25        22.522    374.98    10.753
into                     51        139       22.282    129.307   10.513
inflation                18        26        19.499    332.674   7

### Observations:
- Trade policy: In the interviews, trade policy is a big theme as weget words like "trade", "sell", "tariff", "china", "germany", "korea", and "cars"
- Economy: More economic focused words like "companies", "jobs", "tax", are used more in interviews by Trump.
- The Border: Trump dicusses "border" and "mexico" relatively more in the interviews than the podcasts. 
- Identity groups: words like "women" and "black" appear more in interviews.

## Comparison Between Harris' Interview and Podcast Appearances

In [7]:
kdf=calculate_key(harris_podcasts_freq, harris_interviews_freq, top=100)

WORD                     A Freq.   B Freq.   Keyness   %DIFF     BIC
you                      618       108       34.648    77.716    24.045
know                     272       34        31.554    148.457   20.950
right                    166       16        27.512    222.218   16.909
so                       289       44        23.011    103.989   12.408
dont                     111       9         22.027    283.038   11.424
your                     138       14        21.518    206.135   10.915
like                     101       10        16.22     213.677   5.617
up                       104       11        15.385    193.631   4.781
just                     142       19        14.656    132.111   4.052
because                  141       19        14.367    130.477   3.763
but                      187       29        14.279    100.265   3.676
mean                     72        6         13.919    272.686   3.316
look                     66        7         9.729     192.825   -0.875
s

#### Observations

- Conversational Tone: Words like "you," "know," "right," and "so" indicate a more informal, conversational tone. These fillers and discourse markers are typical in relaxed, unscripted dialogue.
- Gendered Language: The presence of gendered pronouns and terms like "women", "woman", "she" "her," as well as "he" and "him," in Kamala Harris's podcast vocabulary suggests a focus on addressing gender dynamics, equality, and inclusivity
- Empathic Language: The terms "life", "care", "child", and "small" in Kamala Harris's podcast language reflect a strong use of empathic language, which conveys compassion, understanding, and a focus on individual experiences.
- No real topic focus...

In [8]:
kdf=calculate_key(harris_interviews_freq, harris_podcasts_freq, top=100)

WORD                     A Freq.   B Freq.   Keyness   %DIFF     BIC
american                 55        18        86.568    883.849   75.965
border                   30        19        31.228    408.401   20.625
bill                     26        17        26.353    392.451   15.749
has                      59        79        24.23     140.471   13.627
president                59        86        20.461    120.898   9.858
clear                    22        17        19.124    316.689   8.521
must                     13        7         15.324    497.976   4.720
forward                  12        6         14.886    543.974   4.283
states                   35        48        13.733    134.782   3.130
leader                   12        7         13.334    451.978   2.730
immigration              11        6         12.847    490.31    2.244
would                    49        81        12.656    94.782    2.053
donald                   33        46        12.541    130.991   1.938
the 

### Observations

- Nationalist rhetoric: In interviews Harris frequently uses this term to emphasize unity, patriotism, and shared national identity, such as "America", "American", "President", "Leader", "states" and "country" appealing broadly to citizens across political divides.
- Discussion of legislation: Words like "bill", "plan", "fix", and "congress" highlight her engagement with actionable governance, signaling progress or advocacy for specific initiatives in news interviews.
- Immigration, law, and drugs: we see words like "border", "immigration", "agents", "criminal", and "fentanyl" suggesting more of a focus on law and order specifically regarding immigration rhetoric.
- Opponent mentions: Harris seems more willing to discuss her opponent in interviews with words like "former", "Trump", and "Donald". 

## Podcast Comparison Between Candidates

In [9]:
kdf=calculate_key(harris_podcasts_freq, trump_podcasts_freq, top=100)

WORD                     A Freq.   B Freq.   Keyness   %DIFF     BIC
my                       234       192       223.714   341.557   211.850
who                      170       115       192.204   435.578   180.340
to                       1083      2318      190.325   69.273    178.461
work                     113       48        173.229   752.922   161.365
black                    88        21        172.869   1418.222  161.005
about                    304       411       156.105   167.981   144.241
your                     138       128       116.589   290.608   104.725
community                47        5         113.435   3305.649  101.571
small                    51        8         113.235   2209.682  101.371
is                       478       936       110.802   85.022    98.938
of                       770       1780      101.696   56.727    89.833
donald                   46        8         99.452    1983.243  87.588
plan                     42        7         91.829    207

### Observations

- Personal Pronouns & Possession: Harris uses "my," "your," and "we" more often than Trump, suggesting a more personal, conversational tone. This reflects a style that is likely more self-reflective and focused on individual connections, potentially framing her message in a way that feels more inclusive and relatable to her audience.
- Themes of Race and Community: Harris mentions words like "black," "community," and "justice" more frequently, indicating a stronger emphasis on racial issues and social justice. This could reflect her focus on policies aimed at racial equality and supporting marginalized communities, aligning with her political stance on social change and advocacy for racial justice.
- Children and Families: Harris uses terms like "kids," "young," "children," "mother," and "family" more often, suggesting her podcast places a greater emphasis on family and child-related issues. This could highlight her focus on family welfare, childcare, and family-centered policies.
- Discussion of Legislation: Words like "bill," "plan," "fix," and "congress" appear more often in Harris's podcast, pointing to a stronger focus on policy proposals and legislative action. This suggests she may discuss specific initiatives and her approach to policy change more extensively than Trump.
- Labor and Economy: Harris's increased use of words like "economy," "businesses," "working," and "work" points to a stronger focus on labor and economic issues. This may reflect discussions around job creation, economic policies, and the role of businesses in supporting communities and workers' rights.
- Opponent Mentions: Harris references her opponent more often with words like "former," "trump," and "donald," indicating a greater willingness to directly engage with or critique her opponent in her podcast. This may suggest she is more vocal about contrasting her views with Trump’s, positioning herself as an alternative on key political issues.

In [10]:
kdf=calculate_key(trump_podcasts_freq, harris_podcasts_freq, top=100)

WORD                     A Freq.   B Freq.   Keyness   %DIFF     BIC
they                     1854      164       268.397   212.028   256.533
said                     712       28        194.528   601.859   182.664
but                      1422      187       109.491   109.887   97.627
had                      621       46        108.808   272.616   96.944
i                        4033      755       104.549   47.438    92.685
he                       1224      153       104.492   120.81    92.629
great                    403       18        102.871   517.96    91.007
a                        3148      575       90.958    51.111    79.094
it                       2241      377       88.761    64.07     76.898
was                      1502      223       86.769    85.906    74.905
very                     627       60        82.217    188.433   70.353
its                      1096      148       79.672    104.398   67.808
hes                      450       34        77.334    265.31    

### Observations

- Frequent Use of "They", "He", and "Shes" indicate Trump often references external actors or individuals, emphasizing a third-person narrative to frame others' actions or viewpoints. This aligns with his tendency to focus on opponents or external forces. However, "I" also appears more often possibly indicating a more narrative driven linguistic style.
- Words like "great", "good", "bad", "dangerous", and "problem" suggest a preference for strong adjectives and hyperbole.
- Words like "guy", "yeah", and "gonna" indicate a more conversational and informal speech style, likely aimed at establishing a relatable and down-to-earth persona.
- Terms such as "didn't" and "can't"  show a tendency to emphasize what others haven’t done or limitations, underscoring a rhetorical strategy that critiques or highlights perceived failures.
- He's talking more about Russia in the podcasts than Kamala is.

## Interview Comparison Between Candidates

In [11]:
kdf=calculate_key(trump_interviews_freq, harris_interviews_freq, top=100)

WORD                     A Freq.   B Freq.   Keyness   %DIFF     BIC
they                     244       39        64.553    235.179   54.337
said                     134       10        63.404    617.887   53.187
you                      442       108       61.868    119.255   51.651
theyre                   98        5         54.611    950.043   44.394
i                        670       210       50.06     70.925    39.843
but                      175       29        44.42     223.289   34.203
dont                     102       9         44.003    507.168   33.786
because                  122       19        33.217    243.999   23.000
its                      134       25        29.284    187.155   19.068
was                      183       43        27.64     127.999   17.423
were                     141       29        26.744    160.479   16.527
she                      69        9         22.37     310.731   12.153
know                     145       34        21.986    128.476   11

### Observations

- Pronouns like "They", "They're", "She", "He", "Him" appear in the interviews again far more often, reinforcing how Trump references external actors or individuals quite often. 
- "I" comes up significantly more in Trump's interview experience once again underscoring his tendency to go on personal anecdotes.
- He's talking more about inflation in the interviews than Kamala is.

In [12]:
kdf=calculate_key(harris_interviews_freq, trump_interviews_freq, top=100)

WORD                     A Freq.   B Freq.   Keyness   %DIFF     BIC
american                 55        5         85.713    1953.249  75.497
who                      58        19        52.419    469.801   42.202
has                      59        20        52.035    450.644   41.818
my                       64        27        47.302    342.451   37.086
that                     274       301       39.524    69.915    29.308
what                     104       79        36.567    145.728   26.351
trump                    30        6         35.895    833.295   25.679
of                       275       313       35.094    63.998    24.878
is                       152       142       34.758    99.804    24.541
president                59        34        31.32     223.908   21.103
the                      452       608       27.269    38.766    17.053
need                     26        7         26.663    593.305   16.446
economy                  22        5         24.753    721.3     14

### Observations

- Opponent focus: Kamala really talks a lot about Trump in the interviews with words like "president", "trump"
- Issue focus: words like "election", "border", "military", "war", "election", and "tax" show Kamala's rhetoric is for more issue-focused than Trump's.
- Nationalist Rhetoric: words like "american", "united", "states" show Kamala talks about American values and American unity more in her news interview appearances.

## Final Observations:

This analysis contrasts the linguistic and thematic differences in Trump’s and Harris’s podcast and interview appearances. Here's a breakdown of the key distinctions:

### Trump
- **Podcasts**:  
  - *Tone*: Casual, with markers like “yeah,” “gonna.”  
  - *Vocabulary*: Hyperbolic and charged words (“great,” “terrible”).  
  - *Topics*: Focus on election issues (“election,” “vote”), international relations (“Russia,” “war”).  

- **Interviews**:  
  - *Tone*: More formal, less spontaneous.  
  - *Vocabulary*: Economic and trade-related terms (“tariff,” “China”).  
  - *Topics*: Trade, economy (“jobs,” “tax”), border security.  

### Harris
- **Podcasts**:  
  - *Tone*: Personal, empathic. Frequent use of pronouns (“my,” “we”) and words reflecting family, community, and social justice themes (“kids,” “justice”).  
  - *Topics*: Broader focus on empathy and inclusion rather than legislation.  

- **Interviews**:  
  - *Tone*: Issue-driven with a formal tone.  
  - *Vocabulary*: Nationalistic and legislative terms (“America,” “bill,” “Congress”).  
  - *Topics*: Immigration, law enforcement, and critiques of Trump.

### Candidate Comparison
- **Podcasts**:  
  - *Harris*: Emphasizes empathy, social justice, and legislative plans.  
  - *Trump*: Focuses on personal narratives, informal speech, and critiques of opponents or external threats.  

- **Interviews**:  
  - *Harris*: Stronger focus on opponents and issue-based rhetoric (immigration, economy).  
  - *Trump*: More anecdotal, discussing his achievements or critiques of inflation.

This reflects their differing rhetorical strategies: Harris leans towards a detailed, policy-focused narrative, while Trump often opts for broad, emotive appeals.