RQ4: Is there a correlation between the framing of candidates’ tweets and the framing of news stories containing those tweets in the 2016 and 2020 coverage? 
-	Correlations between the tweets from presidential candidates to news stories about them
-	Comparison 2016 to 2020

This analysis should go like this:
Frame 1 (let’s say opponent attack). Select all tweets with this frame and see if it’s a correlation between this frame and the frame in the stories they were used. Stories divided by CNN/Fox. Two separate analyses, one for 2016 and one for 2020.

Repeat for all remaining frames.

In [1]:
 #using mystats env
import pandas as pd
import numpy as np
import pingouin as pg
import association_metrics as am
import scipy.stats as stats
from scipy.stats import chisquare
import pyreadstat
import statsmodels.api as sm

In [2]:
#open file
df=pd.read_csv('all.csv')
#cases STORYNUMBER 191 and 195 are removed since twitter had missing values

In [3]:
#unit of analysis is tweet
dftweet=df[['STORYNUMBER', 'CODER', 'YEAR', 'PUBLICATIONDATE', 'URL', 'PUBLISHER', "PUBLISHERYEAR",
       'NUMBEROFTWEETSBYCANDIDATES', 'NUMBEROFTWEETSFROMOTHERS', 'TWEETBY',
       'TWEETPRESENTATION', 'TWEETGENERICFRAMESDOMINANT','TWEETEMOTIONALFRAMES',
       'TWEETCONTAINSFALSESTATEMENT', 'TWEETCLAIMSSOMETHINGELSEISFALSE',
       'TWEETTOPIC', 'TWEETROLETOTHESTORY', 'STORYROLETOTWEET']]
#unit of analysis is story
dfstory=df[['STORYNUMBER', 'CODER', 'YEAR', 'PUBLICATIONDATE', 'URL', 'PUBLISHER', "PUBLISHERYEAR",
       'STORYTYPE', 'STORYTOPIC', 'STORYGENERICFRAMESDOMINANT', 'STORYEMOTIONALFRAMES',
       'STORYCONTAINSFALSESTATEMENT', 'STORYCLAIMSANOTHERSOURCEISFALSE',
       'STORYFACTCHECKSTWEET', 'SENSATIONALISM', 'src_expert', 'src_opponent',
       'src_otherpolitician', 'src_privatecitizen', 'src_media', 'src_other']]
#remove duplicates
# dfstory.drop_duplicates(subset=['STORYNUMBER'], keep='first', inplace=True)
#dv list for tweets and stories
DVtweet=['TWEETTOPIC','TWEETGENERICFRAMESDOMINANT', 'TWEETEMOTIONALFRAMES',
       #  'TWEETCONTAINSFALSESTATEMENT', 'TWEETCLAIMSSOMETHINGELSEISFALSE'
        ]
DVstory=['STORYTOPIC', 'STORYGENERICFRAMESDOMINANT', 'STORYEMOTIONALFRAMES',
       # 'STORYCONTAINSFALSESTATEMENT', 'STORYCLAIMSANOTHERSOURCEISFALSE'
       ]
DVnamesTweet=['the topic of the tweet','the generic frames of the tweets','the emotional frames of the tweets']
DVnamesStory=['the topic of the story','the generic frames of the stories','the emotional frames of the stories']

In [4]:
#slicing datasets
Trump=df[df.TWEETBY=="trump"]
TrumpFOX=Trump[Trump.PUBLISHER=="fox"]
TrumpCNN=Trump[Trump.PUBLISHER=="cnn"]
Trump2016=Trump[Trump.YEAR==2016]
Trump2020=Trump[Trump.YEAR==2020]
Trump2016FOX=Trump2016[Trump2016.PUBLISHER=="fox"]
Trump2016CNN=Trump2016[Trump2016.PUBLISHER=="cnn"]
Trump2020FOX=Trump2020[Trump2020.PUBLISHER=="fox"]
Trump2020CNN=Trump2020[Trump2020.PUBLISHER=="cnn"]
Democrat=df[df.TWEETBY=="biden"]
Hillary2016=Democrat[Democrat.YEAR==2016]
Biden2020=Democrat[Democrat.YEAR==2020]
Hillary2016FOX=Hillary2016[Hillary2016.PUBLISHER=="fox"]
Hillary2016CNN=Hillary2016[Hillary2016.PUBLISHER=="cnn"]
Biden2020FOX=Biden2020[Biden2020.PUBLISHER=="fox"]
Biden2020CNN=Biden2020[Biden2020.PUBLISHER=="cnn"]
dfList=[Trump, TrumpFOX, TrumpCNN, Trump2016, Trump2016FOX, Trump2016CNN, Trump2020, Trump2020FOX, Trump2020CNN, Hillary2016, Hillary2016FOX, Hillary2016CNN, Biden2020, Biden2020FOX, Biden2020CNN]
dfName=["Trump overall", "Trump on FOX in both terms", "Trump on CNN in both terms", "Trump in 2016", "Trump in 2016 on FOX", "Trump in 2016 on CNN", "Trump in 2020", "Trump in 2020 on FOX", "Trump in 2020 on CNN", "Hillary", "Hillary on FOX", "Hillary on CNN", "Biden", "Biden on FOX", "Biden on CNN"]


In [5]:
# Cramers V calculator
def cramersV(x,y):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    result=-1
    if len(x.value_counts())==1 :
        print("First variable is constant")
    elif len(y.value_counts())==1:
        print("Second variable is constant")
    else:   
        conf_matrix=pd.crosstab(x, y)
            
        if conf_matrix.shape[0]==2:
            correct=False
        else:
            correct=True
    
        chi2 = stats.chi2_contingency(conf_matrix, correction=correct)[0]

        n = sum(conf_matrix.sum())
        phi2 = chi2/n
        r,k = conf_matrix.shape
        phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
        rcorr = r - ((r-1)**2)/(n-1)
        kcorr = k - ((k-1)**2)/(n-1)
        result=np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
    return round(result,2)

In [6]:
def CorCramer(x):
    for i, j, k, l in zip(DVtweet, DVstory, DVnamesTweet, DVnamesStory):
        data = x[[i, j]]
        table = sm.stats.Table.from_data(data)
        # print(table.table_orig)
        #calculating the p value
        rslt = table.test_nominal_association()
        # print('p = ' + rslt.pvalue.round(3).astype(str))
        #conditional p value report
        if rslt.pvalue < 0.05 and rslt.pvalue > 0.01:
            print("The correlation between "+k+" and "+l+" was significant (p < 0.05, Cramer's V = " + cramersV(x[i], x[j]).astype(str)+ ").")
        elif rslt.pvalue < 0.01:
            print("The correlation between "+k+" and "+l+" was significant (p < 0.01, Cramer's V = " + cramersV(x[i], x[j]).astype(str)+ ").")
        else:
            print("The correlation between "+k+" and "+l+" was not significant "+ '(p = '+rslt.pvalue.round(3).astype(str)+").")

In [7]:
# for i, j in zip(DVtweet, DVstory):
#     data = dfTrump2016[[i, j]]
#     table = sm.stats.Table.from_data(data)
#     # print(table.table_orig)
#     #calculating the p value
#     rslt = table.test_nominal_association()
#     # print('p = ' + rslt.pvalue.round(3).astype(str))
#     #conditional p value report
#     if rslt.pvalue < 0.05 and rslt.pvalue > 0.01:
#         print("The correlation between "+i+" and "+j+" was significant (p < 0.05, Cramer's V = " + cramersV(dfTrump2016[i], dfTrump2016[j]).astype(str)+ ").")
#     elif rslt.pvalue < 0.01:
#         print("The correlation between "+i+" and "+j+" was significant (p < 0.01, Cramer's V = " + cramersV(dfTrump2016[i], dfTrump2016[j]).astype(str)+ ").")
#     else:
#         print("The correlation between "+i+" and "+j+" was not significant "+ '(p = '+rslt.pvalue.round(3).astype(str)+").")

In [8]:
#Results
for x, y in zip(dfList, dfName):
    print(y)
    CorCramer(x)
    print("\n")

Trump overall
The correlation between the topic of the tweet and the topic of the story was significant (p < 0.01, Cramer's V = 0.49).
The correlation between the generic frames of the tweets and the generic frames of the stories was significant (p < 0.01, Cramer's V = 0.43).
The correlation between the emotional frames of the tweets and the emotional frames of the stories was significant (p < 0.01, Cramer's V = 0.35).


Trump on FOX in both terms
The correlation between the topic of the tweet and the topic of the story was significant (p < 0.01, Cramer's V = 0.68).
The correlation between the generic frames of the tweets and the generic frames of the stories was significant (p < 0.01, Cramer's V = 0.55).
The correlation between the emotional frames of the tweets and the emotional frames of the stories was significant (p < 0.01, Cramer's V = 0.45).


Trump on CNN in both terms
The correlation between the topic of the tweet and the topic of the story was significant (p < 0.01, Cramer's 