**Hypothesis Testing of Results**

BAMP 2022 - MCT 4 - Ante Jelavic, Franziskus Perkhofer, Manuel Mencher, Melissa Ewering, Tim Ritzheimer

Short Description: In the previous script "VisualizationOfResults" the sentiments and emotions have been visualized per demographic characteristics and related topics. However, thw differenxes, e.g. between demographics characteristics, are not yet statistically proven. This script will analyze the shown differences and use the Chi-square test of independence to check this. In case the groups are dependent, effect size will be measured by Cramer's V.

The null hypothesis for the Chi-square test is always: Both variables are independent.

In [1]:
# Loading all necessary packages

import pandas as pd
import csv
import numpy as np
from scipy.stats import chi2_contingency
import scipy.stats as stats

In [2]:
df = pd.read_csv('FinalResults.csv') #See Output_Data - Needs to be saved in "Scripts" folder for runtime

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [3]:
#Function to calculate Cramer's V
def cramers_v(cross_tabs):
    # getting the chi sq. stat
    chi2 = stats.chi2_contingency(cross_tabs)[0]
    # calculating the total number of observations
    n = cross_tabs.sum().sum()
    # getting the degrees of freedom
    dof = min(cross_tabs.shape)-1
    # calculating cramer's v
    v = np.sqrt(chi2/(n*dof))
    # printing results
    print(f'V = {v}')
    
    return v

In [4]:
# Create dataframe to store p-values and Cramer's V effect size
results = pd.DataFrame(columns=['Variable 1','Variable 2','P-Value - Chi-square test','Effect size - Cramers V'])
results

Unnamed: 0,Variable 1,Variable 2,P-Value - Chi-square test,Effect size - Cramers V


**1st test:** Check independence between Age and Sentiment

In [5]:
# crosstab 
pd.crosstab(df["Age_Group"], df["label"])

label,NEGATIVE,POSITIVE
Age_Group,Unnamed: 1_level_1,Unnamed: 2_level_1
30-39,7659,9192
40-49,24051,33789
50-59,15842,32114
60-69,16905,26103
<30,2351,6673
>70,7632,16119


In [6]:
# Chi-square test of independence.
x = chi2_contingency(pd.crosstab(df["Age_Group"],df["label"]))
p = x[1]
x

(2129.719068634005,
 0.0,
 5,
 array([[ 6321.566497  , 10529.433503  ],
        [21698.38028524, 36141.61971476],
        [17990.44821852, 29965.55178148],
        [16134.23131583, 26873.76868417],
        [ 3385.30746359,  5638.69253641],
        [ 8910.06621983, 14840.93378017]]))

In [7]:
# Compute effect size - Cramers V
v = cramers_v(pd.crosstab(df["Age_Group"], df["label"]))

V = 0.1035994604307967


In [8]:
row = ["Age", "Sentiment", p, v]
results.loc[len(results)] = row
results

Unnamed: 0,Variable 1,Variable 2,P-Value - Chi-square test,Effect size - Cramers V
0,Age,Sentiment,0.0,0.103599


**Conclusion:** Null hypothesis can be rejeted beacuse the p value (0,0) is smaller than 0,01 (99% confidence intervall) - Therefore, it can be assumed that the variables are dependent.

**2nd test:** Check independence between Gender and Sentiment

In [9]:
# crosstab 
pd.crosstab(df["Gender"], df["label"])

label,NEGATIVE,POSITIVE
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
m,46684,78866
w,27756,45124


In [10]:
# Chi-square test of independence.
x = chi2_contingency(pd.crosstab(df["Gender"],df["label"]))
p = x[1]
x

(15.928532693036649,
 6.577958104337594e-05,
 1,
 array([[47099.44060878, 78450.55939122],
        [27340.55939122, 45539.44060878]]))

In [11]:
# Compute effect size - Cramers V
v = cramers_v(pd.crosstab(df["Gender"], df["label"]))

V = 0.00895950919327114


In [12]:
row = ["Gender", "Sentiment", p, v]
results.loc[len(results)] = row
results

Unnamed: 0,Variable 1,Variable 2,P-Value - Chi-square test,Effect size - Cramers V
0,Age,Sentiment,0.0,0.103599
1,Gender,Sentiment,6.6e-05,0.00896


**Conclusion:** Null hypothesis can be rejeted beacuse the p value (6,5e-05) is smaller than 0,01 (99% confidence intervall) - Therefore, it can be assumed that the variables are dependent.

**3rd test (group):** Check independence between Topics and Sentiment

In [13]:
# crosstab 
pd.crosstab(df["covid_hashtags"], df["label"])

label,NEGATIVE,POSITIVE
covid_hashtags,Unnamed: 1_level_1,Unnamed: 2_level_1
0,73889,123247
1,551,743


In [14]:
# Chi-square test of independence.
x = chi2_contingency(pd.crosstab(df["covid_hashtags"],df["label"]))
p = x[1]
x

(14.047222455555765,
 0.00017827694220970481,
 1,
 array([[ 73954.56251575, 123181.43748425],
        [   485.43748425,    808.56251575]]))

In [15]:
# Compute effect size - Cramers V
v = cramers_v(pd.crosstab(df["covid_hashtags"], df["label"]))

V = 0.008413787977396878


In [16]:
row = ["Topic - Covid", "Sentiment", p, v]
results.loc[len(results)] = row
results

Unnamed: 0,Variable 1,Variable 2,P-Value - Chi-square test,Effect size - Cramers V
0,Age,Sentiment,0.0,0.103599
1,Gender,Sentiment,6.6e-05,0.00896
2,Topic - Covid,Sentiment,0.000178,0.008414


In [17]:
# crosstab 
pd.crosstab(df["brexit_hashtags"], df["label"])

label,NEGATIVE,POSITIVE
brexit_hashtags,Unnamed: 1_level_1,Unnamed: 2_level_1
0,74184,123789
1,256,201


In [18]:
# Chi-square test of independence.
x = chi2_contingency(pd.crosstab(df["brexit_hashtags"],df["label"]))
p = x[1]
x

(66.11089555954123,
 4.2625298739149127e-16,
 1,
 array([[ 74268.55878647, 123704.44121353],
        [   171.44121353,    285.55878647]]))

In [19]:
# Compute effect size - Cramers V
v = cramers_v(pd.crosstab(df["brexit_hashtags"], df["label"]))

V = 0.018252941165965292


In [20]:
row = ["Topic - Brexit", "Sentiment", p, v]
results.loc[len(results)] = row
results

Unnamed: 0,Variable 1,Variable 2,P-Value - Chi-square test,Effect size - Cramers V
0,Age,Sentiment,0.0,0.103599
1,Gender,Sentiment,6.577958e-05,0.00896
2,Topic - Covid,Sentiment,0.0001782769,0.008414
3,Topic - Brexit,Sentiment,4.26253e-16,0.018253


In [21]:
# crosstab 
pd.crosstab(df["esg_hashtags"], df["label"])

label,NEGATIVE,POSITIVE
esg_hashtags,Unnamed: 1_level_1,Unnamed: 2_level_1
0,74110,122910
1,330,1080


In [22]:
# Chi-square test of independence.
x = chi2_contingency(pd.crosstab(df["esg_hashtags"],df["label"]))
p = x[1]
x

(120.01085602186295,
 6.291543296024961e-28,
 1,
 array([[ 73911.04570881, 123108.95429119],
        [   528.95429119,    881.04570881]]))

In [23]:
# Compute effect size - Cramers V
v = cramers_v(pd.crosstab(df["esg_hashtags"], df["label"]))

V = 0.024592722005054978


In [24]:
row = ["Topic - ESG", "Sentiment", p, v]
results.loc[len(results)] = row
results

Unnamed: 0,Variable 1,Variable 2,P-Value - Chi-square test,Effect size - Cramers V
0,Age,Sentiment,0.0,0.103599
1,Gender,Sentiment,6.577958e-05,0.00896
2,Topic - Covid,Sentiment,0.0001782769,0.008414
3,Topic - Brexit,Sentiment,4.26253e-16,0.018253
4,Topic - ESG,Sentiment,6.291543e-28,0.024593


**Conclusion:** Null hypothesis can be rejeted beacuse the p value of all three tests is smaller than 0,01 (99% confidence intervall) - Therefore, it can be assumed that the variables are dependent.

**4th test:** Check independence between Age and Emotion

In [25]:
# crosstab 
pd.crosstab(df["Age_Group"], df["emotion_label"])

emotion_label,anger,disgust,fear,joy,neutral,sadness,surprise
Age_Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
30-39,1883,170,6016,2657,2905,1438,1782
40-49,3702,285,17864,10303,13222,4909,7555
50-59,2490,291,15783,11714,8984,3884,4810
60-69,3213,267,15359,7260,9183,3921,3805
<30,662,30,3836,1834,1314,698,650
>70,1526,121,9100,5215,3967,1744,2078


In [26]:
# Chi-square test of independence.
x = chi2_contingency(pd.crosstab(df["Age_Group"],df["emotion_label"]))
p = x[1]
x

(3710.541814784197,
 0.0,
 30,
 array([[ 1144.40395102,    98.84878295,  5771.10446001,  3310.50009071,
          3360.77369853,  1409.18960843,  1756.17940836],
        [ 3928.09474374,   339.29224412, 19808.95388802, 11363.08380789,
         11535.64481177,  4836.95489593,  6027.97560853],
        [ 3256.84148566,   281.31222093, 16423.89683012,  9421.30095248,
          9564.3738346 ,  4010.39088847,  4997.88378773],
        [ 2920.80737792,   252.28701305, 14729.31343043,  8449.23078164,
          8577.54170236,  3596.60712594,  4482.21256866],
        [  612.84797662,    52.93522149,  3090.52558585,  1772.82967293,
          1799.75205362,   754.64524517,   940.46424432],
        [ 1613.00446505,   139.32451746,  8134.20580557,  4666.05469435,
          4736.91389911,  1986.21223605,  2475.2843824 ]]))

In [27]:
# Compute effect size - Cramers V
v = cramers_v(pd.crosstab(df["Age_Group"], df["emotion_label"]))

V = 0.06115472205823401


In [28]:
row = ["Age", "Emotion", p, v]
results.loc[len(results)] = row
results

Unnamed: 0,Variable 1,Variable 2,P-Value - Chi-square test,Effect size - Cramers V
0,Age,Sentiment,0.0,0.103599
1,Gender,Sentiment,6.577958e-05,0.00896
2,Topic - Covid,Sentiment,0.0001782769,0.008414
3,Topic - Brexit,Sentiment,4.26253e-16,0.018253
4,Topic - ESG,Sentiment,6.291543e-28,0.024593
5,Age,Emotion,0.0,0.061155


**Conclusion:** Null hypothesis can be rejeted beacuse the p value (0,0) is smaller than 0,01 (99% confidence intervall) - Therefore, it can be assumed that the variables are dependent.

**5th test:** Check independence between Gender and Sentiment

In [29]:
# crosstab 
pd.crosstab(df["Gender"], df["emotion_label"])

emotion_label,anger,disgust,fear,joy,neutral,sadness,surprise
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
m,8051,865,39403,25290,27817,9806,14318
w,5425,299,28555,13693,11758,6788,6362


In [30]:
# Chi-square test of independence.
x = chi2_contingency(pd.crosstab(df["Gender"],df["emotion_label"]))
p = x[1]
x

(2274.673711856102,
 0.0,
 6,
 array([[ 8526.4919619 ,   736.48238674, 42998.17013556, 24665.20007055,
         25039.76843219, 10499.30302878, 13084.58398428],
        [ 4949.5080381 ,   427.51761326, 24959.82986444, 14317.79992945,
         14535.23156781,  6094.69697122,  7595.41601572]]))

In [31]:
# Compute effect size - Cramers V
v = cramers_v(pd.crosstab(df["Gender"], df["emotion_label"]))

V = 0.10706706264356


In [32]:
row = ["Gender", "Emotion", p, v]
results.loc[len(results)] = row
results

Unnamed: 0,Variable 1,Variable 2,P-Value - Chi-square test,Effect size - Cramers V
0,Age,Sentiment,0.0,0.103599
1,Gender,Sentiment,6.577958e-05,0.00896
2,Topic - Covid,Sentiment,0.0001782769,0.008414
3,Topic - Brexit,Sentiment,4.26253e-16,0.018253
4,Topic - ESG,Sentiment,6.291543e-28,0.024593
5,Age,Emotion,0.0,0.061155
6,Gender,Emotion,0.0,0.107067


**Conclusion:** Null hypothesis can be rejeted beacuse the p value (6,5e-05) is smaller than 0,01 (99% confidence intervall) - Therefore, it can be assumed that the variables are dependent.

**3rd test (group):** Check independence between Topics and Sentiment

In [33]:
# crosstab 
pd.crosstab(df["covid_hashtags"], df["emotion_label"])

emotion_label,anger,disgust,fear,joy,neutral,sadness,surprise
covid_hashtags,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,13419,1164,66850,38932,39570,16526,20675
1,57,0,1108,51,5,68,5


In [34]:
# Chi-square test of independence.
x = chi2_contingency(pd.crosstab(df["covid_hashtags"],df["emotion_label"]))
p = x[1]
x

(1576.702935557135,
 0.0,
 6,
 array([[1.33881204e+04, 1.15640933e+03, 6.75148329e+04, 3.87287844e+04,
         3.93169239e+04, 1.64857874e+04, 2.05451418e+04],
        [8.78795747e+01, 7.59066673e+00, 4.43167122e+02, 2.54215602e+02,
         2.58076148e+02, 1.08212649e+02, 1.34858237e+02]]))

In [35]:
# Compute effect size - Cramers V
v = cramers_v(pd.crosstab(df["covid_hashtags"], df["emotion_label"]))

V = 0.08913972130093012


In [36]:
row = ["Topic - Covid", "Emotion", p, v]
results.loc[len(results)] = row
results

Unnamed: 0,Variable 1,Variable 2,P-Value - Chi-square test,Effect size - Cramers V
0,Age,Sentiment,0.0,0.103599
1,Gender,Sentiment,6.577958e-05,0.00896
2,Topic - Covid,Sentiment,0.0001782769,0.008414
3,Topic - Brexit,Sentiment,4.26253e-16,0.018253
4,Topic - ESG,Sentiment,6.291543e-28,0.024593
5,Age,Emotion,0.0,0.061155
6,Gender,Emotion,0.0,0.107067
7,Topic - Covid,Emotion,0.0,0.08914


In [37]:
# crosstab 
pd.crosstab(df["brexit_hashtags"], df["emotion_label"])

emotion_label,anger,disgust,fear,joy,neutral,sadness,surprise
brexit_hashtags,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,13457,1164,67551,38966,39575,16582,20678
1,19,0,407,17,0,12,2


In [38]:
# Chi-square test of independence.
x = chi2_contingency(pd.crosstab(df["brexit_hashtags"],df["emotion_label"]))
p = x[1]
x

(621.5082697186715,
 5.343095288235906e-131,
 6,
 array([[1.34449637e+04, 1.16131922e+03, 6.78014873e+04, 3.88932191e+04,
         3.94838556e+04, 1.65557827e+04, 2.06323723e+04],
        [3.10362949e+01, 2.68078416e+00, 1.56512654e+02, 8.97809353e+01,
         9.11443582e+01, 3.82172958e+01, 4.76276773e+01]]))

In [39]:
# Compute effect size - Cramers V
v = cramers_v(pd.crosstab(df["brexit_hashtags"], df["emotion_label"]))

V = 0.05596542287673583


In [40]:
row = ["Topic - Brexit", "Emotion", p, v]
results.loc[len(results)] = row
results

Unnamed: 0,Variable 1,Variable 2,P-Value - Chi-square test,Effect size - Cramers V
0,Age,Sentiment,0.0,0.103599
1,Gender,Sentiment,6.577958e-05,0.00896
2,Topic - Covid,Sentiment,0.0001782769,0.008414
3,Topic - Brexit,Sentiment,4.26253e-16,0.018253
4,Topic - ESG,Sentiment,6.291543e-28,0.024593
5,Age,Emotion,0.0,0.061155
6,Gender,Emotion,0.0,0.107067
7,Topic - Covid,Emotion,0.0,0.08914
8,Topic - Brexit,Emotion,5.343095e-131,0.055965


In [41]:
# crosstab 
pd.crosstab(df["esg_hashtags"], df["emotion_label"])

emotion_label,anger,disgust,fear,joy,neutral,sadness,surprise
esg_hashtags,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,13360,1164,66817,38891,39566,16551,20671
1,116,0,1141,92,9,43,9


In [42]:
# Chi-square test of independence.
x = chi2_contingency(pd.crosstab(df["esg_hashtags"],df["emotion_label"]))
p = x[1]
x

(1484.137289939724,
 0.0,
 6,
 array([[1.33802425e+04, 1.15572887e+03, 6.74751054e+04, 3.87059954e+04,
         3.92937887e+04, 1.64760867e+04, 2.05330525e+04],
        [9.57574963e+01, 8.27112836e+00, 4.82894623e+02, 2.77004636e+02,
         2.81211258e+02, 1.17913320e+02, 1.46947538e+02]]))

In [43]:
# Compute effect size - Cramers V
v = cramers_v(pd.crosstab(df["esg_hashtags"], df["emotion_label"]))

V = 0.08648352292569428


In [44]:
row = ["Topic - ESG", "Emotion", p, v]
results.loc[len(results)] = row
results

Unnamed: 0,Variable 1,Variable 2,P-Value - Chi-square test,Effect size - Cramers V
0,Age,Sentiment,0.0,0.103599
1,Gender,Sentiment,6.577958e-05,0.00896
2,Topic - Covid,Sentiment,0.0001782769,0.008414
3,Topic - Brexit,Sentiment,4.26253e-16,0.018253
4,Topic - ESG,Sentiment,6.291543e-28,0.024593
5,Age,Emotion,0.0,0.061155
6,Gender,Emotion,0.0,0.107067
7,Topic - Covid,Emotion,0.0,0.08914
8,Topic - Brexit,Emotion,5.343095e-131,0.055965
9,Topic - ESG,Emotion,0.0,0.086484


**Conclusion:** Null hypothesis can be rejeted beacuse the p value of all three tests is smaller than 0,01 (99% confidence intervall) - Therefore, it can be assumed that the variables are dependent.

In [45]:
results

Unnamed: 0,Variable 1,Variable 2,P-Value - Chi-square test,Effect size - Cramers V
0,Age,Sentiment,0.0,0.103599
1,Gender,Sentiment,6.577958e-05,0.00896
2,Topic - Covid,Sentiment,0.0001782769,0.008414
3,Topic - Brexit,Sentiment,4.26253e-16,0.018253
4,Topic - ESG,Sentiment,6.291543e-28,0.024593
5,Age,Emotion,0.0,0.061155
6,Gender,Emotion,0.0,0.107067
7,Topic - Covid,Emotion,0.0,0.08914
8,Topic - Brexit,Emotion,5.343095e-131,0.055965
9,Topic - ESG,Emotion,0.0,0.086484


**Overall conclusion:** For all test the null hypothesis could be rejected with a 99% confidence intervall. Therefore it can be assumed that the sentiment and emotion are dependent on all collected demographic charceteritics and the 3 identified topic cluster. However, Cramer's V indicats a weak depencies for all cases. Therefore, the effect size will be again measured for groups that optically differed a lot.

Example 1: Sentiment & "Age_Group = <30" or "Age_Group = 30 - 39"

In [46]:
# Data selection & cross tab
df1 = df[df['Age_Group']=="<30"]
df2 = df[df['Age_Group']=="30-39"]
df3 = pd.concat([df1, df2], ignore_index=True)
pd.crosstab(df3["Age_Group"], df3["label"])

label,NEGATIVE,POSITIVE
Age_Group,Unnamed: 1_level_1,Unnamed: 2_level_1
30-39,7659,9192
<30,2351,6673


In [47]:
# Chi-square test of independence.
x = chi2_contingency(pd.crosstab(df3["Age_Group"],df3["label"]))
p = x[1]
x

(931.5130262603652,
 1.3848318780580499e-204,
 1,
 array([[ 6518.97623188, 10332.02376812],
        [ 3491.02376812,  5532.97623188]]))

In [48]:
# Compute effect size - Cramers V
v = cramers_v(pd.crosstab(df3["Age_Group"], df3["label"]))

V = 0.18973798626092847


In [49]:
row = ["Age_Group <30 & 30-39", "Sentiment",p, v]
results.loc[len(results)] = row
results

Unnamed: 0,Variable 1,Variable 2,P-Value - Chi-square test,Effect size - Cramers V
0,Age,Sentiment,0.0,0.103599
1,Gender,Sentiment,6.577958e-05,0.00896
2,Topic - Covid,Sentiment,0.0001782769,0.008414
3,Topic - Brexit,Sentiment,4.26253e-16,0.018253
4,Topic - ESG,Sentiment,6.291543e-28,0.024593
5,Age,Emotion,0.0,0.061155
6,Gender,Emotion,0.0,0.107067
7,Topic - Covid,Emotion,0.0,0.08914
8,Topic - Brexit,Emotion,5.343095e-131,0.055965
9,Topic - ESG,Emotion,0.0,0.086484


Example 2: Sentiment & "Topic cluster - Brexit" or "Topic cluster - ESG"

In [50]:
# Data selection
df1 = df[df['brexit_hashtags']== 1]
df2 = df[df['esg_hashtags']== 1]
df3 = pd.concat([df1, df2], ignore_index=True)
print(len(df3))
# Delete entries which mention both topics
df3 = df3.drop(df3[(df3.brexit_hashtags == 1) & (df3.esg_hashtags == 1)].index)
print(len(df3))
df3.loc[df3.brexit_hashtags == 0, 'brexit_hashtags'] = 'ESG'
df3.loc[df3.brexit_hashtags == 1, 'brexit_hashtags'] = 'Brexit'

1867
1865


In [51]:
# Cross tab - Brexit_hashtags contains the information wheter ESG or Brexit cluster
pd.crosstab(df3["brexit_hashtags"], df3["label"])

label,NEGATIVE,POSITIVE
brexit_hashtags,Unnamed: 1_level_1,Unnamed: 2_level_1
Brexit,255,201
ESG,329,1080


In [52]:
# Chi-square test of independence.
x = chi2_contingency(pd.crosstab(df3["brexit_hashtags"],df3["label"]))
p = x[1]
x

(168.41473625391367,
 1.6421513686098202e-38,
 1,
 array([[142.79034853, 313.20965147],
        [441.20965147, 967.79034853]]))

In [53]:
# Compute effect size - Cramers V
v = cramers_v(pd.crosstab(df3["brexit_hashtags"], df3["label"]))

V = 0.30050425561520616


In [54]:
row = ["Brexit or ESG", "Sentiment",p, v]
results.loc[len(results)] = row
results

Unnamed: 0,Variable 1,Variable 2,P-Value - Chi-square test,Effect size - Cramers V
0,Age,Sentiment,0.0,0.103599
1,Gender,Sentiment,6.577958e-05,0.00896
2,Topic - Covid,Sentiment,0.0001782769,0.008414
3,Topic - Brexit,Sentiment,4.26253e-16,0.018253
4,Topic - ESG,Sentiment,6.291543e-28,0.024593
5,Age,Emotion,0.0,0.061155
6,Gender,Emotion,0.0,0.107067
7,Topic - Covid,Emotion,0.0,0.08914
8,Topic - Brexit,Emotion,5.343095e-131,0.055965
9,Topic - ESG,Emotion,0.0,0.086484


Example 3: "Emotion = Fear" & Gender

In [55]:
df1 = df
df1.loc[df1.emotion_label != 'fear', 'emotion_label'] = 'Not fear'
df1.loc[df1.emotion_label == 'fear', 'emotion_label'] = 'Fear'

In [56]:
# Cross tab
pd.crosstab(df1["Gender"], df1["emotion_label"])

emotion_label,Fear,Not fear
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
m,39403,86147
w,28555,44325


In [57]:
# Chi-square test of independence.
x = chi2_contingency(pd.crosstab(df1["Gender"],df1["emotion_label"]))
p = x[1]
x

(1244.3916676160832,
 1.3737127375249912e-272,
 1,
 array([[42998.17013556, 82551.82986444],
        [24959.82986444, 47920.17013556]]))

In [58]:
# Compute effect size - Cramers V
v = cramers_v(pd.crosstab(df1["Gender"], df1["emotion_label"]))

V = 0.07919082748188042


In [59]:
row = ["Gender", "Fear or not fear",p, v]
results.loc[len(results)] = row
results

Unnamed: 0,Variable 1,Variable 2,P-Value - Chi-square test,Effect size - Cramers V
0,Age,Sentiment,0.0,0.103599
1,Gender,Sentiment,6.577958e-05,0.00896
2,Topic - Covid,Sentiment,0.0001782769,0.008414
3,Topic - Brexit,Sentiment,4.26253e-16,0.018253
4,Topic - ESG,Sentiment,6.291543e-28,0.024593
5,Age,Emotion,0.0,0.061155
6,Gender,Emotion,0.0,0.107067
7,Topic - Covid,Emotion,0.0,0.08914
8,Topic - Brexit,Emotion,5.343095e-131,0.055965
9,Topic - ESG,Emotion,0.0,0.086484


In [60]:
results.to_csv("Tests.csv") #see Output_files