Chi-Square Test for Independence

Check if there is significant difference in performance among GPT models.

- Significance Level ((\alpha)): 0.05 (5%).  Confidence Level: 0.95 (95%)

In [1]:
import numpy as np
from scipy.stats import chi2, chi2_contingency

Contingency table : 
|  | GPT 3.5 Turbo  | GPT 4 Turbo | GPT 4o |
|------|-----|-----|-----|
| Aligned With Nurses |  318  |  373  |  394  |
| Different With Nurses   |  127  |  72  |  51  |
| Column Total   |  445  |  445  |  445 |

In [2]:
obs = np.array([[318,373,394], [127,72, 51]])
# normally use correction=True, which is Yates' correction for continuity
# The effect of the correction is to adjust each observed value by 0.5 towards the corresponding expected value.
chi2_score,P_value,df,expected_value_array = chi2_contingency(obs,correction=True)

print(f'chi2_score: {chi2_score}')
print(f'P_value: {P_value}')
print(f'df: {df}')
print(f'expected_value_array: {expected_value_array}')

chi2_score: 45.48597235023042
P_value: 1.3269256893842978e-10
df: 2
expected_value_array: [[361.66666667 361.66666667 361.66666667]
 [ 83.33333333  83.33333333  83.33333333]]


Get the critical value under 95% level of confidence, 2 degree of freedom

In [3]:
critical_value = chi2.ppf(0.95,df= df)
print('critical_value is used to compare chi2 value on the x-axis of the chart.')
print(f'critical_value (x-axis) of 0.95 (integral val is 0.95) is {critical_value:.3f}')

critical_value is used to compare chi2 value on the x-axis of the chart.
critical_value (x-axis) of 0.95 (integral val is 0.95) is 5.991


chi2 test, compare chi2_score with critical_value.

In [4]:
null_hypothesis = 'The two categorical variables are independent. That means the performances of GPT models are similar.'
alternative_hypothesis = 'The two categorical variables are dependent. That means at least one GPT\'s performace differs from the others.' 

if chi2_score > critical_value:
    print('Reject null hypothesis')
    print(alternative_hypothesis)
else:
    print('Accept null hypothesis')
    print(null_hypothesis)

Reject null hypothesis
The two categorical variables are dependent. That means at least one GPT's performace differs from the others.
