# Chi-Square Test for Dependency between categorical variables


***
A most common problem we come across **Machine learning** is determining whether input features are relevant to the outcome to be predicted. This is the problem of feature selection. 
***

 In the case of classification problems where input variables are also categorical, we can use statistical tests to determine whether the output variable is dependent or independent of the input variables
 
 “ Categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values.”
 
The Chi-Squared test is a statistical hypothesis test that assumes (the null hypothesis) that the **observed frequencies** for a categorical variable match the **expected frequencies** for the categorical variable. The Chi-Squared test does this for a contingency table, first calculating the expected frequencies for the groups, then determining whether the division of the groups, called the observed frequencies, matches the expected frequencies.
    
 The **result** of the test is a **test statistic** that has a chi-squared distribution and can be interpreted to reject or fail to reject the assumption or null hypothesis that the observed and expected frequencies are the same.
    
   When observed frequency is far from the expected frequency, the corresponding term in the sum is large; when the two are close, this term is small. Large values of **Chi-square** indicate that observed and expected frequencies are far apart. Small values of **Chi-square** mean the opposite: observed are close to expected. 
   
***“ The variables are considered independent if the observed and expected frequencies are similar, that the levels of the variables do not interact, are not dependent. “***
 we can interpret the dependency of the variables  in two ways
 
1.	Using test statistic
2.	Using P-value

**1.Using Test-statistic :**

We can interpret the test statistic in the context of the chi-squared distribution with the requisite number of degress of freedom as follows: 

•	**If Statistic >= Critical Value:** significant result, reject null hypothesis (H0), dependent.

•	**If Statistic < Critical Value:** not significant result, fail to reject null hypothesis (H0), independent.

 The degrees of freedom for the chi-squared distribution is calculated based on the size of the contingency table as:
degrees of freedom: (rows - 1) * (cols - 1)

**2.Using P-value**

In terms of a p-value and a chosen significance level (alpha), the test can be interpreted as follows:

•	**If p-value <= alpha:** significant result, reject null hypothesis (H0), dependent.

•	**If p-value > alpha:** not significant result, fail to reject null hypothesis (H0), independent.

For the test to be effective, at least five observations are required in each cell of the contingency table.


***


In [7]:
# chi-squared test with similar proportions
from scipy.stats import chi2_contingency
from scipy.stats import chi2

In [8]:
#sample contigency table
# contingency table
table = [[60,54,46,41 ],
        [40,44,53,57]]
print(table)

[[60, 54, 46, 41], [40, 44, 53, 57]]


In [9]:
#calculating the Satistics, P-value , Degree of Freedom and Expected value
stat, p, dof, expected = chi2_contingency(table)
print('dof=%d' % dof)
print(expected)

dof=3
[[50.88607595 49.86835443 50.37721519 49.86835443]
 [49.11392405 48.13164557 48.62278481 48.13164557]]


In [10]:
# interpret  Using test-statistic
prob = 0.95
critical = chi2.ppf(prob, dof)
print('probability=%.3f, critical=%.3f, stat=%.3f' % (prob, critical, stat))
if abs(stat) >= critical:
    print('Dependent (reject H0)')
else:
    print('Independent (fail to reject H0)')

probability=0.950, critical=7.815, stat=8.006
Dependent (reject H0)


In [11]:
# interpret using p-value
alpha = 1.0 - prob
print('significance=%.3f, p=%.3f' % (alpha, p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (fail to reject H0)')

significance=0.050, p=0.046
Dependent (reject H0)
