# chi-square test (χ2)

* the fundamental of the chi-square test (χ2), a statistical method to make the inference about the distribution of a variable or to decide whether there is a relationship exists between two variables of a population. The inference relies on the χ2 distribution curve, dependent upon the number of degrees of freedom d.f.

## Chi-Squared Distribution with different degrees of freedom

<b>Note:</b> The χ2 distribution curve is right-skewed and as the number of degrees of freedom becomes larger, the χ2 curve will more similar to the normal distribution.

![](https://miro.medium.com/max/720/1*-VtI50wyyTXzOdW2svxR4Q.png)




## χ2 test of Independence

* It is used to decide whether there is a relationship exists between two variables of a population. Useful when analyzing survey results of 2 categorical variables.

### null and alternate hypothesis 

* H₀: The two categorical variables have no relationship
* H₁: There is a relationship between two categorical variables

Note: The number of degrees of freedom of the χ2 independence test statistics:
d.f. = (# rows -1) *(#columns-1)

## Test Statistic of chi_sqaured test
* below is the contingency table for the two variables X1 and X2 
* R1,R2 .... Rr are r categories in variable X1.
* c1,c2 .... cc are c categories in variable X2.
* O11,O21,O31.... Or1 are occurance of category C1 when R1,R2....Rr occurs.
* similarly O1c,O2c,O3c...Orc are occurance of category Cc when R1,R2....Rr occurs.
* C1 Total, C2 Total , C3 Total ..... CC Total are total occurances of C1,C2....Cc respectively.
* R1 Total, R2 Total , CR3 Total ..... Rr Total are total occurances of R1,R2....Rr respectively.
* Grand Total summ of C1-Cr totals + R1-Rr totals

![](https://miro.medium.com/max/720/1*dFFtBknV9CEZ3ioHBmyi-Q.png)

* As the null hypothesis "The two categorical variables have no relationship" then their ratios should be as below(these are expected ratios as per the null hypothesis)

![](https://miro.medium.com/max/720/1*ujyHU_Kj_xLozEd6X-ZKTw.png)

* and the test statistic is sum of ratio of squares of the differences of originaland expected values to expected values.

* intuitively we can say how different is our assumption that there is no difference with the original values.

![](https://miro.medium.com/max/720/1*SG-NUDfemymJUoo_XyM--A.png)

## Lets understand by an example:

The table below is an exit poll which displays the joint responses to 2 categorical variables: people in categories from consider from 18–29, 30–44, 45–64 and >65 years, and their political affiliation, which is “Conservative”, “Socialist” and “Other”. Is there any evidence of a relationship between the age group and their political affiliation, at 5% significant level?

![](https://miro.medium.com/max/640/1*RhRyvc_638pybH-dqiKRag.png)

According to five steps process of hypothesis testing:

* H₀: whether age group and their political affiliation are independent, i.e. no relationship
* H₁: whether age group and their political affiliation are dependent, i.e. ∃ a relationship
* α = 0.05



In [None]:
import pandas as pd
import scipy.stats as stats

# create sample data according to survey
data = [['18-29', 'Conservative'] for i in range(141)] + \
        [['18-29', 'Socialist'] for i in range(68)] + \
        [['18-29', 'Other'] for i in range(4)] + \
        [['30-44', 'Conservative'] for i in range(179)] + \
        [['30-44', 'Socialist'] for i in range(159)] + \
        [['30-44', 'Other'] for i in range(7)] + \
        [['45-65', 'Conservative'] for i in range(220)] + \
        [['45-65', 'Socialist'] for i in range(216)] + \
        [['45-65', 'Other'] for i in range(4)] + \
        [['65 & older', 'Conservative'] for i in range(86)] + \
        [['65 & older', 'Socialist'] for i in range(101)] + \
        [['65 & older', 'Other'] for i in range(4)]
df = pd.DataFrame(data, columns = ['Age Group', 'Political Affiliation']) 

# create contingency table
data_crosstab = pd.crosstab(df['Age Group'],
                            df['Political Affiliation'],
                           margins=True, margins_name="Total")

# significance level
alpha = 0.05

# Calcualtion of Chisquare
chi_square = 0
rows = df['Age Group'].unique()
columns = df['Political Affiliation'].unique()
for i in columns:
    for j in rows:
        O = data_crosstab[i][j]
        E = data_crosstab[i]['Total'] * data_crosstab['Total'][j] / data_crosstab['Total']['Total']
        chi_square += (O-E)**2/E

# The p-value approach
print("Approach 1: The p-value approach to hypothesis testing in the decision rule")
p_value = 1 - stats.chi2.cdf(chi_square, (len(rows)-1)*(len(columns)-1))
conclusion = "Failed to reject the null hypothesis."
if p_value <= alpha:
    conclusion = "Null Hypothesis is rejected."
        
print("chisquare-score is:", chi_square, " and p value is:", p_value)
print(conclusion)
    
# The critical value approach
print("\n--------------------------------------------------------------------------------------")
print("Approach 2: The critical value approach to hypothesis testing in the decision rule")
critical_value = stats.chi2.ppf(1-alpha, (len(rows)-1)*(len(columns)-1))
conclusion = "Failed to reject the null hypothesis."
if chi_square > critical_value:
    conclusion = "Null Hypothesis is rejected."
        
print("chisquare-score is:", chi_square, " and critical value is:", critical_value)
print(conclusion)

Approach 1: The p-value approach to hypothesis testing in the decision rule
chisquare-score is: 24.367421717305202  and p value is: 0.0004469083391495099
Null Hypothesis is rejected.

--------------------------------------------------------------------------------------
Approach 2: The critical value approach to hypothesis testing in the decision rule
chisquare-score is: 24.367421717305202  and critical value is: 12.591587243743977
Null Hypothesis is rejected.


## χ2 Goodness-Of-Fit Test

It is used to make the inference about the distribution of a variable.

* H₀: The variable has the specified distribution, normal
* H₁: The variable does not have the specified distribution, not normal

* The number of degrees of freedom of the χ2 Goodness-Of-Fit test statistics:

$d.f. = (# categories -1)$

* It compares the observed frequencies O of a sample with the expected frequencies E.

$E = probability of the event * total sample size$

Lets understand with the example:

> The table below displays the more than 44 million people voting result for 2013 German Federal Election. 41.5% of German vote for the Christian Democratic Union (CDU), 25.7% for the Social Democratic Party (SPD) and the remaining 32.8% as Others.

> Assume the researcher take a random sample and pick 123 students of FU Berlin about their party affiliation. Out of them 57 vote for CDU, 26 vote for SPD and 40 for Others. These number corresponds to the observed frequencies.

![](https://miro.medium.com/max/640/1*PpwdP65Gd9TTC6TY-hByGQ.png)


* H₀: The variable has the specified distribution, i.e. the observed and expected frequencies are roughly equal
* H₁: The variable does not have the specified distribution, not normal
* α = 0.05

Following χ2 Goodness-Of-Fit test statistics:

In [None]:
#Creation of data
data = [['CDU', 0.415, 57], ['SPD', 0.257, 26], ['Others', 0.328, 40]] 
df = pd.DataFrame(data, columns = ['Varname', 'prob_dist', 'observed_freq']) 
df['expected_freq'] = df['observed_freq'].sum() * df['prob_dist']

# significance level
alpha = 0.05

# Calcualtion of Chisquare
chi_square = 0
for i in range(len(df)):
    O = df.loc[i, 'observed_freq']
    E = df.loc[i, 'expected_freq']
    chi_square += (O-E)**2/E

# The p-value approach
print("Approach 1: The p-value approach to hypothesis testing in the decision rule")
p_value = 1 - stats.chi2.cdf(chi_square, df['Varname'].nunique() - 1)
conclusion = "Failed to reject the null hypothesis."
if p_value <= alpha:
    conclusion = "Null Hypothesis is rejected."
        
print("chisquare-score is:", chi_square, " and p value is:", p_value)
print(conclusion)
    
# The critical value approach
print("\n--------------------------------------------------------------------------------------")
print("Approach 2: The critical value approach to hypothesis testing in the decision rule")
critical_value = stats.chi2.ppf(1-alpha, df['Varname'].nunique() - 1)
conclusion = "Failed to reject the null hypothesis."
if chi_square > critical_value:
    conclusion = "Null Hypothesis is rejected."
        
print("chisquare-score is:", chi_square, " and critical value is:", critical_value)
print(conclusion)

Approach 1: The p-value approach to hypothesis testing in the decision rule
chisquare-score is: 1.693614940576721  and p value is: 0.42878164729702506
Failed to reject the null hypothesis.

--------------------------------------------------------------------------------------
Approach 2: The critical value approach to hypothesis testing in the decision rule
chisquare-score is: 1.693614940576721  and critical value is: 5.991464547107979
Failed to reject the null hypothesis.
