# Chi-Square Test

### Table of Contents
* What are Categorical Variables?
* What is a Chi-Square Test and Why Do We Use It?
* Assumptions of the Chi-Square Test
* Types of Chi-Square Tests (With implementation in R)
* Chi-Square Goodness of Fit Test
* Chi-Square Test of Association between Two Variables

### Case study:


Let's Start with our Case Study, i want you to think of your favorite restaurant right now. let's Say you can predict a certain number of people arriving for lunch five days a week.At the end of the week, you observe that the expected footfall was different from the actual footfall.

So, how will you check the statistical significance between the observed and the expected footfall values? Remember this is a categorical variable – ‘Days of the week’ – with 5 categories [Monday, Tuesday, Wednesday, Thursday, Friday].

## What are Categorical Variables?

* Categorical variables fall into a particular category of those variables that can be divided into finite categories. 
* These categories are generally names or labels. 
* Labels may Present in both **Feature(X, Independent, Input)** and Target(Ouput/Dependent/y)
* These variables are also called qualitative variables as they depict the quality or characteristics of that particular variable.
* I’m sure you’ve encountered categorial variables before, even if you might not have intuitively recognized them.


**For example:**

the category `Movie Genre` in a list of movies could contain the categorical variables – `Action`, `Fantasy`, `Comedy`, `Romance`, etc.



###  Two types of categorical variables:
1. Nominal Variable
2. Ordinal Variable

#### Nominal Variable:
*  A nominal variable has **no natural ordering** to its categories.
* They have two or more categories.
* For example:
    * Marital Status (Single, Married, Divorcee)
    * Gender (Male, Female, Transgender)

#### Ordinal Variable:
* A variable for which the categories can be placed in an order. 
* Ordinal Variable are in **Natural ordering** to its categories.
* Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories is not known.
* For Example:
    * Customer Satisfaction (Excellent, Very Good, Good, Average, Bad)
    * Movie Rating

## What is a Chi-Square Test and Why Do We Use It?

* `A Chi-Square test is a test of statistical significance for categorical variables.`
* Chi-square test in hypothesis testing is used to test the hypothesis about the distribution of observations/frequencies in different categories.


### 


### Assumptions of the Chi-Square Test

* Just like any other statistical test, the chi-square test comes with a few assumptions of its own:
    * The $x^2$ assumes that the data for the study is obtained through random selection, i.e., they are randomly pick from the population.
    *  The categories are mutually exclusive i.e. each subject fits in only one category. 
        **Example:** the number of people who lunched in your restaurant on Monday can’t be filled in the Tuesday category
    * The data should be in the form of frequencies or counts of a particular category and not in percentages.
    * The data should not consist of paired samples or groups or we can say the observations should be independent of each other.
    * When more than **20% of the expected frequencies** have a value of **less than 5** then Chi-square cannot be used.
    * To tackle this problem: Either one should combine the categories only if it is relevant or obtain more data.


In [1]:
url = "https://raw.githubusercontent.com/reddyprasade/DataSet-for-ML-and-Data-Science/master/DataSets/day.csv"

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv(url)
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


In [4]:
df.columns

Index(['instant', 'dteday', 'season', 'yr', 'mnth', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed',
       'casual', 'registered', 'cnt'],
      dtype='object')

In [5]:
categorical_col = ['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday',
       'weathersit']

In [6]:
chisqt = pd.crosstab(df.holiday,df.weathersit,margins=True)

In [7]:
chisqt

weathersit,1,2,3,All
holiday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,448,241,21,710
1,15,6,0,21
All,463,247,21,731


In [8]:
from scipy.stats import chi2_contingency #  chi2_contingency() function on the table and get the statistics, p-value and degree of freedom values

In [11]:
chisqt.iloc[0][0:5]

weathersit
1      448
2      241
3       21
All    710
Name: 0, dtype: int64

In [14]:
value = np.array([chisqt.iloc[0][0:5].values,
                chisqt.iloc[1][0:5].values 
])

In [16]:
chi2_contingency(value)[0:3]

# 0.79 is the p-value
# 1.02 is the statistical value
# 3 is Dof 

(1.0187958822970438, 0.796704036292652, 3)

From above, 0.79 is the p-value, 1.02 is the statistical value and 3 is the degree of freedom. As the p-value is greater than 0.05, we accept the NULL hypothesis and assume that the variables ‘holiday’ and ‘weathersit’ are independent of each other.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f5f90ba1-3290-463e-8fc6-44108f4fa21b' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>