# Chi-Squared Test
Chi-Squared Test is the sum of square of the Gaussian distribution.

The Chi-Squared Test is a statistics test used to determine whether there is a significant association between 2 categorical variables. It is a non-parametric test. Meaning, it does not make assumptions about the distribution of the data. The Chi-Squared Test comes in different forms,
1. Chi-Squared Test for independence:
    - Objective: To assess whether there is a significant association between two categorical variables.
    - H0: The two variables are independent.
    - H1: The two variables are not independent, there is an association.
    - Degrees of Freedom: $(r - 1) * (c - 1)$, where $r$ = number of rows, $c$ = number of columns in the contingency table.
2. Chi-Squared Test for goodness of fit:
    - Objective: To test whether the observed frequency distribution fits a theoretical (expected) distribution.
    - H0: The observed and expected frequencies do not differ significantly.
    - H1: The observed and the expected frequencies differ significantly.
    - Degrees of Freedom: $(k - 1)$, where $k$ is the number of categories.

In both forms of Chi-Squared test, the test statistic is compared to a critical value from the Chi-Squared distribution to determine the statistical significance. If the test statistic is larger than the critical value, the null hypothesis is rejected.

The Chi-Squared Test is widely used in various fields, such as social sciences, biology and market research to assess the associations between categorical variables or goodness of fit.

The Chi-Squared ($\chi^2$) should be small under null hypothesis.

The Chi-Squared ($\chi^2$) should be large under alternative hypothesis.

### Assumptions of Chi-Squared Test
1. Variables are categorical.
2. Observations are independent.
3. Each cell is mutually exclusive.
4. The expected value in each cell has to be greater than 5.

# Chi-Squared Test For Independence

In [1]:
import numpy as np
from scipy.stats import chi2_contingency

observed_data = np.array([[10, 20, 30],
                          [15, 25, 35]])

chi2_stat, p_value, dof, expected = chi2_contingency(observed_data)

print("Chi-squared Statistic:", chi2_stat)
print("P-value:", p_value)
print("Degrees of Freedom:", dof)
print("Expected Frequencies:\n", expected)

Chi-squared Statistic: 0.27692307692307694
P-value: 0.870696738961232
Degrees of Freedom: 2
Expected Frequencies:
 [[11.11111111 20.         28.88888889]
 [13.88888889 25.         36.11111111]]


# Chi-Squared Test For Goodness Of Fit

In [2]:
import numpy as np
from scipy.stats import chi2_contingency

# observed frequencies
observed_data = np.array([30, 45, 25])

# expected frequencies
expected_data = np.array([20, 50, 30])

chi2_stat, p_value, dof, expected = chi2_contingency([observed_data, expected_data])

print("Chi-squared Statistic:", chi2_stat)
print("P-value:", p_value)

Chi-squared Statistic: 2.7177033492822966
P-value: 0.2569556763201589


# Examples Showing How To Conduct A Chi-Squared Test

### Example 1
Conduct a Chi-Squared Test to see if a coin is fair or biased.

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chisquare # chistat, p-value is returned

# H0: coin is fair
# Ha: coin is biased

# the chisquare() method takes in two lists observed and expected

chisquare([28, 22], [25, 25]) # run this

alpha = 0.05
chi_stat, p_value =  chisquare([28, 22], [25, 25])
print(chi_stat)
print(p_value)

if p_value < alpha:
  print("Reject H0")
  print("Coin is biased")
else:
  print("Accept H0")
  print("Coin is fair")

# repeat the above for 2600, 2400
alpha = 0.05
chi_stat, p_value =  chisquare([2600, 2400], [2500, 2500])
print(chi_stat)
print(p_value)

if p_value < alpha:
  print("Reject H0")
  print("Coin is biased")
else:
  print("Accept H0")
  print("Coin is fair")

0.72
0.3961439091520741
Accept H0
Coin is fair
8.0
0.004677734981047265
Reject H0
Coin is biased


### Example 2
Conduct the Chi-Squared Test using the formula,

In [4]:
((28 - 25) ** (2/ 25)) + ((22 - 25) ** (2/ 25))

(2.149430792773634+0.27153625606401877j)

### Example 3
How to use this to find the value at which there is a 0.05 probability to the right? How to find the critical value such that the probability on the right is 0.05?

In [5]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chisquare
from scipy.stats import chi2

# find the p_value
# p_value = 1 - chi2.cdf(chi_stat, dof) # dof: degrees of freedom

# to calculates the critical value from the chi-squared distribution for a given 
# significance level (alpha) and degrees of freedom
chi2.ppf(0.95, df = 1)
# meaning, if the chi_stat is less than this value, H0 stands
# once the chi_stat value exceeds this critical value, H0 is rejected

np.float64(3.8414588206941205)

### Example 4
Consider a survey where the impact of gender on online or offline purchases is recoded. Questions that can be asked,
- Do men prefer purchasing online or offline?
- Do women prefer purchasing online or offline?
- Is the preference independent of the gender?

Observed values,

|  | Men | Women |  |
| :-: | :-: | :-: | :-: |
| Offline | 527 | 72 | 599 (66%) |
| Online | 206 | 102 | 308 (34%) |
|  | 733 | 174 | 907 |

Null and alternative hypotheses
- H0: Gender has no effect on the medium used to purchase.
- H1: Gender has an effect on the medium used to purchase.

There are no expected values given here in this situation.

Questions,
- What are the expected values under the assumption of H0?
- What percentage of people prefer to purchase offline? 66%.
- If the gender were not to have an impact or effect, among the 733 people, how many are expected to prefer purchasing offline? 66%.
- Is it possible to compute the Chi-Squared statistic here, using the above interpretation?

Expected values,

|  | Men | Women |  |
| :-: | :-: | :-: | :-: |
| Offline | 484 | 115 | 599 (66%) |
| Online | 249 | 59 | 308 (34%) |
|  | 733 | 174 | 907 |

The expected values have been computed from the observed values.

There is python method that does the above computation of expected for us. It will also take the Chi-squares statistic.

$\chi^2 = (527 - 484)^{2/ 484} + (72 - 115)^{2/ 115} + (206 - 249)^{2/ 249} + (102 - 59)^{2/ 59}$

Difference between the above example and the Coin Toss example:
- Coin Toss: Fit the expected distribution.
- Preference vs Gender: Testing for independence.

There are two flavors to the chi-squared test.

In [6]:
# preference vs gender
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chisquare
from scipy.stats import chi2
from scipy.stats import chi2_contingency

# H0: Gender and preference are independent.
# Ha: Preference depends on the gender.

observed = [[527, 72], [206, 102]]
chi_stat, p_value, dof, expected_freq = chi2_contingency(observed)
print(chi_stat)
print(p_value)
print(dof)
print(expected_freq)

if p_value < 0.05:
  print("Reject H0")

57.04098674049609
4.268230756875866e-14
1
[[484.08710033 114.91289967]
 [248.91289967  59.08710033]]
Reject H0


Consider the aerofit dataset,

In [7]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chisquare
from scipy.stats import chi2
from scipy.stats import chi2_contingency

df_aerofit = pd.read_csv("aerofit.csv")
df_aerofit.head()

gender_product = pd.crosstab(index = df_aerofit["Gender"], columns = df_aerofit["Product"])
gender_product

# h0: gender has no impact the purchases
# h1: gender has an impact the purchases

# check for independence 
chi_stat, p_value, dof, expected_freq = chi2_contingency(gender_product)
print(chi_stat)
print(p_value)
print(dof)
print(expected_freq)

if p_value < 0.05:
   print("Reject H0")
   print("Gender has an impact the purchases")

# check for independence
chi_stat, p_value, dof, expected_freq = chi2_contingency([[40, 29], [40, 31]])
print(chi_stat)
print(p_value)
print(dof)
print(expected_freq)

if p_value < 0.05:
   print("Reject H0")
   print("Gender has an impact the purchases")
else:
   print("Accept H0")

12.923836032388664
0.0015617972833158714
2
[[33.77777778 25.33333333 16.88888889]
 [46.22222222 34.66666667 23.11111111]]
Reject H0
Gender has an impact the purchases
0.0005953595971967067
0.9805335549105975
1
[[39.42857143 29.57142857]
 [40.57142857 30.42857143]]
Accept H0
