<a href="https://colab.research.google.com/github/sakrl0413/Data_Analysis/blob/main/13_ChiSquareTests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🐹🐾  **Chi-square tests**

- Whether your sample data is likely to be from a specific theoretical distribution.
- Test of Independence (독립성 검정) of **categorical variables**
-

## 🐹🐾🐾  **Contingency Table 분할표**
    contingency. n. 우연성, 우발, 돌발 (ph. in the supposed contingency 만약 그런일이 일어날 경우에는)
    contingent. a. 의존적인, **조건으로 하는 (on/upon)**, 우발적인

- The table shows how the frequency of one variable is contingent on the frequency of another variable.

- The chi-square test can analyze two-way contigency tables 분할표. (cf., Analyses with three-way or four-way contingency tables are also possible in principle, but log linear regression is used.)
    - **Log-Linear Models**: These are often used for analyzing multi-way contingency tables. They help in understanding the interaction effects between the variables. [see the last code cell]
    - Log-linear models and Generalized Linear Models (GLMs) are closely related and can indeed be compatible. In fact, a log-linear model can be considered a specific type of GLM where the response variable follows a Poisson distribution, and the log link function is used; therefore, in the last code cell, GLM is used instead.

>
## <font color = 'red'> **Now, back to two-way contingency tables.**</font>
>

|       |Category A |Category B|Category C|
|:--:   |:--:|:--:|:--:|
|Group 1|30|20|10|
|Group 2|15|25|20|
|Group 3|25|15|30|

>

||dependent variable1|dependent variable2|dependent variable3|row total|
|:--:|:--:|:--:|:--:|:--:|
|independent variable1|r1c1|r1c2|r1c3|<font color = 'blue'>**r1 total**|
|independent variable2|r2c1|r2c2|r2c3|<font color = 'blue'>**r2 total**|
|independent variable3|r3c1|r3c2|r3c3|<font color = 'blue'>**r3 total**|
|column total|<font color = 'green'>**c1 total**|<font color = 'green'>**c2 total**|<font color = 'green'>**c3 total**| <font color = 'red'>**Total** |


🅰️ In the following, we will deal with a contingency table with observed frequencies which show people's perferences for different SNS, either Instagram, Youtube, or Facebook
>

|            |Instagram|Youtube|Facebook||
|:--:        |:--:|:--:|:--:|:--:|
|In thier 20s|125|119|56|300|
|In thier 30s|268|147|85|500|
|In thier 40s|210|75|215|500|
|            |603|341|356|1300|




🅰️-1️⃣ How to calculate Expected Frequency?



    df = (3-1) * (3-1)

    - A higher Chi-square statistic indicates a greater discrepancy between observed and expected frequencies, leading to a lower p-value.
    - A lower p-value suggests that it is less likely the observed association is due to random chance,
    making it more likely that the variables are indeed associated.

### 🐹🐾🐾🐾 **Formula** for estimating **Expected Frequency**
###[1. Row totals; Column totals; Total numbers](https://github.com/ms624atyale/Data_Analysis/blob/main/RowTotals_ColumnTotals_TotalN.png)

###[2.Calculating expected frequencies](https://github.com/ms624atyale/Data_Analysis/blob/main/Fomula_ExpectedFrequencies.png)

**Explaining about the formula**

    Expected Frequency = row n total * column n total / Total

    where n is a number.

>

|ObsFreq     |Instagram|Youtube|Facebook||
|:--:        |:--:|:--:|:--:|:--:|
|In thier 20s|125|119|56|300|
|In thier 30s|268|147|85|500|
|In thier 40s|210|75|215|500|
|            |603|341|356|1300|
>

|Formula     |Instagram|Youtube|Facebook||
|:--:        |:--:|:--:|:--:|:--:|
|In thier 20s|300*603/1300|300*341/1300|300*356/1300|300|
|In thier 30s|500*603/1300|500*341/1300|500*356/1300|500|
|In thier 40s|500*603/1300|500*341/1300|500*356/1300|500|
|            |603|341|356|1300|

>

|Exp Freq    |Instagram|Youtube|Facebook||
|:--:        |:--:|:--:|:--:|:--:|
|In thier 20s|129|73|98|300|
|In thier 30s|215|122|163|500|
|In thier 40s|258|146|195|600|
|            |603|341|456|1400|

>

🅰️-2️⃣ Formula of chi^2

chi^2	= (ObsFreqr1c1-ExpFreqc1cl)^2/ExpFreq1c1 + (ObsFreqr1c2-ExpFreqc1c2)^2/ExpFreqr1c2 + ... + (ObsFreqr3c3-ExpFreqc3c3)^2/ExpFreqr3c3

    

    ⤵️ Fomula of chi^2 applied:

    chi^2 = 0.14 + 29 + 18 + 12.9 + 5.2 + 37 + 9.08 + 35 + 73 = 218.99

🅰️-3️⃣ Degree of Freedom (df)  = (Total number of rows - 1) * (Total number of columns)



## **<font color = 'red'> 🌱 🐣 Chi-Square Tests using Python codes**

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

observed_freq = np.array([[125, 119, 56],[268, 147, 85], [210, 75, 315]])

df = pd.DataFrame(observed_freq, index = ['In thier 20s', 'In their 30s', 'In thier 40s'], columns = ['Instagram', 'Youtube', 'Facebook'])

print(df)
print('\n')

# Perform chi-square test
chi2_statistic, p_value, degrees_of_freedom, expected_freq = chi2_contingency(observed_freq)

print ('Ho: Preference of SNS (e.g., Instagram, Youtube, Facebook) is similar among different age groups. ')

print("\n", "Chi-square statistic:", chi2_statistic)

print("\n", "p-value:", p_value)

print("\n", "Degrees of freedom:", degrees_of_freedom)

print("\n", "Expected frequencies:\n", expected_freq)

# Define the CTT (critical test threshold, aka., alpha)
alpha = 0.05

# Interpret the result
if p_value > alpha:
    print("\n", "Preference of SNS (e.g., Instagram, Youtube, Facebook) is similar among different age groups. (fail to reject H0)")
else:
    print("\n", "Preference of SNS (e.g., Instagram, Youtube, Facebook) are different among different age groups. (reject H0)")

              Instagram  Youtube  Facebook
In thier 20s        125      119        56
In their 30s        268      147        85
In thier 40s        210       75       315


Ho: Preference of SNS (e.g., Instagram, Youtube, Facebook) is similar among different age groups. 

 Chi-square statistic: 218.98992136307027

 p-value: 3.0923258542677724e-46

 Degrees of freedom: 4

 Expected frequencies:
 [[129.21428571  73.07142857  97.71428571]
 [215.35714286 121.78571429 162.85714286]
 [258.42857143 146.14285714 195.42857143]]

 Preference of SNS (e.g., Instagram, Youtube, Facebook) are different among different age groups. (reject H0)


### 1️⃣ **Testing speaking fluency with chi-square tests**

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

# Define the contingency table
observed_freq = np.array([[30, 85],
                          [95, 68]])

df = pd.DataFrame(observed_freq, index = ['Male/Female after Puberty', 'Kids under 5'], columns = ['Interuptions', 'Fillers'])

print(df)
print('\n')

# Perform chi-square test
chi2_statistic, p_value, degrees_of_freedom, expected_freq = chi2_contingency(observed_freq)

# Print the results

print ('Ho: Speaking fluency are similar between adults and kids are independent of age.')

print("\n", "Chi-square statistic:", chi2_statistic)

print("\n", "p-value:", p_value)

print("\n", "Degrees of freedom:", degrees_of_freedom)

print("\n", "Expected frequencies:\n", expected_freq)

# Define the CTT (critical test threshold, aka., alpha)
alpha = 0.05

# Interpret the result
if p_value > alpha:
    print("\n", "Different agae groups did not differ in speaking fluency. (fail to reject H0)")
else:
    print("\n", "Different agae groups differ in speaking fluency. (reject H0)")

                           Interuptions  Fillers
Male/Female after Puberty            30       85
Kids under 5                         95       68


Ho: Speaking fluency are similar between adults and kids are independent of age.

 Chi-square statistic: 26.957080592820397

 p-value: 2.0802369135102012e-07

 Degrees of freedom: 1

 Expected frequencies:
 [[51.70863309 63.29136691]
 [73.29136691 89.70863309]]

 Different agae groups differ in speaking fluency. (reject H0)


### 2️⃣ **Testing speakers' use of grammatical function words with chi-square tests**

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

observed_freq = np.array([[125, 59, 80],[100, 12, 50]])

df = pd.DataFrame(observed_freq, index = ['Native after puberty', 'Non-native after puberty'], columns = ['Pronouns', 'Relative Clause', 'Subordinate Clause'])

print(df)
print('\n')

# Perform chi-square test
chi2_statistic, p_value, degrees_of_freedom, expected_freq = chi2_contingency(observed_freq)

print ('Ho: Using grammatical function words are similar in terms of frequency between native and non-native after-puberty speakers of English')

print("\n", "Chi-square statistic:", chi2_statistic)

print("\n", "p-value:", p_value)

print("\n", "Degrees of freedom:", degrees_of_freedom)

print("\n", "Expected frequencies:\n", expected_freq)

# Define the CTT (critical test threshold, aka., alpha)
alpha = 0.05

# Interpret the result
if p_value > alpha:
    print("\n", "Different agae groups did not differ in their use of grammatical function words. (fail to reject H0)")
else:
    print("\n", "Different agae groups differ in thier use of grammatical function words. (reject H0)")

                          Pronouns  Relative Clause  Subordinate Clause
Native after puberty           125               59                  80
Non-native after puberty       100               12                  50


Ho: Using grammatical function words are similar in terms of frequency between native and non-native after-puberty speakers of English

 Chi-square statistic: 17.38783849894961

 p-value: 0.00016760186379635635

 Degrees of freedom: 2

 Expected frequencies:
 [[139.43661972  44.          80.56338028]
 [ 85.56338028  27.          49.43661972]]

 Different agae groups differ in thier use of grammatical function words. (reject H0)


### 3️⃣ **The effectiveness of a new drug**

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

observed_freq = np.array([[100, 150],[120, 160]])

df = pd.DataFrame(observed_freq, index = ['Cigarette', 'e-Cigarette'], columns = ['Lung Cancer', 'Bladder Cancer'])

print(df)
print('\n')

# Perform chi-square test
chi2_statistic, p_value, degrees_of_freedom, expected_freq = chi2_contingency(observed_freq)

print ('Ho: Cancers in the area of Lung and Bladder are similar between cigarette and e-cigarette users')

print("\n", "Chi-square statistic:", chi2_statistic)

print("\n", "p-value:", p_value)

print("\n", "Degrees of freedom:", degrees_of_freedom)

print("\n", "Expected frequencies:\n", expected_freq)

# Define the CTT (critical test threshold, aka., alpha)
alpha = 0.05

# Interpret the result
if p_value > alpha:
    print("\n", "The incidence rates of lung cancer and bladder cancer are not different between cigarette users and e-cigarette users. (fail to reject H0)")
else:
    print("\n", "The incidence rates of lung cancer and bladder cancer differ between cigarette users and e-cigarette users. (reject H0)")

             Lung Cancer  Bladder Cancer
Cigarette            100             150
e-Cigarette          120             160


Ho: Cancers in the area of Lung and Bladder are similar between cigarette and e-cigarette users

 Chi-square statistic: 0.3341892019271051

 p-value: 0.5632026843691589

 Degrees of freedom: 1

 Expected frequencies:
 [[103.77358491 146.22641509]
 [116.22641509 163.77358491]]

 The incidence rates of lung cancer and bladder cancer are not different between cigarette users and e-cigarette users. (fail to reject H0)


### **For your information later! (This part is not part of Final)**

Using GLM(general linear model) for a log linear model, three-way categorical data is analyzed in the following.


In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Create a sample dataset
np.random.seed(0)
gender = np.random.choice(['Male', 'Female'], size=90)
smoking_status = np.random.choice(['Smoker', 'Non-Smoker'], size=90)
exercise_level = np.random.choice(['Low', 'Medium', 'High'], size=90)

# Combine into a DataFrame
data = pd.DataFrame({
    'Gender': gender,
    'Smoking_Status': smoking_status,  # Rename column
    'Exercise_Level': exercise_level   # Rename column
})

# Create a contingency table
contingency_table = pd.crosstab(index=[data['Gender'], data['Smoking_Status']], columns=data['Exercise_Level'])

# Melt the contingency table to a long format
data_long = contingency_table.reset_index().melt(id_vars=['Gender', 'Smoking_Status'], var_name='Exercise_Level', value_name='Frequency')

# Fit a Poisson regression model
model = smf.glm("Frequency ~ Gender + Smoking_Status + Exercise_Level", data=data_long, family=sm.families.Poisson()).fit()

# Print the summary of the model
print(model.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:              Frequency   No. Observations:                   12
Model:                            GLM   Df Residuals:                        7
Model Family:                 Poisson   Df Model:                            4
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -25.951
Date:                Wed, 05 Jun 2024   Deviance:                       6.0284
Time:                        01:50:02   Pearson chi2:                     5.85
No. Iterations:                     4   Pseudo R-squ. (CS):             0.2466
Covariance Type:            nonrobust                                         
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept               

###**For your information**

**Generalized Linear Models (GLM)** are used when you need to model a dependent variable that has a non-normal distribution. GLMs extend linear models to allow for response variables that have error distribution models other than a normal distribution. They are useful for various types of dependent variables including binary, count, and categorical data.

Here are some scenarios when you might use GLM:

1.Binary Data:

  - When your dependent variable is binary (0/1, yes/no, success/failure), you would use a GLM with a binomial distribution. For example, logistic regression is a type of GLM used for binary outcomes.
  - Example: Predicting whether a customer will buy a product or not.

2.Count Data:

   - When your dependent variable is a count (non-negative integers), such as the number of times an event occurs, you would use a GLM with a Poisson or negative binomial distribution.
   - Example: Modeling the number of times a website is visited in a day.

3.Proportional Data:

   - When your dependent variable is a proportion (ranging between 0 and 1), such as the fraction of successes in a fixed number of trials, you would use a GLM with a binomial distribution.
   - Example: Modeling the proportion of defective items in a batch.

**4.Categorical Data with More Than Two Categories:**

   - When your dependent variable is categorical with more than two categories, such as multinomial outcomes, you can use a multinomial logistic regression (a type of GLM).
   Example: Predicting the type of car a customer will buy (sedan, SUV, truck).


5.Continuous Data with Non-Normal Errors:

- When your dependent variable is continuous but the residuals are not normally distributed, you might use a GLM with a suitable distribution (e.g., gamma distribution for positively skewed data).
