<a href="https://colab.research.google.com/github/sundaybest3/Spring2024/blob/main/Seminar/Chi_Squared_Independence.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📙**Part 1. Chi-Squared Test of Independence**

### This is a statistical method used to determine if there is a significant association between two categorical variables.


+ Note: There are 3 datasets for you to practice. Complete them by 5/23 (save this file to your Github repository)

# Sample data analysis (Kim MR)

## 0.1 Example data: Self-Perception of English Proficiency by Gender

+ Description: This dataset represents a survey conducted among a group of 200 students to explore **how they perceive their own English proficiency**. The aim of the study is to investigate whether there are noticeable differences in self-assessment of English skills between genders. Note that their test scores had no statistical difference between the two groups.

+ Each participant in the survey was asked to classify their English language skills into one of three categories:

  + **Beginner**: The student feels that they are just starting to learn English and have limited ability to use the language.
  + **Intermediate**: The student has a basic command of the English language but might struggle with complex grammar structures and extensive vocabulary.
  + **Advanced**: The student feels confident in using the English language fluently and accurately on all levels.


### 0.2 Objective of the Analysis

+ The primary objective for analyzing this dataset is to determine **if there is a statistically significant association between gender and self-perceived English proficiency levels.**
+ By understanding these patterns, educators can better address the confidence and educational needs of students based on their self-perceived language abilities.

### 0.3 Dataset preview

|Gender|Proficiency|
|--|--|
|Female|Intermediate|
|Male|Beginner|
|Male|Beginner|
|Female|Advanced|
|...|...|

+ 💾 [1. sample data in Github](https://github.com/MK316/Spring2024/blob/main/Seminar/data/statsample.csv),
+ 💾 [2. Raw data](https://raw.githubusercontent.com/MK316/Spring2024/main/Seminar/data/statsample.csv)


### 🔎 Null hypothesis: _Gender (Male, Female) is independent of self-perceived English proficiency levels (Beginner, Intermediate, Advanced)._
### 🔎 Alternative hypothesis: _There is a dependence between gender and self-perceived English proficiency levels._

## Step [1] Read data and examine the data structure

In [None]:
# Read csv using a web link (The file is saved in my github account)

import pandas as pd

url = "https://raw.githubusercontent.com/MK316/Spring2024/main/Seminar/data/statsample.csv"
df = pd.read_csv(url)
df.tail()

## Step [2] Generate the contingency table

In [None]:
# Contingency table

# import pandas as pd

# Generating the contingency table
contingency_table = pd.crosstab(df['Gender'], df['Proficiency'])

# Display the contingency table
print(contingency_table)


In [None]:
# Contingency table with total counts

contingency_table_total = pd.crosstab(df['Gender'], df['Proficiency'], margins = True, margins_name="Total")
print(contingency_table_total)

In [None]:
# Save the contingency table to a CSV file

contingency_table_total.to_csv('contingency_table_sample.csv')

## Step [3] Conducting Chi-Squared test of independence

In [None]:
import pandas as pd
from scipy.stats import chi2_contingency

# Generating the contingency table
contingency_table = pd.crosstab(df['Gender'], df['Proficiency'])

# Perform the Chi-squared test of independence
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Convert the expected frequencies into a DataFrame
expected_df = pd.DataFrame(expected, columns=contingency_table.columns, index=contingency_table.index)

# Output the results
print(f"Chi-squared Statistic: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of Freedom: {dof}")
print("="*50)
print("Contingency table:")
print("="*50)
print(contingency_table_total)
print("="*50)
print("Expected Frequencies Table:")
print("="*50)
print(expected_df)

## Step [4] Reporting

+ **[General description]** The objective of this analysis was to determine whether there is a significant association between gender (Female, Male) and self-reported English proficiency levels (Advanced, Beginner, Intermediate). The survey data gathered responses from 200 participants (F=102, M=98).
+ **[Data summary]** A contingency table was constructed from the survey responses to cross-tabulate the frequencies of reported English proficiency by gender. The contingency table and expected frequencies were as follows: (You should include observed and expected frequency tables)
+ **[Chi-squared test result]** The Chi-squared test of independence was applied to assess if the observed differences in proficiency levels across genders were not statistically significant ($\chi^2 = 2.5768$, df = 2, $p = 0.2757$).


## Step [5] Interpretation of the result and make a conclusion

+ **Interpretation**: The Chi-squared statistic of 2.5768 with 2 degrees of freedom and a p-value of 0.2757 suggests that there is no statistically significant association between gender and English proficiency levels, as the observed differences between actual and expected frequencies do not deviate enough from the null hypothesis at the conventional significance level of 0.05.
+ **Conclusion**: The test results indicate that **gender does not significantly influence how participants perceive their English proficiency**. This lack of significant difference suggests that both genders are equally distributed across the levels of English proficiency as reported in this survey. Educational programs and resources aiming to enhance English proficiency can thus be designed without gender-specific modifications based on this dataset.

---
# ⏰ Data set 1 (Jung WC)

+ **Description**: This dataset represents a survey conducted to understand preferences for electronic bicycles across different age groups. The survey targeted three main demographic groups: Teens, Adults, and Seniors. Participants in each age group were asked whether they like or dislike electronic bicycles. The survey aimed to capture trends and preferences that could influence market strategies and product development for electronic bicycles.

+ **Data Collection Methodology**:
The data was collected through an online survey distributed via email and social media platforms. Respondents were classified into three age groups:

  + Teens: Ages 13-19
  + Adults: Ages 20-59
  + Seniors: Ages 60 and above

+ Data preview

|Age_group|Ebikes_preference|
|--|--|
|Teen|Likes|
|Teen|Likes|
|Adult|Likes|
|Adult|Dislikes|
|Senior|Likes|
|...|...|

+ 💾[data set01: rawdata](https://raw.githubusercontent.com/MK316/Spring2024/main/Seminar/data/statdata01.csv)

---
## To do

Your task is to determine whether there is a statistically significant association between age group and preference for E-Bikes. You are expected to apply the Chi-squared test of independence to this contingency table to assess this relationship.

### 🔎 Null hypothesis: _Gender (Male, Female) is independent of self-perceived English proficiency levels (Beginner, Intermediate, Advanced)._
### 🔎 Alternative hypothesis: _There is a dependence between gender and self-perceived English proficiency levels._

## Step [1] Read data and examine the data structure

## Step [2] Generate the contingency table

## Step [3] Conducting Chi-Squared test of independence

## Step [4] Reporting

Write here....

## Step [5] Interpretation of the result and make a conclusion

Write here...

---
# ⏰ Data set 2 (Sohn HS)

## 2.1 Data Description
This dataset examines the association between the frequency of exercise and self-perceived health status among a sample population. The study seeks to understand how often individuals engage in physical activity and how this affects their perception of their own health. The respondents were categorized based on their regularity of exercise into three groups: those who exercise regularly, those who seldom exercise, and those who never exercise.

## Dataset Composition
The responses were collected from a total of 600 participants, divided evenly among the exercise frequency categories:

+ Exercise Regularly: 200 participants
+ Seldom Exercise: 200 participants
+ Never Exercise: 200 participants

## Data preview 💾[dataset02](https://github.com/MK316/Spring2024/blob/main/Seminar/data/statdata02.csv)

|Exercise_Frequency|Health_Perception|
|--|--|
|Exercise Regularly|Healthy|
|Exercise Regularly|Healthy|
|Exercise Regularly|Unhealthy|
|Exercise Regularly|Unhealthy|
|...|...|


## Objective
Your task is to apply the Chi-squared test of independence to determine if there is a statistically significant relationship between the frequency of exercise and health perception.

### 🔎 Null hypothesis: (write this down for yourself)
### 🔎 Alternative hypothesis: (write this down for yourself)

## Step [1] Read data and examine the data structure

## Step [2] Generate the contingency table

## Step [3] Conducting Chi-Squared test of independence

## Step [4] Reporting

## Step [5] Interpretation of the result and make a conclusion

---
# ⏰ Data set 3 (Choi JM)

## 3.1 Data Description

This dataset investigates the relationship **between students' academic majors and their interest in enrolling in online courses**. The aim of this study is to determine **if there are significant differences in online course interest among students from various academic backgrounds**. Students from three major academic fields—Science, Engineering, and Humanities—were surveyed to assess their interest levels.

## 3.2 Data Composition

Dataset Composition
The survey gathered responses from a total of 600 students, distributed equally across the different majors:

+ Science: 200 students
+ Engineering: 200 students
+ Humanities: 200 students

## 3.3 Objective

Your task is to apply the Chi-squared test of independence to explore if there is a statistically significant relationship between students' study majors and their interest in online courses.

## Data preview: 💾[dataset03](https://github.com/MK316/Spring2024/blob/main/Seminar/data/statdata03.csv)

|Major|Interest_OnlineCourses|
|--|--|
|Science|Interested|
|Science |Not Interested|
|Engineering|Interested|
|Humanities	|Not Interested|
|...|...|

### 🔎 Null hypothesis: _Students' academic majors(Science, Engineering, Humanities) are independent of their interest in enrolling in online courses(nterested, not interested)._
### 🔎 Alternative hypothesis: _There is a dependence between students' academic majors and their interest in enrolling in online courses._

## Step [1] Read data and examine the data structure

In [None]:
# Read csv using a web link (The file is saved in my github account)

import pandas as pd

url = "https://raw.githubusercontent.com/MK316/Spring2024/main/Seminar/data/statdata03.csv"
df = pd.read_csv(url)
df.tail()

## Step [2] Generate the contingency table

In [None]:
# Contingency table

# import pandas as pd

# Generating the contingency table
contingency_table = pd.crosstab(df['Major'], df['Interest_OnlineCourses'])

# Display the contingency table
print(contingency_table)

contingency_table_total = pd.crosstab(df['Major'], df['Interest_OnlineCourses'], margins = True, margins_name="Total")
print(contingency_table_total)


In [None]:
# Save the contingency table to a CSV file

contingency_table_total.to_csv('contingency_table_sample.csv')

## Step [3] Conducting Chi-Squared test of independence

In [None]:
import pandas as pd
from scipy.stats import chi2_contingency

# Generating the contingency table
contingency_table = pd.crosstab(df['Major'], df['Interest_OnlineCourses'])

# Perform the Chi-squared test of independence
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Convert the expected frequencies into a DataFrame
expected_df = pd.DataFrame(expected, columns=contingency_table.columns, index=contingency_table.index)

# Output the results
print(f"Chi-squared Statistic: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of Freedom: {dof}")
print("="*50)
print("Contingency table:")
print("="*50)
print(contingency_table_total)
print("="*50)
print("Expected Frequencies Table:")
print("="*50)
print(expected_df)

## Step [4] Reporting

+ **[General description]** The objective of this analysis was to determine whether there is a significant association between students' academic majors and their interest in enrolling in online courses. The survey data gathered responses from 600 participants (E=200, H=200, S=200).
+ **[Data summary]** A contingency table was constructed from the survey responses to cross-tabulate the frequencies of reported interest in enrolling in online courses by students' academic major. The contingency table and expected frequencies were as follows:

+ **[Chi-squared test result]** The Chi-squared test of independence was applied to assess if the observed interest in enrolling in online courses across students' academic majors were statistically significant ($\chi^2 = 65.406$, DoF = 2, $p = 6.269979329252908e-15$).

## Step [5] Interpretation of the result and make a conclusion

+ **Interpretation**: The Chi-squared statistic of 65.406 with 2 degrees of freedom and a p-value of 6.269979329252908e-15 suggests that there is a statistically significant association between students' academic majors and their interest in enrolling in online courses, as the observed frequencies of interest and non-interest differ markedly from the expected frequencies, indicating that certain majors are more or less interested in online courses than would be expected by chance alone. For example, in Engineering,  140 students are interested (observed) compared to 96.67 (expected), and 60 are not interested (observed) compared to 103.33 (expected).
+ **Conclusion**: The test results indicate that **students' academic majors significantly influence their interests in enrolling in online courses**. This significant difference suggests that students' interest in enrollig in online courses varies significantly depending on their academic majors.

---
The End