### Statistics Analysis

In [1]:
import pandas as pd

In [2]:
student_data=pd.read_csv("../data/cleaned_student_data.csv")

In [3]:
student_data.head()

Unnamed: 0,gender,race/ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


### Descriptive Statistics

In [4]:
student_data.describe()

Unnamed: 0,math_score,reading_score,writing_score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


In [5]:
print(student_data['gender'].value_counts())


gender
female    518
male      482
Name: count, dtype: int64


In [6]:
print(student_data['race/ethnicity'].value_counts())


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64


In [7]:
print(student_data['lunch'].value_counts())


lunch
standard        645
free/reduced    355
Name: count, dtype: int64


In [8]:
print(student_data['parental_level_of_education'].value_counts())


parental_level_of_education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: count, dtype: int64


In [9]:
print(student_data['test_preparation_course'].value_counts())


test_preparation_course
none         642
completed    358
Name: count, dtype: int64


### T-Test

### 1.Does Test Preparation Make a Difference in Scores?

In [10]:
import pandas as pd
from scipy.stats import ttest_ind

In [11]:

completed_preparation = student_data[student_data['test_preparation_course'] == 'completed']
no_preparation = student_data[student_data['test_preparation_course'] == 'none']

subjects = ['math_score', 'reading_score', 'writing_score']
for subject in subjects:
    t_stat, p_val = ttest_ind(completed_preparation[subject], no_preparation[subject], equal_var=False)
    print(f"{subject} - t-statistic: {t_stat:.3f}, p-value: {p_val:.4f}")


math_score - t-statistic: 5.787, p-value: 0.0000
reading_score - t-statistic: 8.004, p-value: 0.0000
writing_score - t-statistic: 10.753, p-value: 0.0000


All p-values < 0.05 = We reject the null hypothesis.






#### Completing the test preparation course significantly improves scores in Math, Reading, and Writing.

### Chi Square test

### 2.who needs the test prepation course male or female

In [12]:
import pandas as pd
from scipy.stats import chi2_contingency

contingency_table = pd.crosstab(student_data['gender'], student_data['test_preparation_course'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"Chi2 Statistic: {chi2:.4f}")
print(f"P-Value: {p_value:.4f}")
print(f"Degrees of Freedom: {dof}")
print("\nExpected Frequencies Table:")
print(expected)


Chi2 Statistic: 0.0155
P-Value: 0.9008
Degrees of Freedom: 1

Expected Frequencies Table:
[[185.444 332.556]
 [172.556 309.444]]


P-Value = 0.8315 > 0.05=We fail to reject the null hypothesis.

Gender and Test Preparation are independent.

 Males and females have similar completion rates — no strong relationship!


Based on the Chi-Square test, there is no statistically significant association between gender and test preparation course completion.
Both males and females complete the course at similar rates.



### Chi-Square Test

### which ethinicity need preparation score

Null Hypothesis:
There is no association between ethnicity and test preparation course.

Alternative Hypothesis :
There is an association between ethnicity and test preparation course.



In [13]:
import pandas as pd
from scipy.stats import chi2_contingency

table = pd.crosstab(student_data['race/ethnicity'],student_data['test_preparation_course'])

print("Cross Tabulation Table:\n", table)

chi2, p_value, dof, expected = chi2_contingency(table)

print("\nChi-Square Test Results:")
print(f"Chi2 Statistic: {chi2}")
print(f"p-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("\nExpected Frequencies Table:\n", expected)


Cross Tabulation Table:
 test_preparation_course  completed  none
race/ethnicity                          
group A                         31    58
group B                         68   122
group C                        117   202
group D                         82   180
group E                         60    80

Chi-Square Test Results:
Chi2 Statistic: 5.4875148857070695
p-value: 0.24082911295018397
Degrees of Freedom: 4

Expected Frequencies Table:
 [[ 31.862  57.138]
 [ 68.02  121.98 ]
 [114.202 204.798]
 [ 93.796 168.204]
 [ 50.12   89.88 ]]


Since the p-value (0.241) is greater than 0.05, 
we fail to reject the null hypothesis.
Therefore, there is no statistically significant association between the variables.

Different ethnic groups participate differently in test preparation.
We should help those groups who are completing it less, to give everyone a fair academic chance.

