# Initial Analysis

Here I will be analyzing a dataset from Kaggle found here https://www.kaggle.com/datasets/spscientist/students-performance-in-exams. This dataset records test data with demographics and test prep. My objective is to demonstrate analyses of the data to act as controls and other to act as bad faith analyses to later see how partcipiants will react to the analyses.

In [94]:
import pandas as pd
import scipy
from scipy.stats import ttest_ind
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.feature_selection import SelectKBest, chi2, f_classif, f_regression
from sklearn.tree import DecisionTreeRegressor

df = pd.read_csv("StudentsPerformance.csv")

df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


In [95]:
z = 1.96
tests = ['math score', 'writing score', 'reading score']
for test in tests:
	print("\nStats for " + test)
	print(ttest_ind(df[df['gender'] == 'male'][test], df[df['gender'] == 'female'][test], equal_var=False))
	print("Male Confidence Interval", df[df['gender'] == 'male'][test].mean() - z*df[df['gender'] == 'male'][test].sem(), df[df['gender'] == 'male'][test].mean() + z*df[df['gender'] == 'male'][test].sem())
	print("Female Confidence Interval", df[df['gender'] == 'female'][test].mean() - z*df[df['gender'] == 'female'][test].sem(), df[df['gender'] == 'female'][test].mean() + z*df[df['gender'] == 'female'][test].sem())


print("\nReading Score\n", df.groupby('gender')['reading score'].describe())


Stats for math score
Ttest_indResult(statistic=5.398000564160736, pvalue=8.420838109090413e-08)
Male Confidence Interval 67.4465511724309 70.0098803628388
Female Confidence Interval 62.29912009128336 64.96728917512591

Stats for writing score
Ttest_indResult(statistic=-9.997718973491885, pvalue=1.711809371849699e-22)
Male Confidence Interval 62.051183170132006 64.57122346887215
Female Confidence Interval 71.18878146060051 73.74558147376241

Stats for reading score
Ttest_indResult(statistic=-7.9683565184844, pvalue=4.3762967534977204e-15)
Male Confidence Interval 64.22925702465336 66.71680106663295
Female Confidence Interval 71.36989020182624 73.84632601438999

Reading Score
         count       mean        std   min    25%   50%   75%    max
gender                                                             
female  518.0  72.608108  14.378245  17.0  63.25  73.0  83.0  100.0
male    482.0  65.473029  13.931832  23.0  56.00  66.0  75.0  100.0


Despite the fact that girls and boys both start off with the same inate ability the data here is clear that boys do better on math and girls better on reading and writing. Or is it? Through the next session I will investigate whether there is any bias in the data between girls and boys.

Also I have used the "Reading Score" score table from above to ask a question in my form as to whether girls did statistically significantly better than boys on reading. Here I hope to identify how deep participants go, will they just look at the difference in means or will they mistakenly assume the large standard deviations leave room for overlap, or will they go as far to run their own t tests as I have which show a significant difference.

In [99]:
features = pd.get_dummies(df.drop(tests, axis=1), drop_first=True)
print(features.shape)
for test in tests:
	# kBest = SelectKBest(f_regression, k=2)
	# kBest.fit_transform(features, df[test].values.reshape(-1, 1))

	# print("\n" + test)
	# print(kBest.pvalues_)
	# print(kBest.scores_)
	# print(kBest.get_feature_names_out())

(1000, 12)

math score
[9.12018555e-08 7.68444284e-03 2.02909329e-02 1.13561198e-01
 4.96229401e-11 1.17342014e-02 4.45432565e-05 5.61457485e-02
 2.41695206e-01 1.15364775e-02 2.41319560e-30 1.53591346e-08]
[ 28.9793361    7.13451779   5.40396562   2.50835083  44.16280406
   6.37404324  16.81569209   3.65623668   1.37230874   6.40440608
 140.11884155  32.54264847]
['race/ethnicity_group E' 'lunch_standard']

writing score
[2.01987771e-22 1.33120283e-02 7.47264828e-01 9.45349400e-03
 4.81828157e-03 4.72444951e-05 6.49581827e-09 6.73109822e-05
 3.76603509e-01 2.06187612e-03 3.18618958e-15 3.68529174e-24]
[9.95915761e+01 6.14909848e+00 1.03901216e-01 6.76125439e+00
 7.98214323e+00 1.67020952e+01 3.42723807e+01 1.60201446e+01
 7.82457983e-01 9.54371738e+00 6.41566429e+01 1.08350892e+02]
['gender_male' 'test preparation course_none']

reading score
[4.68053874e-15 5.66925193e-02 9.22646471e-01 2.66424543e-01
 7.24966510e-04 2.36767876e-03 1.59636982e-06 7.47116053e-04
 7.33449636e-01 2.4012

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [100]:
for col in df.columns:
	if col == 'gender' or col in tests: continue
	colM = df[df['gender'] == 'male'][col]
	colF = df[df['gender'] == 'female'][col]
	print(pd.merge(colM.value_counts(), colF.value_counts(), right_index=True, left_index=True))
# colF.value_counts() + colM.value_counts()

         race/ethnicity_x  race/ethnicity_y
group C               139               180
group D               133               129
group B                86               104
group E                71                69
group A                53                36
                    parental level of education_x  \
some college                                  108   
associate's degree                            106   
high school                                   102   
some high school                               88   
bachelor's degree                              55   
master's degree                                23   

                    parental level of education_y  
some college                                  118  
associate's degree                            116  
high school                                    94  
some high school                               91  
bachelor's degree                              63  
master's degree                                36  
