# Standardized Testing
## 1 Setup

This dataset records test data with demographics and test prep. My objective is to demonstrate analyses of the data to act as controls and other to act as bad faith analyses to later see how partcipiants will react to the analyses.
## 1.1 Supporting Packages

In [256]:
import pandas as pd
import scipy
from scipy.stats import ttest_ind
from matplotlib import pyplot as plt

from sklearn.neighbors import KNeighborsRegressor, RadiusNeighborsRegressor
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.feature_selection import SelectKBest, chi2, f_classif, f_regression
from sklearn.tree import DecisionTreeRegressor, plot_tree

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


## 1.2 Getting The Data
The dataset I am using is from Kaggle found here https://www.kaggle.com/datasets/spscientist/students-performance-in-exams. It includes 1000 different students recording 5 observations of their demographics (gender, race/ethnicity, parental education, their lunch, test preparation) and 3 different test scores across math, reading, and writing. 

In [308]:
df = pd.read_csv("StudentsPerformance.csv")
print(df.shape)

df

(1000, 8)


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


## 1.2.1 Sanitization / Cleanup
The data has come in a fairly useable format from kaggle the only thing I will need to do is to one hot encode all the observations as they are all categorical. To reduce the number of redundant columns I will drop the first one when encoding.

In [316]:
features = pd.get_dummies(df.drop(tests, axis=1), drop_first=True)
print(features.shape)

features

(1000, 12)


Unnamed: 0,gender_male,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E,parental level of education_bachelor's degree,parental level of education_high school,parental level of education_master's degree,parental level of education_some college,parental level of education_some high school,lunch_standard,test preparation course_none
0,0,1,0,0,0,1,0,0,0,0,1,1
1,0,0,1,0,0,0,0,0,1,0,1,0
2,0,1,0,0,0,0,0,1,0,0,1,1
3,1,0,0,0,0,0,0,0,0,0,0,1
4,1,0,1,0,0,0,0,0,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,1,0,0,1,0,0,1,0
996,1,0,1,0,0,0,1,0,0,0,0,1
997,0,0,1,0,0,0,1,0,0,0,0,0
998,0,0,0,1,0,0,0,0,1,0,1,0


## 2 Initial Analysis
Before I dive into making conclusions based on the specific nature of the data. It will be best to take a step back and look at the properties of my data frame so that wecan better understand the consequences these properties may have further on. 
## 2.1 Duplicate Feature Combinations

In [None]:
features = pd.get_dummies(df.drop(tests, axis=1), drop_first=True)
print("Shape of features: ", features.shape)
print("Unique feature rows: ", features.drop_duplicates().shape)
double_counts = features.groupby(features.columns.tolist(),as_index=False).size()['size'].value_counts()
print("Breakdown of number of duplicates:", double_counts.sort_index(), sep='\n')
print(double_counts.multiply(double_counts.index).sort_index(), double_counts.multiply(double_counts.index).sum()) # Sanity Check

for test in tests:
	knn = RadiusNeighborsRegressor(radius=0, p=1)
	knn.fit(features, df[test])
	print(test + " R^2 is " + str(knn.score(features, df[test])))

Shape of features:  (1000, 12)
Unique feature rows:  (211, 12)
Breakdown of number of duplicates:
1     29
2     37
3     32
4     26
5     23
6     14
7     13
8     11
9      7
10     1
11     5
12     2
13     4
14     2
15     2
17     1
19     1
21     1
Name: size, dtype: int64
1      29
2      74
3      96
4     104
5     115
6      84
7      91
8      88
9      63
10     10
11     55
12     24
13     52
14     28
15     30
17     17
19     19
21     21
dtype: int64 1000
math score R^2 is 0.383747836030655
writing score R^2 is 0.44360602438994856
reading score R^2 is 0.36593437622917757


By creating a Radius Neighbors Regressor with a radius of 0 we can identify how well a potential perfect model can do. As shown by the low R^2 values for each more than half of the variance could not be explained in the original dataset even if we were perfectly overfit. This makes sense though as there are a lot of students with the same background in fact there are only 211 distinct feature rows and of those 211 only 29 have a single observation. In fact there are 897 rows with at least a 3 row neighboorhood and 697 with at least a 5 row neighborhood.

In [317]:
z = 1.96
tests = ['math score', 'writing score', 'reading score']
for test in tests:
	print("\nStats for " + test)
	print(ttest_ind(df[df['gender'] == 'male'][test], df[df['gender'] == 'female'][test], equal_var=False))
	print("Male Confidence Interval", df[df['gender'] == 'male'][test].mean() - z*df[df['gender'] == 'male'][test].sem(), df[df['gender'] == 'male'][test].mean() + z*df[df['gender'] == 'male'][test].sem())
	print("Female Confidence Interval", df[df['gender'] == 'female'][test].mean() - z*df[df['gender'] == 'female'][test].sem(), df[df['gender'] == 'female'][test].mean() + z*df[df['gender'] == 'female'][test].sem())


print("\nReading Score\n", df.groupby('gender')['reading score'].describe())


Stats for math score
Ttest_indResult(statistic=5.398000564160736, pvalue=8.420838109090413e-08)
Male Confidence Interval 67.4465511724309 70.0098803628388
Female Confidence Interval 62.29912009128336 64.96728917512591

Stats for writing score
Ttest_indResult(statistic=-9.997718973491885, pvalue=1.711809371849699e-22)
Male Confidence Interval 62.051183170132006 64.57122346887215
Female Confidence Interval 71.18878146060051 73.74558147376241

Stats for reading score
Ttest_indResult(statistic=-7.9683565184844, pvalue=4.3762967534977204e-15)
Male Confidence Interval 64.22925702465336 66.71680106663295
Female Confidence Interval 71.36989020182624 73.84632601438999

Reading Score
         count       mean        std   min    25%   50%   75%    max
gender                                                             
female  518.0  72.608108  14.378245  17.0  63.25  73.0  83.0  100.0
male    482.0  65.473029  13.931832  23.0  56.00  66.0  75.0  100.0


Despite the fact that girls and boys both start off with the same inate ability the data here is clear that boys do better on math and girls better on reading and writing. Or is it? Through the next session I will investigate whether there is any bias in the data between girls and boys.

Also I have used the "Reading Score" score table from above to ask a question in my form as to whether girls did statistically significantly better than boys on reading. Here I hope to identify how deep participants go, will they just look at the difference in means or will they mistakenly assume the large standard deviations leave room for overlap, or will they go as far to run their own t tests as I have which show a significant difference.

In [319]:
for col in df.columns:
	if col == 'gender' or col in tests: continue
	colM = df[df['gender'] == 'male'][col]
	colF = df[df['gender'] == 'female'][col]
	print(pd.merge(colM.value_counts(), colF.value_counts(), right_index=True, left_index=True))
# colF.value_counts() + colM.value_counts()

         race/ethnicity_x  race/ethnicity_y
group C               139               180
group D               133               129
group B                86               104
group E                71                69
group A                53                36
                    parental level of education_x  \
some college                                  108   
associate's degree                            106   
high school                                   102   
some high school                               88   
bachelor's degree                              55   
master's degree                                23   

                    parental level of education_y  
some college                                  118  
associate's degree                            116  
high school                                    94  
some high school                               91  
bachelor's degree                              63  
master's degree                                36  
