## Aim: Predicting student performance based on demographic/socioeconomic information

## Methods: Ridge regression with cross-validation
* Predicted variable: average score consisted of math, reading, and writing.
* Predictors: gender, parent education level (low to high), lunch type (standard or not), race/ethnicity (5 groups), whether having taken preparetion course
*Given the demographic/socioeconomic variables are likely correlated with each other, we used regresion with regularization.

## Results: 
* Prediction accuracy, measured by r-squared score, is signficantly better than chance (p<10e-4)
* Several predictors are signifcant (see Result summary)

## Conclusion:
* Better performance is related to (order by effect size): having standard lunch (instead of free/reduced), having taken the preparation course, being in the race/ethnicity group D and E, being female, and parents with higher education level.
** The amount of explained variance is relatively low desite bing significant. The biggerst limitation is the small amount of info provided by the dataset, nonetheless the model prediction is remarkably signficant. More variables (other information) will likely improve the amount the explained variance.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
student_data = pd.read_csv("../input/students-performance-in-exams/StudentsPerformance.csv")

# Construt dataset:
## 1. Define predicted variables: average score of the three subjects
## 2. Defind predictors:
* Convert "parental level of education" to continuous
* Dummy coding categorical variables
## 3. Drop unnecessary columns

In [None]:
# Preparing the dataset, including:
## 1. averaging the three scores since they are highly correlated
## 2. convert parent education to numerical (low to high)
student_data['avg_score'] = (student_data['math score']+student_data['reading score']+student_data['writing score'])/3
mapping_edu = {"some high school":0, "high school":1, "some college":2, "associate's degree":3, \
               "bachelor's degree":4, "master's degree":5} 
student_data['edu_parent'] = student_data['parental level of education'].map(lambda x: mapping_edu[x])
student_data.head()

In [None]:
ds = pd.get_dummies(student_data,
                    columns=['gender','race/ethnicity','test preparation course','lunch'], drop_first=True)
y = ds['avg_score']
ds.drop(columns=["parental level of education", 'math score','reading score','writing score','avg_score'],inplace=True)
n_cols = len(ds.columns)
n_cols

In [None]:
y.hist()

# Run regression analysis with cross-validataion
* Using regularization given colinearity in the dataset (the variables are correlated)
* Ridge regression is implemented here. 
    * Lasso and elastic net gave similar results
* Estimating significance of the prediciton result and regression coefficient with 10K permutation

In [None]:
from sklearn import model_selection, linear_model, metrics

In [None]:
n_it = 5
# similar with n_it=10/20, use 5 to save time
kf = model_selection.KFold(n_splits=n_it)
clf = linear_model.Ridge()
#Note: results are similar with parameter selection (RidgeCV). 
print(clf)

In [None]:
accs = np.zeros(n_it)
labels_pred = np.zeros(len(y))
coefs = []
for it, (tr,te) in enumerate(kf.split(ds, y )):
    clf.fit(ds.iloc[tr], y.iloc[tr])
    y_true = y.iloc[te]
    y_pred = clf.predict(ds.iloc[te])
    accs[it] = metrics.r2_score(y_true=y_true, y_pred=y_pred)
    print('Iter#%d acc=%f, alpha=%f' % (it,accs[it],clf.alpha))
    coefs.append(clf.coef_)
    labels_pred[te] = y_pred
print('Acc: %f+/-%f' % (np.mean(accs),np.std(accs)))
coefs = np.asarray(coefs)

In [None]:
## 10k permutatiaon test to obtain the significancy of the accuracy (r square) and the coefficients.
n_perm = 10000
acc_rand = np.zeros(n_perm)
coefs_rand = np.zeros((n_perm,kf.n_splits,n_cols))
for iperm in range(n_perm):
    y_rand = np.random.permutation(y)
    accs_tmp = np.zeros(n_it)
    for it, (tr,te) in enumerate(kf.split(ds, y)):
        clf.fit(ds.iloc[tr], y_rand[tr])
        y_true = y_rand[te]
        y_pred = clf.predict(ds.iloc[te])
        accs_tmp[it] = metrics.r2_score(y_true=y_true, y_pred=y_pred)
        coefs_rand[iperm, it] = clf.coef_
    acc_rand[iperm] = np.mean(accs_tmp)
    if np.mod(iperm,1000)==0:
        print('#%d Acc: %f+/-%f' % (iperm, np.mean(accs_tmp),np.std(accs_tmp)))
coefs_rand = np.nanmean(coefs_rand,axis=1)
coefs_rand.shape

In [None]:
realAcc = np.mean(accs)
plt.hist(acc_rand,50)
plt.plot(realAcc,10,'rd')
plt.show()
print('actual acc=%f, p=%f',realAcc, np.sum(acc_rand>realAcc)/(n_perm*1.))

In [None]:
print('r2=%f' % metrics.r2_score(y, labels_pred))
plt.plot(y, labels_pred, 'ro')
#plt.xlim([0,101])plt.ylim([0,101])
plt.ylabel('predicted')
plt.xlabel('true')
plt.show()

In [None]:
# avg coef values across cv:
coefs_avg = np.nanmean(coefs,axis=0)
plt.barh(range(n_cols),coefs_avg,color='r')
plt.barh(range(n_cols),np.mean(coefs_rand,axis=0),xerr=np.var(coefs_rand,axis=0),color='k',capsize=6)
plt.yticks(range(n_cols),ds.columns)
plt.axvline(x=0,color='k')
plt.show()
for ic,col in enumerate(ds.columns):
    print('#%d %s coef=%f, p=%f' % (ic,col,coefs_avg[ic], np.sum(abs(coefs_avg[ic]) < coefs_rand[:,ic]) /(n_perm*1.)))

# Result summary
## Significant positive variables (being in the category (or higher value), HIGHER test score)
1. parent education level
2. being race/ethinicity D and E
3. standard lunch

## Significant negative variables (****being in the category, LOWER test score)
1. being male 
2. not taken test preparation course