# Gelato Exercise - Predict Test Scores of students.  
  
## Predicting the posttest scores of students  
  
### 1) Import packages and dataset

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import matplotlib.lines as lines
import seaborn as sns
from sklearn.metrics import mean_squared_error as MSE
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import OrdinalEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

df = pd.read_csv('../input/predict-test-scores-of-students/test_scores.csv')

### 2) Overview of data

In [None]:
#Return summaries of data and top rows
print(df.info())
print()
print(df.describe())
print()
print(df.head())

In [None]:
#Return unique values of object (categorical) variables
for col in df:
    if df[col].dtype == "object":
        print(col + " - " + str(len(np.unique(df[col]))))
        print(str(np.unique(df[col])))

### Summary  
Data loaded successfully.  
8 category variables including student_id.  
3 continuous variables including the target posttest.  
No evidence of any zeros or missing values in the continuous variables. 
No unexpected values in any categorical variables.  
Correct number of unique student_id values.  
  
Based on this investigation we will assume that the data is correct, with no missing values and there will be no need for any further cleaning.

### 2) Initial Data Investigation
We will now produce some basic visualisations of the data.
This will provide a second check that everything looks in order.
It will also give information that could be useful for the model building stage.

In [None]:
plot1 = sns.PairGrid(df)
plot1.map_diag(sns.histplot)
plot1.map_offdiag(sns.scatterplot)

There is a strong relationship between pretest and posttest.

In [None]:
plot2 = sns.FacetGrid(df, col="school", col_wrap=6)
plot2.map(sns.histplot, "posttest")

There is strong evidence that test results (posttest) is affected by school.
The highest performing student in some schools is beaten by the lowest performer in another.

In [None]:
plot3 = sns.FacetGrid(df, col="school_setting")
plot3.map(sns.histplot, "posttest")

In [None]:
plot4 = sns.FacetGrid(df, col="school_type")
plot4.map(sns.histplot, "posttest")


In [None]:
plot5 = sns.FacetGrid(df, col="classroom", col_wrap = 12)
plot5.map(sns.histplot, "posttest")


Classroom has a large number of values.  
However some classrooms do appear to perform significantly better than others.  

I will attempt to confirm later whether the classroom feature is a large number of general types, or the ID of a particular room.

In [None]:
plot6 = sns.FacetGrid(df, col="teaching_method")
plot6.map(sns.histplot, "posttest")


In [None]:
plot7 = sns.FacetGrid(df, col="gender")
plot7.map(sns.histplot, "posttest")


In [None]:
plot8 = sns.FacetGrid(df, col="lunch")
plot8.map(sns.histplot, "posttest")


In [None]:
#crosstab of classroom and school to see which classrooms are in each school
df_count = pd.crosstab(df.classroom, df.school)
print("number of classrooms in each school")  
#count the number of classrooms that exist in each school
print (df_count[df_count > 1.0].count())
print("")

#similar crosstab to see how many schools have each classroom
df_count2 = pd.crosstab( df.school,df.classroom,)
print("max no of schools with each classroom - " + str(df_count2[df_count2 > 1.0].count().max()))

This confirms that each school has several classrooms, but each classroom only exists within one school.  
I will assume that classroom is an ID for a unique classroom, not a category of classrooms.  

### Summary
Several of the variables appear to have strong relationships with the target posttest.  
There are no significant concerns about any of the variable distributions or weightings.  


### 3) Data Preparation
Normalizing, encoding and splitting out the target variable

In [None]:
#Converting student_id to an index
df.set_index('student_id', inplace=True)

#Splitting data into X and y
X_pre = df.drop('posttest', axis=1)
y_pre = df[['posttest']]

#List of category variables
cat_list = ['school', 'school_setting', 'school_type', 'classroom', 'teaching_method', 'gender', 'lunch']
   
#Split into category and continuous variables
X_cont = X_pre[X_pre.columns[~X_pre.columns.isin(cat_list)]]
X_cat = X_pre[cat_list]

#Apply StandardScaler
X_cont_col = list(X_cont.columns)
X_scaler = StandardScaler().fit(X_cont)
X_scale_cont = pd.DataFrame(X_scaler.transform(X_cont))
X_scale_cont.columns = X_cont_col
#test = pd.DataFrame(scaler.inverse_transform(X_scale_cont))

#Apply StandardScaler to target variable
y_scaler = StandardScaler().fit(y_pre)
y_df = pd.DataFrame(y_scaler.transform(y_pre))
#Retrieve Series from DataFrame
y = y_df.iloc[:,0]
y.columns = ['posttest']

#Apply OneHotEncoder to convert categories into dummies, drop one value when binary to remove redundant columns
ohe = OneHotEncoder(drop = 'if_binary', sparse=False)
X_encoded_cat = pd.DataFrame(ohe.fit_transform(X_cat), columns=ohe.get_feature_names(X_cat.columns))

#Combine category and continuous feature dataframes
X = pd.concat([X_scale_cont, X_encoded_cat], axis=1)

#Split into test and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=22, shuffle=True)
Results  = {}

### 4)Model Building
### 4a) Regularised Regression
We will use Lasso and Ridge cross-validated regularised regression to predict posttest

### Lasso Regression

In [None]:
#Create the range of possible values for alpha
lasso_space = np.logspace(-10, 0, 30)
#Set parameter space for GridSearchCV
param_grid = {"alpha": lasso_space}

#Instantiate Model
lassoreg = Lasso(normalize = True, tol = 0.01)

#Create GridSearch Cross Validation with 10 folds
lassoreg_cv = GridSearchCV(lassoreg, param_grid, cv = 10)
#Fit to training data
lassoreg_cv.fit(X_train,y_train)

print('Best Parameter Set: '+format(lassoreg_cv.best_params_))
print('Best Estimator Training Score: '+format(lassoreg_cv.best_score_))

#Best model, test set prediction and scores
lasso_best = lassoreg_cv.best_estimator_
lasso_best_pred = lasso_best.predict(X_test)
lasso_test_score = lasso_best.score(X_test, y_test)
lasso_best_RMSE = MSE(y_test,lasso_best_pred)**(1/2)
Results['Lasso'] = [lasso_test_score, lasso_best_RMSE]

print('Test set RSquared: ' + str(lasso_test_score))
print('Test set RMSE: ' + str(lasso_best_RMSE))
print('')
#Check Coefficient List
lasso_coeffs = pd.DataFrame(lasso_best.coef_, X_train.columns, columns=['Coefficients'])
print(lasso_coeffs[(lasso_coeffs.Coefficients != 0)].sort_values(by=['Coefficients'], ascending=False))


In [None]:
#Simple scatter of performance
plt.scatter(y_test, lasso_best_pred)
plt.plot(plt.xlim(), plt.xlim(), linestyle='--', color='k', lw=3, scalex=False, scaley=False)
plt.xlabel('Actual Score')
plt.ylabel('Predicted score')
plt.show()

### Lasso Regression Results
RSquared of 0.958  
RMSE of 0.204  
No evidence of Overfitting.  
Note that the model has chiefly assigned values to classrooms and some schools, with individual student performance captured through the pretest feature. 
There is no evidence of non-linear relationships and no major outliers.  
No obvious trends or heteroscedacity in the residuals.  
No additional transformation of features will be required.

### Ridge Regression

In [None]:
#Create the range of possible values for alpha
ridge_space = np.logspace(-10, 0, 30)
param_grid2 = {"alpha": ridge_space}

#Instantiate Model
ridgereg = Ridge(normalize = True)

#Create GridSearch Cross Validation with 10 folds
ridgereg_cv = GridSearchCV(ridgereg, param_grid2, cv = 10)
#Fit to training data
ridgereg_cv.fit(X_train,y_train)


print('Best Parameter Set: '+format(ridgereg_cv.best_params_))
print('Best Estimator Training Score: '+format(ridgereg_cv.best_score_))

#Best model, test set prediction and scores
ridge_best = ridgereg_cv.best_estimator_
ridge_best_pred = ridge_best.predict(X_test)
ridge_test_score = ridge_best.score(X_test, y_test)
ridge_best_RMSE = MSE(y_test,ridge_best_pred)**(1/2)
Results['Ridge'] = [ridge_test_score, ridge_best_RMSE]

print('Test set RSquared: ' + str(ridge_test_score))
print('Test set RMSE: ' + str(ridge_best_RMSE))

#Check Coefficient List
ridge_coeffs = pd.DataFrame(ridge_best.coef_, X_train.columns, columns=['Coefficients'])
print(ridge_coeffs[(ridge_coeffs.Coefficients != 0)].sort_values(by=['Coefficients'], ascending=False))


In [None]:
plt.scatter(y_test, ridge_best_pred)
plt.plot(plt.xlim(), plt.xlim(), linestyle='--', color='k', lw=3, scalex=False, scaley=False)
plt.xlabel('Actual Score')
plt.ylabel('Predicted score')
plt.show()

### Ridge Regression Results
RSquared of 0.958  
RMSE of 0.204  
No evidence of Overfitting.  
Performance was very slightly better than Lasso.

### Summary
Lasso and Ridge regularised regression modelled the problem well and produced good accuracy, with no obvious problems.

### 4b) Tree based regressors
We will now test a tree based approach, including ensemble models, to see if they are able to improve on this accuracy.

### Decision Tree

In [None]:
#Create parameter space for Decision Tree Regressor
param_grid3 = {'criterion': ['mse'],
               'min_samples_leaf':[ 0.003, 0.005, 0.01, 0.03],
               'max_depth':[6,8,10,12,14]}

#Instantiate Regressor
treereg = DecisionTreeRegressor(random_state = 22)

#Create Grid Search and fit to training data
treereg_cv = GridSearchCV(treereg, param_grid3, cv = 10)
treereg_cv.fit(X_train,y_train)

print('Best Parameter Set: '+format(treereg_cv.best_params_))
print('Best Estimator Training Score: '+format(treereg_cv.best_score_))

#Best model, test set prediction and scores
tree_best = treereg_cv.best_estimator_
tree_best_pred = tree_best.predict(X_test)
tree_test_score = tree_best.score(X_test, y_test)
tree_best_RMSE = MSE(y_test,tree_best_pred)**(1/2)
Results['Decision Tree'] = [tree_test_score, tree_best_RMSE]

print('Test set RSquared: ' + str(tree_test_score))
print('Test set RMSE: ' + str(tree_best_RMSE))

#Check Variable Importances
tree_importances = pd.DataFrame(tree_best.feature_importances_, X_train.columns, columns=['Importance'])
print(tree_importances[(tree_importances.Importance != 0)].sort_values(by=['Importance'], ascending=False))


In [None]:
plt.scatter(y_test, tree_best_pred)
plt.plot(plt.xlim(), plt.xlim(), linestyle='--', color='k', lw=3, scalex=False, scaley=False)
plt.xlabel('Actual Score')
plt.ylabel('Predicted score')
plt.show()

### Decision Tree Results  
RSquared of 0.946  
RMSE of 0.232  
Slightly worse than the two Regression models  
No trend in the residuals, although the model shows some strata in the predictions due to using a single tree.   
We will now try some ensemble techniques to reduce the error.

### Random Forest

In [None]:
#Create Parameter Space for Random Forest
param_grid4 = {'criterion': ['mse'],
               'n_estimators':[500, 1000, 1500],
               'min_samples_leaf':[0.001, 0.003, 0.005],
               'max_depth':[8,10,12,14],
               'max_features':['sqrt']}

#Instantiate Forest
forestreg = RandomForestRegressor(random_state = 22)

#Create Gridsearch for Forest and fit to training data
forestreg_cv = GridSearchCV(forestreg, param_grid = param_grid4, cv = 3, n_jobs = 1)
forestreg_cv.fit(X_train,y_train)

print('Best Parameter Set: '+format(forestreg_cv.best_params_))
print('Best Estimator Training Score: '+format(forestreg_cv.best_score_))

#Best model, test set prediction and scores
forest_best = forestreg_cv.best_estimator_
forest_best_pred = forest_best.predict(X_test)
forest_test_score = forest_best.score(X_test, y_test)
forest_best_RMSE = MSE(y_test,forest_best_pred)**(1/2)
Results['Random Forest'] = [forest_test_score, forest_best_RMSE]

print('Test set RSquared: ' + str(forest_test_score))
print('Test set RMSE: ' + str(forest_best_RMSE))



In [None]:
forest_importances = pd.DataFrame(forest_best.feature_importances_, X_train.columns, columns=['Coefficients'])
print(forest_importances[(forest_importances.Coefficients != 0)].sort_values(by=['Coefficients'], ascending=False))
plt.scatter(y_test, forest_best_pred)
plt.plot(plt.xlim(), plt.xlim(), linestyle='--', color='k', lw=3, scalex=False, scaley=False)
plt.xlabel('Actual Score')
plt.ylabel('Predicted score')
plt.show()

### Random Forest Results  
RSquared of 0.952  
RMSE of 0.220  
The Random Forest optimises to a high maximum depth, small leaves and a large number of estimators.  
Although performance has improved versus the single tree this is very resource intensive, and is still not as accurate as the Regression models.


### AdaBoost

In [None]:
#Create parameter space
param_grid5 = {'n_estimators': [500,750],
                 'learning_rate' : [ 0.1, 0.25, 0.5, 0.75 ],
                 'loss' : ['square', 'exponential'],
                 'base_estimator__max_depth' : [6,8,10]
                }
#Instantiate decision tree estimator
treereg2 = DecisionTreeRegressor( random_state = 22)

#Instantiate AdaBoost
adareg = AdaBoostRegressor(treereg2)

#Create GridSearch and fit to training data
adareg_cv = GridSearchCV(adareg, param_grid = param_grid5, cv = 3, n_jobs = 1)
adareg_cv.fit(X_train,y_train)

print('Best Parameter Set: '+format(adareg_cv.best_params_))
print('Best Estimator Score: '+format(adareg_cv.best_score_))

#Best model, test set prediction and scores
ada_best = adareg_cv.best_estimator_
ada_best_pred = ada_best.predict(X_test)
ada_test_score = ada_best.score(X_test, y_test)
ada_best_RMSE = MSE(y_test,ada_best_pred)**(1/2)
Results['AdaBoost'] = [ada_test_score, ada_best_RMSE]

print('Test set RSquared: ' + str(ada_test_score))
print('Test set RMSE: ' + str(ada_best_RMSE))

In [None]:
#Importance of features and plot of performance
ada_importances = pd.DataFrame(ada_best.feature_importances_, X_train.columns, columns=['Importance'])
print(ada_importances[(ada_importances.Importance != 0)].sort_values(by=['Importance'], ascending=False))
y_pred_ada = ada_best.predict(X_test)
plt.scatter(y_test, y_pred_ada)
plt.plot(plt.xlim(), plt.xlim(), linestyle='--', color='k', lw=3, scalex=False, scaley=False)
plt.xlabel('Actual Score')
plt.ylabel('Predicted score')
plt.show()

### AdaBoost Results  
RSquared of 0.949  
RMSE of 0.227  
Once again this ensemble approach was very resource intensive, taking a long time to run, but did not produce better results than previous models.  
 

### Stochastic Gradient Boost

In [None]:
#Create parameter space
param_grid6 = {'n_estimators': [250,500],
                 'subsample' : [ 0.7, 0.8, 0.9, 1 ],
                 'max_features' : [0.7,0.8,0.9, 1]
                }

#Instantiate Regressor
gbreg = GradientBoostingRegressor( random_state = 22)

#create GridSearch and fit to training data
gbreg_cv = GridSearchCV(gbreg, param_grid = param_grid6, cv = 3, n_jobs = 1)
gbreg_cv.fit(X_train,y_train)

print('Best Parameter Set: '+format(gbreg_cv.best_params_))
print('Best Estimator Score: '+format(gbreg_cv.best_score_))

#Best model, test set prediction and scores
gb_best = gbreg_cv.best_estimator_
gb_best_pred = gb_best.predict(X_test)
gb_test_score = gb_best.score(X_test, y_test)
gb_best_RMSE = MSE(y_test,gb_best_pred)**(1/2)
Results['Gradient Boost'] = [gb_test_score, gb_best_RMSE]

print('Test set RSquared: ' + str(gb_test_score))
print('Test set RMSE: ' + str(gb_best_RMSE))


In [None]:
gb_importances = pd.DataFrame(gb_best.feature_importances_, X_train.columns, columns=['Importance'])
print(gb_importances[(gb_importances.Importance != 0)].sort_values(by=['Importance'], ascending=False))
y_pred_gb = gb_best.predict(X_test)
plt.scatter(y_test, y_pred_gb)
plt.plot(plt.xlim(), plt.xlim(), linestyle='--', color='k', lw=3, scalex=False, scaley=False)
plt.xlabel('Actual Score')
plt.ylabel('Predicted score')
plt.show()

### Stochastic Gradient Boost Results  
RSquared of 0.952  
RMSE of 0.219  
Stochastic Gradient Boost was one of the stronger ensemble methods on this data and took less time than some of the others, but still did not outperform the linear regression models.  
Several other Boosting Algorithmns are available, but the performance of these first models doesn't suggest that an exhaustive search would be worthwhile.   

### 5) Results  
Regularised Linear Regression appears to be the best predictor on this dataset, of the methods that I have explored.  
It is also simpler and less expensive in time and resource than the ensemble approaches.  
There is little to choose between the two regression approaches, but Ridge performed slightly better in this case.


In [None]:
Models = pd.Series(list(Results.keys()))
Scores = pd.DataFrame(list(Results.values()))

Final_Results = pd.concat([Models, Scores], axis = 1)
Final_Results.columns = ['Model', 'RSquared', 'RMSE']
print(Final_Results)
sns.catplot(data=Final_Results, kind="bar", x="RSquared", y="Model")
plt.xlim(0.9, 1)
sns.catplot(data=Final_Results, kind="bar", x="RMSE", y="Model")

### Model considerations
All the models that I looked at predicted the posttest score with a good level of accuracy.  
However if we were building this model as a project there are other considerations that would need to be included, specifically what decisions the model is intended to inform. 
  
For example if you wanted to achieve the best outcome for a pupil my most accurate model would suggest putting them into the best classroom. But if you wanted to improve the outcomes for all of the pupils you couldn't put all two thousand pupils in that classroom.
  
Similarly using the pretest score to predict the posttest score is a good way to get an accurate prediction. But it doesn't have any information on why the scores are high. That feature only tells you that students who got high scores before will likely get high scores again in the future.
  
Depending on the specific application it might be necessary to get more data relating to the underlying causes of performance, or possibly accept a lower level of accuracy.


### Next Steps and Builds  
If I were to spend more time on this project I think that it could be worthwhile to build my pre-processing as a pipeline, so that each model could be set up independently and could run new data more easily.  
As the pre-processing was the same for all the models and there is no additional data, it is less impactful in this case.  
Similarly although it is only a few lines of code, the RMSE, scoring and plotting are reused repeatedly and I could wrap them in one or more functions. 