In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import seaborn as sns
import random

np.random.seed(1001)
random.seed(1001)

## Load data

In [None]:
data = pd.read_csv('/kaggle/input/predict-test-scores-of-students/test_scores.csv')
data.head()

Loaded fine. Also, all attributes are reasonably well balanced and there are no missing values. Nice tidy data.

## Low-hanging fruit: the pre-test scores

We've got a pre-test score. These are usually very predictive (they are even sometimes used as a proxy for test results when a final test can't be sat).

Let's see how well it correlates:

In [None]:
sns.scatterplot(data=data, x='posttest',y='pretest')

That is highly correlated. It's also one of the few real-valued attributes and it would be good to factor that out so any models we build can stick to fitting parameters to the difficult stuff.

Let's move to predicting the difference from pre- to post-test results. Happily, minimising the MAD on the difference in the scores gives the same loss as MAD on the final score itself. This means we can build a delta-score predictor and just add each of its predictions to the pre-test value.

If we end up using decision trees, this also means we can avoid the use of hard thresholds at decision points which can be a weakness.


In [None]:
data = data.assign(testdelta=data['posttest']-data['pretest'])
data.head()

In [None]:
data.testdelta.mean()

So on average, the pre-test is 12 points lower than the final test result.

### An initial baseline model
We can now make a super-simple baseline model: add the average delta to the pre-test.
Note that this is calculated across all data, so we're evaluating on the test data here, but it still gives a good feeling for where we start from:

In [None]:
sum(abs(x-12.14) for x in data.testdelta) / len(data.testdelta)

So we should be looking to do (significantly) better than a MAD of 3.5.

## Other real-valued attributes
Let's take a look at the other real-valued attribute: the number of students. We'd expect this to be negatively correlated with final score (small classes doing better) but it's less clear whether this will correlate with the delta-scores.

In [None]:
sns.scatterplot(data=data, x='testdelta',y='n_student', alpha=0.2)

The scatter plot shows no obvious correlations. Some care must be taken not to overfit on this attribute: consider the unusual distributions at class sizes 26 and 29.

These are clearly just artifacts as they shouldn't differ much from those at the adjacent class sizes. It's probably information leakage from the classroom attribute.

We'll leave this attribute out to begin with. Also note we're cheating a bit here as these plots are on all data. Once we've looked at the possible values of the discrete attributes we'll make the test/train split to ensure we're doing any further investigations properly.

## Discrete attributes

In [None]:
data[['school_setting','school_type','teaching_method','gender','lunch']].describe()

In [None]:
data.school_setting.unique()

Two attributes have been left out here.

The *classroom* attribute feels a bit too fine-grained. Also, it wouldn't be useful in a real model in which we'd want to predict scores on later years (rather than predicting classmates' scores on the exact same test).

The *school* attribute may be informative. At this point I'm assuming that the school setting+type+teaching method sufficiently captures what the school does. The plan is to first work without this attribute, then compare a final model with it in.

## Upper bound model

We have a baseline of 3.5, but we can also find a reasonable upper limit to what we could achieve.

We'll maximally overfit to the training data. In this case that means breaking the data up into the smallest subsets that have identical attribute values (so there will be 3 x 2 x 2 x 2 x 2 = 48 subsets). Using the mean of each subset as the prediction for its members will provide a minimal MAD.

If there were more features we can do this by learning a single decision tree with high depth, split nodes down to size 2, and permitting leaves of single items. But in this case we can just do the binning ourselves:

In [None]:
uniques = [list(data[x].unique()) for x in ['school_setting','school_type','teaching_method','gender','lunch']]
uniques

In [None]:
subsets = [[]]
for feats in uniques:
    subsets = [ fs + [x] for x in feats for fs in subsets]
len(subsets)

In [None]:
subsets[:10]

Now we'll sub-divide our data into the 48 sub-sets, get a mean, MAD, and number of items.

In [None]:
sub_data = [data[(data.school_setting==x[0]) & (data.school_type==x[1]) & (data.teaching_method==x[2]) & (data.gender==x[3]) & (data.lunch==x[4])] for x in subsets]

And do a quick sanity check that the number of items in the subsets matches the original data:

In [None]:
print(len(data), sum(len(d) for d in sub_data))

In [None]:
for d in sub_data[:5]:
    print(len(d),d.testdelta.mean(), d.testdelta.mad())

The MAD we'd get for predicting across the whole dataset is just the weighted mean of the MAD of each sub-set of the data:

In [None]:
sum(d.testdelta.mad()*len(d) for d in sub_data)/len(data)

So we'd like to approach 2.6 MAD. Let's see how close we can get to it. 

It's possible that with the *school* and *n_students* attributes you could do a bit better, but we'll leave that for later.

## Test/train split

Our upper and lower bounds have been determined based on all data, as has the decision to predict the delta scores. These decisions could have been made based on training data only so it isn't a big issue. From now on we'll be more careful though.


In [None]:
from sklearn.model_selection import train_test_split

ys = data.testdelta
train, test, train_y, test_y = train_test_split(data,ys,train_size=0.8)
train.school_setting.unique()

If three values aren't present for school setting run it again with a new random seed and get a new split.

Otherwise the training data we've got is badly skewed. For determining the one-hot encoding below we'll leave it to automatically infer the encoding so we want all values present.

## Random forest with limited features
We'll try to keep the meta-heuristic search space small -- we've only got about 2000 instances and they'll get over-used fast.

In [None]:
from sklearn import preprocessing
from sklearn.compose import ColumnTransformer

train_x = train[['school_setting','school_type','teaching_method','gender','lunch']]
test_x = test[['school_setting','school_type','teaching_method','gender','lunch']]


# one-hot encode the attribute with three values, the rest can use a single binary feature
feature_transform = ColumnTransformer(transformers=[
    ('onehot', preprocessing.OneHotEncoder(handle_unknown='ignore'), ['school_setting']),
    ('binary', preprocessing.OneHotEncoder(handle_unknown='error', drop='first'), ['school_type','teaching_method','gender','lunch']),
])

# consider forests with many small trees - not a very large feature space
params = {
  'predictor__n_estimators':[20,80,240],
  'predictor__min_samples_leaf':[8,16,32],
  'predictor__min_samples_split':[8,16,32],
  'predictor__max_depth':[2,3,4,5]
}


For the random forest we'll use the mean absolute error criterion since that's what we want to optimise at the leaves. 

We'll also bootstrap, since only toggling features won't give that many unique trees. Re-sampling the data will hopefully improve diversity.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn import ensemble

predictor = Pipeline(steps=[
    ('feature_transform',feature_transform),
    ('predictor',ensemble.RandomForestRegressor(criterion='mae', bootstrap=True, n_estimators=20, min_samples_leaf=8,min_samples_split=8,max_depth=2))
  ])


In [None]:
from sklearn.model_selection import GridSearchCV
meta_fit = GridSearchCV(predictor, params, scoring='neg_mean_absolute_error',n_jobs=-1)
meta_fit.fit(train_x,train_y)
print('Best score:',meta_fit.best_score_)
print('Best params',meta_fit.best_params_)

In [None]:
print(meta_fit.score(train_x,train_y), meta_fit.score(test_x,test_y))

MAD of 2.6 on the training and test data is as good as we could hope for. Normally I'd be concerned about overfitting since we got right down to 2.6 on the training data, but if it holds on the test data then it may be generalising ok.

We may have an "easy" test set here (e.g. with few outliers) as it actually performed slightly better than on the training data.

## Full features
We'll now throw in n_student, school, and pretest as additional features.

Since this will then be feature-rich we'll turn off bootstrapping and try out some larger tree sizes.

We'll use the same test/training split as above, but with the extra features in.

In [None]:
train_x = train[['school','n_student','pretest','school_setting','school_type','teaching_method','gender','lunch']]
test_x = test[['school','n_student','pretest','school_setting','school_type','teaching_method','gender','lunch']]

feature_transform = ColumnTransformer(transformers=[
    ('onehot', preprocessing.OneHotEncoder(handle_unknown='ignore'), ['school_setting','school']),
    ('binary', preprocessing.OneHotEncoder(handle_unknown='error', drop='first'), ['school_type','teaching_method','gender','lunch']),
  ],remainder='passthrough')

params = {
  'predictor__n_estimators':[20,80,240],
  'predictor__min_samples_leaf':[4,8,16],
  'predictor__min_samples_split':[8,16,32],
  'predictor__max_depth':[2,4,8,16]
}

predictor = Pipeline(steps=[
    ('feature_transform',feature_transform),
    ('predictor',ensemble.RandomForestRegressor(criterion='mae'))
])


In [None]:
meta_fit = GridSearchCV(predictor, params, scoring='neg_mean_absolute_error',n_jobs=-1)
meta_fit.fit(train_x,train_y)
print('Best score:',meta_fit.best_score_)
print('Best params',meta_fit.best_params_)

In [None]:
print(meta_fit.score(train_x,train_y), meta_fit.score(test_x,test_y))

## Results

The full set of attributes reaches 2.4 MAD, which is a fair step from 2.01 on the training data. This is a good uplift from the simple discrete attributes and looks like a reasonable generalisation.

To find the r2 metric we have to convert from predicted delta scores into posttest values by adding the pretest back on.

In [None]:
import sklearn.metrics as metrics
preds = meta_fit.predict(test_x)
preds = [y+test_x.pretest.iat[i] for i,y in enumerate(preds)]
post_ys = [y+test_x.pretest.iat[i] for i,y in enumerate(test_y)]

metrics.r2_score(preds,post_ys)

So overall:

2.4 Mean absolute error

95.0% R2 score