# About
Data analysis and prediction of student test scores based on the Kaggle [Predict test scores of students ](https://www.kaggle.com/kwadwoofosu/predict-test-scores-of-students) dataset.

The main task is to accurately and efficiently predict the post-test scores using info about the students.

# Imports and setup
Basic libraries and config.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import sklearn

In [None]:
sns.set_theme(style="whitegrid")

# Data exploration

In [None]:
data_path = "/kaggle/input/predict-test-scores-of-students/test_scores.csv"
df = pd.read_csv(data_path)
df

There are 2133 students taking the test, each with 11 features, including the prediction target variable "posttest". That's few enough that we can manually have a look at them one by one:

In [None]:
# Print some info about the features
for c in df:
    print("="*80)
    print("*"*3, c, "*"*3)
    print("dtype", df[c].dtype)
    print("Number of unique values:", df[c].nunique())
    print("Frequencies of unique values:")
    print(df[c].value_counts())
    print("NaNs:", df[c].isna().sum())

Most features  seem to have some predictive value based on common sense. The student id however, is not relevant and shouldn't be used for predctions. 

In [None]:
df.drop(columns=['student_id'], inplace=True)

To better understand what classroom "type" that *classroom* represent I check whether *n_students* can be reproduced simply by counting:

In [None]:
# Count the number of students that have the same classroom label
classroom_count_students = df.groupby('classroom')['n_student'].count()
n_student_xy = df.merge(classroom_count_students, on='classroom')[['n_student_x', 'n_student_y']]
# The number of students obtained this way by counting and the n_student variable are the same. 
assert (n_student_xy['n_student_y'] == n_student_xy['n_student_x']).all()

Indeed, *classroom* simply correspond to the group of students that are in the same class:

One would expect that the pre-test score *pretest* is very informative, perhaps the most informative, about the post-test score *posttest*. My understanding of the pretest score is that before the real test there was a trial test to help the students assess how much effort they need to put into studying to get their desired score. Indeed we see that there is a very strong correlation between the two and the relationship looks fairly linear.

In [None]:
sns.relplot(data=df, x='pretest', y='posttest');

From a distributional plot it becomes more clear that the *posttest* values are on average higher than the *pretest* values.

In [None]:
# Put pretest and posttest values in one column ("value"), labels in column "variable"
data = df.melt(value_vars=['pretest', 'posttest'])
sns.displot(data=data, x='value', hue='variable', kind='kde');

The only other numerical feature is the number of students in the class *n_student* and there seems to be a correlation between the number of students and *posttest*:  score results decrease with larger classes. 

In [None]:
sns.scatterplot(data=df, x='n_student', y='posttest')

We can also quickly create bar plots for all the non-numerical features to get an overview of the data:

In [None]:
# Bar plots for all non-numerical features
for c in df:
    if df[c].dtype == "object":
        plt.figure()
        # Group by c, take the mean over posttest, sort and output c values
        order = df.groupby(c)['posttest'].mean().reset_index().sort_values(by='posttest')[c]
        sns.barplot(data=df, x=c, y='posttest', order=order)

In general the variables, look like they have predictive power: different categories have different *posttest* mean values. Gender does not show obvious differences, but it could be that there are correlations that aren't visible in a bar chart and it's only one categorical variable, so might as well keep it in there.

Clearly both the schools and the class rooms show performance differences, but it's unclear to me from the plots above how schools and classes relate to each other. E.g. are the difference in class performance mainly due to being at a particular school? To visualize this better I plot *posttest* class average as function the class rooms, but now grouped by the school and sorted by the school's performance:

In [None]:
import itertools

def plot_classroom_sorted_by_scores(df):
    """Bar plot of the class rooms sorted posttest scores.
    
    Sorts first by school score, then by classroom score.
    """
    # Get the right plot order of classrooms.
    # Mean posttest per classroom and the school that the uniquely classroom belongs to. 
    df1 = df.groupby(['classroom']).agg(school=('school', lambda x: x.unique()),
                                         classroom_mean_posttest=('posttest', 'mean')).reset_index()
    # Mean posttet per school, to use for sorting the classrooms
    df2 = df.groupby(['school']).agg(school_mean_posttest=('posttest', 'mean')).reset_index()
    df12 = df1.merge(df2, on='school')
    # Sort the classrooms by the school mean, then by classroom mean
    order = df12.sort_values(['school_mean_posttest', 'classroom_mean_posttest'])['classroom']

    # Display the school name instead of the classroom 
    xtick_labels = df12.sort_values(['school_mean_posttest', 'classroom_mean_posttest'])['school'].reset_index(drop=True)
    indices_to_zero = []
    for i, l in enumerate(xtick_labels):
        indices = xtick_labels[xtick_labels == l].index.to_list()
        middle_index = indices[len(indices)//2]
        not_middle_index = (i != middle_index)
        indices_to_zero.append(not_middle_index)
    # Erase labels
    xtick_labels[indices_to_zero] = ''
    # abbreviate labels to fit better
    #xtick_labels = xtick_labels.apply(lambda x: x[:3] if len(x) > 0 else x)

    # Color by school
    palette = itertools.cycle(sns.color_palette())
    # Give colors in the same order as they'll be plotted
    schools_ordered_by_value = df.groupby('school').mean().sort_values('posttest').index.values
    school_colors = {school:next(palette) for i, school in  enumerate(schools_ordered_by_value)}
    # Map classroom to the school
    classroom_to_school = df12[['classroom', 'school']].set_index('classroom').to_dict()['school']
    classroom_colors_by_school = {classroom: school_colors[classroom_to_school[classroom]]\
                                  for i, classroom in  enumerate(set(df['classroom']))}

    # Plot finally 
    fig, ax = plt.subplots(figsize=(24, 4))
    ax = sns.barplot(data=df, x='classroom', y='posttest',
                     # esthetics
                     order=order, palette=classroom_colors_by_school, ax=ax)
    ax.set_xticklabels(xtick_labels);
    ax.set_xlabel("classroom (grouped and labeled by school)");
    return df12

In [None]:
df12 = plot_classroom_sorted_by_scores(df)

Here we see more clearly that also within a school, there are significant differences between different *class_room* (maybe they represent different specializations within a school?). For instance the best class rooms of "QOQTS" is as good as the best of "OJOBU", even though "OJOBU" ranks higher on average. So it makes sense to use *class_room* as a feature for prediction.

# Data transformations

In [None]:
target = ["posttest"]
features = [c for c in df.columns if c not in target]

## One-hot encoding of categorical variables

In [None]:
# Extract features, encode categorical
df_features = pd.get_dummies(df[features], drop_first=True)

In [None]:
df_features.shape

## Split into train/validation and test set
Note: if we were not interested in getting an accurate final estimate, we could skip the testing and use all data for training.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df_features.to_numpy()
y = df[target].to_numpy()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

## Normalize numerical

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Scale features to [0,1]
x_scaler = MinMaxScaler().fit(X_train)

X_train = x_scaler.transform(X_train)
X_test = x_scaler.transform(X_test)

In [None]:
# Skip target scaling for simplicity, to interpret output more directly
#y_scaler = MinMaxScaler().fit(y_train)

#y_train = y_scaler.transform(y_train)
#y_test = y_scaler.transform(y_test)

# Metrics

In [None]:
def mean_absolute_error(y, ypred):
    return np.mean(np.abs(y.ravel() - ypred.ravel()))

# Cross-validation config
I use k-fold cross-validation to optimize model hyperparameters (if any) and obtain better estimates for the model performance.

In [None]:
# Only if need more metrics
# from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score

In [None]:
from sklearn.model_selection import KFold

In [None]:
n_splits = 10
cv = KFold(n_splits=n_splits, shuffle=True, random_state=0)
scoring = 'neg_mean_absolute_error'

# Prediction baseline, 1D linear regression
As a (strong) baseline I'll use a linear model, with only one feature, the *pretest* variable.

In [None]:
from sklearn import linear_model

In [None]:
regr = linear_model.LinearRegression()

In [None]:
col_pretest = df_features.columns.to_list().index('pretest')

In [None]:
X1_train = X_train[:,col_pretest].reshape(-1,1)
X1_test = X_test[:, col_pretest].reshape(-1,1)

In [None]:
# Estimate model performance
scores = cross_val_score(regr, X1_train, y_train, cv=cv, scoring=scoring)
# Score = neg MAE
mae, mae_std = np.mean(-scores), np.std(-scores)
print("Cross-validation MAE:", mae, mae_std)

In [None]:
# Fit and print train test MAE
# Fit on train set
regr.fit(X1_train, y_train)

# Train error
ypred = regr.predict(X1_train)
print("Train MAE:", mean_absolute_error(y_train, ypred))

# Test error 
ypred = regr.predict(X1_test)
print("Test MAE:", mean_absolute_error(y_test, ypred))

In [None]:
# Plot model and data
plt.plot(X1_test.ravel(), y_test, 'o', label='data')
plt.plot(X1_test.ravel(), ypred, '-', label='model')
plt.legend()
plt.xlabel('pretest')
plt.ylabel('posttest');

# Linear regression using all features
Simply using all features as-is yields a more complex model but with significantly lowered MAE.

In [None]:
regr = sklearn.linear_model.LinearRegression()

In [None]:
# Estimate model performance
scores = cross_val_score(regr, X_train, y_train, cv=cv, scoring=scoring)
# Score = neg MAE
mae, mae_std = np.mean(-scores), np.std(-scores)
print("Cross-validation MAE:", mae, mae_std)

In [None]:
# Fit and print train test MAE
# Fit on train set
regr.fit(X_train, y_train)

# Train error
ypred = regr.predict(X_train)
print("Train MAE:", mean_absolute_error(y_train, ypred))

# Test error 
ypred = regr.predict(X_test)
print("Test MAE:", mean_absolute_error(y_test, ypred))

In [None]:
# Plot model and data
plt.plot(X_test[:,col_pretest], y_test, 'o', label='data')
plt.plot(X_test[:,col_pretest], ypred, '-', label='model')
plt.legend()
plt.xlabel('pretest')
plt.ylabel('posttest');

We can also note that the gap between training and test error increased, indicating more of a tendency to overfitting.

# Feature engineering

## Reduction, replace classroom and shool with statistics 
The *school* and especially *classroom* variables lead to fairly high dimensional feature space (>100) due to the many different categories. In order to obtain a more compact representation I experiment with deriving a numerical feature using these variables. Similarly to how *n_student* counts the number of students in a class. I can try adding the classroom and school statistics:
* the *pretest* average/standard deviation per *class*
* the *pretest* average/standard deviation per *classroom*  

Then e.g. students from different schools but with similar school averages will be recognized as similar in this sense.

In [None]:
df['pretest_school'] = df.groupby('school')['pretest'].transform('mean')
df['pretest_classroom'] = df.groupby('classroom')['pretest'].transform('mean')

In [None]:
# Could consider the standard deviation too
#df['pretest_school_std'] = df.groupby('school')['pretest'].transform('std')
#df['pretest_classroom_std'] = df.groupby('classroom')['pretest'].transform('std')

In [None]:
features_red = [c for c in df.columns if c not in target and c not in ['classroom', 'school']]
# One-hot encode
df_features_red = pd.get_dummies(df[features_red])
df_features_red

### Test linear regression

In [None]:
X = df_features_red.to_numpy()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

In [None]:
# Scale features to [0,1]
x_scaler = MinMaxScaler().fit(X_train)

X_train = x_scaler.transform(X_train)
X_test = x_scaler.transform(X_test)

In [None]:
regr = linear_model.LinearRegression()

In [None]:
# Estimate model performance
scores = cross_val_score(regr, X_train, y_train, cv=cv, scoring=scoring)
# Score = neg MAE
mae, mae_std = np.mean(-scores), np.std(-scores)
print("Cross-validation MAE:", mae, mae_std)

In [None]:
# Fit and print train test MAE
# Fit on train set
regr.fit(X_train, y_train)

# Train error
ypred = regr.predict(X_train)
print("Train MAE:", mean_absolute_error(y_train, ypred))

# Test error 
ypred = regr.predict(X_test)
print("Test MAE:", mean_absolute_error(y_test, ypred))

The MAE is slightly lowered, albeit within the error bars. Since the feature space is smaller, easier to work with and also can be generalized to include new unseen schools and classrooms, at first sight I'd say this is a better representation for continuied modeling.

## Try adding mix terms
To make the model slightly more complex, we can add polynomial mix terms, i.e. for each feature pair $x_1$, $x_2$ add a feature $x_{1,2} = x_1 \cdot x_2$.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
poly = PolynomialFeatures(2, include_bias=False, interaction_only=True)

In [None]:
poly.fit(X_train)

In [None]:
X_train_ = poly.transform(X_train)
X_test_ = poly.transform(X_test)

In [None]:
# Estimate model performance
scores = cross_val_score(regr, X_train_, y_train, cv=cv, scoring=scoring)
# Score = neg MAE
mae, mae_std = np.mean(-scores), np.std(-scores)
print("Cross-validation MAE:", mae, mae_std)

Since this has no advantage in performance and is a more complex model, I don't pursue this further here.

# Model selection, grid search

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPRegressor
import warnings
from sklearn.exceptions import ConvergenceWarning

In order to compare models I perform a grid search over a few candidate models as well as over hyperparameters of those models.

In [None]:
models = {
    'linear': LinearRegression(),
    # Linear model with L2 penalty term
    'ridge': Ridge(random_state=0),
    # Linear model with L1 penalty term
    'lasso': Lasso(random_state=0),
    # Linear model with L1, L2 penalty terms
    'elastic': ElasticNet(random_state=0),
    # Decision tree
    'decision-tree': DecisionTreeRegressor(random_state=0),
    # Neural network, fix some of the optimization parameters
    'neural-network': MLPRegressor(random_state=0,
                                   max_iter=1000,
                                   solver='sgd',
                                   learning_rate='constant',
                                   momentum=0,
                                   nesterovs_momentum=False)
}
alphas = (0.1, 0.1, 1, 10, 100)
params = {
    'linear':{},
    'ridge': {'alpha': alphas},
    'lasso': {'alpha': alphas},
    'elastic': {'alpha': alphas, 'l1_ratio': (0.25, 0.5, 0.75)},
    'decision-tree': {'max_depth': (2, 3, 5, 7, 9, 12)},
    'neural-network': {'alpha': [0.1, 1, 10], 
                       'hidden_layer_sizes': [(10,), (100,), (10, 10), (100,)],
                       'learning_rate_init': [0.001, 0.01, 0.1]}
}

In [None]:
Xs = {
    # Pretest only
    '1-dim': df_features.iloc[:,col_pretest].to_numpy().reshape(-1,1),
    # Reduce school, classroom to pretest averages
    '15-dim': df_features_red.to_numpy(),
    # All features
    '126-dim': df_features.to_numpy()
}

In [None]:
search_results = {}
for x_name, X in Xs.items():
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
    x_scaler = MinMaxScaler().fit(X_train)
    X_train = x_scaler.transform(X_train)
    X_test = x_scaler.transform(X_test)
    for model_name, regr in models.items():
        print(f"{model_name}, {x_name}")
        search = GridSearchCV(regr, param_grid=params[model_name], scoring=scoring, cv=cv, verbose=1)
        # Ignore warnings from combinations that don't converge
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn")
            search_results[model_name+'_'+x_name] = search.fit(X_train, y_train.ravel())

In [None]:
# Collect best results
best_results = dict()
for name, search in search_results.items():
    means, stds, params = search.cv_results_['mean_test_score'], search.cv_results_['std_test_score'], search.cv_results_['params']
    means_stds_params_sorted = sorted(zip(means, stds, params), key=(lambda t: t[0]), reverse=True)
    first = means_stds_params_sorted[0]
    mean, std, params = first
    mean = -mean # Negate scoring
    best_results[name] = mean, std, params

In [None]:
best_results = dict(sorted(best_results.items(), key=lambda x: x[1][0]))
print(*best_results.items(), sep='\n')

In [None]:
# Pick the simplest model as baseline
baseline = 'linear_1-dim'
baseline_val = best_results[baseline][0]
plt.axhline(baseline_val, linestyle='--', color='gray', linewidth=2, label=baseline)
plt.legend()
labels=[]
for i, label in enumerate(best_results):
    mae, std = best_results[label][:2]
    plt.errorbar(i, mae, yerr=std, color='C0', marker='o', capsize=3)
    labels.append(label)
plt.xticks(np.arange(len(labels)), labels, rotation=90);
plt.ylabel('Mean absolute error');
plt.title("Model performance");

The winners in this case are the relatively simple models with the 15-dimensional feature space, Ridge or unregularized linear perform essentially as well. Neural network also performs on par with them, however since it's generally a more complex model there is no clear benefit for this particular dataset.

# Final evaluation
Linear and ridge regression performed the best. Linear was already evaluated on the test data above (MAE $\approx 2.3$). Below, I evaluate Ridge and expect essentially the same test set performance.

In [None]:
first = next(iter(best_results.keys()))
best_params = best_results[first][-1]
print(first, best_params)

In [None]:
regr = Ridge(random_state=0, **best_params)

In [None]:
X = Xs['15-dim']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
x_scaler = MinMaxScaler().fit(X_train)
X_train = x_scaler.transform(X_train)
X_test = x_scaler.transform(X_test)

In [None]:
# Fit and print train test MAE
# Fit on train set
regr.fit(X_train, y_train)

# Train error
ypred = regr.predict(X_train)
print("Train MAE:", mean_absolute_error(y_train, ypred))

# Test error 
ypred = regr.predict(X_test)
print("Test MAE:", mean_absolute_error(y_test, ypred))

# Summary
- A grid search over linear, tree and neural network models combined with different dimensionality of feature space were performed to find a model to predict *posttest*.
- Result: linear models, Ridge or unregularized, with a reduced feature space (15-dim) performed the best (test MAE $\approx 2.3$) and are in addition simple and fast to evaluate. 