In [8]:
import pandas as pd
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
import numpy as np

In [9]:
X_train = pd.read_csv("sets/X_train.csv")
X_test = pd.read_csv("sets/X_test.csv")
y_train = pd.read_csv("sets/y_train.csv")
y_test = pd.read_csv("sets/y_test.csv")

### Decision trees

Decision trees are a simple ML model that we can use for the regression task of predicting the score of some student in an assessment.
This model should be able to deal with the non-linear relationship between the features and the target, thus achieving better results compared to linear regression. However, it might be prone to overfitting.

In [10]:
#using the scikit-learn implementation of the model
from sklearn.tree import DecisionTreeRegressor

min_samples_leaf = 15 #minimum number of samples required in a leaf -> may smooth the model
min_samples_split = 10 #minimum number of samples to split an internal node

decTree = DecisionTreeRegressor(min_samples_leaf=min_samples_leaf, min_samples_split=min_samples_split)

decTree.fit(X_train, y_train)

We evaluate the model by considering the RMSE error and the R2 score.

In [11]:
y_pred = decTree.predict(X_test)
y_pred_train = decTree.predict(X_train)

RMSE = metrics.mean_squared_error(y_test, y_pred, squared=False)
R2 = metrics.r2_score(y_test, y_pred)

RMSE_train = metrics.mean_squared_error(y_train, y_pred_train, squared=False)
R2_train = metrics.r2_score(y_train, y_pred_train)

In [12]:
print("RMSE: "+str(RMSE))
print("R2: "+str(R2))

print("RMSE for training: "+str(RMSE_train))
print("R2 for training: "+str(R2_train))

RMSE: 16.58295192809045
R2: 0.2137573513177421
RMSE for training: 13.194100446708545
R2 for training: 0.5067471305736341


We can see that the model lose some accuracy on the test set compared to the training set, because it tends to overfit a little. However, it reaches better performances than linear models, because of the non-linearity of the data distribution.