This dataset is  obtained in a survey of students math and portuguese language courses in a secondary school. It contains social, gender and study information about the students. It vcan be used for some EDA or to predict students final grade.I used the publication by Paolo Cortez and Alice Silva as input to try out some regression and classification methods.

I will perform some basic EDA, prepare the data, perform some feature selection and try out some regression and classification methods. Finally I will reflect on the central question of Cortez & Silva: can we predict a student's final grade?  

I am beginning my data science journey. So there will be mistakes in this notebook. Constructive critisism and suggestions for improvement are very much welcome. I used some code snippets of others and mention them in the sources at the end.



In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC


For this notebook I will only use the math dataset. Let start with getting some insights in the data.

In [None]:
data_full = pd.read_csv('../input/student-alcohol-consumption/student-mat.csv')

In [None]:
data_full.head()

In [None]:
data_full.shape

Let's see whether our data is categorical or numerical

In [None]:
data_full.dtypes

The heatmap below shows the correlation between the various features. The variables with the strongest correlation are G1 - first period grade and G2 - second period grade. To a lesser extent other features like 'failures' and the level of eduction of the student's mother contribute to the exam results. Alcohol consumption does not seem to influence much in this dataset.



In [None]:
corr = data_full.corr()
plt.subplots(figsize=(15,12))
sns.heatmap(corr, vmax=0.9, cmap="Blues", square=True, annot=True)



Let's check if there is any missing data.

In [None]:
data_full.isnull().sum()

Next, we will analyse the distribution of the dependent variable, G3 - final grade. It seems like an almost normal distribution. However, there are almost 40 students with 0 marks, which could be meaningful.

In [None]:
sns.set_style("white")
sns.set_color_codes(palette='deep')
f, ax = plt.subplots(figsize=(8, 7))
#Check the new distribution 
sns.histplot(data_full['G3'], color="b");
ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="final grade")
ax.set(title="Final grade distribution")
sns.despine(trim=True, left=True)
plt.show()

Now let's split the dataset in dependent (G3) and independent variables. I think we can try to use this dataset to test both regression and classification algorithms. I will use some methods to predict student's scores and also a few algorithms to predict whether a student will pass or fail the final exam. Therefore, I will prepare a binary version of the y-values for classification. 

In [None]:
y = data_full['G3']

data_full['G3'] = [1 if x >= 10 else 0 for x in data_full['G3']]

y_bool = data_full['G3']

data = data_full.drop(['G3'], axis=1)



In [None]:
# One-hot encode the data 
data = pd.get_dummies(data).reset_index(drop=True)


As we have seen in the heatmap, not all features seem to be relevant, especially since G1 and G2 are much stronger correlated to G3 than the other variables. 
We can use Lasso to select the most relevant variables. 

In [None]:
# Select features with lasso

feature_sel_model = SelectFromModel(Lasso(alpha=0.03, random_state=0)) 
feature_sel_model.fit(data, y)
feature_sel_model.get_support()

selected_feat = data.columns[(feature_sel_model.get_support())]

print('total features: {}'.format((data.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
print('features with coefficients shrank to zero: {}'.format(
    np.sum(feature_sel_model.estimator_.coef_ == 0)))
data = data[selected_feat]

In [None]:
data.head()

I will use the four regressions methods that are mentioned in the paper by Paulo Cortez and Alice Silva. These are Decision Trees, Random Forest, Neural Network and Linear Regresson. I add XGBoost, as it was used in Kaggle's Intermediate Machine Learning course.

In [None]:
# Regression

X_train, X_valid, Y_train, Y_valid = train_test_split(data, y, test_size=0.2) 

DT_model = tree.DecisionTreeRegressor()
RF_model = RandomForestRegressor(n_estimators=500, random_state=0) # value of n_estimators based on Cortez & Silva 
LinR_model = LinearRegression()
NN_model = MLPRegressor(random_state=0, hidden_layer_sizes=(393, 395, 395), max_iter=1500, early_stopping=True)
# Cortez & Silva use, 1 layer, 100 epochs of the BFGS algorithm, but this does not perform well in my case
XGB_model = XGBRegressor(n_estimators=100, learning_rate=0.08)

DT_model.fit(X_train, Y_train)
RF_model.fit(X_train, Y_train)
LinR_model.fit(X_train, Y_train)
NN_model.fit(X_train, Y_train)
XGB_model.fit(X_train, Y_train)

DT_prediction = DT_model.predict(X_valid)
print("Decision Tree score: ", DT_model.score(X_train,Y_train))
mae_DT_prediction = mean_absolute_error(DT_prediction, Y_valid)
print("Mean Absolute Error: ", mae_DT_prediction)

RF_prediction = RF_model.predict(X_valid)
print("Random Forest score: ", RF_model.score(X_train,Y_train))
mae_RF_prediction = mean_absolute_error(RF_prediction, Y_valid)
print("Mean Absolute Error: ", mae_RF_prediction)

LinR_prediction = LinR_model.predict(X_valid)
print("Linear Regression score: ", LinR_model.score(X_train,Y_train))
mae_LinR_prediction = mean_absolute_error(LinR_prediction, Y_valid)
print("Mean Absolute Error: ", mae_LinR_prediction)

NN_prediction = NN_model.predict(X_valid)
print("Neural Network score: ", NN_model.score(X_train,Y_train))
mae_NN_prediction = mean_absolute_error(NN_prediction, Y_valid)
print("Mean Absolute Error: ", mae_NN_prediction)

XGB_prediction = XGB_model.predict(X_valid)
print("XGBoost score: ", XGB_model.score(X_train,Y_train))
mae_XGB_prediction = mean_absolute_error(XGB_prediction, Y_valid)
print("Mean Absolute Error: ", mae_XGB_prediction)


Now let's see if we can use some classification algorithms to predict failure or success.

In [None]:
# Classification

X_train, X_valid, Y_train, Y_valid = train_test_split(data, y_bool, test_size=0.2) 

LR_model = LogisticRegression(max_iter=500)
SVC_model = SVC()
KNN_model = KNeighborsClassifier(n_neighbors=5)

LR_model.fit(X_train, Y_train)
SVC_model.fit(X_train, Y_train)
KNN_model.fit(X_train, Y_train)

LR_prediction = LR_model.predict(X_valid)
SVC_prediction = SVC_model.predict(X_valid)
KNN_prediction = KNN_model.predict(X_valid)

# print classification report
print('LR_prediction \n', classification_report(LR_prediction, Y_valid))
print('SVC_prediction \n',classification_report(SVC_prediction, Y_valid))
print('KNN_prediction \n',classification_report(KNN_prediction, Y_valid))

The algorithms seem to work well. The neural network is outperformed by others. Cortez & Silva explain that this is due to the presenc of irrelevant data. In my case it could also be a lack of understanding how to tweak MLPRegressor correctly. As pointed out by Cortez & Silva, student achievement is highly influenced by previous performances, in particular by the variables G2 and G1. However, I would argue that this does not explain well what makes a student successful or not. It is obivious that a successful student will pass all the exams, but what makes this person to stand out? I would argue that exams are results not inputs. Passing the G1, G2 and G3 exams is the result of other factors. Otherwise, no additional study effort would be necessary after the first exam. Does this dataset provide some insight in other variables that could predict a student's final mark? Therefore, let's drop the exam results and focus on the other features, perform the same pipeline and see whether the remaining data can predict the final exam outcomes.


In [None]:
data_clean = data_full.drop(['G3', 'G2', 'G1'], axis=1)

In [None]:
# One-hot encode the data 
data_clean = pd.get_dummies(data_clean).reset_index(drop=True)

In [None]:
# Select features with lasso

feature_sel_model = SelectFromModel(Lasso(alpha=0.01, random_state=0)) 
feature_sel_model.fit(data_clean, y)
feature_sel_model.get_support()

selected_feat = data_clean.columns[(feature_sel_model.get_support())]

print('total features: {}'.format((data_clean.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
print('features with coefficients shrank to zero: {}'.format(
    np.sum(feature_sel_model.estimator_.coef_ == 0)))
data = data_clean[selected_feat]

Let's select a bit larger set of features

In [None]:
data.head()

In [None]:
# Regression

X_train, X_valid, Y_train, Y_valid = train_test_split(data_clean, y, test_size=0.2) 

DT_model = tree.DecisionTreeRegressor()
RF_model = RandomForestRegressor(n_estimators=500, random_state=0) 
LinR_model = LinearRegression()
NN_model = MLPRegressor(random_state=0, hidden_layer_sizes=(395, 395, 395), max_iter=1500, early_stopping=True)
XGB_model = XGBRegressor(n_estimators=100, learning_rate=0.05)

DT_model.fit(X_train, Y_train)
RF_model.fit(X_train, Y_train)
LinR_model.fit(X_train, Y_train)
NN_model.fit(X_train, Y_train)
XGB_model.fit(X_train, Y_train)

DT_prediction = DT_model.predict(X_valid)
print("Decision Tree score: ", DT_model.score(X_train,Y_train))
mae_DT_prediction = mean_absolute_error(DT_prediction, Y_valid)
print("Mean Absolute Error: ", mae_DT_prediction)

RF_prediction = RF_model.predict(X_valid)
print("Random Forest score: ", RF_model.score(X_train,Y_train))
mae_RF_prediction = mean_absolute_error(RF_prediction, Y_valid)
print("Mean Absolute Error: ", mae_RF_prediction)

LinR_prediction = LinR_model.predict(X_valid)
print("Linear Regression score: ", LinR_model.score(X_train,Y_train))
mae_LinR_prediction = mean_absolute_error(LinR_prediction, Y_valid)
print("Mean Absolute Error: ", mae_LinR_prediction)

NN_prediction = NN_model.predict(X_valid)
print("Neural Network score: ", NN_model.score(X_train,Y_train))
mae_NN_prediction = mean_absolute_error(NN_prediction, Y_valid)
print("Mean Absolute Error: ", mae_NN_prediction)

XGB_prediction = XGB_model.predict(X_valid)
print("XGBoost score: ", XGB_model.score(X_train,Y_train))
mae_XGB_prediction = mean_absolute_error(XGB_prediction, Y_valid)
print("Mean Absolute Error: ", mae_XGB_prediction)


In [None]:
# Classification

X_train, X_valid, Y_train, Y_valid = train_test_split(data_clean, y_bool, test_size=0.2) 

LR_model = LogisticRegression(max_iter=500)
SVC_model = SVC()
KNN_model = KNeighborsClassifier(n_neighbors=5)

LR_model.fit(X_train, Y_train)
SVC_model.fit(X_train, Y_train)
KNN_model.fit(X_train, Y_train)

LR_prediction = LR_model.predict(X_valid)
SVC_prediction = SVC_model.predict(X_valid)
KNN_prediction = KNN_model.predict(X_valid)

# print classification report
print('LR_prediction \n', classification_report(LR_prediction, Y_valid))
print('SVC_prediction \n',classification_report(SVC_prediction, Y_valid))
print('KNN_prediction \n',classification_report(KNN_prediction, Y_valid))


Without considering the previous exam results, the algorithms struggle to predict well. In the case of classification, the models present strong class imbalances, as they prodominantly predict that students will pass the exam (especially in the case of SVM). The same situation occurs for the regression algorithms. Here, the MAE is much worse, making this dataset less useable for predictions. 
It seems to me that the dataset does not help very much to explain what makes a student successful or not. To predict student performance better, probably other variables might be considered, like the quality of the school, the teachers and learning materials used, motivation and attitude of the student, their IQ, the composition of the class and socio-economical factors.



Sources:

USING DATA MINING TO PREDICT SECONDARY SCHOOL STUDENT PERFORMANCE, Paulo Cortez and Alice Silva
http://www3.dsi.uminho.pt/pcortez/student.pdf

Lasso: https://github.com/krishnaik06/Advanced-House-Price-Prediction-

Kaggle Intermediate Machine Learing Course: https://www.kaggle.com/learn/intermediate-machine-learning

Classification; https://stackabuse.com/overview-of-classification-methods-in-python-with-scikit-learn/

Image: https://odinland.vn/a-comprehensive-guide-about-the-education-system-in-portugal/?lang=en
https://www.youtube.com/watch?v=1JXrxCJoHuw&list=PLZoTAELRMXVMcRQwR5_J8k9S7cffVFq_U&index=6