# Student Grade Prediction

This notebook consists of data analysis and prediction for the UCL Machine Learning Repository. [Link to the dataset](https://archive.ics.uci.edu/ml/datasets/student+performance)

The dataset's attributes include student grades, demographic, social and school-related features. The data was collected by using school reports and questionnaires.

There are three columns for the grades:
- G1 - first-period grade
- G2 - second-period grade
- G3 - the final grade

All grades are numeric, from 0 to 20.

The goal of the notebook is to predict the final grade (G3). The G1 and G2 have high correlation to the final grade. For this purpose, it was decided to follow a suggestion on the kaggle competition page and predict the final grade without using periodic grades. This resolution allows to produce higher value to this notebook.


#### This notebook consists of:
- Data load
- Data analysis
- Data preparation
- Regression using classical machine learning
- Impact analysis
- Regression using deep learning

### Data load

The first step of the research was loading the data and dropping the G1 and G2 columns.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

In [None]:
cd ../input/student-grade-prediction/

In [None]:
data = pd.read_csv('student-mat.csv')

In [None]:
data.drop(columns=['G1', 'G2'], axis=1, inplace=True)

The next step was checking if there is any missing data.

In [None]:
data.isnull().sum(axis=0).sum()

In [None]:
len(data)

The number of samples is relatively low. For this reason, checking of the skewness of columns was performed. Correlation to the goal was checked as well.

Note: to allow calculation of skewness and correlation to the final grade categorical data was LabelEncodered. The appropriate transformation (after investigation) will be performed later.

In [None]:
labeled_data = data.apply(LabelEncoder().fit_transform)
summary = pd.concat([data.dtypes, data.nunique(), labeled_data.skew().abs(), labeled_data.corr()['G3']], axis=1).sort_values(2, ascending=False).head(10)
summary.columns=['type', 'unique_values', 'skewness', 'correlation_to_G3']
summary

Later the average skewness of all the columns was calculated.

In [None]:
labeled_data.skew().abs().mean()

The most of features are highly skewed. This was expected as the number of samples is quite low. Wherefore, inspection all the features with skewness higher than 2 was executed.

In [None]:
for column in summary[summary['skewness']>2].index:
    sns.countplot(data[column])
    plt.ylabel('Count')
    plt.xlabel(column.capitalize())
    plt.title('Distribution of the {} column'.format(column))
    plt.show()

In [None]:
data['higher'].value_counts()

The higher column contains information if a particular student desire to take higher education. Since the feature is highly skewed (as one option is ten times more represented then the second) the feature will be dropped as it might lead to false generalization.
The rest of the boolean categories have slightly better distribution, so no further action was needed.

The Dalc column (workday alcohol consumption) and failures (number of past class failures) have underrepresented options. So they will be grouped in more general options to reduce skewness.

In [None]:
data.drop(['higher'], axis=1, inplace=True)

The number of failures was regrouped into two options to reduce skewness.

In [None]:
sns.countplot(data['failures'])
plt.title('Distribution of the number of failed past classes')

In [None]:
data['failures'] = data['failures'].apply(lambda x: 'No' if x == 0  else 'Yes')

In [None]:
sns.countplot(data['failures'])
plt.title('Distribution of the number of failed past classes')

In [None]:
sns.countplot(data['Dalc'])
plt.title('Distribution of workday alcohol consumption by students')

In [None]:
def regroup_dalc(x):
    if x == 1:
        return 'very low'
    elif x == 2:
        return 'low'
    else:
        return 'considerable'
    
data['Dalc'] = data['Dalc'].apply(regroup_dalc)

In [None]:
sns.countplot(data['Dalc'])
plt.title('Distribution of workday alcohol consumption by students')

The final skewness are as follows:

In [None]:
data[['Dalc', 'failures']].apply(LabelEncoder().fit_transform).skew().abs()

And the age distribution are as follows:

In [None]:
sns.countplot(data['age'])
plt.title('Distribution of students age')

In [None]:
data['age'].value_counts()

Due to the low number of samples of students older than 19 y.o., they were removed from the dataset:

In [None]:
data = data[data['age'] < 20]

### Data analysis

The data analysis was initiated
by investigating the final grade (goal) distribution.

In [None]:
from scipy.stats import norm

plt.figure(figsize=(8, 6))
sns.distplot(data['G3'])
mu, sigma = norm.fit(data['G3'])

plt.xlabel('G3 score')
plt.ylabel('Frequency')
plt.title('Final grade distribution')
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

The plot shows a gap of just over 0 scores, which is the cause of bimodality of the distribution. The column will be transformed in the next paragraph to increase the performance of ML models.

The dataset contains data from two schools, the scores distribution was compared.

In [None]:
data['school'].value_counts()

In [None]:
ax = sns.violinplot(x='school', y='G3', data=data)
ax.set_xticklabels(['Gabriel Pereira', 'Mousinho da Silveira'])
plt.title('Comparison of scores distribution between the schools')
plt.xlabel('School')
plt.ylabel('Final grade')

The number of students in the two schools varies considerably, but the score distribution seems to look similar. The comparison has been made in function of students sex.

In [None]:
sns.swarmplot(x='sex', y='G3', data=data)
plt.xlabel('Sex')
plt.ylabel('Final grade')
plt.title('Comparison between male and female students')

In [None]:
data['sex'].value_counts()

Women tend to have better grades and have a lower number of 0 scores.

The next part of the study was a comparison of the scores in the function of the age of a student.

In [None]:
sns.violinplot(x='age', y='G3', data=data)
plt.xlabel('Age')
plt.ylabel('Final grade')
plt.title('Comparison between male and female students')

In [None]:
data['age'].value_counts()

The number of students, mean score, and dispersion of scores all decrease with age.

The next step was the analysis of the three highest correlating attributes to the G3 score.

In [None]:
labeled_data = data.apply(LabelEncoder().fit_transform)
summary = pd.concat([labeled_data.corr()['G3']], axis=1).sort_values('G3', ascending=False).head(4)
summary.columns=['Correlation to G3']
summary

The Medu and Fedu columns contain information about students Mother and Father education, respectively. Whereas, the reason column contains information about the reason for selecting a particular school.

In [None]:
sns.swarmplot(x='Medu', y='G3', data=data)
plt.xlabel("Mother's education")
plt.ylabel('Final grade')
plt.title("Impact of mother's education on the students final grade")

In [None]:
data_group_Medu = data.groupby('Medu')['G3']
data_Medu = pd.DataFrame([data_group_Medu.count(), data_group_Medu.mean()])
data_Medu = data_Medu.T
data_Medu.columns = ['Count', 'Average']
data_Medu

In [None]:
sns.swarmplot(x='Fedu', y='G3', data=data)
plt.xlabel("Fathers's education")
plt.ylabel('Final grade')
plt.title("Impact of fathers's education on the students final grade")

In [None]:
data_group_Fedu = data.groupby('Fedu')['G3']
data_Fedu = pd.DataFrame([data_group_Fedu.count(), data_group_Fedu.mean()])
data_Fedu = data_Fedu.T
data_Fedu.columns = ['Count', 'Average']
data_Fedu

An increase in students grades in the function of parent's education can be easily seen. The high average on 0s can be ignored as the number of samples is low.

In [None]:
ax = sns.swarmplot(x='reason', y='G3', data=data, order=['course', 'home', 'reputation', 'other'])
ax.set_xticklabels(['Courses ', '  Close to home  ', "  School's rep.", 'Other'])
plt.xlabel("Reson of choosing the school")
plt.ylabel('Final grade')
plt.title("The students final grade in function of reason of selecting particular school")
plt.figure(figsize=(10, 8))

In [None]:
data_group_reason = data.groupby('reason')['G3']
data_reason = pd.DataFrame([data_group_reason.count(), data_group_reason.mean()])
data_reason = data_reason.T
data_reason.columns = ['Count', 'Average']
data_reason

The students, which selected the school because it was near their home or the school had a preferred course, have lower scores.

The prepared correlation matrix is shown below.

In [None]:
labeled_data = data.apply(LabelEncoder().fit_transform)
plt.figure(figsize=(14, 10))
sns.heatmap(labeled_data.corr().abs(), vmax=0.4)

The heatmap legend has the upper boundary set to 0.4 to increase readability.

The heatmap shows a high correlation between features: 
- mother's and father's education and job
- alcohol consumption (both workday and weekend) and amount of going out with friends
- address (urban or rural) and travel time

### Data preparation for ML models

First, there was investigated if all numerical attributes are ordered.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
numerical_columns = data.dtypes[data.dtypes != 'object'].index
for column in numerical_columns.drop('G3'):
    sns.countplot(data[column])
    plt.show()

All these columns are ordered (double-checked with the dataset description).

The distances might not be exactly equal between the fields where students selected values from 1 to 5. But taking the number of samples into account, the values were decided to remain numeric.

Nextly it has been checked whether the object features are not continuous. If so they should be labelled.

In [None]:
object_columns = data.dtypes[data.dtypes == 'object'].index
for column in object_columns:
    sns.countplot(data[column])
    plt.show()

All these values except Dalc column, are not ordered. The Dalc column was decided to be labelled as the distance between low and considerable is higher than between low and very low.

So all the object columns can be labelled.

In [None]:
data = pd.get_dummies(data, drop_first=True)

In [None]:
data.head(5)

In the following step, the goal distribution was analyzed. Furthermore, an attempt was made to normalize the data.

In [None]:
plt.figure(figsize=(8, 6))
sns.distplot(data['G3'])
mu, sigma = norm.fit(data['G3'])

plt.xlabel('G3 score')
plt.ylabel('Frequency')
plt.title('Final grade distribution')
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

In [None]:
data['G3'] = np.log1p(data['G3'])

In [None]:
plt.figure(figsize=(8, 6))
sns.distplot(data['G3'])
mu, sigma = norm.fit(data['G3'])

plt.xlabel('G3 score')
plt.ylabel('Frequency')
plt.title('Final grade distribution')
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

The log1p succeded in normalizing the data. But the gap still remains as any of transformation cannot remove it.

Dividing the data into train and test set:

85% - train

15% - test

CV set will be divided from train set when it is needed.

In [None]:
train_X, test_X, train_Y, test_Y = train_test_split(data.drop(['G3'], axis=1), data['G3'], train_size=0.85, shuffle=True, random_state=1000)

Having all features in a similar range can improve the performance of most of ML models.

In [None]:
scaler = StandardScaler()
train_X = scaler.fit_transform(train_X)
test_X = scaler.transform(test_X)

### Regression using classical machine learning

This chapter starts with a simple linear regression before checking more robust models.

In [None]:
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()
linear_model.fit(train_X, train_Y)

Later, the scorer was created. MSE (mean squared error) was selected as in this metric high errors are punished more.

In [None]:
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

scorer = make_scorer(mean_squared_error, greater_is_better=True, squared=True)

In [None]:
scorer(linear_model, test_X, test_Y)

In [None]:
def regression_graph(y_true, y_pred, model_name):
    plt.figure(figsize=(12, 8))
    n = len(y_true)
    ax = sns.scatterplot(x=range(n), y=y_true)
    ax = sns.scatterplot(x=range(n), y=y_pred, marker="s", s=45)
    ax = plt.vlines(range(n), y_true, y_pred, linestyles='dotted')

    plt.legend(title='', loc='upper right', labels=['Actual values', 'Predicted values'])
    plt.xlabel('Number of the sample')
    plt.ylabel('Final score')
    plt.title('Predicted final scores using the {} model'.format(model_name))

In [None]:
regression_graph(np.expm1(test_Y), np.expm1(linear_model.predict(test_X)), 'linear')

The errors are quite high in most of the cases, but there are some accurate predictions as well. The model avoids predicting 0s because predicting 0s leads to high penalty if a student score is a positive value.

The further move was trying more robust models. The best configurations of hyperparameters were found by using the GridSearch.

In [None]:
model_scores = pd.DataFrame(columns=['CV score', 'Test score'])

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn import ensemble
from sklearn.linear_model import RidgeCV, ElasticNetCV, LassoCV


ridge = RidgeCV(alphas = [0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 10, 30, 60, 100, 300, 600, 1000], scoring = scorer)
ridge.fit(train_X, train_Y)
result = cross_val_score(ridge, train_X, train_Y, scoring = scorer, cv = 10, n_jobs=-1).mean()
ridge.fit(train_X, train_Y)
result_test = scorer(ridge, test_X, test_Y)
model_scores.loc['ridge'] = [result, result_test]

lasso = LassoCV(alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1], 
                max_iter = 50000)
lasso.fit(train_X, train_Y)
result = cross_val_score(lasso, train_X, train_Y, scoring = scorer, cv = 10, n_jobs=-1).mean()
lasso.fit(train_X, train_Y)
result_test = scorer(lasso, test_X, test_Y)
model_scores.loc['lasso'] = [result, result_test]


params_GBR_grid = {'n_estimators': (100, 500), 'max_depth': (2, 4), 'min_samples_split':(1, 2), 'learning_rate': (0.05,), 'min_samples_leaf': (3, 5, 10), 'max_features': ('sqrt',), 'loss': ('huber',)}
GBR = ensemble.GradientBoostingRegressor()
GBR_GS = GridSearchCV(GBR, params_GBR_grid, scoring=scorer, n_jobs=-1)
GBR_GS.fit(train_X, train_Y)
GBR = ensemble.GradientBoostingRegressor(**GBR_GS.best_params_)
GBR.fit(train_X, train_Y)
result = cross_val_score(GBR, train_X, train_Y, scoring = scorer, cv = 10, n_jobs=-1).mean()
result_test = scorer(GBR, test_X, test_Y)
model_scores.loc['GRB'] = [result, result_test]



elastic_net = ElasticNetCV(l1_ratio = [0.1, 0.3, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 1],
                          alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 
                                    0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6], 
                          max_iter = 50000, cv = 10)
elastic_net.fit(train_X, train_Y)
result = cross_val_score(elastic_net, train_X, train_Y, scoring = scorer, cv = 10, n_jobs=-1).mean()
elastic_net.fit(train_X, train_Y)
result_test = scorer(elastic_net, test_X, test_Y)
model_scores.loc['elastic net'] = [result, result_test]

In [None]:
GBR_GS.best_params_

In [None]:
GBR.get_params()

In [None]:
model_scores

The GBR model gave the best predictions on the test set. Rest of the models generalize just better than the linear regression model.

Below are presented regression graphs for all of the models.

In [None]:
for name, model in {'Ridge': ridge, 'Lasso': lasso, 'Gradient Boosting Regressor': GBR, 'Elastic Net': elastic_net}.items():
    regression_graph(np.expm1(test_Y), np.expm1(model.predict(test_X)), name)

The GBR model avoids making high errors which are highly penalized by the MSE loss function. The model predicted multiple samples correctly but on the same time there are many high errors.

The ridge model has many accurate predictions, but at the same time, it has a few predictions quite different from true values, which make the test score low.

The ridge, lasso and elastic net models generally predict scores in a narrow range in the average grade neighbourhood. This is expected as these models include parameters regularization.

### Feature importance

The linear model was used for visualization of feature importance, as it's easier to retrieve coefficients than on the GBR model.

In [None]:
# Fething coeficients, their name and rescalling to percents
importance = pd.Series(linear_model.coef_)
importance = importance / importance.abs().sum()*100
pd.concat([pd.Series(data.drop(['G3'], axis=1).columns), importance.abs()], axis=1).sort_values(1, ascending=False)
graph_data = pd.concat([pd.Series(data.drop(['G3'], axis=1).columns), importance], axis=1)

In [None]:
# Ordering data
graph_data = graph_data.sort_values(1, ascending=True)
graph_data = graph_data.reset_index()
graph_data[0] = pd.Categorical(graph_data[0])

In [None]:
# Plotting the graph
fig = plt.figure(figsize=(12,14), dpi= 100)
ax = fig.add_subplot(111)
ax.yaxis.tick_right()
plt.hlines(y=graph_data.index, xmin=0, xmax=graph_data[1], color='black', alpha=.8, linewidth=.8)
plt.scatter(graph_data[1], graph_data.index, color='black')
plt.yticks(graph_data.index, graph_data[0])
plt.xticks(fontsize=12)
plt.xlabel('Impact [%]', fontsize=12)

# Decorate
plt.title('Impact of the features on the final grade', fontdict={'size':20})
plt.grid(linestyle='dotted', alpha=0.5)
plt.show()

### Regression using deep learning

The next step was checking if a neural network can beat the GBR model score. Due to the small number of samples, different configurations of NN were compared using the test dataset. This is not the best practise, but diving a CV set from the train data lead to highly variating results variating on the randomness of the split. 

Because neural networks have more configurations available than the previous models, GridSearching was performed at multiple steps. The search consisted of finding the optimum number of:
- Layers
- Nodes in the layers
- Dropout ratio
- Optimizer
- Activation function


The last iteration of gridsearching can be found below:

In [None]:
import tensorflow.keras as tf
from sklearn.model_selection import ParameterGrid

In [None]:
def create_deep_model(first_layer=50, second_layer=0, third_layer=0, dropout=0.1, optimizer='Adamax', activation='relu'):
    n = len(train_X[0])
    
    if second_layer > 0:
        deep_model = tf.models.Sequential([
        tf.layers.Dense(first_layer, input_shape=(n,), activation=activation),
        tf.layers.Dropout(dropout),
        tf.layers.Dense(second_layer, activation=activation),
        tf.layers.Dropout(dropout),
        tf.layers.Dense(1, activation=None)
        ])
    else:
        deep_model = tf.models.Sequential([
        tf.layers.Dense(first_layer, input_shape=(n,), activation=activation),
        tf.layers.Dropout(dropout),
        tf.layers.Dense(1, activation=None)
        ])
    
    deep_model.compile(optimizer=optimizer, loss='mse')
    callbacks = [tf.callbacks.EarlyStopping(patience=4)]
    iter_history = deep_model.fit(train_X, train_Y, validation_split=0.1, epochs=100, verbose=0, callbacks=callbacks)
    val_score = scorer(deep_model, test_X, test_Y) #
    
    
    return deep_model, val_score

In [None]:
deep_results = pd.DataFrame(columns=['Test score', 'Params'])

params = dict(first_layer=[30, 20, 10, 5], second_layer=[20, 15, 10, 5, 0], third_layer=[0, ], dropout=[0.0, 0.1], 
              optimizer=['RMSprop', 'Adamax', 'Adam'], activation=['relu', 'selu', 'elu'])

deep_params_grid = ParameterGrid(params)


for param in deep_params_grid:
    _, result = create_deep_model(**param)
    deep_results.loc[len(deep_results)] = [result, param]

pd.set_option('display.max_colwidth', -1)
deep_results.sort_values('Test score').head(10)

In [None]:
best_parameters_deep = deep_results[deep_results['Test score']==deep_results['Test score'].min()]['Params'].values[0]

deep_model, res = create_deep_model(**best_parameters_deep)
print(res)

regression_graph(np.expm1(test_Y), np.expm1(deep_model.predict(test_X)[:, 0]), 'deep learning')

The small number of samples caused the NN networks to produce different results each time a model was trained. This was caused by a random selection of samples for CV in each epoch of training. Which occurs by retrieving various test scores each time model is trained. The randomness makes the NN model less reliable than the GBR model.

#### Summary:
Taking into account:
- the type of data (most of the data avaible has indirect impact on the final score)
- removal of the periodic grades data
- the small number of samples

the regression results are satisfactory.


The models generaly dend to avoid predicting students failing the class, as predicting a zero score for a student which acctualy passed generates high penalty. The next step could be including the gap between 0 and 4 scores into a model architecture to produce better results.


Collecting more data, especially for the students with the less represented selections, would improve model and regression performance.