# Prototype Score-Predicting Model Based on Portuguese Secondary Student Performance Data

This notebook contains the code of a prototype ML-based score-predicting feature, designed for the main product of Focus Project as part of my internship. The dataset includes attributes such as past student grades, demographic, social, and school related factors. After performing some basic visualizations, data pre-processing and feature engineering, the data was modelled under binary/five-level classification and regression tasks. Some models' hyperparameters were fine-tunded to optimize performance, whereas other models with already competitive results did not undergo this procedure to prioritize computational time over performance. Towards the end of this notebook, to tackle class imbalance, the SMOTE (Synthetic Minority Oversampling Technique) was performed, yielding more competitive results.

I am still learning everyday and I am always open to new ideas that can help me improve my code, therefore, if you have any feedback, queries or concerns regarding this notebook, please feel free to email me at aryanmsr@gmail.com.

- Link to dataset: https://archive.ics.uci.edu/ml/datasets/Student+Performance
- Link to the original research paper from which this notebook is based on: http://www3.dsi.uminho.pt/pcortez/student.pdf

## Importing the libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))## Importing the dataset

In [None]:
dataset = pd.read_csv("/kaggle/input/student-performance-data-set/student-por.csv")

In [None]:
dataset.head()

In [None]:
dataset.tail()

## Creating Modified Labels 

In [None]:
grades_pass_fail = []
for index, row in dataset.iterrows():
    if row['G3'] >= 10:
        grades_pass_fail.append(1) #pass
    else:
        grades_pass_fail.append(0) #fail
        
grades_pass_fail_series = pd.Series(grades_pass_fail)
dataset["Pass/Fail"] = grades_pass_fail_series

In [None]:
grades_erasmus_label_encoded = []
for index, row in dataset.iterrows():
    if row['G3'] >= 16:
        grades_erasmus_label_encoded.append(1) #A
    elif row['G3'] == 15 or row['G3'] == 14:
        grades_erasmus_label_encoded.append(2) #B
    elif row['G3'] == 12 or row['G3'] == 13:
        grades_erasmus_label_encoded.append(3) #C
    elif row['G3'] == 10 or row['G3'] == 11:
        grades_erasmus_label_encoded.append(4) #D
    elif row['G3'] <= 9:
        grades_erasmus_label_encoded.append(5) #F   
          
grades_erasmus_label_encoded_series = pd.Series(grades_erasmus_label_encoded)
dataset["Erasmus Grade Label Encoded"] = grades_erasmus_label_encoded_series

In [None]:
grades_erasmus = []
for index, row in dataset.iterrows():
    if row['G3'] >= 16:
        grades_erasmus.append('A') 
    elif row['G3'] == 15 or row['G3'] == 14:
        grades_erasmus.append('B')
    elif row['G3'] == 12 or row['G3'] == 13:
        grades_erasmus.append('C')
    elif row['G3'] == 10 or row['G3'] == 11:
        grades_erasmus.append('D')
    elif row['G3'] <= 9:
        grades_erasmus.append('F')    
          
grades_erasmus_series = pd.Series(grades_erasmus)
dataset["Erasmus Grade"] = grades_erasmus_series

In [None]:
dataset.head()

In [None]:
dataset.tail()

In [None]:
X = dataset.iloc[:, :-4].values #All columns until G3
y = dataset.iloc[:, -4].values #Column G3

In [None]:
print(X)

In [None]:
print(y)

## Data Visualizations

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.style.use('ggplot')
dataset['G3'].plot.hist(title='Histogram of G3 Grades', bins=20)
plt.xlabel('Grades - G3')

In [None]:
plt.style.use('bmh')
dataset['G2'].plot.hist(title='Histogram of G2 Grades',bins=20)
plt.xlabel('Grades - G2')

In [None]:
plt.figure(figsize=(4.30,3), dpi=100)
plt.style.use('seaborn')
dataset['G1'].plot.hist(title='Histogram of G1 Grades',bins=20)
plt.xlabel('Grades - G1')

### Visualizing the Relationship Between the Number of Absences and G3 Grades

In [None]:
Absences = dataset.iloc[:, -7].values
G3 = dataset.iloc[:, -4].values

In [None]:
plt.figure(figsize=(10,3), dpi=100)
plt.style.use('bmh')
plt.xlabel('Number of Absences')
plt.ylabel('Grades - G3')
plt.title('Scatter Plot of Absences and G3 Grades')
plt.scatter(Absences,G3)

## Verifying if there is missing data

In [None]:
dataset.isnull().values.any()

## Encoding categorical data

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0, 1, 3, 4, 5, 8, 9, 10, 11, 15, 16, 17, 18, 19, 20, 21, 22])], remainder="passthrough")
X = np.array(ct.fit_transform(X))

In [None]:
print(X)

## Splitting the Dataset Into a Training Set and Test Set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [None]:
print(X_train)

In [None]:
print(X_test)

In [None]:
print(y_train)

In [None]:
print(y_test)

In [None]:
#To print the whole array
# with np.printoptions(threshold=np.inf):
#     print(X_test) 

## Model 1: Linear Regression 

In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

## Linear Regression Feature Importance 

In [None]:
importance = regressor.coef_
for i,v in enumerate(importance):
    v = "{:.2f}".format(v)
    print(f'Feature: {i}, Score: {v}')
# plot feature importance
plt.bar([i for i in range(len(importance))], importance)
plt.show()

In [None]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

In [None]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print(f'Score (R2): {r2}')

## Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator = regressor, X = X_train, y = y_train, cv = 10, scoring='r2')
print("Score (R2): {:.2f}".format(scores.mean()))
print("Standard Deviation: {:.2f}".format(scores.std()))

## Model 2A: Decision-Tree Based Classification to predict pass/fail

In [None]:
y = dataset.iloc[:, -3].values #Column Pass/Fail

In [None]:
print(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

## Decision Tree Feature Importance

In [None]:
importance = classifier.feature_importances_
for i,v in enumerate(importance):
    v = "{:.2f}".format(v)
    print(f'Feature: {i}, Score: {v}')
# plot feature importance
plt.bar([i for i in range(len(importance))], importance)
plt.show()

In [None]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
print("Accuracy: {:.2f} %".format(accuracy_score(y_test, y_pred)*100))

## Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

## Applying Grid Search to find the best model and the best parameters

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {'criterion':['gini','entropy'],'max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50,70,90,120,150]}
grid_search = GridSearchCV(estimator = classifier,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best Accuracy: {:.2f} %".format(best_accuracy * 100))
print("Best Parameters:", best_parameters)

## Model 2B: Decision-Tree Based Classification to predict encoded Erasmus grade

In [None]:
y = dataset.iloc[:, -2].values #Column Erasmus Grade Label Encoded

In [None]:
print(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

## Decision Tree Feature Importance 2

In [None]:
importance = classifier.feature_importances_
for i,v in enumerate(importance):
    v = "{:.2f}".format(v)
    print(f'Feature: {i}, Score: {v}')
# plot feature importance
plt.bar([i for i in range(len(importance))], importance)
plt.show()

In [None]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
print("Accuracy: {:.2f} %".format(accuracy_score(y_test, y_pred)*100))

## Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

## Applying Grid Search to find the best model and the best parameters

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {'criterion':['gini','entropy'],'max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50,70,90,120,150]}
grid_search = GridSearchCV(estimator = classifier,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best Accuracy: {:.2f} %".format(best_accuracy * 100))
print("Best Parameters:", best_parameters)

## Model 3A: Random-Forest Classification to predict pass/fail

In [None]:
y = dataset.iloc[:, -3].values #Column Pass/Fail

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

## Random Forest Feature Importance 1

In [None]:
importance = classifier.feature_importances_
for i,v in enumerate(importance):
    v = "{:.2f}".format(v)
    print(f'Feature: {i}, Score: {v}')
# plot feature importance
plt.bar([i for i in range(len(importance))], importance)
plt.show()

In [None]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
print("Accuracy: {:.2f} %".format(accuracy_score(y_test, y_pred)*100))

## Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

## Model 3B: Random-Forest Classification to predict encoded Erasmus grade

In [None]:
y = dataset.iloc[:, -2].values #Column Erasmus Grade Label Encoded

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

## Random Forest Feature Importance 2

In [None]:
importance = classifier.feature_importances_
for i,v in enumerate(importance):
    v = "{:.2f}".format(v)
    print(f'Feature: {i}, Score: {v}')
# plot feature importance
plt.bar([i for i in range(len(importance))], importance)
plt.show()

In [None]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
print("Accuracy: {:.2f} %".format(accuracy_score(y_test, y_pred)*100))

## Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

## Random Search Cross Validation in Scikit-Learn

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from pprint import pprint
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
pprint(random_grid)

In [None]:
rf = RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 10, cv = 3, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(X_train, y_train)

In [None]:
rf_random.best_params_

In [None]:
y_pred = rf_random.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

In [None]:
cm = confusion_matrix(y_test, y_pred)
print(cm)
print("Accuracy: {:.2f} %".format(accuracy_score(y_test, y_pred)* 100))

## Model 4A: XGboost to predict pass/fail

In [None]:
y = dataset.iloc[:, -3].values #Column Pass/Fail
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [None]:
import sys
!{sys.executable} -m pip install xgboost
from xgboost import XGBClassifier
classifier = XGBClassifier(eval_metric='mlogloss', use_label_encoder=False)
classifier.fit(X_train, y_train)

## XGboost Feature Importance 

In [None]:
importance = classifier.feature_importances_
for i,v in enumerate(importance):
    v = "{:.2f}".format(v)
    print(f'Feature: {i}, Score: {v}')
# plot feature importance
plt.bar([i for i in range(len(importance))], importance)
plt.show()

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
print("Accuracy: {:.2f} %".format(accuracy_score(y_test, y_pred)*100))

## Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

## Model 4B: XGboost to predict encoded Erasmus grade

In [None]:
y = dataset.iloc[:, -2].values #Column Erasmus Grade Label Encoded

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [None]:
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

## XGboost Feature Importance 2 

In [None]:
importance = classifier.feature_importances_
for i,v in enumerate(importance):
    v = "{:.2f}".format(v)
    print(f'Feature: {i}, Score: {v}')
# plot feature importance
plt.bar([i for i in range(len(importance))], importance)
plt.show()

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
print("Accuracy: {:.2f} %".format(accuracy_score(y_test, y_pred)*100))

## Applying k-Fold Cross Validation

In [None]:
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

## Using SMOTE to mitigate the effects of unbalanced classes 

In [None]:
!{sys.executable} -m pip install delayed
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy = 'auto', random_state=27)
X_train_smote, y_train_smote = sm.fit_resample(X_train, y_train) #It's important to generate the new samples only in the training set to ensure our model generalizes well to unseen data.
#Let's now fit our classifiers over our updated dataset!

In [None]:
assert len(X_train_smote) !=  len(X_train)
assert len(y_train_smote) != len(y_train) #confirming that we have a resampled dataset with synthetic values

In [None]:
smote_xgb = classifier.fit(X_train_smote, y_train_smote)
smote_pred_xg = smote_xgb.predict(X_test)

In [None]:
cm_slr = confusion_matrix(y_test, smote_pred_xg)

In [None]:
print(cm_slr)

In [None]:
print("Accuracy: {:.2f} %".format(accuracy_score(y_test, smote_pred_xg)*100))

In [None]:
Models = ['DRA', 'DRB','RFA', 'RFB', 'XGBA', 'XGBB']
Scores = [0.9287, 0.7360, 0.9249, 0.7615, 0.9229, 0.7846]
barlist = plt.bar(Models, Scores)
for i in range(1,6,2):
    barlist[i].set_color('r')
plt.xlabel('Models')
plt.ylabel('Scores')
plt.title('Model Comparison - Classification')