The purpose of this notebook is the prediction of a student's performance in mathematics examination from his socio-economic and demographic information.

Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Data Reading

In [None]:
db = pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv')

In [None]:
db.head()

In [None]:
db.info()

Features

In [None]:
X = db.drop(['math score'], axis=1)

Target

In [None]:
y = db['math score']

Transformation of categorical variables

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X['test preparation course'] = le.fit_transform(X['test preparation course'])
X['gender'] = le.fit_transform(X['gender'])
X['race/ethnicity'] = le.fit_transform(X['race/ethnicity'])
X['parental level of education'] = le.fit_transform(X['parental level of education'])
X['lunch'] = le.fit_transform(X['lunch'])

Scaling values to improve model, Robust Scaler to avoid outliers

In [None]:
from sklearn.preprocessing import RobustScaler
rs = RobustScaler()
X = rs.fit_transform(X)

Data Division in Training and Tests

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

In [None]:
from sklearn.ensemble import RandomForestRegressor

Method Used: Random Forest Regressor

In [None]:
RF = RandomForestRegressor()

In [None]:
RF.fit(X_train, y_train)

In [None]:
RFpred = RF.predict(X_test)

Model Evaluation

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [None]:
lineX = [RF.score(X_train, y_train), RF.score(X_test, y_test)]
lineX

Evaluation in training and test data, light overfitting.

The data is better in the training data set.


Metrics Evaluation

In [None]:
MSE = mean_squared_error(y_test,RFpred,squared = True)
MAE = mean_absolute_error(y_test,RFpred)
RMSE = mean_squared_error(y_test,RFpred,squared = False)
R_squared = r2_score(y_test,RFpred)
print("MSE:", MSE)
print("MAE:", MAE)
print("RMSE:", RMSE)
print("R-squared:", R_squared)

Best parameter test Max_depth for random forests. (Maximum depth of the tree)

In [None]:
RFdic = {}
for md in range(1,16):
    RFtest = RandomForestRegressor(max_depth=md)
    RFtest.fit(X_train, y_train)
    RFtestpred = RFtest.predict(X_test)
    RFdic[md] = [RFtest.score(X_train, y_train), RFtest.score(X_test, y_test)]
RFdic

Based on results We observed that the best depth for the tree is 6. (Good Results in Test Data and lower cost than 9)

In [None]:
RFdb = pd.DataFrame(data=RFdic) # set results to dataframe

Formatting dataframe

In [None]:
RFdb.rename(index={ 0: 'treino', 1: 'teste'}, inplace=True)
RFdb = RFdb.T

Max Depth results with Seaborn Visualization

In [None]:

plt.figure(figsize=(16, 6))
sns.lineplot(data = RFdb)
plt.xlabel('max_depth')
plt.ylabel('R2 Score')