# Introduction

In this notebook, we will try to predict the average score of a student based on their socio economic status, family background, as well as their gender. Below are the steps that we will be performing:

1. Exploratory Data Analytics
    * Checking for missing values
    * Visualization of variables & correlations
2. Data Engineering
    * Hypothesis testing on the proportion of the scores, to further validate the use of students' average score
    * Converting qualitative variables to dummy variables
3. Data Modelling with:
    * Linear Regression
    * K Nearest Neighbour Regression
    * Support Vector Regression (SVR)
    * Neural Networks
4. Comparison and Conclusion

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import seaborn as sns
import scipy
from scipy import stats
import statsmodels.formula.api as smf 
from sklearn.model_selection import train_test_split
from sklearn import neighbors
from sklearn.metrics import mean_squared_error 
from math import sqrt
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
students = pd.read_csv("/kaggle/input/students-performance-in-exams/StudentsPerformance.csv")

# Exploratory Data Analytics
We first take a look at the data

In [None]:
students.head()

The dataframe columns are renamed for easier accessibility

In [None]:
students.columns = "gender","race","parental_edu","lunch","test_prep","math","reading","writing"

We also check if there are any missing data in the dataset

In [None]:
students.isna().sum()
# No missing data in this dataset

We then plot bar plots and histograms to visualize the distribution of the data for each variable

In [None]:
f, axs = plt.subplots(3,3,figsize=(15,15))
students['gender'].value_counts().plot(kind='bar', ax=axs[0,0])
axs[0,0].title.set_text('Gender')
students['race'].value_counts().plot(kind='bar', ax=axs[0,1])
axs[0,1].title.set_text('Race')
students['parental_edu'].value_counts().plot(kind='bar', ax=axs[0,2])
axs[0,2].title.set_text('Parental Education')
students['lunch'].value_counts().plot(kind='bar', ax=axs[1,0])
axs[1,0].title.set_text('Lunch')
students['test_prep'].value_counts().plot(kind='bar', ax=axs[1,1])
axs[1,1].title.set_text('Test Prep')
axs[1,2].hist(students['math'])
axs[1,2].title.set_text('Math')
axs[2,0].hist(students['reading'])
axs[2,0].title.set_text('Readiing')
axs[2,1].hist(students['writing'])
axs[2,1].title.set_text('Writing')

f.delaxes(axs[2][2])
f.tight_layout()
plt.show()

From this, we observe the following regarding the data:

* Qualitative variables are distributed rather evenly between the classes, with no sparse classes.
* Quantitative variables 'Math', 'Reading', and 'Writing' have a relatively normal distribution. Besides that, they have also taken on acceptable values within the range of 0 to 100 (i.e no outliers due to typos/data entry)

Besides that, we also study the relationship between the quantitative variables

In [None]:
sns.pairplot(students.iloc[:,:])
students.corr()

From this, it can be observed that the 3 score variables are quite highly correlated, with highest correlation between reading & writing. As this may affect the model output, we may remove certain variables during modelling or combine them during the data modelling stage

# Data & Feature Engineering
**Target Variable**

Based on the high correlation between the 3 scores above, we hypothesize that the proportion of the scores are equal across the samples (students). We will perform the hypothesis test to verify this:

* H0: Proportion of math scores = Proportion of writing scores = Proportion of reading scores across all students
* H1: Proportions are not equal

We will perform the Pearson Chi-Squared test for proportion similarity

In [None]:
scores = students.loc[:,['math','reading','writing']].transpose()

Chisquares_results=scipy.stats.chi2_contingency(scores)
print("Chi-Squared test returns P-value of", Chisquares_results[1])

Based on the P-value, we can deduce that the proportion of scores are similar across all students. Thus, we will calculate the average score, to be used as the prediction target

In [None]:
students["average_score"] = students.loc[:,['math','reading','writing']].mean(axis=1).round(1)
students

We are also interested in the shape of the data for average score, thus we will plot a histogram

In [None]:
plt.hist(students['average_score'])
plt.title('Distribution of Student Average Scores')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.show()

**Independent Variables**

As the columns "gender", "race", "parental_edu", "lunch" and "test_prep" are qualitative variables, we will create dummy variables for them, removing 1 dummy variable for each variable to prevent dummy trap (multi-collinearity problems)

We then concatenate the dummy variables with the original dataset, and remove the original variable, we will store this as a new dataframe students_d

The column names containing spaces are then renamed to omit the spaces

In [None]:
students = pd.get_dummies(students, columns=['gender', 'race', 'parental_edu', 'lunch', 'test_prep'],
               drop_first=True, prefix=['gender', 'race', 'parental_edu', 'lunch', 'test_prep'], prefix_sep='_')

students.rename(columns={'race_group B': 'race_B', 'race_group C': 'race_C', 'race_group D': 'race_D', 
                             'race_group E': 'race_E', 'parental_edu_bachelor\'s degree': 'parental_bachelor', 
                             'parental_edu_high school': 'parental_hs', 'parental_edu_master\'s degree': 'parental_masters',
                             'parental_edu_some college': 'parental_somecol', 'parental_edu_some high school': 'parental_somehs'}, inplace=True)


# Scores Prediction

For score prediction, we will try and compare several models, namely:
* Linear Regression
* K Nearest Neighbour Regression
* SVR
* Neural Networks

# Linear Regression Model
For model validation, we will use validation set approach. For this, we first perform a train-test split on the data in the ratio of 70:30

In [None]:
train_data,test_data = train_test_split(students, test_size = 0.3)

In [None]:
ml1 = smf.ols('average_score ~ gender_male+race_B+race_C+race_D+race_E+parental_bachelor+parental_hs+parental_masters+parental_somecol+parental_somehs+lunch_standard+test_prep_none', data=train_data).fit()

ml1.summary()

From the model summary, the R-squared value is extremely low. This is likely due to the low number of predictors that are all categorical variables.

I have tried dropping variables, with no improvement on the R-squared value. Thus, we will leave the model as is. Besides that, transformations on the predictors will not work either as they are all coded as binary values only.

We are also interested to have a visualization of the predicted values. We will plot a histogram for this

In [None]:
test_pred = ml1.predict(test_data)

plt.hist(test_pred)
plt.title('Distribution of Predicted Scores using Logistic Regression')
plt.xlabel('Scores')
plt.ylabel('Frequency')
plt.show()

From this, it can be observed that the model is predicting all score values between 50 and 90 only, and not able to handle extreme cases. 

However, the distribution of the data is rather similar to the average score, showing that the model is not blindly predicting certain scores all the time

We will calculate and store the RMSE values for final tabulation and comparison

In [None]:
test_resid  = test_pred - test_data.average_score

# RMSE value for test data 
lr_test_rmse = np.sqrt(np.mean(test_resid*test_resid))

print("Test RMSE using logistic regression is:",lr_test_rmse)

# K Nearest Neighbours Regression
We will fit the same test & train data onto the KNN regression algorithm

In [None]:
train_X = train_data.drop(['average_score','math','writing','reading'], axis=1)
train_Y = train_data.loc[:,'average_score']
test_X = test_data.drop(['average_score','math','writing','reading'], axis=1)
test_Y = test_data.loc[:,'average_score']

Fitting the model for values of k from 1 to 150 and calculating the resulting RMSE

Note that this is only done for learning purpose as it is a small dataset, not recommended to run so many k values sequentially on large datasets

In [None]:
rmse_val = [] #to store rmse values for different k
for K in range(150):
    K = K+1
    model = neighbors.KNeighborsRegressor(n_neighbors = K)

    model.fit(train_X, train_Y)  #fit the model
    pred=model.predict(test_X) #make prediction on test set
    error = sqrt(mean_squared_error(test_Y,pred)) #calculate rmse
    rmse_val.append(error) #store rmse values

Saving the data as a dataframe

In [None]:
data = {"k": range(1,151),"RMSE": rmse_val}
final_rmse=pd.DataFrame(data)
min_index = final_rmse.iloc[:,1].idxmin()

Plotting the rmse values against the number of neighbours, k, as well as the value which k returns min RMSE

In [None]:
plt.plot(final_rmse.k, final_rmse.RMSE)
plt.plot(final_rmse.k[min_index],final_rmse.RMSE[min_index],'ro')
plt.title('Plot of Test RMSE vs K number of neighbours')
plt.xlabel('K')
plt.ylabel('Test RMSE')
plt.show()


From the elbow plot, we will select the smallest value of K that provides largest reduction in RMSE. From the plot, we will select the value K = 35 

For this value of K, we will plot a histogram to look at the distribution of predictions

In [None]:
model = neighbors.KNeighborsRegressor(n_neighbors = 35)
model.fit(train_X, train_Y) 
pred=model.predict(test_X)

plt.hist(pred)
plt.title('Distribution of Predicted Scores for KNN Regression')
plt.xlabel('Scores')
plt.ylabel('Frequency')
plt.show()

From the histogram, it can be seen that all the predicted scores fall on a very short range, from 55 to 80 only. 

This is because the KNN algorithm averages the scores of the nearest neighbours, thus the predictions tend to fall closer to the average value of the data. 

In other words, KNN regression is not able to give very accurate predictions of student scores (especially those on the lower ranges)

We will store the rmse values for final tabulation and comparison

In [None]:
knn_test_rmse = sqrt(mean_squared_error(test_Y,pred))
print("Test RMSE using KNN regression is:",knn_test_rmse)

# Support Vector Regression (SVR)
We will utilize the same train & test split data to model for SVR

We will model the data for the following kernels to select the best one:
* Linear
* Polynomial
* Sigmoid
* Gaussian (rbf)

In [None]:
model_linear = SVR(kernel="linear")
model_linear.fit(train_X, train_Y)
pred_linear = model_linear.predict(test_X)
linear_rmse = sqrt(mean_squared_error(test_Y,pred_linear))

# kernel = poly
model_poly = SVR(kernel="poly")
model_poly.fit(train_X, train_Y)
pred_poly = model_poly.predict(test_X)
poly_rmse = sqrt(mean_squared_error(test_Y,pred_poly))

# kernel = sigmoid
model_sigmoid = SVR(kernel="sigmoid")
model_sigmoid.fit(train_X, train_Y)
pred_sigmoid = model_sigmoid.predict(test_X)
sigmoid_rmse = sqrt(mean_squared_error(test_Y,pred_sigmoid))

# kernel = rbf
model_rbf = SVR(kernel="rbf")
model_rbf.fit(train_X, train_Y)
pred_rbf = model_rbf.predict(test_X)
rbf_rmse = sqrt(mean_squared_error(test_Y,pred_rbf))

data = {"kernel":pd.Series(["linear","polynomial","sigmoid","rbf"]),
            "Test RMSE":pd.Series([linear_rmse,poly_rmse,sigmoid_rmse,rbf_rmse]),
            "Pred":pd.Series([pred_linear,pred_poly,pred_sigmoid,pred_rbf])}
table_rmse=pd.DataFrame(data)
table_rmse


So far, it seems like SVR is giving the best (lowest) RMSE out of the models. We will try to further improve the model by doing a grid search to find the best parameters

In [None]:
K = 15
parameters = [{'kernel': ['linear','sigmoid','rbf'], 'gamma': [2e-3,2e-2, 2e-1, 1, 2, 4, 8, 16],'C': [2e-5,2e-4,2e-3,2e-2, 2e-1, 1, 2, 4, 8, 16]}]
scorer = make_scorer(mean_squared_error, greater_is_better=False)
svr_gs = GridSearchCV(SVR(epsilon = 0.01), parameters, cv = K, scoring=scorer)

svr_gs.fit(train_X, train_Y)
print(svr_gs.best_params_)

We perform prediction for test data using the best parameters from grid search, and append this test RMSE to the table of results for SVR

In [None]:
regressor = SVR(**svr_gs.best_params_)
regressor.fit(train_X,train_Y)
pred=regressor.predict(test_X)

error = sqrt(mean_squared_error(test_Y,pred))
data = {"kernel":pd.Series(["GS Output"]),"Test RMSE":pd.Series([error]),"Pred":pd.Series([pred])}
table_rmse = table_rmse.append(pd.DataFrame(data))
print(table_rmse)

For the model that returns the lowest RMSE value, we plot a histogram to look at the distribution of the prediction values

In [None]:
plt.hist(table_rmse.Pred.iloc[table_rmse["Test RMSE"].idxmin()])
plt.title('Distribution of Predicted Scores for SVR')
plt.xlabel('Scores')
plt.ylabel('Frequency')
plt.show()

From the histogram, it can be seen that all the predicted scores fall on a short range, from 55 to 85. Similar to the previous models, it is also unable to predict scores on the upper and lower ranges

We then store the lowest RMSE value for final tabulation and comparison

In [None]:
svr_test_rmse = table_rmse["Test RMSE"].min()
print("Test RMSE using SVR is:",svr_test_rmse)

# Neural Networks

We will fit the same train & test split data onto an artificial neural network

Note that multiple activation functions and neural network structures are tested, but only the final network is shown here

In [None]:
import tensorflow as tf

# Importing necessary models for implementation of ANN
from keras.models import Sequential
from keras.layers import Dense

cont_model = Sequential()
cont_model.add(Dense(100, input_dim=train_X.columns.value_counts().sum(), activation="softmax"))
cont_model.add(Dense(60, activation="relu"))
cont_model.add(Dense(1, kernel_initializer="normal"))
cont_model.compile(loss="mean_squared_error", optimizer = "adam", metrics = ["mse"])

model = cont_model
model.fit(np.array(train_X), np.array(train_Y), epochs=300)

# On Test dataset
pred = model.predict(np.array(test_X))
pred = pd.Series([i[0] for i in pred])

We plot a histogram to look at the distribution of the prediction values

In [None]:
plt.hist(pred)
plt.title('Distribution of Predicted Scores for Neural Network')
plt.xlabel('Scores')
plt.ylabel('Frequency')
plt.show()

From the histogram, it can be seen that all the predicted scores fall on a short range, from 55 to 85. Note that this is a similar theme across all models built

We store the lowest RMSE value for final tabulation and comparison

In [None]:
nn_test_rmse = sqrt(mean_squared_error(test_Y,pred))
print("Test RMSE using ANN is:",nn_test_rmse)

# Comparison & Conclusion
We will tabulate the test & train accuracies for the 4 algorithms

In [None]:
data = {"Model":pd.Series(["Linear Regression","KNN Regression","SVR","Neural Network"]),
            "Test RMSE":pd.Series([lr_test_rmse,knn_test_rmse,svr_test_rmse,nn_test_rmse])}
table_final=pd.DataFrame(data)
table_final

From the final results, it is observed that linear regression and SVR provide the best models to predict student score. 

However, in all 4 models, the ranges of scores predicted are quite small, and the models are unable to predict scores on the higher and lower ranges of the spectrum. This is because the dataset is relatively small, and the input variables are all qualitative variables. A more accurate model can be produced to predict student scores, given more data & quantitative variables, for example:
* hours spent studying a week
* average score on previous exams
* attendance in school
* enrollment in extra classes