# Student Grade Prediction

* Name: Ikhwanul Muslimin

* Dataset: [Student Grade Prediction - Kaggle](https://www.kaggle.com/dipam7/student-grade-prediction)

* Dataset information:
<p> This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school-related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).</p>

* Relevant papers: [P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.](http://www3.dsi.uminho.pt/pcortez/student.pdf).

# 1. Import Libraries

In [None]:
# import EDA library
import pandas as pd
import numpy as np

In [None]:
# import sklearn library
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
# import stats library
from statsmodels.stats.outliers_influence import variance_inflation_factor

# 2. Reading the data

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<code>df</code> for regression and <code>df2</code> for classification.

In [None]:
# read the data
df = pd.read_csv('/kaggle/input/student-grade-prediction/student-mat.csv')
df2 = pd.read_csv('/kaggle/input/student-grade-prediction/student-mat.csv')

# 3. Exploring the data

In [None]:
# display the first 5 rows of the data
df.head()

The data has 395 rows and 33 columns without any null values.

In [None]:
# simple data checking - get dataframe general information
df.info()

To do the regression, we have to eliminate target that have the value 0 so our <code>Difference</code> is not <code>Inf</code>.

In [None]:
df.drop(df[df['G3'] < 1].index, inplace = True)

For classification, we need the average score.

In [None]:
df2['Gavg']= round((df['G1']+df['G2']+df['G3'])/3, 2)

# 4. Regression

## Make dummy variable

In [None]:
df = pd.get_dummies(df, drop_first=True)
df.head()

## Do the regression

I will choose <code>G3</code> as output variable and the others for the input.

In [None]:
out = df['G3']
inp = df.drop(['G3'], axis=1) 

In [None]:
# split the data into train and test by 80:20
x_train, x_test, y_train, y_test = train_test_split(inp, out, test_size=0.2, random_state=29)

In [None]:
# load the algorithm
model = LinearRegression()

In [None]:
# train the data
model.fit(x_train, y_train)

In [None]:
# predict the y using trained model
y_train_pred = model.predict(x_train)
y_test_pred = model.predict(x_test)

## Check the result

In [None]:
# model result
print('Coefficients:\n',model.coef_)
print('\n')
print('Intercept:',model.intercept_)

We get a good result, because our $R^2 \approx 0.92$.

In [None]:
# MSE and R^2
print("MSE :", metrics.mean_squared_error(y_test,y_test_pred))
print("R squared :", metrics.r2_score(y_test,y_test_pred))

## Check our model's performance
I will make a new dataframe that consist of:
* Test prediction (our result)
* Target data (real result)
* Difference in %

In [None]:
# - Test prediction
performance = pd.DataFrame(y_test_pred, columns=['Prediction'])
# - Target data
y_test = y_test.reset_index(drop=True)
performance['Target'] = y_test
# - The difference in %
performance['Difference (%)']= np.absolute((performance['Target'] 
                                            - performance['Prediction'])/
                                           performance['Target']*100)
performance.head()

Our mean difference result is only $7.69\%$.

In [None]:
# check the summary statistics
performance.describe()

# 5. Classification

## Make target value

In [None]:
# make a new target value, which is Passed
df2['Passed']= np.where(df2['Gavg'] > 10, 1, 0)
df2.head()

## Do the classification: Logistic Regression

In [None]:
# choose column for each x dan y
y = df2['Passed']
x = (df2._get_numeric_data()).drop(['Gavg','Passed'], axis=1)

# split the data into test and train
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.2, random_state=29)

# load the algorithm
model = LogisticRegression(max_iter=1000)

# train the data
model.fit(x_train, y_train)

# predict the y using trained model
y_train_pred = model.predict(x_train)
y_test_pred = model.predict(x_test)

## Check the result

We get a perfect result, because our accuracy is $100\%$. Wait what?!

In [None]:
# evaluate classification model - accuracy
accuracy_test = metrics.accuracy_score(y_test,y_test_pred)
print('Accuracy Test Data: {}'.format(accuracy_test))

In [None]:
# classification report
print(classification_report(y_test,y_test_pred))

## Is our model correct?

I think there is something wrong with our model, maybe because we are using all of the features of our data? So, there are some variables that highly correlated with each other.

# 6. Classification - but with Multicollinearity

## Multicollinearity

The reason for the absurdity of our results is multicollinearity.

Multicollinearity is the occurrence of high intercorrelations among two or more independent variables in a multiple regression model. Multicollinearity can lead to skewed or misleading results when a researcher or analyst attempts to determine how well each independent variable can be used most effectively to predict or understand the dependent variable in a statistical model [(source)](https://www.investopedia.com/terms/m/multicollinearity.asp).

## Correlation between column

In [None]:
corr = df2.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

## Check the Variance Inflation Factor (VIF)

Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of multiple regression variables. Mathematically, the VIF for a regression model variable is equal to the ratio of the overall model variance to the variance of a model that includes only that single independent variable [(source)](https://www.investopedia.com/terms/v/variance-inflation-factor.asp).

In [None]:
# indicate which variables to compute VIF
new_x = x

# add intercept
new_x['intercept'] = 1

# compute VIF
vif = pd.DataFrame()
vif["variables"] = new_x.columns
vif["VIF"] = [variance_inflation_factor(new_x.values, i) for i in range(new_x.shape[1])]

# output
vif

## Drop the columns which have VIF > 5

In [None]:
# drop the columns
df2.drop(columns=['G2','G3'], inplace=True)
df2.head()

## Do the classification: Logistic Regression

In [None]:
y = df2['Passed']
x = (df2._get_numeric_data()).drop(['Gavg','Passed'], axis=1)

In [None]:
# split the data into test and train
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.2, random_state=29)

# load the algorithm
model = LogisticRegression(max_iter=1000)

# train the data
model.fit(x_train, y_train)

# predict the y using trained model
y_train_pred = model.predict(x_train)
y_test_pred = model.predict(x_test)

## Check the result

A little bit worse than before, $92.4\%$.

In [None]:
# evaluate classification model - accuracy
accuracy_test = metrics.accuracy_score(y_test,y_test_pred)
print('Accuracy Test Data: {}'.format(accuracy_test))

In [None]:
# classification report
print(classification_report(y_test,y_test_pred))

# Conclusion
We already have done the linear regression with $R^2 \approx 0.92$ and the classification twice using Logistic Regression, the first one got $100\%$ accuracy! However, in that case we have not taken into account the multicollinearity effect. After taking that into accout, our model is performed worse than before ($100\%$ to around $92.4\%$, but this result has increased the reliability of our model.