# Student Grade Prediction

* Name: Ikhwanul Muslimin

* Dataset: [Student Grade Prediction - Kaggle](https://www.kaggle.com/dipam7/student-grade-prediction)

* Dataset information:
<p> This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school-related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).</p>

* Relevant papers: [P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.](http://www3.dsi.uminho.pt/pcortez/student.pdf).

# 1. Import Libraries

In [1]:
# import EDA library
import pandas as pd
import numpy as np

In [18]:
# import sklearn library
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import sklearn.neighbors as neighbors
import sklearn.metrics as metrics

# 2. Reading the data

In [3]:
# mounted google drive
import google.colab as gc
gc.drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# change folder
%cd '/content/drive/My Drive/datasets'

/content/drive/My Drive/datasets


<code>df</code> for regression and <code>df2</code> for classification.

In [5]:
# read the data
df = pd.read_csv('student-mat.csv')
df2 = pd.read_csv('student-mat.csv')

# 3. Exploring the data

In [6]:
# display the first 5 rows of the data
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,3,yes,no,yes,no,yes,yes,yes,no,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,yes,yes,yes,yes,yes,yes,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,yes,no,yes,yes,no,no,4,3,2,1,2,5,4,6,10,10


The data has 395 rows and 33 columns without any null values.

In [7]:
# simple data checking - get dataframe general information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

To do the regression, we have to eliminate target that have the value 0 so our <code>Difference</code> is not <code>Inf</code>.

In [8]:
df.drop(df[df['G3'] < 1].index, inplace = True)

For classification, we need the average score.

In [9]:
df2['Gavg']= round((df['G1']+df['G2']+df['G3'])/3, 2)

# 4. Regression

## Make dummy variable

In [10]:
df = pd.get_dummies(df, drop_first=True)
df.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,school_MS,sex_M,address_U,famsize_LE3,Pstatus_T,Mjob_health,Mjob_other,Mjob_services,Mjob_teacher,Fjob_health,Fjob_other,Fjob_services,Fjob_teacher,reason_home,reason_other,reason_reputation,guardian_mother,guardian_other,schoolsup_yes,famsup_yes,paid_yes,activities_yes,nursery_yes,higher_yes,internet_yes,romantic_yes
0,18,4,4,2,2,0,4,3,4,1,1,3,6,5,6,6,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,1,0,0
1,17,1,1,1,2,0,5,3,3,1,1,3,4,5,5,6,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0
2,15,1,1,1,2,3,4,3,2,2,3,3,10,7,8,10,0,0,1,1,1,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,1,0,1,1,1,0
3,15,4,2,1,3,0,3,2,2,1,1,5,2,15,14,15,0,0,1,0,1,1,0,0,0,0,0,1,0,1,0,0,1,0,0,1,1,1,1,1,1,1
4,16,3,3,1,2,0,4,3,2,1,2,5,4,6,10,10,0,0,1,0,1,0,1,0,0,0,1,0,0,1,0,0,0,0,0,1,1,0,1,1,0,0


## Do the regression

I will choose <code>G3</code> as output variable and the others for the input.

In [11]:
out = df['G3']
inp = df.drop(['G3'], axis=1) 

In [12]:
# split the data into train and test by 80:20
x_train, x_test, y_train, y_test = train_test_split(inp, out, test_size=0.2, random_state=29)

In [13]:
# load the algorithm
model = LinearRegression()

In [14]:
# train the data
model.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [15]:
# predict the y using trained model
y_train_pred = model.predict(x_train)
y_test_pred = model.predict(x_test)

## Check the result

In [16]:
# model result
print('Coefficients:\n',model.coef_)
print('\n')
print('Intercept:',model.intercept_)

Coefficients:
 [ 0.00846679 -0.03854367 -0.00405499  0.02275569  0.04568779 -0.03128242
  0.1521829  -0.06889094 -0.09737496  0.04455523  0.00415405 -0.09150857
 -0.00225146  0.10441835  0.86662191  0.0432138   0.05206804  0.157063
  0.0071835  -0.2321538   0.22291949 -0.19148592  0.05332517  0.11635028
  0.25015302  0.28577438  0.12607162  0.36524252  0.13318228 -0.07263221
  0.14716473 -0.04349625 -0.26658121 -0.38235645  0.14020916 -0.23049683
  0.02780167 -0.15704109 -0.05318095 -0.03837345  0.01490872]


Intercept: 0.5937471350238344


We get a good result, because our $R^2 \approx 0.92$.

In [19]:
# MSE and R^2
print("MSE :", metrics.mean_squared_error(y_test,y_test_pred))
print("R squared :", metrics.r2_score(y_test,y_test_pred))

MSE : 0.9754670615373
R squared : 0.9191270911110324


## Check our model's performance
I will make a new dataframe that consist of:
* Test prediction (our result)
* Target data (real result)
* Difference in %

In [22]:
# - Test prediction
performance = pd.DataFrame(y_test_pred, columns=['Prediction'])
# - Target data
y_test = y_test.reset_index(drop=True)
performance['Target'] = y_test
# - The difference in %
performance['Difference (%)']= np.absolute((performance['Target'] 
                                            - performance['Prediction'])/
                                           performance['Target']*100)
performance.head()

Unnamed: 0,Prediction,Target,Difference (%)
0,12.889541,13,0.849687
1,11.964955,11,8.772321
2,13.188101,13,1.446934
3,16.253536,18,9.702577
4,12.328043,12,2.733689


Our mean difference result is only $7.69\%$.

In [23]:
# check the summary statistics
performance.describe()

Unnamed: 0,Prediction,Target,Difference (%)
count,72.0,72.0,72.0
mean,11.011651,11.222222,7.686405
std,3.362387,3.497372,7.205985
min,4.663673,5.0,0.297671
25%,8.442313,9.0,2.165567
50%,10.611898,11.0,6.359291
75%,13.252459,13.25,10.183016
max,19.559877,20.0,32.915161


# 5. Classification

## Make target value

In [25]:
# make a new target value, which is Passed
df2['Passed']= np.where(df2['Gavg'] > 15, 1, 0)
df2.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,Gavg,Passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6,5.67,0
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,4,5,5,6,5.33,0
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,3,yes,no,yes,no,yes,yes,yes,no,4,3,2,2,3,3,10,7,8,10,8.33,0
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,yes,yes,yes,yes,yes,yes,3,2,2,1,1,5,2,15,14,15,14.67,0
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,yes,no,yes,yes,no,no,4,3,2,1,2,5,4,6,10,10,8.67,0


## Correlation between column

In [26]:
corr = df2.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,Gavg,Passed
age,1.0,-0.16,-0.16,0.07,-0.0,0.24,0.05,0.02,0.13,0.13,0.12,-0.06,0.18,-0.06,-0.14,-0.16,-0.11,-0.09
Medu,-0.16,1.0,0.62,-0.17,0.06,-0.24,-0.0,0.03,0.06,0.02,-0.05,-0.05,0.1,0.21,0.22,0.22,0.19,0.17
Fedu,-0.16,0.62,1.0,-0.16,-0.01,-0.25,-0.0,-0.01,0.04,0.0,-0.01,0.01,0.02,0.19,0.16,0.15,0.17,0.13
traveltime,0.07,-0.17,-0.16,1.0,-0.1,0.09,-0.02,-0.02,0.03,0.14,0.13,0.01,-0.01,-0.09,-0.15,-0.12,-0.1,-0.05
studytime,-0.0,0.06,-0.01,-0.1,1.0,-0.17,0.04,-0.14,-0.06,-0.2,-0.25,-0.08,-0.06,0.16,0.14,0.1,0.13,0.06
failures,0.24,-0.24,-0.25,0.09,-0.17,1.0,-0.04,0.09,0.12,0.14,0.14,0.07,0.06,-0.35,-0.36,-0.36,-0.31,-0.15
famrel,0.05,-0.0,-0.0,-0.02,0.04,-0.04,1.0,0.15,0.06,-0.08,-0.11,0.09,-0.04,0.02,-0.02,0.05,0.01,-0.01
freetime,0.02,0.03,-0.01,-0.02,-0.14,0.09,0.15,1.0,0.29,0.21,0.15,0.08,-0.06,0.01,-0.01,0.01,-0.01,0.02
goout,0.13,0.06,0.04,0.03,-0.06,0.12,0.06,0.29,1.0,0.27,0.42,-0.01,0.04,-0.15,-0.16,-0.13,-0.17,-0.05
Dalc,0.13,0.02,0.0,0.14,-0.2,0.14,-0.08,0.21,0.27,1.0,0.65,0.08,0.11,-0.09,-0.06,-0.05,-0.14,-0.1


## Choose the highest correlated features

In [66]:
column_highest = ['Medu','Fedu','Dalc','Walc']
y = df2['Passed']
x = df2[column_highest]

## Do the classification: K-neighbors

In [67]:
# split the data into test and train
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.2, random_state=29)

In [72]:
# load the algorithm
model = neighbors.KNeighborsClassifier(n_neighbors=10)

In [73]:
# train the data
model.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform')

In [74]:
# predict the y using trained model
y_train_pred = model.predict(x_train)
y_test_pred = model.predict(x_test)

## Check the result

We get a great result, because our accuracy is $89.87\%$.

In [75]:
# evaluate classification model - accuracy
accuracy_test = metrics.accuracy_score(y_test,y_test_pred)
print('Accuracy Test Data: {}'.format(accuracy_test))

Accuracy Test Data: 0.8987341772151899


# Conclusion
We already have done the linear regression with $R^2 \approx 0.92$ and the classification using K-Neighbors with accuracy $89.87\%$.