# Predicting Student Performance

In this project, i will evaluate the performance and predictive power of a model that has been tarined and tested on data collected from student performance in secondary education of two Portuguese schools. A model trained on this data that is seen as a good fit could then be used to make certain predictions about a home's monetary value.

The Dataset for this project is originates from <a href="https://archive.ics.uci.edu/ml/datasets/Student+Performance">UCI Machine Learnign Repository</a>. The student performance data was collected in 2008 and each of the 395 entries represent aggregated data about 33 features.

## Import library and read the data

I will start reading the data, and selecting some features from 33 features that necessary for apply to the model. 

In [1]:
# Import necessary library
import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle

In [2]:
#Load the data
data = pd.read_csv("student-mat.csv", sep=";")
data

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,A,2,2,services,services,...,5,5,4,4,5,4,11,9,9,9
391,MS,M,17,U,LE3,T,3,1,services,services,...,2,4,5,3,4,2,3,14,16,16
392,MS,M,21,R,GT3,T,1,1,other,other,...,5,5,3,3,3,3,3,10,8,7
393,MS,M,18,R,LE3,T,3,2,services,other,...,4,4,1,3,4,5,0,11,12,10


## Select features

since 33 features is to much features, i choose the 6 features that i think it is the most important features for the data.
the features are:


* G1 - first period grade (numeric: from 0 to 20)
* G2 - second period grade (numeric: from 0 to 20)
* G3 - final grade (numeric: from 0 to 20, output target)
* Studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
* Failures - failures - number of past class failures (numeric: n if 1<=n<3, else 4)
* absences - number of school absences (numeric: from 0 to 93)

In [3]:
#Features selection for the model
data = data[["G1","G2","G3","studytime","failures","absences"]]
data

Unnamed: 0,G1,G2,G3,studytime,failures,absences
0,5,6,6,2,0,6
1,5,5,6,2,0,4
2,7,8,10,2,3,10
3,15,14,15,3,0,2
4,6,10,10,2,0,4
...,...,...,...,...,...,...
390,9,9,9,2,2,11
391,14,16,16,1,0,3
392,10,8,7,1,3,3
393,11,12,10,1,0,0


## Determine inputs (y) and Output (X)

The method that i will use in this model is Linear Regression. In statistics, linear regression is a linear approach to modeling the relationship between dependent variable and one or more explanatory variables  independent variables. Independent variable means the output and the dependent variable is the input.

the equation looks like this:
<img src="https://i2.wp.com/contentsimplicity.com/wp-content/uploads/2019/05/18d7e-1eieyrsqib85cpa32zapqwq.png?w=1080&ssl=1" width=300>



The "G3" features is the final grades of the student. Which mean that grade is the fatures that use to measure the performances of the student. So, i will use G3 as the dependet variable (y) and the rest of the features as Independent variable (X).

In [4]:
predict = "G3"
X = np.array(data.drop([predict],1))
y = np.array(data[predict])

before makes the computer train a model. first we should split the data into train data and test data. The purpose is the train data is data for computer to finding the pattern between the train input and train output. Then apply it to the test data and makes prediction with the input test. Then we will compare the predistion with the test output and assest its accuracy.

# Split data into train and test

In [5]:
#split the data into train and test. The proportion is 90% for train and 10% for test
x_train, x_test, y_train,y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)

## Train the model

In [6]:
linear = linear_model.LinearRegression()

In [7]:
# train the model with with train data 
linear.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

the linear regression is arround 0.87 or 87%. Which means it is a good number for a prediction

In [8]:
# finding linear score
linear.score(x_test, y_test)

0.8748582083256659

the numbers below are the independent variable score

In [9]:
linear.coef_

array([ 0.15456365,  0.98236359, -0.2015389 , -0.24037257,  0.03309301])

the number below are the constant

In [10]:
linear.intercept_

-1.5305857397120075

## test the model with test inputs

In [11]:
predictions = linear.predict(x_test)
results = pd.DataFrame(data = predictions, columns = ['Prediction Final Score'])
results
#number below are the prediction

Unnamed: 0,Prediction Final Score
0,11.458127
1,9.258854
2,8.698779
3,15.252617
4,14.877304
5,14.858093
6,4.763257
7,11.778218
8,9.347231
9,15.605739


We will compare between the prediction and the actual final grades of each student

In [12]:
results['Actual Score'] = y_test
results

Unnamed: 0,Prediction Final Score,Actual Score
0,11.458127,11
1,9.258854,10
2,8.698779,10
3,15.252617,15
4,14.877304,16
5,14.858093,15
6,4.763257,0
7,11.778218,12
8,9.347231,10
9,15.605739,15


## calculating the difference

In [13]:
results["Residual"] = results["Prediction Final Score"] - results["Actual Score"]
results

Unnamed: 0,Prediction Final Score,Actual Score,Residual
0,11.458127,11,0.458127
1,9.258854,10,-0.741146
2,8.698779,10,-1.301221
3,15.252617,15,0.252617
4,14.877304,16,-1.122696
5,14.858093,15,-0.141907
6,4.763257,0,4.763257
7,11.778218,12,-0.221782
8,9.347231,10,-0.652769
9,15.605739,15,0.605739


In [14]:
results['Difference%'] = np.absolute(results['Residual']/results['Actual Score']*100)
results

Unnamed: 0,Prediction Final Score,Actual Score,Residual,Difference%
0,11.458127,11,0.458127,4.164789
1,9.258854,10,-0.741146,7.411464
2,8.698779,10,-1.301221,13.012214
3,15.252617,15,0.252617,1.684114
4,14.877304,16,-1.122696,7.016852
5,14.858093,15,-0.141907,0.946047
6,4.763257,0,4.763257,inf
7,11.778218,12,-0.221782,1.848183
8,9.347231,10,-0.652769,6.527688
9,15.605739,15,0.605739,4.038258


In [15]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [16]:
results.sort_values(by=['Difference%'])

Unnamed: 0,Prediction Final Score,Actual Score,Residual,Difference%
22,13.98,14,-0.02,0.12
31,12.91,13,-0.09,0.67
17,8.94,9,-0.06,0.69
5,14.86,15,-0.14,0.95
19,18.22,18,0.22,1.23
3,15.25,15,0.25,1.68
7,11.78,12,-0.22,1.85
10,18.58,18,0.58,3.21
23,16.32,17,-0.68,3.98
39,15.6,15,0.6,4.02


Let's see what score our model will predict these students performance

In [17]:
student_data = [[15,19,4,0,3], #Student 1 G1:10, G2:5, Study time:4, Failure:0, Absence:5 
                [18,10,3,1,3], #Student 2 G1:8, G2:7, Study time:4, Failure:0, Absence:5
                [5,5,1,2,2] #Student 1 G1:10, G2:5, Study time:4, Failure:0, Absence:5
               ]

In [18]:
student_data_result = pd.DataFrame(data = linear.predict(student_data), columns = ['Predicted Score'])
student_data_result

Unnamed: 0,Predicted Score
0,18.75
1,10.33
2,3.54


# Cites

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
<a href="http://www3.dsi.uminho.pt/pcortez/student.pdf">Web Link</a>