<h1>Student Performance</h1>

## Dataset Name : “Student Performance Data Set”.

Source : UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Student+Performance)

### Information :
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (<a href='http://www3.dsi.uminho.pt/pcortez/student.pdf'>see paper source for more details</a>).

In [7]:
# import libraries
import numpy as np
import matplotlib
import pandas as pd
import sklearn

In [8]:
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC 

In [9]:
# Load Dataset
url = "data/"
matDS=pd.read_csv(url+'student-mat.csv',sep=';')
porDS=pd.read_csv(url+'student-por.csv',sep=';')
# Merging two datasets 
dataSet=pd.concat([matDS,porDS])
dataSet.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


### Features:

In [10]:
dataSet.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
      dtype='object')

1. school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
2. sex - student's sex (binary: 'F' - female or 'M' - male)
3. age - student's age (numeric: from 15 to 22)
4. address - student's home address type (binary: 'U' - urban or 'R' - rural)
5. famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6. Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7. Medu - mother's education (numeric):
<ul>
    <li>0 - none</li>
    <li>1 - primary education (4th grade)</li>
    <li>3 - secondary education </li>
    <li>4 - higher education</li>
</ul>
8. Fedu - father's education (numeric):
<ul>
    <li>0 - none</li>
    <li>1 - primary education (4th grade)</li>
    <li>3 - secondary education </li>
    <li>4 - higher education</li>
</ul>
9. Mjob - mother's job (nominal): 
<ul>
    <li>'teacher'</li>
    <li>'health' care related</li>
    <li>civil 'services' (e.g. administrative or police)</li>
    <li>'at_home'</li>
    <li> 'other'</li>
</ul>
10. Fjob - father's job (nominal): 
<ul>
    <li>'teacher'</li>
    <li>'health' care related</li>
    <li>civil 'services' (e.g. administrative or police)</li>
    <li>'at_home'</li>
    <li> 'other'</li>
</ul>
11. reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
12. guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13. traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14. studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15. failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16. schoolsup - extra educational support (binary: yes or no)
17. famsup - family educational support (binary: yes or no)
18. paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19. activities - extra-curricular activities (binary: yes or no)
20. nursery - attended nursery school (binary: yes or no)
21. higher - wants to take higher education (binary: yes or no)
22. internet - Internet access at home (binary: yes or no)
23. romantic - with a romantic relationship (binary: yes or no)
24. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29. health - current health status (numeric: from 1 - very bad to 5 - very good)
30. absences - number of school absences (numeric: from 0 to 93)
31. G1 - first period grade (numeric: from 0 to 20)
32. G2 - second period grade (numeric: from 0 to 20)
#### Output target
33. G3 - final grade (numeric: from 0 to 20)

In [11]:
# remove output target and assin it to y
y=dataSet['G3']
dataSet=dataSet.drop(columns=['G3'])

## Preprocessing features
In this data set we have 3 different type of features, <b>binary</b> that has true of false, of or like sex column, F or M which show Female of Male. <b>numeric</b> that contents number value and <b>nominal</b> which filed by some string values. About numeric data, no need to do any thing but shoud do data tranfsormation for binary and nominal data. 

In [12]:
def feature_transfer_label_encoder(fieldName):
    before=dataSet.groupby(fieldName).size()
    leVar = preprocessing.LabelEncoder()
    dataSet[fieldName] =leVar.fit_transform(dataSet[fieldName]) 
    after=dataSet.groupby(fieldName).size()  
    print(fieldName,' :')
    print('    -',before.index[0],'->',after.index[0],', Count: ',after[0])
    print('    -',before.index[1],'->',after.index[1],', Count: ',after[1])

In [13]:
# Convert binary data to 0 and 1
binaryFields=['sex','address','school','famsize','Pstatus',   'schoolsup', 'famsup',
       'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic']
for f in binaryFields:
    feature_transfer_label_encoder(f)

sex  :
    - F -> 0 , Count:  591
    - M -> 1 , Count:  453
address  :
    - R -> 0 , Count:  285
    - U -> 1 , Count:  759
school  :
    - GP -> 0 , Count:  772
    - MS -> 1 , Count:  272
famsize  :
    - GT3 -> 0 , Count:  738
    - LE3 -> 1 , Count:  306
Pstatus  :
    - A -> 0 , Count:  121
    - T -> 1 , Count:  923
schoolsup  :
    - no -> 0 , Count:  925
    - yes -> 1 , Count:  119
famsup  :
    - no -> 0 , Count:  404
    - yes -> 1 , Count:  640
paid  :
    - no -> 0 , Count:  824
    - yes -> 1 , Count:  220
activities  :
    - no -> 0 , Count:  528
    - yes -> 1 , Count:  516
nursery  :
    - no -> 0 , Count:  209
    - yes -> 1 , Count:  835
higher  :
    - no -> 0 , Count:  89
    - yes -> 1 , Count:  955
internet  :
    - no -> 0 , Count:  217
    - yes -> 1 , Count:  827
romantic  :
    - no -> 0 , Count:  673
    - yes -> 1 , Count:  371


In [14]:
def feature_transfer_nominal_convertor(fieldName):
    dummy_data=pd.get_dummies(dataSet[fieldName], prefix=fieldName)
    dummy_data=dummy_data.drop(columns=[fieldName+'_other']) # remove other field from dummy data
    res=pd.concat([dataSet, dummy_data], axis=1, sort=False) # concat dummy data with main data set
    res=res.drop(columns=[fieldName]) # remove original field from result
    return res


In [15]:
nominalFields=['Mjob', 'Fjob','reason','guardian']
for f in nominalFields:
    tr=feature_transfer_nominal_convertor(f)
    dataSet=tr # fill main data set with converted data

<hr/>

In [17]:
# Split-out validation dataset
array=dataSet.values
X=dataSet
validation_size=0.20
seed=7
X_train,X_validation,y_train,y_validation=model_selection.train_test_split(
    X,y,test_size=validation_size,random_state=seed)

In [18]:
# Test option and evaluation metric
scoring = 'accuracy'

In [19]:
models=[]
models.append(('LR',LogisticRegression()))
models.append(('KNN',KNeighborsClassifier()))
models.append(('SVM',SVC()))

In [20]:
# evaluate each model
results=[]
names=[]

for name,model in models:
    kfold=model_selection.KFold(n_splits=10,random_state=seed)
    cv_results=model_selection.cross_val_score(model,X_train,y_train,cv=kfold,scoring=scoring)
    names.append(name)
    msg= "%s: %f (%f)" %(name,cv_results.mean(),cv_results.std())
    print(msg)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

LR: 0.336647 (0.032151)
KNN: 0.299182 (0.057896)
SVM: 0.362808 (0.048741)
