# Supervised Learning: Building a Student Intervention System

## Objective

The goal of this project is to identify students who might need early intervention. We want to predict whether a given student will pass or
fail based on information about his life and habits. Therefore we approach this task as a classification problem with two classes, pass and fail.

In [1]:
# Import libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

  'Matplotlib is building the font cache using fc-list. '


## Reading the data

In what follows we will be working with part of the [_Student Performance Data set_](https://archive.ics.uci.edu/ml/datasets/student+performance)
dataset from the UCI machine learning repository. It is composed of 395 data points with 30 attributes each. The 31'st attribute indicates whether
the student passed or failed. Here is a brief description of each feature:

Attributes for student-data.csv:

 * school - student's school (binary: "GP" or "MS")
 * sex - student's sex (binary: "F" - female or "M" - male)
 * age - student's age (numeric: from 15 to 22)
 * address - student's home address type (binary: "U" - urban or "R" - rural)
 * famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
 * Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart)
 * Medu - mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
 * Fedu - father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
 * Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
 * Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
 * reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
 * guardian - student's guardian (nominal: "mother", "father" or "other")
 * traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
 * studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
 * failures - number of past class failures (numeric: n if 1<=n<3, else 4)
 * schoolsup - extra educational support (binary: yes or no)
 * famsup - family educational support (binary: yes or no)
 * paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
 * activities - extra-curricular activities (binary: yes or no)
 * nursery - attended nursery school (binary: yes or no)
 * higher - wants to take higher education (binary: yes or no)
 * internet - Internet access at home (binary: yes or no)
 * romantic - with a romantic relationship (binary: yes or no)
 * famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
 * freetime - free time after school (numeric: from 1 - very low to 5 - very high)
 * goout - going out with friends (numeric: from 1 - very low to 5 - very high)
 * Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
 * Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
 * health - current health status (numeric: from 1 - very bad to 5 - very good)
 * absences - number of school absences (numeric: from 0 to 93)
 * passed - did the student pass the final exam (binary: yes or no)


In [4]:
# Read student data
student_data = pd.read_csv("student-data.csv")

We explore the following features in the data:

 * Total number of students
 * Number of students who passed
 * Number of students who failed
 * Graduation rate of the class (%)
 * Number of features

In [5]:
n_students = student_data.shape[0]
n_features = student_data.shape[1] - 1
n_passed = sum([1 for y in student_data['passed'] if y == 'yes'])
n_failed = sum([1 for n in student_data['passed'] if n == 'no'])
grad_rate = 100.*n_passed/(n_passed + n_failed)

print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 30
Graduation rate of the class: 67.09%


## Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns


In [6]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print "Target column: {}".format(target_col)

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col]  # corresponding targets/labels
# print "\nFeature values:-"
# print X_all.head()  # print the first 5 rows

Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed


### Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted. Many of them are simply `yes`/`no`, e.g. `internet`. These can
be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such
a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of
them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the
[`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies)
function to perform this transformation.

In [7]:
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

X_all = preprocess_features(X_all)
print "Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Split data into training and test sets

So far, we have converted all _categorical_ features into numeric values. In this next step, we split the data (both features and corresponding
labels) into training and test sets.

In [8]:
# First, decide how many training vs test samples you want
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train

# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_all, y_all, test_size = .24, random_state = 0)

print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])
# Note: If you need a validation set, extract it from within training data

Training set: 300 samples
Test set: 95 samples


## 4. Training and Evaluating Models
Next we perform a baseline comparison of 5 different classification models out of the box.

 - DecisionTreeClassifier
 - SVC
 - KNeighborsClassifier
 - GaussianNB
 - AdaBoostClassifier

We fit the model to the training data, try to predict labels (for both training and test sets), and measure the F<sub>1</sub> score. Then
repeat this process with different training set sizes (100, 200, 300), keeping test set constant.

* * *
**Model Training**  

Kfold cross validation can be used to get the maximum use of a small data set like the one here. To get my bearings and not to impose any
pre-judgement, I wanted to try most of the classifiers seen in class out of the box. I excluded neural net, since its recommended use requires
the features to be scaled (one of the disadvantages of that algorithm).  After separating the data into a training set (size 300) and test set
(size 95), I used Kfold cross-validation on the training set with 10 folds to have a basic handle on how the models were performing on
average. The result of this can be seen in the table below. The decision tree trails all others in its f1 test score, while scoring perfectly
on the training set (which no other does) so it seems to be over-fitting.  The SVC tops all others with an average test f1 of 0.808. KNN
doesn't trail far with 0.781. Naive Bayes and AdaBoost are pretty much tied with test f1's of 0.768 and 0.767 respectively.

In [13]:
# Helper Functions
import time
from IPython.display import display, HTML, Image, display_pretty
from sklearn.metrics import f1_score

# Return the classifier's training time
def timeTraining(clf, X_train, y_train):
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    return (end - start)

# Return the classifier's predictions and prediction time
def predictAndTime(clf, features):
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    return y_pred, (end - start)

# Return the f1 score for the target values and predictions
def F1(target, prediction):
    return f1_score(target.values, prediction, pos_label='yes')



In [10]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier

In [12]:
# Chosen Classifiers
chosen_clfs = [DecisionTreeClassifier(criterion = "entropy"),
        SVC(C = 1.0, kernel="rbf"),
        KNeighborsClassifier(n_neighbors = 3)]

# Test Classifiers with increasing data set size
training_sizes = [50,100,150,200,250,300]
# Choosing random index sets for each size in training_sizes 
index_list = [getIndices(size, X_train.shape[0]-1) for size in training_sizes]

for clf in chosen_clfs:
    print clf.__class__.__name__
    table = makeTable(clf, index_list, X_train, X_test, y_train, y_test)
    display_pretty(table)



NameError: name 'getIndices' is not defined

SVC showed consistently higher f1 test scores. The DecisionTree was very fast at predicting. KNN was very fast at training.

## 5. Choosing the Best Model

- Based on the experiments you performed earlier, in 1-2 paragraphs explain to the board of supervisors what single model you chose as the best
  model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?
- In 1-2 paragraphs explain to the board of supervisors in layman's terms how the final model chosen is supposed to work (for example if you
  chose a Decision Tree or Support Vector Machine, how does it make a prediction).
- Fine-tune the model. Use Gridsearch with at least one important parameter tuned and with at least 3 settings. Use the entire training set for this.
- What is the model's final F<sub>1</sub> score?

In [54]:
# TODO: Fine-tune your model and report the best F1 score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

scorer = make_scorer(F1)

tree_param = { 'criterion': ["entropy", "gini"], 'max_features':["sqrt", "log2"], 'max_depth': range(2,11),
              'min_samples_split':range(2,9), 'min_samples_leaf':range(1,9) }

neigh_param = {'n_neighbors' : [3,5,10,20,25,30,40], 'weights' : ['uniform', 'distance'], 'p':[1,2,3,5,10],
              'algorithm' : ['ball_tree', 'kd_tree', 'brute']}

#Perform grid Search
def gridIt(clf, params):
    #Grid search folds = 10, for consistency with previous computations
    grid_clf = GridSearchCV(clf, params,
                            scorer, n_jobs=4, cv = 10)
    print clf.__class__.__name__
    print "Grid search time:", timeTraining(grid_clf, X_train, y_train)
    print "Parameters of tuned model: ", grid_clf.best_params_
    y_pred, predict_t = predictAndTime(grid_clf, X_test)
    print "f1_score and prediction time on X_test, y_test: "
    print F1(y_test, y_pred), predict_t
    print '------------------\n'
    
gridIt(KNeighborsClassifier(), neigh_param)
gridIt(DecisionTreeClassifier(), tree_param)

 90.4000577927
Parameters of tuned model:  {'max_features': 'sqrt', 'min_samples_split': 6, 'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 2}
f1_score and prediction time on X_test, y_test: 
0.778523489933 0.00148892402649
------------------

 86.8100130558
Parameters of tuned model:  {'n_neighbors': 25, 'weights': 'uniform', 'algorithm': 'ball_tree', 'p': 2}
f1_score and prediction time on X_test, y_test: 
0.786666666667 0.00957107543945
------------------

DecisionTreeClassifier
Grid search time:KNeighborsClassifier
Grid search time:

In [60]:
tree_classifier = DecisionTreeClassifier(max_features='sqrt', min_samples_split=6, criterion='entropy', max_depth=2, min_samples_leaf=2)
tree_classifier.fit(X_train, y_train)
y_pred, predict_t = predictAndTime(tree_classifier, X_test)
print F1(y_test, y_pred), predict_t


import pydotplus
dot_data = tree_classifier.export_graphviz(clf, out_file=None, 
                     feature_names=iris.feature_names,  
                     class_names=iris.target_names,  
                     filled=True, rounded=True,  
                     special_characters=True)
                     
graph = pydotplus.graph_from_dot_data(dot_data)

Image(graph.create_png())  

ImportError: No module named pydotplus

0.778523489933 0.000586986541748


## Conclusions
**Algorithm Selection:**  
* * *
The decision to use KNeighborsClassifier was mostly due to its low cost computationally. Given a battery of
possible tuning parameters it was the only one of the algorithms that was feasably tunable given low resources.
It performed better than other cheap alternatives like Naive Bayes, so it was reasonably
well performing for its cost. As seen in the benchmark table in cell [82] out of the box it was outperformed only
by the SVC, which is more costly to train. Still its improvement after tuning was only marginal. 

**Layman Explanation: KNeighborsClassifier**  
* * *
The mechanism with which our model decides whether a current student will pass or fail is very intuitive. Each student has 30 attributes
associated with him/her. These range from basic descriptors like their age, sex, and health, to behavioral descriptors like whether they are in
a romantic relationship, how much time they devote to study, if they have any extra curricular activities. Some of the attributes are of things
out of there control like whether they have internet access, the size of their family, and what neighborhood they live in. What are model
process does, is that it assigns a relevant number to every one of these attributes. Just like for example you can take a house and assign it a
longitude, a latitude, and maybe if it sits on a hill an altitude. In the same way that information like longitude, latitude and altitude,
alows us to decide how far away two houses are from each other, we can decide how "far away" two particular students are from each other. Based
on all those attributes like free time and age and so forth. Since we have information about which students have failed and which have passed,
our model basically answers the question; how "close" is this student to other students who have passed. Maybe he is "closer" to students who
fail. We can choose to compare him/her to the closest _single_ student to determine how likely he is to pass or fail. Usually, however we tune
the model to find a _group_ of students closest to him/her, maybe the closest 4 students, or maybe 10 closest students. The exact number is
determined while tuning. Since we use students that we know either passed or failed in this group, we can determine if our student in question
is closer to others who pass or those who fail.

** Final F1 score **  
0.787