# Project 2: Supervised Learning
### Building a Student Intervention System

## 1. Classification vs Regression

Your goal is to identify students who might need early intervention - which type of supervised machine learning problem is this, classification or regression? Why?

>This is a classification problem, because we're trying to put students into one of 2 categories: those who need early intervention, and those who will perform adequately without it.  It's possible that this could be _phrased_ as a regression problem, if perhaps we were trying to predict students graduation GPA based on existing data, but as the problem stands we don't care about a particular point on a continuum

## 2. Exploring the Data

Let's go ahead and read in the student dataset first.

_To execute a code cell, click inside it and press **Shift+Enter**._

In [1]:
# Import libraries
import numpy as np
import pandas as pd

In [2]:
# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"
# Note: The last column 'passed' is the target/label, all other are feature columns

Student data read successfully!


Now, can you find out the following facts about the dataset?
- Total number of students
- Number of students who passed
- Number of students who failed
- Graduation rate of the class (%)
- Number of features

_Use the code block below to compute these values. Instructions/steps are marked using **TODO**s._

In [3]:
n_students = student_data['school'].count()
n_features = student_data.dtypes.count() - 1 # because the last column is the target label
n_passed = student_data[student_data['passed']=='yes']['passed'].count()
n_failed = student_data[student_data['passed']=='no']['passed'].count()
grad_rate = np.float32(n_passed)/np.float32(n_students) * 100

print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 30
Graduation rate of the class: 67.09%


## 3. Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Let's first separate our data into feature and target columns, and see if any features are non-numeric.<br/>
**Note**: For this dataset, the last column (`'passed'`) is the target or label we are trying to predict.

In [4]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print "Target column: {}".format(target_col)

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col]  # corresponding targets/labels
print "\nFeature values:-"
print X_all.head()  # print the first 5 rows

Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed

Feature values:-
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...    

### Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation.

In [5]:
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

X_all = preprocess_features(X_all)
print "Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Split data into training and test sets

So far, we have converted all _categorical_ features into numeric values. In this next step, we split the data (both features and corresponding labels) into training and test sets.

In [6]:
# First, decide how many training vs test samples you want
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train

from sklearn.cross_validation import StratifiedShuffleSplit
splitter = StratifiedShuffleSplit(y_all, 1, test_size=num_test, random_state=29)

for train_index, test_index in splitter:
  X_train = X_all.iloc[train_index]
  y_train = y_all.iloc[train_index]
  X_test = X_all.iloc[test_index]
  y_test = y_all.iloc[test_index]

print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])


Training set: 300 samples
Test set: 95 samples


## 4. Training and Evaluating Models
Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:

- What is the theoretical O(n) time & space complexity in terms of input size?
- What are the general applications of this model? What are its strengths and weaknesses?
- Given what you know about the data so far, why did you choose this model to apply?
- Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F<sub>1</sub> score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.

Produce a table showing training time, prediction time, F<sub>1</sub> score on training set and F<sub>1</sub> score on test set, for each training set size.

Note: You need to produce 3 such tables - one for each model.

In [7]:
# Train a model
import time

def train_classifier(clf, X_train, y_train):
    print "Training {}...".format(clf.__class__.__name__)
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    print "Done!\nTraining time (secs): {:.3f}".format(end - start)

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()

# Fit model to training data
train_classifier(clf, X_train, y_train)  # note: using entire training set here
print clf  # you can inspect the learned model by printing it

Training KNeighborsClassifier...
Done!
Training time (secs): 0.005
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')


In [8]:
# Predict on training set and compute F1 score
from sklearn.metrics import f1_score

def predict_labels(clf, features, target):
    print "Predicting labels using {}...".format(clf.__class__.__name__)
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    print "Done!\nPrediction time (secs): {:.3f}".format(end - start)
    return f1_score(target.values, y_pred, pos_label='yes')

train_f1_score = predict_labels(clf, X_train, y_train)
print "F1 score for training set: {}".format(train_f1_score)

Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.006
F1 score for training set: 0.856492027335


In [9]:
# Predict on test data
print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))

Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.002
F1 score for test set: 0.742857142857


In [10]:
# Train and predict using different training set sizes
def train_predict(clf, X_train, y_train, X_test, y_test):
    print "------------------------------------------"
    print "Training set size: {}".format(len(X_train))
    train_classifier(clf, X_train, y_train)
    print "F1 score for training set: {}".format(predict_labels(clf, X_train, y_train))
    print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))

train_predict(clf, X_train[0:50], y_train[0:50], X_test, y_test)
train_predict(clf, X_train[0:100], y_train[0:100], X_test, y_test)
train_predict(clf, X_train[0:150], y_train[0:150], X_test, y_test)
train_predict(clf, X_train[0:200], y_train[0:200], X_test, y_test)
train_predict(clf, X_train[0:250], y_train[0:250], X_test, y_test)
train_predict(clf, X_train, y_train, X_test, y_test)
# Note: Keep the test set constant

------------------------------------------
Training set size: 50
Training KNeighborsClassifier...
Done!
Training time (secs): 0.001
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.9
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.805369127517
------------------------------------------
Training set size: 100
Training KNeighborsClassifier...
Done!
Training time (secs): 0.000
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.853333333333
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.772413793103
------------------------------------------
Training set size: 150
Training KNeighborsClassifier...
Done!
Training time (secs): 0.000
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.002
F1 score for training set: 0.8744

In [11]:
from sklearn.tree import DecisionTreeClassifier
d_clf = DecisionTreeClassifier(random_state=29)
train_classifier(d_clf, X_train, y_train)

train_predict(d_clf, X_train[0:50], y_train[0:50], X_test, y_test)
train_predict(d_clf, X_train[0:100], y_train[0:100], X_test, y_test)
train_predict(d_clf, X_train[0:150], y_train[0:150], X_test, y_test)
train_predict(d_clf, X_train[0:200], y_train[0:200], X_test, y_test)
train_predict(d_clf, X_train[0:250], y_train[0:250], X_test, y_test)
train_predict(d_clf, X_train, y_train, X_test, y_test)


Training DecisionTreeClassifier...
Done!
Training time (secs): 0.004
------------------------------------------
Training set size: 50
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.001
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 1.0
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.739130434783
------------------------------------------
Training set size: 100
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.001
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for training set: 1.0
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.728682170543
------------------------------------------
Training set size: 150
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.001
Predicting labels using DecisionTreeClassifie

In [12]:
from sklearn.naive_bayes import GaussianNB
nb_clf = GaussianNB()
train_classifier(nb_clf, X_train, y_train)

train_predict(nb_clf, X_train[0:50], y_train[0:50], X_test, y_test)
train_predict(nb_clf, X_train[0:100], y_train[0:100], X_test, y_test)
train_predict(nb_clf, X_train[0:150], y_train[0:150], X_test, y_test)
train_predict(nb_clf, X_train[0:200], y_train[0:200], X_test, y_test)
train_predict(nb_clf, X_train[0:250], y_train[0:250], X_test, y_test)
train_predict(nb_clf, X_train, y_train, X_test, y_test)

Training GaussianNB...
Done!
Training time (secs): 0.001
------------------------------------------
Training set size: 50
Training GaussianNB...
Done!
Training time (secs): 0.000
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.62962962963
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.325581395349
------------------------------------------
Training set size: 100
Training GaussianNB...
Done!
Training time (secs): 0.000
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.436781609195
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.2
------------------------------------------
Training set size: 150
Training GaussianNB...
Done!
Training time (secs): 0.001
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.415384615385
Predicting labe

In [13]:
from sklearn import svm
svm_clf = svm.SVC(random_state=29)
train_classifier(svm_clf, X_train, y_train)

train_predict(svm_clf, X_train[0:50], y_train[0:50], X_test, y_test)
train_predict(svm_clf, X_train[0:100], y_train[0:100], X_test, y_test)
train_predict(svm_clf, X_train[0:150], y_train[0:150], X_test, y_test)
train_predict(svm_clf, X_train[0:200], y_train[0:200], X_test, y_test)
train_predict(svm_clf, X_train[0:250], y_train[0:250], X_test, y_test)
train_predict(svm_clf, X_train, y_train, X_test, y_test)

Training SVC...
Done!
Training time (secs): 0.012
------------------------------------------
Training set size: 50
Training SVC...
Done!
Training time (secs): 0.001
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.90243902439
Predicting labels using SVC...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.812903225806
------------------------------------------
Training set size: 100
Training SVC...
Done!
Training time (secs): 0.001
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.888888888889
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.807692307692
------------------------------------------
Training set size: 150
Training SVC...
Done!
Training time (secs): 0.002
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.891891891892
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001


### K Nearest Neighbors

| Training Set Size | Training Time | Prediction Time | F1 Score (train) | F1 Score (test) |
| ----------------- |:-------------:| ---------------:|  ---------------:|  --------------:|
| 50                | 0.001         | 0.001           | 0.9              | 0.805369127517  |
| 100               | 0.000         | 0.001           | 0.853333333333   | 0.769230769231  |
| 150               | 0.000         | 0.002           | 0.874418604651   | 0.794520547945  |
| 200               | 0.001         | 0.002           | 0.88             | 0.757142857143  |
| 250               | 0.001         | 0.004           | 0.854794520548   | 0.765957446809  |
| 300               | 0.001         | 0.004           | 0.856492027335   | 0.74285714285   |


### Decision Tree


| Training Set Size | Training Time | Prediction Time | F1 Score (train) | F1 Score (test) |
| ----------------- |:-------------:| ---------------:|  ---------------:|  --------------:|
| 50                | 0.001         | 0.001           | 1.0              | 0.739130434783  |
| 100               | 0.001         | 0.000           | 1.0              | 0.728682170543  |
| 150               | 0.001         | 0.000           | 1.0              | 0.702290076336  |
| 200               | 0.001         | 0.000           | 1.0              | 0.754098360656  |
| 250               | 0.001         | 0.000           | 1.0              | 0.725806451613  |
| 300               | 0.002         | 0.000           | 1.0              | 0.761194029851  |



### Naive Bayes


| Training Set Size | Training Time | Prediction Time | F1 Score (train) | F1 Score (test) |
| ----------------- |:-------------:| ---------------:|  ---------------:|  --------------:|
| 50                | 0.000         | 0.000           | 0.62962962963    | 0.325581395349  |
| 100               | 0.000         | 0.000           | 0.436781609195   | 0.2             |
| 150               | 0.001         | 0.000           | 0.415384615385   | 0.302325581395  |
| 200               | 0.001         | 0.000           | 0.807272727273   | 0.781954887218  |
| 250               | 0.001         | 0.000           | 0.810495626822   | 0.746031746032  |
| 300               | 0.001         | 0.000           | 0.799043062201   | 0.781954887218  |


### Support Vector Machine


| Training Set Size | Training Time | Prediction Time | F1 Score (train) | F1 Score (test) |
| ----------------- |:-------------:| ---------------:|  ---------------:|  --------------:|
| 50                | 0.001         | 0.001           | 0.90243902439    | 0.812903225806  |
| 100               | 0.001         | 0.001           | 0.888888888889   | 0.807692307692  |
| 150               | 0.002         | 0.001           | 0.891891891892   | 0.8             |
| 200               | 0.003         | 0.002           | 0.880258899676   | 0.797385620915  |
| 250               | 0.004         | 0.003           | 0.870466321244   | 0.797385620915  |
| 300               | 0.008         | 0.004           | 0.868421052632   | 0.797385620915  |


In [14]:
from sklearn.grid_search import GridSearchCV

params_map = [{
         'C': [0.5, 1.0, 2.0, 5.0, 10.0, 100.0],
         'kernel': ['rbf', 'linear'],
         'gamma': ['auto',1.0, 5.0, 10.0]
        }]

def f1_score_on_test(estimator, x, y):
  y_pred = estimator.predict(X_test)
  return f1_score(y_test.values, y_pred, pos_label='yes')
  

gsv = GridSearchCV(svm.SVC(random_state=29), params_map, scoring=f1_score_on_test)
gsv.fit(X_train, y_train)
print("BEST PARAMS SET")
print(gsv.best_params_)

BEST PARAMS SET
{'kernel': 'rbf', 'C': 0.5, 'gamma': 'auto'}


In [15]:
svm_optimized_clf = svm.SVC(random_state=29, kernel='rbf',C=0.5, gamma='auto')

train_classifier(svm_optimized_clf, X_train, y_train)

train_predict(svm_optimized_clf, X_train, y_train, X_test, y_test)

Training SVC...
Done!
Training time (secs): 0.006
------------------------------------------
Training set size: 300
Training SVC...
Done!
Training time (secs): 0.006
Predicting labels using SVC...
Done!
Prediction time (secs): 0.004
F1 score for training set: 0.832298136646
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.810126582278
