# Project 2: Supervised Learning
### Building a Student Intervention System

## 1. Classification vs Regression

Your goal is to identify students who might need early intervention - which type of supervised machine learning problem is this, classification or regression? Why?

>This is a classification problem, because we're trying to put students into one of 2 categories: those who need early intervention, and those who will perform adequately without it.  It's possible that this could be _phrased_ as a regression problem, if perhaps we were trying to predict students graduation GPA based on existing data, but as the problem stands we don't care about a particular point on a continuum

## 2. Exploring the Data

Let's go ahead and read in the student dataset first.

_To execute a code cell, click inside it and press **Shift+Enter**._

In [1]:
# Import libraries
import numpy as np
import pandas as pd

In [2]:
# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"
# Note: The last column 'passed' is the target/label, all other are feature columns

Student data read successfully!


Now, can you find out the following facts about the dataset?
- Total number of students
- Number of students who passed
- Number of students who failed
- Graduation rate of the class (%)
- Number of features

_Use the code block below to compute these values. Instructions/steps are marked using **TODO**s._

In [3]:
n_students = student_data['school'].count()
n_features = student_data.dtypes.count() - 1 # because the last column is the target label
n_passed = student_data[student_data['passed']=='yes']['passed'].count()
n_failed = student_data[student_data['passed']=='no']['passed'].count()
grad_rate = np.float32(n_passed)/np.float32(n_students) * 100

print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 30
Graduation rate of the class: 67.09%


## 3. Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Let's first separate our data into feature and target columns, and see if any features are non-numeric.<br/>
**Note**: For this dataset, the last column (`'passed'`) is the target or label we are trying to predict.

In [4]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print "Target column: {}".format(target_col)

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col]  # corresponding targets/labels
print "\nFeature values:-"
print X_all.head()  # print the first 5 rows

Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed

Feature values:-
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...    

### Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation.

In [5]:
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

X_all = preprocess_features(X_all)
print "Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Split data into training and test sets

So far, we have converted all _categorical_ features into numeric values. In this next step, we split the data (both features and corresponding labels) into training and test sets.

In [6]:
# First, decide how many training vs test samples you want
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train

from sklearn.cross_validation import StratifiedShuffleSplit
splitter = StratifiedShuffleSplit(y_all, 1, test_size=num_test, random_state=29)

for train_index, test_index in splitter:
  X_train = X_all.iloc[train_index]
  y_train = y_all.iloc[train_index]
  X_test = X_all.iloc[test_index]
  y_test = y_all.iloc[test_index]

print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])


Training set: 300 samples
Test set: 95 samples


## 4. Training and Evaluating Models
Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:

- What is the theoretical O(n) time & space complexity in terms of input size?
- What are the general applications of this model? What are its strengths and weaknesses?
- Given what you know about the data so far, why did you choose this model to apply?
- Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F<sub>1</sub> score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.

Produce a table showing training time, prediction time, F<sub>1</sub> score on training set and F<sub>1</sub> score on test set, for each training set size.

Note: You need to produce 3 such tables - one for each model.

In [7]:
# Train a model
import time

def train_classifier(clf, X_train, y_train):
    print "Training {}...".format(clf.__class__.__name__)
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    print "Done!\nTraining time (secs): {:.3f}".format(end - start)

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()

# Fit model to training data
train_classifier(clf, X_train, y_train)  # note: using entire training set here
print clf  # you can inspect the learned model by printing it

Training KNeighborsClassifier...
Done!
Training time (secs): 0.005
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')


In [8]:
# Predict on training set and compute F1 score
from sklearn.metrics import f1_score

def predict_labels(clf, features, target):
    print "Predicting labels using {}...".format(clf.__class__.__name__)
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    print "Done!\nPrediction time (secs): {:.3f}".format(end - start)
    return f1_score(target.values, y_pred, pos_label='yes')

train_f1_score = predict_labels(clf, X_train, y_train)
print "F1 score for training set: {}".format(train_f1_score)

Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.006
F1 score for training set: 0.856492027335


In [9]:
# Predict on test data
print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))

Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.002
F1 score for test set: 0.742857142857


In [10]:
# Train and predict using different training set sizes
def train_predict(clf, X_train, y_train, X_test, y_test):
    print "------------------------------------------"
    print "Training set size: {}".format(len(X_train))
    train_classifier(clf, X_train, y_train)
    print "F1 score for training set: {}".format(predict_labels(clf, X_train, y_train))
    print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))

train_predict(clf, X_train[0:50], y_train[0:50], X_test, y_test)
train_predict(clf, X_train[0:100], y_train[0:100], X_test, y_test)
train_predict(clf, X_train[0:150], y_train[0:150], X_test, y_test)
train_predict(clf, X_train[0:200], y_train[0:200], X_test, y_test)
train_predict(clf, X_train[0:250], y_train[0:250], X_test, y_test)
train_predict(clf, X_train, y_train, X_test, y_test)
# Note: Keep the test set constant

------------------------------------------
Training set size: 50
Training KNeighborsClassifier...
Done!
Training time (secs): 0.001
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.9
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.805369127517
------------------------------------------
Training set size: 100
Training KNeighborsClassifier...
Done!
Training time (secs): 0.000
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.853333333333
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.772413793103
------------------------------------------
Training set size: 150
Training KNeighborsClassifier...
Done!
Training time (secs): 0.000
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.002
F1 score for training set: 0.8744

In [11]:
from sklearn.tree import DecisionTreeClassifier
d_clf = DecisionTreeClassifier(random_state=29)
train_classifier(d_clf, X_train, y_train)

train_predict(d_clf, X_train[0:50], y_train[0:50], X_test, y_test)
train_predict(d_clf, X_train[0:100], y_train[0:100], X_test, y_test)
train_predict(d_clf, X_train[0:150], y_train[0:150], X_test, y_test)
train_predict(d_clf, X_train[0:200], y_train[0:200], X_test, y_test)
train_predict(d_clf, X_train[0:250], y_train[0:250], X_test, y_test)
train_predict(d_clf, X_train, y_train, X_test, y_test)


Training DecisionTreeClassifier...
Done!
Training time (secs): 0.004
------------------------------------------
Training set size: 50
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.001
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 1.0
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.739130434783
------------------------------------------
Training set size: 100
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.001
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for training set: 1.0
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.728682170543
------------------------------------------
Training set size: 150
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.001
Predicting labels using DecisionTreeClassifie

In [12]:
from sklearn.naive_bayes import GaussianNB
nb_clf = GaussianNB()
train_classifier(nb_clf, X_train, y_train)

train_predict(nb_clf, X_train[0:50], y_train[0:50], X_test, y_test)
train_predict(nb_clf, X_train[0:100], y_train[0:100], X_test, y_test)
train_predict(nb_clf, X_train[0:150], y_train[0:150], X_test, y_test)
train_predict(nb_clf, X_train[0:200], y_train[0:200], X_test, y_test)
train_predict(nb_clf, X_train[0:250], y_train[0:250], X_test, y_test)
train_predict(nb_clf, X_train, y_train, X_test, y_test)

Training GaussianNB...
Done!
Training time (secs): 0.001
------------------------------------------
Training set size: 50
Training GaussianNB...
Done!
Training time (secs): 0.000
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.62962962963
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.325581395349
------------------------------------------
Training set size: 100
Training GaussianNB...
Done!
Training time (secs): 0.000
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.436781609195
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.2
------------------------------------------
Training set size: 150
Training GaussianNB...
Done!
Training time (secs): 0.001
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.415384615385
Predicting labe

In [13]:
from sklearn import svm
svm_clf = svm.SVC(random_state=29)
train_classifier(svm_clf, X_train, y_train)

train_predict(svm_clf, X_train[0:50], y_train[0:50], X_test, y_test)
train_predict(svm_clf, X_train[0:100], y_train[0:100], X_test, y_test)
train_predict(svm_clf, X_train[0:150], y_train[0:150], X_test, y_test)
train_predict(svm_clf, X_train[0:200], y_train[0:200], X_test, y_test)
train_predict(svm_clf, X_train[0:250], y_train[0:250], X_test, y_test)
train_predict(svm_clf, X_train, y_train, X_test, y_test)

Training SVC...
Done!
Training time (secs): 0.012
------------------------------------------
Training set size: 50
Training SVC...
Done!
Training time (secs): 0.001
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.90243902439
Predicting labels using SVC...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.812903225806
------------------------------------------
Training set size: 100
Training SVC...
Done!
Training time (secs): 0.001
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.888888888889
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.807692307692
------------------------------------------
Training set size: 150
Training SVC...
Done!
Training time (secs): 0.002
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.891891891892
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001


### K Nearest Neighbors

##### Complexity

Space will be consumed faster than runtime for training.  KNN defers computation to when you try to use it for prediction, so it's training time is O(1).  Prediction runtime, however, will go up as we add data. KNN needs to find the K closest points to a new data point, which means examining every point in the dataset (barring some clever sectioning of the data).  If you add 3 more data points to the set, prediction will have to calculate 3 more distances.  This seems to be O(n).

Space should be O(n) as for each additional data point, we consume 1 unit of additional space to store it for use during classification of novel data.


##### General Application

KNN has the advantage of being a fairly simple algorithm to intuit the behavior of.  It just finds the K most similar points to a new data point, and uses a vote betwen them to decide how to classify it.  This works well for low dimensionality space that is relatively well clustered.  One downside of the algorithm is that for clusters that have complex decision boundaries, the closest points might not be a good measure of where the new point should fit.  Another drawback is that the prediction function is computationally intensive for large data sets, so adding more data will help the accuracy, but hinder the runtime for each prediction.

##### Inclusion in Experiment

Nearest Neighbors seems to have some intuitively correct application to this problem. If a student's habits, health, and employment are all very close to points for other passing students, it seems likely that student will be more likely to pass.

##### Results

| Training Set Size | Training Time | Prediction Time | F1 Score (train) | F1 Score (test) |
| ----------------- |:-------------:| ---------------:|  ---------------:|  --------------:|
| 50                | 0.001         | 0.001           | 0.9              | 0.805369127517  |
| 100               | 0.000         | 0.001           | 0.853333333333   | 0.769230769231  |
| 150               | 0.000         | 0.002           | 0.874418604651   | 0.794520547945  |
| 200               | 0.001         | 0.002           | 0.88             | 0.757142857143  |
| 250               | 0.001         | 0.004           | 0.854794520548   | 0.765957446809  |
| 300               | 0.001         | 0.004           | 0.856492027335   | 0.74285714285   |


### Decision Tree

##### Complexity

As the number of points in a dataset goes up, it will be possible to continue to subdivide them further, so both training and prediction runtime will rise.  However, adding one more point doesn't add one more decision.  O(logN) should better describe how adding additional data provides more decision opportunities.  Space too grows (more decisions to keep track of)

Space should be O(n) as for each additional data point, we consume 1 unit of additional space to store it for use during classification of novel data.


##### General Application

I like decision trees because the resulting algorithm is inspectable.  I might not be able to intuit why a particular feature is resulting in information gain for a particular decision, but I can confirm for myself in the data that it is the case.

As a downside, decision trees overfit fairly easily, especially if allowed to continue to create additional decisions deep in the tree (continuing to split leafs that already only have a handful of points in them).

##### Inclusion in Experiment

This is another algorithm I included for intuitive reasons, because if I were to want to classify this data myself without a program, it's analogous to how I would want to go about it.  I'd want to look at a data feature I suspected of being relevant, and see how splitting the data along those lines looked in terms of clustering, and then would look for additional features to effectively seperate the remainder of the data points.

##### Results

| Training Set Size | Training Time | Prediction Time | F1 Score (train) | F1 Score (test) |
| ----------------- |:-------------:| ---------------:|  ---------------:|  --------------:|
| 50                | 0.001         | 0.001           | 1.0              | 0.739130434783  |
| 100               | 0.001         | 0.000           | 1.0              | 0.728682170543  |
| 150               | 0.001         | 0.000           | 1.0              | 0.702290076336  |
| 200               | 0.001         | 0.000           | 1.0              | 0.754098360656  |
| 250               | 0.001         | 0.000           | 1.0              | 0.725806451613  |
| 300               | 0.002         | 0.000           | 1.0              | 0.761194029851  |



### Naive Bayes

##### Complexity

Because you usually have more data records than features, and because bayes is calculating a one-dimensional distribution per feature, it's actually more expensive in terms of runtime to add another feature column than to add more data rows.  Runtime should scale at O(n) for training, each additional data point is one more item to include in each feature's distribution calculation.

space should be O(1), since the distributions don't take on additional size with each data point, just different values.  For the same reason, prediction runtime should be O(1) (no matter how many data points you used to create each feature's distribution, you're still comparing each novel data point feature to eah distribution once.


##### General Application

Bayes classifiers consider each feature independent from each other, which makes it resiliant to high-dimensional spaces, and therefore particularly good for datasets made predominantly from text.  Things like spam identification and text authorship identification can be done surprisingly effectively with it.

##### Inclusion in Experiment

There are a decent number of features for each student in this data set, so I wanted to include a Bayesian algorithm on the off chance that it performed particularly well (it didn't).

| Training Set Size | Training Time | Prediction Time | F1 Score (train) | F1 Score (test) |
| ----------------- |:-------------:| ---------------:|  ---------------:|  --------------:|
| 50                | 0.000         | 0.000           | 0.62962962963    | 0.325581395349  |
| 100               | 0.000         | 0.000           | 0.436781609195   | 0.2             |
| 150               | 0.001         | 0.000           | 0.415384615385   | 0.302325581395  |
| 200               | 0.001         | 0.000           | 0.807272727273   | 0.781954887218  |
| 250               | 0.001         | 0.000           | 0.810495626822   | 0.746031746032  |
| 300               | 0.001         | 0.000           | 0.799043062201   | 0.781954887218  |


### Support Vector Machine

##### Complexity

Runtime in training can be pretty variable depending on which kernal you use, but in any case optimizing for margin requires examining each point more than once as different boundaries are tried.  At least O(n\*log(n)), possibly O(n^2).

Prediction time is also comparatively slow, at least in the results table we have here.  Prediction appears to be roughly linear with respect to amount of training data [ O(n) ]

##### General Application

Support vector machines try to find a decision boundary between labels that is as far away as possible from the points closest to the boundary, maximizing the margin for unknown data points.  If allowed to make too complex a decision boundary, they can be prone to overfitting, but are generally a good candidate for using on data sets that have a high number of features.  There are many examples of them being used to classify visual data surprisingly well.

##### Inclusion in Experiment

SVMs are one of the most widely applied algorithms in the subset of research I've read about, so since the cost of adding one more classifier to the experiment was low it seemed prudent to include.


| Training Set Size | Training Time | Prediction Time | F1 Score (train) | F1 Score (test) |
| ----------------- |:-------------:| ---------------:|  ---------------:|  --------------:|
| 50                | 0.001         | 0.001           | 0.90243902439    | 0.812903225806  |
| 100               | 0.001         | 0.001           | 0.888888888889   | 0.807692307692  |
| 150               | 0.002         | 0.001           | 0.891891891892   | 0.8             |
| 200               | 0.003         | 0.002           | 0.880258899676   | 0.797385620915  |
| 250               | 0.004         | 0.003           | 0.870466321244   | 0.797385620915  |
| 300               | 0.008         | 0.004           | 0.868421052632   | 0.797385620915  |


## 5. Choosing the Best Model

- Based on the experiments you performed earlier, in 1-2 paragraphs explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?

> In the comparison process I evaluated 4 common classifier algorithms (KNN, Decision Tree, Naive Bayes, and SVM) over the sample dataset.  Of the four, one stood out as a relatively inferior fit; the KNN algorithm, even when provided with the maximum training dataset, still couldn't even produce an F1 score of 0.75 while the other 3 algorithms were each up to an FT of greater than 0.75 when provided with the full training data.  Of the remaining 3, Decision Trees are less accurate than Naive Bayes, which are less accurate than SVM (which provided the highest accuracy of all, F1 of 0.79).

> Accuracy is not the only concern at play.  Based on Board's expressed preference for a cost-concious model, I think that Naive Bayes would provide a marginally better long term cost if the usage pattern is balanced the way I suspect (infrequent infusions of additional training data, frequent prediction requests).  In both training time and prediction time, Naive Bayes runs faster (and therefore cheaper).  However, if the training dataset is representative of the order of magnitude of data we're going to be running this system with, it may not matter very much.  At the observed growth rate of training time with respect to input data size (doubling about every hundered records added), you'd be up to 1,000 records before it took even 1 second to train the SVM.  If it's going to be much larger, though, and if the growth rate holds, then at 2000 data points it would be taking a little less than 20 minutes, and at 2500 it would be taking hours. So if the data gets really big, naive bayes may provide a better tradeoff.  At this size of dataset though, the SVM _is_ the most accurate of all, and it's still feasible to run fairly quickly.

- In 1-2 paragraphs explain to the board of supervisors in layman's terms how the final model chosen is supposed to work (for example if you chose a Decision Tree or Support Vector Machine, how does it make a prediction).

> Any description of a technical topic to a non-technical audience is likely to be fraught with poor assumptions about the transparency of given jargon and the struggle to find the appropriate level of abstraction to gain the necessary intuition for a given solution.  With this in mind I'll try to write simply to minimize poor assumptions on my part about any particular piece of foundational knowledge: The model selected from the experiment so far is known as a Support Vector Machine Classifier.  The name does accurately represent some of the underlying mathematical concepts at play, but rather than dive too deeply into the derivation of the name instead let's focus on how it works.

> SVMs can be thought of as an attempt to plot all the data we have on a grid, and draw the line between the categories that is as close as possible to being evenly drawn between them (maximizing the margin).  Imagine taking each student and assigning them a shirt color (red if they failed, green if they passed) and having them all stand on a football field at points are calculated as a function of all of the things we know about them (so the same employment, habits, demographic would place a student at the same point on the field).  Now we start trying to set out traffic cones on the field to divide the students in red shirts from the students in green shirts, trying to keep the cones as far away from any individual student as possible.  Now if we took any new student, not knowing yet whether they would pass or fail, we could still find the right spot on the field for them by asking about their habits, demographic, and employment.  Then we just look at where the new student is standing, whether that point is on a red side of the divison, or the green side.  That's how we make our prediction.  It's a bit of a funny picture to imagine, but it's a decent thought experiment to help set out how SVMs work in a familiar setting.

- Fine-tune the model. Use Gridsearch with at least one important parameter tuned and with at least 3 settings. Use the entire training set for this.

- What is the model's final F<sub>1</sub> score?

> 0.810126582278

In [14]:
from sklearn.grid_search import GridSearchCV

params_map = [{
         'C': [0.5, 1.0, 2.0, 5.0, 10.0, 100.0],
         'kernel': ['rbf', 'linear'],
         'gamma': ['auto',1.0, 5.0, 10.0]
        }]

def f1_score_on_test(estimator, x, y):
  y_pred = estimator.predict(X_test)
  return f1_score(y_test.values, y_pred, pos_label='yes')
  

gsv = GridSearchCV(svm.SVC(random_state=29), params_map, scoring=f1_score_on_test)
gsv.fit(X_train, y_train)
print("BEST PARAMS SET")
print(gsv.best_params_)

BEST PARAMS SET
{'kernel': 'rbf', 'C': 0.5, 'gamma': 'auto'}


In [15]:
svm_optimized_clf = svm.SVC(random_state=29, kernel='rbf',C=0.5, gamma='auto')

train_classifier(svm_optimized_clf, X_train, y_train)

train_predict(svm_optimized_clf, X_train, y_train, X_test, y_test)

Training SVC...
Done!
Training time (secs): 0.006
------------------------------------------
Training set size: 300
Training SVC...
Done!
Training time (secs): 0.006
Predicting labels using SVC...
Done!
Prediction time (secs): 0.004
F1 score for training set: 0.832298136646
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.810126582278
