# Machine Learning Engineer Nanodegree
## Supervised Learning
## Project 2: Building a Student Intervention System

Welcome to the second project of the Machine Learning Engineer Nanodegree! In this notebook, some template code has already been provided for you, and it will be your job to implement the additional functionality necessary to successfully complete this project. Sections that begin with **'Implementation'** in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a `'TODO'` statement. Please be sure to read the instructions carefully!

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a **'Question X'** header. Carefully read each question and provide thorough answers in the following text boxes that begin with **'Answer:'**. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.  

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail to graduate. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer: ** Since the graduation status of the student is discrete (passed or failed), it is a classification problem.

## Exploring the Data
Run the code cell below to load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [34]:
# Import libraries
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score

# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"

Student data read successfully!


In [35]:
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))
import sklearn; print sklearn.__version__

0.18.dev0


### Implementation: Data Exploration
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [36]:
dims = student_data.shape

# TODO: Calculate number of students
n_students = dims[0]

# TODO: Calculate number of features
n_features = dims[1] - 1

# TODO: Calculate passing students
n_passed = len(student_data[student_data['passed'] =='yes'])

# TODO: Calculate failing students
n_failed = n_students - n_passed

# TODO: Calculate graduation rate
grad_rate = n_passed / (n_students + 0.0) * 100

# Print the results
print "Total number of students: {}".format(n_students)
print "Number of features: {}".format(n_features)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)


Total number of students: 395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09%


## Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Run the code cell below to separate the student data into feature and target columns to see if any features are non-numeric.

In [37]:
# Extract feature columns
feature_cols = list(student_data.columns[:-1])

# Extract target column 'passed'
target_col = student_data.columns[-1] 

# Show the list of columns
print "Feature columns:\n{}".format(feature_cols)
print "\nTarget column: {}".format(target_col)

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]

# Show the feature information by printing the first five rows
print "\nFeature values:"
print X_all.head()

Feature columns:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Target column: passed

Feature values:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       

### Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [38]:
from sklearn.preprocessing import MinMaxScaler as MMS

def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)
        output = output.join(col_data)
    
    return output


def preprocess_features_2(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts binary categories to binary variables while
        renaming the column name and normalizes the numerical data to [0,1].''' 
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        # print col
        # If data type is non-numeric, replace all yes/no values with 1/0
        uniq = np.sort(col_data.unique())
        if col_data.dtype == object and len(uniq) == 2:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
            if uniq[0] != "no":
                col_data.name = col + "_" + uniq[0]
                col_data = col_data.replace(uniq, [1, 0])
        # If data type is categorical, convert to dummy variables
        elif col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)
        else:
            # Normalize number columns
            m = col_data.min()
            M = col_data.max()
            if (M - m) != 0:
                col_data = (col_data - m) / (M - m)
            # print col_data   
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

def preprocess_features_3(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables.
        Normalizes the rest between [0,1].'''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        # print col
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)
        else:
            # Normalize number columns
            m = col_data.min()
            M = col_data.max()
            if (M - m) != 0:
                col_data = (col_data - m) / (M - m)
            # print col_data   
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(student_data[list(student_data.columns[:-1])])
print "Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns))
# print X_all['age']

Processed feature columns (48 total features):
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. In the following code cell below, you will need to implement the following:
- Randomly shuffle and split the data (`X_all`, `y_all`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [39]:
# TODO: Import any additional functionality you may need here
from sklearn.cross_validation import ShuffleSplit,train_test_split

# TODO: Set the number of training points
num_train = 300

# Set the number of testing points
num_test = X_all.shape[0] - num_train

# TODO: Shuffle and split the dataset into the number of training and testing points above
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, 
                                                    test_size = num_test, 
                                                    train_size = num_train, 
                                                    stratify = y_all,
                                                    random_state = 42)

# Show the results of the split
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])

Training set has 300 samples.
Testing set has 95 samples.


## Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data (100 data points, 200 data points, and 300 data points) and measure the F<sub>1</sub> score. You will need to produce three tables (one for each model) that shows the training set size, training time, prediction time, F<sub>1</sub> score on the training set, and F<sub>1</sub> score on the testing set.

### Question 2 - Model Application
*List three supervised learning models that are appropriate for this problem. What are the general applications of each model? What are their strengths and weaknesses? Given what you know about the data, why did you choose these models to be applied?*

*Answer:*  The three learning models I would use in this classification project would be Naive Bayes (NB), SVM and Decision Tree.

#### NB

NB is a technique that is usually applied to text characterization (e.g. sentiment analysis, medical diagnosis) but depending on the normalization of data, it can compete with SVM. Since it does not use many parameters to tweak during the learning process, it makes the model really hard to overfit the data at hand. In contrast, due to the same feature, the susceptibility to noise is high. This technique is applicable to our data since overfitting being less likely helps to deal with unbalanced and small sized data. Also, since these are student records, it is rather safe to assume that there is not much noise involved in the collection of it.

#### SVM

SVM (with the linear kernel) wants to find separating lines for different classes of features while maximizing the margin. It is widely used in binary classification problems with fewer features, in image classification where feature extraction is in a sense embedded in the kernel choice, in cursive classification, etc. Using the kernel trick, one can also use separating hyperplanes in higher dimensions to reveal non-linear separation boundaries in the original feature space, which adds tremendous flexibility to the approach and which ultimately adds new features to the data at hand. SVM is a strong choice for this data since there are many features (but not too much compared to sample size). Also as an advantage, the choice of the kernel in the algorithm adds versatility to the model. The downside is that our data has a lot of exclusive binary features (that came from non-numeric columns) and my intuition tells me that these features may affect the optimization poorly.

#### Decision Tree

Decision tree regression method uses classification trees, where a binary tree is constructed using the features and the end labels are placed at the terminating leaves. Using this tree, the method then comes up with the label given new features. It is used in data mining applications to reveal how to partition the data, in astronomy to classify Hubble's images to count several object(e.g. galaxies, quasars), in text classification in e.g. medicine, etc. In contrast with SVM's single weakness, decision tree algorithms handles the binary features extremely well, thus fitting model to use with our data. On the other hand, they are very susceptible to overfitting and learning might take much longer than the other algorithms, so we must be careful while choosing parameters for the algorithm.
    

### Setup
Run the code cell below to initialize three helper functions which you can use for training and testing the three supervised learning models you've chosen above. The functions are as follows:
- `train_classifier` - takes as input a classifier and training data and fits the classifier to the data.
- `predict_labels` - takes as input a fit classifier, features, and a target labeling and makes predictions using the F<sub>1</sub> score.
- `train_predict` - takes as input a classifier, and the training and testing data, and performs `train_clasifier` and `predict_labels`.
 - This function will report the F<sub>1</sub> score for both the training and testing data separately.

In [40]:
def train_classifier(clf, X_train, y_train, verbose=False):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    if verbose:
        print "Trained model in {:.4f} seconds".format(end - start)
    return end - start

    
def predict_labels(clf, features, target, verbose=False):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Print and return results
    if verbose:
        print "Made predictions in {:.4f} seconds.".format(end - start)
    return {'score': f1_score(target.values, y_pred, pos_label='yes'), 'dt': end - start}


def train_predict(clf, X_train, y_train, X_test, y_test, verbose=False):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    if verbose:
        print "Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train))
    
    # Train the classifier
    t_tr = train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    scores = {'dt': t_tr, 'train': predict_labels(clf, X_train, y_train), 'test': predict_labels(clf, X_test, y_test)}
    
    if verbose:
        print "F1 score for training set: {:.4f}.".format(scores.train.score)
        print "F1 score for test set: {:.4f}.".format(scores.test.score)
    
    # Print markdown table row after header
    return scores

### Implementation: Model Performance Metrics
With the predefined functions above, you will now import the three supervised learning models of your choice and run the `train_predict` function for each one. Remember that you will need to train and predict on each classifier for three different training set sizes: 100, 200, and 300. Hence, you should expect to have 9 different outputs below — 3 for each model using the varying training set sizes. In the following code cell, you will need to implement the following:
- Import the three supervised learning models you've discussed in the previous section.
- Initialize the three models and store them in `clf_A`, `clf_B`, and `clf_C`.
 - Use a `random_state` for each model you use, if provided.
 - **Note:** Use the default settings for each model — you will tune one specific model in a later section.
- Create the different training set sizes to be used to train each model.
 - *Do not reshuffle and resplit the data! The new training points should be drawn from `X_train` and `y_train`.*
- Fit each model with each training set size and make predictions on the test set (9 in total).  
**Note:** Three tables are provided after the following code cell which can be used to store your results.

In [41]:
# TODO: Import the three supervised learning models from sklearn
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier

model_names = ["GaussianNB","SVC", "Linear SVC", "Decision Tree"]
# TODO: Initialize the three models
classifiers = [GaussianNB(),
               SVC(random_state = 42),
               LinearSVC(random_state = 42),
               DecisionTreeClassifier(random_state = 42)]

from sklearn import __version__ as skversion
if skversion > '0.18':
    classifiers += [MLPClassifier(algorithm='l-bfgs', batch_size=50, random_state = 42)]
    model_names += ["Multi-Layered Perceptron NN"]
    
    

table = """
**{3}**

| Training Set Size | Training Time | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100 |{0[dt]:.4f} |{0[train][dt]:.4f} |{0[test][dt]:.4f} |{0[train][score]:.4f} | {0[test][score]:.4f} |
| 200 |{1[dt]:.4f} |{1[train][dt]:.4f} |{1[test][dt]:.4f} |{1[train][score]:.4f} | {1[test][score]:.4f} |
| 300 |{2[dt]:.4f} |{2[train][dt]:.4f} |{2[test][dt]:.4f} |{2[train][score]:.4f} | {2[test][score]:.4f} |

"""

# TODO: Set up the training set sizes
X_train_100 = X_train[:100]
y_train_100 = y_train[:100]

X_train_200 = X_train[:200]
y_train_200 = y_train[:200]

X_train_300 = X_train
y_train_300 = y_train

# TODO: Execute the 'train_predict' function for each classifier and each training set size
printout = ""
for i in range(len(classifiers)):
    r = [""] * 4
    r[3] = "Classifier {} - {}".format(i + 1,classifiers[i].__class__.__name__)
    r[0] = train_predict(classifiers[i], X_train_100, y_train_100, X_test, y_test)
    r[1] = train_predict(classifiers[i], X_train_200, y_train_200, X_test, y_test)
    r[2] = train_predict(classifiers[i], X_train_300, y_train_300, X_test, y_test)
    printout += table.format(*r)

### Tabular Results
Edit the cell below to see how a table can be designed in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#tables). You can record your results from above in the tables provided.

In [42]:
printmd("I wanted to learn to print in Markdown using Code cells and also I wanted to try Perceptron NN and LinearSVC for curiosity.")    
printmd(printout)

I wanted to learn to print in Markdown using Code cells and also I wanted to try Perceptron NN and LinearSVC for curiosity.


**Classifier 1 - GaussianNB**

| Training Set Size | Training Time | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100 |0.0017 |0.0003 |0.0003 |0.7752 | 0.6457 |
| 200 |0.0008 |0.0003 |0.0003 |0.8060 | 0.7218 |
| 300 |0.0009 |0.0004 |0.0003 |0.8134 | 0.7761 |


**Classifier 2 - SVC**

| Training Set Size | Training Time | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100 |0.0018 |0.0010 |0.0009 |0.8354 | 0.8025 |
| 200 |0.0050 |0.0043 |0.0022 |0.8431 | 0.8105 |
| 300 |0.0094 |0.0064 |0.0024 |0.8664 | 0.8052 |


**Classifier 3 - LinearSVC**

| Training Set Size | Training Time | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100 |0.0111 |0.0002 |0.0001 |0.8732 | 0.6880 |
| 200 |0.0285 |0.0004 |0.0003 |0.8227 | 0.7571 |
| 300 |0.0371 |0.0010 |0.0003 |0.8256 | 0.6935 |


**Classifier 4 - DecisionTreeClassifier**

| Training Set Size | Training Time | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100 |0.0013 |0.0003 |0.0003 |1.0000 | 0.6452 |
| 200 |0.0021 |0.0004 |0.0002 |1.0000 | 0.7258 |
| 300 |0.0019 |0.0004 |0.0007 |1.0000 | 0.6838 |


**Classifier 5 - MLPClassifier**

| Training Set Size | Training Time | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100 |0.2610 |0.0006 |0.0005 |1.0000 | 0.6822 |
| 200 |0.4789 |0.0010 |0.0005 |1.0000 | 0.6613 |
| 300 |0.6756 |0.0014 |0.0005 |1.0000 | 0.6870 |



## Choosing the Best Model
In this final section, you will choose from the three supervised learning models the *best* model to use on the student data. You will then perform a grid search optimization for the model over the entire training set (`X_train` and `y_train`) by tuning at least one parameter to improve upon the untuned model's F<sub>1</sub> score. 

### Question 3 - Chosing the Best Model
*Based on the experiments you performed earlier, in one to two paragraphs, explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?*

**Answer: ** The first model candidate to eliminate would be Decision Tree, since the model clearly overfits and performs poorly as the training set increases, the indicators being the perfect score for the training and poor testing score. Also, it seems like prediction times are also higher than the other ones. Choosing between SVM and Gaussian NB is challenging. As you can see in the table above, both models increasingly fit the training data better and also performs rather well on the testing set. One thing to notice here is that SVM has significantly higher training scores; thus, after fine-tuning its parameters, it is expected to perform better than Gaussian NB. On the other hand, refering to the elapsed times for the models, it is apparent that SVM gets computationally costly as number of samples increase. Therefore, depending on the size of the data, we might prefer to train a Gaussian NB model.

For the data available at hand I would recommend going forward with SVM model to fit the data. If we had much larger set of students to consider, I might have preferred Gaussian NB.

### Question 4 - Model in Layman's Terms
*In one to two paragraphs, explain to the board of directors in layman's terms how the final model chosen is supposed to work. For example if you've chosen to use a decision tree or a support vector machine, how does the model go about making a prediction?*

**Answer: **

Upon considering several other, I have chosen _Support Vector Machine (SVM)_ model to predict student graduation for this project. SVM constructs hyperplanes, i.e. higher dimensional counterparts of lines/planes in 2D/3D, that separates features according to graduation status. This separation is optimal in the following sense: the algorithm chooses the separator that maximizes the separation margin. 
<img src="separation.png" width=50% title="Courtesy of sklearn documentation">
As seen in the graph above, the full line is chosen over the dotted lines, which separate the data as well, since the former has a larger separation margin, i.e. in average more distant to the points that are close to it on either side. The learning phase consists of coming up with these separators and storing them for prediction. In the predictive phase, given the student data, the algorithm computes which region it falls under and decides if she/he is likely to graduate or not.

Of course, real world data is rarely this easily separable, e.g. consider the following
<img src="data1_2D.png" width=50% title="Courtesy of Eric Kim at www.eric-kim.net">
As you can see there is no line suitable to separate the two colors, but our human eye can detect that there is actually a pattern: reds and blues are separated by a circle rather than a line. Even though it seems like SVM wouldn't be successful in this case, there is a peculiar functionality of it called the _kernel trick_. What it does is that through a transformation in space (possibly while increasing the dimensionality), it reveals a linear separator. For our example above, as we noticed that the points are separated by distance, we can come up with a transformation that encodes not only 2 axis but also a new third one that encodes the distance to the origin.
<img src="data1_3D.png" width=50% title="Courtesy of Eric Kim at www.eric-kim.net">
Now we can see the linear separation evidently (here linear is a plane since we are in 3D)
<img src="data1_2D_3D.png" width=75% title="Courtesy of Eric Kim at www.eric-kim.net">
And as you can see, the separator in 2D is now a circle instead of a line! This example shows only the tip of the iceberg of the advantages of using kernel trick in SVMs. Here is a bunch of other data that is not linearly separable, but after a kernel trick SVM reveals a complex separation boundary.
- Here we see the end result of a transformation that revealed the complex separator (red curve on the left) after transforming and linearly separating.
<img src="kernel_trick1.png" width=50% title="Courtesy of Wikipedia">
- Here is another sketch of 2D-to-3D transformation that reveals the plane separator.
<img src="kernel_trick2.png" width=50% title="">
- Lastly, another application of SVM that reveals a highly complex separator.
<img src="kernel_trick3.png" width=50% title="Courtesy of Quora">

### Implementation: Model Tuning
Fine tune the chosen model. Use grid search (`GridSearchCV`) with at least one important parameter tuned with at least 3 different values. You will need to use the entire training set for this. In the code cell below, you will need to implement the following:
- Import [`sklearn.grid_search.gridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html) and [`sklearn.metrics.make_scorer`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html).
- Create a dictionary of parameters you wish to tune for the chosen model.
 - Example: `parameters = {'parameter' : [list of values]}`.
- Initialize the classifier you've chosen and store it in `clf`.
- Create the F<sub>1</sub> scoring function using `make_scorer` and store it in `f1_scorer`.
 - Set the `pos_label` parameter to the correct value!
- Perform grid search on the classifier `clf` using `f1_scorer` as the scoring method, and store it in `grid_obj`.
- Fit the grid search object to the training data (`X_train`, `y_train`), and store it in `grid_obj`.

In [43]:
# Data scaling for SVM 

from sklearn import preprocessing


X_tr = [pd.DataFrame(X_train), pd.DataFrame(X_train), pd.DataFrame(X_train)]
X_te = [pd.DataFrame(X_test), pd.DataFrame(X_test), pd.DataFrame(X_test)]

norm_names = [""]*len(X_tr)

# idx 0 for whitened data (mean 0, std dev 1)
scaler = preprocessing.StandardScaler().fit(X_train.values)
X_tr[0][X_tr[0].columns] = scaler.transform(X_tr[0])
X_te[0][X_te[0].columns] = scaler.transform(X_te[0])
norm_names[0] = "mean 0, std.dev 1"

# idx 0 for data scaled to [0,1] interval
train_trans = lambda x: (x - X_train[x.name].min()) / (X_train[x.name].max() - X_train[x.name].min())
X_tr[1] = X_tr[1].apply(train_trans)
X_te[1] = X_te[1].apply(train_trans)
norm_names[1] = "scaled to [0,1]"

norm_names[2] = "no scaling, as-is"


In [44]:
# TODO: Import 'GridSearchCV' and 'make_scorer'
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.cross_validation import StratifiedShuffleSplit, StratifiedKFold

# TODO: Create the parameters list you wish to tune
# parameters = {'kernel': ['rbf','poly'], 'degree': range(1,6), 'C': [pow(2,-i) for i in xrange(2,9)]}
# parameters = {'kernel': ['linear','rbf','poly','sigmoid'], 'degree': range(0,5), 'C': [pow(2,-i) for i in xrange(-1,15)]}
parameters = {'kernel': ['linear','rbf','sigmoid'],'gamma' : [ 0.0001, 0.001, 0.1, 10, 100 ], 'C': [1] + [pow(2,-i) for i in xrange(1,10)], 'tol': [pow(10,-i) for i in xrange(2,8)]}
# parameters = {'kernel': ['rbf'], 'gamma': [ 0.0001, 0.001, 0.1, 10, 100 ], 'C': [pow(10,i) for i in range(-3,4)]}
# parameters = { 'C': [float(pow(2,-i)) for i in xrange(-1,15)]} #, 'class_weight' : [None, 'balanced']}
# parameters = {'n_neighbors': range(3,8), 'weights': ['uniform','distance']}

# TODO: Initialize the classifier
# clf = KNeighborsClassifier()
clf = SVC(random_state = 42)
ssscv = StratifiedShuffleSplit( y_train, n_iter=10, test_size=15, random_state = 42)

# TODO: Make an f1 scoring function using 'make_scorer' 
f1_scorer = make_scorer(f1_score, pos_label='yes')

# TODO: Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj_scv = [GridSearchCV(clf, parameters, cv = ssscv, scoring=f1_scorer) for i in [0, 1, 2]] # 10-fold
# grid_obj_lvc = [GridSearchCV(LinearSVC(random_state=42), { 'C': parameters['C']}, cv = ssscv, scoring=f1_scorer) for i in [0, 1, 2]] 

# TODO: Fit the grid search object to the training data and find the optimal parameters

print "Untuned model has a training F1 score of {:.2f}%.".format(predict_labels(classifiers[1], X_train, y_train)['score']*100)
print "Untuned model has a testing F1 score of {:.2f}%.".format(predict_labels(classifiers[1], X_test, y_test)['score']*100)
print classifiers[1]
print ""

for i in [0, 1, 2]:
    grid_obj_scv[i].fit(X_tr[i], y_train)
    # grid_obj_lvc[i].fit(X_tr[i], y_train)

    # Get the estimator
    # clf = grid_obj.best_estimator_

    # Report the final F1 score for training and testing after parameter tuning
    print "Data Type: {}\nModel: General".format(norm_names[i])
    print "Tuned model ({}-fold) has a training F1 score of {:.2f}%.".format(len(ssscv), predict_labels(grid_obj_scv[i].best_estimator_, X_tr[i], y_train)['score']*100)
    print "Tuned model ({}-fold) has a testing F1 score of {:.2f}%.".format(len(ssscv), predict_labels(grid_obj_scv[i].best_estimator_, X_te[i], y_test)['score']*100)
    print grid_obj_scv[i].best_estimator_
    if False:
        print "Model: Linear"
        print "Tuned model ({}-fold) has a training F1 score of {:.2f}%.".format(len(ssscv), predict_labels(grid_obj_lvc[i].best_estimator_, X_tr[i], y_train)['score']*100)
        print "Tuned model ({}-fold) has a testing F1 score of {:.2f}%.".format(len(ssscv), predict_labels(grid_obj_lvc[i].best_estimator_, X_te[i], y_test)['score']*100)
        print grid_obj_lvc[i].best_estimator_
    print ""

Untuned model has a training F1 score of 80.24%.
Untuned model has a testing F1 score of 80.50%.
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=42, shrinking=True,
  tol=0.001, verbose=False)

Data Type: mean 0, std.dev 1
Model: General
Tuned model (10-fold) has a training F1 score of 83.83%.
Tuned model (10-fold) has a testing F1 score of 80.00%.
SVC(C=0.015625, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.0001, kernel='linear',
  max_iter=-1, probability=False, random_state=42, shrinking=True,
  tol=0.01, verbose=False)

Data Type: scaled to [0,1]
Model: General
Tuned model (10-fold) has a training F1 score of 85.35%.
Tuned model (10-fold) has a testing F1 score of 79.22%.
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, p

### Question 5 - Final F<sub>1</sub> Score
*What is the final model's F<sub>1</sub> score for training and testing? How does that score compare to the untuned model?*

**Answer: **

I have based my results below for the fitted model using data scaled to the interval $[0,1]$, since upon the analysis above, the best results over the training set is achieved in that scaling. Here's the final table of $F_1$ scores:

|     Model Type    | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: |
| Untuned | 80.24% | 80.50% |
| Tuned | 85.35% | 79.22% |

As we can see, while the tuned model does well better than untuned model in the training set, untuned model performs better in the testing set.

> **Note**: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to  
**File -> Download as -> HTML (.html)**. Include the finished document along with this notebook as your submission.

## Notes

Dear grader, thanks a lot for your previous comments, they were very helpful and constructive. If you have time and patience, could you review the extra work below too? Thanks again!

## Extra Work

Here, I'll try a different approach to the data at hand. The work I'll do will be mostly on the preprocessing level rather than modeling. I want to start by using `preprocess_features_2` which treats binary categories as single feature rather than two separate one like `get_dummies` does, and normalizes numeric features to the interval $[0,1]$.

In [47]:
X_all_2 = preprocess_features_2(student_data[list(student_data.columns[:-1])])
print "Processed feature columns ({} total features):\n{}".format(len(X_all_2.columns), list(X_all_2.columns))

Processed feature columns (43 total features):
['school_GP', 'sex_F', 'age', 'address_R', 'famsize_GT3', 'Pstatus_A', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


As you can see the result above, now we have less features (e.g. `sex` now became only `sex_F` rather than `sex_F` and `sex_M`). Now we use this data to make predictions. Let's start by splitting into training and testing sets.

In [48]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_all_2, y_all, 
                                                    test_size = num_test, 
                                                    train_size = num_train, 
                                                    stratify = y_all,
                                                    random_state = 42)

# Show the results of the split
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])

Training set has 300 samples.
Testing set has 95 samples.


Now, we fit the 3 chosen classifiers below and make a table out of the results.

In [55]:
classifiers_2 = [GaussianNB(),
               SVC(random_state = 42),
               DecisionTreeClassifier(random_state = 42)]

table_2 = """
**{3}**

| Training Set Size | Training Time | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100 |{0[dt]:.4f} |{0[train][dt]:.4f} |{0[test][dt]:.4f} |{0[train][score]:.4f} | {0[test][score]:.4f} |
| 200 |{1[dt]:.4f} |{1[train][dt]:.4f} |{1[test][dt]:.4f} |{1[train][score]:.4f} | {1[test][score]:.4f} |
| 300 |{2[dt]:.4f} |{2[train][dt]:.4f} |{2[test][dt]:.4f} |{2[train][score]:.4f} | {2[test][score]:.4f} |

"""

# TODO: Set up the training set sizes
X_train_2_100 = X_train_2[:100]
y_train_2_100 = y_train_2[:100]

X_train_2_200 = X_train_2[:200]
y_train_2_200 = y_train_2[:200]

X_train_2_300 = X_train_2
y_train_2_300 = y_train_2

# TODO: Execute the 'train_predict' function for each classifier and each training set size
printout_2 = ""
for i in range(len(classifiers_2)):
    r_2 = [""] * 4
    r_2[3] = "Classifier {} - {}".format(i + 1,classifiers_2[i].__class__.__name__)
    r_2[0] = train_predict(classifiers_2[i], X_train_2_100, y_train_2_100, X_test_2, y_test_2)
    r_2[1] = train_predict(classifiers_2[i], X_train_2_200, y_train_2_200, X_test_2, y_test_2)
    r_2[2] = train_predict(classifiers_2[i], X_train_2_300, y_train_2_300, X_test_2, y_test_2)
    printout_2 += table.format(*r_2)
    
printmd(printout_2)


**Classifier 1 - GaussianNB**

| Training Set Size | Training Time | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100 |0.0011 |0.0004 |0.0008 |0.8088 | 0.6912 |
| 200 |0.0015 |0.0005 |0.0004 |0.7926 | 0.7519 |
| 300 |0.0013 |0.0005 |0.0004 |0.8038 | 0.7761 |


**Classifier 2 - SVC**

| Training Set Size | Training Time | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100 |0.0018 |0.0012 |0.0012 |0.7952 | 0.8050 |
| 200 |0.0054 |0.0038 |0.0019 |0.7879 | 0.8050 |
| 300 |0.0118 |0.0079 |0.0033 |0.8089 | 0.8050 |


**Classifier 3 - DecisionTreeClassifier**

| Training Set Size | Training Time | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100 |0.0014 |0.0004 |0.0003 |1.0000 | 0.6667 |
| 200 |0.0022 |0.0012 |0.0003 |1.0000 | 0.7154 |
| 300 |0.0033 |0.0005 |0.0004 |1.0000 | 0.6721 |



Again, SVM looks like the best. Let's fine-tune parameters.

In [54]:
# TODO: Create the parameters list you wish to tune
# parameters = {'kernel': ['rbf','poly'], 'degree': range(1,6), 'C': [pow(2,-i) for i in xrange(2,9)]}
# parameters = {'kernel': ['linear','rbf','poly','sigmoid'], 'degree': range(0,5), 'C': [pow(2,-i) for i in xrange(-1,15)]}
parameters_2 = {'kernel': ['linear','rbf'],'gamma' : [ 0.0001, 0.001, 0.1, 10, 100 ], 'C': [1] + [pow(2,-i) for i in xrange(1,5)], 'tol': [pow(10,-i) for i in xrange(2,6)]}
# parameters = {'kernel': ['rbf'], 'gamma': [ 0.0001, 0.001, 0.1, 10, 100 ], 'C': [pow(10,i) for i in range(-3,4)]}
# parameters = { 'C': [float(pow(2,-i)) for i in xrange(-1,15)]} #, 'class_weight' : [None, 'balanced']}
# parameters = {'n_neighbors': range(3,8), 'weights': ['uniform','distance']}

# TODO: Initialize the classifier
# clf = KNeighborsClassifier()
clf_2 = SVC(random_state = 42)
ssscv_2 = StratifiedShuffleSplit( y_train_2, n_iter=10, test_size=15, random_state = 42)

# TODO: Make an f1 scoring function using 'make_scorer' 
f1_scorer_2 = make_scorer(f1_score, pos_label='yes')

# TODO: Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj_scv_2 = GridSearchCV(clf_2, parameters_2, cv = ssscv_2, scoring=f1_scorer_2) # 10-fold
# grid_obj_lvc = [GridSearchCV(LinearSVC(random_state=42), { 'C': parameters['C']}, cv = ssscv, scoring=f1_scorer) for i in [0, 1, 2]] 

# TODO: Fit the grid search object to the training data and find the optimal parameters

print "Untuned model has a training F1 score of {:.2f}%.".format(predict_labels(classifiers_2[1], X_train_2, y_train_2)['score']*100)
print "Untuned model has a testing F1 score of {:.2f}%.".format(predict_labels(classifiers_2[1], X_test_2, y_test_2)['score']*100)
print classifiers_2[1]
print ""

grid_obj_scv_2.fit(X_train_2, y_train_2)
# grid_obj_lvc[i].fit(X_tr[i], y_train)

# Get the estimator
# clf = grid_obj.best_estimator_

# Report the final F1 score for training and testing after parameter tuning

print "Tuned model ({}-fold) has a training F1 score of {:.2f}%.".format(len(ssscv_2), predict_labels(grid_obj_scv_2.best_estimator_, X_train_2, y_train_2)['score']*100)
print "Tuned model ({}-fold) has a testing F1 score of {:.2f}%.".format(len(ssscv_2), predict_labels(grid_obj_scv_2.best_estimator_, X_test_2, y_test_2)['score']*100)
print grid_obj_scv_2.best_estimator_

Untuned model has a training F1 score of 80.89%.
Untuned model has a testing F1 score of 80.50%.
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=42, shrinking=True,
  tol=0.001, verbose=False)

Tuned model (10-fold) has a training F1 score of 85.53%.
Tuned model (10-fold) has a testing F1 score of 80.52%.
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=42, shrinking=True,
  tol=0.01, verbose=False)


As you can see below, our final tuned model has a better (not extremely, but still higher) final training _and_ testing score than the original one . I think I would conclude that it's better to convert binary categories into single binary feature rather than two. Similarly, given a feature with $n$ categories, instead of replacing it with $n$ features, we might want to try making $\lceil \lg n \rceil$ features (e.g. if we have a feature $F$ with choices, say, $\{0,1,2,3\}$, I would have two binary features $F_1$ and $F_2$ such that the correspondence is

| $F$ | $F_1$ | $F_2$ |
| :--: | :--: | :--: |
| 0 | 0 | 0 |
| 1 | 0 | 1 |
| 2 | 1 | 0 |
| 3 | 1 | 1 |

which looks like bit-encoding.