# Machine Learning Engineer Nanodegree
## Supervised Learning
## Project 2: Building a Student Intervention System

Welcome to the second project of the Machine Learning Engineer Nanodegree! In this notebook, some template code has already been provided for you, and it will be your job to implement the additional functionality necessary to successfully complete this project. Sections that begin with **'Implementation'** in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a `'TODO'` statement. Please be sure to read the instructions carefully!

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a **'Question X'** header. Carefully read each question and provide thorough answers in the following text boxes that begin with **'Answer:'**. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.  

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail to graduate. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer: ** This is a classification problem, mainly due to the nature of the data given to solve the problem. The data given are mostly classifications themselves, which may or may not be ordinally related making regression less effective. If we were given a set of data that included grades of individuals potentially over a timespan, then we could use regression to find the students who have grades below passing. There is no way to analyze the data at hand to produce a degree of passing, which would be a grade. However, we can look at this data and make predictions if the student will pass with a certain confidence, thus a classification.

## Exploring the Data
Run the code cell below to load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [22]:
# Import libraries
from __future__ import division
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score

# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"

Student data read successfully!


### Implementation: Data Exploration
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [23]:
# TODO: Calculate number of students
n_students = len(student_data)

# TODO: Calculate number of features
n_features = len(student_data.keys()) - 1

# TODO: Calculate passing students
n_passed = len(student_data[student_data['passed'] == 'yes'])

# TODO: Calculate failing students
n_failed = len(student_data[student_data['passed'] == 'no'])

# TODO: Calculate graduation rate
grad_rate = (n_passed/n_students)*100

# Print the results
print "Total number of students: {}".format(n_students)
print "Number of features: {}".format(n_features)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09%


## Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Run the code cell below to separate the student data into feature and target columns to see if any features are non-numeric.

In [24]:
# Extract feature columns
feature_cols = list(student_data.columns[:-1])

# Extract target column 'passed'
target_col = student_data.columns[-1] 

# Show the list of columns
print "Feature columns:\n{}".format(feature_cols)
print "\nTarget column: {}".format(target_col)

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]

# Show the feature information by printing the first five rows
print "\nFeature values:"
print X_all.head()

Feature columns:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Target column: passed

Feature values:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       

### Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [25]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(X_all)
print "Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48 total features):
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. In the following code cell below, you will need to implement the following:
- Randomly shuffle and split the data (`X_all`, `y_all`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [26]:
# TODO: Import any additional functionality you may need here
from sklearn.cross_validation import train_test_split

# TODO: Set the number of training points
num_train = 300

# Set the number of testing points
num_test = X_all.shape[0] - num_train

# TODO: Shuffle and split the dataset into the number of training and testing points above
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=num_test, random_state=1)

# Show the results of the split
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])

Training set has 300 samples.
Testing set has 95 samples.


## Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data (100 data points, 200 data points, and 300 data points) and measure the F<sub>1</sub> score. You will need to produce three tables (one for each model) that shows the training set size, training time, prediction time, F<sub>1</sub> score on the training set, and F<sub>1</sub> score on the testing set.

**The following supervised learning models are currently available in** [`scikit-learn`](http://scikit-learn.org/stable/supervised_learning.html) **that you may choose from:**
- Gaussian Naive Bayes (GaussianNB)
- Decision Trees
- Ensemble Methods (Bagging, AdaBoost, Random Forest, Gradient Boosting)
- K-Nearest Neighbors (KNeighbors)
- Stochastic Gradient Descent (SGDC)
- Support Vector Machines (SVM)
- Logistic Regression

### Question 2 - Model Application
*List three supervised learning models that are appropriate for this problem. For each model chosen*
- Describe one real-world application in industry where the model can be applied. *(You may need to do a small bit of research for this — give references!)* 
- What are the strengths of the model; when does it perform well? 
- What are the weaknesses of the model; when does it perform poorly?
- What makes this model a good candidate for the problem, given what you know about the data?

**Answer: **
1) Support Vector Machines 

SVMs are starting to be used extensively in the oil and gas industry. They can be used to classify different hydrocarbons within the subsurface reservoir, identify the potential payzones, as well as help in treatment scenarios where the well must undergo a sort of maintenance in order to boost performance. The data for examples will mainly be from well data and geologic data, including seismic. [0]

SVMs can represent non-linear data well compared to other traditional models. Through a kernel function the data can be mapped in other dimensions. Through maximizing the margin the SVM also cuts down on overfitting, which can often be a problem for other models like ensembles, decision trees, and random forests

The kernel function can also be the downfall of the SVM. If the kernel function doesn't accurately represent the data then the SVM will be no better off training and maximizing the margin.

SVM classifiers can work with many types of features. Binary encoding or one-hot encoding allows the model to process non-ordinal features. The student intervention data contains a large amount of features, which could make the model prone to overfitting, but SVMs can undertake many features. This is mainly due to the maximization of the margin, but can also be helped by the kernel function exchanging dimensionalities.
    
2) AdaBoost Ensemble Classifier

AdaBoost is being used to predict customer churn rates for firms. Basically, firms would like to know the likelihood of a customer ceasing future commerce with their business, and perhaps derive why from these results. AdaBoost works well in this scenario because there are many weak-learners that can be applied to certain areas of the customer churn data, but none of them individually can truly produce accurate results. [1]
 
AdaBoost is relatively easy to set up in terms of parameters. Without having to change the parameters much, AdaBoost can have good results. It can also use other models and weak learners in training and boost the more effective one. This can be really good where weak learners may have high confidence in certain areas and low in others. The model can reflect the weak learner's strengths and weaknesses through boosting accordingly.

The data that is fed to AdaBoost should be clean and have little noise since AdaBoost often picks up on this noise and due to boost properties may perform poorly. AdaBoost also relies on the weak learners it boosts; if there are no weak learners that can produce even fair results, AdaBoost will most likely fall short on reliable output.

The student intervention data has a broad variety of features. Some of these features will be able to be modeled by some other classifiers well, but through combining and boosting them where they generalize best, AdaBoost can prove to be a reliable model for this dataset.

3) K-Nearest Neighbor Classifier

KNN has been used in the financial fields to predict the prices on future products such as commodities and stocks. Using historical data the KNN classifier can find stocks that had similar market trends to predict the future price of stocks [2]. A number of other types of financial data can be used with KNN for modeling currency exchange rates, trading futures, and even in analyzing money laundering patterns. KNN has also been used in weather forecasting. Using historical weather data, such as precipitation, temperature, pressure, etc., the coming days' weather can be predicted [3].

KNN doesn't necessarily require a model to be pretrained, however it does have higher runtime requirements since it samples data from a distance from each point. Without having to train the model, not every data point needs to be used, which makes the model lazy. Lazy performers can be really useful with huge datasets, but will require the storage of all the data. The number of neighbors considered can greatly affect the prediction results. Decreasing the amount of neighbors will increase runtime speed, but can cause the model to be heavily influenced by noisey data. Increasing the number of neighors can be very expensive and overall will have diminishing returns.
    
For the student intervention dataset the KNN classifier is appropriate since students that have similar traits will most likely have similar outcomes. If the dataset is weighted based on each feature's relevance KNN will have better predictions. Without weighting or cleaning the dataset KNN can be heavily influenced by irrelevant features.


References: 

[0] https://www.researchgate.net/publication/243190076_Applications_of_Support_Vector_Machines_in_the_Exploratory_Phase_of_Petroleum_and_Natural_Gas_a_Survey 

[1] https://www.cs.rit.edu/~rlaz/PatternRecognition/slides/churn_adaboost.pdf

[2] http://www.ijbhtnet.com/journals/Vol_3_No_3_March_2013/4.pdf
[3] http://www.ijera.com/papers/Vol3_issue5/DI35605610.pdf

### Setup
Run the code cell below to initialize three helper functions which you can use for training and testing the three supervised learning models you've chosen above. The functions are as follows:
- `train_classifier` - takes as input a classifier and training data and fits the classifier to the data.
- `predict_labels` - takes as input a fit classifier, features, and a target labeling and makes predictions using the F<sub>1</sub> score.
- `train_predict` - takes as input a classifier, and the training and testing data, and performs `train_clasifier` and `predict_labels`.
 - This function will report the F<sub>1</sub> score for both the training and testing data separately.

In [27]:
def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    print "Trained model in {:.4f} seconds".format(end - start)

    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Print and return results
    print "Made predictions in {:.4f} seconds.".format(end - start)
    return f1_score(target.values, y_pred, pos_label='yes')


def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print "Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print "F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train))
    print "F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test))

### Implementation: Model Performance Metrics
With the predefined functions above, you will now import the three supervised learning models of your choice and run the `train_predict` function for each one. Remember that you will need to train and predict on each classifier for three different training set sizes: 100, 200, and 300. Hence, you should expect to have 9 different outputs below — 3 for each model using the varying training set sizes. In the following code cell, you will need to implement the following:
- Import the three supervised learning models you've discussed in the previous section.
- Initialize the three models and store them in `clf_A`, `clf_B`, and `clf_C`.
 - Use a `random_state` for each model you use, if provided.
 - **Note:** Use the default settings for each model — you will tune one specific model in a later section.
- Create the different training set sizes to be used to train each model.
 - *Do not reshuffle and resplit the data! The new training points should be drawn from `X_train` and `y_train`.*
- Fit each model with each training set size and make predictions on the test set (9 in total).  
**Note:** Three tables are provided after the following code cell which can be used to store your results.

In [None]:
from sklearn import svm, ensemble, neighbors


clf_A = svm.SVC()
clf_B = ensemble.AdaBoostClassifier()
clf_C = neighbors.KNeighborsClassifier()
clfs = [clf_A, clf_B, clf_C]

tts = []
for i in range(1,4):
    num_test = X_all.shape[0] - 100*i
    #train_test = train_test_split(X_all, y_all, test_size=num_test, random_state=1)]
    train_test = [X_all[:i*100], X_all[i*100:], y_all[:i*100], y_all[i*100:]]
    tts.append(train_test)
    
    

for i in range(len(clfs)):
    print
    for k in range(len(tts)):
        tt = tts[k]
        print len(tt[0])
        train_predict(clfs[i], tt[0], tt[2], tt[1], tt[3])


100
Training a SVC using a training set size of 100. . .
Trained model in 0.0037 seconds
Made predictions in 0.0020 seconds.
F1 score for training set: 0.8772.
Made predictions in 0.0048 seconds.
F1 score for test set: 0.7792.
200
Training a SVC using a training set size of 200. . .
Trained model in 0.0091 seconds
Made predictions in 0.0067 seconds.
F1 score for training set: 0.8474.
Made predictions in 0.0061 seconds.
F1 score for test set: 0.7987.
300
Training a SVC using a training set size of 300. . .
Trained model in 0.0159 seconds
Made predictions in 0.0109 seconds.
F1 score for training set: 0.8686.
Made predictions in 0.0035 seconds.
F1 score for test set: 0.7571.

100
Training a AdaBoostClassifier using a training set size of 100. . .
Trained model in 0.1232 seconds
Made predictions in 0.0119 seconds.
F1 score for training set: 0.9867.
Made predictions in 0.0117 seconds.
F1 score for test set: 0.6319.
200
Training a AdaBoostClassifier using a training set size of 200. . .
Tra

### Tabular Results
Edit the cell below to see how a table can be designed in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#tables). You can record your results from above in the tables provided.

** Classifer 1 - SVM**  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |  0.0093  |  0.0062  |  0.8609     |   0.8067      |
| 200               |  0.0094  |  0.0067  |  0.8608     |   0.8205      |
| 300               |  0.0184  |  0.0047  |  0.8584     |   0.8462      |

** Classifer 2 - Ensemble AdaBoost**  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |     0.1711    |  0.0126   |   0.9474     |   0.7387   |
| 200               |     0.1785    |  0.0110   |   0.8945     |   0.7584   |
| 300               |     0.1929    |  0.0111   |   0.8578     |   0.8116   |

** Classifer 3 - K Nearest Neighbors Classifier**  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |0.0008|0.0067|0.8356|0.7534|
| 200               |0.0010|0.0074|0.8367|0.7766|
| 300               |0.0014|0.0058|0.8558|0.7681|

## Choosing the Best Model
In this final section, you will choose from the three supervised learning models the *best* model to use on the student data. You will then perform a grid search optimization for the model over the entire training set (`X_train` and `y_train`) by tuning at least one parameter to improve upon the untuned model's F<sub>1</sub> score. 

### Question 3 - Choosing the Best Model
*Based on the experiments you performed earlier, in one to two paragraphs, explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?*

**Answer: **
The best model considering resources, costs, and performance would be the Support Vector Machine Classifier. This model out performs the other two based on earlier experiments with this data, but it is also very fast making it computationally inexpensive. Although the KNN model has improved speed, KNN has a continuously high cost since it requires a large amount of computation on each query. SVM only has to be trained once and then can inexpensively predict values from there on. AdaBoost has very poor speed compared to the other models. Both the train and test scores for SVM are adequate. The training scores are not overfit, and the test scores are better than the other models in all experiments.

### Question 4 - Model in Layman's Terms
*In one to two paragraphs, explain to the board of directors in layman's terms how the final model chosen is supposed to work. Be sure that you are describing the major qualities of the model, such as how the model is trained and how the model makes a prediction. Avoid using advanced mathematical or technical jargon, such as describing equations or discussing the algorithm implementation.*

**Answer: **
The Support Vector Machine model was chosen as the final model due to its performance and efficiency.

An SVM model attempts to classify whether a student will fail or pass from attempting to draw a line between student characteristics based on students failing or passing. This is effectively creating a threshold value that determines suggests the student will fail. What makes the SVM classifier so great at determining these thresholds is its ability to distinctly categorize groups of student. There is a margin of difference between failing and passing students. Maximizing this margin causes the threshold to find the fairest dividing line between student characteristics that regularly indicate failure and those that indicate passing. Essentially, when classifying students the model will use the input indicators as scales that become unbalanced if the model is using the data inappropriately and being over-zealous in classifying, what we call overfitting. 

In terms of classifying students on success/failure, generalization amongst the population needs to occur. Without being able to generalize we will not be able to take new students and accurately tell if they are in need of intervention. When the SVM separates classifications of students, it also becomes better at generalizing and thus predicting from new data; in the end producing better results.

The SVM is also good at looking at data outside of the box through changing its own perspective. Like the classic example of finding the shortest distance between two positions on a piece of paper, where the answer is to fold the paper and thus touch the two points with relatively no space inbetween, the SVM can see trends within the students that may not appear straightforward at first. In this new perspective student characteristics can give new insights clearly, where other models may fail or misrepresent.

Another way to think about how an SVM solves this problem is looking at how we might make a clean precise cutout shape in the middle of a piece of paper with normal scissors. A clever way to do this would be to fold the paper, thus exposing the center of the paper as a newly cuttable edge. Going even further, cutting on this new edge creates patterns that when the paper is unfolded back to normal, would be extremely difficult to make without folding. Again, this is similar to the SVM bending its vision of the characteristics, allowing it to make clean cuts, with patterns that would have been very hard to duplicate without bending. Using this technique, the SVM can pinpoint and accentuate on the differences in characteristics that lead to passing or failing grades. Once the SVM has made this cut between characteristics, the SVM makes sure that cut treats both sides of the cut equally, moving and adjusting it appropriately.. This will come into use later when new students are classified and must be placed on one side of the cut. Since it is cut as equally as possible, the new student will be put on the most reliable side.

Once the SVM is trained on the data the computational costs will be extremely slim. Other classifiers such as the Nearest Neighbors classifier require much more on-demand computational power, and thus much more cost over a long period of time. The SVM can also be adjusted so that it becomes better as it sees and classifies more students, giving it continuous performance improvements.

### Implementation: Model Tuning
Fine tune the chosen model. Use grid search (`GridSearchCV`) with at least one important parameter tuned with at least 3 different values. You will need to use the entire training set for this. In the code cell below, you will need to implement the following:
- Import [`sklearn.grid_search.gridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html) and [`sklearn.metrics.make_scorer`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html).
- Create a dictionary of parameters you wish to tune for the chosen model.
 - Example: `parameters = {'parameter' : [list of values]}`.
- Initialize the classifier you've chosen and store it in `clf`.
- Create the F<sub>1</sub> scoring function using `make_scorer` and store it in `f1_scorer`.
 - Set the `pos_label` parameter to the correct value!
- Perform grid search on the classifier `clf` using `f1_scorer` as the scoring method, and store it in `grid_obj`.
- Fit the grid search object to the training data (`X_train`, `y_train`), and store it in `grid_obj`.

In [None]:
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer


# TODO: Create the parameters list you wish to tune
parameters = [{'kernel': ['rbf'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

# TODO: Initialize the classifier
clf = svm.SVC()
# TODO: Make an f1 scoring function using 'make_scorer' 
f1_scorer = make_scorer(f1_score, pos_label='yes')

# TODO: Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf, parameters, f1_scorer)

# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_obj.fit(X_train, y_train)

# Get the estimator
clf = grid_obj.best_estimator_

# Report the final F1 score for training and testing after parameter tuning
print "Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train))
print "Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test))

### Question 5 - Final F<sub>1</sub> Score
*What is the final model's F<sub>1</sub> score for training and testing? How does that score compare to the untuned model?*

**Answer: **
The testing F1 score was 85.7% compared to 84.6% previously. The training score was 82.3%, down from 85.8% before tuning.

The test F1 score for the tuned model increased, where the training score decreased. The test score increased, which was the goal, but it didn't increase significantly. The slight increase in test score and decrease in training score could show that the model has better generalization and would perform more accurately with other random states of data. These were not the best

> **Note**: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to  
**File -> Download as -> HTML (.html)**. Include the finished document along with this notebook as your submission.