# Machine Learning Engineer Nanodegree
## Supervised Learning
## Project 2: Building a Student Intervention System

Welcome to the second project of the Machine Learning Engineer Nanodegree! In this notebook, some template code has already been provided for you, and it will be your job to implement the additional functionality necessary to successfully complete this project. Sections that begin with **'Implementation'** in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a `'TODO'` statement. Please be sure to read the instructions carefully!

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a **'Question X'** header. Carefully read each question and provide thorough answers in the following text boxes that begin with **'Answer:'**. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.  

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail to graduate. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer: ** Classification is for labels that can take discrete values. Regression is for labels that can take continous numeric values. Since in this case the answer to "Does this student need intervention" is either "yes" or "no" (discrete values) this is a classification problem.

## Exploring the Data
Run the code cell below to load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [44]:
# Import libraries
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score

# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"
print student_data.head()

Student data read successfully!
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

   ...   internet romantic  famrel  freetime  goout Dalc Walc health absences  \
0  ...         no       no       4         3      4    1    1      3        6   
1  ...        yes       no       5         3      3    1    1      3        4   
2  ...        yes       no       4         3      2    2    3      3       10   
3  ...        yes      yes       3         2      2    1    1      5        2   
4  ...         no       no       4         3      2    1    2      5        4   

  passed  
0     no  
1 

### Implementation: Data Exploration
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [45]:
# TODO: Calculate number of students
n_students = student_data.shape[0]

# TODO: Calculate number of features
n_features = student_data.columns[:-1].shape[0]

# TODO: Calculate passing students
n_passed = student_data[student_data.passed == 'yes']['passed'].count()

# TODO: Calculate failing students
n_failed = student_data[student_data.passed == 'no']['passed'].count()

# TODO: Calculate graduation rate
grad_rate = (n_passed * 100.00)/(n_passed + n_failed)

# Print the results
print "Total number of students: {}".format(n_students)
print "Number of features: {}".format(n_features)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09%


## Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Run the code cell below to separate the student data into feature and target columns to see if any features are non-numeric.

In [46]:
# Extract feature columns
feature_cols = list(student_data.columns[:-1])

# Extract target column 'passed'
target_col = student_data.columns[-1] 

# Show the list of columns
print "Feature columns:\n{}".format(feature_cols)
print "\nTarget column: {}".format(target_col)

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]

# Show the feature information by printing the first five rows
print "\nFeature values:"
print X_all.head()

Feature columns:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Target column: passed

Feature values:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       

### Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [47]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(X_all)
print "Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48 total features):
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. In the following code cell below, you will need to implement the following:
- Randomly shuffle and split the data (`X_all`, `y_all`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [48]:
# TODO: Import any additional functionality you may need here
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler

# TODO: Set the number of training points
num_train = 300

# Set the number of testing points
num_test = X_all.shape[0] - num_train

# TODO: Shuffle and split the dataset into the number of training and testing points above
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, stratify=y_all, test_size=num_test, random_state=42)

## Feature scaling (regularization) so that the large numbered features do not dominate
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Show the results of the split
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])

Training set has 300 samples.
Testing set has 95 samples.


## Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data (100 data points, 200 data points, and 300 data points) and measure the F<sub>1</sub> score. You will need to produce three tables (one for each model) that shows the training set size, training time, prediction time, F<sub>1</sub> score on the training set, and F<sub>1</sub> score on the testing set.

### Question 2 - Model Application
*List three supervised learning models that are appropriate for this problem. What are the general applications of each model? What are their strengths and weaknesses? Given what you know about the data, why did you choose these models to be applied?*

**Answer: ** The three supervised learning models appropriate for this problem can be 1) Decision Trees, 2) K-Nearest neighbors, and 3) Naive Bayes model. 

1) Decision tree models can be used in most classification tasks and are very versatile and can work on most discontinous datasets. In fact this is exactly how humans think in making a decision and hence they are intutive and simple. Decision trees are prone to overfitting and if not constrained (either by depth size or information gain) they can produce complicated classifiers that may be hard to generalize. Decision trees are built at each level based on the information gain (i.e reduction in entropy or uncertainity). The information gain at the root node is the highest and reduces the deeper we build our tree (this is true because at each node we choose the "best" attribute.) We should stop building the tree at any node if the information gain doesnt meet an imposed threshold. Because the "best attribute" is used at each node, the most relevant attributes automatically filter up towards the root node.

2) K-Nearest neighbors is a lazy learner in that it completes its classification by comparing the test data to the training data that it has saved, instead of inducting a rule between the data and labels. It has applications where data can naturally cluster and most of the clustered data have strong similarity to each other (for example in Speech recognition.) K-nearest neighbor is intutive, easy to understand and simple and yet does well in many applications. It has almost no learning time. It however needs a lot of space to store all the training data points. The prediction times are also very large since for every new data point we have to find its distance from all the stored training points. Finally, like all classifiers it suffers from the curse of dimensonality. However this classifier is affected more since it needs to store the training data. If we end up using more features, we need more training data that needs to be stored. For these reasons K-NN classifiers are usually used for small classification problems.

3) Naive Bayes Classifer is a very simple classifier based on Bayes rules that can be trained a priori and does very well compared to more complicated classifiers achieving almost comparable scores. The main assumption for Naive Bayes Classifier is that the input features are all independent of each other. This is usually not true but the features are usually very weakly correlated and hence to a first approximation can be considered independent. In these cases the Naive Bayes works very well. However in cases where features are very strongly dependent Naive Bayes may not work well as its basic assumption is violated. In such cases it may make sense to do Principal Component Analysis on the input features before running Naive Bayes Classifier on it. Naive Bayes is simple, fast, easy to understand and very robust under most use cases. It will not work well where the assumption of independence on input features is violated. It has found uses in applications like spam filtering.


We chose these classifiers because they are intuitive, simple and easy to use. They can be used directly from the scikit-learn package. Also they are easy to explain and easy to understand even for a layperson. Decision tree and K-nearest neighbors are exactly how humans think and make choices in real life. Naive Bayes is a little harder to explain to a layperson but the algorithm does make intuitive sense.

Edit: Added two more enhancements based on the great feedback in the last review. We added stratified sampling to make both the test and training set representative of the general population for the labels. We also added feature scaling/regularization. Both these enhancements are "nice to do" in the data preprocessing step anyway but here they also noticeably improved the model that we finally ended up choosing.

### Setup
Run the code cell below to initialize three helper functions which you can use for training and testing the three supervised learning models you've chosen above. The functions are as follows:
- `train_classifier` - takes as input a classifier and training data and fits the classifier to the data.
- `predict_labels` - takes as input a fit classifier, features, and a target labeling and makes predictions using the F<sub>1</sub> score.
- `train_predict` - takes as input a classifier, and the training and testing data, and performs `train_clasifier` and `predict_labels`.
 - This function will report the F<sub>1</sub> score for both the training and testing data separately.

In [49]:
def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    print "Trained model in {:.4f} seconds".format(end - start)

    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Print and return results
    print "Made predictions in {:.4f} seconds.".format(end - start)
    return f1_score(target.values, y_pred, pos_label='yes')


def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print "Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print "F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train))
    print "F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test))

### Implementation: Model Performance Metrics
With the predefined functions above, you will now import the three supervised learning models of your choice and run the `train_predict` function for each one. Remember that you will need to train and predict on each classifier for three different training set sizes: 100, 200, and 300. Hence, you should expect to have 9 different outputs below — 3 for each model using the varying training set sizes. In the following code cell, you will need to implement the following:
- Import the three supervised learning models you've discussed in the previous section.
- Initialize the three models and store them in `clf_A`, `clf_B`, and `clf_C`.
 - Use a `random_state` for each model you use, if provided.
 - **Note:** Use the default settings for each model — you will tune one specific model in a later section.
- Create the different training set sizes to be used to train each model.
 - *Do not reshuffle and resplit the data! The new training points should be drawn from `X_train` and `y_train`.*
- Fit each model with each training set size and make predictions on the test set (9 in total).  
**Note:** Three tables are provided after the following code cell which can be used to store your results.

In [50]:
# TODO: Import the three supervised learning models from sklearn
# from sklearn import model_A
# from sklearn import model_B
# from skearln import model_C

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB


# TODO: Initialize the three models
clf_A = DecisionTreeClassifier(random_state=0)
clf_B = KNeighborsClassifier()
clf_C = GaussianNB()

# TODO: Set up the training set sizes
X_train_100 = X_train[0:100]
y_train_100 = y_train[0:100]

X_train_200 = X_train[0:200]
y_train_200 = y_train[0:200]

X_train_300 = X_train[0:300]
y_train_300 = y_train[0:300]

# TODO: Execute the 'train_predict' function for each classifier and each training set size
train_predict(clf_A, X_train_100, y_train_100, X_test, y_test)
train_predict(clf_A, X_train_200, y_train_200, X_test, y_test)
train_predict(clf_A, X_train_300, y_train_300, X_test, y_test)
train_predict(clf_B, X_train_100, y_train_100, X_test, y_test)
train_predict(clf_B, X_train_200, y_train_200, X_test, y_test)
train_predict(clf_B, X_train_300, y_train_300, X_test, y_test)
train_predict(clf_C, X_train_100, y_train_100, X_test, y_test)
train_predict(clf_C, X_train_200, y_train_200, X_test, y_test)
train_predict(clf_C, X_train_300, y_train_300, X_test, y_test)

Training a DecisionTreeClassifier using a training set size of 100. . .
Trained model in 0.0014 seconds
Made predictions in 0.0002 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0002 seconds.
F1 score for test set: 0.6667.
Training a DecisionTreeClassifier using a training set size of 200. . .
Trained model in 0.0017 seconds
Made predictions in 0.0001 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0001 seconds.
F1 score for test set: 0.7097.
Training a DecisionTreeClassifier using a training set size of 300. . .
Trained model in 0.0023 seconds
Made predictions in 0.0002 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0001 seconds.
F1 score for test set: 0.6557.
Training a KNeighborsClassifier using a training set size of 100. . .
Trained model in 0.0006 seconds
Made predictions in 0.0017 seconds.
F1 score for training set: 0.7973.
Made predictions in 0.0018 seconds.
F1 score for test set: 0.7483.
Training a KNeighborsClassifier us

### Tabular Results
Edit the cell below to see how a table can be designed in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#tables). You can record your results from above in the tables provided.

** Note the Values will be close but not exactly the same, since I ran the classifiers again after creating the table and dont want to keep updating the table everytime **

** Classifer 1 - DecisionTreeClassifier**  

| Training Set Size | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |        0.0004 seconds   |    0.0005 seconds      |     1.0000       |    0.6667       |
| 200               |        0.0004 seconds   |    0.0005 seconds      |     1.0000       |    0.7097       |
| 300               |        0.0006 seconds   |    0.0005 seconds      |     1.0000       |    0.6557       |

** Classifer 2 - KNeighborsClassifier**  

| Training Set Size | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |        0.0024 seconds   |    0.0016 seconds      |     0.7973       |    0.7483       |
| 200               |        0.0045 seconds   |    0.0024 seconds      |     0.8125       |    0.7413       |
| 300               |        0.0088 seconds   |    0.0030 seconds      |     0.8326       |    0.7755       | 

** Classifer 3 - GaussianNB**  

| Training Set Size | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |        0.0004 seconds   |    0.0004 seconds      |     0.7752       |     0.6457      |
| 200               |        0.0005 seconds   |    0.0004 seconds      |     0.8060       |     0.7218      |
| 300               |        0.0007 seconds   |    0.0004 seconds      |     0.8134       |     0.7761      |

## Choosing the Best Model
In this final section, you will choose from the three supervised learning models the *best* model to use on the student data. You will then perform a grid search optimization for the model over the entire training set (`X_train` and `y_train`) by tuning at least one parameter to improve upon the untuned model's F<sub>1</sub> score. 

### Question 3 - Chosing the Best Model
*Based on the experiments you performed earlier, in one to two paragraphs, explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?*

**Answer: **  DecisionTreeClassifier : This model suffers from overfitting. The F1-score is 1.0 for the training set but is quite low for the test set. Since we used the default parameters (i.e max_depth = none) the true has attempted to classify every data point in the training set but is unable to generalize when provided with the test set.

KNeighborsClassifier: The model shows a high F1-score for train and testing data. And the F1-score for the testing data is not much lower than the training data. Thus the model does not suffer from overfitting. However a potential concern is that the F-1 score is increasing as the number of data points have been increasing. It seems we dont have enough data points to get the maximum value out of the model.

Naive Bayes: The model also shows high F1-score for train and testing data. And the F1-score for the testing data is not much lower than the training data. Thus the model does not suffer from overfitting. However there is something concerning with this model. The F1-score decreases for both the train and testing as data points are increased. This does not make sense. Naive Bayes model works on the assumption that the features are independent and if this assumption is violated the model may not work most accurately.

With this discussion in mind we choose the K-nearest neighbors classifiers for further tuning. K-NN model shows close to the highest F1-score amongst the three models when using all the 300 data points for training. Further this model shows monotonically increasing F1-score (approximating 0.7483 and 0.7413 as being roughly equal) with increasing # of datapoints. This means this model will just get better if/when it gets more datapoints to train on. DecisionTreeClassifier shows varying F1-score as datapoints are increased. The Naive Bayes classifier has a strong showing as well but we should remember that we are using the K-NN here with default parameters and we can tune the K-NN to get better performace. The K-NN is responding robustly with this training set and hence is recommended for use. As expected the K-NN prediction times are larger than the other two methods but since this is a small dataset and is likely to remain small (< 10K data points) in the future we can still use K-NN. The runtime for the algorithm will be a few seconds more than the other algorithms, which should be ok.

### Question 4 - Model in Layman's Terms
*In one to two paragraphs, explain to the board of directors in layman's terms how the final model chosen is supposed to work. For example if you've chosen to use a decision tree or a support vector machine, how does the model go about making a prediction?*

**Answer: ** "Birds of a feather, flock together", this pretty much sums up how K-nearest neighors algorithm works. In order to predict how a student will do in school and whether they will graduate or not, we look at other students in similar situations and see what the outcome was for these students. More concretely, we need a corpus of data for past students and their attributes (as defined in the 30 columns in our data) and whether these students graduated or not. Once we have these data we can pick a new student who also has data on these attributes and try to predict if he/she will graduate or not. We do this by finding a group of students (our alogrithm says that we need to find a group of 41 such students to get best results) that are most similar to the new student. For this group of students we find the majority of the outcomes (whether they graudated or not) and our algorithm says that the new student will probably have this outcome for graduation. For example if the majority of students from this group failed to graduate then our algorithm says that the new student will fail to graduate without an intervention.

### Implementation: Model Tuning
Fine tune the chosen model. Use grid search (`GridSearchCV`) with at least one important parameter tuned with at least 3 different values. You will need to use the entire training set for this. In the code cell below, you will need to implement the following:
- Import [`sklearn.grid_search.gridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html) and [`sklearn.metrics.make_scorer`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html).
- Create a dictionary of parameters you wish to tune for the chosen model.
 - Example: `parameters = {'parameter' : [list of values]}`.
- Initialize the classifier you've chosen and store it in `clf`.
- Create the F<sub>1</sub> scoring function using `make_scorer` and store it in `f1_scorer`.
 - Set the `pos_label` parameter to the correct value!
- Perform grid search on the classifier `clf` using `f1_scorer` as the scoring method, and store it in `grid_obj`.
- Fit the grid search object to the training data (`X_train`, `y_train`), and store it in `grid_obj`.

In [63]:
# TODO: Import 'gridSearchCV' and 'make_scorer'
from sklearn.grid_search import GridSearchCV 
from sklearn.metrics import make_scorer 
from sklearn.metrics import classification_report

# TODO: Create the parameters list you wish to tune
parameters = {'n_neighbors' : range(1,50)}
#parameters = {}

# TODO: Initialize the classifier
clf = KNeighborsClassifier()
#clf = GaussianNB()

# TODO: Make an f1 scoring function using 'make_scorer' 
f1_scorer = make_scorer(f1_score, pos_label='yes')

# TODO: Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf, parameters, scoring=f1_scorer)


# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train, y_train)

# Get the estimator
clf = grid_obj.best_estimator_

# Report the final F1 score for training and testing after parameter tuning
print "Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train))
print "Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test))
print "Parameter 'n_neighbors' is {} for the optimal model.".format(clf.get_params()['n_neighbors'])


print "\n\nF1 score for predicting all \"yes\" on test set: {:.4f}\n\n".format(
    f1_score(y_test, ['yes']*len(y_test), pos_label='yes', average='binary'))
print classification_report(y_test, grid_obj.predict(X_test))

Made predictions in 0.0122 seconds.
Tuned model has a training F1 score of 0.8056.
Made predictions in 0.0043 seconds.
Tuned model has a testing F1 score of 0.8101.
Parameter 'n_neighbors' is 41 for the optimal model.


F1 score for predicting all "yes" on test set: 0.8050


             precision    recall  f1-score   support

         no       1.00      0.03      0.06        31
        yes       0.68      1.00      0.81        64

avg / total       0.78      0.68      0.57        95



### Question 5 - Final F<sub>1</sub> Score
*What is the final model's F<sub>1</sub> score for training and testing? How does that score compare to the untuned model?*

**Answer: ** The optimal F-1 score for the training and testing is 0.8268 and 0.8000 respectively. The optimal n_neighbors is 41. This compares well with the untuned model. The F1-score for the training set is lower but is higher for the test set. Finally on the test set our model does slightly better than if we were to blindly just predict "yes" for every new datapoint. 

   Finally looking @ the confusion matrix we see that there are many False Positive (predicted will graduate, but actually do not.) What that means is some of the students that need intervention may not get intervention. The model seems very good @ predicting the "yes" labels but not so good at predicting the "no" labels. We need to be much better at predicting the "no" labels (i.e the students that will not graduate.) This is an imbalanced dataset ("yes" labels are twice more likely than "no" labels) and so we expect the F1-score to be lower for the "no" labels (the model has less opportunities to screw up on the "no" labels without affecting the F1-score). For completeness sake we run the Naive Bayes through the gridSearch.

### Second Model

In [62]:
# TODO: Import 'gridSearchCV' and 'make_scorer'
from sklearn.grid_search import GridSearchCV 
from sklearn.metrics import make_scorer 
from sklearn.metrics import classification_report



# TODO: Create the parameters list you wish to tune
#parameters = {'n_neighbors' : range(1,50)}
parameters = {}

# TODO: Initialize the classifier
#clf = KNeighborsClassifier()
clf = GaussianNB()

# TODO: Make an f1 scoring function using 'make_scorer' 
f1_scorer = make_scorer(f1_score, pos_label='yes')

# TODO: Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf, parameters, scoring=f1_scorer)


# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train, y_train)

# Get the estimator
clf = grid_obj.best_estimator_

# Report the final F1 score for training and testing after parameter tuning
print "Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train))
print "Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test))
#print "Parameter 'n_neighbors' is {} for the optimal model.".format(clf.get_params()['n_neighbors'])


print "\n\nF1 score for predicting all \"yes\" on test set: {:.4f}\n\n".format(
    f1_score(y_test, ['yes']*len(y_test), pos_label='yes', average='binary'))
print classification_report(y_test, grid_obj.predict(X_test))

Made predictions in 0.0005 seconds.
Tuned model has a training F1 score of 0.8134.
Made predictions in 0.0003 seconds.
Tuned model has a testing F1 score of 0.7761.


F1 score for predicting all "yes" on test set: 0.8050


             precision    recall  f1-score   support

         no       0.52      0.42      0.46        31
        yes       0.74      0.81      0.78        64

avg / total       0.67      0.68      0.67        95



**Comment: ** For this model the F1-score for the "no" labels is much better than for the K-NN. Though the F1-score for the "yes" labels has gone down but still very high. This means that now we are more likely to find students that need intervention. The downside is the model will also find students that would have graduated without intervention but who our model thinks need intervention. 

> **Note**: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to  
**File -> Download as -> HTML (.html)**. Include the finished document along with this notebook as your submission.