# Machine Learning Engineer Nanodegree
## Supervised Learning
## Project: Building a Student Intervention System

Welcome to the second project of the Machine Learning Engineer Nanodegree! In this notebook, some template code has already been provided for you, and it will be your job to implement the additional functionality necessary to successfully complete this project. Sections that begin with **'Implementation'** in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a `'TODO'` statement. Please be sure to read the instructions carefully!

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a **'Question X'** header. Carefully read each question and provide thorough answers in the following text boxes that begin with **'Answer:'**. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.  

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail to graduate. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer: ** This is a classification problem because the goal of the student intervention program is understand the factors that lead to two binary outcomes among students - passing and failing. In order to increase the graduation rate at this school, it is vital to understand what features about students increase the likelihood that they will either pass or fail. The training data allows the model to discover patterns in the data that promote the particular outcomes. After training, the final model that will be built to solve this problem will ultimately take in a variety of features about a student and predict an outcome. Using this model, school administrators can intervene when they notice a certain pattern arising in a student that the model predicts would lead to failure.

## Exploring the Data
Run the code cell below to load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [4]:
# Import libraries
import numpy as np
import pandas as pd
from IPython.display import display
from time import time
from sklearn.metrics import f1_score

# Read student data
student_data = pd.read_csv("student-data.csv")
print("Student data read successfully!")
display(student_data.head())

Student data read successfully!


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes


### Implementation: Data Exploration
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [5]:
# TODO: Calculate number of students
n_students = student_data.shape[0]

# TODO: Calculate number of features
n_features = student_data.shape[1] - 1 

# TODO: Calculate passing students
n_passed = len(student_data[(student_data['passed']=='yes')])

# TODO: Calculate failing students
n_failed = len(student_data[(student_data['passed']=='no')])

# TODO: Calculate graduation rate
grad_rate = (n_passed/n_students)*100

# Print the results
print("Total number of students: {}".format(n_students))
print("Number of features: {}".format(n_features))
print("Number of students who passed: {}".format(n_passed))
print("Number of students who failed: {}".format(n_failed))
print("Graduation rate of the class: {:.2f}%".format(grad_rate))

Total number of students: 395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09%


## Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Run the code cell below to separate the student data into feature and target columns to see if any features are non-numeric.

In [6]:
# Extract feature columns
feature_cols = list(student_data.columns[:-1])

# Extract target column 'passed'
target_col = student_data.columns[-1] 

# Show the list of columns
print("Feature columns:\n{}".format(feature_cols))
print("\nTarget column: {}".format(target_col))

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]

# Show the feature information by printing the first five rows
print("\nFeature values:")
print(X_all.head())

Feature columns:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Target column: passed

Feature values:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       

### Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [7]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(X_all)
print("Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns)))

Processed feature columns (48 total features):
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. In the following code cell below, you will need to implement the following:
- Randomly shuffle and split the data (`X_all`, `y_all`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [22]:
from sklearn.model_selection import train_test_split

# TODO: Set the number of training points
num_train = 300

# Set the number of testing points
num_test = X_all.shape[0] - num_train

# TODO: Shuffle and split the dataset into the number of training and testing points above

X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=num_test, random_state=40)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

Training set has 300 samples.
Testing set has 95 samples.


## Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data (100 data points, 200 data points, and 300 data points) and measure the F<sub>1</sub> score. You will need to produce three tables (one for each model) that shows the training set size, training time, prediction time, F<sub>1</sub> score on the training set, and F<sub>1</sub> score on the testing set.

**The following supervised learning models are currently available in** [`scikit-learn`](http://scikit-learn.org/stable/supervised_learning.html) **that you may choose from:**
- Gaussian Naive Bayes (GaussianNB)
- Decision Trees
- Ensemble Methods (Bagging, AdaBoost, Random Forest, Gradient Boosting)
- K-Nearest Neighbors (KNeighbors)
- Stochastic Gradient Descent (SGDC)
- Support Vector Machines (SVM)
- Logistic Regression

### Question 2 - Model Application
*List three supervised learning models that are appropriate for this problem. For each model chosen*
- Describe one real-world application in industry where the model can be applied. *(You may need to do a small bit of research for this — give references!)* 
- What are the strengths of the model; when does it perform well? 
- What are the weaknesses of the model; when does it perform poorly?
- What makes this model a good candidate for the problem, given what you know about the data?


### I. Logistic Regression

**1.)** Logistic regression is often used in predicting weather events, for example, whether or not it will rain. Weather forecasts usually depict rain as a binary event (rain or no rain), with the corresponding probability that it will rain. This is a textbook example of logistic regression. This model is used to predict rainfall in ***A.H.M. Rahmatullah Imon, Manos C. Roy, S. K. Bhattacharjee (2012) Prediction of Rainfall Using Logistic Regression. Pakistan Journal of Statistics and Operation Research, 8(3), ISSN: 2220-5810.*** After carefully preprocessing the data and removing outliers, the researchers in this study were able to use the model to predict days that it would rain with 95.25% success. For days with no rain, the success rate was 84.48%. 


    
**2.)** The primary strength of logistic regression is that it not only classifies data by label, but also returns the probability of it belonging to the positive class as well as its complement. It combines the rigidness of unit-step classification with the flexibility of linear regression. Furthermore, the results are typically easy to interpret and explain (the probabilities each label always add up to 1). Most importantly, because logistic regression does not attempt to create complex decision boundaries that other algorithms do, it is not as prone to overfitting. Moreover, logistic regression is one of the fastest machine learning algorithms in both training and testing, typically running side by side with decision trees in this measure. On large datasets, it beats decision trees in training and prediction speed. Logistic regression performs well when there isn't too much noise in the data and when there aren't too many outliers. It also benefits from having all the data on the same scale. The more the data is linearly separable, the easier it is to setup, and the more accurate the predictions are. In situations where the data does not have one clear or at least rough decision boundary, logistic regression is not the best choice.

    
**3.)** Because logistic regression is a parametric model that calculates the weight coefficients using all feature vectors in the dataset, it can be heavily influenced by outliers and noise in the data. This algorithm can overcome some of these issues using L2 regularization, however the problem can still weigh on the results. These reasons make logistic regression a fickle model to use. In order to get the most out of it features have to be scaled, categorical variables have to be transformed into numerical ones, outliers need to be removed, and so on. If any one of these points is not addressed, results could be heavily biased and performance could be poor. Logistic regression performs poorly when the data cannot be reasonably separated with a linear hyperplane. It isn't as weak as the perceptron in this regard, but the less defined the decision boundary, the poor the predictions will be. Too many clusters can cause issue for logistic regression as well. It can't create rectangular decision regions like decision trees, so unless the boundary can be approximated by one hyperplane, performance will be less than ideal. Lastly, without regularization, logistic regression can put so much weight on one or two features that it ends up essentially ignoring the rest in its predictions. Decision trees, which prioritize features in its question-asking scheme isn't subject to this problem as much.

    
**4.)** The fact that the output values in the dataset are categorical and binary make this a classic case for logistic regression. There are no missing values, so that won't cause problems. The data in its current form needs to be transformed first if it is to be used with logistic regression, however that isn't a huge issue. Looking at the data, one of the most obvious aspects is the high number of features relative to the number of samples. This makes logistic regression a good choice as it can avoid the "curse of dimensionality" that comes with a large number of features. Instead of trying to fit the precisely using decision rules, logistic regression seeks to find the best decision boundary that minimizes error by approximating a linear hyperplane. It accomplishes this by assigning varying weight parameters to each feature, with higher weights getting more importance. Beyond the data itself, logistic regression makes for a good candidate for this problem in general. The school is trying to predict whether students will pass or fail and they have a limited budget for those that are headed down the latter path. Because logistic regression does not make absolute predictions, but instead probabilistic ones, the support staff at the school can further classify students that the model predicts will fail into ones that are highly likely (85%+), moderately likely (65-85%) and just barely likely (51-65%) to fail. They can subsequently devote more resources to students in the first category before allocating them to those in the second and third categories. They may even decide that students with a 95% or greater chance of failure just aren't worth wasting resources on. True to its name, logistic regression can help the school work out its resources logistics in a logically cold manner. This sounds ruthless, but by doing so they can find a balance between boosting graduation rates and maximizing resource efficiency. Besides, isn't machine learning sort of a ruthless undertaking to begin with?
    
    
### II. Decision Trees

**1.)** Researchers CalTech contracted by NASA have used decision trees for image classification, specifically to distinguish objects that are recorded in night sky. In ***Nicholas Weir, Usama M. Fayyad, AND S. Djorgovski. Automated star/galaxy classification for digitized POSS-II. The Astronomical Journal, 109(6):2401, 1995.***, the researchers applied four different decision tree induction algorithms to 3 terabytes worth of astronomical images to classify the objects in the images into stars, stars with fuzz, galaxies, and artifacts. They reported a maximum accuracy of 94.2% on unseen data that was too faint for visual classification using the RULER induction system.


**2.)** Decision trees are easy to set up, train, and use for prediction. Furthermore they can be interpreted and visualized easily. Data does not have to be scaled, transformed, cleaned, etc., to be used by a decision tree. Missing values, outliers, and noisy data don't cause major issues. Their non-parametric nature allows them to find simple linear decision boundaries as well as more complex ones involving many borders. The result of all this is that they make sense as a logical approach to classify data, even to a non-technical audience. However, all this convenience creates a tradeoff with performance. Without "pruning" and tuning, decision trees by default will overfit to the training data, to the point of classifying it perfectly. The issue is exacerbated by small datasets when trying to use decision trees to make predictions on unseen data. More features increase the degree of overfitting as well. On the contrary, decision trees achieve great performance when they are given a large amount of samples, and when the number of features is managed through pruning. Although more data is better for all machine learning algorithms, this is especially true with decision trees. In addition, decision trees work best when the decision boundaries are parallel to the feature axes. When visualizing their results, one will find the decision boundary will typically consist of a complex arrangement of rectangles with sides parallel to the axes. If the decision boundary is instead diagonal, logistic regression is the better choice. Strengths are described further in the fourth section.


**3.)** One of decision trees' biggest strengths - their ease of interpretability - quickly degrades when the number of features becomes too large. In addition to overfitting the training data, the tree can become so deep and complex that trying to understand it is not worth the effort. Once again, managing the depth of the tree through pruning can reduce this issue to some extent, however if the tree is pruned too much, it can become subject to bias and fail to generalize well on new data. Finding the right balance between bias, variance, and interpretability is thus a major challenge when using decision trees. Furthermore, when the decision boundary between classes is not parallel to the axes, decision trees will have a hard time discovering it without overfitting. As mentioned previously, it's much better to just use logistic regression in these cases. Lastly, due to their rule-based approach to learning, decision trees need lots of samples to learn any meaningful patterns that exist in the larger population. With a small number of samples, they will almost certainly overfit and perform poorly in predicting unseen samples.


**4.)** The dataset has 30 features, which come in a variety of formats. Some are binary (sex), some are categorical (reason for attending the school), some are numerical (age), some are ordinal (quality of family relationships) and so on. This alone makes a strong case for decision trees because their property of asking a series of questions can overcome different data formats, as well as scale differences. Although there are no missing values for any features of the samples in the datasets, this is highly ideal and unlikely to always be the case. Thankfully, this does not pose a major problems for decision trees, as the tree will be built with or without data in every cell. If the school staff wanted to get maximum performance out of other algorithms, they would likely have to normalize and scale the data, transform it, and fill in missing values. All of these additional steps open up new possibilities for error and waste of time and resources that the school needs most for the primary task, which is helping students.
- Two additional factors that make decision trees a strong candidate for this problem are their speed and their ease of being interpreted and explained to a non-technical audience. Once a decision tree is trained, it can make new predictions in a fraction of the time that more complex models such as support vector machines require. Since the software will be used by staff that don't necessarily have data expertise, it would be a bonus for them to have a conceptual understanding of it as well, beyond merely being able to use it for prediction. Given that decision trees divide and conquer the data by the amount of information gained at each split, they can also be used to identify the relative importance of features. In a nutshell, decision trees automate the process of feature selection. For example, the most crucial features that separate students that pass from those that fail will occur at the top of the tree, while more inconsequential features will occur further down. Considering the school has limited resources to help failing students, it would make sense that they would want to focus them intensively where they can get the most bang for their buck. Going on this, decision trees simplify the choices the schools face as they attempt to boost the graduation rate. 
- Using the convenient `export_graphviz` function that comes in `sklearn.tree`, I was able to visualize a simple decision tree, which shows `failures` at the root node, followed by `absences` and `age` at the second nodes. This makes intuitive sense. Interesting patterns emerge as well using this function. For example, all three students in the training data that failed at least one class, were age 16 or younger, and whose parents had higher education ultimately failed the class. This on the other hand is not so intuitive, but nevertheless is demonstrated in the model. Easy to read patterns such as these and many more are all possible to identify and visualize with decision trees.
    
### III. Support Vector Machines

**1.)** Support vector machines have found considerable success in character and digit recognition: ***Medeiros, S., SVM Applied to License Plate Recognition, Technical Report, Federal University of Rio de Janeiro, Computer Science Department, Brazil, 2011.*** In this paper, researchers used support vector machines to classify letters digits on Brazilian license plates. The average accuracy for classifications of digits 0 to 8 was 99.61%. The corresponding accuracy for letters A to Z was 98.60%. Law enforcement could use such an application to read and record cars' license plates as drivers commit traffic violations, such as speeding or running a red light. This could automate ticketing and reduce the need for traffic cops, freeing up police department resources for addressing more menacing forms of crime. Customs officials could use it to automatically scan and record license plates of vehicles that cross over national borders.


**2.)** Perhaps the most obvious benefit of support vector machines are their near immunity to outliers and noise. Because SVMs learn only using training samples that lie closest to the margin that they attempt to maximize during the training stage, outliers are ignored entirely. Since SVM is a parameteric model like logistic regression, it works best when all the data is on the same scale, when all data is numerical, and when there are no missing values. The biggest strength of SVMs is their ability to make short work of any kind of decision boundary using the kernel trick. Through automatic feature creation, SVMs can project data into high-dimensional space, make virtual slices, and then bring them back to the original space. Using this method, they can discover all kinds of strange and complex decision boundaries, linear and non-linear, that other algorithms would struggle with. Unlike decision trees, SVMs are not hindered by small datasets and a large number of features. Datasets with samples and features of all shapes and sizes can be conquered with SVMs, however sometimes they are not the most feasible choice. All of the above makes SVMs the most badass algorithm to me, although their usefulness remains to be seen.


**3.)** Not all is glamorous with SVMs though. They typically tradeoff good performance for complexity and difficulty in interpretation. They are tricky to set up, and require tuning at least three hyperparameters (C, kernel, and gamma) to find optimal performance. The math underlying the training process is exceedingly complex and would be difficult to translate to a non-technical audience. Even if the SVM model that they are using is performing well and making accurate predictions, the chances that they could make sense of the results is pretty slim. To some extent, they would have to just take them at face-value. As mentioned previously, the kernel trick that SVMs uses is excellent for elucidating complex decision boundaries, however it is computationally quite expensive. When comparing various algorithms, SVMs tend to rank among the slowest. In real-world situations, SVMs need significant memory and processing power to do their job. In some sense, this algorithm is the Lamborghini of machine learning. It is deadly, sexy, and powerful. On the downside though, it is expensive, it requires tuning many parameters to find optimal performance, and it's not for the layman. When it breaks down, debugging is not as easy as with decision trees. A Lamborghini is awesome for a ride around town, but you wouldn't want to drive it across the country. Likewise, the feasibility of support vector machines starts to breakdown after too many (10,000+) training examples. They are too expensive to train at this point. For large datasets, it's usually better to just go with decision trees.


**4.)** SVM is a good candidate for this problem because the dataset has many features and the number of samples is relatively small. SVMs have been shown to generalize well in these kind of situations, even in cases where the number of features exceeds the number of samples ***(http://www.brainvoyager.com/bvqx/doc/UsersGuide/MVPA/SupportVectorMachinesSVMs.html)***. That isn't the case here, but nevertheless with only 300 samples to train on and 95 samples for testing, the data is sufficiently sparse to warrant concern. Furthermore, SVMs can efficiently overcome the "curse of dimensionality" that negatively affects models that use a nearest-neighbors approach in high-dimensional space. Given the way that SVMs seek to separate the data linearly and then maximize the margin, 30 features shouldn't be much as much of an issue as they would be when using KNN. Lastly, SVMs are well-suited for binary classification problems like this one. 
    

### Setup
Run the code cell below to initialize three helper functions which you can use for training and testing the three supervised learning models you've chosen above. The functions are as follows:
- `train_classifier` - takes as input a classifier and training data and fits the classifier to the data.
- `predict_labels` - takes as input a fit classifier, features, and a target labeling and makes predictions using the F<sub>1</sub> score.
- `train_predict` - takes as input a classifier, and the training and testing data, and performs `train_clasifier` and `predict_labels`.
 - This function will report the F<sub>1</sub> score for both the training and testing data separately.

In [24]:
def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    print("Trained model in {:.4f} seconds.\n".format(end - start))

    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Print and return results
    print("Made predictions in {:.4f} seconds.".format(end - start))
    return f1_score(target.values, y_pred, pos_label='yes')


def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print("Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train)))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print("F1 score for training set: {:.4f}.\n".format(predict_labels(clf, X_train, y_train)))
    print("F1 score for test set: {:.4f}.\n".format(predict_labels(clf, X_test, y_test)))

### Implementation: Model Performance Metrics
With the predefined functions above, you will now import the three supervised learning models of your choice and run the `train_predict` function for each one. Remember that you will need to train and predict on each classifier for three different training set sizes: 100, 200, and 300. Hence, you should expect to have 9 different outputs below — 3 for each model using the varying training set sizes. In the following code cell, you will need to implement the following:
- Import the three supervised learning models you've discussed in the previous section.
- Initialize the three models and store them in `clf_A`, `clf_B`, and `clf_C`.
 - Use a `random_state` for each model you use, if provided.
 - **Note:** Use the default settings for each model — you will tune one specific model in a later section.
- Create the different training set sizes to be used to train each model.
 - *Do not reshuffle and resplit the data! The new training points should be drawn from `X_train` and `y_train`.*
- Fit each model with each training set size and make predictions on the test set (9 in total).  
**Note:** Three tables are provided after the following code cell which can be used to store your results.

In [26]:
# TODO: Import the three supervised learning models from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# TODO: Initialize the three models
clf_A = LogisticRegression(random_state=40)
clf_B = DecisionTreeClassifier(random_state=40)
clf_C = SVC(random_state=40)

classifiers = [clf_A, clf_B, clf_C]

# TODO: Set up the training set sizes

training_sizes = [100,200,300]

# TODO: Execute the 'train_predict' function for each classifier and each training set size

for clf in classifiers:
    print("\n{}: \n".format(clf.__class__.__name__))
    for n in training_sizes:
        train_predict(clf, X_train[:n], y_train[:n], X_test, y_test)
    print('*-------------------------------------------------------*')


LogisticRegression: 

Training a LogisticRegression using a training set size of 100. . .
Trained model in 0.0024 seconds.

Made predictions in 0.0004 seconds.
F1 score for training set: 0.8859.

Made predictions in 0.0006 seconds.
F1 score for test set: 0.7714.

Training a LogisticRegression using a training set size of 200. . .
Trained model in 0.0036 seconds.

Made predictions in 0.0005 seconds.
F1 score for training set: 0.8255.

Made predictions in 0.0076 seconds.
F1 score for test set: 0.8414.

Training a LogisticRegression using a training set size of 300. . .
Trained model in 0.0078 seconds.

Made predictions in 0.0006 seconds.
F1 score for training set: 0.8227.

Made predictions in 0.0008 seconds.
F1 score for test set: 0.8169.

*-------------------------------------------------------*

DecisionTreeClassifier: 

Training a DecisionTreeClassifier using a training set size of 100. . .
Trained model in 0.0022 seconds.

Made predictions in 0.0007 seconds.
F1 score for training se

### Tabular Results
Edit the cell below to see how a table can be designed in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#tables). You can record your results from above in the tables provided.

** Classifer 1 - Logistic Regression**  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |        0.0024s          |        0.0006s         |     0.8859       |    0.7714       |
| 200               |        0.0036s          |        0.0076s         |     0.8255       |    0.8414       |
| 300               |        0.0078s          |        0.0008s         |     0.8227       |    0.8169       |

** Classifer 2 - Decision Tree Classifier**  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |         0.0022s         |      0.0012s           |       1.0        |    0.6281       |
| 200               |         0.0026s         |      0.0011s           |       1.0        |    0.7538       |
| 300               |         0.0102s         |      0.0004s           |       1.0        |    0.7692       |

** Classifer 3 - Support Vector Machine**  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |        0.0111s          |        0.0010s         |      0.8750      |      0.8199     |
| 200               |        0.0056s          |        0.0019s         |      0.8664      |      0.7895     |
| 300               |        0.0117s          |        0.0117s         |      0.8786      |     0.8366     |

## Choosing the Best Model
In this final section, you will choose from the three supervised learning models the *best* model to use on the student data. You will then perform a grid search optimization for the model over the entire training set (`X_train` and `y_train`) by tuning at least one parameter to improve upon the untuned model's F<sub>1</sub> score. 

### Question 3 - Choosing the Best Model
*Based on the experiments you performed earlier, in one to two paragraphs, explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?*

- After deploying three different machine learning algorithms on your student dataset, I have determined that logistic regression is most suitable for your support team's needs. Logistic regression strikes a nice balance between the support vector machines' strong ability to generalize, and the low computational costs of decision trees - the two other algorithms that did not make the cut. Most importantly, it suits the demands of this specific problem the best, as I will explain. This algorithm is inherently a probabilistic classification method. Using it, your support team can easily differentiate students with a 90% chance of failing from those with a 55% chance of failing, for example. This will allow you to channel your school district's limited resources to those students whose likelihood of failing is greatest first before directing them to those whose chances are less. Taking your cost considerations into account, this algorithm is one of the least computationally expensive of them all, performing only slightly behind decision trees, but far ahead of support vector machines. The time complexity of both training and testing using logistic regression scales linearly (`O(n_features*n_samples`). On the other hand, the complexity of training using support vector machines scales cubically (`O(n_features*n^3_samples`), while testing scales quadratically (`O(n_features*n^2_samples`). Therefore on larger datasets, the computation costs for logistic regression may increase, but the relative advantage over support vector machines in this regard will remain. In any case, you will not need to make any large investments in new computational infrastructure to employ this model efficiently within your organization. Where logistic regression really shines over decision trees on the other hand is in its ability to generalize to new data, detailed below.


- Although the dataset is relatively small, logistic regression performs quite well compared to decision trees. The training and test scores converge at F1 scores above 0.8 as the number of training samples increases, demonstrating that the model neither underfits nor overfits the data. On the other hand, the results from running the decision trees algorithm shows clear evidence of overfitting. In a nutshell, these scores assimilate true positives, true negatives, false positives, and false negatives that the model classifies. Even using 200 training examples to learn from, the model was able to obtain an F1 score of 0.8414 on unseen students, which was significantly greater than decision trees' score of 0.7538 on the same data. Furthermore, the F1 scores were more stable than support vector machines between 200 and 300 training examples, leading me to believe that they are the most trustworthy as well. As for the complexity of decision trees, training growth is linearithmic (`O(n_samples*n_features*log(n_samples))`), while testing growth is logarithmic (`O(log(n_samples)`). Therefore on large datasets, if testing speed is given greater importance, then decision trees would be the best algorithm to use, despite having a slower training time. One concern that I do have is your stated goal of reaching a 95% graduation rate by the end of the next decade. In order to ensure reaching this goal, I recommend that your team gather more student data in the future to feed to the model. When it reaches a score of say 0.95 or greater, you can be more confident that you goal can be fulfilled. However in the near-term, the model is wholly sufficient for helping your team boost the graduation rate significantly from where it stands now at 67%. Please let me know if you have any questions or concerns, or if I can be of further assistance.

### Question 4 - Model in Layman's Terms
*In one to two paragraphs, explain to the board of directors in layman's terms how the final model chosen is supposed to work. Be sure that you are describing the major qualities of the model, such as how the model is trained and how the model makes a prediction. Avoid using advanced mathematical or technical jargon, such as describing equations or discussing the algorithm implementation.*

- In the state that I have delivered it to your team, the logistic regression takes features of a sample as input and outputs a classification to which the sample belongs. In this particular case, the samples are the students, their features are their school, sex, age, address, family size, etc., and the classifications are fail or pass. An intermediate step in logistic regression that is not currently visible is the calculation of the probabilities of the sample belonging to the positive class (passing). The probabilities that the algorithm produces internally always add up to 1.The probability of the negative class (failing) is thus 1 minus the probability of passing. For example if the internal output was (0.01, 0.99), then that could be interpreted as the student having a 1% chance of failing and a 99% chance of passing. An output of (0.5, 0.5) means that the algorithm gives a sample a 50/50 chance of belonging to both classes. Due to the nature of how the algorithm calculates its probabilities, a chance of 100% of either class is impossible, therefore a certain degree of uncertainty is always tolerated and quantified, unlike in many other models. The output that you ultimately see (fail or pass), is simply the classification that has a greater probability of being true.


- The mechanism that allows logistic regression to "learn" is a set of weights that the algorithm multiplies with the features of each sample that it sees. In this case, there are 30 features being multiplied with 30 corresponding weights. During the training process, it is these internal weights that are constantly being modified with each training sample that the model is exposed to. In order to learn the "true" weights, the logistic regression attempts to minimize a cost, or error function. For each prediction, the function penalizes the weights if the prediction is wrong, and leaves them unchanged if the prediction is correct. The further the probabilities are from the true value, the more the weights are penalized. Penalizing in this sense moves the weights further in the opposite direction than where they were on the previous instance. For example, if the weights are too small, they will be increased. If they are too large, they will be decreased. With each new training example, the algorithm moves the weights closer and closer to the "true" weights. It follows naturally that the more data that the model has to train on, the the more accurate the weights will be. Using the final weights, the algorithm can multiply them with features of new students to make predictions of their probabilities of both failing or passing. As previously stated, the greater probability is the final "choice" that the model makes.

### Implementation: Model Tuning
Fine tune the chosen model. Use grid search (`GridSearchCV`) with at least one important parameter tuned with at least 3 different values. You will need to use the entire training set for this. In the code cell below, you will need to implement the following:
- Import [`sklearn.grid_search.GridSearchCV`](http://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html) and [`sklearn.metrics.make_scorer`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html).
- Create a dictionary of parameters you wish to tune for the chosen model.
 - Example: `parameters = {'parameter' : [list of values]}`.
- Initialize the classifier you've chosen and store it in `clf`.
- Create the F<sub>1</sub> scoring function using `make_scorer` and store it in `f1_scorer`.
 - Set the `pos_label` parameter to the correct value!
- Perform grid search on the classifier `clf` using `f1_scorer` as the scoring method, and store it in `grid_obj`.
- Fit the grid search object to the training data (`X_train`, `y_train`), and store it in `grid_obj`.

In [309]:
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV

# TODO: Create the parameters list you wish to tune
param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
parameters = {'C': param_range,}
#Taken from Raschka, Sebastian (2015-09-23). Python Machine Learning (p. 186). Packt Publishing. Kindle Edition. 

# TODO: Initialize the classifier
clf = LogisticRegression()

# TODO: Make an f1 scoring function using 'make_scorer' 
f1_scorer = make_scorer(f1_score,pos_label='yes')

# TODO: Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf,parameters,scoring=f1_scorer)

# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train, y_train)

# Get the estimator
clf = grid_obj.best_estimator_
# Report the final F1 score for training and testing after parameter tuning
print("Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train)))
print("Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test)))

Made predictions in 0.0003 seconds.
Tuned model has a training F1 score of 0.8075.
Made predictions in 0.0005 seconds.
Tuned model has a testing F1 score of 0.8228.


### Question 5 - Final F<sub>1</sub> Score
*What is the final model's F<sub>1</sub> score for training and testing? How does that score compare to the untuned model?* 

- The tuned model's training and testing F1 scores are 0.8075 and 0.8228 respectively. This compares with 0.8227 and 0.8169 with the default model, therefore using grid search and tuning the C parameter gave a small edge in the testing data, but actually weakened performance slightly on the training data. Overall the differences are not too significant though. It appears that the scarcity of the data and relative class imbalance might be causing some unpredictability in the results. Taking into account the performance of the default model on 100, 200, and 300 training samples, it seems reasonable to conclude that any differences are due to random chance in the way the data is split more than anything. By altering the `random_state` parameter in the `train_test_split` function, I was able to obtain radically different confusion maxtrixes and F1 scores from the logistic regression, lending support to this idea that the particular split used is noticeably influential in small datasets with class imbalance. However, because grid search uses KFold cross-validation in its implementation, the 0.8075 training score is more realistic and trustworthy than 0.8227 with a simple train-test split. Overall, I would not take any scores that these models generate at face-value quite yet. More data would help determine which model could ultimately emerge as the winner.

> **Note**: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to  
**File -> Download as -> HTML (.html)**. Include the finished document along with this notebook as your submission.