# Machine Learning Engineer Nanodegree
## Supervised Learning
## Project: Building a Student Intervention System

The goal of building a student interention system is to divide students into two distinct groups, students belonging to one group need early intervention before they fail to graduate and students belonging to the other group don't. Therefore, this is a binary classification problem. 

## Exploring the Data
Run the code cell below to load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score

# Read student data
student_data = pd.read_csv("student-data.csv")
print("Student data read successfully!")

Student data read successfully!


In [2]:
# check what the data look like
student_data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes


### Implementation: Data Exploration
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. 
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [3]:
# Calculate number of students
n_students = student_data.shape[0]

# Calculate number of features
n_features = student_data.shape[1] - 1 # one of the columns is target variable

# Calculate passing students
n_passed = sum(student_data['passed'] == 'yes')

# Calculate failing students
n_failed = sum(student_data['passed'] == 'no')

# Calculate graduation rate
grad_rate = n_passed/n_students*100

# Print the results
print("Total number of students: {}".format(n_students))
print("Number of features: {}".format(n_features))
print("Number of students who passed: {}".format(n_passed))
print("Number of students who failed: {}".format(n_failed))
print("Graduation rate of the class: {:.2f}%".format(grad_rate))

Total number of students: 395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09%


## Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Run the code cell below to separate the student data into feature and target columns to see if any features are non-numeric.

In [4]:
# Extract feature columns
feature_cols = list(student_data.columns[:-1])

# Extract target column 'passed'
target_col = student_data.columns[-1] 

# Show the list of columns
print("Feature columns:\n{}".format(feature_cols))
print("\nTarget column: {}".format(target_col))

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]

# Show the feature information by printing the first five rows
print("\nFeature values:")
print(X_all.head())

Feature columns:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Target column: passed

Feature values:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       

### Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [5]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(X_all)
print("Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns)))

Processed feature columns (48 total features):
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. In the following code cell below, we will implement the following:
- Randomly shuffle and split the data (`X_all`, `y_all`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [6]:
from sklearn.model_selection import train_test_split

# Set the number of training points
num_train = 300

# Set the number of testing points
num_test = X_all.shape[0] - num_train

# Shuffle and split the dataset into the number of training and testing points above
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, 
                                                    train_size=num_train, random_state=16)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

Training set has 300 samples.
Testing set has 95 samples.


## Training and Evaluating Models
In this section, 3 supervised learning models are chosen that are appropriate for this problem and available in `scikit-learn`. We will fit the model to varying sizes of training data (100 data points, 200 data points, and 300 data points) and measure the F<sub>1</sub> score. We produce three tables (one for each model) that shows the training set size, training time, prediction time, F<sub>1</sub> score on the training set, and F<sub>1</sub> score on the testing set.

**The following supervised learning models are currently available in** [`scikit-learn`](http://scikit-learn.org/stable/supervised_learning.html) :
- Gaussian Naive Bayes (GaussianNB)
- Decision Trees
- Ensemble Methods (Bagging, AdaBoost, Random Forest, Gradient Boosting)
- K-Nearest Neighbors (KNeighbors)
- Stochastic Gradient Descent (SGDC)
- Support Vector Machines (SVM)
- Logistic Regression

### Model Application
*Three supervised learning models that are appropriate for this problem, include:


1. Logistic Regression
    - Logistic regression is a common method for binary classification problem. In real world, logistic regression can be used in credit card and loan companies to predict default or no default giving balance and other features. (Gareth et al. "An Introduction to Statistical Learning with Application in R.")   
    - Strength: simple, fast, provides a probabilistic interpretation, good for small dataset, no multivariant normality and equal dispersion assumptions.  
    - Weakness: it assumes a linear decision boundary which might not be the case sometimes; also, parameter estimation can be complicated by multicollinearity in the features.  
    - It should work here since this is a binary classification problem with both quantitative and qualititave data.
2. Support Vector Machines (SVM)   
    - Support vector machine is very powerful and versatile machine learning model, capable of classification, regression and outlier detection. One industrial application of SVM is for mechanical faults diagnostic (Baccarini etc. "SVM practical industrial application for mechanical faults diagnostic.") 
    - Strength: powerful and versatile; can model non-linear decision boundaries with many available kernels; fairly robust against overfitting, especially in high-dimensional space.
    - Weakness: it is memory intensive and does not scale well to large dataset; can be tricky to tune
    - It could be useful for this problem since we have a classification problem with a relatively small dataset. 
3. Ensemble Methods (e.g., random forest) 
    - One example of random forest is in remote sensing-provide land classification (Gislason et al. "Decision fusion for the classification of urban remote sensing images.")
    - Strength: reduces variance by combining random decision trees with bagging, gives high classification accuracy; it is a good out-of-box model to use, and automatically learns feature interactions
    - Weakness: can be computationally expensive, not easy to interpret, needs relatively larger dataset
    - It could be a good candidate in this case as the default model usually performs well, and it can avoid overfitting.

### Setup
Run the code cell below to initialize three helper functions which you can use for training and testing the three supervised learning models you've chosen above. The functions are as follows:
- `train_classifier` - takes as input a classifier and training data and fits the classifier to the data.
- `predict_labels` - takes as input a fit classifier, features, and a target labeling and makes predictions using the F<sub>1</sub> score.
- `train_predict` - takes as input a classifier, and the training and testing data, and performs `train_clasifier` and `predict_labels`.
 - This function will report the F<sub>1</sub> score for both the training and testing data separately.

In [7]:
def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    print("Trained model in {:.4f} seconds".format(end - start))

    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Print and return results
    print("Made predictions in {:.4f} seconds.".format(end - start))
    return f1_score(target.values, y_pred, pos_label='yes')


def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print("Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train)))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print("F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train)))
    print("F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test)))
    print("------------------------------------------")

### Implementation: Model Performance Metrics
With the predefined functions above, we can run the `train_predict` function for each model we selected. 

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# Initialize the three models
clf_A = LogisticRegression(random_state=16)
clf_B = SVC(random_state=16)
clf_C = RandomForestClassifier(random_state=16)

# Set up the training set sizes
X_train_100 = X_train[:100]
y_train_100 = y_train[:100]

X_train_200 = X_train[:200]
y_train_200 = y_train[:200]

X_train_300 = X_train
y_train_300 = y_train

# Execute the 'train_predict' function for each classifier and each training set size
# train_predict(clf, X_train, y_train, X_test, y_test)
for clf in [clf_A, clf_B, clf_C]:
    for X_train_i, y_train_i in zip([X_train_100, X_train_200, X_train_300], 
                                   [y_train_100, y_train_200, y_train_300]):
        train_predict(clf, X_train_i, y_train_i, X_test, y_test)

Training a LogisticRegression using a training set size of 100. . .
Trained model in 0.0015 seconds
Made predictions in 0.0003 seconds.
F1 score for training set: 0.9362.
Made predictions in 0.0003 seconds.
F1 score for test set: 0.7746.
------------------------------------------
Training a LogisticRegression using a training set size of 200. . .
Trained model in 0.0018 seconds
Made predictions in 0.0002 seconds.
F1 score for training set: 0.8674.
Made predictions in 0.0002 seconds.
F1 score for test set: 0.7273.
------------------------------------------
Training a LogisticRegression using a training set size of 300. . .
Trained model in 0.0023 seconds
Made predictions in 0.0003 seconds.
F1 score for training set: 0.8274.
Made predictions in 0.0002 seconds.
F1 score for test set: 0.7591.
------------------------------------------
Training a SVC using a training set size of 100. . .
Trained model in 0.0011 seconds
Made predictions in 0.0006 seconds.
F1 score for training set: 0.8662.
M

### Tabular Results


** Classifer 1 - Logistic Regression**  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |      0.0015             |        0.0003          |   0.9362         |  0.7481              |
| 200               |        0.0018          |    0.0002               |   0.8674         |  0.8029               |
| 300               |      0.0023             |       0.0002           |  0.8274          |   0.7770              |

** Classifer 2 - Support Vector Machines**  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |       0.0011            |     0.0006             |   0.8662         |  0.8250               |
| 200               |     0.0024             |       0.0010            |   0.8746         |  0.7571               |
| 300               |       0.0049            |      0.0013            |   0.8673         |  0.8138        |

** Classifer 3 - Random Forest**  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |    0.0107               |   0.0015               |   0.9927         |  0.7910               |
| 200               |    0.0108               |    0.0011              |   0.9922         |  0.7097               |
| 300               |    0.0112               |    0.0011              |   0.9975         | 0.7500                |

## Choosing the Best Model
In this final section, we can choose from the three supervised learning models the *best* model to use on the student data. We can then perform a grid search optimization for the model over the entire training set (`X_train` and `y_train`) by tuning at least one parameter to improve upon the untuned model's F<sub>1</sub> score. 

In [13]:
print('Mean test F1 score for logistic regression: {}'.format(np.mean([0.7481, 0.8029, 0.7770])))
print('Mean test F1 score for support vector machine: {}'.format(np.mean([0.8250, 0.7571, 0.8138])))
print('Mean test F1 score for random forest: {}'.format(np.mean([0.7910, 0.7097, 0.7500])))

Mean test F1 score for logistic regression: 0.7759999999999999
Mean test F1 score for support vector machine: 0.7986333333333334
Mean test F1 score for random forest: 0.7502333333333334


The best model for building a student intervention system in this case would be support vector machine. 

In terms of F1 score, support vector machine gives the best average test F1 score (0.7986) over the three different training sets. The average test F1 score is 0.7760 for logistic regression and 0.7502 for random forest model. Therefore, support vector machine shows overall more robust performance on the test set, unlike random forest model which shows high variance. 

In terms of speed, although the train and prediction time for support vector machine scales roughly linearly with training set size, with the relatively small dataset in this case, the train/prediction time is actually close to that of logistic regression, and much faster than ramdom forest. 

Overall, support vector shows robust performance on test set and reasonable speed given the relatively small dataset in this case.

### Model in Layman's Terms


The chosen model, support vector machine, predicts whether a students belongs to one class (passed the graduation requirements) or the other (did not pass the graduation requirements) given 30 features of the students such as age, sex, family size, number of absence, etc.  

The model is first trained on a training set which contains features together with class labels (records of pass or not) of past students. The model constructs a decision boundary that separates the classes and also maximizes the distance between the closest members of separate classes. It is just like fitting a widest street between the classes. When we have a new data point (student) we would like to determine which class it belongs to (whether early intervention is needed or not), we simply see which side of the decision boundary it falls given the known features. 

### Implementation: Model Tuning
Fine tune the chosen model. Use grid search (`GridSearchCV`) with at least one important parameter tuned with at least 3 different values. 

In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

# Create the parameters list you wish to tune
parameters = {'C': np.logspace(-2, 2, 13), 
              'kernel': ['linear', 'poly', 'rbf', 'sigmoid']} 

# Initialize the classifier
clf = SVC(random_state=16)

# Make an f1 scoring function using 'make_scorer' 
f1_scorer = make_scorer(f1_score, pos_label='yes')

# Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf, parameters, scoring=f1_scorer, cv=10)

# Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train, y_train)

# Get the estimator
clf = grid_obj.best_estimator_

# Report the final F1 score for training and testing after parameter tuning
print("Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train)))
print("Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test)))

Made predictions in 0.0034 seconds.
Tuned model has a training F1 score of 0.8337.
Made predictions in 0.0013 seconds.
Tuned model has a testing F1 score of 0.8228.


In [20]:
# best parameters
grid_obj.best_params_

{'C': 0.46415888336127775, 'kernel': 'rbf'}

### Final F<sub>1</sub> Score


The final F1 score for training is 0.8337, and for testing is 0.8228. Compared to the untuned model, the F1 score for training decreased but the F1 score for testing increased. By optimizing the hyperparameters `C` and `kernel`, the model suffer less from overfitting the training set and achieves performance which generalizes well to the test set. 