# Machine Learning 
## Supervised Learning
## Project: Building a Student Intervention System

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail to graduate. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer:

This is a binomial, two-choices, two-class classification supervised learning problem. Our goal,the output variable, is a discrete label, such as "student needs intervention" or "student doesn't need intervention". Classification predicts what class a data point is part of, it is exactly what we need.
Regression supervised learning predicts continues values by analysing the relationship between variables.  

Examples of Classification type problems would be:
 diagnostic,
 image classification,
 identity fraud detection.
 
Examples of Regression type problems would be:
 popularity prediction,
 weather forecasting,
 price forecasting.
 
**

## Exploring the Data

In [1]:
# Importing libraries
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score

# Reading student data
student_data = pd.read_csv("student-data.csv")
print student_data
print "Student data read successfully!"

    school sex  age address famsize Pstatus  Medu  Fedu      Mjob      Fjob  \
0       GP   F   18       U     GT3       A     4     4   at_home   teacher   
1       GP   F   17       U     GT3       T     1     1   at_home     other   
2       GP   F   15       U     LE3       T     1     1   at_home     other   
3       GP   F   15       U     GT3       T     4     2    health  services   
4       GP   F   16       U     GT3       T     3     3     other     other   
5       GP   M   16       U     LE3       T     4     3  services     other   
6       GP   M   16       U     LE3       T     2     2     other     other   
7       GP   F   17       U     GT3       A     4     4     other   teacher   
8       GP   M   15       U     LE3       A     3     2  services     other   
9       GP   M   15       U     GT3       T     3     4     other     other   
10      GP   F   15       U     GT3       T     4     4   teacher    health   
11      GP   F   15       U     GT3       T     2   

### Implementation: Data Exploration

In [2]:

# TODO: Calculating number of students
n_students = student_data.shape[0]

# TODO: Calculating number of features
n_features = np.shape(student_data)[1] - 1

# TODO: Calculating passing students
n_passed = len(student_data[student_data['passed'] == 'yes'])

# TODO: Calculating failing students
n_failed = len(student_data[student_data['passed'] == 'no'])

# TODO: Calculating graduation rate
grad_rate = (float(n_passed) / float(n_students)) * 100

# Printing the results
print "Total number of students: {}".format(n_students)
print "Number of features: {}".format(n_features)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)


Total number of students: 395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09%


## Preparing the Data
### Identify feature and target columns

In [3]:
# Extracting feature columns
feature_cols = list(student_data.columns[:-1])

# Extracting target column 'passed'
target_col = student_data.columns[-1] 

# Showing the list of columns
print "Feature columns:\n{}".format(feature_cols)
print "\nTarget column: {}".format(target_col)

# Separating the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]

# Showing the feature information by printing the first five rows
print "\nFeature values:"
print X_all.head()



Feature columns:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Target column: passed

Feature values:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       

### Preprocess Feature Columns

In [4]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initializing new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigating each feature column for the data
    for col, col_data in X.iteritems():
        
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(X_all)
print "Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48 total features):
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Implementation: Training and Testing Data Split

In [5]:
# TODO: Importing any additional functionality you may need here
import numpy as np
from sklearn.model_selection import train_test_split

# TODO: Seting the number of training points
num_train = 300

# Seting the number of testing points
num_test = X_all.shape[0] - num_train

# TODO: Spliting the dataset into the number of training and testing points above
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, train_size = 0.76, test_size=0.24, random_state = 2) 

# Showing the results of the split
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])
X_train.head()


Training set has 300 samples.
Testing set has 95 samples.


Unnamed: 0,school_GP,school_MS,sex_F,sex_M,age,address_R,address_U,famsize_GT3,famsize_LE3,Pstatus_A,...,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
171,1,0,0,1,16,0,1,1,0,0,...,1,1,1,4,3,2,1,1,3,2
12,1,0,0,1,15,0,1,0,1,0,...,1,1,0,4,3,3,1,3,5,2
13,1,0,0,1,15,0,1,1,0,0,...,1,1,0,5,4,3,1,2,3,2
151,1,0,0,1,16,0,1,0,1,0,...,1,0,1,4,4,4,3,5,5,6
310,1,0,1,0,19,0,1,0,1,0,...,1,0,1,4,2,4,2,2,3,0


## Training and Evaluating Models

### Question 2 - Model Application
*List three supervised learning models that are appropriate for this problem. For each model chosen*
- Describe one real-world application in industry where the model can be applied. *(You may need to do a small bit of research for this — give references!)* 
- What are the strengths of the model; when does it perform well? 
- What are the weaknesses of the model; when does it perform poorly?
- What makes this model a good candidate for the problem, given what you know about the data?

**Answer:


The three supervised learning models that are appropriate for this problem, given what I know about the data, are:

            ●Decision Trees 
   
            ●Gaussian Naive Bayes (GaussianNB)
   
            ●Ensemble Method (Random Forest)


DECISION TREES

Real-world application:

            •detecting spam email messages based on words in the text.
 
The strengths:

            •performs feature selection,
 
            •not sensitive to missing values and outliers,
 
            •clarity of interpretation.
 
The weaknesses:

            •easily overfits (without limiting tree growth would overfit training data).
 
Why this model is a good candidate:

            •Decision Trees model clearly shows the statistical relationship between different features and the predictive capability.
 
            •The model selects important features.
 
What characteristics of the data makes the model appropriate:

            •Our data has a non-linear shape, and a non-linear Decision Trees model is classifying well non-linearly separable data.



GAUSSIAN NAIVE BAYES (GAUSSIAN NB)

Real-world application:

            •classifying text, spam filtering.
 
The strengths:

            •able to update knowledge step by step with new information.
 
The weaknesses:

            •can not learn interactions between features.
 
Why this model is a good candidate:

            •Gaussian Naive Bayes model has high training and testing speed.
 
            •Gaussian Naive Bayes model adapts quickly to the changes in the data set, it performs well on dynamic data.
 
What characteristics of the data makes the model appropriate:

            •Gaussian Naive Bayes model has an assumptions of independence between features. Our data has nonlinear trends. Gaussian Naive Bayes model captures well the nonlinearity in the data.




ENSEMBLE METHOD (RANDOM FOREST)

Real-world application:

            •colorectal cancer detection based on the results of patients blood plazma samples.
   
The strengths:

            •finds best split randomly,
 
            •never overfits,
 
            •can deal with missing values,
 
            •can work with categorical values,
 
            •has many parameters and great flexibility.
   
The weaknesses:

            •the model has to be rebuild every time something changes,
 
            •slow,
 
            •doesn't describe relationships in data.
   
Why this model is a good candidate:

            •Random Forest model has a lot of parameters to tune and to improve the predictive capability.
 
            •Random Forest model avoids overfitting and gives good predictions.
 
            •Random Forest model performs well on static data.
 
   
What characteristics of the Student Intervention data set makes the model appropriate:

            •The dependancy of the output from the predictors is not linear. Random Forest model performs well in discovering non-linear dependencies.
 
**

### Setup

In [6]:
def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Starting the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Printing the results
    print "Trained model in {:.4f} seconds".format(end - start)

    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Starting the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Printing and return results
    print "Made predictions in {:.4f} seconds.".format(end - start)
    return f1_score(target.values, y_pred, pos_label='yes')


def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicating the classifier and the training set size
    print "Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train))
    
    # Training the classifier
    train_classifier(clf, X_train, y_train)
    
    # Printing the results of prediction for both training and testing
    print "F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train))
    print "F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test))
   

### Implementation: Model Performance Metrics

In [7]:
# TODO: Importing the three supervised learning models from 

# from sklearn import model_A - Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz

# from sklearn import model_B - Gaussian Naive Bayes Classifier
from sklearn.naive_bayes import GaussianNB

# from sklearn import model_C - Random Forest Classifier
from sklearn import ensemble
from sklearn.ensemble import RandomForestClassifier

# TODO: Initializing the three models
clf_A = DecisionTreeClassifier(max_depth=5)
clf_B = GaussianNB()
clf_C = RandomForestClassifier(max_depth=10, n_estimators = 10, random_state = 42)

# TODO: Seting up the training set sizes
X_train_100 = X_train[:100]
y_train_100 = y_train[:100]

X_train_200 = X_train[:200]
y_train_200 = y_train[:200]

X_train_300 = X_train[:300]
y_train_300 = y_train[:300]


# TODO: Executing the 'train_predict' function for each classifier and each training set size      
for clf in [clf_A, clf_B, clf_C]:
    print(' ') 
    print('___________________________')
    for size in [100, 200, 300]:
        train_predict( clf, X_train[:size], y_train[:size], X_test, y_test)
        print(' ') 

 
___________________________
Training a DecisionTreeClassifier using a training set size of 100. . .
Trained model in 0.0034 seconds
Made predictions in 0.0006 seconds.
F1 score for training set: 0.9173.
Made predictions in 0.0002 seconds.
F1 score for test set: 0.7770.
 
Training a DecisionTreeClassifier using a training set size of 200. . .
Trained model in 0.0010 seconds
Made predictions in 0.0002 seconds.
F1 score for training set: 0.8659.
Made predictions in 0.0002 seconds.
F1 score for test set: 0.7385.
 
Training a DecisionTreeClassifier using a training set size of 300. . .
Trained model in 0.0011 seconds
Made predictions in 0.0003 seconds.
F1 score for training set: 0.8765.
Made predictions in 0.0002 seconds.
F1 score for test set: 0.7826.
 
 
___________________________
Training a GaussianNB using a training set size of 100. . .
Trained model in 0.0007 seconds
Made predictions in 0.0003 seconds.
F1 score for training set: 0.4103.
Made predictions in 0.0003 seconds.
F1 score 

### Tabular Results 



** Classifer 1 / Model A  / Decision Tree Classifier**  

| Training Set Size | Training Time           | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |        0.0012 seconds   |     0.0006 seconds     |      0.9173      |     0.7536      |
| 200               |        0.0016 seconds   |     0.0003 seconds     |      0.8659      |     0.7442      |          | 300               |        0.0017 seconds   |     0.0004 seconds     |      0.8765      |     0.7941      |


** Classifer 2 / Model B / Gaussian NB**  

| Training Set Size | Training Time           | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |        0.0015 seconds   |     0.0007 seconds     |      0.4103      |     0.2955      |
| 200               |        0.0022 seconds   |     0.0012 seconds     |      0.3544      |     0.2651      |
| 300               |        0.0016 seconds   |     0.0005 seconds     |      0.7839      |     0.7692      |


** Classifer 3 /Model C / Random Forest Classifier**  

| Training Set Size | Training Time           | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |        0.0283 seconds   |     0.0054 seconds     |       0.9841     |      0.7391     |
| 200               |        0.0282 seconds   |     0.0048 seconds     |       0.9692     |      0.7917     |
| 300               |        0.0285 seconds   |     0.0050 seconds     |       0.9709     |      0.7714     |



## Choosing the Best Model

### Question 3 - Choosing the Best Model
*Based on the experiments you performed earlier, in one to two paragraphs, explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?*

**Answer:

I have choosen Random Forest Classifier model as the best model for my predictions to use on the student data. This model is the most appropriate based on the available data, limited resources, cost, and performance. Random Forest Classifier gives the best F1 Score comparing to other two classifiers - Gaussian Naive Bayes and Decision Tree. It takes longer to train and to test Random Forest Classifier, but considering our data set small size, only 395 students records, time delay is not important for us in this case.

**

### Question 4 - Model in Layman's Terms
*In one to two paragraphs, explain to the board of directors in layman's terms how the final model chosen is supposed to work. Be sure that you are describing the major qualities of the model, such as how the model is trained and how the model makes a prediction. Avoid using advanced mathematical or technical jargon, such as describing equations or discussing the algorithm implementation.*

**Answer:

A Random Forest Classifier is the most popular classification model. This Classifier is an ensemble model composed of many Decision Trees Classifiers combined together to make predictions. The Decision Branch structure mimics a tree. Random Forest Classifier uses the Tree Branch structure to produce a group of Decision Trees. Every Tree checks important features and makes a decision. Collective Decisions of Trees produces the best decision, the best prediction.

**

### Implementation: Model Tuning

### Question 5 - Final F<sub>1</sub> Score
*What is the final model's F<sub>1</sub> score for training and testing? How does that score compare to the untuned model?*

In [8]:
# TODO: Importing 'GridSearchCV' and 'make_scorer'
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer

# TODO: Creating the parameters list you wish to tune
parameters = {'n_estimators':[14, 20, 50],'max_features':[0.2, 0.3, 0.5], 'min_samples_leaf':[50, 60, 70]}

# TODO: Initialize the classifier
clf = RandomForestClassifier(n_estimators = 50, max_features = 0.4, min_samples_leaf = 55, random_state = 40)

# TODO: Making an f1 scoring function using 'make_scorer' 
f1_scorer = make_scorer(f1_score,pos_label='yes')

# TODO: Performing grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf,parameters,scoring=f1_scorer)

# TODO: Fitting the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train, y_train)

# Getting the estimator
clf = grid_obj.best_estimator_

# Reporting the final F1 score for training and testing after parameter tuning
print "Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train))
print "Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test))




Made predictions in 0.0065 seconds.
Tuned model has a training F1 score of 0.8000.
Made predictions in 0.0072 seconds.
Tuned model has a testing F1 score of 0.8125.


**Answer:


The Untuned Random Forest Classifier Model:

| Training Set Size | Training Time           | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |        0.0283 seconds   |     0.0054 seconds     |       0.9841     |      0.7391     |
| 200               |        0.0282 seconds   |     0.0048 seconds     |       0.9692     |      0.7917     |
| 300               |        0.0285 seconds   |     0.0050 seconds     |       0.9709     |      0.7714     |


-------------


The Final Tuned Random Forest Classifier Model: 

Made predictions in 0.0066 seconds.
Tuned model has a training F1 score of 0.8000.
Made predictions in 0.0069 seconds.
Tuned model has a testing F1 score of 0.8125.
-----------

The final model's F1 score for training and testing doesn't improve significally comparing to the untuned model. 



**