## Project 2: Building a Student Intervention System
### Supervised Learning
### Machine Learning Engineer Nanodegree


This notebook contains extensive answers and tips that go beyond what was taught and what is required. But the extra parts are very useful for your future projects. Feel free to fork my repository on Github [here](https://github.com/ritchieng/machine-learning-nanodegree/tree/master/supervised_learning/student_intervention).

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail to graduate. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer: **
- This should be a classification problem. 
- This is because there possibly two discrete outcomes, typical of a classification problem: 
    1. Students who need early intervention.
    2. Students who do not need early intervention.
- We can classify accordingly with a binary outcome such as:
    1. Yes, 1, for students who need early intervention.
    2. No, 0, for students who do not need early intervention.
- Evidently, we are not trying to predict a continuous outcome, hence this is not a regression problem.

## Exploring the Data
Run the code cell below to load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV

student_data = pd.read_csv("student-data.csv")

student_data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes


In [2]:
# This is a 395 x 31 DataFrame
student_data.shape

(395, 31)

### Implementation: Data Exploration
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [3]:
# TODO: Calculate number of students
n_students = student_data.shape[0]

# TODO: Calculate number of features
n_features = student_data.shape[1] - 1

# TODO: Calculate passing students
# Data filtering using .loc[rows, columns]
passed = student_data[student_data.passed == 'yes']
n_passed = passed.shape[0]

# TODO: Calculate failing students
failed = student_data[student_data.passed == 'no']
n_failed = failed.shape[0]

# TODO: Calculate graduation rate
grad_rate = n_passed * 100 / n_students

# Print the results
print("Number of students: {}".format(n_students))
print("Number of features: {}".format(n_features))
print("Number of students who passed: {}".format(n_passed))
print("Number of students who failed: {}".format(n_failed))
print("Graduation rate of the class: {:.2f}%".format(grad_rate))

Number of students: 395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09%


In [4]:
# Binarize categorical columns with 2 values
from sklearn.preprocessing import LabelBinarizer

for i in student_data.select_dtypes(exclude=[np.number]).columns:
    if student_data[i].nunique() == 2:
        lbl = LabelBinarizer()
        lbl.fit(student_data[i])
        student_data[i] = lbl.transform(student_data[i])

# Dummify the rest
student_data =\
pd.get_dummies(student_data, prefix='dummy')

student_data = student_data.select_dtypes(include=[np.number])

student_data.shape

(395, 44)

In [6]:
from sklearn.model_selection import train_test_split, cross_val_score

X_all = student_data.drop('passed', axis=1)
y_all = student_data['passed']

xtrain, xtest, ytrain, ytest = train_test_split(
    X_all, y_all, test_size=.2)

print("Training set has {} samples.".format(xtrain.shape[0]))
print("Testing set has {} samples.".format(xtest.shape[0]))

Training set has 316 samples.
Testing set has 79 samples.


## Training and Evaluating Models

In [7]:
from sklearn.dummy import DummyClassifier

dum = DummyClassifier()

cv = cross_val_score(dum, xtrain, ytrain, cv=4, scoring='roc_auc')

print(dum, '\n')
print('Mean score:', cv.mean())
print('Std Dev:   ', cv.std())

DummyClassifier(constant=None, random_state=None, strategy='stratified') 

Mean score: 0.489206445197
Std Dev:    0.0570994343835


In [8]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier()
params = {
    'n_neighbors': np.arange(1,35,2),
    'leaf_size': np.arange(1,35,2),
    'p': np.arange(1,9,1),
}
grid = GridSearchCV(clf, param_grid=params, cv=4, scoring='roc_auc').fit(xtrain, ytrain)
clf = grid.best_estimator_

cv = cross_val_score(clf, xtrain, ytrain, cv=4, scoring='roc_auc')

print(clf, '\n')
print('Mean score:', cv.mean())
print('Std Dev:   ', cv.std())

KNeighborsClassifier(algorithm='auto', leaf_size=3, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=29, p=4,
           weights='uniform') 

Mean score: 0.650896696769
Std Dev:    0.0264962040969


In [9]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
params = {
    'penalty': ['l1', 'l2'],
    'C': [.01, .1, .5, 1, 10,]
}

grid = GridSearchCV(clf, param_grid=params, cv=4, scoring='roc_auc').fit(xtrain, ytrain)
clf = grid.best_estimator_

cv = cross_val_score(clf, xtrain, ytrain, cv=4, scoring='roc_auc')

print(clf, '\n')
print('Mean score:', cv.mean())
print('Std Dev:   ', cv.std())

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False) 

Mean score: 0.668729774768
Std Dev:    0.0454280191782


In [None]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
params = {
    'max_depth': list(np.arange(2,13)) + [None],
    'min_samples_leaf': np.arange(1,11,1),
    'min_samples_split': np.arange(2,21,1),
}

grid = GridSearchCV(clf, param_grid=params, cv=4, scoring='roc_auc').fit(xtrain, ytrain)
clf = grid.best_estimator_

cv = cross_val_score(clf, xtrain, ytrain, cv=4, scoring='roc_auc')

print(clf, '\n')
print('Mean score:', cv.mean())
print('Std Dev:   ', cv.std())

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=5,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best') 

Mean score: 0.650570942859
Std Dev:    0.0319657112988


In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
params = {
    'max_depth': list(np.arange(2,13)) + [None],
    'min_samples_leaf': np.arange(1,11,1),
    'min_samples_split': np.arange(2,21,1),
    'n_estimators': [10,50,100]
}

grid = GridSearchCV(clf, param_grid=params, cv=4, scoring='roc_auc').fit(xtrain, ytrain)
clf = grid.best_estimator_

cv = cross_val_score(clf, xtrain, ytrain, cv=4, scoring='roc_auc')

print(clf, '\n')
print('Mean score:', cv.mean())
print('Std Dev:   ', cv.std())

In [None]:
from xgboost import XGBClassifier

clf = XGBClassifier()
params = {
    'max_depth': list(np.arange(2, 7)),
    'reg_lambda': np.arange(1, 2.51, .25),
    'n_estimators': [50, 100, 200]
}

grid = GridSearchCV(clf, param_grid=params, cv=4, scoring='roc_auc').fit(xtrain, ytrain)
clf = grid.best_estimator_

cv = cross_val_score(clf, xtrain, ytrain, cv=4, scoring='roc_auc')

print(clf, '\n')
print('Mean score:', cv.mean())
print('Std Dev:   ', cv.std())