Choose one of the following methods for classification model evaluation. 

Describe the details of the approach and write the code for applying it for the evaluation of both SVM and decision tree learning algorithms in Python using Sklearn and your choice of data. 

1) Holdout – Reserve 2/3 for training and 1/3 for testing 

2) Random subsampling – Repeated holdout 

3) Cross validation – Partition data into k disjoint subsets – k-fold: train on k-1 partitions, test on the remaining one – Leave-one-out: k=n 

4) Stratified sampling – oversampling vs undersampling 

5) Bootstrap – Sampling with replacement

### Get Data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### HR Analytics Data Set

https://www.kaggle.com/ckeller/hra-modeling-with-decision-tree/notebook

Why are our best and most experienced employees leaving prematurely?

try to predict which valuable employees will leave next. 

Fields in the dataset include:

- Satisfaction Level
- Last evaluation
- Number of projects
- Average monthly hours
- Time spent at the company
- Whether they have had a work accident
- Whether they have had a promotion in the last 5 years
- Departments (column sales)
- Salary
- Whether the employee has left

In [3]:
hra = pd.read_csv("/Users/racheldyap/Desktop/CDA/HR_comma_sep.csv")
hra.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


In [4]:
hra.shape

(14999, 10)

In [5]:
hra.rename(columns={"sales":"department"}, inplace=True)
hra_new = pd.get_dummies(hra,["department", "salary"],drop_first= True)
hra_new.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,department_RandD,department_accounting,department_hr,department_management,department_marketing,department_product_mng,department_sales,department_support,department_technical,salary_low,salary_medium
0,0.38,0.53,2,157,3,0,1,0,0,0,0,0,0,0,1,0,0,1,0
1,0.8,0.86,5,262,6,0,1,0,0,0,0,0,0,0,1,0,0,0,1
2,0.11,0.88,7,272,4,0,1,0,0,0,0,0,0,0,1,0,0,0,1
3,0.72,0.87,5,223,5,0,1,0,0,0,0,0,0,0,1,0,0,1,0
4,0.37,0.52,2,159,3,0,1,0,0,0,0,0,0,0,1,0,0,1,0


In [8]:
hra_new.dtypes

satisfaction_level        float64
last_evaluation           float64
number_project              int64
average_montly_hours        int64
time_spend_company          int64
Work_accident               int64
left                        int64
promotion_last_5years       int64
department_RandD            uint8
department_accounting       uint8
department_hr               uint8
department_management       uint8
department_marketing        uint8
department_product_mng      uint8
department_sales            uint8
department_support          uint8
department_technical        uint8
salary_low                  uint8
salary_medium               uint8
dtype: object

In [9]:
hra_new.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,department_RandD,department_accounting,department_hr,department_management,department_marketing,department_product_mng,department_sales,department_support,department_technical,salary_low,salary_medium
count,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0
mean,0.612834,0.716102,3.803054,201.050337,3.498233,0.14461,0.238083,0.021268,0.05247,0.051137,0.04927,0.042003,0.057204,0.060137,0.276018,0.14861,0.181345,0.487766,0.429762
std,0.248631,0.171169,1.232592,49.943099,1.460136,0.351719,0.425924,0.144281,0.222981,0.220284,0.216438,0.200602,0.232239,0.237749,0.447041,0.355715,0.385317,0.499867,0.495059
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [10]:
hra_new.isnull().sum()

satisfaction_level        0
last_evaluation           0
number_project            0
average_montly_hours      0
time_spend_company        0
Work_accident             0
left                      0
promotion_last_5years     0
department_RandD          0
department_accounting     0
department_hr             0
department_management     0
department_marketing      0
department_product_mng    0
department_sales          0
department_support        0
department_technical      0
salary_low                0
salary_medium             0
dtype: int64

In [11]:
hra_new["left"].value_counts()

0    11428
1     3571
Name: left, dtype: int64

### 1 Hold Out

Reserve 30% as hold out test set, and remaining as training set. 

Issues: 
- waste of dataset
- estimation of error rate might be misleading

In [52]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.svm import SVC

In [14]:
X = hra_new.drop('left',axis=1)
y = hra_new["left"]

In [73]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(10499, 18) (10499,)
(4500, 18) (4500,)


In [61]:
## SVM

In [68]:
model = SVC()
print(model.fit(X_train, y_train))
print("\nThe score for the train set is: ",model.score(X_train, y_train))

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

The score for the train set is:  0.954471854462


In [45]:
predictions = model.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

[[4459  172]
 [ 123 1246]]
             precision    recall  f1-score   support

          0       0.97      0.96      0.97      4631
          1       0.88      0.91      0.89      1369

avg / total       0.95      0.95      0.95      6000



The score we get is already pretty good at 0.95.
Lets see with the GridSearch what parameters we should fine tune to have an even better score. 

In [53]:
from sklearn.model_selection import GridSearchCV

In [57]:
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']} 
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
grid.fit(X_train,y_train)

Fitting 3 folds for each of 25 candidates, totalling 75 fits
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ....... C=0.1, gamma=1, kernel=rbf, score=0.767000, total=   2.1s
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.2s remaining:    0.0s


[CV] ....... C=0.1, gamma=1, kernel=rbf, score=0.769000, total=   2.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.4s remaining:    0.0s


[CV] ....... C=0.1, gamma=1, kernel=rbf, score=0.766589, total=   2.1s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ..... C=0.1, gamma=0.1, kernel=rbf, score=0.920333, total=   1.2s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ..... C=0.1, gamma=0.1, kernel=rbf, score=0.912333, total=   1.2s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ..... C=0.1, gamma=0.1, kernel=rbf, score=0.922307, total=   1.3s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] .... C=0.1, gamma=0.01, kernel=rbf, score=0.866667, total=   1.1s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] .... C=0.1, gamma=0.01, kernel=rbf, score=0.854667, total=   1.0s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] .... C=0.1, gamma=0.01, kernel=rbf, score=0.856619, total=   1.1s
[CV] C=0.1, gamma=0.001, kernel=rbf ..................................
[CV] .

[CV] ...... C=1000, gamma=1, kernel=rbf, score=0.941000, total=   2.3s
[CV] C=1000, gamma=1, kernel=rbf .....................................
[CV] ...... C=1000, gamma=1, kernel=rbf, score=0.932333, total=   2.4s
[CV] C=1000, gamma=1, kernel=rbf .....................................
[CV] ...... C=1000, gamma=1, kernel=rbf, score=0.939647, total=   2.4s
[CV] C=1000, gamma=0.1, kernel=rbf ...................................
[CV] .... C=1000, gamma=0.1, kernel=rbf, score=0.952667, total=   0.8s
[CV] C=1000, gamma=0.1, kernel=rbf ...................................
[CV] .... C=1000, gamma=0.1, kernel=rbf, score=0.950000, total=   0.8s
[CV] C=1000, gamma=0.1, kernel=rbf ...................................
[CV] .... C=1000, gamma=0.1, kernel=rbf, score=0.957653, total=   0.8s
[CV] C=1000, gamma=0.01, kernel=rbf ..................................
[CV] ... C=1000, gamma=0.01, kernel=rbf, score=0.958000, total=   3.8s
[CV] C=1000, gamma=0.01, kernel=rbf ..................................
[CV] .

[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:  2.3min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 'kernel': ['rbf']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=3)

In [58]:
grid.best_estimator_

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [69]:
grid_predictions = grid.predict(X_test)
print(confusion_matrix(y_test,grid_predictions))
print(classification_report(y_test,grid_predictions))

[[3372   59]
 [  37 1032]]
             precision    recall  f1-score   support

          0       0.99      0.98      0.99      3431
          1       0.95      0.97      0.96      1069

avg / total       0.98      0.98      0.98      4500



After we run the grid search to find the better combination of c and gamma parameter, we get a much better score at 0.98, using the best parameters as evaluated by gridsearch

### Decision Tree

In [132]:
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [134]:
predictions = dtree.predict(X_test)
print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          0       0.99      0.98      0.98      3426
          1       0.93      0.97      0.95      1074

avg / total       0.98      0.97      0.97      4500



In [135]:
print(confusion_matrix(y_test,predictions))

[[3343   83]
 [  32 1042]]


### 3 Cross-Validation 
But we might fall into the danger of having a small dataset and the split of data might not contain as much representation of each category given that we have few examples for those that have left the company, which can cause generalization error and over-fitting. So to try to avoid this, we can try cross-validation.

Cross Validation allows you to alternate between training and testing when your dataset is relatively small to maximize your error estimation

K-Folds Cross-Validation
1. Split data into K equally sized subsets.
2. For each fold, use k-1 subset as training data, leave the last subset as test data

In [98]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics

In [95]:
X = hra_new.drop('left',axis=1)
y = hra_new["left"]

In [105]:
clf = SVC(kernel='rbf', C=10) # instantiate an object for SVC model w best param
scores = cross_val_score(clf, X,y,cv=5) # get the scores for each fold, where you set the # of folds with parameter 'cv'
scores

array([ 0.96467844,  0.95433333,  0.956     ,  0.96898966,  0.96965655])

In [106]:
# get the average of all the scores and that is your final accuracy score
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.96 (+/- 0.01)


Cross validation Predict: generate cross-validated estimates for each input data point

In [107]:
predicted= cross_val_predict(clf,X,y,cv=10)
metrics.accuracy_score(y,predicted)

0.96673111540769385

CV Steps below: 
    
1. Decide on the number of folds you want (k)
2. Subdivide your dataset into k folds
3. Use k-1 folds for a training set to build a tree.
4. Use the testing set to estimate statistics about the error in your tree.
5. Save your results for later
6. Repeat steps 3-6 for k times leaving out a different fold for your test set.
7. Average the errors across your iterations to predict the overall error

In [137]:
print('Score:', dtree.score(X_train, y_train))

Score: 1.0


In [140]:
print('Cross validation score, 10-fold cv: \n', cross_val_score(dtree, X_train, y_train, cv=10))

Cross validation score, 10-fold cv: 
 [ 0.97716461  0.97811608  0.97428571  0.98095238  0.96666667  0.98095238
  0.96380952  0.97998093  0.9742612   0.97998093]


In [146]:
print('Mean Cross validation train score: \n', cross_val_score(dtree, X_train, y_train, cv=10).mean())

Mean Cross validation train score: 
 0.975521984752


In [143]:
print('Score:', dtree.score(X_test, y_test))

Score: 0.974444444444


In [144]:
print('Cross validation score, 10-fold cv: \n', cross_val_score(dtree, X_test, y_test, cv=10))

Cross validation score, 10-fold cv: 
 [ 0.96674058  0.97117517  0.96452328  0.96452328  0.97111111  0.96888889
  0.98663697  0.96659243  0.9688196   0.96659243]


In [145]:
print('Mean Cross validation score\n', cross_val_score(dtree, X_test, y_test, cv=10).mean())

Mean Cross validation score
 0.971112477373


#### On the example above, we have a better score with the DecisionTree using th Cross-Validation method

##### When to use: 

The ideal choices are k-fold cross-validation with large value of k (but smaller than number of instances) or leave-one-out cross-validation whereas while working on bigger datasets, the first thought is to use holdout validation.