# Model Comparison Lab

In this lab we will compare the performance of all the models we have learned about so far, using the car evaluation dataset.

## 1. Prepare the data

The [car evaluation dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/car/) is in the assets/datasets folder. By now you should be very familiar with this dataset.

1. Load the data into a pandas dataframe
- Encode the categorical features properly: define a map that preserves the scale (assigning smaller numbers to words indicating smaller quantities)
- Separate features from target into X and y

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
df = pd.read_csv('./../../assets/datasets/car.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying           1728 non-null object
maint            1728 non-null object
doors            1728 non-null object
persons          1728 non-null object
lug_boot         1728 non-null object
safety           1728 non-null object
acceptability    1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB


In [12]:
df.acceptability.value_counts()

unacc    1210
acc       384
good       69
vgood      65
Name: acceptability, dtype: int64

In [5]:
map_maint = {'vhigh': 4,'high': 3,'med': 2,'low':1}
map_doors = {'5more': 4,'4': 3,'3': 2,'2':1}
map_persons = {'more': 3,'4': 2,'2':1}
map_lug_boot = {'big': 3,'med': 2,'small':1}
map_safety = {'high': 3,'med': 2,'low':1}

df.maint = df.maint.map(map_maint)
df.buying = df.buying.map(map_maint)
df.doors = df.doors.map(map_doors)
df.persons = df.persons.map(map_persons)
df.lug_boot = df.lug_boot.map(map_lug_boot)
df.safety = df.safety.map(map_safety)
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,4,4,1,1,1,1,unacc
1,4,4,1,1,1,2,unacc
2,4,4,1,1,1,3,unacc
3,4,4,1,1,2,1,unacc
4,4,4,1,1,2,2,unacc


## 2. Useful preparation

Since we will compare several models, let's write a couple of helper functions.

1. Separate X and y between a train and test set, using 30% test set, random state = 42
    - make sure that the data is shuffled and stratified
2. Define a function called `evaluate_model`, that trains the model on the train set, tests it on the test, calculates:
    - accuracy score
    - confusion matrix
    - classification report
3. Initialize a global dictionary to store the various models for later retrieval


In [8]:
X = df.drop('acceptability', axis=1)
y = df.acceptability

In [7]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, classification_report

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, stratify = y, random_state =42)

In [14]:
def evaluate_model(model, name):
    s = cross_val_score(model, X, y, cv=3, n_jobs=-1)
    print "{} Cross Val Score:\t{:0.3} ± {:0.3}".format(name, s.mean().round(3), s.std().round(3))

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    print "Accuracy: ", model.score(X_test,y_test)
    
    cm = confusion_matrix(y_test, y_pred, labels =['unacc','acc','good', 'vgood'] ) 
    print pd.DataFrame(cm, index=['True unacc','True acc','True good', 'True vgood'], 
                       columns=['Pred unacc','Pred acc','Pred good', 'Pred vgood'] )
    print classification_report(y_test, y_pred)

## 3.a KNN

Let's start with `KNeighborsClassifier`.

1. Initialize a KNN model
- Evaluate it's performance with the function you previously defined
- Find the optimal value of K using grid search
    - Be careful on how you perform the cross validation in the grid search

In [15]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

evaluate_model(knn, 'knn')

knn Cross Val Score:	0.739 ± 0.123
Accuracy:  0.946050096339
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         355         8          0           0
True acc            12       103          0           0
True good            0         2         19           0
True vgood           0         4          2          14
             precision    recall  f1-score   support

        acc       0.88      0.90      0.89       115
       good       0.90      0.90      0.90        21
      unacc       0.97      0.98      0.97       363
      vgood       1.00      0.70      0.82        20

avg / total       0.95      0.95      0.95       519



In [16]:
from sklearn.grid_search import GridSearchCV
parameters = {"n_neighbors": [1,2,3,4,5,6,7,8,9,10]}

gs = GridSearchCV(knn, parameters, cv=5, n_jobs=4)
gs.fit(X, y)
print gs.best_score_
gs.best_params_

0.773148148148


{'n_neighbors': 9}

In [94]:
knn = KNeighborsClassifier(n_neighbors=9)

evaluate_model(knn, 'knn with grid Search')

knn with grid Search Cross Val Score:	0.775 ± 0.103
Accuracy:  0.942196531792
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         357         6          0           0
True acc            13       102          0           0
True good            2         4         15           0
True vgood           1         3          1          15
             precision    recall  f1-score   support

        acc       0.89      0.89      0.89       115
       good       0.94      0.71      0.81        21
      unacc       0.96      0.98      0.97       363
      vgood       1.00      0.75      0.86        20

avg / total       0.94      0.94      0.94       519



## 3.b Bagging + KNN

Now that we have found the optimal K, let's wrap `KNeighborsClassifier` in a BaggingClassifier and see if the score improves.

1. Wrap the KNN model in a Bagging Classifier
- Evaluate performance
- Do a grid search only on the bagging classifier params

In [30]:
from sklearn.ensemble import BaggingClassifier

knn = KNeighborsClassifier()
Bknn = BaggingClassifier(knn)

evaluate_model(Bknn, 'knn with bagging')

  **self._backend_args)
  **self._backend_args)
  **self._backend_args)


knn with bagging Cross Val Score:	0.744 ± 0.129
Accuracy:  0.940269749518
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         357         6          0           0
True acc            13       100          2           0
True good            4         2         15           0
True vgood           0         2          2          16
             precision    recall  f1-score   support

        acc       0.91      0.87      0.89       115
       good       0.79      0.71      0.75        21
      unacc       0.95      0.98      0.97       363
      vgood       1.00      0.80      0.89        20

avg / total       0.94      0.94      0.94       519



In [104]:
parameters = {"n_estimators": [1, 3, 5, 7, 9, 11],
              'max_features': [1, 2, 3, 4, 5],
              "bootstrap": [True, False],
              "bootstrap_features": [True, False]}
gs = GridSearchCV(Bknn, parameters, cv=5, n_jobs=1)
gs.fit(X, y)
print gs.best_score_
gs.best_params_

0.751157407407


{'bootstrap': False,
 'bootstrap_features': False,
 'max_features': 5,
 'n_estimators': 7}

In [108]:
Bknn = BaggingClassifier(knn,bootstrap = False,
 bootstrap_features = False,
 max_features = 5,
 n_estimators = 7)

evaluate_model(Bknn, 'knn with bagging and bets params')

  **self._backend_args)
  **self._backend_args)
  **self._backend_args)


knn with bagging and bets params Cross Val Score:	0.704 ± 0.119
Accuracy:  0.867052023121
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         356         7          0           0
True acc            38        76          1           0
True good           12         2          7           0
True vgood           5         2          2          11
             precision    recall  f1-score   support

        acc       0.87      0.66      0.75       115
       good       0.70      0.33      0.45        21
      unacc       0.87      0.98      0.92       363
      vgood       1.00      0.55      0.71        20

avg / total       0.87      0.87      0.86       519



## 4. Logistic Regression

Let's see if logistic regression performs better

1. Initialize LR and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

In [115]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()

evaluate_model(LR, 'Log reg')

Log reg Cross Val Score:	0.707 ± 0.075
Accuracy:  0.782273603083
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         351         9          0           3
True acc            64        49          2           0
True good            4        10          3           4
True vgood           0        17          0           3
             precision    recall  f1-score   support

        acc       0.58      0.43      0.49       115
       good       0.60      0.14      0.23        21
      unacc       0.84      0.97      0.90       363
      vgood       0.30      0.15      0.20        20

avg / total       0.75      0.78      0.75       519



In [116]:
parameters = {
    'penalty': ['l1', 'l2'],
    'C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]}

gs = GridSearchCV(LR, parameters, cv=5, n_jobs=1)
gs.fit(X, y)
print gs.best_score_
gs.best_params_

0.754050925926


{'C': 1.0, 'penalty': 'l1'}

In [117]:
LR = LogisticRegression(C= 1.0, penalty='l1')
BLR = BaggingClassifier(LR)
evaluate_model(BLR, 'Bagging Log reg')

  **self._backend_args)
  **self._backend_args)
  **self._backend_args)


Bagging Log reg Cross Val Score:	0.702 ± 0.096
Accuracy:  0.805394990366
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         350        11          1           1
True acc            60        53          2           0
True good            5        10          6           0
True vgood           0        11          0           9
             precision    recall  f1-score   support

        acc       0.62      0.46      0.53       115
       good       0.67      0.29      0.40        21
      unacc       0.84      0.96      0.90       363
      vgood       0.90      0.45      0.60        20

avg / total       0.79      0.81      0.79       519



## 5. Decision Trees

Let's see if Decision Trees perform better

1. Initialize DT and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

In [119]:
from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier()
evaluate_model(DT, "Decision Tree")

Decision Tree Cross Val Score:	0.815 ± 0.01
Accuracy:  0.982658959538
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         360         3          0           0
True acc             3       110          2           0
True good            0         0         21           0
True vgood           0         1          0          19
             precision    recall  f1-score   support

        acc       0.96      0.96      0.96       115
       good       0.91      1.00      0.95        21
      unacc       0.99      0.99      0.99       363
      vgood       1.00      0.95      0.97        20

avg / total       0.98      0.98      0.98       519



In [120]:
parameters = {
    'criterion': ['gini', 'entropy'],
    'max_features': [None, 1, 2, 3,4,5],
    'max_depth': [None, 4, 5, 6, 7, 8, 9],
    'max_leaf_nodes': [None, 4, 5, 6, 7, 8, 9]
}
gs = GridSearchCV(DT, parameters, cv=5, n_jobs=1)
gs.fit(X, y)
print gs.best_score_
gs.best_params_

0.857638888889


{'criterion': 'entropy',
 'max_depth': 9,
 'max_features': 4,
 'max_leaf_nodes': None}

In [124]:
DT = DecisionTreeClassifier()
BDT = BaggingClassifier(DT)
evaluate_model(DT, 'Bagging Dec Tree')

Bagging Dec Tree Cross Val Score:	0.818 ± 0.013
Accuracy:  0.982658959538
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         360         3          0           0
True acc             3       110          2           0
True good            0         0         21           0
True vgood           0         1          0          19
             precision    recall  f1-score   support

        acc       0.96      0.96      0.96       115
       good       0.91      1.00      0.95        21
      unacc       0.99      0.99      0.99       363
      vgood       1.00      0.95      0.97        20

avg / total       0.98      0.98      0.98       519



## 6. Support Vector Machines

Let's see if SVM perform better

1. Initialize SVM and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

In [18]:
from sklearn.svm import SVC

sv = SVC()

evaluate_model(sv, "Support vector machine")

Support vector machine Cross Val Score:	0.764 ± 0.113
Accuracy:  0.957610789981
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         352        11          0           0
True acc             7       108          0           0
True good            0         2         19           0
True vgood           0         1          1          18
             precision    recall  f1-score   support

        acc       0.89      0.94      0.91       115
       good       0.95      0.90      0.93        21
      unacc       0.98      0.97      0.98       363
      vgood       1.00      0.90      0.95        20

avg / total       0.96      0.96      0.96       519



In [128]:
bsv = BaggingClassifier(sv)
evaluate_model(bsv, "Bagging Support vector machine")

  **self._backend_args)
  **self._backend_args)
  **self._backend_args)


Bagging Support vector machine Cross Val Score:	0.766 ± 0.109
Accuracy:  0.959537572254
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         353        10          0           0
True acc             6       109          0           0
True good            0         4         17           0
True vgood           0         0          1          19
             precision    recall  f1-score   support

        acc       0.89      0.95      0.92       115
       good       0.94      0.81      0.87        21
      unacc       0.98      0.97      0.98       363
      vgood       1.00      0.95      0.97        20

avg / total       0.96      0.96      0.96       519



In [25]:
parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

gs = GridSearchCV(sv, parameters, n_jobs=2)
gs.fit(X, y)
print gs.best_score_
gs.best_params_

0.707175925926


{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}

In [38]:
sv = SVC(gamma=0.001, C=100000.0, kernel='rbf', degree=5)

bsv = BaggingClassifier(sv)
bsv.fit(X_train,y_train)

bsv.score(X_test,y_test)

0.95568400770712914

## 7. Random Forest & Extra Trees

Let's see if Random Forest and Extra Trees perform better

1. Initialize RF and ET and test on Train/Test set
- Find optimal params with Grid Search

In [40]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier()

evaluate_model(RFC, "Random Forest")

Random Forest Cross Val Score:	0.808 ± 0.07
Accuracy:  0.967244701349
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         358         5          0           0
True acc             2       111          2           0
True good            0         1         20           0
True vgood           0         5          2          13
             precision    recall  f1-score   support

        acc       0.91      0.97      0.94       115
       good       0.83      0.95      0.89        21
      unacc       0.99      0.99      0.99       363
      vgood       1.00      0.65      0.79        20

avg / total       0.97      0.97      0.97       519



In [48]:
param = {
    'bootstrap': [True, False],
    'criterion':['gini', 'entropy'],
    'max_features': [1, 2, 3, 4,5,6],
    'max_depth': [None, 3, 5, 7, 9, 11,13, 15,17]
}

gs = GridSearchCV(RFC, param,cv=5, n_jobs=2)
gs.fit(X, y)
print gs.best_score_
gs.best_params_

0.883101851852


{'bootstrap': True, 'criterion': 'entropy', 'max_depth': 11, 'max_features': 5}

In [50]:
RFC = RandomForestClassifier(bootstrap = True,
 criterion= 'entropy',
 max_depth = 11,
 max_features = 5)

evaluate_model(RFC, "Random Forest")

Random Forest Cross Val Score:	0.854 ± 0.045
Accuracy:  0.97880539499
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         358         5          0           0
True acc             3       111          1           0
True good            0         0         21           0
True vgood           0         0          2          18
             precision    recall  f1-score   support

        acc       0.96      0.97      0.96       115
       good       0.88      1.00      0.93        21
      unacc       0.99      0.99      0.99       363
      vgood       1.00      0.90      0.95        20

avg / total       0.98      0.98      0.98       519



In [47]:
from sklearn.ensemble import ExtraTreesClassifier

ETC = ExtraTreesClassifier()

evaluate_model(ETC, "Extra Tree Classfifier")

Extra Tree Classfifier Cross Val Score:	0.861 ± 0.015
Accuracy:  0.959537572254
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         355         8          0           0
True acc            10       105          0           0
True good            0         3         18           0
True vgood           0         0          0          20
             precision    recall  f1-score   support

        acc       0.91      0.91      0.91       115
       good       1.00      0.86      0.92        21
      unacc       0.97      0.98      0.98       363
      vgood       1.00      1.00      1.00        20

avg / total       0.96      0.96      0.96       519



In [51]:
param = {
    'bootstrap': [True, False],
    'criterion':['gini', 'entropy'],
    'max_features': [1, 2, 3, 4,5,6],
    'max_depth': [None, 3, 5, 7, 9,11,13, 15,17]
}

gs = GridSearchCV(RFC, param,cv=5, n_jobs=2)
gs.fit(X, y)
print gs.best_score_
gs.best_params_

0.867476851852


{'bootstrap': True,
 'criterion': 'entropy',
 'max_depth': None,
 'max_features': 6}

In [54]:
ETC = ExtraTreesClassifier(bootstrap = True,max_depth= None,
 criterion= 'entropy',
 max_features = 6)

evaluate_model(ETC, "Extra Tree Classfifier")

Extra Tree Classfifier Cross Val Score:	0.835 ± 0.035
Accuracy:  0.976878612717
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         357         6          0           0
True acc             3       111          1           0
True good            0         0         20           1
True vgood           0         1          0          19
             precision    recall  f1-score   support

        acc       0.94      0.97      0.95       115
       good       0.95      0.95      0.95        21
      unacc       0.99      0.98      0.99       363
      vgood       0.95      0.95      0.95        20

avg / total       0.98      0.98      0.98       519



## 8. Model comparison

Let's compare the scores of the various models.

1. Do a bar chart of the scores of the best models. Who's the winner on the train/test split?
- Re-test all the models using a 3 fold stratified shuffled cross validation
- Do a bar chart with errorbars of the cross validation average scores. is the winner the same?


In [56]:
print "The best model is RandomForestClassifier with Parameters from GridSearch\n"
evaluate_model(RFC, "Random Forest")

The best model is RandomForestClassifier with Parameters from GridSearch

Random Forest Cross Val Score:	0.83 ± 0.052
Accuracy:  0.976878612717
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         358         5          0           0
True acc             3       112          0           0
True good            0         1         20           0
True vgood           0         2          1          17
             precision    recall  f1-score   support

        acc       0.93      0.97      0.95       115
       good       0.95      0.95      0.95        21
      unacc       0.99      0.99      0.99       363
      vgood       1.00      0.85      0.92        20

avg / total       0.98      0.98      0.98       519



## Bonus

We have encoded the data using a map that preserves the scale.
Would our results have changed if we had encoded the categorical data using `pd.get_dummies` or `OneHotEncoder`  to encode them as binary variables instead?

1. Repeat the analysis for this scenario. Is it better?
- Experiment with other models or other parameters, can you beat your classmates best score?