# Model Comparison Lab

In this lab we will compare the performance of all the models we have learned about so far, using the car evaluation dataset.

## 1. Prepare the data

The [car evaluation dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/car/) is in the assets/datasets folder. By now you should be very familiar with this dataset.

1. Load the data into a pandas dataframe
- Encode the categorical features properly: define a map that preserves the scale (assigning smaller numbers to words indicating smaller quantities)
- Separate features from target into X and y

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [43]:
from sklearn import tree, ensemble, preprocessing, model_selection, metrics, neighbors, linear_model, svm

In [19]:
df = pd.read_csv('../../assets/datasets/car.csv')

In [20]:
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [21]:
df.shape

(1728, 7)

In [22]:
le = preprocessing.LabelEncoder()

In [23]:
y = le.fit_transform(df['acceptability'])

In [24]:
X = pd.get_dummies(df.iloc[:,:-1])

## 2. Useful preparation

Since we will compare several models, let's write a couple of helper functions.

1. Separate X and y between a train and test set, using 30% test set, random state = 42
    - make sure that the data is shuffled and stratified
2. Define a function called `evaluate_model`, that trains the model on the train set, tests it on the test, calculates:
    - accuracy score
    - confusion matrix
    - classification report
3. Initialize a global dictionary to store the various models for later retrieval


In [25]:
skf = model_selection.StratifiedKFold(shuffle=True)

In [26]:
for train, test in skf.split(X,y):
    X_train, X_test = X.iloc[train], X.iloc[test]
    y_train, y_test = y[train], y[test]

In [27]:
dt = tree.DecisionTreeClassifier()
dt.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [28]:
le.classes_

array(['acc', 'good', 'unacc', 'vgood'], dtype=object)

In [29]:
def evaluate_model(model, name):
    mod = model.fit(X_train, y_train)
    y_pred = mod.predict(X_test)
    print '{} accuracy score: {}'.format(name, mod.score(X_test, y_test))
    
    conmat = metrics.confusion_matrix(y_test, y_pred)
    conmat = pd.DataFrame(conmat, columns = le.classes_, index=le.classes_)
    print
    print conmat
    
    class_rep = metrics.classification_report(y_test, y_pred)
    print
    print class_rep

## 3.a KNN

Let's start with `KNeighborsClassifier`.

1. Initialize a KNN model
- Evaluate it's performance with the function you previously defined
- Find the optimal value of K using grid search
    - Be careful on how you perform the cross validation in the grid search

In [30]:
evaluate_model(neighbors.KNeighborsClassifier(), 'kNN')

kNN accuracy score: 0.857391304348

       acc  good  unacc  vgood
acc     82     1     45      0
good    15     6      2      0
unacc    3     0    400      0
vgood   12     1      3      5

             precision    recall  f1-score   support

          0       0.73      0.64      0.68       128
          1       0.75      0.26      0.39        23
          2       0.89      0.99      0.94       403
          3       1.00      0.24      0.38        21

avg / total       0.85      0.86      0.84       575



## 3.b Bagging + KNN

Now that we have found the optimal K, let's wrap `KNeighborsClassifier` in a BaggingClassifier and see if the score improves.

1. Wrap the KNN model in a Bagging Classifier
- Evaluate performance
- Do a grid search only on the bagging classifier params

In [31]:
evaluate_model(ensemble.BaggingClassifier(neighbors.KNeighborsClassifier()), 'Bagged kNN')

Bagged kNN accuracy score: 0.918260869565

       acc  good  unacc  vgood
acc    101     4     23      0
good     5    14      2      2
unacc    8     0    395      0
vgood    3     0      0     18

             precision    recall  f1-score   support

          0       0.86      0.79      0.82       128
          1       0.78      0.61      0.68        23
          2       0.94      0.98      0.96       403
          3       0.90      0.86      0.88        21

avg / total       0.92      0.92      0.92       575



In [33]:
gs = model_selection.GridSearchCV(ensemble.BaggingClassifier(neighbors.KNeighborsClassifier()),
                                 {'n_estimators': np.arange(1,10,1), 
                                  'max_samples': np.arange(0.5,1.0,0.1), 
                                  'max_features': np.arange(0.5,1.0,0.1)},
                                 cv=5)
gs.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingClassifier(base_estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': array([1, 2, 3, 4, 5, 6, 7, 8, 9]), 'max_samples': array([ 0.5,  0.6,  0.7,  0.8,  0.9]), 'max_features': array([ 0.5,  0.6,  0.7,  0.8,  0.9])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [34]:
evaluate_model(gs.best_estimator_, 'GridSearched Bagged kNN')

GridSearched Bagged kNN accuracy score: 0.779130434783

       acc  good  unacc  vgood
acc     51     0     77      0
good     8     3     12      0
unacc   15     1    384      3
vgood    4     0      7     10

             precision    recall  f1-score   support

          0       0.65      0.40      0.50       128
          1       0.75      0.13      0.22        23
          2       0.80      0.95      0.87       403
          3       0.77      0.48      0.59        21

avg / total       0.76      0.78      0.75       575



## 4. Logistic Regression

Let's see if logistic regression performs better

1. Initialize LR and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

In [35]:
evaluate_model(linear_model.LogisticRegression(), 'Logistic Regression')

Logistic Regression accuracy score: 0.867826086957

       acc  good  unacc  vgood
acc     96     6     26      0
good    15     7      0      1
unacc   13     1    389      0
vgood   14     0      0      7

             precision    recall  f1-score   support

          0       0.70      0.75      0.72       128
          1       0.50      0.30      0.38        23
          2       0.94      0.97      0.95       403
          3       0.88      0.33      0.48        21

avg / total       0.86      0.87      0.86       575



In [36]:
gs2 = model_selection.GridSearchCV(linear_model.LogisticRegression(),
                                   {'C': [0.001,0.01,0.1,0.2,0.3,0.5,1.0],
                                   'penalty': ['l1', 'l2']},
                                   cv=5)
gs2.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 0.2, 0.3, 0.5, 1.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [37]:
evaluate_model(gs2.best_estimator_, 'GridSearched Logistic Regression')

GridSearched Logistic Regression accuracy score: 0.838260869565

       acc  good  unacc  vgood
acc     85     0     43      0
good    23     0      0      0
unacc    6     0    397      0
vgood   21     0      0      0

             precision    recall  f1-score   support

          0       0.63      0.66      0.65       128
          1       0.00      0.00      0.00        23
          2       0.90      0.99      0.94       403
          3       0.00      0.00      0.00        21

avg / total       0.77      0.84      0.80       575



  'precision', 'predicted', average, warn_for)


In [38]:
evaluate_model(ensemble.BaggingClassifier(gs2.best_estimator_), 'Bagged GridSearched Logistic Regression')

Bagged GridSearched Logistic Regression accuracy score: 0.834782608696

       acc  good  unacc  vgood
acc     84     0     44      0
good    23     0      0      0
unacc    7     0    396      0
vgood   21     0      0      0

             precision    recall  f1-score   support

          0       0.62      0.66      0.64       128
          1       0.00      0.00      0.00        23
          2       0.90      0.98      0.94       403
          3       0.00      0.00      0.00        21

avg / total       0.77      0.83      0.80       575



## 5. Decision Trees

Let's see if Decision Trees perform better

1. Initialize DT and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

In [39]:
evaluate_model(tree.DecisionTreeClassifier(), "Decision Tree")

Decision Tree accuracy score: 0.96

       acc  good  unacc  vgood
acc    113     3     12      0
good     0    22      1      0
unacc    6     0    397      0
vgood    1     0      0     20

             precision    recall  f1-score   support

          0       0.94      0.88      0.91       128
          1       0.88      0.96      0.92        23
          2       0.97      0.99      0.98       403
          3       1.00      0.95      0.98        21

avg / total       0.96      0.96      0.96       575



In [40]:
gs3 = model_selection.GridSearchCV(tree.DecisionTreeClassifier(),
                                   {'criterion': ['gini', 'entropy'],
                                    'max_depth': [None, 5,10,20], 
                                    'min_samples_split': np.arange(10,50,10), 
                                    'min_samples_leaf': np.arange(5,20,1)},
                                   cv=5)
gs3.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'min_samples_split': array([10, 20, 30, 40]), 'criterion': ['gini', 'entropy'], 'max_depth': [None, 5, 10, 20], 'min_samples_leaf': array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [41]:
evaluate_model(gs3.best_estimator_, 'GridSearched Decision Trees')

GridSearched Decision Trees accuracy score: 0.935652173913

       acc  good  unacc  vgood
acc    111     3     12      2
good     5    17      1      0
unacc    9     0    394      0
vgood    0     5      0     16

             precision    recall  f1-score   support

          0       0.89      0.87      0.88       128
          1       0.68      0.74      0.71        23
          2       0.97      0.98      0.97       403
          3       0.89      0.76      0.82        21

avg / total       0.94      0.94      0.94       575



In [42]:
evaluate_model(ensemble.BaggingClassifier(gs3.best_estimator_), 'Bagged GridSearched Decision Trees')

Bagged GridSearched Decision Trees accuracy score: 0.96347826087

       acc  good  unacc  vgood
acc    123     2      1      2
good     0    20      0      3
unacc   13     0    390      0
vgood    0     0      0     21

             precision    recall  f1-score   support

          0       0.90      0.96      0.93       128
          1       0.91      0.87      0.89        23
          2       1.00      0.97      0.98       403
          3       0.81      1.00      0.89        21

avg / total       0.97      0.96      0.96       575



## 6. Support Vector Machines

Let's see if SVM perform better

1. Initialize SVM and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

In [44]:
evaluate_model(svm.SVC(), 'Support Vector')

Support Vector accuracy score: 0.913043478261

       acc  good  unacc  vgood
acc    124     0      4      0
good    17     5      0      1
unacc   16     0    387      0
vgood   12     0      0      9

             precision    recall  f1-score   support

          0       0.73      0.97      0.84       128
          1       1.00      0.22      0.36        23
          2       0.99      0.96      0.97       403
          3       0.90      0.43      0.58        21

avg / total       0.93      0.91      0.90       575



## 7. Random Forest & Extra Trees

Let's see if Random Forest and Extra Trees perform better

1. Initialize RF and ET and test on Train/Test set
- Find optimal params with Grid Search

## 8. Model comparison

Let's compare the scores of the various models.

1. Do a bar chart of the scores of the best models. Who's the winner on the train/test split?
- Re-test all the models using a 3 fold stratified shuffled cross validation
- Do a bar chart with errorbars of the cross validation average scores. is the winner the same?


## Bonus

We have encoded the data using a map that preserves the scale.
Would our results have changed if we had encoded the categorical data using `pd.get_dummies` or `OneHotEncoder`  to encode them as binary variables instead?

1. Repeat the analysis for this scenario. Is it better?
- Experiment with other models or other parameters, can you beat your classmates best score?