<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Evaluating SVM on Multiple Datasets

_Authors: Kiefer Katovich (SF)_

---

In this lab you can explore several datasets with SVM classifiers compared to logistic regression and kNN classifiers. 

Your datasets folder has these four datasets to choose from for the lab:

**Breast cancer**

    ./datasets/breast_cancer_wisconsin

**Spambase**

    ./datasets/spam

**Car evaluation**

    ./datasets/car_evaluation


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display

from sklearn.svm import SVC

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### 1. Load the breast cancer data

- Are there any missing values? Impute or clean if so.
- Select a classification target and predictors.

In [2]:
breast = pd.read_csv('./datasets/breast_cancer_wisconsin/breast_cancer.csv')

In [3]:
breast.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
Sample_code_number             699 non-null int64
Clump_Thickness                699 non-null int64
Uniformity_of_Cell_Size        699 non-null int64
Uniformity_of_Cell_Shape       699 non-null int64
Marginal_Adhesion              699 non-null int64
Single_Epithelial_Cell_Size    699 non-null int64
Bare_Nuclei                    699 non-null object
Bland_Chromatin                699 non-null int64
Normal_Nucleoli                699 non-null int64
Mitoses                        699 non-null int64
Class                          699 non-null int64
dtypes: int64(10), object(1)
memory usage: 60.1+ KB


In [4]:
breast.head()

Unnamed: 0,Sample_code_number,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [5]:
breast.columns = [i.lower() for i in breast.columns]

In [6]:
breast.bare_nuclei = [int(i) for i in breast.bare_nuclei]

ValueError: invalid literal for int() with base 10: '?'

In [7]:
breast[breast.bare_nuclei == '?'].bare_nuclei.count()

16

In [8]:
median_bare_nuclei = np.median([int(i) for i in breast[breast.bare_nuclei != '?'].bare_nuclei])

In [9]:
median_bare_nuclei

1.0

In [10]:
(breast.loc[breast[breast.bare_nuclei.str.isdigit()].index.values, 'bare_nuclei']).astype(int).dtypes

dtype('int64')

In [11]:
median_bare_nuclei = np.median(breast.loc[(breast[breast.bare_nuclei.str.isdigit()].index.values,'bare_nuclei')].astype(int))

In [12]:
median_bare_nuclei

1.0

In [13]:
breast.loc[breast[breast.bare_nuclei == '?'].index.values, 'bare_nuclei'] = median_bare_nuclei

In [14]:
breast.bare_nuclei = breast.bare_nuclei.astype(int)

In [15]:
breast.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
sample_code_number             699 non-null int64
clump_thickness                699 non-null int64
uniformity_of_cell_size        699 non-null int64
uniformity_of_cell_shape       699 non-null int64
marginal_adhesion              699 non-null int64
single_epithelial_cell_size    699 non-null int64
bare_nuclei                    699 non-null int64
bland_chromatin                699 non-null int64
normal_nucleoli                699 non-null int64
mitoses                        699 non-null int64
class                          699 non-null int64
dtypes: int64(11)
memory usage: 60.1 KB


In [16]:
breast.describe()

Unnamed: 0,sample_code_number,clump_thickness,uniformity_of_cell_size,uniformity_of_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,1071704.0,4.41774,3.134478,3.207439,2.806867,3.216023,3.486409,3.437768,2.866953,1.589413,2.689557
std,617095.7,2.815741,3.051459,2.971913,2.855379,2.2143,3.621929,2.438364,3.053634,1.715078,0.951273
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,1238298.0,6.0,5.0,5.0,4.0,4.0,5.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [17]:
breast['class'].value_counts()

2    458
4    241
Name: class, dtype: int64

In [18]:
breast['class'] = [0 if i == 2 else 1 for i in breast['class']]

# or breast['class'] = breast['class'].map(lambda x: 1 if x == 4 else 0 )

In [19]:
breast['class'].value_counts()

0    458
1    241
Name: class, dtype: int64

In [20]:
breast.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
sample_code_number             699 non-null int64
clump_thickness                699 non-null int64
uniformity_of_cell_size        699 non-null int64
uniformity_of_cell_shape       699 non-null int64
marginal_adhesion              699 non-null int64
single_epithelial_cell_size    699 non-null int64
bare_nuclei                    699 non-null int64
bland_chromatin                699 non-null int64
normal_nucleoli                699 non-null int64
mitoses                        699 non-null int64
class                          699 non-null int64
dtypes: int64(11)
memory usage: 60.1 KB


In [21]:
X = breast.drop(['sample_code_number','class'],axis = 1)
y = breast['class']

### 2. Build an SVM classifier on the data

For details on the SVM classifier, see here:

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

- What's the baseline for the accuracy?
- Initialize and train a linear svm. What's the average accuracy score with a 5-fold cross validation?
- Repeat using a radial basis function (rbf) classifier. Compare the scores. Which one is better?
- Print a confusion matrix and classification report for your best model using training & testing data.

Classification report:

```python
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
```

Confusion matrix:

```python
df_confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
```

In [22]:
breast['class'].value_counts()

0    458
1    241
Name: class, dtype: int64

In [25]:
baseline = breast['class'].value_counts()[0]/float(len(breast['class']))
baseline

0.65522174535050071

In [26]:
from sklearn.preprocessing import StandardScaler

In [27]:
ss = StandardScaler()
Xs = ss.fit_transform(X)

In [30]:
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score

In [43]:
accuracy_5_folds_linear = cross_val_score(SVC(kernel = 'linear'), X, y, cv = 5, n_jobs = -1)
accuracy_5_folds_linear

array([ 0.94326241,  0.94285714,  0.97857143,  0.97841727,  0.98561151])

In [44]:
accuracy_5_folds_rbf = cross_val_score(SVC(), X, y, cv = 5, n_jobs=-1)
accuracy_5_folds_rbf

array([ 0.90070922,  0.90714286,  0.96428571,  0.98561151,  0.98561151])

In [56]:
test = np.array((1,2,3))
np.append(test,4)

array([1, 2, 3, 4])

In [58]:
accuracy_5_folds_linear = np.append(accuracy_5_folds_linear, np.mean(accuracy_5_folds_linear))
accuracy_5_folds_rbf = np.append(accuracy_5_folds_rbf, np.mean(accuracy_5_folds_rbf))

In [59]:
compare = pd.DataFrame({'linear':accuracy_5_folds_linear,'rbf':accuracy_5_folds_rbf},index=[0,1,2,3,4,'avg'])
compare

Unnamed: 0,linear,rbf
0,0.943262,0.900709
1,0.942857,0.907143
2,0.978571,0.964286
3,0.978417,0.985612
4,0.985612,0.985612
avg,0.965744,0.948672


In [62]:
from sklearn.model_selection import train_test_split



In [70]:
from sklearn.model_selection import train_test_split

Xs_train, Xs_test, y_train, y_test = train_test_split(Xs, y, test_size = 0.33, stratify = y)
svm_linear = SVC(kernel = 'linear')
svm_linear.fit(Xs_train, y_train)
y_predict = svm_linear.predict(Xs_test)
pd.DataFrame(confusion_matrix(y_test, y_predict, labels = [0,1]), 

Unnamed: 0,true_0,true_1
pred_0,148,3
pred_1,2,78


In [63]:
svm_linear = SVC(kernel = 'linear', )
svm_rbf = SVC()

In [79]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

def print_cm_cr(model, Xs, y):
    Xs_train, Xs_test, y_train, y_test = train_test_split(Xs, y, test_size = 0.33, stratify = y)
    model.fit(Xs_train, y_train)
    y_predict = model.predict(Xs_test)
    
    cr = classification_report(y_test, y_predict)
    cm = pd.DataFrame(confusion_matrix(y_test, y_predict, labels = [0,1]), index = ['pred_0', 'pred_1'],\
                      columns = ['true_0','true_1'])
    
    print cr
    print cm

In [81]:
print_cm_cr(svm_linear, Xs, y)

             precision    recall  f1-score   support

          0       0.98      0.99      0.98       151
          1       0.97      0.96      0.97        80

avg / total       0.98      0.98      0.98       231

        true_0  true_1
pred_0     149       2
pred_1       3      77


In [82]:
print_cm_cr(svm_rbf, Xs, y)

             precision    recall  f1-score   support

          0       0.98      0.98      0.98       151
          1       0.96      0.96      0.96        80

avg / total       0.97      0.97      0.97       231

        true_0  true_1
pred_0     148       3
pred_1       3      77


#### 2.2 Are there more false positives or false negatives? Is this good or bad?


In [None]:
# A:


### 3. Perform the steps above with a different dataset.

Repeat each step.

In [87]:
# A:
car = pd.read_csv('./datasets/car_evaluation/car.csv')

In [88]:
car.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying           1728 non-null object
maint            1728 non-null object
doors            1728 non-null object
persons          1728 non-null object
lug_boot         1728 non-null object
safety           1728 non-null object
acceptability    1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB


In [89]:
car.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [93]:
car.isnull().sum()

buying           0
maint            0
doors            0
persons          0
lug_boot         0
safety           0
acceptability    0
dtype: int64

In [94]:
car.acceptability.unique()

array(['unacc', 'acc', 'vgood', 'good'], dtype=object)

In [97]:
car.acceptability = car.acceptability.map(lambda x: 1 if x == ['vgood','good'] else 0)

In [98]:
import patsy

In [99]:
X = patsy.dmatrix('~ buying + maint + doors + persons + lug_boot + safety -1',
                  data=car, return_type='dataframe')

In [103]:
car.buying.unique()

array(['vhigh', 'high', 'med', 'low'], dtype=object)

In [102]:
X

Unnamed: 0,buying[high],buying[low],buying[med],buying[vhigh],maint[T.low],maint[T.med],maint[T.vhigh],doors[T.3],doors[T.4],doors[T.5more],persons[T.4],persons[T.more],lug_boot[T.med],lug_boot[T.small],safety[T.low],safety[T.med]
0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
5,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
7,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
8,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0


### 4. Compare SVM, kNN and logistic regression using a dataset.

You should:

- Gridsearch optimal parameters for both (for SVM, just gridsearch C and kernel).
- Cross-validate scores.
- Examine confusion matrices and classification reports.

**Bonus:**

Plot "learning curves" for the best models of each. This is a great way see how training/testing size affects the scores. Look at the documentation for how to use this function in sklearn.

http://scikit-learn.org/stable/modules/learning_curve.html#learning-curves

In [104]:
# A:
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn_params = {
    
}