<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Evaluating SVM on Multiple Datasets

_Authors: Kiefer Katovich (SF)_

---

In this lab you can explore several datasets with SVM classifiers compared to logistic regression and kNN classifiers. 

Your datasets folder has these four datasets to choose from for the lab:

**Breast cancer**

    ./datasets/breast_cancer_wisconsin

**Spambase**

    ./datasets/spam

**Car evaluation**

    ./datasets/car_evaluation


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display

from sklearn.svm import SVC

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### 1. Load the breast cancer data

- Are there any missing values? Impute or clean if so.
- Select a classification target and predictors.

In [2]:
# A:
breast = pd.read_csv('./datasets/breast_cancer_wisconsin/breast_cancer.csv')

breast.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
Sample_code_number             699 non-null int64
Clump_Thickness                699 non-null int64
Uniformity_of_Cell_Size        699 non-null int64
Uniformity_of_Cell_Shape       699 non-null int64
Marginal_Adhesion              699 non-null int64
Single_Epithelial_Cell_Size    699 non-null int64
Bare_Nuclei                    699 non-null object
Bland_Chromatin                699 non-null int64
Normal_Nucleoli                699 non-null int64
Mitoses                        699 non-null int64
Class                          699 non-null int64
dtypes: int64(10), object(1)
memory usage: 60.1+ KB


In [3]:
breast.head()

Unnamed: 0,Sample_code_number,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [4]:
breast.columns = [i.lower() for i in breast.columns]

In [5]:
breast.bare_nuclei.unique()

array(['1', '10', '2', '4', '3', '9', '7', '?', '5', '8', '6'], dtype=object)

In [6]:
breast.loc[breast[breast.bare_nuclei == '?'].index.values, 'bare_nuclei'] = np.median(breast.loc[breast[breast.bare_nuclei.str.isdigit()].index.values, 'bare_nuclei'].astype(int))

In [7]:
breast.bare_nuclei = breast.bare_nuclei.astype(int)

In [8]:
breast.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
sample_code_number             699 non-null int64
clump_thickness                699 non-null int64
uniformity_of_cell_size        699 non-null int64
uniformity_of_cell_shape       699 non-null int64
marginal_adhesion              699 non-null int64
single_epithelial_cell_size    699 non-null int64
bare_nuclei                    699 non-null int64
bland_chromatin                699 non-null int64
normal_nucleoli                699 non-null int64
mitoses                        699 non-null int64
class                          699 non-null int64
dtypes: int64(11)
memory usage: 60.1 KB


In [9]:
breast['class'].value_counts()

2    458
4    241
Name: class, dtype: int64

In [10]:
# class 2 = 1, class 4 = 0
breast['class'] = [1 if i == 2 else 0 for i in breast['class']]
breast['class'].value_counts()

1    458
0    241
Name: class, dtype: int64

In [11]:
y = breast['class']
X = breast[[i for i in breast.columns if i not in ['sample_code_numnber','class']]]

### 2. Build an SVM classifier on the data

For details on the SVM classifier, see here:

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

- What's the baseline for the accuracy?
- Initialize and train a linear svm. What's the average accuracy score with a 5-fold cross validation?
- Repeat using a radial basis function (rbf) classifier. Compare the scores. Which one is better?
- Print a confusion matrix and classification report for your best model using training & testing data.

Classification report:

```python
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
```

Confusion matrix:

```python
df_confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
```

In [12]:
# A:
breast['class'].value_counts()

1    458
0    241
Name: class, dtype: int64

In [13]:
baseline = float(breast['class'].value_counts()[1])/np.sum(breast['class'].value_counts())
baseline

0.6552217453505007

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
clf = SVC(kernel = 'linear', C = 1.0)
clf.fit(X_train, y_train)
print clf.score(X_train, y_train)
print clf.score(X_test, y_test)
print cross_val_score(clf, X, y, cv = 5, n_jobs = -1)

0.681297709924
0.622857142857


In [None]:
clf = SVC()
clf.fit(X_train, y_train)
print clf.score(X_train, y_train)
print clf.score(X_test, y_test)
print cross_val_score(clf, X, y, cv = 5, n_jobs = -1)

#### 2.2 Are there more false positives or false negatives? Is this good or bad?


In [None]:
# A:

### 3. Perform the steps above with a different dataset.

Repeat each step.

In [None]:
# A:

### 4. Compare SVM, kNN and logistic regression using a dataset.

You should:

- Gridsearch optimal parameters for both (for SVM, just gridsearch C and kernel).
- Cross-validate scores.
- Examine confusion matrices and classification reports.

**Bonus:**

Plot "learning curves" for the best models of each. This is a great way see how training/testing size affects the scores. Look at the documentation for how to use this function in sklearn.

http://scikit-learn.org/stable/modules/learning_curve.html#learning-curves

In [None]:
# A: