# Support Vector Machines Lab

In this lab we will explore several datasets with SVMs. The assets folder contains several datasets (in order of complexity):

1. Breast cancer

For each of these a `.names` file is provided with details on the origin of data.

In [66]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize
from sklearn.metrics import confusion_matrix
from sklearn import grid_search
from sklearn.cross_validation import StratifiedKFold

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline



In [42]:
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
X.head()
y = data.target
y_as_df = pd.DataFrame(data.target, columns=['benign'])
train, test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)

# Exercise 1: Breast Cancer



## 1.a: Load the Data
- Are there any missing values? (how are they encoded? do we impute them?)
- Are the features categorical or numerical?
- Are the values normalized?
- How many classes are there in the target?

Perform what's necessary to get to a point where you have a feature matrix `X` and a target vector `y`, both with only numerical entries.

In [43]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
mean radius                569 non-null float64
mean texture               569 non-null float64
mean perimeter             569 non-null float64
mean area                  569 non-null float64
mean smoothness            569 non-null float64
mean compactness           569 non-null float64
mean concavity             569 non-null float64
mean concave points        569 non-null float64
mean symmetry              569 non-null float64
mean fractal dimension     569 non-null float64
radius error               569 non-null float64
texture error              569 non-null float64
perimeter error            569 non-null float64
area error                 569 non-null float64
smoothness error           569 non-null float64
compactness error          569 non-null float64
concavity error            569 non-null float64
concave points error       569 non-null float64
symmetry error             569 

## 1.b: Model Building

- What's the baseline for the accuracy?
- Initialize and train a linear svm. What's the average accuracy score with a 3-fold cross validation?
- Repeat using an rbf classifier. Compare the scores. Which one is better?
- Are your features normalized? if not, try normalizing and repeat the test. Does the score improve?
- What's the best model?
- Print a confusion matrix and classification report for your best model using:
        train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)

**Check** to decide which model is best, look at the average cross validation score. Are the scores significantly different from one another?

In [70]:
#baseline accuracy
y_as_df.sum()/float(len(y_as_df))

benign    0.627417
dtype: float64

In [71]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

model = SVC(kernel='linear')
#model.fit(X=X_train, y=y_train)
print np.mean(cross_val_score(model,X=train, y=y_train))

0.952918971066


In [72]:
model = SVC(kernel='rbf')
#model.fit(X=X, y=y)
print np.mean(cross_val_score(model,X=train, y=y_train)) 

0.963294627234


Linear is better than rbf.

In [73]:
train = StandardScaler().fit(train).transform(train)

In [74]:
model = SVC(kernel='linear')
#model.fit(X=X_train, y=y_train)
print np.mean(cross_val_score(model,X=train, y=y_train))

0.952918971066


In [75]:
model = SVC(kernel='rbf')
#model.fit(X=X, y=y)
print np.mean(cross_val_score(model,X=train, y=y_train)) 

0.963294627234


With normalization the rbf model is the best

In [76]:
X, X_test,y, y_test = train_test_split(train, y_train, stratify=y_train, test_size=0.33, random_state=42)
model.fit(X,y)
tn, fp, fn, tp  = confusion_matrix(y_pred=model.predict(X_test), y_true=y_test).ravel()
tn, fp, fn, tp 

(42, 5, 0, 79)

**Check:** Are there more false positives or false negatives? Is this good or bad?

There are more false positives. Which for breast cancer is good. It is better to have someone think they have cancer when they actually don't then it is for someone to think they don't have cancer when they actually do.

A false positive leads to futher investigation, a false negative leads to leaving

('Best Params:', {'kernel': 'linear', 'C': 0.01})
('Best Score:', 0.9725490196078431)


SVC(C=0.01, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

##  1.c: Grid Ssearch

Use the grid_search function to explore different kernels and values for the C parameter.

- Can you improve on your best previous score?
- Print the best parameters and the best score

In [88]:
parameters = {'kernel':('linear', 'rbf'), 'C':[0.01, 1, 100]}

model = grid_search.GridSearchCV(SVC(), parameters, cv=StratifiedKFold(y, 5))
model.fit(X, y)
print("Best Params:", model.best_params_)
print("Best Score:", model.best_score_)

model.best_estimator_

('Best Params:', {'kernel': 'linear', 'C': 0.01})
('Best Score:', 0.9725490196078431)


SVC(C=0.01, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

# Exercise 2
Now let's encapsulate a few things into functions so that it's easier to repeat the analysis.

## 2.a: Cross Validation
Implement a function `do_cv(model, X, y, cv)` that does the following:
- Calculates the cross validation scores
- Prints the model
- Prints and returns the mean and the standard deviation of the cross validation scores

> Answer: see above

## OPTIONAL
## 2.b: Confusion Matrix and Classification report
Implement a function `do_cm_cr(model, X, y, names)` that automates the following:
- Split the data using `train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)`
- Fit the model
- Prints confusion matrix and classification report in a nice format

**Hint:** names is the list of target classes


In [89]:
def do_cv(model, X, y, cv):
    scores = cross_val_score(model, X, y, cv=cv)
    print(model)
    sm = scores.mean()
    ss = scores.std()
    res = (sm, ss)
    print ("Average score: {:0.3}+/-{:0.3}".format(*res))
    # 0.3 is rounding the results to the 3rd decimol
    # .format is like using string modulos, but allows us to 'format' the output

    return res


def do_cm_cr(model, X, y, names):
    
    X, X_test,y, y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)
    model.fit(X,y)
    scores = cross_val_score(model, X, y, cv=cv)
    print (confusion_matrix(y_pred=model.predict(X_test), y_true=y_test).ravel())
    # 0.3 is rounding the results to the 3rd decimol
    # .format is like using string modulos, but allows us to 'format' the output
    print (classification_report(y_true, y_pred))
