# <center>Breast cancer biopsy diagnosis using Support Vector Machine classifier</center>


<center>Histopathology images of cell nuclei. Biopsy samples taken from breast tissue using FNA.</center>


<img src="FNA_breast_tissue.jpeg"/>




<font size=2 color=grey>Sizilio, Glaucia & Leite, Cicilia & Mg Guerreiro, Ana & Neto, Adriao Duarte. (2012). Fuzzy method for pre-diagnosis of breast cancer from the Fine Needle Aspirate analysis. Biomedical engineering online. 11. 83. </font>

# <font color=blue>Aim:</font>

We want a model to learn to predict **Benign** or **Malignant** for histopathology breast tumour biopsies.





# <font color=blue>Dataset:</font>

Dataset located at: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic%29

Dataset contains 10 measurements (_features_) giving info about the cell nuclei obtained from the biopsy:

    a) radius (mean of distances from center to points on the perimeter) 
    b) texture (standard deviation of gray-scale values) 
    c) perimeter 
    d) area 
    e) smoothness (local variation in radius lengths) 
    f) compactness (perimeter^2 / area - 1.0) 
    g) concavity (severity of concave portions of the contour) 
    h) concave points (number of concave portions of the contour) 
    i) symmetry 
    j) fractal dimension ("coastline approximation" - 1)
    
Dataset contains additional 20 derived features (mean, standard error, and "worst" i.e. mean of the three largest values) giving a total of 30 features.

All of the examples in the dataset are _labelled_ with the "correct" diagnosis: B for benign and M for malignant.

# <font color=blue>Data science approach:</font>

    1) Import libraries
    2) Import data
    3) Exploratory data analysis
    4) Create model
    5) Evaluate model
    Go back to steps 3 or 4

-----------

## 1) Import libraries

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## 2) Get data

In [None]:
from sklearn.datasets import load_breast_cancer

In [None]:
cancer = load_breast_cancer()

## 3) Exploratory data analysis

### 3.1) What format does our dataset currently exist in 

In [None]:
type(cancer)

From the above we see that the sklearn version of the dataset exists as a base.Bunch - a 'dictionary-like' format.


### 3.2) What features (column headings) have we got?

We can see parent level elements using .keys()... 

In [None]:
cancer.keys()

We can access the keys using typical pandas referencing...

In [None]:
cancer['target_names']

Alternatively, we can also access like this (if no spaces present)...

In [None]:
cancer.target_names

Let's look at the dataset by means of the keys...

In [None]:
cancer.keys() 

We wrap each key with print() function to display info nicely...

In [None]:
print(cancer['DESCR']) # shift o toggle

In [None]:
print(cancer['data'])

In [None]:
print(cancer['feature_names'])

In [None]:
print(cancer['target']) # malignant = 0, benign = 1

In [None]:
print(cancer['target_names'])

From the above we would assume that malignant = 0 and benign = 1.  (This info isn't explicitly stated in dataset description.)  Let's verify.  We know from description that there are that there are 357 benign cases.  If benign =1, then summing the target variables will give us 357.  Let's confirm this is the case...

In [None]:
sum(cancer.target)

### 3.3) Put data into dataframe format

Let's create a dataframe for the dataset feature variables (leaving out the target data) ...

In [None]:
df = pd.DataFrame(cancer['data'], columns=cancer['feature_names']) 

### 3.4) Look at structure of data

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.info()

From the above, we can verify: 

    There are 30 features.
    All are floats.
    There are no missing values.
    There are no NaNs.

In [None]:
df.describe()

Looking at the above, we might ask if some of these features should be optimised to help the algorithm converge.  However for now, we'll go with the simplest approach first and just use the data 'as is'.

In [None]:
data_and_labels_df = df.assign(Benign = cancer['target']) # make a new df that has the labels
data_and_labels_df.head()

In [None]:
g = sns.pairplot(data_and_labels_df, vars=['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension'], hue='Benign')

In [None]:
coeffs = df.corr() # Find out correlation coefficients
coeffs

In [None]:
plt.rcParams['figure.figsize']=14,10
ax = sns.heatmap(coeffs, annot=False)


In [None]:
plt.rcParams['figure.figsize']=6,6
#sns.jointplot(data_and_labels_df['mean texture'], df['mean perimeter'], kind ='kde', color='red')
sns.kdeplot(data_and_labels_df['mean area'], df['mean perimeter'], shade=True, cmap='plasma', shade_lowest=False)

In [None]:
sns.jointplot(data_and_labels_df['mean area'], df['mean perimeter'], kind ='kde', color='red', )

In [None]:
plt.figure(figsize=(24,24))
sns.lmplot(x='mean area', y='mean perimeter', hue='Benign', data=data_and_labels_df, fit_reg=False, scatter_kws={'alpha':0.1})

In [None]:
plt.figure(figsize=(24,24))
sns.lmplot(x='mean texture', y='mean perimeter', hue='Benign', data=data_and_labels_df, fit_reg=False, scatter_kws={'alpha':0.2})

## 4) Create model 

### 4.1) Pull data into suitable structure for building model

Let's create our X matrix and y target data...

In [None]:
X = df # We know this is a dataframe
y = cancer.target # Let's see what type this is
type(y)

### 4.2) Split data into separate training and test datasets

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

### <font color=green>4.3) Call our machine learning model, in this case Support Vector Machine (SVM)

<font color=green>Import and support vector machine class and instantiate...</font>

In [None]:
from sklearn.svm import SVC
classifier = SVC()

### <font color=green>4.4) Train model</font>

<font color=green>Fit machine learning model to training data...</font>

In [None]:
classifier.fit(X_train, y_train)

## 5) Evaluate model

### 5.1) Make predictions on test set (model has not seen this data yet)



In [None]:
y_pred = classifier.predict(X_test)

In [None]:
y_pred

### 5.2) See how well predictions match our known results

Call tools...

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

Assess performance using confusion matrix...

In [None]:
print(confusion_matrix(y_test, y_pred)) # col headings malignant (Class 0) benign (Class 1)

From the confusion matrix, we note the following:

    There are 66 actual class 0 (malignant) 
    There are 105 actual class 1 (benign)
    
It seems that our SVM is classifying everything as benign.  All actual class 0 (malignant) are coming out as false positives.  Let's look at the precision and recall...

Assess performance using Precision, Recall and F1 score metrics...

In [None]:
print(classification_report(y_test, y_pred))

As expected, for class 1 (benign) recall = 1 (100%) because the SVM is actually finding _all_ the benign cases, but it is only doing this by classifying every sample as benign, i.e. it is also pulling in _all_ malignant cases (lots of false positives), so it's precision for this class is poor.

Recall for class 0 (malignant) is zero because the SVM is unable to find these cases 

Obviously this is a very serious failing for a cancer detection system.  Where positive corresponds to benign, we would prefer false negatives (incorrectly diagnosing benign as malignant) rather than false positives (incorrectly classifying malignant cancers as benign - i.e. failing to spot malignant cancer)

We need to adjust the parameters of the SVM (and perhaps consider normalising data).


--------------

# <font color=blue>Data science is an exploratory and iterative process</font><br>

## <font color=blue> -  Standard process is:   First, establish a functional model</font> 

## <font color=blue> -  This will give you a steer on what to focus on next</font>

---------------

Next steps, in this case... search for best model hyperparameters using sklearn's GridSearch

## 4 +) Create model 

In [None]:
from sklearn.model_selection import GridSearchCV

GridSearchCV takes in dict (or list of dicts) 
the dict contains parameter names as keys and parameter settings as values
GridSearch then uses these various hyperparameter values to train multiple SVMs

So...
keys = params, values = list of settings to test.

We know the parameters from looking at help for our instantiation call: classifier = SVC() 

Here we will try a Grid based on a range of C and gamma values.

C in SVM is the penalty for misclassifying. High C value corresponds to strict margin.  Relaxing may help. <br>
**Large C = low bias, high variance** ,so model is strictly trained to training data - overfitted - and can't generalise to test data).

Gamma is the kernel coefficient (or it is when kernel is default = radial basis function).  It defines how far the influence a single training example reaches.  Intuitively, if gamma is high then each example exerts far recahing influence and gives potential for overfittig (high variance)  Thus gamma works in reverse way to C: <br> 
**Large gamma = High bias** , biased to model, rather than data, and thus low variance (low overfitting). 

Let's define a parameter grid to explore the function of the SVM given a range of C and gamma parameters...

In [None]:
param_grid ={'C':[0.1, 1, 10, 100, 1000],
            'gamma':[1, 0.1, 0.01, 0.001, 0.0001]}

Instantiate a GridSearch object with our classifier and parameter grid...

In [None]:
grid = GridSearchCV(SVC(), param_grid, verbose=3)

### 4.1 +) Train (multiple) models

Train multiple SVMs...

In [None]:
grid.fit(X_train, y_train)

Let's see what the grid search believes are the best hyperparameter values...

In [None]:
grid.best_params_ # C is increased to 10 instead of default 1.  Gamma is reduced to 0.0001 instead default 1/n

We can also pull out the score (gridsearch allocates a cross validation type score to each classifier in the grid)...

In [None]:
grid.best_score_

So this is what the classifier (estimator) looks like...

In [None]:
grid.best_estimator_  

_grid_ is used above to refer to the highest scoring classifier.  We can take _grid_ and use it as our new (tuned) classifier

## 5 +) Re-evaluation - based on our gridsearch tuned classifier

In [None]:
grid_pred = grid.predict(X_test)

In [None]:
grid_pred

In [None]:
print(confusion_matrix(y_test, grid_pred))

We see a marked improvement comparing the above confusion matrix to our previous one. 9 out of 171 samples are misclassified (~5% misclassification)

As before, we note:

    There are 60 + 6 = 66 actual class 0 (malignant)
    There are 3 + 102 = 105 actual class 1 (benign)

However now we see that our SVM correctly finds 60 of the malignant tumours, a big improvement over our earlier SVM but unfortunately it classifies 6 malignant tumours as benign (false positives).  For cancer detection, this would be a worrying performance issue.

There are 3 false negatives, which, given our labeling, corresponds to: benigns classed as malignant.  This would be acceptable in cancer detection. 

In [None]:
print( classification_report(y_test, grid_pred))

**Class 0 = Malignant** <br>
**Class 1 = Benign**

For class 0 (malignant), we have a recall of 0.91.  Whilst an improvement over earlier SVM, maximising this metric would be a priority in cancer detection.  The quality of the recall is high - precision = 0.95 - only a few benigns have been classed as malignant.

For class 1 (benign), we have a recall of 0.97.  Most of the benigns are found.  Unfortunately as mentioned, **we are also pulling in some malignants here** hence our precision is not as high as we would like.  We would aspire to have the precision here = 1 (and recall for class 0 = 1) an aspiration which may require us to accept more benigns detected as malignant.

# Next iteration

- Further work to improve the recall for class 0 based on revisting the second and third best SVMs from the grid.  (The grid's scoring system may have different priorities to our requirements.)  

- Test effect of scaling the features. 

## --> Return to step 3