<a href="https://colab.research.google.com/github/tillaczel/Machine-learning-workshop/blob/master/Cancer_model_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting probability of cancer with k-fold cross valiadtion and f1 score
This notebook builds upon the previus excercise (https://github.com/tillaczel/Machine-learning-workshop/blob/master/Cancer_excercise.ipynb). It is extended with k-fold cross valiadtion and the f1 score

## Install and import
First let's upgrade tensorflow to 2.0, then import all the nescecary libraries.

In [1]:
!pip install tensorflow --upgrade

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.utils import shuffle

Requirement already up-to-date: tensorflow in /usr/local/lib/python3.6/dist-packages (2.0.0)


## Importing and understanding the dataset

We are using the breast cancer dataset from sklearn.
The description of the dataset is printed out.

In [2]:
dataset = load_breast_cancer()
print(dataset.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

## K-fold cross validation

To avoid bias in the training data, the data needs to be shuffled before the train split. In k-fold cross validation the data is split into k segments. For each segment the model is trained on the rest and evaulated on the segment.

In [0]:
x, y = shuffle(dataset.data, dataset.target, random_state=1)

def k_fold(k, i, x, y):
    valid_start_i = int(len(y)/k*i)
    valid_end_i = int(len(y)/k*(i+1))
    x_train = np.concatenate((x[:valid_start_i], x[valid_end_i:]), axis=0)
    y_train = np.concatenate((y[:valid_start_i], y[valid_end_i:]))
    x_test = x[valid_start_i:valid_end_i]
    y_test = y[valid_start_i:valid_end_i]

    mean = np.mean(x_train)
    std = np.std(x_train)

    x_train_norm, x_test_norm = (x_train-mean)/(std+1e-6), (x_test-mean)/(std+1e-6)

    return x_train_norm, y_train, x_test_norm, y_test

## Building model

Complete the build_model() function!

In [0]:
def build_model(x_train_norm, y_train,x_test_norm, y_test):
    model = Sequential()
    model.add(Dense(128, input_dim=30, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(optimizer='sgd',
                loss='mse',
                metrics=['accuracy'])

    history = model.fit(x_train_norm, y_train, validation_data=(x_test_norm, y_test), epochs=10, batch_size=32)
    return model, history

## Training

In [5]:
k = 5

accuracy = np.zeros((k))
precision = np.zeros((k))
recall = np.zeros((k))
f1_score = np.zeros((k))

for i in range(k):
    print(f'Iteration {i} from {k}.')

    x_train_norm, y_train, x_test_norm, y_test = k_fold(k, i, x, y)

    model, history = build_model(x_train_norm, y_train,x_test_norm, y_test)
    decision_boundary = 0.5
    prediction = np.round(model.predict(x_test_norm)[:,0]+0.5-decision_boundary,0).astype(int)

    accuracy[i] = (np.sum(np.multiply(prediction==1, y_test==1))+np.sum(np.multiply(prediction==0, y_test==0)))/len(y_test)
    precision[i] = np.sum(np.multiply(prediction==1, y_test==1))/np.sum(prediction==1)
    recall[i] = np.sum(np.multiply(prediction==0, y_test==0))/np.sum(prediction==0)
    f1_score[i] = 2*(precision[i]*recall[i])/(precision[i]+recall[i])

Iteration 0 from 5.
Train on 456 samples, validate on 113 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Iteration 1 from 5.
Train on 455 samples, validate on 114 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Iteration 2 from 5.
Train on 455 samples, validate on 114 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Iteration 3 from 5.
Train on 455 samples, validate on 114 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Iteration 4 from 5.
Train on 455 samples, validate on 114 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
accuracy: 0.9015836050302749
precision: 0.9141062891360688
recall: 0.8893052475979306
f1_score: 0.8978544465401026


## Validation

The accuracy is the percentage of correctly classified datapoints. If the dataset is illbalanced it is not a good metric, because a model, which always predict one class can achieve high value.

The precision is the percentage of true positive in all positive.
The recall is the percentage of true negative in all negative.

The f1 score combines the precision and recall into one score. It has its maximum, when both precision and recall is 1. If either of those decresaes it decreases as well. Other functions can be defined to combine the two metrics.

In [0]:
print(f'accuracy: {np.mean(accuracy)}')
print(f'precision: {np.mean(precision)}')
print(f'recall: {np.mean(recall)}')
print(f'f1_score: {np.mean(f1_score)}')