<a href="https://colab.research.google.com/github/tillaczel/Machine-learning-workshop/blob/resturcture/Keras_basics/Cancer_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting probability of cancer
In this notebook your task is to build a model, which can predict the probability of cancer given a sample.
All code is complete except the build_model() function.

## Install and import
First let's upgrade tensorflow to 2.0, then import all the nescecary libraries.

In [0]:
!pip install tensorflow --upgrade

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.utils import shuffle

## Importing and understanding the dataset

We are using the breast cancer dataset from sklearn.
The description of the dataset is printed out.

In [0]:
dataset = load_breast_cancer()
print(dataset.DESCR)

## Prepare data

To avoid bias in the training data, the data needs to be shuffled before the train validation split. We are using a 80% validation split.

In [0]:
x, y = shuffle(dataset.data, dataset.target, random_state=1)

train_ratio = 0.8
x_train, y_train = x[:int(train_ratio*len(x))], y[:int(train_ratio*len(y))]
x_test, y_test = x[int(train_ratio*len(x)):], y[int(train_ratio*len(y)):]

After printing out the dataset it can be seen, that the input needs to be normalized.

In [0]:
print(x_train)
print(y_train)

To avoid information leakage, the mean and standard deviation for the normalization needs to be calculated only on the training dataset. For numerical stability 1e+6 is added to the standard deviation.

In [0]:
mean = np.mean(x_train)
std = np.std(x_train)

x_train_norm, x_test_norm = (x_train-mean)/(std+1e-6), (x_test-mean)/(std+1e-6)

In [0]:
print(x_train_norm)

## Task

Complete the build_model() function!

In [0]:
def build_model(x_train_norm, y_train, x_test_norm, y_test):

    return model, history

In [0]:
model, history = build_model(x_train_norm, y_train, x_test_norm, y_test)

## Training visualization

To spot over and underfitting, the training and validation loss and accuracy are plotted.

In [0]:
fig = plt.figure(figsize=(16,8))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['loss', 'val_loss'])
plt.title('Losses')
plt.show()

In [0]:
fig = plt.figure(figsize=(16,8))
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['accuracy', 'val_accuracy'])
plt.title('Accuracy')
plt.show()

## Validation

The task is to maximize the accuracy on the validation dataset.

In [0]:
decision_boundary = 0.5

prediction = np.round(model.predict(x_test_norm)[:,0]+0.5-decision_boundary,0).astype(int)

accuracy = (np.sum(np.multiply(prediction==1, y_test==1))+np.sum(np.multiply(prediction==0, y_test==0)))/len(y_test)

print(f'accuracy: {accuracy}')