# Using tf.keras to Predict Breast Cancer

The University of California at Irvine (UCI) has released a large number of open-source data sets that are particularly useful for machine learning projects. One of them is the **Breast Cancer Wisconsin (Diagnostic) Data Set**. In this Colab, we will be using this dataset to train a tf.keras Sequential model that would predict the likelihood of a certain patient having either a *malignant(active/infectious)* or a *benign(dormant/inactive)* tumour based on certain parameters. Let's take a deeper look into the data. Start by downloading the *breast-cancer-wisconsin.data* and *breast-cancer-wisconsin.names* files and opening them in appropriate spreadsheet software/ text editors. 

![alt text](https://github.com/boronhub/breast_cancer_classifier/blob/master/images/data.jpg?raw=true)



As soon as we open up the .data file,, we see that there are no labels. Thus, we can't really infer much from the data. So, for reference, we turn to the .names file.



![alt text](https://github.com/boronhub/breast_cancer_classifier/blob/master/images/names.png?raw=true)

We can see the data seperation of the benign and malignant tumours, and total number of entries. But most importantly, we see the various labels like clamp thickness, etc, which help us understand how the data is to be used when preparing it for input to the model.


Let's get started with creating this model.

**Step 1: Importing necessary libraries**

Make sure to use TensorFlow 2.x 

In [0]:
%tensorflow_version 2.x

TensorFlow 2.x selected.


In [0]:
from __future__ import print_function
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Activation
import tensorflow as tf 
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd

**Step 2: Prepare the data**

First, we import the .data file as csv from the official UCI Datasets Repository. Then, we need to manually define the column headers for the dataset using the .names file for reference. 

In [0]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data')

df.columns = ['id','clump_thickness','unif_cell_size','unif_cell_shape'
,'marg_adhesion','single_epith_size','bare_nuclei','bland_chrom','norm_nucleoli','mitoses','class']

We drop the 'id' column, as it would not affect the identity of the cancer in any way. The dataset also has a few missing values (denoted by '?'), that we replace with false negatives so that particular entry is treated as an outlier.


In [0]:
df.drop(['id'], inplace=True, axis=1)
df.replace('?', -99999, inplace=True)


We see from the .names file that in the 'class' column, 2 is used for benign,and 4 for malignant. Remap these to 0 and 1 resectively for easier interpretation of the output.

In [0]:
df['class'] = df['class'].map(lambda x: 1 if x == 4 else 0)

Let's take a look at our dataset so far.

In [0]:
df.head()

Unnamed: 0,clump_thickness,unif_cell_size,unif_cell_shape,marg_adhesion,single_epith_size,bare_nuclei,bland_chrom,norm_nucleoli,mitoses,class
0,5,4,4,5,7,10,3,2,1,0
1,3,1,1,1,2,2,3,1,1,0
2,6,8,8,1,3,4,3,7,1,0
3,4,1,1,3,2,1,3,1,1,0
4,8,10,10,8,7,10,9,7,1,1


**Step 3: Splitting the dataset**


Once the data is ready, we need to seperate what we will be used as prediction labels (here it is the 'class' column) and as the parameters, which would be the rest of the dataset.

In [0]:
x = np.array(df.drop(['class'], axis=1))
y = np.array(df['class'])

(698, 9)

As with any model, we need to provide both a Training set, and a Validation, or a Test set. We can easily do this using the sklearn.train_test_split() function from the sklearn library.

In [0]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

tf.keras models do not take NumPy arrays as inputs, which is what our input consists of until now. Using tf.convert_to_tensor(), we can easily convert these arrays into usable Tensors of the required size.

In [0]:
y_test = tf.convert_to_tensor(y_test, dtype=tf.float32)
y_train = tf.convert_to_tensor(y_train, dtype=tf.float32)
x_train = tf.convert_to_tensor(x_train, dtype=tf.float32)
x_test = tf.convert_to_tensor(x_test, dtype=tf.float32)

Let's look a single entry from our training Tensor. 

In [0]:
x_train[0]

<tf.Tensor: shape=(9,), dtype=float32, numpy=array([ 5.,  6.,  6.,  8.,  6., 10.,  4., 10.,  4.], dtype=float32)>

**Step 4: Training the model**

In [0]:
model = Sequential()
model.add(Dense(9, activation='sigmoid', input_shape=(9,)))
model.add(Dense(27, activation='sigmoid'))
model.add(Dropout(0.25))
model.add(Dense(54, activation='sigmoid'))
model.add(Dropout(0.25))
model.add(Dense(27, activation='sigmoid'))
model.add(Dropout(0.25))
model.add(Dense(1, activation='sigmoid'))

In [0]:
model.compile(optimizer='adam', loss='mean_squared_logarithmic_error')

In [0]:
model.fit(x_train,y_train,batch_size=64,epochs=5,verbose=1, validation_data=(x_test, y_test))

Train on 558 samples, validate on 140 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f17d431e908>

**Step 5: Finishing Up**

Using model.evaluate() with the validation set, we can finally figure out the accuracy of our model. 

In [0]:
loss = model.evaluate(x_test, y_test, verbose=1, batch_size=30)

In [0]:
print("Final acccuracy of the model is {}".format(100-loss*100))