This project is about implementing machine learning algorithms to diagnose breast cancer. The dataset used in this project is from UCI ML Breast Cancer Wisconsin (Diagnostic). This project was done to be included in my ML portfolio.

# **Medical Diagnosis**
In this project a logistic regression algorithm (as a baseline) and a neural network algorithm will be used to model the data. In this project, the diagnosis of breast cancer into malignant and benign tumours will be performed. To do so, a dataset is employed with several numeric features and attributes of the tissue. These features are computed from a digitized medical image of a breast mass.


Import TensorFlow

In [18]:
# Run on TensorFlow 2.x
%tensorflow_version 2.x
from __future__ import absolute_import, division, print_function, unicode_literals

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.


Import the necessary modules to prepare data etc.

In [19]:
#Import relevant modules
import numpy as np
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import regularizers
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# The following lines adjust the granularity of reporting.
pd.options.display.float_format = "{:.1f}".format

Load the dataset.

In [20]:
from sklearn.datasets import load_breast_cancer
df = load_breast_cancer()
dataset= pd.DataFrame(df['data'], columns=df['feature_names'])
dataset['target']= df['target']
dataset.columns = dataset.columns.str.replace(' ', '_')

First let's have a quick look at the dataset.

In [21]:
dataset.head()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,target
0,18.0,10.4,122.8,1001.0,0.1,0.3,0.3,0.1,0.2,0.1,...,17.3,184.6,2019.0,0.2,0.7,0.7,0.3,0.5,0.1,0
1,20.6,17.8,132.9,1326.0,0.1,0.1,0.1,0.1,0.2,0.1,...,23.4,158.8,1956.0,0.1,0.2,0.2,0.2,0.3,0.1,0
2,19.7,21.2,130.0,1203.0,0.1,0.2,0.2,0.1,0.2,0.1,...,25.5,152.5,1709.0,0.1,0.4,0.5,0.2,0.4,0.1,0
3,11.4,20.4,77.6,386.1,0.1,0.3,0.2,0.1,0.3,0.1,...,26.5,98.9,567.7,0.2,0.9,0.7,0.3,0.7,0.2,0
4,20.3,14.3,135.1,1297.0,0.1,0.1,0.2,0.1,0.2,0.1,...,16.7,152.2,1575.0,0.1,0.2,0.4,0.2,0.2,0.1,0


The statistics for the dataset is also good to have a look at.

In [22]:
dataset.describe()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,target
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.1,19.3,92.0,654.9,0.1,0.1,0.1,0.0,0.2,0.1,...,25.7,107.3,880.6,0.1,0.3,0.3,0.1,0.3,0.1,0.6
std,3.5,4.3,24.3,351.9,0.0,0.1,0.1,0.0,0.0,0.0,...,6.1,33.6,569.4,0.0,0.2,0.2,0.1,0.1,0.0,0.5
min,7.0,9.7,43.8,143.5,0.1,0.0,0.0,0.0,0.1,0.0,...,12.0,50.4,185.2,0.1,0.0,0.0,0.0,0.2,0.1,0.0
25%,11.7,16.2,75.2,420.3,0.1,0.1,0.0,0.0,0.2,0.1,...,21.1,84.1,515.3,0.1,0.1,0.1,0.1,0.3,0.1,0.0
50%,13.4,18.8,86.2,551.1,0.1,0.1,0.1,0.0,0.2,0.1,...,25.4,97.7,686.5,0.1,0.2,0.2,0.1,0.3,0.1,1.0
75%,15.8,21.8,104.1,782.7,0.1,0.1,0.1,0.1,0.2,0.1,...,29.7,125.4,1084.0,0.1,0.3,0.4,0.2,0.3,0.1,1.0
max,28.1,39.3,188.5,2501.0,0.2,0.3,0.4,0.2,0.3,0.1,...,49.5,251.2,4254.0,0.2,1.1,1.3,0.3,0.7,0.2,1.0


Before beginning with the modeling it's good practice to find the biggest correlations between features so that the dataset can be understood better.

In [23]:
dataset.corr(method='pearson')

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,target
mean_radius,1.0,0.3,1.0,1.0,0.2,0.5,0.7,0.8,0.1,-0.3,...,0.3,1.0,0.9,0.1,0.4,0.5,0.7,0.2,0.0,-0.7
mean_texture,0.3,1.0,0.3,0.3,-0.0,0.2,0.3,0.3,0.1,-0.1,...,0.9,0.4,0.3,0.1,0.3,0.3,0.3,0.1,0.1,-0.4
mean_perimeter,1.0,0.3,1.0,1.0,0.2,0.6,0.7,0.9,0.2,-0.3,...,0.3,1.0,0.9,0.2,0.5,0.6,0.8,0.2,0.1,-0.7
mean_area,1.0,0.3,1.0,1.0,0.2,0.5,0.7,0.8,0.2,-0.3,...,0.3,1.0,1.0,0.1,0.4,0.5,0.7,0.1,0.0,-0.7
mean_smoothness,0.2,-0.0,0.2,0.2,1.0,0.7,0.5,0.6,0.6,0.6,...,0.0,0.2,0.2,0.8,0.5,0.4,0.5,0.4,0.5,-0.4
mean_compactness,0.5,0.2,0.6,0.5,0.7,1.0,0.9,0.8,0.6,0.6,...,0.2,0.6,0.5,0.6,0.9,0.8,0.8,0.5,0.7,-0.6
mean_concavity,0.7,0.3,0.7,0.7,0.5,0.9,1.0,0.9,0.5,0.3,...,0.3,0.7,0.7,0.4,0.8,0.9,0.9,0.4,0.5,-0.7
mean_concave_points,0.8,0.3,0.9,0.8,0.6,0.8,0.9,1.0,0.5,0.2,...,0.3,0.9,0.8,0.5,0.7,0.8,0.9,0.4,0.4,-0.8
mean_symmetry,0.1,0.1,0.2,0.2,0.6,0.6,0.5,0.5,1.0,0.5,...,0.1,0.2,0.2,0.4,0.5,0.4,0.4,0.7,0.4,-0.3
mean_fractal_dimension,-0.3,-0.1,-0.3,-0.3,0.6,0.6,0.3,0.2,0.5,1.0,...,-0.1,-0.2,-0.2,0.5,0.5,0.3,0.2,0.3,0.8,0.0


## Dataset preparation
Here the dataset will be split into the input features and the output target. For the input features, select the top 4 features with highest correlations to the target. The four features with the highest correlation to the target are: mean_area, mean_radius, mean_perimeter, and mean_concave_points.



In [24]:
y_dataset = dataset.pop('target')
X_dataset = dataset[['mean_area', 'mean_radius', 'mean_perimeter', 'mean_concave_points']]

The features are normalized using Z-score to be easier to work with.

In [25]:
X_dataset_mean = X_dataset.mean()
X_dataset_std = X_dataset.std()
X_dataset_norm = (X_dataset - X_dataset_mean)/X_dataset_std

Dataset normalized.


Now it is appropriate to split the dataset into a training set and a test set, since we now have all the features normalized. 80/20 split works well for this project. Random state will be set to 100 for reproducibility.

In [26]:
from sklearn.model_selection import train_test_split
X_train_norm, X_test_norm, y_train, y_test = train_test_split(X_dataset_norm , y_dataset , test_size=0.2 , random_state=100)

Dataset split.


## Logistic regression
A logistic regression model can be used as the baseline. Here we define and train a logistic regression model.

**Question 5:** Train the initialized model using the training data.

In [27]:
# Create a logistic regression object and train it

model_LR = LogisticRegression()
model_LR.fit(X_train_norm, y_train)
y_predict_LR = model_LR.predict(X_test_norm)

Now we evaluate the performance of the LR model using the test set. Mean square error is an appropriate metric to judge the model by.

In [28]:
# Evaluate the trained model against the test set.

print("\n Evaluate the logistic regression model against the test set:")
accuracy_score(y_predict_LR, y_test)


 Evaluate the logistic regression model against the test set:


0.9210526315789473

## Neural network
Now the neural network will be defined. For this we will try using three hidden layers with 16, 8, and 6 nodes. The activation function will be set to relu and the kernel_regularizer as l2, l=0.001.

In [34]:
# Find the number of input features
n_features = X_train_norm.shape[1]

# Create the neural network
model_NN = tf.keras.Sequential([
    layers.Dense(16, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(l=0.001), input_shape=(n_features,)),
    layers.Dense(8, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(l=0.001)),
    layers.Dense(6, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(l=0.001)),
    layers.Dense(1, activation='sigmoid')
])

The neural network model should trained using the training data. Here we set the number of epochs(100), the optimization algorithm('adam'), and loss function(BCE).

In [30]:
# compile the model
#model_NN.compile(optimizer='adam', loss='mse', metrics=['mse'])
model_NN.compile(optimizer='adam', loss='binary_crossentropy', metrics=['binary_accuracy'])

# Fit the model
model_NN.fit(x=X_train_norm, y=y_train, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.src.callbacks.History at 0x7c8e9c55bc70>

Now the performance will be evaluated.

In [31]:
# Evaluate the trained model against the test set.
print("\n Evaluate the neural network model against the test set:")
model_NN.evaluate(x = X_test_norm, y = y_test)


 Evaluate the neural network model against the test set:


[0.19265033304691315, 0.9122806787490845]

In [32]:
# Evaluate the trained model against the test set.
print("\n Evaluate the neural network model against the test set:")
model_NN.evaluate(x = X_test_norm, y = y_test)

print("\n Evaluate the logistic regression model against the test set:")
accuracy_score(y_predict_LR, y_test)


# Compute mean squared error
mse_nn = model_NN.evaluate(X_test_norm, y_test)
print("Mean Squared Error of the Neural Network: ",mse_nn)

mse_lr = mean_squared_error(y_test, y_predict_LR)
print("Mean Squared Error of the Logistic Regression: ", mse_lr)

# Compute mean squared error
mse_nn = model_NN.evaluate(X_train_norm, y_train)
print("Mean Squared Error of the Neural Network: ",mse_nn)

y_predict_train = model_LR.predict(X_train_norm)
mse_lr = mean_squared_error(y_train, y_predict_train)
print("Mean Squared Error of the Logistic Regression: ", mse_lr)



 Evaluate the neural network model against the test set:

 Evaluate the logistic regression model against the test set:
Mean Squared Error of the Neural Network:  [0.19265033304691315, 0.9122806787490845]
Mean Squared Error of the Logistic Regression:  0.07894736842105263
Mean Squared Error of the Neural Network:  [0.19548872113227844, 0.9230769276618958]
Mean Squared Error of the Logistic Regression:  0.08791208791208792


Accuracy LR: 0.9210526315789473  
Accuracy NN: 0.9210526347160339  
MSE LR test: 0.07894736842105263  
MSE NN test: 0.054480426013469696  
MSE LR train: 0.08791208791208792  
MSE NN train: 0.05716453865170479  
To get these values I added the code in question 10, and I also had to modify the compiler for the NN to get MSE. As can be seen above the models are very similar in accuracy and MSE. Since the NN has a higher accuracy and a lower MSE in both the training and test data it is the better model. The MSE is also very similar between the test and training data so the overfitting can be said to be minimal. When evaluating models it is also important to take the time to execute into account, and in this case the logistic model is much faster. The NN is still the better model in this case I would say, since it will probably get better the more data provided. Although it will also take more time to compute.