<a href="https://www.kaggle.com/code/sitinursarah/deep-learning-classification-on-breast-cancer?scriptVersionId=144164596" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Introduction**

Breast cancer is a persistent issue all over the world. It can be detected in many ways through mammography, ultrasound, biopsy and so on. Detecting breast cancer early is vital as it increases the chance of a successful treatment, ultimately saving lives. Therefore, the development of an Image Classification Model for detecting Breast Cancer early is important. This notebook was implemented referring to previous works.

The aim for this project is:
- To create an Image Classification Model that can accurately identify and categorize breast cancer cells from the non-cancerous images.

The objectives will be:
- To use a suitable dataset on Breast Cancer images
- Implement deep learning artchitectures such as Convolutional Neural Networks (CNN) to build the Image Classification Model.
- Optimize the model with previously developed models and evaluate

# Data Collection

## *Description of dataset used*

Breast cancer can develop at any different part of the breast. The most common form of breast cancer is Invasive Ductal Carcinoma (IDC). In order to detect IDC, it is through various methods such as mammography, ultrasound, biopsy and so on. Through biopsy, histopathology images are derived.

The Dataset that is going to be used for training and testing for the image classification model will be the Breast Histopathology Images dataset.

The dataset was originally uploaded on the Gleason Case website: http://gleason.case.edu/webdata/jpi-dl-tutorial/IDC_regular_ps50_idx5.zip

However, the website is not accessible as of now, thus we are using the dataset uploaded by Paul Mooney.

In this dataset, it consists a total of 277,524 patches of images sized 50 x 50, which was broken down from 162 whole mount images. Within these patches, there are 198, 738 IDC negative and 78,786 IDC positive. 


## *Importing Necessary Modules*

In [None]:
# Basic Libraries
import numpy as np
import random
from os import listdir
from PIL import Image

# Preprocessing/Visualization
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from keras.utils.np_utils import to_categorical

# Model Creation
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Evaluation Metrics
import seaborn as sns
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

## *Importing Data*

After carefully looking through the dataset, while the base path of the data is at breast-histopathology-images, there is a compilation of all the images into one folder which is named IDC_regular_ps50_idx5. For this project, the data from IDC_regular_ps50_idx5 folder will be directly extracted. It is a folder full of folders that is named after the patients' id, which also consists of the IDC positive and IDC negative photos. 

In [None]:
# Import the dataset into 'files'

base_path = "../input/breast-histopathology-images/IDC_regular_ps50_idx5/"
files = listdir(base_path)

In [None]:
# Find the total length of data/Find out how many patients are there

print("Total Number of Patients: "+ str(len(files)))

Based on what was printed above, it seems that there are 279 patients, and in each file contains images of IDC positive and negative. This information is useful as we are now aware on how to structure our files. Storing the data into the appropriate data structure is crucial as it will be easier to represent the data later. For this notebook, we will be using arrays to store the images and labels.

In [None]:
# Saving the data into an array [image_path, class]

dataset = []

for i in range(len(files)):
    patient_id = files[i]
    for c in [0,1]:
        patient_path = base_path + patient_id
        class_path = patient_path + '/' + str(c) + '/'
        subfiles = listdir(class_path)
        for pic in subfiles:
            image_path = class_path + pic
            dataset.append([image_path,c])
        

In [None]:
print("Total Number of Images: " + str(len(dataset)))

As shown above, there is a total of 277524 images. The way the images are stored is in two different arrays, where one is to store the images, the other is to store its type of class, IDC positive or negative, indicated with the numbers 0 and 1.

In [None]:
# How each data is stored

dataset[0]

Each data in the dataset is formatted into a list type, which consists of the Image Path and its class 0 or 1, 0 being IDC negative and 1 being IDC postive.

The Dataset might be too big for my Kaggle notebook to run, so we will reduce it to a quarter. 

In [None]:
total_length = len(dataset)
limit = total_length/4
dataset = dataset[:int(limit)]

len(dataset)

## *Data Visualisation*

What does the image look like? And to what ratio is the IDC positive and IDC negative? This section will answer those questions.

In [None]:
# Get the size

# Load the image

image_path = dataset[0][0]
label = dataset[0][1]
image = Image.open(image_path)

# Get the size (dimensions) of the image

image_width, image_height = image.size

print(f"Image Width: {image_width} pixels")
print(f"Image Height: {image_height} pixels")

So each of these images are 50 by 50 pixels which was stated in the dataset itself. Here's what the first image looks like.

In [None]:
# Show the first image in the dataset

plt.figure(figsize=(12, 8))

plt.imshow(image)
plt.title("IDC Negative")

plt.show()

The dataset will be separated into NCdata and Cdata for the purpose of visualisation.

In [None]:
# Separate the data by class

NCdata = [img for img, label in dataset if label == 0]
Cdata = [img for img, label in dataset if label == 1]

NClabels = [label for img, label in dataset if label == 0]
Clabels = [label for img, label in dataset if label == 1]

A sample of images will be taken from each data array for display.

In [None]:
# Get a sample of images from each type of dataset

negativeSample = random.sample(NCdata, 50)
positiveSample = random.sample(Cdata, 50)

### Healthy Patches

In [None]:
# Display 5x10 Grid of Healthy Patches

fig, ax = plt.subplots(5,10,figsize=(20,10))
for n in range(5):
    for m in range(10):
        idx = negativeSample[m + 10*n]
        image = Image.open(idx)
        ax[n,m].imshow(image)
        ax[n,m].grid(False)


### Cancer Patches

In [None]:
# Display 5x10 Grid of Cancer Patches

fig, ax = plt.subplots(5,10,figsize=(20,10))
for n in range(5):
    for m in range(10):
        idx = positiveSample[m + 10*n]
        image = Image.open(idx)
        ax[n,m].imshow(image)
        ax[n,m].grid(False)

Observations:
* There might be a chance that not all the images are 50x50pixels.
* Comparing the Healthy Patches and the Cancer Patches, the Cancer patches seems to have more purpleish look to it.

### Display Class Distribution

In [None]:
# Get the class distribution

labels = ["Non-Cancer", "Cancer"]
counts = [len(NCdata), len(Cdata)]

total_samples = sum(counts)
percentages = [(count / total_samples) * 100 for count in counts]


In [None]:
plt.figure(figsize=(8, 6))
plt.bar(labels, counts)
plt.xlabel("Class")
plt.ylabel("Count")
plt.title("Class Distribution")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
plt.pie(percentages, labels=labels, autopct='%1.1f%%', startangle=140)
plt.title("Class Distribution (Percentage)")
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

# Data Preprocessing

As observed previously, not all images are 50x50 pixels. To avoid any issues during training, it is better to resize all images to follow the 50x50 size to ensure fairness. After resizing, normalization, shuffling and splitting the data will be carried out.

In [None]:
# Resizing using PIL Image

desired_size = (50,50)
resizedNC = []
resizedC = []

for image_path in NCdata:
    image = Image.open(image_path)
    nimage = image.resize(desired_size, Image.LANCZOS)  # Resize with anti-aliasing for better quality
    resizedNC.append(nimage)
    
for image_path in Cdata:
    image = Image.open(image_path)
    cimage = image.resize(desired_size, Image.LANCZOS)  # Resize with anti-aliasing for better quality
    resizedC.append(cimage)


In [None]:
# Normalize the Dataset pixel values to [0, 1] range

NCdataset = np.array([np.array(image) / 255.0 for image in resizedNC])
Cdataset = np.array([np.array(image) / 255.0 for image in resizedC])

In [None]:
# Shuffle the dataset

NCdataset = shuffle(NCdataset, random_state=42)
Cdataset = shuffle(Cdataset, random_state=42)

In [None]:
# Get the Shape of all dataset

print('NCdataset shape : {}' .format(NCdataset.shape))
print('Cdataset shape : {}' .format(Cdataset.shape))

In [None]:
# Split the data

# Split each dataset into training data and temporary data - 70:30

NCtrain, NCtemp, NCtrain_labels, NCtemp_labels = train_test_split(
    NCdataset, NClabels, test_size=0.3, stratify=NClabels, random_state=42
)

# Split the Cancer data
Ctrain, Ctemp, Ctrain_labels, Ctemp_labels = train_test_split(
    Cdataset, Clabels, test_size=0.3, stratify=Clabels, random_state=42
)

# Use the temporary data to split into Validation and Testing Data - 15:15
NCval, NCtest, NCval_labels, NCtest_labels = train_test_split(
    NCtemp, NCtemp_labels, test_size=0.5, stratify=NCtemp_labels, random_state=42
)

Cval, Ctest, Cval_labels, Ctest_labels = train_test_split(
    Ctemp, Ctemp_labels, test_size=0.5, stratify=Ctemp_labels, random_state=42
)

# Combine the two Non-Cancer Data and the Cancer Data to make one train_data, val_data, test_data
train_data = np.concatenate((NCtrain, Ctrain), axis=0)
train_labels = np.concatenate((NCtrain_labels, Ctrain_labels), axis=0)
val_data = np.concatenate((NCval, Cval), axis=0)
val_labels = np.concatenate((NCval_labels, Cval_labels), axis=0)
test_data = np.concatenate((NCtest, Ctest), axis=0)
test_labels = np.concatenate((NCtest_labels, Ctest_labels), axis=0)

In [None]:
# Reformat the shape for the labels

train_labels = to_categorical(train_labels, 2)
val_labels = to_categorical(val_labels, 2)
test_labels = to_categorical(test_labels, 2)

In [None]:
print('train_data shape : {}' .format(train_data.shape))
print('train_labels shape : {}' .format(train_labels.shape))
print('val_data shape : {}' .format(val_data.shape))
print('val_labels shape : {}' .format(val_labels.shape))
print('test_data shape : {}' .format(test_data.shape))
print('test_labels shape : {}' .format(test_labels.shape))

# Model Architecture

The Model used for this project will be a custom Convolutional Neural Network model. Our base model consists of 11 layers.

In [None]:
model = tf.keras.Sequential([
    # Convolutional Layers
    tf.keras.layers.Conv2D(32, (3, 3), padding = 'same', activation = 'relu', input_shape = (50, 50, 3)),
    tf.keras.layers.MaxPooling2D(strides = 2),
    tf.keras.layers.Conv2D(64, (3, 3), padding = 'same', activation = 'relu'),
    tf.keras.layers.MaxPooling2D((3, 3),strides = 2),
    tf.keras.layers.Conv2D(128, (3, 3), padding = 'same', activation = 'relu'),
    tf.keras.layers.MaxPooling2D((3, 3),strides =2),
    tf.keras.layers.Conv2D(128, (3, 3), padding = 'same', activation = 'relu'),
    tf.keras.layers.MaxPooling2D((3, 3),strides =2),
    
    # Flatten Layer
    tf.keras.layers.Flatten(),
    
    # Fully Connected Layers
    tf.keras.layers.Dense(128, activation = 'relu'),
    tf.keras.layers.Dense(2, activation = 'softmax')
])

In [None]:
model.summary()

# Hyperparameter Tuning

The Optimizer used for this model is Adam and the evaluation metrics is Accuracy.

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),loss='binary_crossentropy',metrics=['accuracy'])

# Model Training

In [None]:
history = model.fit(train_data, train_labels, validation_data = (val_data, val_labels), epochs = 25 , batch_size = 75)

# Model Evaluation

In [None]:
model.evaluate(test_data,test_labels)

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
predict_data = model.predict(test_data)
predict_labels = np.argmax(predict_data, axis=1)

In [None]:
def convert_to_single_label(one_hot_labels):
    return np.argmax(one_hot_labels, axis=1)

# Convert train_labels
true_train_labels = convert_to_single_label(train_labels)

# Convert val_labels
true_val_labels = convert_to_single_label(val_labels)

# Convert test_labels
true_test_labels = convert_to_single_label(test_labels)

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix


# Calculate accuracy
accuracy = accuracy_score(true_test_labels, predict_labels)
print(f'Accuracy: {accuracy:.2f}')

# Calculate precision
precision = precision_score(true_test_labels, predict_labels)
print(f'Precision: {precision:.2f}')

# Calculate recall
recall = recall_score(true_test_labels, predict_labels)
print(f'Recall: {recall:.2f}')

# Calculate F1-score
f1 = f1_score(true_test_labels, predict_labels)
print(f'F1-score: {f1:.2f}')

# Calculate confusion matrix
conf_matrix = confusion_matrix(true_test_labels, predict_labels)
f,ax = plt.subplots(figsize=(8, 8))
sns.heatmap(conf_matrix, annot=True, linewidths=0.01,cmap="BuPu",linecolor="gray", fmt= '.1f',ax=ax)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()


# Fine Tuning

A new model is created after evaluating from the previous one. For the model, more layers were added, and is using Early Stopping.


In [None]:
model2 = keras.Sequential([
    # Convolutional Layers
    layers.Conv2D(32, (3, 3), padding='same', activation='relu', input_shape=(50, 50, 3)),
    layers.MaxPooling2D(strides=2),
    layers.Conv2D(64, (3, 3), padding='same', activation='relu'),
    layers.MaxPooling2D((3, 3), strides=2),
    layers.Conv2D(128, (3, 3), padding='same', activation='relu'),
    layers.MaxPooling2D((3, 3), strides=2),
    layers.Conv2D(256, (3, 3), padding='same', activation='relu'),
    layers.MaxPooling2D((3, 3), strides=2),
    
    # Flatten Layer
    layers.Flatten(),
    
    # Fully Connected Layers
    layers.Dense(256, activation='relu'),
    layers.Dense(128, activation='relu'),
    layers.Dense(2, activation='softmax')
])

In [None]:
model2.summary()

In [None]:
model2.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),loss='binary_crossentropy',metrics=['accuracy'])


In [None]:
early_stopping = keras.callbacks.EarlyStopping(
    monitor='val_loss',  # Metric to monitor (e.g., validation loss)
    patience=5,           # Number of epochs with no improvement to wait before stopping
    restore_best_weights=True  # Restore model weights to the best epoch
)

In [None]:
history2 = model2.fit(train_data, train_labels, validation_data = (val_data, val_labels), epochs = 25 , batch_size = 256, callbacks=[early_stopping])

In [None]:
model2.evaluate(test_data,test_labels)

In [None]:
plt.plot(history2.history['accuracy'])
plt.plot(history2.history['val_accuracy'])
plt.title('Model Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
plt.plot(history2.history['loss'])
plt.plot(history2.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
predict_data = model2.predict(test_data)
predict_labels = np.argmax(predict_data, axis=1)

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix


# Calculate accuracy
accuracy = accuracy_score(true_test_labels, predict_labels)
print(f'Accuracy: {accuracy:.2f}')

# Calculate precision
precision = precision_score(true_test_labels, predict_labels)
print(f'Precision: {precision:.2f}')

# Calculate recall
recall = recall_score(true_test_labels, predict_labels)
print(f'Recall: {recall:.2f}')

# Calculate F1-score
f1 = f1_score(true_test_labels, predict_labels)
print(f'F1-score: {f1:.2f}')

# Calculate confusion matrix
conf_matrix = confusion_matrix(true_test_labels, predict_labels)
f,ax = plt.subplots(figsize=(8, 8))
sns.heatmap(conf_matrix, annot=True, linewidths=0.01,cmap="BuPu",linecolor="gray", fmt= '.1f',ax=ax)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()


# Analysis

Just based on the accuracy score between the two models created, it is proven that the second level gives a better accuracy score. 


# Discussion

Interpret the results obtained from the image classification model

Analyze the implications of the findings in the context of the research questions or objectives

Address any limitations or constraints of the model and potential areas for improvement 


# Conclusion

Summarize the key findings and contributions of the image classification model

Discuss the implications of the results for the broader field of image classification and its potential applications
