# Breast Cancer Image Classification using Convolutional Neural Networks
Authors: Dereck Riley, Maria Santiago, Samuel Scott

Due: December 16, 2024

## Introduction

In this project, we explore a dataset of breast tissue images focused on Invasive Ductal Carcinoma (IDC), the most common type of breast cancer in women. The dataset consists of labeled images, each marked as either IDC (-) or IDC (+), indicating the presence or absence of cancerous tissue in the sample. The images are organized into patient ID files containing two directories, with folder 0 containing non-IDC images (indicating no cancer present) and folder 1 containing IDC-positive images (indicating the presence of breast cancer).

The objective of our project is to build a model that accurately differentiates between IDC-negative and IDC-positive images. By comparing the characteristics of these two categories, we aim to develop a reliable approach for identifying breast cancer cells. This analysis will enable us to explore patterns in the data that could improve diagnostic accuracy, and better detection of breast cancer in medical practice.

Our dataset can be found here: https://www.kaggle.com/datasets/paultimothymooney/breast-histopathology-images


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, ParameterGrid
import tensorflow as tf
from tensorflow.keras import models, layers, Input, Model
from tensorflow.keras.layers import Lambda
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
import tensorflow.keras.backend as K
from tensorflow.keras.preprocessing.image import load_img
from IPython.core.display import display, HTML

# Importing Data
import kagglehub

# Used for Preprocessing
import glob
import cv2
import random
from tensorflow.keras.utils import to_categorical

## Data Retrieval


In [None]:
# Download for the Data Set
path = kagglehub.dataset_download("paultimothymooney/breast-histopathology-images")
print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/paultimothymooney/breast-histopathology-images?dataset_version_number=1...


  5%|▍         | 151M/3.10G [00:07<02:38, 20.0MB/s]


KeyboardInterrupt: 

## Data Preprocessing

For the preprocessing aspect of our project we:

*   categorize IDC(-) and IDC(+) images.
*   resize the images to a standard size
*   downsize the amount of IDC(-) images
*   label them according to their class
*   split the data into testing and training sets

The purpose for preprocessing our data in this format is to help the model distinguish the images, train on a mixture of both classes, make accurate predictions and generalize new data.

Source Used for Segments of Data Preprocessing Code:

https://www.kaggle.com/code/thesnak/breast-cancer-classification-96-89

In [None]:
images = glob.glob(path + "/**/*.png", recursive=True)

In [None]:
N_IDC = []
P_IDC = []

for img in images:
    if img[-5] == '0' :
        N_IDC.append(img)

    elif img[-5] == '1' :
        P_IDC.append(img)

num_non = min(len(N_IDC), 18)
num_can = min(len(P_IDC), 18)

In [None]:
non_img_arr = []
can_img_arr = []
NewN_IDC=N_IDC[:78786]

for img in NewN_IDC:
    n_img = cv2.imread(img, cv2.IMREAD_COLOR)
    n_img_size = cv2.resize(n_img, (50, 50), interpolation = cv2.INTER_LINEAR)
    non_img_arr.append([n_img_size, 0])

for img in P_IDC:
    c_img = cv2.imread(img, cv2.IMREAD_COLOR)
    c_img_size = cv2.resize(c_img, (50, 50), interpolation = cv2.INTER_LINEAR)
    can_img_arr.append([c_img_size, 1])

In [None]:
X = []
y = []

# Extract image arrays and labels separately
non_img_data = [item[0] for item in non_img_arr]
non_img_labels = [item[1] for item in non_img_arr]
can_img_data = [item[0] for item in can_img_arr]
can_img_labels = [item[1] for item in can_img_arr]

# Concatenate image data and labels separately
X = np.concatenate((non_img_data, can_img_data))
y = np.concatenate((non_img_labels, can_img_labels))

# Shuffle data and labels together
combined_data = list(zip(X, y))
random.shuffle(combined_data)
X, y = zip(*combined_data)

X = np.array(X)
y = np.array(y)

In [None]:
def describeData(a,b):
    print('Total number of images: {}'.format(len(a)))
    print('Number of IDC(-) Images: {}'.format(np.sum(b==0)))
    print('Number of IDC(+) Images: {}'.format(np.sum(b==1)))
    print('Image shape (Width, Height, Channels): {}'.format(a[0].shape))
describeData(X,y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

y_train = to_categorical(y_train, num_classes = 2)
y_test = to_categorical(y_test, num_classes = 2)

print("Training Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)

## Data Visual

- Idea: Plot/Showcase Images of Our Data Set

In [None]:
def display_random_images(X, y, num_images=5):

    # Select IDC(-) and IDC(+) indices
    negative_indices = [i for i in range(len(y)) if y[i] == 0]
    positive_indices = [i for i in range(len(y)) if y[i] == 1]

    random_negative_indices = random.sample(negative_indices, num_images)
    random_positive_indices = random.sample(positive_indices, num_images)

    # Set up the plot
    fig, axs = plt.subplots(2, num_images, figsize=(15, 6))

    # Display IDC(-) images
    for i, idx in enumerate(random_negative_indices):
        ax = axs[0, i]
        ax.imshow(cv2.cvtColor(X[idx], cv2.COLOR_BGR2RGB))  # Convert BGR to RGB for correct display
        ax.set_title("IDC(-)")
        ax.axis('off')

    # Display IDC(+) images
    for i, idx in enumerate(random_positive_indices):
        ax = axs[1, i]
        ax.imshow(cv2.cvtColor(X[idx], cv2.COLOR_BGR2RGB))
        ax.set_title("IDC(+)")
        ax.axis('off')

    plt.tight_layout()
    plt.show()

display_random_images(X, y, num_images=5)

## Machine Learning

Our baseline model serves as a reference point for evaluating the performance of our more advanced model. With this simple approach we can get a sense of what is achieved in terms of accuracy and to identify any issues with the data. With the results from the baseline model we can make improvements and modifications to achieve a higher accuracy in the advanced model, by changing the amount of convolutional layers, activation functions, and testing out other layer features.

### Baseline Model

In [None]:
K.clear_session()

In [None]:
input_shape = (50, 50, 3)

model = models.Sequential()
model.add(layers.Input(shape=input_shape))
model.add(layers.Conv2D(32, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D(2, 2))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(2, activation='softmax'))

In [None]:
early_stopping = EarlyStopping(monitor='val_loss', patience=8,
                              restore_best_weights=True, verbose=1)

In [None]:
optimizer = tf.keras.optimizers.Adam()
model.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
history = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.3,
                    callbacks=[early_stopping])

In [None]:
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f'test accuracy: {test_accuracy:.3g}')

Commentary for Statement for Test Accuracy to Consider Baseline Benchmark to Achieve a Higher Score Against for our Complex Model

### Complex Model

In [None]:
K.clear_session()

Will Showcase Mean Validation Accuracy Values for Each Random Search Models to Show Model with Highest Value

In [None]:
def scores_to_dataframe(scores):
    """ Return hyperparameters and scores in a data frame. """

    params_list, acc_list = scores

    params = pd.DataFrame(params_list)
    accs = pd.Series(acc_list)
    scores_df = pd.concat([params, accs], axis=1)

    return scores_df

In [None]:
def get_model(conv_layers=3, num_filters=32, act_fun='relu', dropout_rate=0.5, dense_layers=3):
  input_shape = (50, 50, 3)
  inputs = Input(input_shape)
  x = inputs
  num_filters_doubled = num_filters * 2

  for i in range(conv_layers):
    if (i == 0):
      x = layers.Conv2D(num_filters, (3, 3), padding='same')(x)
      x = layers.BatchNormalization()(x)
      x = layers.Activation(act_fun)(x)
      x = layers.MaxPooling2D(2, 2)(x)
      x = layers.Dropout(dropout_rate)(x)

    else:
      x = layers.Conv2D(num_filters_doubled, (3, 3), padding='same')(x)
      x = layers.BatchNormalization()(x)
      x = layers.Activation(act_fun)(x)
      x = layers.MaxPooling2D(2, 2)(x)
      x = layers.Dropout(dropout_rate)(x)

  x = layers.Flatten()(x)

  for j in range(dense_layers):
    if (j == 0):
      x = layers.Dense(num_filters_doubled)(x)
      x = layers.BatchNormalization()(x)
      x = layers.Activation(act_fun)(x)
      x = layers.Dropout(dropout_rate)(x)

    else:
      x = layers.Dense(num_filters)(x)
      x = layers.BatchNormalization()(x)
      x = layers.Activation(act_fun)(x)
      x = layers.Dropout(dropout_rate)(x)

  x = layers.Dense(2, activation='softmax')(x)
  return Model(inputs, x)

In [None]:
default_params = {
    'conv_layers': 3,
    'num_filters': 32,
    'act_fun': 'relu',
    'dropout_rate': 0.5,
    'dense_layers': 3,
    'optimizer': 'adam',
}

In [None]:
def evaluate_params(params, verbose=1):
  # default parameters are used if not supplied
  pars = param_grid.copy()
  pars.update(params)

  conv_layers = pars['conv_layers']
  num_filters = pars['num_filters']
  act_fun = pars['act_fun']
  dropout_rate = pars['dropout_rate']
  dense_layers = pars['dense_layers']
  optimizer = params['optimizer']

  early_stopping = EarlyStopping(monitor='val_loss', patience=8,
                                restore_best_weights=True, verbose=1)
  reduce_lr_on_plateau = ReduceLROnPlateau(monitor='val_loss',
                                          patience=4, min_lr=0.000001, verbose=1)

  model = get_model(conv_layers=conv_layers, num_filters=num_filters, act_fun=act_fun,
                    dropout_rate=dropout_rate, dense_layers=dense_layers)

  model.compile(optimizer=optimizer,
                loss="binary_crossentropy",
                metrics=["accuracy"])

  history = model.fit(X_train, y_train, epochs=20,
                      callbacks=[early_stopping, reduce_lr_on_plateau],
                      batch_size=32, validation_split=0.3, verbose=verbose)

  mean_acc = np.mean(history.history['val_accuracy'][-2:])
  return pars, mean_acc, history

36 Num_Tests Represents 50% of the Possible Parameter Combinations

In [None]:
def random_search(param_grid, num_tests=36, random_state=None, verbose=1):
  # create a list of unique parameter combinations
  param_combs = list(ParameterGrid(param_grid))
  if len(param_combs) < num_tests:
    num_tests = len(param_combs)
  random_combs = np.random.choice(param_combs, size=num_tests, replace=False)
  # evaluate each of the combinations
  params_list = []
  acc_list = []
  for params in random_combs:
    print(f"params: {params}")
    pars, acc, history = evaluate_params(params, verbose=verbose)
    params_list.append(pars)
    acc_list.append(acc)
  return params_list, acc_list

In [None]:
param_grid = {
  "conv_layers": [3, 4, 5],
  "num_filters": [32, 64],
  "act_fun": ['relu', 'elu'],
  "dropout_rate": [0.5],
  "dense_layers": [2, 3],
  "optimizer": ['adam', 'rmsprop', 'nadam'],
}

results = random_search(param_grid)

In [None]:
scores_to_dataframe(results)

Best Model: Test 4

In [None]:
get_model(conv_layers=4, num_filters=64, act_fun='relu', dropout_rate=0.5, dense_layers=3)

In [None]:
early_stopping = EarlyStopping(monitor='val_loss', patience=8, restore_best_weights=True, verbose=1)
reduce_lr_on_plateau = ReduceLROnPlateau(monitor='val_loss', patience=4, min_lr=0.000001, verbose=1)

In [None]:
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])

In [None]:
history = model.fit(X_train, y_train, epochs=50,
                    callbacks=[early_stopping, reduce_lr_on_plateau],
                    batch_size=32, validation_split=0.3, verbose=1)

In [None]:
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f'test accuracy: {test_accuracy:.3g}')

## Conclusion



  We discovered that the best parameters to train our complex model are {conv_layers = 4, num_filters = 64, act_fun = 'relu', dropout_rate = 0.5, dense_layers = 3}, achieving an accuracy of 88.54%. In our complex model, we used 32 filters in the first convolutional layer and doubled the filters in every subsequent layer to 64. To enhance performance and reduce overfitting, we incorporated BatchNormalization(), MaxPooling(), and Dropout() techniques. For high quality optimization, we performed random search over a parameter grid with 72 unique model combinations. Additionally, we had two callbacks: EarlyStopping and Reduce Learning Rate on Plateau, which helped reduce overfitting and provided incremental improvements in accuracy.

  Our model significantly outperformed the baseline model, which is crucial for our dataset involving IDC, the most common form of breast cancer. The model predicts 0 for "no breast cancer" and 1 for "breast cancer." Given the life-and-death implications of early and accurate breast cancer detection, achieving high accuracy is very crucial and important. Our results demonstrate that this complex model could potentially assist in reliable IDC detection, and help fight against breast cancer.

  Some potential improvements that could be implemented was data augmentation on the images, because we trained on a sample of the whole dataset. This could have increased the diversity of the training set by applying transformations such as rotations or zooming, helping the model generalize the images better. Instead of using random search for hyperparameter tuning, we could have used Bayesian optimization or grid search, which might have provided better accuracy results by using the parameter space more efficiently.
