# Breast Cancer Detection Using Convolutional Neural Networks

# Breast Cancer Detection Using Pre-trained Convolutional Neural Networks

## Introduction

This project explores whether deep learning can help radiologists detect breast cancer in mammograms more accurately. We use the **CBIS-DDSM dataset**, a curated subset of the Digital Database for Screening Mammography, which contains annotated images of breast tissue with known outcomes (benign or malignant).

Instead of building a deep learning model from scratch, we use **MobileNetV2**, a pre-trained convolutional neural network originally trained on millions of general images (ImageNet). We apply **transfer learning**, keeping the lower layers (which extract basic patterns like edges and textures) and re-training only the final classification layers on the mammogram images.

MobileNetV2 is a lightweight yet powerful architecture, ideal for fast experimentation and limited datasets like CBIS-DDSM. This approach helps us leverage existing computer vision knowledge while tailoring the model to the specific task of medical image classification.

The goal is to build a **binary classifier** that can distinguish between benign and malignant findings. If successful, this prototype could serve as the foundation for clinical decision support tools that improve diagnostic accuracy and reduce human error in breast cancer screening.


## Dataset Description

The Curated Breast Imaging Subset of DDSM (CBIS-DDSM) provides mammographic images with expert radiologist annotations. The dataset includes two types of abnormalities:

- **Calcifications**: Small calcium deposits appearing as bright spots
- **Masses**: Larger tissue abnormalities with varying shapes and densities

Each case includes pathological ground truth labels enabling supervised learning approaches.

## Data Loading and Initial Setup

The first phase involves loading the necessary libraries and establishing the dataset structure. The CBIS-DDSM data is organized into CSV metadata files containing case information and JPEG directories containing the actual mammographic images.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from pathlib import Path

# Configure display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("Required libraries loaded successfully")

In [None]:
# Define data paths
BASE_PATH = Path('../data/kaggle')
CSV_PATH = BASE_PATH / 'csv'
JPEG_PATH = BASE_PATH / 'jpeg'

print("Data path configuration:")
print(f"Base directory: {BASE_PATH}")
print(f"CSV files: {CSV_PATH}")
print(f"Image files: {JPEG_PATH}")

# Verify paths exist
if BASE_PATH.exists():
    print("Data paths verified")
else:
    print("Warning: Data directory not found")

## Dataset Structure Exploration

Understanding the dataset organization is essential before processing. This section examines the available files and their structure to inform subsequent data handling decisions.

In [None]:
# Examine available CSV metadata files
csv_files = list(CSV_PATH.glob('*.csv'))

print("Available metadata files:")
for file in csv_files:
    file_size = file.stat().st_size / 1024
    print(f"- {file.name} ({file_size:.1f} KB)")

print(f"\nTotal files: {len(csv_files)}")

In [None]:
# Load DICOM information file
dicom_info = pd.read_csv(CSV_PATH / 'dicom_info.csv')

print("DICOM Information Dataset:")
print(f"Shape: {dicom_info.shape}")
print(f"Columns: {list(dicom_info.columns)}")

# Display first few rows
print("\nFirst 5 rows:")
display(dicom_info.head())

In [None]:
# Analyze image series types
print("Series Description Analysis:")
series_counts = dicom_info['SeriesDescription'].value_counts()

for series_type, count in series_counts.items():
    percentage = (count / len(dicom_info)) * 100
    print(f"- {series_type}: {count} images ({percentage:.1f}%)")

print("\nStudy Description Analysis:")
study_counts = dicom_info['StudyDescription'].value_counts()
for study_type, count in study_counts.items():
    print(f"- {study_type}: {count} images")

## Data Cleaning and Path Correction

The raw dataset requires preprocessing to remove unnecessary columns and correct file paths for the local environment. This ensures data quality and compatibility with the analysis pipeline.

In [None]:
# Clean DICOM information dataset
print("Data cleaning:")
print(f"Original shape: {dicom_info.shape}")

# Remove unnecessary columns
columns_to_drop = [
    'Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1', 'Unnamed: 0.1.1.1',
    'Modality', 'StudyDescription'
]

existing_columns_to_drop = [col for col in columns_to_drop if col in dicom_info.columns]
dicom_info_clean = dicom_info.drop(columns=existing_columns_to_drop)

print(f"Cleaned shape: {dicom_info_clean.shape}")
print(f"Removed columns: {existing_columns_to_drop}")
print(f"Remaining columns: {list(dicom_info_clean.columns)}")

In [None]:
# Correct image paths for local directory structure
print("Path correction:")

# Show original path format
original_path = dicom_info_clean['image_path'].iloc[0]
print(f"Original: {original_path}")

# Update paths to match local structure
dicom_info_clean['image_path_corrected'] = dicom_info_clean['image_path'].apply(
    lambda x: x.replace('CBIS-DDSM/jpeg', str(JPEG_PATH))
)

corrected_path = dicom_info_clean['image_path_corrected'].iloc[0]
print(f"Corrected: {corrected_path}")

# Verify path correction
sample_size = min(50, len(dicom_info_clean))
existing_paths = sum(1 for path in dicom_info_clean['image_path_corrected'].head(sample_size) 
                    if Path(path).exists())

print(f"Verification: {existing_paths}/{sample_size} paths exist")

In [None]:
# Filter for cropped images (optimal for CNN training)
TARGET_IMAGE_TYPE = 'cropped images'
cropped_images = dicom_info_clean[dicom_info_clean['SeriesDescription'] == TARGET_IMAGE_TYPE].copy()

print(f"Image filtering results:")
print(f"Original dataset: {len(dicom_info_clean):,} images")
print(f"Cropped images: {len(cropped_images):,} images")
print(f"Retention rate: {len(cropped_images)/len(dicom_info_clean)*100:.1f}%")

print("\nFiltered dataset sample:")
display(cropped_images[['PatientID', 'image_path_corrected']].head())

## Clinical Case Information Loading

The clinical case descriptions contain pathological information required for supervised learning. These files provide ground truth labels for model training.

In [None]:
# Load clinical case descriptions
print("Loading clinical case information:")

# Load calcification cases
calc_cases = pd.read_csv(CSV_PATH / 'calc_case_description_train_set.csv')
print(f"Calcification cases: {calc_cases.shape}")

# Load mass cases
mass_cases = pd.read_csv(CSV_PATH / 'mass_case_description_train_set.csv')
print(f"Mass cases: {mass_cases.shape}")

print(f"Total clinical cases: {len(calc_cases) + len(mass_cases):,}")

print("\nCalcification cases sample:")
display(calc_cases.head(3))

print("\nMass cases sample:")
display(mass_cases.head(3))

## Data Loading Summary

The dataset has been successfully loaded and prepared for analysis. The following components are now available:

- **Image metadata**: Cleaned DICOM information with corrected file paths
- **Filtered images**: Cropped images suitable for CNN training
- **Clinical labels**: Pathological information for both calcification and mass cases

This foundation enables the next phase of data preprocessing and model development.

In [None]:
# Final data loading summary
print("Data Loading Complete")
print("=" * 40)
print(f"Cropped images available: {len(cropped_images):,}")
print(f"Unique patients: {cropped_images['PatientID'].nunique():,}")
print(f"Calcification cases: {len(calc_cases):,}")
print(f"Mass cases: {len(mass_cases):,}")
print(f"Total clinical cases: {len(calc_cases) + len(mass_cases):,}")

print("\nDataset ready for:")
print("- Label creation and validation")
print("- Image preprocessing and augmentation") 
print("- Model training pipeline development")

print("\nNext phase: Data preprocessing and model architecture design")