# 🔬 Skin Cancer Detection System - Complete Analysis

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/thedatadudech/skin-cancer-detection/blob/main/skincancer_detector.ipynb)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-blue?logo=github)](https://github.com/thedatadudech/skin-cancer-detection)

This comprehensive notebook demonstrates the complete development process of an advanced skin cancer detection system using deep learning with PyTorch. 

## 📋 What You'll Learn
- **Data Analysis**: Explore the HAM10000 medical imaging dataset
- **Model Development**: Compare EfficientNet, ResNet, and custom CNN architectures
- **Training Pipeline**: Implement robust training with data augmentation
- **Performance Evaluation**: Analyze model performance across 7 skin lesion types
- **Deployment Ready**: Export models for production use

## 🎯 Medical Context
Early detection of skin cancer, particularly melanoma, is crucial for successful treatment. This system assists healthcare professionals in preliminary screening by analyzing dermoscopic images.

**⚠️ Medical Disclaimer**: This tool is for educational and research purposes only. Always consult qualified healthcare professionals for medical diagnosis.

## 🚀 Quick Start
1. Click the "Open in Colab" button above
2. Run the setup cells to install dependencies
3. Download the HAM10000 dataset (instructions below)
4. Execute the training pipeline

---

## 🔧 Google Colab Setup

If running on Google Colab, execute the following cells to set up the environment:

In [None]:
# Install dependencies for Google Colab
import sys

if 'google.colab' in sys.modules:
    print("🚀 Setting up Google Colab environment...")
    
    # Install required packages
    !pip install torch>=2.5.1 torchvision>=0.20.1 -q
    !pip install streamlit>=1.41.1 pandas>=2.2.3 numpy>=2.2.2 -q
    !pip install scikit-learn>=1.6.1 Pillow==10.0.0 tqdm>=4.67.1 -q
    !pip install matplotlib seaborn -q
    
    # Clone the repository
    !git clone https://github.com/thedatadudech/skin-cancer-detection.git
    %cd skin-cancer-detection
    
    print("✅ Environment setup complete!")
else:
    print("📝 Running in local environment")

In [None]:
# Dataset setup for Google Colab
import os

if 'google.colab' in sys.modules:
    print("📊 Setting up dataset for Colab...")
    
    # Create data directories
    os.makedirs('data/images', exist_ok=True)
    
    print("""📥 Dataset Download Instructions:
    
    To use this notebook, you need to download the HAM10000 dataset:
    
    1. Visit: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T
    2. Download:
       - HAM10000_images_part1.zip
       - HAM10000_images_part2.zip  
       - HAM10000_metadata.csv
    3. Upload to Colab and extract:
       - Extract zip files to data/images/
       - Place metadata.csv in data/
    
    Or use the sample dataset for demonstration purposes.""")
    
    # Option to create sample data for demonstration
    create_sample = input("Create sample data for demonstration? (y/n): ")
    if create_sample.lower() == 'y':
        # Create minimal sample dataset for demonstration
        import pandas as pd
        import numpy as np
        from PIL import Image
        
        # Create sample metadata
        sample_data = {
            'image_id': [f'sample_{i:03d}' for i in range(50)],
            'dx': np.random.choice(['nv', 'mel', 'bkl', 'bcc', 'akiec', 'vasc', 'df'], 50),
            'age': np.random.randint(20, 80, 50),
            'sex': np.random.choice(['male', 'female'], 50)
        }
        
        df_sample = pd.DataFrame(sample_data)
        df_sample.to_csv('data/HAM10000_metadata.csv', index=False)
        
        # Create sample images
        for img_id in sample_data['image_id']:
            # Create random RGB image
            img_array = np.random.randint(0, 256, (224, 224, 3), dtype=np.uint8)
            img = Image.fromarray(img_array)
            img.save(f'data/images/{img_id}.jpg')
        
        print("✅ Sample dataset created for demonstration")
else:
    print("📂 Using local dataset")

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import os
import sys
from pathlib import Path

# PyTorch imports
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import torchvision.transforms as transforms
from torchvision import models

# ML utilities
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from tqdm import tqdm

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

# Configure matplotlib
plt.style.use('default')
sns.set_palette("husl")

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 1. Data Loading and Initial Exploration

In [None]:
# Load and explore metadata
try:
    df = pd.read_csv('data/HAM10000_metadata.csv')
    print(f"✅ Dataset loaded successfully!")
    print(f"Dataset shape: {df.shape}")
    print(f"\nColumns: {list(df.columns)}")
    
    # Display first few rows
    display(df.head())
    
except FileNotFoundError:
    print("❌ Dataset not found. Please follow the dataset setup instructions above.")
    df = None

In [None]:
if df is not None:
    # Define class mappings
    class_names = {
        'akiec': 'Actinic keratoses',
        'bcc': 'Basal cell carcinoma', 
        'bkl': 'Benign keratosis',
        'df': 'Dermatofibroma',
        'mel': 'Melanoma',
        'nv': 'Melanocytic nevi',
        'vasc': 'Vascular lesions'
    }
    
    # Display class distribution
    print("📊 Class Distribution:")
    class_counts = df['dx'].value_counts()
    for code, count in class_counts.items():
        print(f"{class_names.get(code, code)}: {count} ({count/len(df)*100:.1f}%)")
    
    # Visualize class distribution
    plt.figure(figsize=(12, 6))
    ax = sns.countplot(data=df, x='dx', order=class_counts.index)
    plt.title('Distribution of Skin Lesion Types in HAM10000 Dataset', fontsize=14, fontweight='bold')
    plt.xlabel('Lesion Type')
    plt.ylabel('Number of Images')
    
    # Add count labels on bars
    for i, v in enumerate(class_counts.values):
        ax.text(i, v + 50, str(v), ha='center', va='bottom')
    
    # Update x-axis labels with full names
    ax.set_xticklabels([class_names.get(code, code) for code in class_counts.index], 
                       rotation=45, ha='right')
    
    plt.tight_layout()
    plt.show()
    
    # Additional statistics
    print(f"\n📈 Dataset Statistics:")
    print(f"Total images: {len(df):,}")
    print(f"Number of classes: {df['dx'].nunique()}")
    print(f"Age range: {df['age'].min():.0f} - {df['age'].max():.0f} years")
    print(f"Gender distribution: {df['sex'].value_counts().to_dict()}")

## 2. Image Analysis

In [None]:
def analyze_image_properties(data_dir, sample_size=100):
    """Analyze properties of images in the dataset"""
    image_files = np.random.choice(os.listdir(data_dir), sample_size)
    widths, heights = [], []
    
    for img_file in image_files:
        img = Image.open(os.path.join(data_dir, img_file))
        widths.append(img.size[0])
        heights.append(img.size[1])
    
    return widths, heights

widths, heights = analyze_image_properties('data/images')

plt.figure(figsize=(12, 4))
plt.subplot(121)
plt.hist(widths)
plt.title('Image Widths Distribution')
plt.subplot(122)
plt.hist(heights)
plt.title('Image Heights Distribution')
plt.show()

## 3. Data Preparation

In [None]:
# Use project modules if available, otherwise implement inline
try:
    from src.data_loader import DataLoader
    from src.model import create_model
    from src.preprocessing import get_data_transforms
    print("✅ Using project modules")
    use_project_modules = True
except ImportError:
    print("📝 Using inline implementations for Colab compatibility")
    use_project_modules = False

if df is not None:
    # Create label mapping
    label_mapping = {label: idx for idx, label in enumerate(df['dx'].unique())}
    print(f"\n🏷️ Label mapping: {label_mapping}")
    
    # Add numeric labels to dataframe
    df['label'] = df['dx'].map(label_mapping)
    
    # Split data
    train_df, temp_df = train_test_split(df, test_size=0.3, stratify=df['label'], random_state=42)
    val_df, test_df = train_test_split(temp_df, test_size=0.5, stratify=temp_df['label'], random_state=42)
    
    print(f"\n📊 Data splits:")
    print(f"Training: {len(train_df)} images")
    print(f"Validation: {len(val_df)} images")
    print(f"Test: {len(test_df)} images")

## 4. Model Development and Training

We'll implement and compare three different architectures:
- **EfficientNet-B0**: State-of-the-art efficient architecture
- **ResNet-50**: Popular residual network
- **Custom CNN**: Baseline convolutional network

In [None]:
# Define model architectures
def create_efficientnet_model(num_classes=7):
    """Create EfficientNet-B0 model with transfer learning"""
    model = models.efficientnet_b0(pretrained=True)
    model.classifier[1] = nn.Linear(model.classifier[1].in_features, num_classes)
    return model

def create_resnet_model(num_classes=7):
    """Create ResNet-50 model with transfer learning"""
    model = models.resnet50(pretrained=True)
    model.fc = nn.Linear(model.fc.in_features, num_classes)
    return model

class CustomCNN(nn.Module):
    """Custom CNN for comparison"""
    def __init__(self, num_classes=7):
        super(CustomCNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 128, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.AdaptiveAvgPool2d((7, 7)),
            nn.Flatten(),
            nn.Linear(128 * 7 * 7, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# Model configurations
NUM_CLASSES = 7
BATCH_SIZE = 16  # Reduced for better Colab compatibility
EPOCHS = 5  # Reduced for demonstration
LEARNING_RATE = 1e-4

print(f"🔧 Training Configuration:")
print(f"Number of classes: {NUM_CLASSES}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Epochs: {EPOCHS}")
print(f"Learning rate: {LEARNING_RATE}")
print(f"Device: {device}")

In [None]:
# Training configuration
data_augmentation = create_data_augmentation()

callbacks = [
    tf.keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=5,
        restore_best_weights=True
    ),
    tf.keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.2,
        patience=3
    )
]

# Dictionary to store training histories
histories = {}

In [None]:
# Train and evaluate each model
for name, model in models.items():
    print(f"\nTraining {name}...")
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(1e-4),
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    
    history = model.fit(
        data_augmentation(X_train),
        y_train,
        batch_size=BATCH_SIZE,
        epochs=EPOCHS,
        validation_data=(X_val, y_val),
        callbacks=callbacks
    )
    
    histories[name] = history.history

## 5. Model Comparison and Evaluation

In [None]:
# Plot training histories
plt.figure(figsize=(15, 5))

plt.subplot(121)
for name, history in histories.items():
    plt.plot(history['accuracy'], label=f'{name} (train)')
    plt.plot(history['val_accuracy'], label=f'{name} (val)')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

plt.subplot(122)
for name, history in histories.items():
    plt.plot(history['loss'], label=f'{name} (train)')
    plt.plot(history['val_loss'], label=f'{name} (val)')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()
plt.show()

In [None]:
# Evaluate models on test set
test_results = {}
for name, model in models.items():
    test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
    test_results[name] = {
        'accuracy': test_acc,
        'loss': test_loss
    }
    print(f"\n{name} Test Accuracy: {test_acc:.4f}")

## 6. Save Best Model

In [None]:
# Find and save the best model
best_model_name = max(test_results, key=lambda k: test_results[k]['accuracy'])
best_model = models[best_model_name]

os.makedirs('models', exist_ok=True)
best_model.save('models/best_model.h5')
print(f"Best model ({best_model_name}) saved with test accuracy: {test_results[best_model_name]['accuracy']:.4f}")