# üéØ Breast Cancer Identification - Complete Training Pipeline
## End-Semester Project | IEEE 2024 Based Approach

**Dataset**: BreakHis (7,909 histopathology images)
**Model**: EfficientNet-B0 with Transfer Learning
**Expected Accuracy**: 95-97%
**Training Time**: 2-3 hours on T4 GPU

---

## üìã Step 1: Check GPU and Environment

In [None]:
# Check GPU availability
!nvidia-smi

import torch
print(f"\n‚úì PyTorch version: {torch.__version__}")
print(f"‚úì CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"‚úì GPU: {torch.cuda.get_device_name(0)}")
    print(f"‚úì GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## üì¶ Step 2: Install Dependencies

In [None]:
%%capture
# Install required packages (suppress output)
!pip install timm==0.9.8
!pip install albumentations==1.4.3
!pip install pytorch-lightning==2.1.3
!pip install wandb==0.16.3
!pip install scikit-learn==1.3.2
!pip install opencv-python==4.8.1
!pip install pyyaml==6.0.1

print("‚úì All packages installed!")

## üíæ Step 3: Mount Google Drive

In [None]:
from google.colab import drive
import os

# Mount Drive
drive.mount('/content/drive')

# Create project directory in Drive
project_dir = '/content/drive/MyDrive/breast_cancer_project'
os.makedirs(project_dir, exist_ok=True)
os.chdir(project_dir)

print(f"‚úì Working directory: {os.getcwd()}")

## üì• Step 4: Clone Repository and Download Dataset

In [None]:
# Clone your repository
!git clone https://github.com/vsiva763-git/breast-cancer-identification.git
%cd breast-cancer-identification

# Create data directory
!mkdir -p data

print("‚úì Repository cloned successfully!")

In [None]:
# Download BreakHis dataset (~1.2 GB, takes 10-20 minutes)
!python download_breakhis.py

## üîç Step 5: Explore Dataset

In [None]:
import os
from pathlib import Path
import matplotlib.pyplot as plt
from PIL import Image
import random

# Check dataset structure
breakhis_path = Path('data/BreaKHis_v1')

if breakhis_path.exists():
    print("‚úì BreakHis dataset found!\n")
    
    # Count images
    benign_images = list(breakhis_path.rglob('*benign*.png'))
    malignant_images = list(breakhis_path.rglob('*malignant*.png'))
    
    print(f"üìä Dataset Statistics:")
    print(f"   Benign: {len(benign_images)} images")
    print(f"   Malignant: {len(malignant_images)} images")
    print(f"   Total: {len(benign_images) + len(malignant_images)} images")
    
    # Display sample images
    fig, axes = plt.subplots(2, 4, figsize=(16, 8))
    fig.suptitle('Sample BreakHis Images', fontsize=16)
    
    for i, ax in enumerate(axes.flat):
        if i < 4:
            img_path = random.choice(benign_images)
            label = "Benign"
        else:
            img_path = random.choice(malignant_images)
            label = "Malignant"
        
        img = Image.open(img_path)
        ax.imshow(img)
        ax.set_title(f"{label}\n{img_path.name[:20]}...")
        ax.axis('off')
    
    plt.tight_layout()
    plt.show()
else:
    print("‚ùå Dataset not found. Please run download cell first.")

## üîß Step 6: Phase 1 - Data Preparation

In [None]:
# Run data preparation
!python phase1_data_preparation/prepare_data.py

## üèãÔ∏è Step 7: Phase 2 - Model Training

### Option A: Quick Training (for testing)

In [None]:
# Create a quick test configuration (5 epochs)
import yaml

# Load config and modify for quick test
with open('configs/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

config['models']['training']['epochs'] = 5
config['models']['training']['batch_size'] = 32

# Save test config
with open('configs/config_test.yaml', 'w') as f:
    yaml.dump(config, f)

print("‚úì Quick test configuration created (5 epochs)")
print("   This will take ~15-20 minutes")

In [None]:
# Run quick training test
!python phase2_model_development/train.py --config configs/config_test.yaml

### Option B: Full Training (50 epochs, ~2-3 hours)

In [None]:
# Setup Weights & Biases for experiment tracking (optional)
import wandb

# Login to W&B (get API key from https://wandb.ai/authorize)
wandb.login()

print("‚úì W&B configured for experiment tracking")

In [None]:
# Run full training with all 50 epochs
!python phase2_model_development/train.py --config configs/config.yaml --wandb

## üìä Step 8: Evaluate Model

In [None]:
import torch
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Load best model checkpoint
checkpoint_path = 'checkpoints/best_model.pth'

if Path(checkpoint_path).exists():
    checkpoint = torch.load(checkpoint_path)
    
    print("üìà Training Results:")
    print(f"   Best Epoch: {checkpoint.get('epoch', 'N/A')}")
    print(f"   Best Val Accuracy: {checkpoint.get('val_accuracy', 'N/A'):.2%}")
    print(f"   Best Val Loss: {checkpoint.get('val_loss', 'N/A'):.4f}")
    
    # If predictions are saved
    if 'test_predictions' in checkpoint:
        y_true = checkpoint['test_labels']
        y_pred = checkpoint['test_predictions']
        
        # Classification Report
        print("\nüìä Classification Report:")
        print(classification_report(y_true, y_pred, target_names=['Benign', 'Malignant']))
        
        # Confusion Matrix
        cm = confusion_matrix(y_true, y_pred)
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                    xticklabels=['Benign', 'Malignant'],
                    yticklabels=['Benign', 'Malignant'])
        plt.title('Confusion Matrix')
        plt.ylabel('True Label')
        plt.xlabel('Predicted Label')
        plt.show()
else:
    print("‚ùå No checkpoint found. Train the model first.")

## üíæ Step 9: Save Results to Drive

In [None]:
import shutil
from datetime import datetime

# Create timestamped results folder
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
results_dir = f"/content/drive/MyDrive/breast_cancer_results_{timestamp}"
os.makedirs(results_dir, exist_ok=True)

# Copy important files
files_to_save = [
    'checkpoints/best_model.pth',
    'configs/config.yaml',
    'logs/training.log'
]

for file_path in files_to_save:
    if Path(file_path).exists():
        shutil.copy(file_path, results_dir)
        print(f"‚úì Saved: {file_path}")

print(f"\n‚úì All results saved to: {results_dir}")

## üé® Step 10: Visualize Training History

In [None]:
import matplotlib.pyplot as plt
import json

# Load training history if available
history_path = 'logs/training_history.json'

if Path(history_path).exists():
    with open(history_path, 'r') as f:
        history = json.load(f)
    
    # Plot training curves
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Loss plot
    axes[0].plot(history['train_loss'], label='Train Loss', marker='o')
    axes[0].plot(history['val_loss'], label='Val Loss', marker='s')
    axes[0].set_xlabel('Epoch')
    axes[0].set_ylabel('Loss')
    axes[0].set_title('Training and Validation Loss')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Accuracy plot
    axes[1].plot(history['train_acc'], label='Train Accuracy', marker='o')
    axes[1].plot(history['val_acc'], label='Val Accuracy', marker='s')
    axes[1].set_xlabel('Epoch')
    axes[1].set_ylabel('Accuracy (%)')
    axes[1].set_title('Training and Validation Accuracy')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig(f'{results_dir}/training_curves.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print(f"‚úì Training curves saved to {results_dir}/training_curves.png")
else:
    print("‚ö†Ô∏è Training history not found")

## üéØ Step 11: Make Predictions on New Images

In [None]:
import torch
import torchvision.transforms as transforms
from PIL import Image
import timm

# Load model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = timm.create_model('efficientnet_b0', pretrained=False, num_classes=2)

# Load weights
checkpoint = torch.load('checkpoints/best_model.pth', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model = model.to(device)
model.eval()

# Define transforms
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

def predict_image(image_path):
    """Predict single image."""
    img = Image.open(image_path).convert('RGB')
    img_tensor = transform(img).unsqueeze(0).to(device)
    
    with torch.no_grad():
        outputs = model(img_tensor)
        probabilities = torch.softmax(outputs, dim=1)
        pred_class = torch.argmax(probabilities, dim=1).item()
        confidence = probabilities[0][pred_class].item()
    
    label = "Malignant" if pred_class == 1 else "Benign"
    return label, confidence

# Test on sample images
test_images = random.sample(benign_images + malignant_images, 4)

fig, axes = plt.subplots(1, 4, figsize=(16, 4))
for ax, img_path in zip(axes, test_images):
    prediction, confidence = predict_image(img_path)
    
    img = Image.open(img_path)
    ax.imshow(img)
    ax.set_title(f"Pred: {prediction}\nConf: {confidence:.2%}")
    ax.axis('off')

plt.tight_layout()
plt.show()

print("‚úì Model ready for predictions!")

## üìù Summary and Next Steps

### ‚úÖ Completed:
1. ‚úì Dataset downloaded (BreakHis - 7,909 images)
2. ‚úì Data preparation with augmentation
3. ‚úì Model training with EfficientNet-B0
4. ‚úì Model evaluation and metrics
5. ‚úì Results saved to Google Drive

### üéØ Next Steps:
1. **Phase 3**: Multi-modal fusion (if using multiple datasets)
2. **Phase 4**: Explainability (Grad-CAM, SHAP)
3. **Phase 5**: Deployment (Streamlit app)

### üìä Expected Results:
- **Accuracy**: 95-97%
- **Training Time**: 2-3 hours on T4 GPU
- **Model Size**: ~20 MB

### üìö References:
- BreakHis Dataset: https://web.inf.ufpr.br/vri/databases/
- EfficientNet Paper: https://arxiv.org/abs/1905.11946
- Repository: https://github.com/vsiva763-git/breast-cancer-identification