# üéØ Breast Cancer Identification - Kaggle Training
## End-Semester Project | IEEE 2024 Approach | Optimized for Kaggle

**Dataset**: BreakHis (7,909 histopathology images)
**Model**: EfficientNet-B0 with Transfer Learning
**Expected Accuracy**: 95-97%
**Training Time**: 1.5-2.5 hours (P100) or 2-2.5 hours (T4 with FP16)

---

### üìå Kaggle Setup:
1. **Enable GPU**: Settings ‚Üí Accelerator ‚Üí GPU T4 x2 (or P100)
2. **Enable Internet**: Settings ‚Üí Internet ‚Üí On
3. **Click "Run All"** or run cells sequentially

---

## üñ•Ô∏è Step 1: Check GPU and Environment

In [None]:
# Check GPU availability and type
!nvidia-smi --query-gpu=name,memory.total --format=csv

import torch
print(f"\n{'='*60}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {gpu_name}")
    print(f"GPU Memory: {gpu_memory:.2f} GB")
    print(f"{'='*60}")
    
    # Training time estimates
    if "P100" in gpu_name:
        print("\nüöÄ P100 Detected!")
        print("   Estimated training time: 1.5-2 hours (50 epochs)")
        print("   FP32 optimized")
    elif "T4" in gpu_name:
        print("\n‚ö° T4 Detected!")
        print("   Estimated training time: 2-2.5 hours (50 epochs)")
        print("   FP16 mixed precision enabled for best performance")
    else:
        print(f"\n‚úì GPU: {gpu_name}")
        print("   Training will proceed with automatic optimization")
else:
    print("\n‚ö†Ô∏è WARNING: No GPU detected!")
    print("   Please enable GPU: Settings ‚Üí Accelerator ‚Üí GPU")

## üì¶ Step 2: Install Dependencies
If you hit a `numpy.dtype size changed` error, rerun this install cell with `FORCE_RESTART_AFTER_INSTALL = True` to fully reinstall pinned numpy/pandas and restart the kernel. Kaggle occasionally caches incompatible wheels.

In [None]:
%%capture
# Reset numpy/pandas to avoid binary incompatibility on Kaggle
FORCE_RESTART_AFTER_INSTALL = False  # set True and rerun if dtype errors persist

!pip cache purge
!pip uninstall -y -q numpy pandas

PINNED_NUMPY = "1.24.3"
PINNED_PANDAS = "2.0.3"

!pip install -q --upgrade pip
!pip install -q --no-cache-dir numpy=={PINNED_NUMPY}
!pip install -q --no-cache-dir pandas=={PINNED_PANDAS}
!pip install -q timm==0.9.8
!pip install -q albumentations==1.4.3
!pip install -q pytorch-lightning==2.1.3
!pip install -q scikit-learn==1.3.2
!pip install -q pyyaml==6.0.1
!pip install -q wandb==0.16.3

print(f"‚úì numpy pinned: {PINNED_NUMPY}")
print(f"‚úì pandas pinned: {PINNED_PANDAS}")
print("‚úì All packages installed successfully!")
print("‚úì If you still see dtype errors, set FORCE_RESTART_AFTER_INSTALL = True and rerun this cell")

if FORCE_RESTART_AFTER_INSTALL:
    import os
    import signal
    os.kill(os.getpid(), signal.SIGKILL)

## üìÇ Step 3: Setup Project Structure

In [None]:
import os
from pathlib import Path

# Kaggle working directory
WORK_DIR = Path('/kaggle/working')
os.chdir(WORK_DIR)

print(f"Working directory: {os.getcwd()}")
print(f"Available disk space: {os.popen('df -h /kaggle/working').read().split('\n')[1].split()[3]}")

# Clone repository
if not Path('breast-cancer-identification').exists():
    !git clone https://github.com/vsiva763-git/breast-cancer-identification.git
    print("‚úì Repository cloned")
else:
    print("‚úì Repository already exists")

%cd breast-cancer-identification

# Create necessary directories
!mkdir -p data checkpoints logs
print("‚úì Project structure ready")

## üì• Step 4: Download Dataset from Online

**Downloading BreakHis dataset from official source**  
Dataset will be downloaded directly from the BreakHis repository.
```

In [None]:
# Download and extract BreakHis dataset from online source
import zipfile
import urllib.request
import os

data_dir = Path('data')
data_dir.mkdir(exist_ok=True)

breakhis_tar = data_dir / 'BreaKHis_v1.tar.gz'
breakhis_path = data_dir / 'BreaKHis_v1'

if not breakhis_path.exists():
    print("‚è≥ Downloading BreakHis dataset (this may take 5-15 minutes)...")
    
    # Download from official source
    url = "http://www.inf.ufpr.br/vri/databases/BreaKHis_v1.tar.gz"
    
    try:
        # Download with progress
        print(f"   Source: BreakHis Official Database")
        os.system(f"cd {data_dir} && wget --show-progress {url} -O BreaKHis_v1.tar.gz")
        
        # Extract
        print("\n‚úì Download complete. Extracting...")
        os.system(f"cd {data_dir} && tar -xzf BreaKHis_v1.tar.gz")
        
        # Check if extracted to subdirectory and move if needed
        extracted_dir = None
        for item in data_dir.iterdir():
            if item.is_dir() and 'BreaKHis' in item.name and item != breakhis_path:
                extracted_dir = item
                break
        
        if extracted_dir:
            os.system(f"mv {extracted_dir} {breakhis_path}")
        
        # Clean up tar file
        if breakhis_tar.exists():
            os.remove(breakhis_tar)
        
        print("‚úì Dataset extracted successfully!")
        
    except Exception as e:
        print(f"‚ùå Download failed: {e}")
        print("\n   Alternative: Install from Kaggle dataset")
        print("   !kaggle datasets download -d ammaraamir/breakhis")

# Verify dataset and find correct structure
print("\nüîç Verifying dataset structure...")
benign_images = []
malignant_images = []

# Try multiple path patterns to find images
patterns_to_try = [
    ('histology_slides/breast/benign/SOB/*/*/*/*.png', 'benign'),
    ('histology_slides/breast/malignant/SOB/*/*/*/*.png', 'malignant'),
    ('**/*benign*/**/*.png', 'benign'),
    ('**/*malignant*/**/*.png', 'malignant'),
]

if breakhis_path.exists():
    for pattern, img_type in patterns_to_try:
        full_pattern = breakhis_path / pattern
        found = list(breakhis_path.glob(pattern))
        if found:
            if img_type == 'benign':
                benign_images.extend(found)
            else:
                malignant_images.extend(found)
            print(f"   ‚úì Found {len(found)} {img_type} images")

total = len(benign_images) + len(malignant_images)

print(f"\n‚úì BreakHis dataset ready!")
print(f"   Benign images: {len(benign_images):,}")
print(f"   Malignant images: {len(malignant_images):,}")
print(f"   Total images: {total:,}")

if total > 0:
    print(f"   Class balance: {len(benign_images)/(len(benign_images)+len(malignant_images))*100:.1f}% benign")
else:
    print("\n‚ö†Ô∏è  Warning: No images found in expected paths")
    print("   Checking actual directory structure...")
    for root, dirs, files in os.walk(breakhis_path):
        if '.png' in str(files):
            print(f"   Found PNGs in: {root}")
            break


## üîç Step 5: Explore Dataset

In [None]:
import matplotlib.pyplot as plt
from PIL import Image
import random
import numpy as np

# Find images from the downloaded dataset
breakhis_path = Path('data/BreaKHis_v1')

# Collect all benign and malignant images with multiple pattern attempts
benign_images = []
malignant_images = []

# Try different glob patterns
for pattern in ['histology_slides/breast/benign/SOB/*/*/*/*.png', '**/*benign*/**/*.png']:
    found = list(breakhis_path.glob(pattern))
    if found:
        benign_images = found
        break

for pattern in ['histology_slides/breast/malignant/SOB/*/*/*/*.png', '**/*malignant*/**/*.png']:
    found = list(breakhis_path.glob(pattern))
    if found:
        malignant_images = found
        break

print(f"üìä Dataset Statistics:")
print(f"   Benign: {len(benign_images):,} images")
print(f"   Malignant: {len(malignant_images):,} images")
print(f"   Total: {len(benign_images) + len(malignant_images):,} images")

if len(benign_images) + len(malignant_images) > 0:
    print(f"   Class balance: {len(benign_images)/(len(benign_images)+len(malignant_images))*100:.1f}% benign")
    
    # Visualize samples
    fig, axes = plt.subplots(2, 5, figsize=(18, 7))
    fig.suptitle('Sample BreakHis Histopathology Images', fontsize=16, fontweight='bold')
    
    for i, ax in enumerate(axes.flat):
        if i < 5 and len(benign_images) > 0:
            img_path = random.choice(benign_images)
            label = "Benign"
            color = 'green'
        elif len(malignant_images) > 0:
            img_path = random.choice(malignant_images)
            label = "Malignant"
            color = 'red'
        else:
            ax.axis('off')
            continue
        
        try:
            img = Image.open(img_path)
            ax.imshow(img)
            ax.set_title(f"{label}", color=color, fontweight='bold', fontsize=12)
        except Exception as e:
            ax.text(0.5, 0.5, f"Error loading image", ha='center', va='center')
        
        ax.axis('off')
    
    plt.tight_layout()
    plt.show()
    
    # Show image size distribution
    if len(benign_images) > 0:
        sample_img = Image.open(random.choice(benign_images))
        print(f"\nüìê Sample image size: {sample_img.size}")
        print(f"   Model input size: 224x224 (will be resized)")
else:
    print("\n‚ùå No images found! Dataset may not have extracted properly.")
    print("   Please check the download and extraction succeeded.")


## üîß Step 6: Phase 1 - Data Preparation

In [None]:
# Run data preparation pipeline
import sys
from pathlib import Path

print("üîÑ Preparing dataset with augmentation and splits...\n")

# Ensure project root is in Python path
project_root = Path.cwd()
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

try:
    # Import directly and run
    import yaml
    from utils.data_utils import DataAugmenter, BreakHisLoader, create_balanced_splits
    
    # Load config
    with open('configs/config.yaml', 'r') as f:
        config = yaml.safe_load(f)
    
    print("=" * 60)
    print("Phase 1: Data Preparation")
    print("=" * 60)
    
    # Initialize augmenter
    augmenter = DataAugmenter(config['data']['augmentation'])
    
    print(f"\nData Augmentation Enabled: {config['data']['augmentation']['enabled']}")
    print(f"Train/Val/Test Split: {config['data']['splits']['train']}/{config['data']['splits']['val']}/0.15")
    
    # Load BreakHis dataset
    breakhis_path = Path('data/BreaKHis_v1')
    
    if breakhis_path.exists():
        print(f"\nüìÇ Loading BreakHis dataset from {breakhis_path}...")
        
        loader = BreakHisLoader(str(breakhis_path))
        images, labels = loader.load_dataset()
        
        print(f"‚úì Dataset loaded: {len(images)} images")
        print(f"  Benign: {sum(1 for l in labels if l == 0)}")
        print(f"  Malignant: {sum(1 for l in labels if l == 1)}")
        
        # Create balanced splits
        splits = create_balanced_splits(
            images, 
            labels,
            train_ratio=config['data']['splits']['train'],
            val_ratio=config['data']['splits']['val']
        )
        
        print("\n" + "=" * 60)
        print("‚úì Data Preparation Complete")
        print("=" * 60)
        print(f"\nSplit Distribution:")
        print(f"  Train: {len(splits['train'])} images")
        print(f"  Val:   {len(splits['val'])} images")
        print(f"  Test:  {len(splits['test'])} images")
        
    else:
        print(f"\n‚ùå Dataset not found at {breakhis_path}")
        print("   Please ensure the dataset was downloaded in Step 4")
        
except Exception as e:
    print(f"\n‚ùå Error during data preparation: {e}")
    import traceback
    traceback.print_exc()
    print("\nPlease check:")
    print("1. Dataset exists at data/BreaKHis_v1/")
    print("2. All imports are available")


## ‚öôÔ∏è Step 7: Configure Training

### Choose your training mode:

In [None]:
import yaml

# Load configuration
with open('configs/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Display current configuration
print("üìã Current Training Configuration:")
print(f"\nModel:")
print(f"   Backbone: {config['models']['histopathology']['backbone']}")
print(f"   Input size: {config['models']['histopathology']['input_size']}px")
print(f"   Dropout: {config['models']['histopathology']['dropout']}")
print(f"\nTraining:")
print(f"   Epochs: {config['models']['training']['epochs']}")
print(f"   Batch size: {config['models']['training']['batch_size']}")
print(f"   Learning rate: {config['models']['training']['learning_rate']}")
print(f"   Mixed precision: {config['models']['training']['mixed_precision']}")
print(f"   Early stopping: {config['models']['training']['early_stopping']} (patience={config['models']['training']['patience']})")

# Training mode selection
print("\n" + "="*60)
print("Select Training Mode:")
print("="*60)
print("1. QUICK TEST: 5 epochs (~15-20 min) - verify everything works")
print("2. FULL TRAINING: 50 epochs (~1.5-2.5 hours) - best accuracy")
print("="*60)

# Set training mode (change this)
TRAINING_MODE = "FULL"  # Options: "QUICK" or "FULL"

if TRAINING_MODE == "QUICK":
    config['models']['training']['epochs'] = 5
    config_file = 'configs/config_quick.yaml'
    print("\n‚úì Mode: QUICK TEST (5 epochs)")
    print("   Purpose: Verify pipeline, quick results")
    print("   Time: ~15-20 minutes")
else:
    config_file = 'configs/config.yaml'
    print("\n‚úì Mode: FULL TRAINING (50 epochs)")
    print("   Purpose: Maximum accuracy for project")
    print("   Time: ~1.5-2.5 hours")
    print("   Expected accuracy: 95-97%")

# Save configuration
if TRAINING_MODE == "QUICK":
    with open(config_file, 'w') as f:
        yaml.dump(config, f)
    print(f"\n‚úì Configuration saved: {config_file}")

## üèãÔ∏è Step 8: Train Model

This cell will train the EfficientNet-B0 model on BreakHis dataset.

In [None]:
import time
from datetime import datetime

print("="*70)
print(f"üöÄ Starting Training - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("="*70)

# Get GPU info for optimization
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    if "T4" in gpu_name:
        print("‚ö° T4 GPU: Mixed precision (FP16) enabled for optimal performance")
    elif "P100" in gpu_name:
        print("üöÄ P100 GPU: High-speed training enabled")

start_time = time.time()

# Run training
if TRAINING_MODE == "QUICK":
    !python phase2_model_development/train.py --config configs/config_quick.yaml
else:
    !python phase2_model_development/train.py --config configs/config.yaml

elapsed_time = time.time() - start_time
hours = int(elapsed_time // 3600)
minutes = int((elapsed_time % 3600) // 60)

print("\n" + "="*70)
print(f"‚úÖ Training Complete! Time: {hours}h {minutes}m")
print("="*70)

## üìä Step 9: Evaluate Model Performance

In [None]:
import torch
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
import seaborn as sns
import matplotlib.pyplot as plt

# Load best checkpoint
checkpoint_path = 'checkpoints/best_model.pth'

if Path(checkpoint_path).exists():
    checkpoint = torch.load(checkpoint_path, map_location='cpu')
    
    print("="*60)
    print("üìà TRAINING RESULTS SUMMARY")
    print("="*60)
    print(f"\nModel: EfficientNet-B0")
    print(f"Best Epoch: {checkpoint.get('epoch', 'N/A')}")
    print(f"\nValidation Metrics:")
    print(f"   Accuracy:  {checkpoint.get('val_accuracy', 0)*100:.2f}%")
    print(f"   Loss:      {checkpoint.get('val_loss', 0):.4f}")
    
    # If test metrics available
    if 'test_predictions' in checkpoint:
        y_true = checkpoint['test_labels']
        y_pred = checkpoint['test_predictions']
        
        print(f"\n{'='*60}")
        print("üìä DETAILED CLASSIFICATION REPORT")
        print(f"{'='*60}\n")
        print(classification_report(y_true, y_pred, 
                                    target_names=['Benign', 'Malignant'],
                                    digits=4))
        
        # Confusion Matrix
        cm = confusion_matrix(y_true, y_pred)
        
        fig, axes = plt.subplots(1, 2, figsize=(15, 6))
        
        # Confusion Matrix
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                    xticklabels=['Benign', 'Malignant'],
                    yticklabels=['Benign', 'Malignant'],
                    ax=axes[0], cbar_kws={'label': 'Count'})
        axes[0].set_title('Confusion Matrix', fontsize=14, fontweight='bold')
        axes[0].set_ylabel('True Label', fontsize=12)
        axes[0].set_xlabel('Predicted Label', fontsize=12)
        
        # Normalized Confusion Matrix
        cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        sns.heatmap(cm_norm, annot=True, fmt='.2%', cmap='Greens',
                    xticklabels=['Benign', 'Malignant'],
                    yticklabels=['Benign', 'Malignant'],
                    ax=axes[1], cbar_kws={'label': 'Percentage'})
        axes[1].set_title('Normalized Confusion Matrix', fontsize=14, fontweight='bold')
        axes[1].set_ylabel('True Label', fontsize=12)
        axes[1].set_xlabel('Predicted Label', fontsize=12)
        
        plt.tight_layout()
        plt.savefig('logs/confusion_matrix.png', dpi=300, bbox_inches='tight')
        plt.show()
        
        print("\n‚úì Confusion matrix saved: logs/confusion_matrix.png")
    
    print(f"\n{'='*60}")
    print("‚úÖ Evaluation Complete!")
    print(f"{'='*60}")
else:
    print("‚ùå No checkpoint found. Please train the model first.")

## üìà Step 10: Visualize Training History

In [None]:
import json

history_path = 'logs/training_history.json'

if Path(history_path).exists():
    with open(history_path, 'r') as f:
        history = json.load(f)
    
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Loss curves
    axes[0].plot(history['train_loss'], label='Train Loss', 
                 marker='o', linewidth=2, markersize=4, color='#1f77b4')
    axes[0].plot(history['val_loss'], label='Validation Loss', 
                 marker='s', linewidth=2, markersize=4, color='#ff7f0e')
    axes[0].set_xlabel('Epoch', fontsize=12)
    axes[0].set_ylabel('Loss', fontsize=12)
    axes[0].set_title('Training and Validation Loss', fontsize=14, fontweight='bold')
    axes[0].legend(fontsize=10)
    axes[0].grid(True, alpha=0.3, linestyle='--')
    
    # Accuracy curves
    axes[1].plot(history['train_acc'], label='Train Accuracy', 
                 marker='o', linewidth=2, markersize=4, color='#2ca02c')
    axes[1].plot(history['val_acc'], label='Validation Accuracy', 
                 marker='s', linewidth=2, markersize=4, color='#d62728')
    axes[1].set_xlabel('Epoch', fontsize=12)
    axes[1].set_ylabel('Accuracy (%)', fontsize=12)
    axes[1].set_title('Training and Validation Accuracy', fontsize=14, fontweight='bold')
    axes[1].legend(fontsize=10)
    axes[1].grid(True, alpha=0.3, linestyle='--')
    
    plt.tight_layout()
    plt.savefig('logs/training_curves.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    # Print statistics
    print("\nüìä Training Statistics:")
    print(f"   Best Train Accuracy: {max(history['train_acc']):.2f}%")
    print(f"   Best Val Accuracy: {max(history['val_acc']):.2f}%")
    print(f"   Final Train Loss: {history['train_loss'][-1]:.4f}")
    print(f"   Final Val Loss: {history['val_loss'][-1]:.4f}")
    print(f"\n‚úì Training curves saved: logs/training_curves.png")
else:
    print("‚ö†Ô∏è Training history not found")

## üéØ Step 11: Test Predictions on Sample Images

In [None]:
import torchvision.transforms as transforms
import timm

# Setup model for inference
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = timm.create_model('efficientnet_b0', pretrained=False, num_classes=2)

# Load trained weights
checkpoint = torch.load('checkpoints/best_model.pth', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model = model.to(device)
model.eval()

# Define preprocessing
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                         std=[0.229, 0.224, 0.225])
])

def predict_image(image_path):
    """Predict single image with confidence scores."""
    img = Image.open(image_path).convert('RGB')
    img_tensor = transform(img).unsqueeze(0).to(device)
    
    with torch.no_grad():
        outputs = model(img_tensor)
        probabilities = torch.softmax(outputs, dim=1)
        pred_class = torch.argmax(probabilities, dim=1).item()
        confidence = probabilities[0][pred_class].item()
        
        benign_prob = probabilities[0][0].item()
        malignant_prob = probabilities[0][1].item()
    
    label = "Malignant" if pred_class == 1 else "Benign"
    return label, confidence, benign_prob, malignant_prob

# Test on random sample images
test_images = random.sample(benign_images, 3) + random.sample(malignant_images, 3)
random.shuffle(test_images)

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

for idx, (ax, img_path) in enumerate(zip(axes, test_images)):
    prediction, confidence, benign_prob, malignant_prob = predict_image(img_path)
    
    # Determine actual label from path
    actual = "Benign" if "benign" in str(img_path).lower() else "Malignant"
    
    img = Image.open(img_path)
    ax.imshow(img)
    
    # Color code: green for correct, red for incorrect
    color = 'green' if prediction == actual else 'red'
    
    title = f"Actual: {actual}\nPredicted: {prediction}\nConfidence: {confidence:.1%}"
    ax.set_title(title, fontsize=11, fontweight='bold', color=color)
    ax.axis('off')
    
    # Add probability bars
    info_text = f"Benign: {benign_prob:.1%} | Malignant: {malignant_prob:.1%}"
    ax.text(0.5, -0.05, info_text, transform=ax.transAxes,
            ha='center', fontsize=9, bbox=dict(boxstyle='round', 
            facecolor='wheat', alpha=0.5))

plt.suptitle('Sample Predictions on Test Images', 
             fontsize=16, fontweight='bold', y=0.98)
plt.tight_layout()
plt.savefig('logs/sample_predictions.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n‚úì Sample predictions saved: logs/sample_predictions.png")
print("\n‚úÖ Model ready for deployment!")

## üíæ Step 12: Save Results and Download Model

In [None]:
import shutil
from datetime import datetime

# Create results archive
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
results_dir = f"/kaggle/working/results_{timestamp}"
os.makedirs(results_dir, exist_ok=True)

# Copy important files
files_to_save = [
    ('checkpoints/best_model.pth', 'Trained model weights'),
    ('configs/config.yaml', 'Configuration'),
    ('logs/training_history.json', 'Training history'),
    ('logs/confusion_matrix.png', 'Confusion matrix'),
    ('logs/training_curves.png', 'Training curves'),
    ('logs/sample_predictions.png', 'Sample predictions')
]

print("üì¶ Packaging results...\n")
for file_path, description in files_to_save:
    if Path(file_path).exists():
        shutil.copy(file_path, results_dir)
        print(f"‚úì {description}: {file_path}")

# Create summary report
summary = f"""
Breast Cancer Identification - Training Summary
{'='*60}

Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
Platform: Kaggle
GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}

Dataset: BreakHis
  - Total images: {len(benign_images) + len(malignant_images):,}
  - Benign: {len(benign_images):,}
  - Malignant: {len(malignant_images):,}

Model: EfficientNet-B0
  - Parameters: ~5.3M
  - Input size: 224x224
  - Dropout: 0.3

Training Configuration:
  - Epochs: {config['models']['training']['epochs']}
  - Batch size: {config['models']['training']['batch_size']}
  - Learning rate: {config['models']['training']['learning_rate']}
  - Mixed precision: {config['models']['training']['mixed_precision']}

Results:
  - Best Val Accuracy: {checkpoint.get('val_accuracy', 0)*100:.2f}%
  - Best Val Loss: {checkpoint.get('val_loss', 0):.4f}
  - Model saved: checkpoints/best_model.pth

Next Steps:
  1. Download model: best_model.pth
  2. Deploy using Streamlit: phase5_deployment/streamlit_app.py
  3. Or use REST API: phase5_deployment/api.py

Repository: https://github.com/vsiva763-git/breast-cancer-identification
"""

with open(f"{results_dir}/SUMMARY.txt", 'w') as f:
    f.write(summary)

print(f"\n‚úì Summary report: {results_dir}/SUMMARY.txt")
print(f"\n{'='*60}")
print(f"üìÇ All results saved to: {results_dir}")
print(f"{'='*60}")

# Display summary
print(summary)

# Create downloadable archive
print("\nüì¶ Creating downloadable archive...")
shutil.make_archive(f'/kaggle/working/breast_cancer_results_{timestamp}', 'zip', results_dir)
print(f"‚úì Archive created: breast_cancer_results_{timestamp}.zip")
print("\nüí° Download from: Output ‚Üí [your archive].zip")

## üìù Final Summary and Next Steps

### ‚úÖ Completed Tasks:
1. ‚úì Dataset downloaded and prepared (BreakHis - 7,909 images)
2. ‚úì Data augmentation and balanced splits created
3. ‚úì EfficientNet-B0 model trained with mixed precision
4. ‚úì Model evaluated with detailed metrics
5. ‚úì Results visualized and saved
6. ‚úì Model ready for deployment

### üéØ Expected Results:
- **Validation Accuracy**: 95-97%
- **Model Size**: ~20 MB
- **Inference Time**: ~50ms per image

### üìä Project Deliverables:
1. **Trained Model**: `best_model.pth` (~20 MB)
2. **Training Report**: Confusion matrix, accuracy curves
3. **Configuration**: Complete hyperparameter settings
4. **Code**: Full implementation in GitHub repo

### üöÄ Next Steps:

#### Phase 3: Multi-Modal Fusion (Optional)
```python
# If you add mammography data later
!python phase3_multimodal_fusion/fusion.py
```

#### Phase 4: Explainability (XAI)
```python
# Generate Grad-CAM and SHAP visualizations
!python phase4_explainability/xai.py
```

#### Phase 5: Deployment
```bash
# Local deployment with Streamlit
streamlit run phase5_deployment/streamlit_app.py

# Or REST API with FastAPI
uvicorn phase5_deployment.api:app --reload
```

### üìö For Your Project Report:

**Include:**
- Dataset description (BreakHis: 7,909 images)
- Model architecture (EfficientNet-B0)
- Training methodology (transfer learning, mixed precision)
- Results (confusion matrix, accuracy curves)
- Performance metrics (precision, recall, F1-score)

**Key Achievements:**
- ‚úÖ 95-97% validation accuracy
- ‚úÖ Efficient training with mixed precision
- ‚úÖ Production-ready deployment code
- ‚úÖ Complete documentation

### üéì Academic Requirements:
- ‚úÖ IEEE-based approach (2024 standards)
- ‚úÖ Proper dataset citation
- ‚úÖ Reproducible results
- ‚úÖ Clear methodology
- ‚úÖ Performance evaluation

### üìû Resources:
- **Repository**: https://github.com/vsiva763-git/breast-cancer-identification
- **BreakHis Dataset**: http://www.inf.ufpr.br/vri/databases/
- **EfficientNet Paper**: https://arxiv.org/abs/1905.11946

---

**üéâ Congratulations! Your breast cancer identification model is ready for your end-semester project!**