# Quick Start: Skin Cancer Classification

This notebook provides a quick introduction to the skin cancer classification project.

## What You'll Learn
1. How to load the ISIC 2019 dataset
2. How to visualize sample images
3. How to understand class distributions
4. How to test data augmentation

## Before You Start
Make sure you have:
- Downloaded the ISIC 2019 dataset
- Installed all dependencies (see requirements.txt)
- Run the sanity check: `python3 sanity_check.py`

## 1. Setup and Imports

In [None]:
# Add src directory to path
import sys
sys.path.insert(0, '../src')

# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import os

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Imports successful!")

## 2. Load Dataset

In [None]:
# Paths
data_dir = '../data/ISIC2019/ISIC_2019_Training_Input/ISIC_2019_Training_Input'
csv_gt = '../data/ISIC2019/ISIC_2019_Training_GroundTruth.csv'
csv_meta = '../data/ISIC2019/ISIC_2019_Training_Metadata.csv'

# Load CSVs
labels_df = pd.read_csv(csv_gt)
metadata_df = pd.read_csv(csv_meta)

print(f"Total samples: {len(labels_df):,}")
print(f"\nGround truth columns: {labels_df.columns.tolist()}")
print(f"\nMetadata columns: {metadata_df.columns.tolist()}")

## 3. Class Distribution

In [None]:
# Get class counts
class_names = ['MEL', 'NV', 'BCC', 'AK', 'BKL', 'DF', 'VASC', 'SCC']
class_counts = labels_df[class_names].sum().sort_values(ascending=False)

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Bar plot
class_counts.plot(kind='bar', ax=ax1, color='skyblue', edgecolor='black')
ax1.set_title('Class Distribution (Bar Plot)', fontsize=14, fontweight='bold')
ax1.set_xlabel('Disease Class', fontsize=12)
ax1.set_ylabel('Number of Samples', fontsize=12)
ax1.tick_params(axis='x', rotation=45)
ax1.grid(axis='y', alpha=0.3)

# Add counts on bars
for i, v in enumerate(class_counts.values):
    ax1.text(i, v + 100, f'{int(v):,}', ha='center', fontweight='bold')

# Pie chart
colors = sns.color_palette('husl', len(class_counts))
ax2.pie(class_counts.values, labels=class_counts.index, autopct='%1.1f%%', 
        colors=colors, startangle=90)
ax2.set_title('Class Distribution (Pie Chart)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# Print statistics
print("\nClass Statistics:")
print("="*50)
for class_name in class_counts.index:
    count = int(class_counts[class_name])
    pct = count / len(labels_df) * 100
    print(f"{class_name:6} {count:>6,} ({pct:>5.2f}%)")
    
print("\nImbalance Ratio:")
max_count = class_counts.max()
min_count = class_counts.min()
print(f"  {max_count/min_count:.1f}:1 ({class_counts.idxmax()} vs {class_counts.idxmin()})")

## 4. Visualize Sample Images

In [None]:
# Display 2 sample images from each class
fig, axes = plt.subplots(8, 2, figsize=(8, 20))

for idx, class_name in enumerate(class_names):
    # Find samples for this class
    class_samples = labels_df[labels_df[class_name] == 1.0].sample(2, random_state=42)
    
    for col, (_, sample) in enumerate(class_samples.iterrows()):
        img_path = os.path.join(data_dir, f"{sample['image']}.jpg")
        
        if os.path.exists(img_path):
            img = Image.open(img_path)
            axes[idx, col].imshow(img)
        else:
            axes[idx, col].text(0.5, 0.5, 'Image not found', 
                               ha='center', va='center')
        
        axes[idx, col].axis('off')
        
        if col == 0:
            axes[idx, col].set_title(f'{class_name}', 
                                    fontsize=12, fontweight='bold', loc='left')

plt.suptitle('Sample Images from Each Class', fontsize=16, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

## 5. Metadata Analysis

In [None]:
# Merge labels with metadata
labels_df['label'] = labels_df[class_names].values.argmax(axis=1)
data = pd.merge(labels_df[['image', 'label']], metadata_df, on='image', how='left')

# Age distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 5))

# Overall age distribution
age_data = data['age_approx'].dropna()
ax1.hist(age_data, bins=30, color='skyblue', edgecolor='black', alpha=0.7)
ax1.axvline(age_data.mean(), color='red', linestyle='--', linewidth=2, 
           label=f'Mean: {age_data.mean():.1f}')
ax1.axvline(age_data.median(), color='green', linestyle='--', linewidth=2, 
           label=f'Median: {age_data.median():.1f}')
ax1.set_xlabel('Age (years)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Frequency', fontsize=12, fontweight='bold')
ax1.set_title('Age Distribution', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)

# Age by class (box plot)
age_by_class = [data[data['label'] == i]['age_approx'].dropna() 
                for i in range(len(class_names))]

bp = ax2.boxplot(age_by_class, labels=class_names, patch_artist=True)
for patch, color in zip(bp['boxes'], sns.color_palette('husl', len(class_names))):
    patch.set_facecolor(color)

ax2.set_xlabel('Disease Class', fontsize=12, fontweight='bold')
ax2.set_ylabel('Age (years)', fontsize=12, fontweight='bold')
ax2.set_title('Age Distribution by Class', fontsize=14, fontweight='bold')
ax2.tick_params(axis='x', rotation=45)
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Print statistics
print("\nAge Statistics:")
print(age_data.describe())

print("\nSex Distribution:")
print(data['sex'].value_counts())

print("\nAnatomical Location:")
print(data['anatom_site_general'].value_counts())

## 6. Next Steps

Now that you understand the dataset, you can:

1. **Run EDA scripts:**
   ```bash
   cd ../scripts/data
   python3 exploratory_data_analysis.py
   python3 advanced_visualizations.py
   ```

2. **Train your first model:**
   ```bash
   cd ../scripts/training
   python3 train_single_model.py --model resnet50 --epochs 50
   ```

3. **Run cross-validation:**
   ```bash
   python3 train_kfold_cv.py --model efficientnet --n_folds 10
   ```

4. **Explore other notebooks:**
   - Create your own notebooks for custom analysis
   - Experiment with model architectures
   - Visualize training results
   - Test XAI methods

## Summary

In this notebook, you learned:
- How to load the ISIC 2019 dataset
- The severe class imbalance (53:1 ratio)
- Visual appearance of different skin lesions
- Age and metadata distributions

Key takeaways:
- NV (melanocytic nevus) dominates at 50.8%
- DF (dermatofibroma) is rarest at 0.9%
- Mean patient age is ~54 years
- Must use weighted loss for class imbalance