# 2025 Kaggle Competition of AI Applied to Medicine at UC3M





Welcome to the **2025 Kaggle competition of AI applied to medicine at UC3M**. This project is set up as an **internal Kaggle competition** in which all students will participate. Our real-world challenge for this course will revolve around the **ISIC 2024** dataset, a large collection of skin images used for research in dermatology.


Welcome to our simple **ResNet50-based** starter notebook. Below we:
1. **Define** a function to load images from HDF5 files.
2. **Load** and display our training metadata (no preprocessing).
3. **Load** a pretrained **ResNet50** model (we won't fine-tune it).
4. **Evaluate** test samples (in a trivial way for demonstration).
5. **Generate** a simple `submission.csv` file with the required format.

> **Note**: This is a minimal example to help you set up your environment. It doesn’t include any real training or meaningful model inference. Feel free to modify it to perform actual classification (e.g., add custom layers, train on your dataset, etc.).

## ISIC 2024 Competition Overview

The **International Skin Imaging Collaboration (ISIC)** has launched this competition to advance automated skin cancer detection by:
- **Improving accuracy** in distinguishing malignant from benign lesions  
- **Enhancing efficiency** in clinical workflows  
- **Developing algorithms** that prioritize high-risk lesions  
- **Reducing mortality rates** by enabling earlier detection  

### Primary Task
You need to **classify skin lesions** as **benign** or **malignant**. For each lesion image (identified by `isic_id`), predict a **probability** in the range [0, 1] indicating the chance that the lesion is malignant.

### High-Level Data Summary
- The dataset is called **SLICE-3D**, containing **skin lesion images** (JPEG files) cropped from 3D Total Body Photography (TBP).  
- Each image has metadata in a corresponding `.csv` file, including:  
  - **Binary diagnostic label** (`target` = 0 or 1)  
  - **Patient data** (e.g., `age_approx`, `sex`, `anatom_site_general`)  
  - **Additional attributes** (image source, diagnosis type)

![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4972760%2F349a3ae1149d15dc5642063a2d742c88%2Fimage%20type_noexif_240425.jpg?generation=1714060307710359&alt=media)

This challenge dataset mimics **non-dermoscopic images** using standardized 15x15 mm “tiles” of lesions from a 3D TBP system. Thousands of patients from multiple continents are represented, creating a broad, diverse dataset.

### Task Description & Clinical Context
- **Why it matters**: Skin cancer can be deadly if not detected early. Many people lack access to dermatologic care, so accurate AI systems for image-based triage can improve outcomes.  
- **Key goal**: Develop a binary classifier that identifies malignant lesions from a set of smartphone-quality images.  
- **Impact**: This technology could help prioritize suspicious lesions (top K) for clinical review, especially in low-resource settings, potentially **saving lives** through earlier detection.

### Importance of 3D TBP
The **3D Total Body Photography (TBP)** approach captures the entire skin surface in macro resolution. Each lesion on the patient’s body is automatically cropped as a 15x15 mm image tile. These images more closely resemble photos taken by a regular smartphone camera, as opposed to specialized dermoscopy devices.

![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4972760%2F169b1f691322233e7b31aabaf6716ff3%2Fex-tiles.png?generation=1717700538524806&alt=media)

### Clinical Background
1. **Major skin cancer types**: Basal Cell Carcinoma (BCC), Squamous Cell Carcinoma (SCC), and Melanoma (most lethal).  
2. **Early detection** is crucial: Minor surgery can cure many skin cancers if caught in time.  
3. **Telemedicine implications**: With the rise in remote healthcare, patients often submit low-quality images captured at home. Robust AI models are needed to handle this variability.

### Summary
- You will build a model to **classify skin lesions** (benign vs. malignant) with probabilities.  
- The dataset includes **every lesion** from thousands of patients, reflecting real-world diversity.  
- **3D TBP** and the “ugly duckling sign” concept illustrate the importance of comparing each lesion against the patient’s total lesion landscape.  
- **Your work** can help improve early detection, prioritizing high-risk cases for clinical evaluation and potentially saving lives.


In [31]:
METADATA_COL2DESC = {
    "isic_id": "Unique identifier for each image case.",
    "target": "Binary class label (0 = benign, 1 = malignant).",
    "patient_id": "Unique identifier for each patient.",
    "age_approx": "Approximate age of the patient at time of imaging.",
    "sex": "Sex of the patient (male or female).",
    "anatom_site_general": "General location of the lesion on the patient's body.",
    "clin_size_long_diam_mm": "Maximum diameter of the lesion (mm).",
    "image_type": "Type of image captured, as defined in the ISIC Archive.",
    "tbp_tile_type": "Lighting modality of the 3D Total Body Photography (TBP) source image.",
    "tbp_lv_A": "Color channel A (green-red axis in LAB space) inside the lesion.",
    "tbp_lv_Aext": "Color channel A outside the lesion.",
    "tbp_lv_B": "Color channel B (blue-yellow axis in LAB space) inside the lesion.",
    "tbp_lv_Bext": "Color channel B outside the lesion.",
    "tbp_lv_C": "Chroma value inside the lesion.",
    "tbp_lv_Cext": "Chroma value outside the lesion.",
    "tbp_lv_H": "Hue value inside the lesion (LAB color space).",
    "tbp_lv_Hext": "Hue value outside the lesion.",
    "tbp_lv_L": "Luminance inside the lesion (LAB color space).",
    "tbp_lv_Lext": "Luminance outside the lesion.",
    "tbp_lv_areaMM2": "Area of the lesion in mm².",
    "tbp_lv_area_perim_ratio": "Ratio of the lesion's perimeter to its area (border jaggedness).",
    "tbp_lv_color_std_mean": "Mean color irregularity within the lesion.",
    "tbp_lv_deltaA": "Average contrast in color channel A between inside and outside.",
    "tbp_lv_deltaB": "Average contrast in color channel B between inside and outside.",
    "tbp_lv_deltaL": "Average contrast in luminance between inside and outside.",
    "tbp_lv_deltaLB": "Combined contrast between the lesion and surrounding skin.",
    "tbp_lv_deltaLBnorm": "Normalized contrast (LAB color space).",
    "tbp_lv_eccentricity": "Eccentricity of the lesion (how elongated it is).",
    "tbp_lv_location": "Detailed anatomical location (e.g., Upper Arm).",
    "tbp_lv_location_simple": "Simplified anatomical location (e.g., Arm).",
    "tbp_lv_minorAxisMM": "Smallest diameter of the lesion in mm.",
    "tbp_lv_nevi_confidence": "Confidence score (0-100) for the lesion being a nevus.",
    "tbp_lv_norm_border": "Normalized border irregularity (0-10 scale).",
    "tbp_lv_norm_color": "Normalized color variation (0-10 scale).",
    "tbp_lv_perimeterMM": "Perimeter of the lesion in mm.",
    "tbp_lv_radial_color_std_max": "Color asymmetry within the lesion, measured radially.",
    "tbp_lv_stdL": "Std. deviation of luminance inside the lesion.",
    "tbp_lv_stdLExt": "Std. deviation of luminance outside the lesion.",
    "tbp_lv_symm_2axis": "Asymmetry about a second axis of symmetry.",
    "tbp_lv_symm_2axis_angle": "Angle of that second axis of symmetry.",
    "tbp_lv_x": "X-coordinate in the 3D TBP model.",
    "tbp_lv_y": "Y-coordinate in the 3D TBP model.",
    "tbp_lv_z": "Z-coordinate in the 3D TBP model.",
    "attribution": "Image source or institution.",
    "copyright_license": "License information.",
    "lesion_id": "Unique ID for lesions of interest.",
    "iddx_full": "Full diagnosis classification.",
    "iddx_1": "First-level (broad) diagnosis.",
    "iddx_2": "Second-level diagnosis.",
    "iddx_3": "Third-level diagnosis.",
    "iddx_4": "Fourth-level diagnosis.",
    "iddx_5": "Fifth-level diagnosis.",
    "mel_mitotic_index": "Mitotic index of invasive malignant melanomas.",
    "mel_thick_mm": "Thickness of melanoma invasion in mm.",
    "tbp_lv_dnn_lesion_confidence": "Lesion confidence score (0-100) from a DNN classifier."
}

METADATA_COL2NAME = {
    "isic_id": "Unique Case Identifier",
    "target": "Binary Lesion Classification",
    "patient_id": "Unique Patient Identifier",
    "age_approx": "Approximate Age",
    "sex": "Sex",
    "anatom_site_general": "General Anatomical Location",
    "clin_size_long_diam_mm": "Clinical Size (Longest Diameter in mm)",
    "image_type": "Image Type",
    "tbp_tile_type": "TBP Tile Type",
    "tbp_lv_A": "Color Channel A (Inside)",
    "tbp_lv_Aext": "Color Channel A (Outside)",
    "tbp_lv_B": "Color Channel B (Inside)",
    "tbp_lv_Bext": "Color Channel B (Outside)",
    "tbp_lv_C": "Chroma (Inside)",
    "tbp_lv_Cext": "Chroma (Outside)",
    "tbp_lv_H": "Hue (Inside)",
    "tbp_lv_Hext": "Hue (Outside)",
    "tbp_lv_L": "Luminance (Inside)",
    "tbp_lv_Lext": "Luminance (Outside)",
    "tbp_lv_areaMM2": "Lesion Area (mm²)",
    "tbp_lv_area_perim_ratio": "Area-to-Perimeter Ratio",
    "tbp_lv_color_std_mean": "Mean Color Irregularity",
    "tbp_lv_deltaA": "Delta A",
    "tbp_lv_deltaB": "Delta B",
    "tbp_lv_deltaL": "Delta L",
    "tbp_lv_deltaLB": "Delta LB",
    "tbp_lv_deltaLBnorm": "Normalized Delta LB",
    "tbp_lv_eccentricity": "Eccentricity",
    "tbp_lv_location": "Detailed Location",
    "tbp_lv_location_simple": "Simplified Location",
    "tbp_lv_minorAxisMM": "Smallest Diameter (mm)",
    "tbp_lv_nevi_confidence": "Nevus Confidence Score",
    "tbp_lv_norm_border": "Normalized Border Irregularity",
    "tbp_lv_norm_color": "Normalized Color Variation",
    "tbp_lv_perimeterMM": "Lesion Perimeter (mm)",
    "tbp_lv_radial_color_std_max": "Radial Color Deviation",
    "tbp_lv_stdL": "Std. Dev. Luminance (Inside)",
    "tbp_lv_stdLExt": "Std. Dev. Luminance (Outside)",
    "tbp_lv_symm_2axis": "Symmetry (Second Axis)",
    "tbp_lv_symm_2axis_angle": "Symmetry Angle (Second Axis)",
    "tbp_lv_x": "X-Coordinate",
    "tbp_lv_y": "Y-Coordinate",
    "tbp_lv_z": "Z-Coordinate",
    "attribution": "Image Source",
    "copyright_license": "Copyright",
    "lesion_id": "Unique Lesion ID",
    "iddx_full": "Full Diagnosis",
    "iddx_1": "Diagnosis Level 1",
    "iddx_2": "Diagnosis Level 2",
    "iddx_3": "Diagnosis Level 3",
    "iddx_4": "Diagnosis Level 4",
    "iddx_5": "Diagnosis Level 5",
    "mel_mitotic_index": "Mitotic Index (Melanoma)",
    "mel_thick_mm": "Melanoma Thickness (mm)",
    "tbp_lv_dnn_lesion_confidence": "Lesion DNN Confidence"
}

In [32]:
import os
import h5py
import cv2
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split
import torchvision.transforms as T
import matplotlib.pyplot as plt

from torchvision import models
from tqdm.auto import tqdm

# If using GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# ---------------------------
# 1. Custom Dataset for HDF5
# ---------------------------
class ISIC_HDF5_Dataset(Dataset):
    """
    A PyTorch Dataset that loads images from an HDF5 file given a DataFrame of IDs.
    Applies image transforms suitable for ResNet50.
    """
    def __init__(self, df: pd.DataFrame, hdf5_path: str, transform=None, is_labelled: bool = True):
        """
        Args:
            df (pd.DataFrame): DataFrame containing 'isic_id' and optionally 'target'.
            hdf5_path (str): Path to the HDF5 file containing images.
            transform (callable): Optional transforms to be applied on a sample.
            is_labelled (bool): Whether the dataset includes labels (for train/val).
        """
        self.df = df.reset_index(drop=True)
        self.hdf5_path = hdf5_path
        self.transform = transform
        self.is_labelled = is_labelled

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        isic_id = row["isic_id"]
        
        # Load image from HDF5
        image_rgb = self._load_image_from_hdf5(isic_id)
        
        # Apply transforms (PIL-style transforms require converting np array to PIL, or we can do tensor transforms)
        if self.transform is not None:
            # Convert NumPy array (H x W x C) to a PIL Image
            import torchvision.transforms.functional as F_v
            image_pil = F_v.to_pil_image(image_rgb)
            image = self.transform(image_pil)
        else:
            # By default, convert it to a tensor (C x H x W)
            image = torch.from_numpy(image_rgb).permute(2, 0, 1).float()

        if self.is_labelled:
            label = row["target"]
            label = torch.tensor(label).float()
            return image, label, isic_id
        else:
            return image, isic_id

    def _load_image_from_hdf5(self, isic_id: str):
        """
        Loads and decodes an image from HDF5 by isic_id.
        Returns a NumPy array in RGB format (H x W x 3).
        """
        with h5py.File(self.hdf5_path, 'r') as hf:
            encoded_bytes = hf[isic_id][()]  # uint8 array

        # Decode the image bytes with OpenCV (returns BGR)
        image_bgr = cv2.imdecode(encoded_bytes, cv2.IMREAD_COLOR)
        # Convert to RGB
        image_rgb = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
        return image_rgb

# ----------------------------------------------------
# 2. DataFrames and Basic Preprocessing / Transforms
# ----------------------------------------------------
# -----------------------------
# 3. Data Setup (Train/Valid/Test)
# -----------------------------
TRAIN_METADATA_CSV = "data/new-train-metadata.csv"
TEST_METADATA_CSV  = "data/students-test-metadata.csv"
TRAIN_HDF5         = "data/train-image.hdf5"
TEST_HDF5          = "data/test-image.hdf5"

train_df = pd.read_csv(TRAIN_METADATA_CSV)
test_df  = pd.read_csv(TEST_METADATA_CSV)

print(f"train_df shape: {train_df.shape}")
print(f"test_df shape:  {test_df.shape}")

# Example: split train_df into 80% train / 20% valid
train_size = int(0.8 * len(train_df))
valid_size = len(train_df) - train_size
train_subset, valid_subset = random_split(
    train_df, 
    [train_size, valid_size],
    generator=torch.Generator().manual_seed(42)
)

train_df_sub = train_df.iloc[train_subset.indices].reset_index(drop=True)
valid_df_sub = train_df.iloc[valid_subset.indices].reset_index(drop=True)

print(f"Train samples: {len(train_df_sub)}, Valid samples: {len(valid_df_sub)}")

# Basic transforms for ResNet
resnet_transforms = T.Compose([
    T.Resize((224,224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]),
])

# Create Datasets
train_dataset = ISIC_HDF5_Dataset(
    df=train_df_sub, 
    hdf5_path=TRAIN_HDF5,
    transform=resnet_transforms,
    is_labelled=True
)

valid_dataset = ISIC_HDF5_Dataset(
    df=valid_df_sub,
    hdf5_path=TRAIN_HDF5,
    transform=resnet_transforms,
    is_labelled=True
)

test_dataset = ISIC_HDF5_Dataset(
    df=test_df,
    hdf5_path=TEST_HDF5,
    transform=resnet_transforms,
    is_labelled=False
)

print("Created train/valid/test datasets.")




Using device: cuda


  train_df = pd.read_csv(TRAIN_METADATA_CSV)


train_df shape: (400959, 55)
test_df shape:  (100, 44)
Train samples: 320767, Valid samples: 80192
Created train/valid/test datasets.


In [33]:
# -----------------------------
# 4. RandomSampler for 1000 Samples per Epoch
# -----------------------------
# Instead of weighting for class imbalance, we simply draw 1000 random samples each epoch.
# Students can discover the imbalance issue themselves!

from torch.utils.data import RandomSampler

# A RandomSampler with replacement=True can pick num_samples=1000 each epoch
sampler = RandomSampler(
    data_source=train_dataset,
    replacement=True,
    num_samples=1000
)

BATCH_SIZE = 8

train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    sampler=sampler,  # No shuffle needed, RandomSampler handles randomness
    num_workers=2
)

valid_loader = DataLoader(
    valid_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=2
)

test_loader = DataLoader(
    test_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=2
)

print(f"Train loader: {len(train_loader)} batches (total = 1000 samples / batch_size = {BATCH_SIZE})")
print(f"Valid loader: {len(valid_loader)} batches")
print(f"Test loader:  {len(test_loader)} batches")



Train loader: 125 batches (total = 1000 samples / batch_size = 8)
Valid loader: 10024 batches
Test loader:  13 batches


In [34]:
# -----------------------------
# 5. Load & Modify ResNet50 for Binary Classification
# -----------------------------
import torch.optim as optim
from torchvision import models

model = models.resnet50(pretrained=True)

# Replace final layer for binary classification
model.fc = nn.Linear(in_features=2048, out_features=1)

model = model.to(device)

criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)




In [35]:
# -----------------------------
# 6. Training Loop (5 Epochs, Only 1000 Samples/Epoch)
# -----------------------------
EPOCHS = 5
for epoch in range(1, EPOCHS+1):
    model.train()
    running_loss = 0.0
    
    for images, labels, _ in tqdm(train_loader, desc=f"Epoch {epoch}", leave=False):
        images, labels = images.to(device), labels.to(device)
        
        optimizer.zero_grad()
        logits = model(images).view(-1)  # [batch_size]
        
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
    
    avg_train_loss = running_loss / len(train_loader)
    
    
    print(f"Epoch {epoch}/{EPOCHS} | Train Loss: {avg_train_loss:.4f}))")


Epoch 1:   0%|          | 0/125 [00:00<?, ?it/s]

                                                          

Epoch 1/5 | Train Loss: 0.0766))


                                                          

Epoch 2/5 | Train Loss: 0.0164))


                                                          

Epoch 3/5 | Train Loss: 0.0015))


                                                          

Epoch 4/5 | Train Loss: 0.0075))


                                                          

Epoch 5/5 | Train Loss: 0.0008))


In [36]:
# -----------------------------
# 7. Inference on Test Set & Submission
# -----------------------------
model.eval()
predictions = []

with torch.no_grad():
    for images, isic_ids in tqdm(test_loader, desc="Inference on Test"):
        images = images.to(device)
        logits = model(images).view(-1)  # shape [batch_size]
        probs = torch.sigmoid(logits)    # shape [batch_size], in [0,1]
        
        probs = probs.cpu().numpy()
        
        for isic_id, p in zip(isic_ids, probs):
            predictions.append({"isic_id": isic_id, "target": float(p)})

submission_df = pd.DataFrame(predictions)
submission_df = submission_df.sort_values(by="isic_id").reset_index(drop=True)

submission_file = "submission.csv"
submission_df.to_csv(submission_file, index=False)

print(f"Saved submission with {len(submission_df)} rows to {submission_file}")
display(submission_df.head(10))


Inference on Test:   0%|          | 0/13 [00:00<?, ?it/s]

Inference on Test: 100%|██████████| 13/13 [00:00<00:00, 13.08it/s]

Saved submission with 100 rows to submission.csv





Unnamed: 0,isic_id,target
0,ISIC_0082829,0.000349
1,ISIC_0114227,0.000398
2,ISIC_0157465,0.000469
3,ISIC_0197356,0.00096
4,ISIC_0275647,0.000363
5,ISIC_0332355,0.00044
6,ISIC_0528190,0.000527
7,ISIC_0576478,0.000472
8,ISIC_0719839,0.000478
9,ISIC_0968965,0.000827
