# SIIM-COVID19 Detection: Plan and EDA

## 1. Problem Understanding

This is a medical imaging competition with two related tasks:
1.  **Image-level:** Detect and localize COVID-19 related opacities with bounding boxes for a single class: `opacity`.
2.  **Study-level:** Classify each study into one of four classes: `Negative for Pneumonia`, `Typical Appearance`, `Indeterminate Appearance`, `Atypical Appearance`.

The evaluation metric is a blended **mean Average Precision (mAP)**, averaging the image-level and study-level mAP scores. The submission format requires predictions for both tasks.

## 2. Revised Plan (Based on Expert Advice)

### Core Strategy: Two-Model Pipeline
Following expert advice, I will build two separate, specialized models:
1.  **Object Detector:** A YOLOv5 model trained to detect a single `opacity` class.
2.  **Image Classifier:** An EfficientNet model (via `timm`) trained to classify images into the four study-level categories.
3.  **Fusion:** At inference, predictions will be fused. Study-level predictions will be an aggregation of the classifier's outputs across all images in a study. Detector meta-features (e.g., max confidence score) can be used to refine this.

### Phase 1: Setup, EDA & Preprocessing (Hours 0-3)
1.  **Environment Setup:** Install necessary packages: `pandas`, `pydicom`, `scikit-learn`, `timm`, `albumentations`, and clone the YOLOv5 repository.
2.  **Metadata Exploration:** Analyze `train_study_level.csv` and `train_image_level.csv`. Merge them to create a unified dataframe for training.
3.  **Validation Strategy:** Implement a `StratifiedGroupKFold` split on `StudyInstanceUID`. The stratification will be based on a combination of the study-level class and a `has_opacity` flag to ensure balanced folds for both tasks.
4.  **DICOM Preprocessing Pipeline:** This is a critical, make-or-break step.
    *   Load DICOM files using `pydicom`.
    *   Apply `Rescale Slope/Intercept` and `VOI LUT`.
    *   Handle `Photometric Interpretation` (invert `MONOCHROME1` images).
    *   Resize images (e.g., to 640x640 or 1024x1024) using letterboxing to preserve aspect ratio.
    *   Normalize pixel values.
    *   **Cache processed images as PNGs** to accelerate training experiments.

### Phase 2: Baseline Model Training (Hours 3-10)
1.  **Detector Training (YOLOv5):**
    *   Prepare data in YOLOv5 format (`.txt` label files).
    *   Train a `YOLOv5s` or `YOLOv5m` model on a single fold to establish a working pipeline.
    *   Use augmentations from `albumentations` (H-flip, small rotations, brightness/contrast).
2.  **Classifier Training (EfficientNet):**
    *   Create a PyTorch `Dataset` for the cached PNG images.
    *   Train an `EfficientNet-B3` model on a single fold.
3.  **Inference & Submission:**
    *   Build an inference script that runs both models.
    *   Implement the logic for the specific submission format: `opacity conf x y w h` for detections and `none 1 0 0 1 1` for images with no predicted boxes.
    *   Aggregate image-level classifications to the study level (e.g., by taking the `max` of probabilities).
    *   Generate a baseline `submission.csv` and verify its format.

### Phase 3: Full CV Training & Ensembling (Hours 10-18)
1.  **Full CV Training:** Train both the detector and classifier on all folds.
2.  **Test-Time Augmentation (TTA):** Implement horizontal flip TTA for both models during inference.
3.  **Ensembling:**
    *   **Detector:** Use **Weighted Boxes Fusion (WBF)** to ensemble predictions from different fold models.
    *   **Classifier:** Average the logits/probabilities from the fold models.
4.  **Threshold Tuning:** Optimize the detector's confidence threshold and the WBF IoU threshold on the out-of-fold validation sets to maximize the CV mAP score.

### Phase 4: Final Submission (Hours 18-24)
1.  **Final Inference:** Run the full, ensembled pipeline with TTA on the test set.
2.  **Sanity Checks:** Perform final checks on the `submission.csv` file, ensuring all study/image IDs are present and the format is perfect.
3.  **Submit**.

# Phase 1: Setup, EDA & Preprocessing

## 1.1 Environment Setup

First, let's install the necessary libraries and clone the YOLOv5 repository.

In [None]:
import subprocess
import sys
import os

print("--- Installing packages ---")
packages = [
    "pydicom", "timm", "albumentations", "scikit-image", "pycocotools",
    "pylibjpeg", "pylibjpeg-libjpeg", "python-gdcm"
]
try:
    # Removed -q flag for verbose output
    subprocess.run([sys.executable, '-m', 'pip', 'install'] + packages, check=True)
    print("Packages installed successfully.")
except subprocess.CalledProcessError as e:
    print(f"Pip install failed: {e}")

print("\n--- Cloning yolov5 repository ---")
if not os.path.exists('yolov5'):
    try:
        subprocess.run(['git', 'clone', 'https://github.com/ultralytics/yolov5.git'], check=True)
        print("YOLOv5 cloned successfully.")
    except subprocess.CalledProcessError as e:
        print(f"Git clone failed: {e}")
else:
    print("YOLOv5 directory already exists.")

print("\n--- Installing yolov5 requirements ---")
yolov5_req_path = os.path.join('yolov5', 'requirements.txt')
if os.path.exists(yolov5_req_path):
    try:
        # Removed -q flag for verbose output
        subprocess.run([sys.executable, '-m', 'pip', 'install', '-r', yolov5_req_path], check=True)
        print("YOLOv5 requirements installed successfully.")
    except subprocess.CalledProcessError as e:
        print(f"YOLOv5 requirements install failed: {e}")
else:
    print(f"Could not find {yolov5_req_path}")

print("\nSetup cell finished.")

## 1.2 Metadata Exploration

Now that the environment is set up, let's load the training metadata and inspect its structure. We have two main files: `train_study_level.csv` and `train_image_level.csv`.

In [1]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_columns', 100)

DATA_DIR = './'

df_study = pd.read_csv(os.path.join(DATA_DIR, 'train_study_level.csv'))
df_image = pd.read_csv(os.path.join(DATA_DIR, 'train_image_level.csv'))

print("Study Level Data:")
display(df_study.head())
print(f"Shape: {df_study.shape}")
print("\nImage Level Data:")
display(df_image.head())
print(f"Shape: {df_image.shape}")

Study Level Data:


Unnamed: 0,id,Negative for Pneumonia,Typical Appearance,Indeterminate Appearance,Atypical Appearance
0,00086460a852_study,0,1,0,0
1,00292f8c37bd_study,1,0,0,0
2,005057b3f880_study,1,0,0,0
3,0051d9b12e72_study,0,0,0,1
4,00792b5c8852_study,1,0,0,0


Shape: (5448, 5)

Image Level Data:


Unnamed: 0,id,boxes,label,StudyInstanceUID
0,000a312787f2_image,"[{'x': 789.28836, 'y': 582.43035, 'width': 102...",opacity 1 789.28836 582.43035 1815.94498 2499....,5776db0cec75
1,000c3a3f293f_image,,none 1 0 0 1 1,ff0879eb20ed
2,0012ff7358bc_image,"[{'x': 677.42216, 'y': 197.97662, 'width': 867...",opacity 1 677.42216 197.97662 1545.21983 1197....,9d514ce429a7
3,001398f4ff4f_image,"[{'x': 2729, 'y': 2181.33331, 'width': 948.000...",opacity 1 2729 2181.33331 3677.00012 2785.33331,28dddc8559b2
4,001bd15d1891_image,"[{'x': 623.23328, 'y': 1050, 'width': 714, 'he...",opacity 1 623.23328 1050 1337.23328 2156 opaci...,dfd9fdd85a3e


Shape: (5696, 4)
Error in callback <function _enable_matplotlib_integration.<locals>.configure_once at 0x70127d9c1e40> (for post_run_cell), with arguments args (<ExecutionResult object at 70115813cc10, execution_count=62 error_before_exec=None error_in_exec=None info=<ExecutionInfo object at 70115813fa10, raw_cell="import pandas as pd
import numpy as np
import os

.." transformed_cell="import pandas as pd
import numpy as np
import os

.." store_history=True silent=False shell_futures=True cell_id=None> result=None>,),kwargs {}:


AttributeError: module 'matplotlib' has no attribute 'backend_bases'

In [None]:
# Clean up IDs and merge the dataframes
df_study['StudyInstanceUID'] = df_study['id'].apply(lambda x: x.replace('_study', ''))
df_image['image_id'] = df_image['id'].apply(lambda x: x.replace('_image', ''))

# Merge the two dataframes
df_merged = df_image.merge(df_study, on='StudyInstanceUID', how='left')

# Create a 'has_opacity' flag for easier analysis
df_merged['has_opacity'] = df_merged['boxes'].apply(lambda x: 0 if pd.isna(x) else 1)

print("Merged Dataframe:")
display(df_merged.head())
print(f"Shape: {df_merged.shape}")

In [None]:
# The plotting code is failing due to an environment issue with matplotlib.
# For now, I will just print the numerical summaries to understand the distribution.

print("--- Analyzing Data Distribution ---")

# Study-level classification distribution
print("\nDistribution of Study-Level Classes:")
df_study_labels = df_merged.drop_duplicates('StudyInstanceUID')
study_counts = df_study_labels[['Negative for Pneumonia', 'Typical Appearance', 'Indeterminate Appearance', 'Atypical Appearance']].sum()
print(study_counts)

# Image-level opacity distribution
print("\nDistribution of Images with/without Opacity:")
opacity_counts = df_merged['has_opacity'].value_counts()
print(opacity_counts)

## 1.3 Validation Strategy

As recommended by the experts, I will use `StratifiedGroupKFold` to create the cross-validation folds. 

*   **Groups:** Splitting will be grouped by `StudyInstanceUID` to ensure that all images from a single study belong to the same fold (either train or validation), preventing data leakage.
*   **Stratification:** The stratification will be based on the study-level classification labels to ensure that each fold has a similar distribution of the four classes.

In [None]:
from sklearn.model_selection import StratifiedGroupKFold

N_SPLITS = 5

# Prepare data for splitting. We need one row per study.
# .reset_index(drop=True) is crucial to align integer indices from k-fold split with dataframe rows.
df_folds = df_merged.drop_duplicates('StudyInstanceUID').reset_index(drop=True)

# Create a single target column for stratification
df_folds['stratify_col'] = df_folds[['Negative for Pneumonia', 'Typical Appearance', 'Indeterminate Appearance', 'Atypical Appearance']].idxmax(axis=1)

# Get groups and stratification targets
groups = df_folds['StudyInstanceUID']
y_stratify = df_folds['stratify_col']

df_folds['fold'] = -1

sgkf = StratifiedGroupKFold(n_splits=N_SPLITS, shuffle=True, random_state=42)

# Use .loc with the integer indices from split() on the reset-index df_folds
for fold, (train_idx, val_idx) in enumerate(sgkf.split(df_folds, y_stratify, groups)):
    df_folds.loc[val_idx, 'fold'] = fold

# Merge the fold and stratification info back into the main dataframe
df_merged = df_merged.merge(df_folds[['StudyInstanceUID', 'fold', 'stratify_col']], on='StudyInstanceUID', how='left')

print("Fold distribution:")
print(df_merged['fold'].value_counts())

print("\nValidation set stratification check (normalized counts per fold):")
# Use print instead of display to avoid matplotlib errors
print(df_merged.groupby('fold')['stratify_col'].value_counts(normalize=True).unstack())

In [None]:
# --- Save the dataframe with fold information ---
print("Saving the merged dataframe with fold information to 'df_train_folds.csv'...")
df_merged.to_csv('df_train_folds.csv', index=False)
print("File saved successfully.")

## 1.4 DICOM Preprocessing and Caching

This is a critical step. I will now process the raw DICOM files into a more usable format (PNG) and cache them to disk. This will significantly speed up training and experimentation.

The preprocessing pipeline will:
1.  Read the DICOM file.
2.  Apply the VOI (Value of Interest) LUT if available, which is crucial for correct windowing.
3.  Invert the image if `Photometric Interpretation` is `MONOCHROME1`.
4.  Convert the pixel data to a standard 8-bit format (0-255).
5.  Save the processed image as a PNG file.

I will process all images and save them to a new `train_png` directory.

In [None]:
import sys
import os

# --- Fix 1: Add pip target directory to Python path ---
pip_target_path = '/app/.pip-target'
if pip_target_path not in sys.path:
    print(f"Adding '{pip_target_path}' to sys.path.")
    sys.path.insert(0, pip_target_path)
else:
    print(f"'{pip_target_path}' is already in sys.path.")

# --- Fix 2: Add GDCM's C++ library path to LD_LIBRARY_PATH ---
# This is crucial for pydicom to find the underlying decompression codecs.
gdcm_lib_path = '/app/.pip-target/python_gdcm.libs'
if os.path.exists(gdcm_lib_path):
    print(f"Found GDCM library path: {gdcm_lib_path}")
    current_ld_path = os.environ.get('LD_LIBRARY_PATH', '')
    if gdcm_lib_path not in current_ld_path:
        print("Adding GDCM library path to LD_LIBRARY_PATH.")
        os.environ['LD_LIBRARY_PATH'] = f"{gdcm_lib_path}:{current_ld_path}"
    else:
        print("GDCM library path already in LD_LIBRARY_PATH.")
    print(f"Current LD_LIBRARY_PATH: {os.environ.get('LD_LIBRARY_PATH', '')}")
else:
    print(f"Warning: GDCM library path '{gdcm_lib_path}' not found.")

print("\nEnvironment setup cell complete. A kernel restart is required for LD_LIBRARY_PATH changes to take full effect.")

In [None]:
import pydicom
import sys
import os
import importlib.metadata

print("--- Environment Diagnostics ---")

# 1. Check LD_LIBRARY_PATH
print(f"LD_LIBRARY_PATH: {os.environ.get('LD_LIBRARY_PATH', 'Not Set')}")

# 2. Check package versions
print("\nPackage Versions:")
# Check for 'python-gdcm' which is the correct pip package name
packages_to_check = ['pydicom', 'python-gdcm', 'pylibjpeg', 'pylibjpeg-libjpeg', 'packaging', 'setuptools']
for package in packages_to_check:
    try:
        version = importlib.metadata.version(package)
        print(f"  - {package}: {version}")
    except importlib.metadata.PackageNotFoundError:
        print(f"  - {package}: NOT FOUND")

# 3. Check pydicom's available handlers
print("\n--- pydicom.config.pixel_data_handlers ---")
from pydicom import config
print(config.pixel_data_handlers)

# 4. Try importing gdcm directly to verify it's accessible
print("\n--- Attempting to import gdcm ---")
try:
    import gdcm
    print("Successfully imported gdcm")
    # Check the version of the underlying GDCM library
    print(f"gdcm library version: {gdcm.Version.GetVersion()}")
except ImportError as e:
    print(f"Failed to import gdcm: {e}")
except Exception as e:
    print(f"An error occurred while importing gdcm: {e}")

In [None]:
import subprocess
import sys

# Pivoting to SimpleITK as per expert's fallback advice due to persistent pydicom errors.
print("--- Installing SimpleITK ---")
try:
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'SimpleITK'], check=True)
    print("SimpleITK installed successfully.")
except subprocess.CalledProcessError as e:
    print(f"Pip install failed: {e}")

In [None]:
# --- Pivoting to SimpleITK for DICOM Preprocessing ---
import os, glob, cv2, numpy as np, pandas as pd
from tqdm import tqdm
import SimpleITK as sitk
import pydicom

print("--- Preprocessing with SimpleITK Fallback ---")

# --- 1. Map image IDs to their file paths (already done, but good to have here) ---
if 'dcm_path' not in df_merged.columns or df_merged['dcm_path'].isna().all():
    print("Re-building file path map for training images...")
    all_dcm_files = glob.glob('train/*/*/*.dcm')
    image_id_to_path = {os.path.basename(p).replace('.dcm', ''): p for p in all_dcm_files}
    df_merged['dcm_path'] = df_merged['image_id'].map(image_id_to_path)
    print(f"Found {len(all_dcm_files)} DICOM files.")
    print(f"Mapped {df_merged['dcm_path'].notna().sum()} of {len(df_merged)} image IDs to paths.")

# --- 2. Define the processing function using SimpleITK ---
def process_dicom_with_sitk(row, output_dir):
    image_id = row['image_id']
    dcm_path = row['dcm_path']
    
    if pd.isna(dcm_path):
        return
        
    save_path = os.path.join(output_dir, f"{image_id}.png")
    
    if os.path.exists(save_path):
        return

    try:
        # Read metadata with pydicom (fast, avoids pixel data)
        dicom_meta = pydicom.dcmread(dcm_path, stop_before_pixels=True)
        
        # Read image data with SimpleITK (robust)
        sitk_image = sitk.ReadImage(dcm_path)
        data = sitk.GetArrayFromImage(sitk_image).squeeze()
        
        # Apply Rescale Slope/Intercept from metadata
        if 'RescaleSlope' in dicom_meta and 'RescaleIntercept' in dicom_meta:
            slope = float(dicom_meta.RescaleSlope)
            intercept = float(dicom_meta.RescaleIntercept)
            data = data * slope + intercept
            
        # Invert MONOCHROME1 images
        if dicom_meta.PhotometricInterpretation == "MONOCHROME1":
            data = np.amax(data) - data
        
        # Apply lung windowing (as suggested by expert)
        center = -600
        width = 1500
        min_val = center - width // 2
        max_val = center + width // 2
        data = np.clip(data, min_val, max_val)
        
        # Normalize to 8-bit (0-255)
        data = data - np.min(data)
        data = data / (np.max(data) + 1e-6)
        data = (data * 255).astype(np.uint8)
        
        cv2.imwrite(save_path, data)
    except Exception as e:
        print(f"Error processing {dcm_path} (Image ID: {image_id}): {e}")

# --- 3. Process and save all training images ---
output_dir = 'train_png'
os.makedirs(output_dir, exist_ok=True)
print(f"\nProcessing DICOMs and saving to '{output_dir}'...")

for _, row in tqdm(df_merged.iterrows(), total=len(df_merged), desc="Processing All Train DICOMs with SITK"):
    process_dicom_with_sitk(row, output_dir)

print(f"\nFinished processing and caching training images. Check '{output_dir}' for PNG files.")

# Phase 2: Baseline Model Training

With the EDA and preprocessing underway, I'll now prepare the data for the first model: the YOLOv5 object detector.

## 2.1 Prepare Data for YOLOv5

YOLOv5 requires a specific directory structure and label format:
1.  **Directory Structure:** A root directory containing `images` and `labels` subdirectories. Each of these will have `train` and `val` splits.
2.  **Label Format:** For each image, there must be a corresponding `.txt` file with the same name. Each line in the file represents one bounding box in the format: `<class_index> <x_center_norm> <y_center_norm> <width_norm> <height_norm>`.

I will now create these files and directories based on the folds defined earlier.

In [None]:
import os
import ast
from tqdm import tqdm
import pandas as pd
import pydicom

# --- Configuration ---
YOLO_DATA_DIR = 'yolov5_data'
FOLD_TO_VALIDATE = 0 # Use fold 0 for the validation set

# --- Create YOLOv5 directory structure ---
print(f"Creating directory structure under '{YOLO_DATA_DIR}'...")
os.makedirs(os.path.join(YOLO_DATA_DIR, 'images/train'), exist_ok=True)
os.makedirs(os.path.join(YOLO_DATA_DIR, 'images/val'), exist_ok=True)
os.makedirs(os.path.join(YOLO_DATA_DIR, 'labels/train'), exist_ok=True)
os.makedirs(os.path.join(YOLO_DATA_DIR, 'labels/val'), exist_ok=True)
print("Directory structure created.")

# --- Get image dimensions if not already present ---
if 'img_height' not in df_merged.columns or df_merged['img_height'].sum() == 0:
    print("Reading DICOM metadata to get image dimensions...")
    df_merged['img_height'] = 0
    df_merged['img_width'] = 0
    
    for index, row in tqdm(df_merged.iterrows(), total=len(df_merged), desc="Reading DICOM metadata"):
        try:
            dicom_meta = pydicom.dcmread(row['dcm_path'], stop_before_pixels=True)
            df_merged.loc[index, 'img_height'] = dicom_meta.Rows
            df_merged.loc[index, 'img_width'] = dicom_meta.Columns
        except Exception as e:
            print(f"Could not read dimensions for {row['image_id']}: {e}")
    print("Finished getting image dimensions.")
else:
    print("Image dimensions already present in dataframe.")

# --- Create symlinks and label files ---
print("Creating YOLO symlinks and label files...")
for _, row in tqdm(df_merged.iterrows(), total=len(df_merged), desc="Creating YOLO files"):
    split = 'val' if row['fold'] == FOLD_TO_VALIDATE else 'train'
    image_id = row['image_id']
    
    # 1. Create symlink to the cached PNG image
    src_png_path = os.path.abspath(os.path.join('train_png', f"{image_id}.png"))
    dst_img_path = os.path.join(YOLO_DATA_DIR, 'images', split, f"{image_id}.png")
    
    if os.path.exists(src_png_path):
        if not os.path.lexists(dst_img_path):
            os.symlink(src_png_path, dst_img_path)
    else:
        # Skip if the source PNG doesn't exist (might have failed preprocessing)
        continue

    # 2. Create the label file
    label_path = os.path.join(YOLO_DATA_DIR, 'labels', split, f"{image_id}.txt")
    
    with open(label_path, 'w') as f:
        if row['has_opacity'] == 1 and isinstance(row['boxes'], str):
            try:
                boxes = ast.literal_eval(row['boxes'])
            except (ValueError, SyntaxError):
                continue

            img_h = row['img_height']
            img_w = row['img_width']
            
            if img_h == 0 or img_w == 0:
                continue # Skip if dimensions are invalid

            for box in boxes:
                class_id = 0 # single class 'opacity'
                
                x = box['x']
                y = box['y']
                w = box['width']
                h = box['height']
                
                x_center_norm = (x + w / 2) / img_w
                y_center_norm = (y + h / 2) / img_h
                width_norm = w / img_w
                height_norm = h / img_h
                
                f.write(f"{class_id} {x_center_norm:.6f} {y_center_norm:.6f} {width_norm:.6f} {height_norm:.6f}\n")

print("YOLOv5 data preparation finished.")

## 2.2 Create YOLOv5 Dataset Config

Before training, I need to create a YAML file that tells the YOLOv5 training script where to find the images and what the class names are.

In [None]:
import yaml
import os

YOLO_DATA_DIR = 'yolov5_data'
config = {
    'path': os.path.abspath(YOLO_DATA_DIR), # dataset root dir
    'train': 'images/train',  # train images (relative to 'path')
    'val': 'images/val',  # val images (relative to 'path')
    'nc': 1,  # number of classes
    'names': ['opacity']  # class names
}

config_path = os.path.join('yolov5', 'data', 'siim_covid19.yaml')
os.makedirs(os.path.dirname(config_path), exist_ok=True)

with open(config_path, 'w') as f:
    yaml.dump(config, f, default_flow_style=False)

print(f"YOLOv5 config file created at: {config_path}")
print("\n--- Config Content ---")
with open(config_path, 'r') as f:
    print(f.read())

## 2.3 Train YOLOv5 Detector

All data preparation is complete. I will now train the YOLOv5s model on `fold 0` for a few epochs to establish a baseline. I'll use pretrained weights to speed up convergence.

The training command will be executed directly in the notebook using `!python`. The key parameters are:
- `--img 640`: Image size.
- `--batch 16`: Batch size, chosen to fit on the available GPU.
- `--epochs 15`: Number of training epochs for this baseline run.
- `--data yolov5/data/siim_covid19.yaml`: The dataset configuration file.
- `--weights yolov5s.pt`: Pretrained weights to start from.
- `--project yolov5_runs/train`: The output directory for training results.

In [None]:
import subprocess
import sys

command = [
    sys.executable,
    'yolov5/train.py',
    '--img', '640',
    '--batch', '16',
    '--epochs', '15',
    '--data', 'yolov5/data/siim_covid19.yaml',
    '--weights', 'yolov5s.pt',
    '--project', 'yolov5_runs/train',
    '--name', 'baseline_fold0',
    '--exist-ok'
]

print(f"Running command: {' '.join(command)}")
try:
    # Using subprocess.run to avoid issues with IPython's '!' magic
    subprocess.run(command, check=True)
    print("\n--- YOLOv5 training finished successfully. ---")
except subprocess.CalledProcessError as e:
    print(f"\n--- YOLOv5 training failed with exit code {e.returncode}. ---")
    # The output from the command will be printed to stderr/stdout automatically.