# Plant Pathology 2021 - FGVC8: Plan to Medal

Objectives:
- Build a strong, fast baseline for multilabel image classification (micro-F1 metric).
- Establish trustworthy CV and iterate with augmentations, better backbones, and ensembling if time allows.

Data & Metric:
- Images: train_images/, test_images/.
- Labels: train.csv has space-separated labels per image.
- Evaluation: micro-F1; threshold tuning likely needed.

Validation:
- Use iterative stratified K-Fold (e.g., 5 folds) on multilabel targets.
- Fix a single CV protocol and reuse across experiments.

Baseline Model:
- timm pretrained CNN (efficientnet_b0/b3 or convnext_tiny) with BCEWithLogitsLoss.
- Input size 384 (start with 256 for speed), mixed precision, cosine schedule, warmup, AdamW.
- Augs: A.HorizontalFlip, A.RandomResizedCrop, ColorJitter, Normalize.
- Inference: TTA (hflip) if useful.

Pipeline:
1) Environment check: install torch, torchvision, timm, albumentations, iterative-stratification; verify GPU.
2) Load train.csv; parse unique classes; build multilabel binarizer.
3) Create folds (iterative stratification).
4) Dataset/Dataloader with on-the-fly augmentations (albumentations, cv2).
5) Train per fold with AMP + early stopping; log per-epoch micro-F1 on val.
6) Save OOF logits and test logits per fold; average logits across folds.
7) Threshold tuning on OOF (global threshold and optionally per-class).
8) Generate submission.csv; verify format matches sample_submission.csv.

Iteration Targets:
- Baseline: effnet_b0, img_size=256, 5 folds, 5-8 epochs -> get solid OOF.
- Improve: bigger backbone (b3/convnext_tiny), better augs, img_size=384, more epochs.
- Ensembling: blend diverse seeds/backbones/resolutions.

Milestones (request expert review at each):
A) After environment + data sanity checks.
B) After CV split + baseline training setup.
C) After first OOF results (analyze error buckets + thresholding).
D) After improved backbone / resolution or blends.

Risks & Mitigations:
- GPU issues: verify before training.
- Overfitting/leakage: strict fold discipline; fit transforms inside folds.
- Slow iterations: start with 2 folds/2-3 epochs smoke test before full runs.

In [3]:
# Environment check + installs (idempotent) + data sanity
import sys, subprocess, os, warnings
warnings.filterwarnings('ignore')
print('Python:', sys.version)
print('CWD:', os.getcwd())

def ensure(pkg, import_name=None):
    import importlib
    name = import_name or pkg
    try:
        importlib.import_module(name)
        print(f'OK: {pkg}')
    except Exception as e:
        print(f'Installing {pkg} ...')
        subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', pkg])
        importlib.import_module(name)
        print(f'Installed: {pkg}')

# Core deps (torch likely preinstalled) - install only if missing
ensure('torch')
ensure('torchvision')
ensure('timm')
ensure('albumentations')
ensure('iterative-stratification', 'iterstrat')
ensure('opencv-python', 'cv2')
ensure('scikit-learn', 'sklearn')
ensure('pandas')
ensure('numpy')

import torch, torchvision
import pandas as pd, numpy as np, cv2
print(f'GPU Available: {torch.cuda.is_available()}')
print(f'GPU Count: {torch.cuda.device_count()}')
if torch.cuda.is_available():
    try:
        print(f'GPU Name: {torch.cuda.get_device_name(0)}')
        props = torch.cuda.get_device_properties(0)
        print(f'GPU Memory: {props.total_memory / 1024**3:.2f} GB')
    except Exception as e:
        print('GPU query error:', e)

# Load CSVs
train_csv = 'train.csv'
sub_csv = 'sample_submission.csv'
train_df = pd.read_csv(train_csv)
sub_df = pd.read_csv(sub_csv)
print('Train shape:', train_df.shape)
print('Sample submission shape:', sub_df.shape)
print(train_df.head(3))
print(sub_df.head(3))

# Submission format is two columns: image, labels (space-separated)
id_col = sub_df.columns[0]
assert id_col == 'image', f'Unexpected ID column: {id_col}'
print('ID column:', id_col)

# Derive class list from training labels (space-separated tokens)
tokens = train_df['labels'].fillna('').str.split(' ')
all_labels = sorted({t for lst in tokens for t in lst if t != ''})
class_cols = all_labels
class_to_idx = {c:i for i,c in enumerate(class_cols)}
print('Classes (derived from train):', class_cols)

# Build multi-hot matrix Y in class order
Y = np.zeros((len(train_df), len(class_cols)), dtype=np.int8)
for i, labs in enumerate(tokens):
    for l in labs:
        Y[i, class_to_idx[l]] = 1
print('Label matrix shape:', Y.shape, 'Positives:', int(Y.sum()))
print('Per-class positives:', dict(zip(class_cols, Y.sum(0).astype(int))))

# Quick file existence checks
train_dir = 'train_images'
test_dir = 'test_images'
n_train_imgs = len([f for f in os.listdir(train_dir) if f.lower().endswith('.jpg')]) if os.path.isdir(train_dir) else -1
n_test_imgs = len([f for f in os.listdir(test_dir) if f.lower().endswith('.jpg')]) if os.path.isdir(test_dir) else -1
print('Train images count:', n_train_imgs, 'Test images count:', n_test_imgs)

# Verify a few paths
for p in train_df['image'].head(5).tolist():
    fp = os.path.join(train_dir, p)
    print(p, 'exists' if os.path.isfile(fp) else 'MISSING')

print('Setup complete.')

Python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
CWD: /var/lib/simon/agent_run_states/plant-pathology-2021-fgvc8-20250921-232435
OK: torch
OK: torchvision
OK: timm
OK: albumentations
OK: iterative-stratification
OK: opencv-python
OK: scikit-learn
OK: pandas
OK: numpy
GPU Available: True
GPU Count: 1
GPU Name: NVIDIA A10-24Q
GPU Memory: 23.72 GB
Train shape: (14905, 2)
Sample submission shape: (3727, 2)
                  image                      labels
0  95cb4b75ad1d842a.jpg                     complex
1  be80d450dac87d7c.jpg  frog_eye_leaf_spot complex
2  80cac5abcfa9c9b4.jpg                     healthy
                  image   labels
0  ca6a50c5d2adb8ae.jpg  healthy
1  b686d217a1e2e3a5.jpg  healthy
2  c9a5345ec78b4ac5.jpg  healthy
ID column: image
Classes (derived from train): ['complex', 'frog_eye_leaf_spot', 'healthy', 'powdery_mildew', 'rust', 'scab']
Label matrix shape: (14905, 6) Positives: 16153
Per-class positives: {'complex': 1711, 'frog_eye_leaf_spot': 3511, '

In [4]:
# Create multilabel stratified folds (seed=42); save to disk for reuse
import os, numpy as np, pandas as pd
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

# Expect train_df, Y, id_col, class_cols from previous cell
assert 'train_df' in globals() and 'Y' in globals(), 'Run env/data cell first'

n_splits = 5
seed = 42
mskf = MultilabelStratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
folds = np.full(len(train_df), -1, dtype=int)
for f, (_, val_idx) in enumerate(mskf.split(train_df, Y)):
    folds[val_idx] = f

train_df_folds = train_df.copy()
train_df_folds['fold'] = folds
out_path = 'train_folds.csv'
train_df_folds.to_csv(out_path, index=False)
print('Saved folds to', out_path)

# Per-fold label distribution sanity
for f in range(n_splits):
    idx = np.where(folds == f)[0]
    y_sum = Y[idx].sum(0)
    print(f'Fold {f}: n={len(idx)}, positives per class:', dict(zip(class_cols, map(int, y_sum))))
print('Folds ready.')

Saved folds to train_folds.csv
Fold 0: n=2977, positives per class: {'complex': 342, 'frog_eye_leaf_spot': 702, 'healthy': 735, 'powdery_mildew': 202, 'rust': 330, 'scab': 920}
Fold 1: n=2992, positives per class: {'complex': 342, 'frog_eye_leaf_spot': 702, 'healthy': 734, 'powdery_mildew': 202, 'rust': 330, 'scab': 920}
Fold 2: n=2975, positives per class: {'complex': 343, 'frog_eye_leaf_spot': 703, 'healthy': 735, 'powdery_mildew': 201, 'rust': 330, 'scab': 920}
Fold 3: n=2993, positives per class: {'complex': 342, 'frog_eye_leaf_spot': 702, 'healthy': 734, 'powdery_mildew': 202, 'rust': 330, 'scab': 920}
Fold 4: n=2968, positives per class: {'complex': 342, 'frog_eye_leaf_spot': 702, 'healthy': 735, 'powdery_mildew': 201, 'rust': 330, 'scab': 920}
Folds ready.


In [None]:
# Smoke test training: Dataset, Model, 1-fold quick run (convnext_tiny @ 256, 2 epochs)
import os, time, math, random, gc
import numpy as np, pandas as pd, cv2, torch, timm
from torch import nn
from torch.utils.data import Dataset, DataLoader
import albumentations as A
from albumentations.pytorch import ToTensorV2
from sklearn.metrics import f1_score

def set_seed(seed=42):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = False
    torch.backends.cudnn.benchmark = True

IMG_SIZE = 256  # 256 for smoke test; later 384+
BATCH_SIZE = 64
EPOCHS = 2
LR = 2e-4
WD = 1e-4
NUM_CLASSES = len(class_cols)
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
ID_COL = id_col
TRAIN_DIR = 'train_images'
TEST_DIR = 'test_images'

train_tfms = A.Compose([
    A.RandomResizedCrop(size=(IMG_SIZE, IMG_SIZE), scale=(0.7, 1.0), ratio=(0.75, 1.33), p=1.0),
    A.HorizontalFlip(p=0.5),
    A.VerticalFlip(p=0.2),
    A.RandomRotate90(p=0.2),
    A.ShiftScaleRotate(shift_limit=0.05, scale_limit=0.1, rotate_limit=20, p=0.5, border_mode=cv2.BORDER_REFLECT101),
    A.ColorJitter(0.2,0.2,0.2,0.1,p=0.5),
    A.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)),
    ToTensorV2(),
])
val_tfms = A.Compose([
    A.Resize(height=IMG_SIZE, width=IMG_SIZE),
    A.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)),
    ToTensorV2(),
])

class PlantDataset(Dataset):
    def __init__(self, df, labels=None, img_dir=TRAIN_DIR, tfms=None):
        self.df = df.reset_index(drop=True)
        self.labels = labels
        self.img_dir = img_dir
        self.tfms = tfms
        self.has_labels = labels is not None
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img_path = os.path.join(self.img_dir, row['image'])
        img = cv2.imread(img_path, cv2.IMREAD_COLOR)
        if img is None:
            raise FileNotFoundError(f'Image not found: {img_path}')
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        if self.tfms: 
            img = self.tfms(image=img)['image']
        if self.has_labels:
            target = self.labels[idx].astype(np.float32)
            return img, torch.from_numpy(target)
        else:
            return img, row['image']

def build_model():
    model = timm.create_model('convnext_tiny', pretrained=True, num_classes=NUM_CLASSES)
    return model

def get_loaders(fold=0):
    df = pd.read_csv('train_folds.csv')
    trn_idx = df.index[df.fold != fold].tolist()
    val_idx = df.index[df.fold == fold].tolist()
    trn_ds = PlantDataset(df.loc[trn_idx, ['image']], labels=Y[trn_idx], img_dir=TRAIN_DIR, tfms=train_tfms)
    val_ds = PlantDataset(df.loc[val_idx, ['image']], labels=Y[val_idx], img_dir=TRAIN_DIR, tfms=val_tfms)
    trn_ld = DataLoader(trn_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=4, pin_memory=True, drop_last=True)
    val_ld = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=4, pin_memory=True)
    return trn_ld, val_ld, val_idx

def micro_f1_from_logits(logits, targets, th=0.3):
    probs = torch.sigmoid(torch.tensor(logits)) if not torch.is_tensor(logits) else torch.sigmoid(logits)
    preds = (probs.cpu().numpy() >= th).astype(int)
    t = targets.cpu().numpy() if torch.is_tensor(targets) else targets
    # enforce at-least-one rule
    rows_all_zero = preds.sum(1) == 0
    if rows_all_zero.any():
        top1 = probs.cpu().numpy().argmax(1)
        preds[rows_all_zero, top1[rows_all_zero]] = 1
    return f1_score(t.reshape(-1), preds.reshape(-1), average='micro')

def train_one_fold(fold=0):
    set_seed(42)
    trn_ld, val_ld, val_idx = get_loaders(fold)
    model = build_model().to(DEVICE)
    optimizer = torch.optim.AdamW(model.parameters(), lr=LR, weight_decay=WD)
    total_steps = EPOCHS * len(trn_ld)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=total_steps)
    scaler = torch.cuda.amp.GradScaler(enabled=(DEVICE=='cuda'))
    criterion = nn.BCEWithLogitsLoss()
    best_f1, best_state = -1.0, None
    print(f'Start fold {fold} | steps/epoch={len(trn_ld)} | val_batches={len(val_ld)}', flush=True)
    start = time.time()
    for epoch in range(EPOCHS):
        model.train()
        running = 0.0
        for bi, (imgs, tgts) in enumerate(trn_ld):
            imgs = imgs.to(DEVICE, non_blocking=True)
            tgts = tgts.to(DEVICE, non_blocking=True)
            optimizer.zero_grad(set_to_none=True)
            with torch.cuda.amp.autocast(enabled=(DEVICE=='cuda')):
                logits = model(imgs)
                loss = criterion(logits, tgts)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
            scheduler.step()
            running += loss.item()
            if (bi+1) % 50 == 0:
                elapsed = time.time() - start
                print(f'Epoch {epoch+1}/{EPOCHS} | step {bi+1}/{len(trn_ld)} | loss {running/(bi+1):.4f} | {elapsed/60:.1f} min', flush=True)
        # validate
        model.eval()
        val_logits, val_targets = [], []
        with torch.no_grad():
            for imgs, tgts in val_ld:
                imgs = imgs.to(DEVICE, non_blocking=True)
                logits = model(imgs).detach().cpu()
                val_logits.append(logits)
                val_targets.append(tgts)
        val_logits = torch.cat(val_logits, 0)
        val_targets = torch.cat(val_targets, 0)
        f1 = micro_f1_from_logits(val_logits, val_targets, th=0.3)
        print(f'Fold {fold} Epoch {epoch+1}: val micro-F1 @0.3 = {f1:.4f}', flush=True)
        if f1 > best_f1:
            best_f1 = f1
            best_state = {k:v.cpu() for k,v in model.state_dict().items()}
    print(f'Best val micro-F1: {best_f1:.4f}')
    # Return best model and val logits for threshold tuning later
    model.load_state_dict(best_state)
    return model, val_idx

def predict_test(model):
    test_df = pd.read_csv('sample_submission.csv')[[ID_COL]].copy()
    ds = PlantDataset(test_df, labels=None, img_dir=TEST_DIR, tfms=val_tfms)
    dl = DataLoader(ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=4, pin_memory=True)
    model.eval()
    preds = []
    with torch.no_grad():
        for imgs, names in dl:
            imgs = imgs.to(DEVICE, non_blocking=True)
            logits = model(imgs)
            preds.append(logits.detach().cpu())
    return torch.cat(preds, 0).numpy(), test_df[ID_COL].tolist()

def logits_to_labels_str(logits, th=0.3):
    probs = 1/(1+np.exp(-logits))
    bin_ = (probs >= th).astype(int)
    # at-least-one rule
    rows_zero = bin_.sum(1) == 0
    if rows_zero.any():
        top1 = probs.argmax(1)
        bin_[rows_zero, top1[rows_zero]] = 1
    labels = []
    for r in bin_:
        labels.append(' '.join([class_cols[i] for i in np.where(r==1)[0]]))
    return labels

if __name__ == '__main__':
    model, _ = train_one_fold(fold=0)
    test_logits, test_names = predict_test(model)
    labels_str = logits_to_labels_str(test_logits, th=0.3)
    sub = pd.DataFrame({ID_COL: test_names, 'labels': labels_str})
    sub.to_csv('submission.csv', index=False)
    print('Wrote submission.csv with shape', sub.shape)

Start fold 0 | steps/epoch=186 | val_batches=47
