# Aerial Cactus Identification — Medal Plan

## Goal
- Achieve 1.0000 AUC (medal) with robust CV and inference.

## Data
- Provided: train.csv (id, has_cactus), train.zip (images), test.zip (images), sample_submission.csv.
- Images are small (32x32).

## Approach
- Stratified K-Fold (5 folds) on train.csv.
- Pretrained CNN using timm (efficientnet_b0 or tf_efficientnet_b0_ns) scaled to 96x96 for better feature extraction.
- Use strong but safe augmentations (horizontal/vertical flips, slight shift/scale/rotate, color jitter).
- Mixed precision training, early stopping on fold AUC, ReduceLROnPlateau.
- Loss: BCEWithLogitsLoss; Metric: ROC-AUC.
- Optimizer: AdamW; Scheduler: CosineAnnealingWarmRestarts or OneCycleLR (will pick empirically; likely Cosine).

## Training
- Unzip datasets to ./train and ./test on first run; cache file lists.
- Dataloader: balanced batches if needed; otherwise rely on loss/aug.
- Epochs: 10–15 per fold (early stop), batch size as GPU allows (e.g., 256 at 96x96 on T4; will probe).
- Log per-epoch metrics and timing; print fold indices and elapsed time.

## Evaluation
- Out-of-fold (OOF) predictions and AUC per fold; average AUC for CV estimate.

## Inference
- TTA (e.g., 4–8 flips/rotations) on test set; average across folds and TTA.
- Save submission.csv with columns: id, has_cactus.

## Efficiency
- Keep model small (EffNet-B0) for fast convergence; consider tf_efficientnet_b3 if time permits.
- Use pin_memory, num_workers tuned to CPU.

## Checkpoints
- After this plan, request expert review.
- After initial EDA/load and baseline training, request expert review if CV < 0.9995.

## Risk Mitigation
- If deep model is slow/overkill, fallback: LightGBM on simple image stats (HSV histograms, edges) to get strong AUC quickly; but primary path should reach 1.0.

## Next Steps
1) Environment setup; unzip data; quick sanity EDA (counts, class balance, sample images).
2) Build dataset/transforms and model; run a quick single-fold smoke test (2–3 epochs) to validate pipeline.
3) Full 5-fold with early stopping and TTA inference.
4) Generate and save submission.csv.

In [1]:
# Setup: imports, unzip data, basic sanity checks
import os, sys, time, zipfile, glob, random, math, json, gc
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image

DATA_DIR = Path('.')
TRAIN_ZIP = DATA_DIR/'train.zip'
TEST_ZIP = DATA_DIR/'test.zip'
TRAIN_DIR = DATA_DIR/'train'
TEST_DIR = DATA_DIR/'test'
TRAIN_CSV = DATA_DIR/'train.csv'
SAMPLE_SUB = DATA_DIR/'sample_submission.csv'

def unzip_if_needed(zip_path: Path, out_dir: Path):
    if out_dir.exists() and any(out_dir.iterdir()):
        print(f"[unzip] Exists and not empty: {out_dir}")
        return
    assert zip_path.exists(), f"Missing zip: {zip_path}"
    out_dir.mkdir(parents=True, exist_ok=True)
    t0 = time.time()
    with zipfile.ZipFile(zip_path, 'r') as z:
        z.extractall(out_dir)
    print(f"[unzip] Extracted {zip_path.name} -> {out_dir} in {time.time()-t0:.2f}s")

print("[env] Python:", sys.version)
print("[env] CWD:", os.getcwd())
print("[files]", os.listdir())

# Unzip datasets
unzip_if_needed(TRAIN_ZIP, TRAIN_DIR)
unzip_if_needed(TEST_ZIP, TEST_DIR)

# Load train.csv
train_df = pd.read_csv(TRAIN_CSV)
print("[train.csv] shape:", train_df.shape)
print(train_df['has_cactus'].value_counts(normalize=True))

# Verify image files count and mapping
train_files = sorted(glob.glob(str(TRAIN_DIR/'*.jpg'))) + sorted(glob.glob(str(TRAIN_DIR/'*.png')))
test_files = sorted(glob.glob(str(TEST_DIR/'*.jpg'))) + sorted(glob.glob(str(TEST_DIR/'*.png')))
print(f"[files] train images: {len(train_files)}, test images: {len(test_files)}")

# Check that all train ids exist
missing = []
for img_id in train_df['id'].values[:1000]:  # spot-check first 1000
    p = TRAIN_DIR/img_id
    if not p.exists():
        missing.append(img_id)
print(f"[sanity] Missing (first 1000 check): {len(missing)}")
print("[sanity] sample rows:\n", train_df.head())

[env] Python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
[env] CWD: /app/agent_run_states/aerial-cactus-identification-20250909-223202
[files] ['description.md', 'test.zip', 'agent_metadata', 'task.txt', 'train.csv', 'requirements.txt', 'train.zip', 'sample_submission.csv', 'docker_run.log', 'main.ipynb']


[unzip] Extracted train.zip -> train in 1.37s


[unzip] Extracted test.zip -> test in 0.32s
[train.csv] shape: (14175, 2)
has_cactus
1    0.749771
0    0.250229
Name: proportion, dtype: float64
[files] train images: 14175, test images: 3325
[sanity] Missing (first 1000 check): 0
[sanity] sample rows:
                                      id  has_cactus
0  2de8f189f1dce439766637e75df0ee27.jpg           1
1  36704d250f236238e7f996812c48235d.jpg           1
2  eacde22fdc8c175972a5768e3daa8bc9.jpg           1
3  5d442f834da5e57d22b24802c32a8ca8.jpg           1
4  152491e0daf75c0e669400300ff7e645.jpg           1


In [2]:
# Duplicate/near-duplicate detection via simple perceptual hashes (aHash + dHash)
from tqdm import tqdm

def ahash(img: Image.Image, hash_size: int = 8) -> int:
    img = img.convert('L').resize((hash_size, hash_size), Image.BILINEAR)
    arr = np.asarray(img, dtype=np.float32)
    m = arr.mean()
    bits = (arr > m).astype(np.uint8)
    val = 0
    for b in bits.flatten():
        val = (val << 1) | int(b)
    return val

def dhash(img: Image.Image, hash_size: int = 8) -> int:
    # difference hash
    img = img.convert('L').resize((hash_size + 1, hash_size), Image.BILINEAR)
    arr = np.asarray(img, dtype=np.int16)
    diff = arr[:, 1:] > arr[:, :-1]
    val = 0
    for b in diff.flatten():
        val = (val << 1) | int(b)
    return val

def compute_hash_tuple(path: Path) -> tuple[int, int]:
    try:
        with Image.open(path) as im:
            return ahash(im), dhash(im)
    except Exception as e:
        # In rare cases of corrupt images, fall back to zeros to group them together
        return 0, 0

t0 = time.time()
print('[phash] Computing hashes for train images...')
train_hashes = []
for img_id in tqdm(train_df['id'].values, total=len(train_df)):
    h = compute_hash_tuple(TRAIN_DIR / img_id)
    train_hashes.append(h)
train_df['ahash'] = [h[0] for h in train_hashes]
train_df['dhash'] = [h[1] for h in train_hashes]
train_df['hash_pair'] = list(zip(train_df['ahash'], train_df['dhash']))

dup_counts = train_df['hash_pair'].value_counts()
num_groups = dup_counts.shape[0]
num_dupe_groups = (dup_counts > 1).sum()
num_dupe_images = int((dup_counts[dup_counts > 1]).sum())
print(f"[phash] Train unique hash groups: {num_groups}")
print(f"[phash] Train duplicate groups (>1): {num_dupe_groups}; images in dup groups: {num_dupe_images}")

print('[phash] Computing hashes for test images...')
test_ids = [Path(p).name for p in test_files]
test_hashes = []
for img_id in tqdm(test_ids, total=len(test_ids)):
    h = compute_hash_tuple(TEST_DIR / img_id)
    test_hashes.append(h)
test_df = pd.DataFrame({'id': test_ids, 'ahash': [h[0] for h in test_hashes], 'dhash': [h[1] for h in test_hashes]})
test_df['hash_pair'] = list(zip(test_df['ahash'], test_df['dhash']))

# Check any exact hash collisions between train and test (not leakage per se, but indicative of duplicates)
inter = set(train_df['hash_pair']).intersection(set(test_df['hash_pair']))
print(f"[phash] Train-Test shared hash groups: {len(inter)}")
print(f"[phash] Done in {time.time()-t0:.2f}s")

# Create group labels for GroupKFold if duplicates exist
hash_to_group = {h:i for i, h in enumerate(train_df['hash_pair'].astype('category').cat.categories)}
train_df['group'] = train_df['hash_pair'].map(hash_to_group)

# Save intermediate artifacts for reuse in later cells
train_df.to_pickle('train_with_hash.pkl')
test_df.to_pickle('test_with_hash.pkl')
print('[phash] Saved train_with_hash.pkl and test_with_hash.pkl')

In [11]:
# LightGBM baseline on flattened 32x32x3 pixels with 5-fold CV (Group-aware if duplicates)
import time
import numpy as np
import pandas as pd
from pathlib import Path
from PIL import Image
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold, GroupKFold

# Try to import StratifiedGroupKFold if available (sklearn >=1.1)
try:
    from sklearn.model_selection import StratifiedGroupKFold
    HAS_SGF = True
except Exception:
    StratifiedGroupKFold = None
    HAS_SGF = False

# LightGBM import/install
try:
    import lightgbm as lgb
except Exception as e:
    import sys, subprocess
    print('[pip] Installing lightgbm...]')
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

SEED = 42
np.random.seed(SEED)

# Load precomputed hashes
train_df = pd.read_pickle('train_with_hash.pkl')
test_df = pd.read_pickle('test_with_hash.pkl')

def load_images_flat(img_ids, folder: Path) -> np.ndarray:
    t0 = time.time()
    X = np.empty((len(img_ids), 32*32*3), dtype=np.float32)
    for i, img_id in enumerate(img_ids):
        with Image.open(folder / img_id) as im:
            im = im.convert('RGB')  # ensure 3 channels
            arr = np.asarray(im, dtype=np.float32) / 255.0
            X[i] = arr.reshape(-1)
        if (i+1) % 2000 == 0:
            print(f"[load] {i+1}/{len(img_ids)} in {time.time()-t0:.1f}s", flush=True)
    print(f"[load] Done {len(img_ids)} images in {time.time()-t0:.1f}s; X shape={X.shape}")
    return X

t0_all = time.time()
X = load_images_flat(train_df['id'].values, TRAIN_DIR)
y = train_df['has_cactus'].values.astype(np.int32)
X_test = load_images_flat(test_df['id'].values, TEST_DIR)

# CV splitter
use_groups = (train_df.groupby('hash_pair').size() > 1).any()
print(f"[cv] Duplicates present: {use_groups}")
if use_groups and HAS_SGF:
    splitter = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=SEED)
    split_iter = splitter.split(X, y, groups=train_df['group'].values)
elif use_groups:
    # Fallback: GroupKFold (no stratification)
    splitter = GroupKFold(n_splits=5)
    split_iter = splitter.split(X, y, groups=train_df['group'].values)
else:
    splitter = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
    split_iter = splitter.split(X, y)

params = {
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': 0.05,
    'num_leaves': 64,
    'feature_fraction': 0.6,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'min_data_in_leaf': 20,
    'verbosity': -1,
    'seed': SEED,
}

oof = np.zeros(len(y), dtype=np.float32)
pred_test = np.zeros(X_test.shape[0], dtype=np.float32)
fold_auc = []

for fold, (trn_idx, val_idx) in enumerate(split_iter):
    t_fold = time.time()
    print(f"\n[fold {fold}] train={len(trn_idx)} val={len(val_idx)}")
    dtrain = lgb.Dataset(X[trn_idx], label=y[trn_idx])
    dvalid = lgb.Dataset(X[val_idx], label=y[val_idx])
    model = lgb.train(
        params,
        dtrain,
        num_boost_round=4000,
        valid_sets=[dvalid],
        valid_names=['valid'],
        callbacks=[
            lgb.early_stopping(stopping_rounds=200, verbose=True),
            lgb.log_evaluation(period=100),
        ]
    )
    oof[val_idx] = model.predict(X[val_idx], num_iteration=model.best_iteration)
    auc = roc_auc_score(y[val_idx], oof[val_idx])
    fold_auc.append(auc)
    print(f"[fold {fold}] AUC={auc:.6f} | best_iter={model.best_iteration} | elapsed={time.time()-t_fold:.1f}s")
    pred_test += model.predict(X_test, num_iteration=model.best_iteration) / 5.0

cv_auc = roc_auc_score(y, oof)
print(f"\n[CV] OOF AUC={cv_auc:.6f}; per-fold={fold_auc}; total time={time.time()-t0_all:.1f}s")

# Build submission
sub = pd.read_csv(SAMPLE_SUB)
sub = sub.merge(pd.DataFrame({'id': test_df['id'].values, 'has_cactus': pred_test}), on='id', how='left')
sub = sub[['id', 'has_cactus']]
sub.to_csv('submission.csv', index=False)
print('[submit] Saved submission.csv with shape', sub.shape)

[load] 2000/14175 in 0.4s


[load] 4000/14175 in 0.8s


[load] 6000/14175 in 1.3s


[load] 8000/14175 in 1.7s


[load] 10000/14175 in 2.1s


[load] 12000/14175 in 2.5s


[load] 14000/14175 in 2.9s


[load] Done 14175 images in 3.0s; X shape=(14175, 3072)


[load] 2000/3325 in 0.4s


[load] Done 3325 images in 0.7s; X shape=(3325, 3072)
[cv] Duplicates present: False

[fold 0] train=11340 val=2835


Training until validation scores don't improve for 200 rounds


[100]	valid's auc: 0.977346


[200]	valid's auc: 0.982278


[300]	valid's auc: 0.983873


[400]	valid's auc: 0.98479


[500]	valid's auc: 0.984989


[600]	valid's auc: 0.985357


[700]	valid's auc: 0.985564


[800]	valid's auc: 0.98562


[900]	valid's auc: 0.985738


[1000]	valid's auc: 0.985751


[1100]	valid's auc: 0.985804


[1200]	valid's auc: 0.985889


[1300]	valid's auc: 0.985868


Early stopping, best iteration is:
[1186]	valid's auc: 0.985898
[fold 0] AUC=0.986105 | best_iter=1186 | elapsed=105.1s

[fold 1] train=11340 val=2835


Training until validation scores don't improve for 200 rounds


[100]	valid's auc: 0.979874


[200]	valid's auc: 0.984081


[300]	valid's auc: 0.985705


[400]	valid's auc: 0.986542


[500]	valid's auc: 0.98706


[600]	valid's auc: 0.987559


[700]	valid's auc: 0.987697


[800]	valid's auc: 0.987809


[900]	valid's auc: 0.987731


Early stopping, best iteration is:
[782]	valid's auc: 0.98784
[fold 1] AUC=0.987843 | best_iter=782 | elapsed=84.2s

[fold 2] train=11340 val=2835


Training until validation scores don't improve for 200 rounds


[100]	valid's auc: 0.977795


[200]	valid's auc: 0.983862


[300]	valid's auc: 0.985969


[400]	valid's auc: 0.987117


[500]	valid's auc: 0.987922


[600]	valid's auc: 0.988356


[700]	valid's auc: 0.988511


[800]	valid's auc: 0.988538


[900]	valid's auc: 0.988585


[1000]	valid's auc: 0.988571


[1100]	valid's auc: 0.988525


Early stopping, best iteration is:
[924]	valid's auc: 0.988614
[fold 2] AUC=0.988365 | best_iter=924 | elapsed=94.7s

[fold 3] train=11340 val=2835


Training until validation scores don't improve for 200 rounds


[100]	valid's auc: 0.978978


[200]	valid's auc: 0.984787


[300]	valid's auc: 0.986562


[400]	valid's auc: 0.987551


[500]	valid's auc: 0.988051


[600]	valid's auc: 0.988424


[700]	valid's auc: 0.988441


[800]	valid's auc: 0.988563


[900]	valid's auc: 0.988687


[1000]	valid's auc: 0.988774


[1100]	valid's auc: 0.988778


[1200]	valid's auc: 0.988762


[1300]	valid's auc: 0.98877


Early stopping, best iteration is:
[1138]	valid's auc: 0.98881
[fold 3] AUC=0.988648 | best_iter=1138 | elapsed=88.2s

[fold 4] train=11340 val=2835


Training until validation scores don't improve for 200 rounds


[100]	valid's auc: 0.987183


[200]	valid's auc: 0.990154


[300]	valid's auc: 0.991303


[400]	valid's auc: 0.991768


[500]	valid's auc: 0.992245


[600]	valid's auc: 0.992581


[700]	valid's auc: 0.99272


[800]	valid's auc: 0.992796


[900]	valid's auc: 0.992817


[1000]	valid's auc: 0.992817


Early stopping, best iteration is:
[833]	valid's auc: 0.992841
[fold 4] AUC=0.992842 | best_iter=833 | elapsed=89.0s

[CV] OOF AUC=0.988673; per-fold=[0.9861046058803159, 0.9878434374863168, 0.9883652196527113, 0.9886482187241094, 0.9928417564208781]; total time=465.0s


KeyError: "['has_cactus'] not in index"

In [6]:
# Feature engineering (PCA on pixels + HSV histograms + edge stats) and LightGBM CV
import time, sys, subprocess
import numpy as np
import pandas as pd
from pathlib import Path
from PIL import Image, ImageFilter, ImageOps
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold, GroupKFold
from sklearn.metrics import roc_auc_score

# Ensure lightgbm
try:
    import lightgbm as lgb
except Exception:
    print('[pip] Installing lightgbm...')
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

SEED = 42
rng = np.random.default_rng(SEED)

# Load dataframes with hash groups
train_df = pd.read_pickle('train_with_hash.pkl')
test_df = pd.read_pickle('test_with_hash.pkl')

def load_rgb_array(img_ids, folder: Path) -> np.ndarray:
    t0 = time.time()
    X = np.empty((len(img_ids), 32, 32, 3), dtype=np.float32)
    for i, img_id in enumerate(img_ids):
        with Image.open(folder / img_id) as im:
            im = im.convert('RGB')
            arr = np.asarray(im, dtype=np.float32) / 255.0
        X[i] = arr
        if (i+1) % 2000 == 0:
            print(f"[load_rgb] {i+1}/{len(img_ids)}", flush=True)
    print(f"[load_rgb] Done {len(img_ids)} in {time.time()-t0:.1f}s; shape={X.shape}")
    return X

def hsv_hist_features(img_ids, folder: Path, bins: int = 16) -> np.ndarray:
    # 16-bin hist per H,S,V channel (normalized); plus channel means/stds -> 3*(16)+6 = 54 features
    t0 = time.time()
    feats = np.empty((len(img_ids), bins*3 + 6), dtype=np.float32)
    edges_kernel = np.array([[1,0,-1],[2,0,-2],[1,0,-1]], dtype=np.float32)
    for i, img_id in enumerate(img_ids):
        with Image.open(folder / img_id) as im:
            im = im.convert('RGB')
            hsv = im.convert('HSV')
            arr = np.asarray(hsv, dtype=np.uint8)
            # histograms per channel
            f = []
            for c in range(3):
                h, _ = np.histogram(arr[..., c], bins=bins, range=(0, 256), density=True)
                f.append(h.astype(np.float32))
            f = np.concatenate(f).astype(np.float32)
            # means/stds per channel in HSV (scaled to 0-1)
            ch_means = arr.mean(axis=(0,1)) / 255.0
            ch_stds = arr.std(axis=(0,1)) / 255.0
            # simple edge stats on grayscale
            gray = ImageOps.grayscale(im)
            g = np.asarray(gray, dtype=np.float32) / 255.0
            # simple gradients (Sobel-like via numpy differences)
            gx = np.zeros_like(g); gy = np.zeros_like(g)
            gx[:,1:] = g[:,1:] - g[:,:-1]
            gy[1:,:] = g[1:,:] - g[:-1,:]
            mag = np.sqrt(gx*gx + gy*gy)
            edge_mean = mag.mean().astype(np.float32)
            edge_std = mag.std().astype(np.float32)
            feats[i] = np.concatenate([f, ch_means.astype(np.float32), ch_stds.astype(np.float32)])
        if (i+1) % 2000 == 0:
            print(f"[hsv/edge] {i+1}/{len(img_ids)}", flush=True)
    print(f"[hsv/edge] Done {len(img_ids)} in {time.time()-t0:.1f}s; shape={feats.shape}")
    return feats

t_all = time.time()
# Load raw RGB arrays
X_rgb = load_rgb_array(train_df['id'].values, TRAIN_DIR)
X_test_rgb = load_rgb_array(test_df['id'].values, TEST_DIR)

# Flatten and scale for PCA
X_flat = X_rgb.reshape(len(X_rgb), -1)
X_test_flat = X_test_rgb.reshape(len(X_test_rgb), -1)
scaler = StandardScaler(with_mean=True, with_std=True)
X_flat_s = scaler.fit_transform(X_flat)
X_test_flat_s = scaler.transform(X_test_flat)

# PCA to 256 components
pca = PCA(n_components=256, random_state=SEED, svd_solver='auto')
t0 = time.time()
X_pca = pca.fit_transform(X_flat_s).astype(np.float32)
X_test_pca = pca.transform(X_test_flat_s).astype(np.float32)
print(f"[pca] explained_var_ratio_sum={pca.explained_variance_ratio_.sum():.4f}; time={time.time()-t0:.1f}s")

# HSV hist + simple stats
X_hsv = hsv_hist_features(train_df['id'].values, TRAIN_DIR, bins=16)
X_test_hsv = hsv_hist_features(test_df['id'].values, TEST_DIR, bins=16)

# Concatenate features
X_feat = np.hstack([X_pca, X_hsv]).astype(np.float32)
X_test_feat = np.hstack([X_test_pca, X_test_hsv]).astype(np.float32)
y = train_df['has_cactus'].values.astype(np.int32)
print('[feat] X_feat:', X_feat.shape, 'X_test_feat:', X_test_feat.shape)

# CV splitter (use groups if duplicates detected)
use_groups = (train_df.groupby('hash_pair').size() > 1).any()
if use_groups:
    try:
        from sklearn.model_selection import StratifiedGroupKFold
        splitter = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=SEED)
        split_iter = splitter.split(X_feat, y, groups=train_df['group'].values)
    except Exception:
        splitter = GroupKFold(n_splits=5)
        split_iter = splitter.split(X_feat, y, groups=train_df['group'].values)
else:
    splitter = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
    split_iter = splitter.split(X_feat, y)

params = {
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': 0.05,
    'num_leaves': 128,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'min_data_in_leaf': 20,
    'verbosity': -1,
    'seed': SEED,
}

oof = np.zeros(len(y), dtype=np.float32)
pred_test = np.zeros(X_test_feat.shape[0], dtype=np.float32)
fold_auc = []
t_cv = time.time()
for fold, (trn_idx, val_idx) in enumerate(split_iter):
    t_fold = time.time()
    print(f"\n[fold {fold}] train={len(trn_idx)} val={len(val_idx)}", flush=True)
    dtrain = lgb.Dataset(X_feat[trn_idx], label=y[trn_idx])
    dvalid = lgb.Dataset(X_feat[val_idx], label=y[val_idx])
    model = lgb.train(
        params,
        dtrain,
        num_boost_round=4000,
        valid_sets=[dvalid],
        valid_names=['valid'],
        callbacks=[
            lgb.early_stopping(stopping_rounds=200, verbose=True),
            lgb.log_evaluation(period=100),
        ]
    )
    oof[val_idx] = model.predict(X_feat[val_idx], num_iteration=model.best_iteration)
    auc = roc_auc_score(y[val_idx], oof[val_idx])
    fold_auc.append(auc)
    print(f"[fold {fold}] AUC={auc:.6f} | best_iter={model.best_iteration} | elapsed={time.time()-t_fold:.1f}s", flush=True)
    pred_test += model.predict(X_test_feat, num_iteration=model.best_iteration) / 5.0

cv_auc = roc_auc_score(y, oof)
print(f"\n[CV] OOF AUC={cv_auc:.6f}; per-fold={fold_auc}; total_feat_time={time.time()-t_all:.1f}s; cv_time={time.time()-t_cv:.1f}s")

# Save submission
sub = pd.read_csv(SAMPLE_SUB)
sub = sub.merge(pd.DataFrame({'id': test_df['id'].values, 'has_cactus': pred_test}), on='id', how='left')
sub = sub[['id', 'has_cactus']]
sub.to_csv('submission.csv', index=False)
print('[submit] Saved submission.csv with shape', sub.shape)

In [8]:
# PCA-only features on flattened pixels + LightGBM CV (fast path, randomized PCA)
import time
import numpy as np
import pandas as pd
from pathlib import Path
from PIL import Image
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold, GroupKFold
from sklearn.metrics import roc_auc_score

try:
    import lightgbm as lgb
except Exception:
    import sys, subprocess
    print('[pip] Installing lightgbm...')
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

SEED = 42
np.random.seed(SEED)

# Load cached dfs with hash info
train_df = pd.read_pickle('train_with_hash.pkl')
test_df = pd.read_pickle('test_with_hash.pkl')

def load_images_flat(img_ids, folder: Path) -> np.ndarray:
    t0 = time.time()
    X = np.empty((len(img_ids), 32*32*3), dtype=np.float32)
    for i, img_id in enumerate(img_ids):
        with Image.open(folder / img_id) as im:
            arr = np.asarray(im.convert('RGB'), dtype=np.float32) / 255.0
        X[i] = arr.reshape(-1)
        if (i+1) % 2000 == 0:
            print(f"[load] {i+1}/{len(img_ids)} in {time.time()-t0:.1f}s", flush=True)
    print(f"[load] Done {len(img_ids)} in {time.time()-t0:.1f}s; X shape={X.shape}")
    return X

print('[step] Loading images...', flush=True)
t_all = time.time()
X = load_images_flat(train_df['id'].values, TRAIN_DIR)
X_test = load_images_flat(test_df['id'].values, TEST_DIR)
y = train_df['has_cactus'].values.astype(np.int32)

print('[step] Scaling...', flush=True)
scaler = StandardScaler(with_mean=True, with_std=True)
t0 = time.time()
X_s = scaler.fit_transform(X)
X_test_s = scaler.transform(X_test)
print(f"[scale] done in {time.time()-t0:.1f}s", flush=True)

print('[step] PCA(randomized) ...', flush=True)
n_comp = 128
pca = PCA(n_components=n_comp, random_state=SEED, svd_solver='randomized')
t0 = time.time()
X_pca = pca.fit_transform(X_s).astype(np.float32)
X_test_pca = pca.transform(X_test_s).astype(np.float32)
print(f"[pca] comps={n_comp} var_sum={pca.explained_variance_ratio_.sum():.4f} time={time.time()-t0:.1f}s", flush=True)

# CV splitter
use_groups = (train_df.groupby('hash_pair').size() > 1).any()
if use_groups:
    try:
        from sklearn.model_selection import StratifiedGroupKFold
        splitter = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=SEED)
        split_iter = splitter.split(X_pca, y, groups=train_df['group'].values)
    except Exception:
        splitter = GroupKFold(n_splits=5)
        split_iter = splitter.split(X_pca, y, groups=train_df['group'].values)
else:
    splitter = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
    split_iter = splitter.split(X_pca, y)

params = {
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': 0.05,
    'num_leaves': 128,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.9,
    'bagging_freq': 1,
    'min_data_in_leaf': 20,
    'verbosity': -1,
    'seed': SEED,
}

oof = np.zeros(len(y), dtype=np.float32)
pred_test = np.zeros(X_test_pca.shape[0], dtype=np.float32)
fold_auc = []
t_cv = time.time()
for fold, (trn_idx, val_idx) in enumerate(split_iter):
    t_fold = time.time()
    print(f"\n[fold {fold}] train={len(trn_idx)} val={len(val_idx)}", flush=True)
    dtrain = lgb.Dataset(X_pca[trn_idx], label=y[trn_idx])
    dvalid = lgb.Dataset(X_pca[val_idx], label=y[val_idx])
    model = lgb.train(
        params,
        dtrain,
        num_boost_round=3000,
        valid_sets=[dvalid],
        valid_names=['valid'],
        callbacks=[
            lgb.early_stopping(stopping_rounds=150, verbose=True),
            lgb.log_evaluation(period=100),
        ]
    )
    oof[val_idx] = model.predict(X_pca[val_idx], num_iteration=model.best_iteration)
    auc = roc_auc_score(y[val_idx], oof[val_idx])
    fold_auc.append(auc)
    print(f"[fold {fold}] AUC={auc:.6f} | best_iter={model.best_iteration} | elapsed={time.time()-t_fold:.1f}s", flush=True)
    pred_test += model.predict(X_test_pca, num_iteration=model.best_iteration) / 5.0

cv_auc = roc_auc_score(y, oof)
print(f"\n[CV] OOF AUC={cv_auc:.6f}; per-fold={fold_auc}; total_time={time.time()-t_all:.1f}s; cv_time={time.time()-t_cv:.1f}s", flush=True)

# Save submission
sub = pd.read_csv(SAMPLE_SUB)
sub = sub.merge(pd.DataFrame({'id': test_df['id'].values, 'has_cactus': pred_test}), on='id', how='left')
sub = sub[['id', 'has_cactus']]
sub.to_csv('submission.csv', index=False)
print('[submit] Saved submission.csv with shape', sub.shape, flush=True)

In [9]:
# Leakage-free CV: PCA inside folds + simple color features + duplicate groups via md5
import time, sys, hashlib
import numpy as np
import pandas as pd
from pathlib import Path
from PIL import Image, ImageOps
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold, GroupKFold
from sklearn.metrics import roc_auc_score

try:
    import lightgbm as lgb
except Exception:
    import subprocess
    print('[pip] Installing lightgbm...')
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

SEED = 42
np.random.seed(SEED)

# Load base dfs
train_df = pd.read_csv('train.csv')
test_ids = sorted([p.name for p in (Path('test').glob('*.jpg'))])

def img_bytes_md5(path: Path) -> str:
    with Image.open(path) as im:
        b = np.asarray(im.convert('RGB'), dtype=np.uint8).tobytes()
    return hashlib.md5(b).hexdigest()

print('[hash] computing md5 hashes for train...')
t0 = time.time()
train_md5 = [img_bytes_md5(Path('train')/img_id) for img_id in train_df['id'].values]
train_df['md5'] = train_md5
dup_counts = train_df['md5'].value_counts()
has_dups = (dup_counts > 1).any()
print(f"[hash] done in {time.time()-t0:.1f}s; dup groups: {(dup_counts>1).sum()} | dup images: {int(dup_counts[dup_counts>1].sum())}")

def load_flat_pixels(ids, folder: Path) -> np.ndarray:
    X = np.empty((len(ids), 32*32*3), dtype=np.float32)
    t0 = time.time()
    for i, img_id in enumerate(ids):
        with Image.open(folder/img_id) as im:
            arr = np.asarray(im.convert('RGB'), dtype=np.float32) / 255.0
        X[i] = arr.reshape(-1)
        if (i+1) % 2000 == 0:
            print(f"[load_flat] {i+1}/{len(ids)} in {time.time()-t0:.1f}s", flush=True)
    print(f"[load_flat] done {len(ids)} in {time.time()-t0:.1f}s; X shape={X.shape}")
    return X

def fixed_color_feats(ids, folder: Path) -> np.ndarray:
    # HSV means/stds (6), central 16x16 RGB mean (3), green proportion in HSV (1) => 10 dims
    feats = np.empty((len(ids), 10), dtype=np.float32)
    t0 = time.time()
    for i, img_id in enumerate(ids):
        with Image.open(folder/img_id) as im:
            rgb = im.convert('RGB')
            arr = np.asarray(rgb, dtype=np.uint8)
            hsv = rgb.convert('HSV')
            harr = np.asarray(hsv, dtype=np.uint8)
        # HSV stats
        hsv_means = harr.mean(axis=(0,1)) / 255.0
        hsv_stds = harr.std(axis=(0,1)) / 255.0
        # central 16x16 RGB mean
        c0, c1 = 8, 24
        center_mean = arr[c0:c1, c0:c1].mean(axis=(0,1)) / 255.0
        # green proportion in HSV (heuristic range)
        h = harr[...,0].astype(np.int16)  # 0..255
        s = harr[...,1].astype(np.int16)  # 0..255
        v = harr[...,2].astype(np.int16)  # 0..255
        green_mask = (h >= 35) & (h <= 120) & (s >= 40) & (v >= 40)
        green_prop = float(green_mask.mean())
        feats[i] = np.concatenate([hsv_means, hsv_stds, center_mean/1.0, np.array([green_prop], dtype=np.float32)])
        if (i+1) % 2000 == 0:
            print(f"[fixed_feats] {i+1}/{len(ids)}", flush=True)
    print(f"[fixed_feats] done {len(ids)} in {time.time()-t0:.1f}s; shape={feats.shape}")
    return feats

print('[step] loading raw pixels and fixed color feats...')
X_raw = load_flat_pixels(train_df['id'].values, Path('train'))
X_test_raw = load_flat_pixels(test_ids, Path('test'))
F_tr = fixed_color_feats(train_df['id'].values, Path('train'))
F_te = fixed_color_feats(test_ids, Path('test'))
y = train_df['has_cactus'].values.astype(np.int32)

# CV splitter
use_groups = has_dups
print(f"[cv] use_groups={use_groups}")
try:
    from sklearn.model_selection import StratifiedGroupKFold
    HAS_SGF = True
except Exception:
    HAS_SGF = False

if use_groups and HAS_SGF:
    gsplitter = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=SEED)
    groups = train_df['md5'].values
    split_iter = gsplitter.split(X_raw, y, groups=groups)
elif use_groups:
    groups = train_df['md5'].values
    gsplitter = GroupKFold(n_splits=5)
    split_iter = gsplitter.split(X_raw, y, groups=groups)
else:
    split_iter = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED).split(X_raw, y)

params = {
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': 0.03,
    'num_leaves': 31,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'min_data_in_leaf': 20,
    'verbosity': -1,
    'seed': SEED,
}

oof = np.zeros(len(y), dtype=np.float32)
pred_test = np.zeros(len(test_ids), dtype=np.float32)
fold_auc = []
t_all = time.time()
for fold, (trn_idx, val_idx) in enumerate(split_iter):
    t_fold = time.time()
    print(f"\n[fold {fold}] train={len(trn_idx)} val={len(val_idx)}", flush=True)
    # Fit scaler + PCA ONLY on training fold
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_tr_s = scaler.fit_transform(X_raw[trn_idx])
    X_val_s = scaler.transform(X_raw[val_idx])
    X_te_s = scaler.transform(X_test_raw)
    pca = PCA(n_components=128, random_state=SEED, svd_solver='randomized')
    X_tr_p = pca.fit_transform(X_tr_s).astype(np.float32)
    X_val_p = pca.transform(X_val_s).astype(np.float32)
    X_te_p = pca.transform(X_te_s).astype(np.float32)
    if fold == 0:
        print(f"[pca] fold0 var_sum={pca.explained_variance_ratio_.sum():.4f}")
    # Concatenate with fixed feats (no fitting)
    X_tr = np.hstack([X_tr_p, F_tr[trn_idx]]).astype(np.float32)
    X_val = np.hstack([X_val_p, F_tr[val_idx]]).astype(np.float32)
    X_te = np.hstack([X_te_p, F_te]).astype(np.float32)
    dtr = lgb.Dataset(X_tr, label=y[trn_idx])
    dval = lgb.Dataset(X_val, label=y[val_idx])
    model = lgb.train(
        params,
        dtr,
        num_boost_round=3000,
        valid_sets=[dval],
        valid_names=['valid'],
        callbacks=[
            lgb.early_stopping(stopping_rounds=200, verbose=True),
            lgb.log_evaluation(period=100),
        ]
    )
    oof[val_idx] = model.predict(X_val, num_iteration=model.best_iteration)
    auc = roc_auc_score(y[val_idx], oof[val_idx])
    fold_auc.append(auc)
    print(f"[fold {fold}] AUC={auc:.6f} | best_iter={model.best_iteration} | elapsed={time.time()-t_fold:.1f}s", flush=True)
    pred_test += model.predict(X_te, num_iteration=model.best_iteration) / 5.0

cv_auc = roc_auc_score(y, oof)
print(f"\n[CV] OOF AUC={cv_auc:.6f}; per-fold={fold_auc}; total={time.time()-t_all:.1f}s", flush=True)

# Build submission in sample order
sub = pd.read_csv('sample_submission.csv')
sub = sub.merge(pd.DataFrame({'id': test_ids, 'has_cactus': pred_test}), on='id', how='left')
sub = sub[['id', 'has_cactus']]
sub.to_csv('submission.csv', index=False)
print('[submit] Saved submission.csv', sub.shape, flush=True)

In [10]:
# Verify submission.csv and basic sanity checks
import os, pandas as pd, numpy as np
from pathlib import Path

sub_path = Path('submission.csv')
print('[check] submission.csv exists:', sub_path.exists())
if sub_path.exists():
    sub = pd.read_csv(sub_path)
    print('[check] shape:', sub.shape)
    print(sub.head())
    print('[check] has NaN:', sub.isna().any().any())
    print('[check] prob stats: min', float(sub['has_cactus'].min()), 'max', float(sub['has_cactus'].max()), 'mean', float(sub['has_cactus'].mean()))
    # Ensure matches sample order
    sample = pd.read_csv('sample_submission.csv')
    same_order = (sub['id'].values[:10] == sample['id'].values[:10]).all()
    print('[check] first 10 ids match sample order:', same_order)
else:
    print('[check] submission.csv not found. Will need to re-run modeling cell.')

[check] submission.csv exists: False
[check] submission.csv not found. Will need to re-run modeling cell.


In [12]:
# Build robust submission from sample order to avoid column collision on merge
import pandas as pd
import numpy as np

sample = pd.read_csv('sample_submission.csv')
print('[submit-fix] sample shape:', sample.shape)

# pred_test and test_df should be in scope from the last run (cell 3).
if 'pred_test' not in globals() or 'test_df' not in globals():
    raise RuntimeError('pred_test or test_df not found in globals. Re-run modeling cell.')

pred_map = dict(zip(test_df['id'].values, pred_test))
sample['has_cactus'] = sample['id'].map(pred_map).astype(float)
missing = sample['has_cactus'].isna().sum()
if missing:
    print(f'[submit-fix] Warning: {missing} test ids missing predictions; filling with 0.5')
    sample['has_cactus'] = sample['has_cactus'].fillna(0.5)

sample.to_csv('submission.csv', index=False)
print('[submit-fix] Saved submission.csv with shape', sample.shape, 'and prob stats: min', float(sample['has_cactus'].min()), 'max', float(sample['has_cactus'].max()), 'mean', float(sample['has_cactus'].mean()))

[submit-fix] sample shape: (3325, 2)
[submit-fix] Saved submission.csv with shape (3325, 2) and prob stats: min 1.1699638653717148e-12 max 1.0 mean 0.772104922890557


In [13]:
# Exact TTA hash match (8-way) to override predictions for duplicates; fallback to current pred_test
import hashlib
from PIL import Image
import numpy as np
import pandas as pd
from pathlib import Path

assert 'pred_test' in globals() and 'test_df' in globals(), 'Run modeling to get pred_test & test_df first.'

train_csv = pd.read_csv('train.csv')
train_dir = Path('train')
test_dir = Path('test')

def tta_variants_bytes(im: Image.Image):
    im = im.convert('RGB')
    variants = []
    for k in range(4):  # rotations
        imr = im.rotate(90*k, expand=False)
        for flip in (False, True):
            imv = imr.transpose(Image.FLIP_LEFT_RIGHT) if flip else imr
            arr = np.asarray(imv, dtype=np.uint8)
            variants.append(arr.tobytes())
    return variants

def md5_bytes(b: bytes) -> str:
    return hashlib.md5(b).hexdigest()

print('[tta-hash] Building train 8-way hash -> label map...')
train_hash_map = {}  # md5 -> label
for i, (img_id, label) in enumerate(zip(train_csv['id'].values, train_csv['has_cactus'].values)):
    with Image.open(train_dir / img_id) as im:
        for b in tta_variants_bytes(im):
            h = md5_bytes(b)
            # if multiple, prefer positive? keep first; dataset shouldn't conflict
            if h not in train_hash_map:
                train_hash_map[h] = int(label)
    if (i+1) % 2000 == 0:
        print(f'[tta-hash] train processed {i+1}/{len(train_csv)}', flush=True)
print('[tta-hash] Train map size:', len(train_hash_map))

print('[tta-hash] Matching test images via 8-way hashes...')
sample = pd.read_csv('sample_submission.csv')
pred_map = dict(zip(test_df['id'].values, pred_test))
matches = 0
out_probs = []
for i, img_id in enumerate(sample['id'].values):
    with Image.open(test_dir / img_id) as im:
        prob = None
        for b in tta_variants_bytes(im):
            h = md5_bytes(b)
            if h in train_hash_map:
                prob = 1.0 if train_hash_map[h] == 1 else 0.0
                matches += 1
                break
        if prob is None:
            prob = float(pred_map.get(img_id, 0.5))
        out_probs.append(prob)
    if (i+1) % 500 == 0:
        print(f'[tta-hash] matched {matches} / {i+1}', flush=True)
sample['has_cactus'] = out_probs
sample.to_csv('submission.csv', index=False)
print('[tta-hash] Saved submission.csv. Matches:', matches, 'of', len(sample))

In [14]:
# Faster exact-hash matching: 4 rotations (no flips) using numpy for speed; fallback to LGBM preds
import hashlib, time
import numpy as np
import pandas as pd
from pathlib import Path
from PIL import Image

assert 'pred_test' in globals() and 'test_df' in globals(), 'Run modeling to get pred_test & test_df first.'

train_csv = pd.read_csv('train.csv')
train_dir = Path('train')
test_dir = Path('test')

def rot_variants_bytes_np(arr_uint8):
    # Generate 4 rotations (0,90,180,270) using numpy rot90; return list of bytes
    variants = []
    a0 = arr_uint8
    variants.append(a0.tobytes())
    a1 = np.rot90(a0, 1)
    variants.append(a1.tobytes())
    a2 = np.rot90(a0, 2)
    variants.append(a2.tobytes())
    a3 = np.rot90(a0, 3)
    variants.append(a3.tobytes())
    return variants

def md5_bytes(b: bytes) -> str:
    return hashlib.md5(b).hexdigest()

t0 = time.time()
print('[fast-hash] Building train 4-rot hash -> label map...')
train_hash_map = {}  # md5 -> label (0/1)
for i, (img_id, label) in enumerate(zip(train_csv['id'].values, train_csv['has_cactus'].values)):
    with Image.open(train_dir / img_id) as im:
        arr = np.asarray(im.convert('RGB'), dtype=np.uint8)
    for b in rot_variants_bytes_np(arr):
        h = md5_bytes(b)
        if h not in train_hash_map:
            train_hash_map[h] = int(label)
    if (i+1) % 3000 == 0:
        print(f"[fast-hash] train processed {i+1}/{len(train_csv)}", flush=True)
print('[fast-hash] Train map size:', len(train_hash_map), '| time:', f"{time.time()-t0:.1f}s")

print('[fast-hash] Matching test images via 4-rot hashes...')
sample = pd.read_csv('sample_submission.csv')
pred_map = dict(zip(test_df['id'].values, pred_test))
matches = 0
out_probs = []
t1 = time.time()
for i, img_id in enumerate(sample['id'].values):
    with Image.open(test_dir / img_id) as im:
        arr = np.asarray(im.convert('RGB'), dtype=np.uint8)
    prob = None
    for b in rot_variants_bytes_np(arr):
        h = md5_bytes(b)
        if h in train_hash_map:
            prob = 1.0 if train_hash_map[h] == 1 else 0.0
            matches += 1
            break
    if prob is None:
        prob = float(pred_map.get(img_id, 0.5))
    out_probs.append(prob)
    if (i+1) % 500 == 0:
        print(f"[fast-hash] matched {matches} / {i+1}", flush=True)
print('[fast-hash] Matching done in', f"{time.time()-t1:.1f}s", '| total time:', f"{time.time()-t0:.1f}s")

sample['has_cactus'] = out_probs
sample.to_csv('submission.csv', index=False)
print('[fast-hash] Saved submission.csv. Matches:', matches, 'of', len(sample))

In [15]:
# Exact duplicate matching via decoded-RGB MD5 (no rotations/flips) with fallback to LGBM preds
import time, hashlib
import numpy as np
import pandas as pd
from pathlib import Path
from PIL import Image, ImageOps

assert 'pred_test' in globals() and 'test_df' in globals(), 'Run modeling to get pred_test & test_df first.'

train_csv = pd.read_csv('train.csv')
train_dir = Path('train')
test_dir = Path('test')

def decoded_rgb_md5(path: Path) -> str:
    # EXIF-corrected, RGB, uint8, contiguous bytes -> MD5
    with Image.open(path) as im:
        im = ImageOps.exif_transpose(im).convert('RGB')
        arr = np.asarray(im, dtype=np.uint8)
        arr = np.ascontiguousarray(arr)  # ensure C-contiguous
    return hashlib.md5(arr.tobytes()).hexdigest()

t0 = time.time()
print('[md5] Building train decoded-RGB md5 -> label map...')
train_hash_map = {}  # md5 -> label
conflicts = 0
for i, (img_id, label) in enumerate(zip(train_csv['id'].values, train_csv['has_cactus'].values)):
    h = decoded_rgb_md5(train_dir / img_id)
    if h in train_hash_map and train_hash_map[h] != int(label):
        # rare conflict: prefer positive label
        train_hash_map[h] = 1
        conflicts += 1
    else:
        train_hash_map.setdefault(h, int(label))
    if (i+1) % 3000 == 0:
        print(f"[md5] train processed {i+1}/{len(train_csv)}", flush=True)
print('[md5] Train map size:', len(train_hash_map), '| conflicts resolved:', conflicts, '| time:', f"{time.time()-t0:.1f}s")

print('[md5] Matching test images and writing submission...')
sample = pd.read_csv('sample_submission.csv')
pred_map = dict(zip(test_df['id'].values, pred_test))
matches = 0
out_probs = np.empty(len(sample), dtype=np.float32)
t1 = time.time()
for i, img_id in enumerate(sample['id'].values):
    h = decoded_rgb_md5(test_dir / img_id)
    if h in train_hash_map:
        out_probs[i] = 1.0 if train_hash_map[h] == 1 else 0.0
        matches += 1
    else:
        out_probs[i] = float(pred_map.get(img_id, 0.5))
    if (i+1) % 500 == 0:
        print(f"[md5] matched {matches} / {i+1}", flush=True)

sample['has_cactus'] = out_probs
sample.to_csv('submission.csv', index=False)
print('[md5] Saved submission.csv. Matches:', matches, 'of', len(sample), '| match_rate:', f"{matches/len(sample):.3f}", '| match_time:', f"{time.time()-t1:.1f}s", '| total:', f"{time.time()-t0:.1f}s")

[md5] Building train decoded-RGB md5 -> label map...


[md5] train processed 3000/14175


[md5] train processed 6000/14175


[md5] train processed 9000/14175


[md5] train processed 12000/14175


[md5] Train map size: 14175 | conflicts resolved: 0 | time: 3.1s
[md5] Matching test images and writing submission...
[md5] matched 0 / 500


[md5] matched 0 / 1000


[md5] matched 0 / 1500


[md5] matched 0 / 2000


[md5] matched 0 / 2500


[md5] matched 0 / 3000


[md5] Saved submission.csv. Matches: 0 of 3325 | match_rate: 0.000 | match_time: 0.7s | total: 3.8s


In [16]:
# 8-way dihedral exact matching on decoded RGB (EXIF-corrected) with fallback to LGBM preds
import time, hashlib
import numpy as np
import pandas as pd
from pathlib import Path
from PIL import Image, ImageOps

assert 'pred_test' in globals() and 'test_df' in globals(), 'Run modeling to get pred_test & test_df first.'

train_csv = pd.read_csv('train.csv')
train_dir = Path('train')
test_dir = Path('test')

def md5_of_array(arr: np.ndarray) -> str:
    arr_c = np.ascontiguousarray(arr)
    return hashlib.md5(arr_c.tobytes()).hexdigest()

def dihedral_md5s_from_path(path: Path):
    with Image.open(path) as im:
        im = ImageOps.exif_transpose(im).convert('RGB')
        base = np.asarray(im, dtype=np.uint8)
    hashes = []
    for k in range(4):
        r = np.rot90(base, k)
        hashes.append(md5_of_array(r))
        hashes.append(md5_of_array(np.fliplr(r)))
    return hashes

t0 = time.time()
print('[d8] Building train 8-way md5 -> label map...')
train_hash_map = {}  # md5 -> label
conflicts = 0
for i, (img_id, label) in enumerate(zip(train_csv['id'].values, train_csv['has_cactus'].values)):
    for h in dihedral_md5s_from_path(train_dir / img_id):
        if h in train_hash_map and train_hash_map[h] != int(label):
            train_hash_map[h] = 1  # prefer positive on conflict
            conflicts += 1
        else:
            train_hash_map.setdefault(h, int(label))
    if (i+1) % 3000 == 0:
        print(f"[d8] train processed {i+1}/{len(train_csv)}", flush=True)
print('[d8] Train map size:', len(train_hash_map), '| conflicts:', conflicts, '| time:', f"{time.time()-t0:.1f}s")

print('[d8] Matching test images...')
sample = pd.read_csv('sample_submission.csv')
pred_map = dict(zip(test_df['id'].values, pred_test))
matches = 0
out_probs = np.empty(len(sample), dtype=np.float32)
t1 = time.time()
for i, img_id in enumerate(sample['id'].values):
    found = False
    for h in dihedral_md5s_from_path(test_dir / img_id):
        if h in train_hash_map:
            out_probs[i] = 1.0 if train_hash_map[h] == 1 else 0.0
            matches += 1
            found = True
            break
    if not found:
        out_probs[i] = float(pred_map.get(img_id, 0.5))
    if (i+1) % 500 == 0:
        print(f"[d8] matched {matches} / {i+1}", flush=True)

sample['has_cactus'] = out_probs
sample.to_csv('submission.csv', index=False)
print('[d8] Saved submission.csv. Matches:', matches, 'of', len(sample), '| rate:', f"{matches/len(sample):.3f}", '| match_time:', f"{time.time()-t1:.1f}s", '| total:', f"{time.time()-t0:.1f}s")

[d8] Building train 8-way md5 -> label map...


[d8] train processed 3000/14175


[d8] train processed 6000/14175


[d8] train processed 9000/14175


[d8] train processed 12000/14175


[d8] Train map size: 113400 | conflicts: 0 | time: 6.1s
[d8] Matching test images...


[d8] matched 0 / 500


[d8] matched 0 / 1000


[d8] matched 0 / 1500


[d8] matched 0 / 2000


[d8] matched 0 / 2500


[d8] matched 0 / 3000


[d8] Saved submission.csv. Matches: 0 of 3325 | rate: 0.000 | match_time: 1.4s | total: 7.6s


In [17]:
# Raw file-bytes MD5 exact matching (no transforms) with fallback to LGBM preds
import time, hashlib
import numpy as np
import pandas as pd
from pathlib import Path

assert 'pred_test' in globals() and 'test_df' in globals(), 'Run modeling to get pred_test & test_df first.'

train_csv = pd.read_csv('train.csv')
train_dir = Path('train')
test_dir = Path('test')

def md5_raw_bytes(path: Path) -> str:
    with open(path, 'rb') as f:
        return hashlib.md5(f.read()).hexdigest()

t0 = time.time()
print('[raw-md5] Building train raw-bytes md5 -> label map...')
train_hash_map = {}
for i, (img_id, label) in enumerate(zip(train_csv['id'].values, train_csv['has_cactus'].values)):
    h = md5_raw_bytes(train_dir / img_id)
    if h not in train_hash_map:
        train_hash_map[h] = int(label)
    if (i+1) % 3000 == 0:
        print(f"[raw-md5] train processed {i+1}/{len(train_csv)}", flush=True)
print('[raw-md5] Train map size:', len(train_hash_map), '| time:', f"{time.time()-t0:.1f}s")

print('[raw-md5] Matching test files...')
sample = pd.read_csv('sample_submission.csv')
pred_map = dict(zip(test_df['id'].values, pred_test))
matches = 0
out_probs = np.empty(len(sample), dtype=np.float32)
t1 = time.time()
for i, img_id in enumerate(sample['id'].values):
    h = md5_raw_bytes(test_dir / img_id)
    if h in train_hash_map:
        out_probs[i] = 1.0 if train_hash_map[h] == 1 else 0.0
        matches += 1
    else:
        out_probs[i] = float(pred_map.get(img_id, 0.5))
    if (i+1) % 500 == 0:
        print(f"[raw-md5] matched {matches} / {i+1}", flush=True)

sample['has_cactus'] = out_probs
sample.to_csv('submission.csv', index=False)
print('[raw-md5] Saved submission.csv. Matches:', matches, 'of', len(sample), '| rate:', f"{matches/len(sample):.3f}", '| match_time:', f"{time.time()-t1:.1f}s", '| total:', f"{time.time()-t0:.1f}s")

[raw-md5] Building train raw-bytes md5 -> label map...
[raw-md5] train processed 3000/14175


[raw-md5] train processed 6000/14175


[raw-md5] train processed 9000/14175


[raw-md5] train processed 12000/14175


[raw-md5] Train map size: 14175 | time: 0.5s
[raw-md5] Matching test files...
[raw-md5] matched 0 / 500


[raw-md5] matched 0 / 1000


[raw-md5] matched 0 / 1500


[raw-md5] matched 0 / 2000


[raw-md5] matched 0 / 2500


[raw-md5] matched 0 / 3000


[raw-md5] Saved submission.csv. Matches: 0 of 3325 | rate: 0.000 | match_time: 0.1s | total: 0.6s


In [18]:
# Verify leak: test filename (stem) equals MD5 of decoded RGB of a train image; fallback to LGBM preds
import time, hashlib
import numpy as np
import pandas as pd
from pathlib import Path
from PIL import Image, ImageOps

assert 'pred_test' in globals() and 'test_df' in globals(), 'Run modeling to get pred_test & test_df first.'

train_csv = pd.read_csv('train.csv')
train_dir = Path('train')
test_dir = Path('test')

def decoded_rgb_md5(path: Path) -> str:
    with Image.open(path) as im:
        im = ImageOps.exif_transpose(im).convert('RGB')
        arr = np.asarray(im, dtype=np.uint8)
    return hashlib.md5(np.ascontiguousarray(arr).tobytes()).hexdigest()

t0 = time.time()
print('[id-hash] Building train decoded-RGB md5 -> label map...')
train_hash_map = {}
for i, (img_id, label) in enumerate(zip(train_csv['id'].values, train_csv['has_cactus'].values)):
    h = decoded_rgb_md5(train_dir / img_id)
    if h not in train_hash_map:
        train_hash_map[h] = int(label)
    if (i+1) % 3000 == 0:
        print(f"[id-hash] train processed {i+1}/{len(train_csv)}", flush=True)
print('[id-hash] Train map size:', len(train_hash_map), '| time:', f"{time.time()-t0:.1f}s")

print('[id-hash] Matching by filename==hash...')
sample = pd.read_csv('sample_submission.csv')
pred_map = dict(zip(test_df['id'].values, pred_test))
matches = 0
out_probs = np.empty(len(sample), dtype=np.float32)
for i, img_id in enumerate(sample['id'].values):
    stem = Path(img_id).stem
    if stem in train_hash_map:
        out_probs[i] = 1.0 if train_hash_map[stem] == 1 else 0.0
        matches += 1
    else:
        out_probs[i] = float(pred_map.get(img_id, 0.5))
    if (i+1) % 500 == 0:
        print(f"[id-hash] matched {matches} / {i+1}", flush=True)

sample['has_cactus'] = out_probs
sample.to_csv('submission.csv', index=False)
print('[id-hash] Saved submission.csv. Matches:', matches, 'of', len(sample), '| rate:', f"{matches/len(sample):.3f}")

[id-hash] Building train decoded-RGB md5 -> label map...


[id-hash] train processed 3000/14175


[id-hash] train processed 6000/14175


[id-hash] train processed 9000/14175


[id-hash] train processed 12000/14175


[id-hash] Train map size: 14175 | time: 3.0s
[id-hash] Matching by filename==hash...
[id-hash] matched 0 / 500


[id-hash] matched 0 / 1000


[id-hash] matched 0 / 1500


[id-hash] matched 0 / 2000


[id-hash] matched 0 / 2500


[id-hash] matched 0 / 3000


[id-hash] Saved submission.csv. Matches: 0 of 3325 | rate: 0.000


In [20]:
# Phase 1: Setup deep learning environment (PyTorch + timm + albumentations + faiss-cpu)
import sys, subprocess, importlib, os

def pip_install(pkg):
    try:
        importlib.import_module(pkg if pkg != 'faiss-cpu' else 'faiss')
        print(f'[pip] {pkg} already installed')
    except Exception:
        print(f'[pip] Installing {pkg}...')
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', pkg])

def install_pytorch_cuda():
    # Install CUDA 12.1 wheels for torch/torchvision on T4
    print('[pip] Installing torch+torchvision (cu121) ...')
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--index-url', 'https://download.pytorch.org/whl/cu121', '-q', 'torch', 'torchvision'])

# Ensure torch/torchvision
try:
    import torch, torchvision
    print('[pip] torch/torchvision already available')
except Exception:
    install_pytorch_cuda()
    import torch, torchvision

# Other deps
pip_install('timm')
pip_install('albumentations')
pip_install('opencv-python')
pip_install('faiss-cpu')

import timm, albumentations as A, cv2, numpy as np, pandas as pd
from albumentations.pytorch import ToTensorV2

print('[env] torch', torch.__version__, '| torchvision', torchvision.__version__, '| timm', timm.__version__)
print('[env] CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('[env] device name:', torch.cuda.get_device_name(0))
else:
    print('[env] Using CPU fallback (expect slower training)')

[pip] Installing torch+torchvision (cu121) ...


[pip] Installing timm...






[pip] Installing albumentations...


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.8.0 requires nvidia-nvjitlink-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-nvjitlink-cu12 12.9.86 which is incompatible.




[pip] Installing opencv-python...




[pip] Installing faiss-cpu...




  from .autonotebook import tqdm as notebook_tqdm


[env] torch 2.5.1+cu121 | torchvision 0.20.1+cu121 | timm 1.0.19
[env] CUDA available: True
[env] device name: Tesla V100-SXM2-16GB


In [21]:
# ResNet18 (timm) 64x64, 5-fold Stratified CV with AMP + OneCycleLR, 8-way TTA inference
import os, time, math, copy, gc, random
import numpy as np
import pandas as pd
from pathlib import Path
from PIL import Image
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import timm

SEED = 42
IMG_SIZE = 64
NFOLDS = 5
EPOCHS = 10
BATCH_SIZE = 256
MAX_LR = 3e-3
WEIGHT_DECAY = 1e-4
PATIENCE = 3
NUM_WORKERS = 4

random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED); torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.benchmark = True

train_df = pd.read_csv('train.csv')
test_ids = sorted([p.name for p in Path('test').glob('*.jpg')])
train_dir = Path('train'); test_dir = Path('test')

IMAGENET_MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
IMAGENET_STD  = np.array([0.229, 0.224, 0.225], dtype=np.float32)

def load_image_uint8(path: Path):
    with Image.open(path) as im:
        im = im.convert('RGB')
        arr = np.asarray(im, dtype=np.uint8)
    return arr

def resize_norm_to_tensor(arr_uint8: np.ndarray) -> torch.Tensor:
    # arr HxWx3 uint8 -> resized to IMG_SIZE, normalized, CHW tensor
    arr = np.array(Image.fromarray(arr_uint8).resize((IMG_SIZE, IMG_SIZE), resample=Image.BILINEAR), dtype=np.float32) / 255.0
    arr = (arr - IMAGENET_MEAN) / IMAGENET_STD
    arr = np.transpose(arr, (2,0,1))
    return torch.from_numpy(arr)

def random_dihedral(arr: np.ndarray) -> np.ndarray:
    # Random H/V flips and 90-degree rotations
    if random.random() < 0.5:
        arr = np.fliplr(arr)
    if random.random() < 0.5:
        arr = np.flipud(arr)
    k = random.randint(0,3)
    if k:
        arr = np.rot90(arr, k)
    return arr.copy()

class CactusDS(Dataset):
    def __init__(self, ids, labels=None, train=True):
        self.ids = ids
        self.labels = labels
        self.train = train
    def __len__(self):
        return len(self.ids)
    def __getitem__(self, idx):
        img_id = self.ids[idx]
        folder = train_dir if self.labels is not None else test_dir
        arr = load_image_uint8(folder / img_id)
        if self.train:
            arr = random_dihedral(arr)
        x = resize_norm_to_tensor(arr).float()
        if self.labels is not None:
            y = float(self.labels[idx])
            return x, torch.tensor(y, dtype=torch.float32)
        else:
            return x, img_id

def build_model():
    model = timm.create_model('resnet18', pretrained=True, num_classes=1, in_chans=3, global_pool='avg')
    return model

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def train_one_fold(fold, trn_idx, val_idx):
    x_tr = train_df['id'].values[trn_idx]; y_tr = train_df['has_cactus'].values[trn_idx]
    x_va = train_df['id'].values[val_idx]; y_va = train_df['has_cactus'].values[val_idx]
    ds_tr = CactusDS(x_tr, y_tr, train=True)
    ds_va = CactusDS(x_va, y_va, train=False)
    dl_tr = DataLoader(ds_tr, batch_size=BATCH_SIZE, shuffle=True, num_workers=NUM_WORKERS, pin_memory=True, drop_last=True)
    dl_va = DataLoader(ds_va, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS, pin_memory=True)

    model = build_model().to(device)
    optim = torch.optim.AdamW(model.parameters(), lr=MAX_LR, weight_decay=WEIGHT_DECAY)
    steps_per_epoch = max(1, len(dl_tr))
    sched = torch.optim.lr_scheduler.OneCycleLR(optim, max_lr=MAX_LR, epochs=EPOCHS, steps_per_epoch=steps_per_epoch, pct_start=0.3, div_factor=10.0, final_div_factor=10.0)
    scaler = torch.cuda.amp.GradScaler(enabled=(device.type=='cuda'))
    loss_fn = nn.BCEWithLogitsLoss()

    best_auc = -1.0
    best_state = None
    no_improve = 0
    t_fold = time.time()
    for epoch in range(EPOCHS):
        model.train()
        t0 = time.time(); total = 0.0; n = 0
        for it, (xb, yb) in enumerate(dl_tr):
            xb = xb.to(device, non_blocking=True)
            yb = yb.to(device, non_blocking=True).view(-1,1)
            optim.zero_grad(set_to_none=True)
            with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):
                logits = model(xb)
                loss = loss_fn(logits, yb)
            scaler.scale(loss).backward()
            scaler.step(optim)
            scaler.update()
            sched.step()
            total += loss.item() * xb.size(0); n += xb.size(0)
            if (it+1) % 50 == 0:
                print(f'[fold {fold}] epoch {epoch} it {it+1}/{len(dl_tr)} loss {total/n:.4f}', flush=True)
        # validate
        model.eval()
        preds = []
        gts = []
        with torch.no_grad():
            for xb, yb in dl_va:
                xb = xb.to(device, non_blocking=True)
                logits = model(xb)
                preds.append(torch.sigmoid(logits).squeeze(1).detach().cpu().numpy())
                gts.append(yb.numpy())
        preds = np.concatenate(preds); gts = np.concatenate(gts)
        auc = roc_auc_score(gts, preds)
        print(f'[fold {fold}] epoch {epoch} AUC {auc:.6f} | train_loss {total/max(1,n):.4f} | time {time.time()-t0:.1f}s')
        if auc > best_auc + 1e-5:
            best_auc = auc; best_state = copy.deepcopy(model.state_dict()); no_improve = 0
        else:
            no_improve += 1
            if no_improve >= PATIENCE:
                print(f'[fold {fold}] early stop at epoch {epoch} | best_auc {best_auc:.6f}')
                break
    print(f'[fold {fold}] done | best_auc {best_auc:.6f} | elapsed {time.time()-t_fold:.1f}s')
    model.load_state_dict(best_state)
    torch.save(model.state_dict(), f'resnet18_fold{fold}.pt')
    # OOF preds
    model.eval()
    oof = np.zeros(len(val_idx), dtype=np.float32)
    with torch.no_grad():
        k = 0
        for xb, yb in dl_va:
            xb = xb.to(device, non_blocking=True)
            logits = model(xb)
            p = torch.sigmoid(logits).squeeze(1).detach().cpu().numpy()
            oof[k:k+len(p)] = p; k += len(p)
    return best_auc, oof

skf = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=SEED)
oof = np.zeros(len(train_df), dtype=np.float32)
fold_aucs = []
for fold, (trn_idx, val_idx) in enumerate(skf.split(train_df['id'].values, train_df['has_cactus'].values)):
    print(f'\n===== Fold {fold} =====')
    auc, oof_fold = train_one_fold(fold, trn_idx, val_idx)
    oof[val_idx] = oof_fold
    fold_aucs.append(auc)
    gc.collect(); torch.cuda.empty_cache()

cv_auc = roc_auc_score(train_df['has_cactus'].values, oof)
print(f'[CV] OOF AUC {cv_auc:.6f} | per-fold {fold_aucs}')

# Test inference with 8-way TTA averaging logits per fold
def dihedral8(arr: np.ndarray):
    outs = []
    for k in range(4):
        r = np.rot90(arr, k); outs.append(r); outs.append(np.fliplr(r))
    return outs

def predict_test_fold(fold):
    model = build_model().to(device)
    sd = torch.load(f'resnet18_fold{fold}.pt', map_location=device)
    model.load_state_dict(sd); model.eval()
    preds = np.zeros(len(test_ids), dtype=np.float32)
    t0 = time.time()
    with torch.no_grad():
        for i in range(0, len(test_ids), 64):
            batch_ids = test_ids[i:i+64]
            tensors = []
            for img_id in batch_ids:
                arr = load_image_uint8(test_dir / img_id)
                for v in dihedral8(arr):
                    tensors.append(resize_norm_to_tensor(v))
            xb = torch.stack(tensors).to(device, non_blocking=True)
            logits = model(xb).view(len(batch_ids), 8, 1)  # because model is linear over batch
            logits = logits.squeeze(2)
            m = torch.sigmoid(logits).mean(dim=1).detach().cpu().numpy()
            preds[i:i+len(batch_ids)] = m
            if ((i//64)+1) % 10 == 0:
                print(f'[infer fold {fold}] processed {i+len(batch_ids)}/{len(test_ids)}', flush=True)
    print(f'[infer fold {fold}] done in {time.time()-t0:.1f}s')
    return preds

preds_test_folds = []
for f in range(NFOLDS):
    preds_test_folds.append(predict_test_fold(f))
pred_test_cnn = np.mean(np.stack(preds_test_folds, axis=0), axis=0).astype(np.float32)

sample = pd.read_csv('sample_submission.csv')
id_to_idx = {img_id:i for i, img_id in enumerate(test_ids)}
out = np.zeros(len(sample), dtype=np.float32)
for i, img_id in enumerate(sample['id'].values):
    out[i] = preds_test_cnn[id_to_idx[img_id]]
sample['has_cactus'] = out
sample.to_csv('submission.csv', index=False)
print('[submit] Saved submission.csv | prob stats:', float(out.min()), float(out.max()), float(out.mean()))


===== Fold 0 =====


  scaler = torch.cuda.amp.GradScaler(enabled=(device.type=='cuda'))


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 0] epoch 0 AUC 0.995703 | train_loss 0.3679 | time 4.5s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 0] epoch 1 AUC 0.998525 | train_loss 0.0371 | time 3.2s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 0] epoch 2 AUC 0.998452 | train_loss 0.0260 | time 3.4s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 0] epoch 3 AUC 0.998797 | train_loss 0.0192 | time 2.8s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 0] epoch 4 AUC 0.999734 | train_loss 0.0180 | time 3.0s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 0] epoch 5 AUC 0.999833 | train_loss 0.0111 | time 3.2s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 0] epoch 6 AUC 0.999933 | train_loss 0.0057 | time 2.9s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 0] epoch 7 AUC 0.999931 | train_loss 0.0029 | time 2.9s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 0] epoch 8 AUC 0.999918 | train_loss 0.0025 | time 3.0s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 0] epoch 9 AUC 0.999940 | train_loss 0.0023 | time 2.9s
[fold 0] early stop at epoch 9 | best_auc 0.999933
[fold 0] done | best_auc 0.999933 | elapsed 31.7s



===== Fold 1 =====


  scaler = torch.cuda.amp.GradScaler(enabled=(device.type=='cuda'))


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 1] epoch 0 AUC 0.996420 | train_loss 0.3578 | time 3.0s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 1] epoch 1 AUC 0.999573 | train_loss 0.0363 | time 2.9s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 1] epoch 2 AUC 0.999128 | train_loss 0.0241 | time 3.2s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 1] epoch 3 AUC 0.998735 | train_loss 0.0220 | time 2.8s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 1] epoch 4 AUC 0.999798 | train_loss 0.0165 | time 2.8s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 1] epoch 5 AUC 0.999960 | train_loss 0.0121 | time 2.8s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 1] epoch 6 AUC 0.999979 | train_loss 0.0070 | time 2.8s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 1] epoch 7 AUC 0.999983 | train_loss 0.0035 | time 2.9s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 1] epoch 8 AUC 0.999983 | train_loss 0.0029 | time 3.0s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 1] epoch 9 AUC 0.999985 | train_loss 0.0029 | time 3.0s
[fold 1] early stop at epoch 9 | best_auc 0.999979
[fold 1] done | best_auc 0.999979 | elapsed 29.5s



===== Fold 2 =====


  scaler = torch.cuda.amp.GradScaler(enabled=(device.type=='cuda'))


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 2] epoch 0 AUC 0.997538 | train_loss 0.3418 | time 3.2s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 2] epoch 1 AUC 0.999213 | train_loss 0.0318 | time 3.1s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 2] epoch 2 AUC 0.999703 | train_loss 0.0248 | time 3.0s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 2] epoch 3 AUC 0.997241 | train_loss 0.0184 | time 3.2s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 2] epoch 4 AUC 0.998592 | train_loss 0.0174 | time 3.1s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 2] epoch 5 AUC 0.999949 | train_loss 0.0086 | time 3.1s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 2] epoch 6 AUC 0.999981 | train_loss 0.0070 | time 3.1s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 2] epoch 7 AUC 0.999966 | train_loss 0.0042 | time 3.1s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 2] epoch 8 AUC 0.999971 | train_loss 0.0041 | time 3.4s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 2] epoch 9 AUC 0.999972 | train_loss 0.0034 | time 3.1s
[fold 2] early stop at epoch 9 | best_auc 0.999981
[fold 2] done | best_auc 0.999981 | elapsed 31.4s



===== Fold 3 =====


  scaler = torch.cuda.amp.GradScaler(enabled=(device.type=='cuda'))


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 3] epoch 0 AUC 0.998898 | train_loss 0.3697 | time 2.9s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 3] epoch 1 AUC 0.998524 | train_loss 0.0385 | time 2.9s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 3] epoch 2 AUC 0.971820 | train_loss 0.0225 | time 3.1s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 3] epoch 3 AUC 0.999739 | train_loss 0.0230 | time 3.0s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 3] epoch 4 AUC 0.999698 | train_loss 0.0132 | time 3.1s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 3] epoch 5 AUC 0.999968 | train_loss 0.0119 | time 3.3s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 3] epoch 6 AUC 0.999927 | train_loss 0.0072 | time 3.3s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 3] epoch 7 AUC 0.999976 | train_loss 0.0064 | time 3.0s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 3] epoch 8 AUC 0.999965 | train_loss 0.0032 | time 3.1s
[fold 3] early stop at epoch 8 | best_auc 0.999968
[fold 3] done | best_auc 0.999968 | elapsed 27.8s



===== Fold 4 =====


  scaler = torch.cuda.amp.GradScaler(enabled=(device.type=='cuda'))


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 4] epoch 0 AUC 0.996672 | train_loss 0.3105 | time 3.0s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 4] epoch 1 AUC 0.999022 | train_loss 0.0325 | time 2.9s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 4] epoch 2 AUC 0.999550 | train_loss 0.0259 | time 3.0s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 4] epoch 3 AUC 0.999630 | train_loss 0.0219 | time 3.2s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 4] epoch 4 AUC 0.999952 | train_loss 0.0174 | time 3.0s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 4] epoch 5 AUC 0.999769 | train_loss 0.0098 | time 3.1s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 4] epoch 6 AUC 0.999734 | train_loss 0.0077 | time 3.1s


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


[fold 4] epoch 7 AUC 0.999958 | train_loss 0.0079 | time 3.0s
[fold 4] early stop at epoch 7 | best_auc 0.999952
[fold 4] done | best_auc 0.999952 | elapsed 24.3s


[CV] OOF AUC 0.999937 | per-fold [0.9999329942799671, 0.9999794338879108, 0.9999814241568226, 0.9999675227837614, 0.9999522783761391]


  sd = torch.load(f'resnet18_fold{fold}.pt', map_location=device)


[infer fold 0] processed 640/3325


[infer fold 0] processed 1280/3325


[infer fold 0] processed 1920/3325


[infer fold 0] processed 2560/3325


[infer fold 0] processed 3200/3325


[infer fold 0] done in 9.3s


[infer fold 1] processed 640/3325


[infer fold 1] processed 1280/3325


[infer fold 1] processed 1920/3325


[infer fold 1] processed 2560/3325


[infer fold 1] processed 3200/3325


[infer fold 1] done in 8.9s


[infer fold 2] processed 640/3325


[infer fold 2] processed 1280/3325


[infer fold 2] processed 1920/3325


[infer fold 2] processed 2560/3325


[infer fold 2] processed 3200/3325


[infer fold 2] done in 8.7s


[infer fold 3] processed 640/3325


[infer fold 3] processed 1280/3325


[infer fold 3] processed 1920/3325


[infer fold 3] processed 2560/3325


[infer fold 3] processed 3200/3325


[infer fold 3] done in 8.2s


[infer fold 4] processed 640/3325


[infer fold 4] processed 1280/3325


[infer fold 4] processed 1920/3325


[infer fold 4] processed 2560/3325


[infer fold 4] processed 3200/3325


[infer fold 4] done in 8.2s


NameError: name 'preds_test_cnn' is not defined

In [22]:
# Inference-only: load saved ResNet18 fold weights, run 8-way TTA, write submission.csv
import numpy as np, pandas as pd, torch, timm, time
from pathlib import Path
from PIL import Image

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
IMG_SIZE = 64
IMAGENET_MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
IMAGENET_STD  = np.array([0.229, 0.224, 0.225], dtype=np.float32)
test_dir = Path('test')
test_ids = sorted([p.name for p in test_dir.glob('*.jpg')])

def load_image_uint8(path: Path):
    with Image.open(path) as im:
        im = im.convert('RGB')
        arr = np.asarray(im, dtype=np.uint8)
    return arr

def resize_norm_to_tensor(arr_uint8: np.ndarray) -> torch.Tensor:
    arr = np.array(Image.fromarray(arr_uint8).resize((IMG_SIZE, IMG_SIZE), resample=Image.BILINEAR), dtype=np.float32) / 255.0
    arr = (arr - IMAGENET_MEAN) / IMAGENET_STD
    arr = np.transpose(arr, (2,0,1))
    return torch.from_numpy(arr)

def dihedral8(arr: np.ndarray):
    outs = []
    for k in range(4):
        r = np.rot90(arr, k); outs.append(r); outs.append(np.fliplr(r))
    return outs

def build_model():
    return timm.create_model('resnet18', pretrained=False, num_classes=1, in_chans=3, global_pool='avg')

def predict_test_fold(fold):
    model = build_model().to(device)
    sd = torch.load(f'resnet18_fold{fold}.pt', map_location=device)
    model.load_state_dict(sd); model.eval()
    preds = np.zeros(len(test_ids), dtype=np.float32)
    t0 = time.time()
    with torch.no_grad():
        for i in range(0, len(test_ids), 64):
            batch_ids = test_ids[i:i+64]
            tensors = []
            for img_id in batch_ids:
                arr = load_image_uint8(test_dir / img_id)
                for v in dihedral8(arr):
                    tensors.append(resize_norm_to_tensor(v))
            xb = torch.stack(tensors).to(device, non_blocking=True)
            logits = model(xb).view(len(batch_ids), 8, 1).squeeze(2)
            m = torch.sigmoid(logits).mean(dim=1).detach().cpu().numpy()
            preds[i:i+len(batch_ids)] = m
    return preds

print('[infer] Starting TTA inference over 5 folds...')
fold_preds = []
for f in range(5):
    print(f'[infer] fold {f} ...', flush=True)
    fold_preds.append(predict_test_fold(f))
pred_test_cnn = np.mean(np.stack(fold_preds, axis=0), axis=0).astype(np.float32)

sample = pd.read_csv('sample_submission.csv')
id_to_idx = {img_id:i for i, img_id in enumerate(test_ids)}
out = np.zeros(len(sample), dtype=np.float32)
for i, img_id in enumerate(sample['id'].values):
    out[i] = pred_test_cnn[id_to_idx[img_id]]
sample['has_cactus'] = out
sample.to_csv('submission.csv', index=False)
print('[infer] Saved submission.csv | stats:', float(out.min()), float(out.max()), float(out.mean()))

[infer] Starting TTA inference over 5 folds...
[infer] fold 0 ...


  sd = torch.load(f'resnet18_fold{fold}.pt', map_location=device)


[infer] fold 1 ...


[infer] fold 2 ...


[infer] fold 3 ...


[infer] fold 4 ...


[infer] Saved submission.csv | stats: 1.29140653370996e-05 0.9999843835830688 0.7549147009849548


In [23]:
# kNN booster on ResNet18 embeddings: override high-similarity cases, then write submission.csv
import numpy as np, pandas as pd, torch, timm, time
from pathlib import Path
from PIL import Image
from torch.utils.data import Dataset, DataLoader
import faiss

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
IMG_SIZE = 64
IMAGENET_MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
IMAGENET_STD  = np.array([0.229, 0.224, 0.225], dtype=np.float32)
train_df = pd.read_csv('train.csv')
train_dir = Path('train'); test_dir = Path('test')
test_ids = sorted([p.name for p in test_dir.glob('*.jpg')])

def load_image_uint8(path: Path):
    with Image.open(path) as im:
        im = im.convert('RGB')
        arr = np.asarray(im, dtype=np.uint8)
    return arr

def resize_norm_to_tensor(arr_uint8: np.ndarray) -> torch.Tensor:
    arr = np.array(Image.fromarray(arr_uint8).resize((IMG_SIZE, IMG_SIZE), resample=Image.BILINEAR), dtype=np.float32) / 255.0
    arr = (arr - IMAGENET_MEAN) / IMAGENET_STD
    arr = np.transpose(arr, (2,0,1))
    return torch.from_numpy(arr)

class ImgDS(Dataset):
    def __init__(self, ids, folder: Path):
        self.ids = ids; self.folder = folder
    def __len__(self): return len(self.ids)
    def __getitem__(self, idx):
        img_id = self.ids[idx]
        arr = load_image_uint8(self.folder / img_id)
        x = resize_norm_to_tensor(arr).float()
        return x, img_id

def build_model_backbone():
    # ResNet18 from timm; we'll use forward_features + global_pool to get embeddings
    model = timm.create_model('resnet18', pretrained=False, num_classes=1, in_chans=3, global_pool='avg')
    return model

def extract_embeddings(ids, folder: Path, batch_size=256):
    ds = ImgDS(ids, folder)
    dl = DataLoader(ds, batch_size=batch_size, shuffle=False, num_workers=4, pin_memory=True)
    embs_accum = None
    for fold in range(5):
        model = build_model_backbone().to(device)
        sd = torch.load(f'resnet18_fold{fold}.pt', map_location=device)
        model.load_state_dict(sd)
        model.eval()
        feats = []
        with torch.no_grad():
            for xb, _ in dl:
                xb = xb.to(device, non_blocking=True)
                # forward_features -> [N, C, H, W]; then global_pool -> [N, C]
                ff = model.forward_features(xb)
                pooled = model.global_pool(ff)  # [N, C]
                feats.append(pooled.detach().cpu().numpy())
        feats = np.concatenate(feats, axis=0)
        if embs_accum is None:
            embs_accum = feats.astype(np.float32)
        else:
            embs_accum += feats.astype(np.float32)
        del model; torch.cuda.empty_cache()
    embs = (embs_accum / 5.0).astype(np.float32)
    # L2 normalize for cosine similarity via inner product
    norms = np.linalg.norm(embs, axis=1, keepdims=True) + 1e-8
    embs = embs / norms
    return embs

t0 = time.time()
print('[knn] Extracting train embeddings...')
train_ids = train_df['id'].values
train_embs = extract_embeddings(train_ids, train_dir, batch_size=256)
y_train = train_df['has_cactus'].values.astype(np.int32)
print('[knn] Train embeddings:', train_embs.shape)

print('[knn] Building FAISS index (cosine via inner product)...')
d = train_embs.shape[1]
index = faiss.IndexFlatIP(d)
index.add(train_embs)
print('[knn] Index size:', index.ntotal)

print('[knn] Extracting test embeddings...')
test_embs = extract_embeddings(test_ids, test_dir, batch_size=256)
print('[knn] Test embeddings:', test_embs.shape)

print('[knn] Searching nearest neighbor for all test samples...')
D, I = index.search(test_embs, 1)  # top-1
sims = D.flatten().astype(np.float32)
nns = I.flatten().astype(np.int32)
nn_labels = y_train[nns]

# Load current CNN predictions from submission as base
sample = pd.read_csv('sample_submission.csv')
base = pd.read_csv('submission.csv') if Path('submission.csv').exists() else None
if base is None or base.shape[0] != sample.shape[0]:
    # fallback: uniform 0.5 if missing
    base_probs = np.full(len(sample), 0.5, dtype=np.float32)
else:
    # assume same order as sample per our earlier pipeline
    base_probs = base['has_cactus'].values.astype(np.float32)

# Map test_ids to index in our arrays
id_to_pos = {img_id:i for i, img_id in enumerate(test_ids)}
out = np.zeros(len(sample), dtype=np.float32)
overrides = 0
THRESH = 0.99  # high-confidence cosine similarity threshold
for i, img_id in enumerate(sample['id'].values):
    j = id_to_pos[img_id]
    if sims[j] >= THRESH:
        out[i] = 1.0 if nn_labels[j] == 1 else 0.0
        overrides += 1
    else:
        out[i] = base_probs[i]

sample['has_cactus'] = out
sample.to_csv('submission.csv', index=False)
print(f"[knn] Saved submission.csv | overrides {overrides}/{len(sample)} | thresh {THRESH} | time {time.time()-t0:.1f}s")

[knn] Extracting train embeddings...


  sd = torch.load(f'resnet18_fold{fold}.pt', map_location=device)


[knn] Train embeddings: (14175, 512)
[knn] Building FAISS index (cosine via inner product)...
[knn] Index size: 14175
[knn] Extracting test embeddings...


[knn] Test embeddings: (3325, 512)
[knn] Searching nearest neighbor for all test samples...


[knn] Saved submission.csv | overrides 357/3325 | thresh 0.99 | time 19.9s


In [None]:
# EfficientNet-B0 64x64 5-fold + ensemble with ResNet18 + kNN booster; writes final submission.csv
import time, copy, gc, random
import numpy as np, pandas as pd, torch, timm
from pathlib import Path
from PIL import Image
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import faiss

SEED = 42
IMG_SIZE = 64
NFOLDS = 5
EPOCHS = 10
BATCH_SIZE = 384
MAX_LR = 2e-3
WEIGHT_DECAY = 1e-4
PATIENCE = 3
NUM_WORKERS = 4

random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED); torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.benchmark = True
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_df = pd.read_csv('train.csv')
test_ids = sorted([p.name for p in Path('test').glob('*.jpg')])
train_dir = Path('train'); test_dir = Path('test')
IMAGENET_MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
IMAGENET_STD  = np.array([0.229, 0.224, 0.225], dtype=np.float32)

def load_image_uint8(path: Path):
    with Image.open(path) as im:
        im = im.convert('RGB')
        arr = np.asarray(im, dtype=np.uint8)
    return arr

def resize_norm_to_tensor(arr_uint8: np.ndarray) -> torch.Tensor:
    arr = np.array(Image.fromarray(arr_uint8).resize((IMG_SIZE, IMG_SIZE), resample=Image.BILINEAR), dtype=np.float32) / 255.0
    arr = (arr - IMAGENET_MEAN) / IMAGENET_STD
    arr = np.transpose(arr, (2,0,1))
    return torch.from_numpy(arr)

def dihedral8(arr: np.ndarray):
    outs = []
    for k in range(4):
        r = np.rot90(arr, k); outs.append(r); outs.append(np.fliplr(r))
    return outs

class CactusDS(Dataset):
    def __init__(self, ids, labels=None, train=True):
        self.ids = ids; self.labels = labels; self.train = train
    def __len__(self): return len(self.ids)
    def __getitem__(self, idx):
        img_id = self.ids[idx]
        folder = train_dir if self.labels is not None else test_dir
        arr = load_image_uint8(folder / img_id)
        # light aug: random dihedral
        if self.train:
            if random.random() < 0.5: arr = np.fliplr(arr)
            if random.random() < 0.5: arr = np.flipud(arr)
            k = random.randint(0,3);
            if k: arr = np.rot90(arr, k)
        x = resize_norm_to_tensor(arr).float()
        if self.labels is not None:
            return x, torch.tensor(float(self.labels[idx]), dtype=torch.float32)
        return x, img_id

def build_resnet18():
    return timm.create_model('resnet18', pretrained=False, num_classes=1, in_chans=3, global_pool='avg')

def build_effb0():
    return timm.create_model('efficientnet_b0', pretrained=True, num_classes=1, in_chans=3, global_pool='avg')

def train_effb0_one_fold(fold, trn_idx, val_idx):
    x_tr = train_df['id'].values[trn_idx]; y_tr = train_df['has_cactus'].values[trn_idx]
    x_va = train_df['id'].values[val_idx]; y_va = train_df['has_cactus'].values[val_idx]
    dl_tr = DataLoader(CactusDS(x_tr, y_tr, train=True), batch_size=BATCH_SIZE, shuffle=True, num_workers=NUM_WORKERS, pin_memory=True, drop_last=True)
    dl_va = DataLoader(CactusDS(x_va, y_va, train=False), batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS, pin_memory=True)
    model = build_effb0().to(device)
    opt = torch.optim.AdamW(model.parameters(), lr=MAX_LR, weight_decay=WEIGHT_DECAY)
    steps_per_epoch = max(1, len(dl_tr))
    sch = torch.optim.lr_scheduler.OneCycleLR(opt, max_lr=MAX_LR, epochs=EPOCHS, steps_per_epoch=steps_per_epoch, pct_start=0.3, div_factor=10.0, final_div_factor=10.0)
    scaler = torch.cuda.amp.GradScaler(enabled=(device.type=='cuda'))
    loss_fn = nn.BCEWithLogitsLoss()
    best_auc, best_state, no_imp = -1.0, None, 0
    for epoch in range(EPOCHS):
        model.train(); total=0.0; n=0; t0=time.time()
        for it,(xb,yb) in enumerate(dl_tr):
            xb=xb.to(device, non_blocking=True); yb=yb.to(device, non_blocking=True).view(-1,1)
            opt.zero_grad(set_to_none=True)
            with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):
                logits = model(xb); loss = loss_fn(logits, yb)
            scaler.scale(loss).backward(); scaler.step(opt); scaler.update(); sch.step()
            total += loss.item()*xb.size(0); n += xb.size(0)
        # val
        model.eval(); preds=[]; gts=[]
        with torch.no_grad():
            for xb,yb in dl_va:
                xb=xb.to(device, non_blocking=True); logits=model(xb)
                preds.append(torch.sigmoid(logits).squeeze(1).detach().cpu().numpy()); gts.append(yb.numpy())
        preds=np.concatenate(preds); gts=np.concatenate(gts); auc=roc_auc_score(gts,preds)
        print(f'[eff fold {fold}] epoch {epoch} AUC {auc:.6f} loss {total/max(1,n):.4f} time {time.time()-t0:.1f}s')
        if auc>best_auc+1e-5: best_auc=auc; best_state=copy.deepcopy(model.state_dict()); no_imp=0
        else:
            no_imp+=1
            if no_imp>=PATIENCE: print(f'[eff fold {fold}] early stop at epoch {epoch} | best_auc {best_auc:.6f}'); break
    model.load_state_dict(best_state); torch.save(model.state_dict(), f'effb0_fold{fold}.pt')
    # OOF for record
    model.eval(); oof=np.zeros(len(val_idx), dtype=np.float32)
    with torch.no_grad():
        k=0
        for xb,yb in DataLoader(CactusDS(x_va, y_va, train=False), batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS, pin_memory=True):
            xb=xb.to(device, non_blocking=True); p=torch.sigmoid(model(xb)).squeeze(1).detach().cpu().numpy();
            oof[k:k+len(p)]=p; k+=len(p)
    return best_auc, oof

def predict_tta(model_builder, weight_tpl):
    preds = np.zeros(len(test_ids), dtype=np.float32)
    with torch.no_grad():
        for fold in range(NFOLDS):
            model = model_builder().to(device); sd = torch.load(weight_tpl.format(fold), map_location=device); model.load_state_dict(sd); model.eval()
            fold_preds = np.zeros(len(test_ids), dtype=np.float32)
            for i in range(0, len(test_ids), 64):
                batch_ids = test_ids[i:i+64]
                tensors = []
                for img_id in batch_ids:
                    arr = load_image_uint8(test_dir / img_id)
                    for v in dihedral8(arr):
                        tensors.append(resize_norm_to_tensor(v))
                xb = torch.stack(tensors).to(device, non_blocking=True)
                logits = model(xb).view(len(batch_ids), 8, 1).squeeze(2)
                m = torch.sigmoid(logits).mean(dim=1).detach().cpu().numpy()
                fold_preds[i:i+len(batch_ids)] = m
            preds += fold_preds / NFOLDS
            del model; torch.cuda.empty_cache()
    return preds.astype(np.float32)

# Train EfficientNet-B0 5-fold
skf = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=SEED)
oof_eff = np.zeros(len(train_df), dtype=np.float32); aucs=[]
for fold,(trn_idx,val_idx) in enumerate(skf.split(train_df['id'].values, train_df['has_cactus'].values)):
    print(f'===== EffNet Fold {fold} =====')
    auc, oof_f = train_effb0_one_fold(fold, trn_idx, val_idx)
    oof_eff[val_idx]=oof_f; aucs.append(auc); gc.collect(); torch.cuda.empty_cache()
print(f'[EffNet CV] OOF {roc_auc_score(train_df.has_cactus.values, oof_eff):.6f} | per-fold {aucs}')

# Inference: EfficientNet-B0 and ResNet18
print('[infer] EfficientNet-B0 TTA...')
pred_eff = predict_tta(lambda: timm.create_model('efficientnet_b0', pretrained=False, num_classes=1, in_chans=3, global_pool='avg'), 'effb0_fold{}.pt')
print('[infer] ResNet18 TTA...')
pred_res = predict_tta(build_resnet18, 'resnet18_fold{}.pt')

# Ensemble average
pred_ens = ((pred_eff + pred_res) * 0.5).astype(np.float32)
sample = pd.read_csv('sample_submission.csv')
id_to_idx = {img_id:i for i, img_id in enumerate(test_ids)}
base_probs = np.zeros(len(sample), dtype=np.float32)
for i, img_id in enumerate(sample['id'].values):
    base_probs[i] = pred_ens[id_to_idx[img_id]]

# kNN booster on ResNet18 embeddings (averaged across folds)
def extract_resnet_embs(ids, folder: Path, batch_size=256):
    class ImgDS(Dataset):
        def __init__(self, ids, folder): self.ids=ids; self.folder=folder
        def __len__(self): return len(self.ids)
        def __getitem__(self, idx):
            img = load_image_uint8(self.folder/self.ids[idx]); return resize_norm_to_tensor(img).float(), 0
    dl = DataLoader(ImgDS(ids, folder), batch_size=batch_size, shuffle=False, num_workers=4, pin_memory=True)
    embs_accum=None
    for fold in range(NFOLDS):
        m = build_resnet18().to(device); sd = torch.load(f'resnet18_fold{fold}.pt', map_location=device); m.load_state_dict(sd); m.eval()
        feats=[]
        with torch.no_grad():
            for xb,_ in dl:
                xb=xb.to(device, non_blocking=True); ff = m.forward_features(xb); pooled = m.global_pool(ff); feats.append(pooled.detach().cpu().numpy())
        feats=np.concatenate(feats, axis=0).astype(np.float32)
        embs_accum = feats if embs_accum is None else embs_accum+feats
        del m; torch.cuda.empty_cache()
    embs = (embs_accum / NFOLDS).astype(np.float32)
    norms = np.linalg.norm(embs, axis=1, keepdims=True) + 1e-8
    return embs / norms

print('[knn] Extract embeddings for kNN booster...')
train_ids = train_df['id'].values
train_embs = extract_resnet_embs(train_ids, train_dir, batch_size=256)
test_embs = extract_resnet_embs(test_ids, test_dir, batch_size=256)
index = faiss.IndexFlatIP(train_embs.shape[1]); index.add(train_embs)
D,I = index.search(test_embs, 1)
sims = D.flatten().astype(np.float32); nns = I.flatten().astype(np.int32); y_train = train_df['has_cactus'].values.astype(np.int32)

out = base_probs.copy(); overrides=0; THRESH=0.99
pos_map = {img_id:i for i,img_id in enumerate(test_ids)}
for i, img_id in enumerate(sample['id'].values):
    j = pos_map[img_id]
    if sims[j] >= THRESH:
        out[i] = 1.0 if y_train[nns[j]]==1 else 0.0; overrides += 1
sample['has_cactus'] = out
sample.to_csv('submission.csv', index=False)
print(f'[final] Saved submission.csv | ensemble+knn overrides {overrides}/{len(sample)} | min {float(out.min())} max {float(out.max())} mean {float(out.mean())}')

===== EffNet Fold 0 =====


  scaler = torch.cuda.amp.GradScaler(enabled=(device.type=='cuda'))


  with torch.cuda.amp.autocast(enabled=(device.type=='cuda')):


