# SIIM-ISIC Melanoma Classification – Medal-Driven Plan & Experiment Log (Updated per Audit)

## Objective
- Build a high-AUC melanoma classifier and iterate to medal thresholds (Bronze ≥0.937, Silver ≥0.940, Gold ≥0.945).

## Data Locations & Formats
- CSVs: train.csv, test.csv in CWD.
- Images (primary): jpeg/train/*.jpg, jpeg/test/*.jpg (use these for modeling).
- DICOM: train/*.dcm, test/*.dcm (optional; ignore initially).
- TFRecords: tfrecords/ (for TF data pipeline experiments).

## Reproducibility & Tracking
- Global seeding: seed=42 across random, numpy, TensorFlow; enable TF deterministic ops when possible.
- Mixed precision on GPU (TF): tf.keras.mixed_precision.set_global_policy('mixed_float16').
- Persist artifacts per experiment:
  - folds.csv (image_name, patient_id, fold).
  - oof.csv (image_name, fold, target, pred_modelX).
  - config_*.json (params, seed, features, aug).
  - model weights (*.h5) and training logs.
  - CV report (mean±std AUC) in results.md.
- Notebook backups: save agent_notebook_BACKUP_YYYYMMDD_HHMM.ipynb before major edits.

## Leakage Prevention & Duplicates
- CV: 5-fold StratifiedGroupKFold (groups=patient_id, stratify=target, seed=42, shuffle=True).
- Duplicate control: compute perceptual hash (phash) over 224px thumbnails; group identical/near-duplicates (Hamming distance ≤3) and force them into the same fold via group mapping.
- Any patient-level statistics computed strictly on fold-train only (no leakage).

## Strategy (Incremental to Gold)
1) Baseline sanity checks:
   - Load train/test CSVs; verify columns, image path joins, target distribution, and missingness.
   - Build and save folds.csv (with duplicate grouping).

2) Metadata-only baseline (explicit plan):
   - Features:
     - Categorical: sex, anatom_site_general_challenge, image_type (if present), dataset_source (if present).
     - Numeric: age_approx, image_width, image_height, aspect_ratio, file_size_bytes.
     - Patient context (within fold-train only): per-patient count of images, melanoma prevalence among patient images (target-encoded safely), patient age mean/std.
   - Encoding:
     - One-hot for low-cardinality categoricals (sex, anatom_site).
     - Target encoding for any high-cardinality fields strictly within fold-train.
     - Impute: age_approx median by sex/site; others with simple strategies; add missing flags.
   - Model: LightGBM (or CatBoost as alt). Handle imbalance via scale_pos_weight ≈ (N_neg/N_pos) and/or AUC-optimized hyperparams.
   - CV: StratifiedGroupKFold(5). Save OOF predictions and CV AUC.

3) Image baseline (EfficientNetB0 @ 320–384px):
   - Pipeline: tf.data from JPEGs (or TFRecords), resize, center crop, normalize; aug: flips, random resized crop, mild color jitter.
   - Loss: BCEWithLogits (from tf.keras with label_smoothing=0–0.05). Class weighting or focal loss γ=2 as variant.
   - Optimizer: AdamW; schedule: warmup → cosine decay; gradient clipping; EMA of weights.
   - Training: freeze backbone 1–2 epochs, then fine-tune 8–12 epochs; early stopping by AUC on val fold.
   - CV: patient-grouped; save fold weights, OOF preds.

4) Strong image modeling:
   - Backbones: EfficientNetB3/B4 @ 448–512px; stronger augmentation (brightness/contrast, hue/sat, ±15° rotation, CoarseDropout).
   - TTA: 5–8 views averaging at inference.
   - Maintain mixed precision and EMA. Monitor OOF AUC. Target ≥0.935.

5) Ensembling:
   - Blend metadata OOF with image OOF (weighted average; image weight 0.8–0.9).
   - Optional stacking: logistic regression on OOF logits, with patient-level features as additional inputs.

## Advanced Enhancements (for Silver/Gold)
- External pretraining: pretrain on ISIC 2017–2019 dermoscopy images, then fine-tune on challenge data.
- Architectural diversity: add a second family (e.g., ConvNeXt or ViT) for lower-correlated ensemble; weight by fold AUC.
- Pseudo-labeling: use high-confidence test predictions to augment training, re-train with patient-grouped CV.
- Patient-level post-processing: within-patient ranking of logits; add within-patient z-score/logit rank to blender (computed non-leaky from fold-train stats).
- Optional preprocessing: hair removal and mild color normalization; evaluate impact via CV only.

## Experiment Log & Targets
- EXP0: Data load + EDA + CV + duplicates; produce folds.csv. (Audit checkpoint)
- EXP1: Metadata LGBM; OOF AUC target ≥0.80; save oof.csv, config. (Audit)
- EXP2: EFF-B0 @ 352px; OOF AUC target ≥0.90. (Audit)
- EXP3: B3/B4 @ 448–512 + EMA + TTA; OOF AUC ≥0.935 → First submission. (Audit)
- EXP4: Blend metadata + image; OOF AUC ≥0.94. (Audit)
- EXP5: External pretrain + second backbone + multi-seed; aim OOF ≥0.945. (Audit)

## Checkpoints
- Plan (this cell).
- After Data Loading & EDA + folds.csv.
- After each major modeling milestone (EXP1–EXP5).

## Housekeeping
- Delete stale cells after documenting results.
- Ensure submission.csv exists before exit (columns: image_name, target).

In [97]:
# GLOBAL SUBMISSION WRITE GUARD
import os
from pathlib import Path
import pandas as pd
import contextlib

if not getattr(pd, '_SUBMISSION_GUARD_INSTALLED', False):
    _orig_to_csv = pd.DataFrame.to_csv
    def _guarded_to_csv(self, path_or_buf=None, *args, **kwargs):
        name = None
        try:
            p = path_or_buf
            if isinstance(p, (str, Path)):
                name = Path(p).name
            else:
                name = getattr(p, 'name', None)
                if name is not None:
                    name = Path(name).name
            if name == 'submission.csv':
                if os.environ.get('ALLOW_SUBMISSION_WRITE', '0') != '1':
                    raise SystemExit("GLOBAL SUBMISSION GUARD: blocked write to submission.csv. Set ALLOW_SUBMISSION_WRITE=1 only inside hardened, gated pipelines (Cells 40/45/46/47/48).")
        except SystemExit:
            raise
        except Exception:
            # Be conservative: if filename not determinable, allow write (non-submission) to proceed.
            pass
        return _orig_to_csv(self, path_or_buf, *args, **kwargs)
    pd.DataFrame.to_csv = _guarded_to_csv
    pd._SUBMISSION_GUARD_INSTALLED = True
    print('GLOBAL SUBMISSION WRITE GUARD installed. Default: block writes to submission.csv unless ALLOW_SUBMISSION_WRITE=1.')

@contextlib.contextmanager
def allow_submission_writes():
    prev = os.environ.get('ALLOW_SUBMISSION_WRITE')
    os.environ['ALLOW_SUBMISSION_WRITE'] = '1'
    try:
        yield
    finally:
        if prev is None:
            os.environ.pop('ALLOW_SUBMISSION_WRITE', None)
        else:
            os.environ['ALLOW_SUBMISSION_WRITE'] = prev

GLOBAL SUBMISSION WRITE GUARD installed. Default: block writes to submission.csv unless ALLOW_SUBMISSION_WRITE=1.


In [2]:
# EXP0: Data loading, EDA, and fold construction (skip heavy duplicate hashing for speed)
import os, sys, json, random, math, gc, time, hashlib, itertools, warnings, subprocess
from pathlib import Path
import numpy as np
import pandas as pd
from tqdm import tqdm
warnings.filterwarnings('ignore')

SEED = 42
random.seed(SEED); np.random.seed(SEED)

DATA_DIR = Path('.')
TRAIN_CSV = DATA_DIR / 'train.csv'
TEST_CSV  = DATA_DIR / 'test.csv'
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
JPEG_TEST  = DATA_DIR / 'jpeg' / 'test'
FOLDS_CSV  = DATA_DIR / 'folds.csv'

print('Files exist:', TRAIN_CSV.exists(), TEST_CSV.exists(), JPEG_TRAIN.exists(), JPEG_TEST.exists())
train = pd.read_csv(TRAIN_CSV)
test  = pd.read_csv(TEST_CSV)
print('train shape:', train.shape, 'test shape:', test.shape)
print('train columns:', list(train.columns))
print('test columns:', list(test.columns))

# Attach JPEG paths
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train['exists'] = train['image_path'].apply(os.path.exists)
test['exists']  = test['image_path'].apply(os.path.exists)
print('Missing train images:', (~train['exists']).sum(), 'Missing test images:', (~test['exists']).sum())

# Basic EDA
pos_rate = train['target'].mean() if 'target' in train.columns else np.nan
print('Target positive rate:', pos_rate)
if 'patient_id' in train.columns:
    n_patients = train['patient_id'].nunique()
    print('Unique patients:', n_patients)
    print('Images per patient (head):')
    print(train['patient_id'].value_counts().head())

# Lightweight file size metadata
def safe_stat(p):
    try:
        return os.stat(p).st_size
    except Exception:
        return np.nan
train['file_size_bytes'] = train['image_path'].apply(safe_stat)
test['file_size_bytes']  = test['image_path'].apply(safe_stat)

# Duplicate control (SKIPPED for speed now; revisit later)
SKIP_PHASH = True
if SKIP_PHASH:
    train['phash'] = None
    train['dup_group'] = np.nan
    n_dup_groups = 0
    n_dups = 0
    print('Skipping perceptual hash duplicate detection for speed.')
else:
    def ensure_pkg(pkg, import_name=None):
        try:
            __import__(import_name or pkg)
            return True
        except ImportError:
            print(f'Installing {pkg} ...')
            subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])
            return True
    ensure_pkg('Pillow', 'PIL'); ensure_pkg('ImageHash', 'imagehash')
    from PIL import Image; import imagehash
    def compute_phash(path, size=224):
        try:
            with Image.open(path) as im:
                im = im.convert('RGB').resize((size, size))
                return str(imagehash.phash(im))
        except Exception:
            return None
    t0 = time.time()
    print('Computing perceptual hashes for duplicates...')
    train['phash'] = [compute_phash(p) for p in tqdm(train['image_path'].tolist())]
    dup_groups = train.groupby('phash')['image_name'].transform('count')
    train['dup_group'] = np.where((train['phash'].notna()) & (dup_groups > 1), train['phash'], np.nan)
    n_dup_groups = train['dup_group'].nunique(dropna=True)
    n_dups = train['dup_group'].notna().sum()
    print(f'Duplicate groups detected: {n_dup_groups}, images in duplicate groups: {n_dups}, time: {time.time()-t0:.1f}s')

# Build group key combining patient and duplicate group to keep dup images together when enabled
if 'patient_id' in train.columns:
    train['group_key'] = train.apply(lambda r: f"{r['patient_id']}|{r['dup_group']}" if pd.notna(r['dup_group']) else str(r['patient_id']), axis=1)
else:
    train['group_key'] = train['dup_group'].fillna(train['image_name'])

# StratifiedGroupKFold (with fallback if not available)
try:
    from sklearn.model_selection import StratifiedGroupKFold
    sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=SEED)
    folds = np.full(len(train), -1, dtype=int)
    for fold, (_, val_idx) in enumerate(sgkf.split(train, y=train['target'], groups=train['group_key'])):
        folds[val_idx] = fold
except Exception as e:
    print('StratifiedGroupKFold not available or failed, using custom balanced GroupKFold fallback. Err:', e)
    dfg = train.groupby('group_key').agg(n=('target','size'), pos=('target','sum')).reset_index()
    dfg = dfg.sort_values(['pos','n'], ascending=[False, False]).reset_index(drop=True)
    pos_per_fold = [0]*5; count_per_fold = [0]*5; group2fold = {}
    for _, row in dfg.iterrows():
        scores = [(pos_per_fold[i] + row['pos'], count_per_fold[i] + row['n'], i) for i in range(5)]
        scores.sort(); i = scores[0][2]
        group2fold[row['group_key']] = i
        pos_per_fold[i] += row['pos']; count_per_fold[i] += row['n']
    folds = train['group_key'].map(group2fold).astype(int).values

train['fold'] = folds
assert (train['fold']>=0).all(), 'Fold assignment failed'

# Persist folds.csv
cols = ['image_name','patient_id','group_key','fold'] if 'patient_id' in train.columns else ['image_name','group_key','fold']
train[cols].to_csv(FOLDS_CSV, index=False)
print('Saved folds to', FOLDS_CSV.resolve())

# Save a minimal results.md entry to start tracking
with open('results.md','a') as f:
    f.write(f"EXP0 | pos_rate={pos_rate:.5f} | patients={train['patient_id'].nunique() if 'patient_id' in train.columns else 'NA'} | dup_imgs={int(n_dups)} | time={time.strftime('%Y-%m-%d %H:%M:%S')}\n")

# Preview
print(train[['image_name','patient_id','target','fold']].head() if 'patient_id' in train.columns else train[['image_name','target','fold']].head())

Files exist: True True True True
train shape: (28984, 8) test shape: (4142, 5)
train columns: ['image_name', 'patient_id', 'sex', 'age_approx', 'anatom_site_general_challenge', 'diagnosis', 'benign_malignant', 'target']
test columns: ['image_name', 'patient_id', 'sex', 'age_approx', 'anatom_site_general_challenge']


Missing train images: 0 Missing test images: 0
Target positive rate: 0.017699420369859234
Unique patients: 2056
Images per patient (head):
patient_id
IP_4382720    104
IP_0656529    102
IP_4938382    101
IP_7279968     98
IP_4479736     97
Name: count, dtype: int64
Skipping perceptual hash duplicate detection for speed.


Saved folds to /app/agent_run_states/siim-isic-melanoma-classification/folds.csv
     image_name  patient_id  target  fold
0  ISIC_2637011  IP_7279968       0     4
1  ISIC_0015719  IP_3075186       0     4
2  ISIC_0068279  IP_6890425       0     3
3  ISIC_0074268  IP_8723313       0     4
4  ISIC_0074311  IP_2950485       0     0


In [3]:
# EXP0-FIX: Regenerate leak-proof folds with exact-duplicate MD5 grouping + patient union-find components
import os, sys, time, hashlib, multiprocessing as mp
import numpy as np
import pandas as pd
from pathlib import Path
from collections import defaultdict
from sklearn.model_selection import StratifiedGroupKFold

SEED = 42
DATA_DIR = Path('.')
TRAIN_CSV = DATA_DIR / 'train.csv'
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
FOLDS_CSV  = DATA_DIR / 'folds.csv'

train = pd.read_csv(TRAIN_CSV)
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))

# MD5 helper
def md5_file(path, chunk=1024*1024):
    try:
        h = hashlib.md5()
        with open(path, 'rb') as f:
            for b in iter(lambda: f.read(chunk), b''):
                h.update(b)
        return h.hexdigest()
    except Exception:
        return None

# Compute MD5 only for files that share the same size (speed-up), else mark unique
def file_size(p):
    try: return os.stat(p).st_size
    except: return -1
train['file_size_bytes'] = train['image_path'].map(file_size)
sizes = train['file_size_bytes'].values
dup_size_mask = train['file_size_bytes'].isin(train['file_size_bytes'].value_counts()[lambda s: s>1].index)
paths = train['image_path'].tolist()
t0 = time.time()
print('Computing MD5 for candidate duplicate sizes...')
md5_list = [None]*len(train)
idxs = np.where(dup_size_mask.values)[0]
with mp.Pool(processes=min(20, mp.cpu_count())) as pool:
    for i, h in zip(idxs, pool.imap(md5_file, [paths[i] for i in idxs], chunksize=64)):
        md5_list[i] = h
train['md5'] = md5_list
n_md5_groups = train['md5'].nunique(dropna=True)
n_md5_dups = train['md5'].value_counts().loc[lambda s: s>1].sum() if 'md5' in train and train['md5'].notna().any() else 0
print(f'MD5 groups: {n_md5_groups}, images in md5-dup groups: {int(n_md5_dups)}; time: {time.time()-t0:.1f}s')

# Union-Find
class DSU:
    def __init__(self, n):
        self.p = list(range(n))
        self.r = [0]*n
    def find(self, x):
        while self.p[x]!=x:
            self.p[x]=self.p[self.p[x]]
            x=self.p[x]
        return x
    def union(self, a,b):
        ra, rb = self.find(a), self.find(b)
        if ra==rb: return
        if self.r[ra]<self.r[rb]: ra, rb = rb, ra
        self.p[rb]=ra
        if self.r[ra]==self.r[rb]: self.r[ra]+=1

N = len(train)
dsu = DSU(N)

# 1) Union all images of the same patient
print('Union by patient_id ...')
for _, idxs in train.groupby('patient_id').groups.items():
    idxs = list(idxs)
    if len(idxs)>1:
        base = idxs[0]
        for j in idxs[1:]: dsu.union(base, j)

# 2) Union all exact duplicate images by MD5
print('Union by exact MD5 duplicates ...')
for md5, idxs in train[train['md5'].notna()].groupby('md5').groups.items():
    idxs = list(idxs)
    if len(idxs)>1:
        base = idxs[0]
        for j in idxs[1:]: dsu.union(base, j)

# Component IDs
comp = [dsu.find(i) for i in range(N)]
train['component'] = pd.Series(comp).astype(int)
train['component'] = train['component'].map({v:i for i,v in enumerate(pd.unique(train['component']))})

# Build stratified group folds on component
print('Building StratifiedGroupKFold on component ...')
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=SEED)
folds = np.full(N, -1, dtype=int)
for f, (_, val_idx) in enumerate(sgkf.split(train, y=train['target'], groups=train['component'])):
    folds[val_idx] = f
train['fold'] = folds

# Validations
print('Validating folds integrity ...')
# No patient across folds
pat_folds = train.groupby('patient_id')['fold'].nunique()
assert pat_folds.max()==1, f'Patient spans multiple folds: {pat_folds[pat_folds>1].index[:5].tolist()}'
# No MD5 duplicate across folds
if train['md5'].notna().any():
    md5_ok = train[train['md5'].notna()].groupby('md5')['fold'].nunique().max()
    assert md5_ok==1, 'MD5 duplicate spans multiple folds'

# Per-fold stats
stats = []
for f in range(5):
    df = train[train['fold']==f]
    pos = int(df['target'].sum()); n = len(df); rate = pos/max(1,n)
    pats = df['patient_id'].nunique()
    stats.append((f, n, pats, pos, rate))
print('Fold stats: (fold, n, n_patients, n_pos, pos_rate)')
for s in stats: print(s)

# Save folds.csv (overwrite)
cols = ['image_name','patient_id','component','fold']
train[cols].to_csv(FOLDS_CSV, index=False)
print('Rewrote folds to', FOLDS_CSV.resolve())

Computing MD5 for candidate duplicate sizes...


MD5 groups: 1809, images in md5-dup groups: 668; time: 0.5s
Union by patient_id ...
Union by exact MD5 duplicates ...
Building StratifiedGroupKFold on component ...


Validating folds integrity ...
Fold stats: (fold, n, n_patients, n_pos, pos_rate)
(0, 5436, 412, 103, 0.01894775570272259)
(1, 5969, 409, 102, 0.017088289495727927)
(2, 5545, 415, 100, 0.018034265103697024)
(3, 6220, 407, 106, 0.017041800643086816)
(4, 5814, 413, 102, 0.017543859649122806)
Rewrote folds to /app/agent_run_states/siim-isic-melanoma-classification/folds.csv


In [4]:
# EXP0-FIX2: Leak-proof folds with perceptual near-duplicate (pHash<=3) + MD5 + patient components
import os, sys, time, hashlib, multiprocessing as mp, itertools, math, subprocess, warnings
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import StratifiedGroupKFold
warnings.filterwarnings('ignore')

SEED = 42
DATA_DIR = Path('.')
TRAIN_CSV = DATA_DIR / 'train.csv'
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
FOLDS_CSV  = DATA_DIR / 'folds.csv'

train = pd.read_csv(TRAIN_CSV)
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))

# Ensure dependencies
def ensure_pkg(pkg, import_name=None):
    try:
        __import__(import_name or pkg)
        return True
    except ImportError:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])
        return True

ensure_pkg('Pillow','PIL'); ensure_pkg('ImageHash','imagehash')
from PIL import Image
import imagehash

# Helpers
def md5_file(path, chunk=1024*1024):
    try:
        h = hashlib.md5()
        with open(path, 'rb') as f:
            for b in iter(lambda: f.read(chunk), b''):
                h.update(b)
        return h.hexdigest()
    except Exception:
        return None

def file_size(p):
    try: return os.stat(p).st_size
    except: return -1

def compute_phash_hex(path, size=224, hash_size=8):
    try:
        with Image.open(path) as im:
            im = im.convert('RGB').resize((size, size))
            h = imagehash.phash(im, hash_size=hash_size)
            return h.__str__()  # 16 hex chars for 64-bit
    except Exception:
        return None

def hamming_hex(a, b):
    # a,b are 16-hex strings
    if a is None or b is None: return 64
    return bin(int(a,16) ^ int(b,16)).count('1')

class DSU:
    def __init__(self, n):
        self.p = list(range(n)); self.r = [0]*n
    def find(self, x):
        while self.p[x]!=x:
            self.p[x]=self.p[self.p[x]]; x=self.p[x]
        return x
    def union(self, a,b):
        ra, rb = self.find(a), self.find(b)
        if ra==rb: return False
        if self.r[ra]<self.r[rb]: ra, rb = rb, ra
        self.p[rb]=ra
        if self.r[ra]==self.r[rb]: self.r[ra]+=1
        return True

# Compute file sizes and MD5 for candidates
t0 = time.time()
train['file_size_bytes'] = train['image_path'].map(file_size)
dup_size_mask = train['file_size_bytes'].isin(train['file_size_bytes'].value_counts()[lambda s: s>1].index)
paths = train['image_path'].tolist()
idxs = np.where(dup_size_mask.values)[0]
md5_list = [None]*len(train)
print('MD5: computing for candidate duplicate sizes ...')
with mp.Pool(processes=min(20, mp.cpu_count())) as pool:
    for i, h in zip(idxs, pool.imap(md5_file, [paths[i] for i in idxs], chunksize=64)):
        md5_list[i] = h
train['md5'] = md5_list
n_md5_dups = int(train['md5'].value_counts().loc[lambda s: s>1].sum()) if train['md5'].notna().any() else 0
print(f'MD5 done in {time.time()-t0:.1f}s | images in md5-dup groups: {n_md5_dups}')

# Compute perceptual hash (phash) in parallel
t1 = time.time()
print('pHash: computing 64-bit perceptual hash for all images (multi-process) ...')
phash_list = [None]*len(train)
with mp.Pool(processes=min(20, mp.cpu_count())) as pool:
    for i, h in enumerate(pool.imap(compute_phash_hex, paths, chunksize=64)):
        phash_list[i] = h
train['phash'] = phash_list
n_phash_missing = int(pd.isna(train['phash']).sum())
print(f'pHash done in {time.time()-t1:.1f}s | missing: {n_phash_missing}')

# Build DSU components over patient + md5 exact dup + phash near-dup (Hamming<=3) using multi-bucket blocking
N = len(train)
dsu = DSU(N)

# 1) Union by patient
for _, idxs in train.groupby('patient_id').groups.items():
    idxs = list(idxs);
    if len(idxs)>1:
        b = idxs[0]
        for j in idxs[1:]: dsu.union(b,j)

# 2) Union by MD5 exact duplicates
for _, idxs in train[train['md5'].notna()].groupby('md5').groups.items():
    idxs = list(idxs);
    if len(idxs)>1:
        b = idxs[0]
        for j in idxs[1:]: dsu.union(b,j)

# 3) Union by pHash near-duplicates with blocking
# Strategy: bucket by 4 hex segments (each 16 bits), compare within each bucket to find pairs with Hamming<=3
segments = [ (0,4), (4,8), (8,12), (12,16) ]
near_links = 0
images_in_near_clusters = set()
for s0,s1 in segments:
    buckets = {}
    for i, hx in enumerate(train['phash']):
        if hx is None: continue
        key = hx[s0:s1]
        buckets.setdefault(key, []).append(i)
    for key, idxs in buckets.items():
        if len(idxs) < 2: continue
        # pairwise compare within bucket
        for a_i in range(len(idxs)):
            ia = idxs[a_i]; ha = train.at[ia, 'phash']
            for b_i in range(a_i+1, len(idxs)):
                ib = idxs[b_i]; hb = train.at[ib, 'phash']
                # quick early continue: if same, it's exact perceptual match
                if ha == hb:
                    if dsu.union(ia, ib): near_links += 1; images_in_near_clusters.update((ia,ib))
                    continue
                # full hamming check
                if hamming_hex(ha, hb) <= 3:
                    if dsu.union(ia, ib): near_links += 1; images_in_near_clusters.update((ia,ib))
    # free memory per segment
    buckets.clear()

print(f'pHash near-duplicate unions created: {near_links} | images touched: {len(images_in_near_clusters)}')

# Final components
comp = [dsu.find(i) for i in range(N)]
train['component'] = pd.Series(comp).astype(int)
train['component'] = train['component'].map({v:i for i,v in enumerate(pd.unique(train['component']))})

# Build folds on components
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=SEED)
folds = np.full(N, -1, dtype=int)
for f, (_, val_idx) in enumerate(sgkf.split(train, y=train['target'], groups=train['component'])):
    folds[val_idx] = f
train['fold'] = folds

# Validations
print('Validating folds integrity ...')
# No patient across folds
pat_folds = train.groupby('patient_id')['fold'].nunique()
assert pat_folds.max()==1, 'Patient spans multiple folds'
# No MD5 duplicate across folds
if train['md5'].notna().any():
    assert train[train['md5'].notna()].groupby('md5')['fold'].nunique().max()==1, 'MD5 duplicate spans multiple folds'
# No phash near-dup cluster across folds: check that any connected component stays in one fold
component_folds = train.groupby('component')['fold'].nunique()
assert component_folds.max()==1, 'Leakage detected: A component spans multiple folds.'

# Per-fold stats
print('Fold stats: (fold, n, n_patients, n_components, n_pos, pos_rate)')
for f in range(5):
    df = train[train['fold']==f]
    n = len(df); pats = df['patient_id'].nunique(); comps = df['component'].nunique(); pos = int(df['target'].sum()); rate = pos/max(1,n)
    print((f, n, pats, comps, pos, rate))

# Save folds.csv (overwrite)
cols = ['image_name','patient_id','component','fold']
train[cols].to_csv(FOLDS_CSV, index=False)
print('Rewrote folds to', FOLDS_CSV.resolve())

# Log summary
with open('results.md','a') as f:
    f.write(f"EXP0-FIX2 | md5_dup_imgs={n_md5_dups} | phash_links={near_links} | phash_imgs={len(images_in_near_clusters)} | time={time.strftime('%Y-%m-%d %H:%M:%S')}\n")

MD5: computing for candidate duplicate sizes ...


MD5 done in 1.2s | images in md5-dup groups: 668
pHash: computing 64-bit perceptual hash for all images (multi-process) ...


pHash done in 498.7s | missing: 0


pHash near-duplicate unions created: 358 | images touched: 533


Validating folds integrity ...
Fold stats: (fold, n, n_patients, n_components, n_pos, pos_rate)
(0, 308, 34, 34, 13, 0.04220779220779221)
(1, 5098, 430, 417, 117, 0.022950176539819536)
(2, 5053, 424, 416, 95, 0.018800712448050663)
(3, 4649, 419, 415, 112, 0.02409120240912024)
(4, 13876, 749, 416, 176, 0.012683770539060248)
Rewrote folds to /app/agent_run_states/siim-isic-melanoma-classification/folds.csv


In [5]:
# EXP0-FIX3: Balanced, leak-proof folds with refined pHash policy and greedy component binning
import os, time, math
import numpy as np
import pandas as pd
from PIL import Image

SEED = 42
np.random.seed(SEED)

# Reuse 'train' with precomputed phash, md5, file_size_bytes from previous cell if available
if 'train' not in globals():
    from pathlib import Path
    TRAIN_CSV = Path('.') / 'train.csv'
    JPEG_TRAIN = Path('.') / 'jpeg' / 'train'
    train = pd.read_csv(TRAIN_CSV)
    train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
    # Minimal fallbacks if phash/md5 missing (will be slow to recompute; expect present from EXP0-FIX2)
    if 'file_size_bytes' not in train.columns:
        train['file_size_bytes'] = train['image_path'].map(lambda p: os.stat(p).st_size if os.path.exists(p) else -1)

N = len(train)

class DSU:
    def __init__(self, n):
        self.p = list(range(n)); self.r = [0]*n
    def find(self, x):
        while self.p[x]!=x:
            self.p[x]=self.p[self.p[x]]; x=self.p[x]
        return x
    def union(self, a,b):
        ra, rb = self.find(a), self.find(b)
        if ra==rb: return False
        if self.r[ra]<self.r[rb]: ra, rb = rb, ra
        self.p[rb]=ra
        if self.r[ra]==self.r[rb]: self.r[ra]+=1
        return True

def hamming_hex(a, b):
    if pd.isna(a) or pd.isna(b): return 64
    return bin(int(a,16) ^ int(b,16)).count('1')

# Cache for image dimensions (lazy, on-demand)
dim_cache = {}
def get_dims(idx):
    if idx in dim_cache: return dim_cache[idx]
    path = train.at[idx, 'image_path']
    try:
        with Image.open(path) as im:
            dim_cache[idx] = im.size  # (width, height)
            return dim_cache[idx]
    except Exception:
        dim_cache[idx] = (-1, -1)
        return dim_cache[idx]

dsu = DSU(N)

# 1) Union by patient (always)
for _, idxs in train.groupby('patient_id').groups.items():
    idxs = list(idxs)
    if len(idxs)>1:
        base = idxs[0]
        for j in idxs[1:]: dsu.union(base, j)

# 2) Union by exact MD5 duplicates (always)
if 'md5' in train.columns:
    for _, idxs in train[train['md5'].notna()].groupby('md5').groups.items():
        idxs = list(idxs)
        if len(idxs)>1:
            base = idxs[0]
            for j in idxs[1:]: dsu.union(base, j)

# 3) Union near-duplicates (refined policy):
#    - Within-patient: Hamming<=3 (safe, desirable)
#    - Cross-patient: Hamming<=1 AND identical dimensions AND file size within 1KB
near_links_within = 0
near_links_cross = 0

# 3a) Within-patient unions (Hamming<=3) using simple pairwise per patient (patients are small groups)
if 'phash' in train.columns:
    for pid, idxs in train.groupby('patient_id').groups.items():
        idxs = list(idxs)
        if len(idxs) < 2: continue
        # small n, direct pairwise
        for i in range(len(idxs)):
            ia = idxs[i]; ha = train.at[ia, 'phash']
            if pd.isna(ha): continue
            for j in range(i+1, len(idxs)):
                ib = idxs[j]; hb = train.at[ib, 'phash']
                if pd.isna(hb): continue
                if hamming_hex(ha, hb) <= 3:
                    if dsu.union(ia, ib): near_links_within += 1

# 3b) Cross-patient unions (very strict): Hamming<=1 + identical dims + ~equal file size
if 'phash' in train.columns:
    # Bucket by exact phash to quickly union identical
    for _, idxs in train.groupby('phash').groups.items():
        idxs = list(idxs)
        if len(idxs) < 2: continue
        # separate by patient to avoid redundant unions
        for i in range(len(idxs)):
            ia = idxs[i]
            for j in range(i+1, len(idxs)):
                ib = idxs[j]
                if train.at[ia,'patient_id'] == train.at[ib,'patient_id']:
                    continue  # already handled in within-patient
                # identical phash implies hamming=0, check dims and near-equal size
                wa, ha_ = get_dims(ia); wb, hb_ = get_dims(ib)
                if (wa,ha_) == (wb,hb_):
                    if abs(train.at[ia,'file_size_bytes'] - train.at[ib,'file_size_bytes']) <= 1024:
                        if dsu.union(ia, ib): near_links_cross += 1
    # For hamming==1: create buckets by 15/16 hex prefix to reduce comparisons
    # Generate 4 segment buckets (each 16 bits) but enforce hamming<=1 with rapid rejection
    segments = [(0,4),(4,8),(8,12),(12,16)]
    for s0,s1 in segments:
        buckets = {}
        for i, hx in enumerate(train['phash']):
            if pd.isna(hx): continue
            key = hx[s0:s1]
            buckets.setdefault(key, []).append(i)
        for key, idxs in buckets.items():
            if len(idxs) < 2: continue
            for a_i in range(len(idxs)):
                ia = idxs[a_i]; ha = train.at[ia,'phash']
                for b_i in range(a_i+1, len(idxs)):
                    ib = idxs[b_i]; hb = train.at[ib,'phash']
                    if train.at[ia,'patient_id'] == train.at[ib,'patient_id']:
                        continue
                    # fast check: if ha==hb handled above; else check hamming and guards
                    hamm = hamming_hex(ha, hb)
                    if hamm <= 1:
                        wa, ha_ = get_dims(ia); wb, hb_ = get_dims(ib)
                        if (wa,ha_) == (wb,hb_) and abs(train.at[ia,'file_size_bytes'] - train.at[ib,'file_size_bytes']) <= 1024:
                            if dsu.union(ia, ib): near_links_cross += 1
        buckets.clear()

print(f'Near-duplicate unions: within-patient={near_links_within}, cross-patient={near_links_cross}')

# Final components
comp = [dsu.find(i) for i in range(N)]
train['component'] = pd.Series(comp).astype(int)
train['component'] = train['component'].map({v:i for i,v in enumerate(pd.unique(train['component']))})

# Guardrails: component size sanity
comp_sizes = train.groupby('component').size().sort_values(ascending=False)
max_comp = int(comp_sizes.iloc[0])
print('Top component sizes:', comp_sizes.head().tolist())
assert max_comp <= 2000, f'Mega-component detected (size={max_comp})'

# Build balanced folds via greedy bin-packing over components to balance positives and size
dfg = train.groupby('component').agg(n=('target','size'), pos=('target','sum')).reset_index()
dfg = dfg.sort_values(['pos','n'], ascending=[False, False]).reset_index(drop=True)
F = 5
fold_counts = [0]*F
fold_pos = [0]*F
comp2fold = {}
for _, row in dfg.iterrows():
    # choose fold minimizing (pos, count) tuple
    choices = [(fold_pos[i], fold_counts[i], i) for i in range(F)]
    choices.sort()
    i = choices[0][2]
    comp2fold[row['component']] = i
    fold_pos[i] += int(row['pos'])
    fold_counts[i] += int(row['n'])
train['fold'] = train['component'].map(comp2fold).astype(int)

# Integrity checks: no leakage across folds by component/patient
assert train.groupby('component')['fold'].nunique().max()==1, 'Component spans multiple folds'
assert train.groupby('patient_id')['fold'].nunique().max()==1, 'Patient spans multiple folds'

# Post-fold balance assertions
fold_sizes = train['fold'].value_counts().sort_index().tolist()
pos_rates = [train.loc[train['fold']==i, 'target'].mean() for i in range(F)]
size_ratio = max(fold_sizes)/max(1,min(fold_sizes))
rate_diff = max(pos_rates) - min(pos_rates)
print('Fold sizes:', fold_sizes)
print('Pos rates:', pos_rates)
print('Size ratio:', size_ratio, 'Rate diff:', rate_diff)
assert size_ratio <= 1.3, f'Fold size imbalance too high: {size_ratio:.3f}'
assert rate_diff <= 0.005, f'Fold positive-rate imbalance too high: {rate_diff:.4f}'

# Save folds.csv
out_cols = ['image_name','patient_id','component','fold']
train[out_cols].to_csv('folds.csv', index=False)
print('Saved balanced, leak-proof folds to folds.csv')

Near-duplicate unions: within-patient=0, cross-patient=0
Top component sizes: [104, 102, 101, 98, 97]
Fold sizes: [1109, 12829, 1109, 12828, 1109]
Pos rates: [0.09287646528403967, 0.007950736612362615, 0.09287646528403967, 0.007951356407857811, 0.09287646528403967]
Size ratio: 11.568079350766457 Rate diff: 0.08492572867167705


AssertionError: Fold size imbalance too high: 11.568

In [6]:
# EXP0-FIX3b: Re-balance folds using cost-based assignment to match size and positive rate
import numpy as np
import pandas as pd

assert 'train' in globals(), 'train DataFrame not found. Run previous cells first.'
assert 'component' in train.columns, 'component not computed. Run FIX steps first.'

# Prepare component table
dfg = train.groupby('component').agg(n=('target','size'), pos=('target','sum')).reset_index()
global_n = int(dfg['n'].sum())
global_pos = int(dfg['pos'].sum())
global_rate = global_pos / max(1, global_n)
F = 5
ideal_size = global_n / F

# Sort components by difficulty: higher pos first, then larger size
dfg = dfg.sort_values(['pos','n'], ascending=[False, False]).reset_index(drop=True)

# Initialize fold stats
fold_counts = np.zeros(F, dtype=int)
fold_pos = np.zeros(F, dtype=int)
comp2fold = {}

def assignment_cost(n_i, p_i, add_n, add_p, ideal_size, global_rate, lam_size=1.0, lam_rate=2.0):
    n_new = n_i + add_n
    p_new = p_i + add_p
    rate_new = (p_new / n_new) if n_new > 0 else 0.0
    size_term = abs(n_new - ideal_size) / ideal_size
    rate_term = abs(rate_new - global_rate)
    return lam_rate * rate_term + lam_size * size_term

# Greedy assignment minimizing combined cost
for _, row in dfg.iterrows():
    n_c = int(row['n']); p_c = int(row['pos'])
    costs = [assignment_cost(fold_counts[i], fold_pos[i], n_c, p_c, ideal_size, global_rate) for i in range(F)]
    i_best = int(np.argmin(costs))
    comp2fold[row['component']] = i_best
    fold_counts[i_best] += n_c
    fold_pos[i_best] += p_c

train['fold'] = train['component'].map(comp2fold).astype(int)

# Integrity checks
assert train.groupby('component')['fold'].nunique().max()==1, 'Component spans multiple folds'
assert train.groupby('patient_id')['fold'].nunique().max()==1, 'Patient spans multiple folds'

# Post-fold balance assertions
fold_sizes = train['fold'].value_counts().sort_index().tolist()
pos_rates = [train.loc[train['fold']==i, 'target'].mean() for i in range(F)]
size_ratio = max(fold_sizes)/max(1,min(fold_sizes))
rate_diff = max(pos_rates) - min(pos_rates)
print('Fold sizes:', fold_sizes)
print('Pos rates:', pos_rates)
print('Size ratio:', size_ratio, 'Rate diff:', rate_diff)
assert size_ratio <= 1.3, f'Fold size imbalance too high: {size_ratio:.3f}'
assert rate_diff <= 0.005, f'Fold positive-rate imbalance too high: {rate_diff:.4f}'

# Save folds.csv
train[['image_name','patient_id','component','fold']].to_csv('folds.csv', index=False)
print('Saved balanced, leak-proof folds to folds.csv')

Fold sizes: [57, 21, 2898, 13017, 12991]
Pos rates: [0.3157894736842105, 0.23809523809523808, 0.017943409247757072, 0.03364830606130445, 0.0]
Size ratio: 619.8571428571429 Rate diff: 0.3157894736842105


AssertionError: Fold size imbalance too high: 619.857

In [7]:
# EXP0-FIX3c: Use StratifiedGroupKFold on refined components to achieve balanced folds
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedGroupKFold

assert 'train' in globals(), 'train DataFrame not found. Run previous cells first.'
assert 'component' in train.columns, 'component not computed. Run FIX steps first.'

SEED = 42
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=SEED)
folds = np.full(len(train), -1, dtype=int)
for f, (_, val_idx) in enumerate(sgkf.split(train, y=train['target'], groups=train['component'])):
    folds[val_idx] = f
train['fold'] = folds.astype(int)

# Integrity checks
assert train.groupby('component')['fold'].nunique().max()==1, 'Component spans multiple folds'
assert train.groupby('patient_id')['fold'].nunique().max()==1, 'Patient spans multiple folds'

# Balance checks
F = 5
fold_sizes = train['fold'].value_counts().sort_index().tolist()
pos_rates = [train.loc[train['fold']==i, 'target'].mean() for i in range(F)]
size_ratio = max(fold_sizes)/max(1, min(fold_sizes))
rate_diff = max(pos_rates) - min(pos_rates)
print('Fold sizes:', fold_sizes)
print('Pos rates:', pos_rates)
print('Size ratio:', size_ratio, 'Rate diff:', rate_diff)
assert size_ratio <= 1.3, f'Fold size imbalance too high: {size_ratio:.3f}'
assert rate_diff <= 0.005, f'Fold positive-rate imbalance too high: {rate_diff:.4f}'

# Save
train[['image_name','patient_id','component','fold']].to_csv('folds.csv', index=False)
print('Saved balanced folds to folds.csv')

Fold sizes: [5436, 5969, 5545, 6220, 5814]
Pos rates: [0.01894775570272259, 0.017088289495727927, 0.018034265103697024, 0.017041800643086816, 0.017543859649122806]
Size ratio: 1.144223693892568 Rate diff: 0.0019059550596357729
Saved balanced folds to folds.csv


In [18]:
# DEPRECATED: Blocked to prevent unsafe submission overwrites. Use hardened cells 40, 45, 46, 47, or 48.
import sys
raise SystemExit("DEPRECATED CELL 7: submission writes are blocked. Use hardened pipelines (Cells 40/45/46/47/48) with gating and manifests.")

Dimension extraction done in 0.8s


[LightGBM] [Info] Number of positive: 410, number of negative: 23138
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001050 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2353
[LightGBM] [Info] Number of data points in the train set: 23548, number of used features: 24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017411 -> initscore=-4.033074
[LightGBM] [Info] Start training from score -4.033074


Fold 0: AUC=0.70983, best_iters=1


[LightGBM] [Info] Number of positive: 411, number of negative: 22604
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000694 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2334
[LightGBM] [Info] Number of data points in the train set: 23015, number of used features: 24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017858 -> initscore=-4.007289
[LightGBM] [Info] Start training from score -4.007289


Fold 1: AUC=0.74713, best_iters=2


[LightGBM] [Info] Number of positive: 413, number of negative: 23026
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000248 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2351
[LightGBM] [Info] Number of data points in the train set: 23439, number of used features: 24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017620 -> initscore=-4.020932
[LightGBM] [Info] Start training from score -4.020932


Fold 2: AUC=0.68761, best_iters=76


[LightGBM] [Info] Number of positive: 407, number of negative: 22357
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000762 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2341
[LightGBM] [Info] Number of data points in the train set: 22764, number of used features: 24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017879 -> initscore=-4.006082
[LightGBM] [Info] Start training from score -4.006082


Fold 3: AUC=0.72872, best_iters=1


[LightGBM] [Info] Number of positive: 411, number of negative: 22759
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000266 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2313
[LightGBM] [Info] Number of data points in the train set: 23170, number of used features: 24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017738 -> initscore=-4.014123
[LightGBM] [Info] Start training from score -4.014123


Fold 4: AUC=0.58705, best_iters=1
OOF AUC: 0.659392 | per-fold: [0.709835, 0.747131, 0.687606, 0.728724, 0.587048]
Saved oof_meta.csv and submission.csv


In [15]:
# DEPRECATED: Blocked to prevent unsafe submission overwrites. Use hardened cells 40, 45, 46, 47, or 48.
import sys
raise SystemExit("DEPRECATED CELL 8: submission writes are blocked. Use hardened pipelines (Cells 40/45/46/47/48) with gating and manifests.")

Dimension extraction done in 0.6s


Fold 0: AUC=0.75149, iters=1


Fold 1: AUC=0.77460, iters=185


Fold 2: AUC=0.73373, iters=160


Fold 3: AUC=0.80142, iters=337


Fold 4: AUC=0.79373, iters=35
OOF AUC: 0.633564 | per-fold: [0.751493, 0.774601, 0.733725, 0.801424, 0.793726]
Saved oof_meta.csv and submission.csv


In [17]:
# DIAG: Verify OOF AUC from saved OOF file (handle duplicate 'target' columns after merge)
import pandas as pd
from sklearn.metrics import roc_auc_score

train = pd.read_csv('train.csv')
oof_df = pd.read_csv('oof_meta.csv')
col_pred = 'pred_meta_cb' if 'pred_meta_cb' in oof_df.columns else ('pred_meta_lgb' if 'pred_meta_lgb' in oof_df.columns else None)
assert col_pred is not None, 'No prediction column found in oof_meta.csv'
df = oof_df.merge(train[['image_name','target']], on='image_name', how='left', suffixes=('_oof','_tr'))
# Resolve target column name after merge (could be target_oof/target_tr if both present)
label_col = 'target' if 'target' in df.columns else ('target_tr' if 'target_tr' in df.columns else ('target_oof' if 'target_oof' in df.columns else None))
assert label_col is not None, f'No target column found after merge. Columns: {df.columns.tolist()}'
auc = roc_auc_score(df[label_col].values, df[col_pred].values)
print('Recomputed OOF AUC from file:', auc)

Recomputed OOF AUC from file: 0.6335636281999062


In [20]:
# DIAG2: Per-fold AUC from saved OOF and rank-normalized global AUC (robust target resolution)
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score

train = pd.read_csv('train.csv')
oof_df = pd.read_csv('oof_meta.csv')
pred_col = 'pred_meta_cb' if 'pred_meta_cb' in oof_df.columns else ('pred_meta_lgb' if 'pred_meta_lgb' in oof_df.columns else None)
assert pred_col is not None, 'No prediction column in oof_meta.csv'

# Merge with suffixes to avoid target column collision
df = oof_df.merge(train[['image_name','target']], on='image_name', how='left', suffixes=('_oof','_tr'))
label_col = 'target' if 'target' in df.columns else ('target_tr' if 'target_tr' in df.columns else ('target_oof' if 'target_oof' in df.columns else None))
assert label_col is not None, f'No target column found after merge. Columns: {df.columns.tolist()}'
assert 'fold' in df.columns, 'fold column missing in oof'

# Per-fold AUCs recomputed from saved OOF
per_fold = {}
for f, g in df.groupby('fold'):
    try:
        per_fold[int(f)] = roc_auc_score(g[label_col].values, g[pred_col].values)
    except ValueError:
        per_fold[int(f)] = np.nan
print('Per-fold AUCs (recomputed from file):', per_fold)

# Global AUC on raw scores
try:
    auc_raw = roc_auc_score(df[label_col].values, df[pred_col].values)
except ValueError:
    auc_raw = np.nan
print('Global OOF AUC (raw):', auc_raw)

# Rank-normalized within fold to mitigate inter-fold score scale mismatch
df['pred_rank'] = df.groupby('fold')[pred_col].rank(pct=True)
auc_rank = roc_auc_score(df[label_col].values, df['pred_rank'].values)
print('Global OOF AUC (within-fold rank-normalized):', auc_rank)

Per-fold AUCs (recomputed from file): {0: 0.6983218611357385, 1: 0.7473230130640974, 2: 0.6876060606060606, 3: 0.718628295097549, 4: 0.5485647690448728}
Global OOF AUC (raw): 0.6569836836128113
Global OOF AUC (within-fold rank-normalized): 0.6830941069751013


In [21]:
# EXP1-CB2: Simplified CatBoost metadata model (robust, fold-safe) targeting ≥0.80 OOF
import os, sys, time, subprocess, warnings
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import roc_auc_score

warnings.filterwarnings('ignore')
SEED = 42
np.random.seed(SEED)

try:
    from catboost import CatBoostClassifier, Pool
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'catboost'])
    from catboost import CatBoostClassifier, Pool

DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')

# Minimal, strong features only (drop geometry for stability)
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
JPEG_TEST  = DATA_DIR / 'jpeg' / 'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))

def safe_stat(p):
    try: return os.stat(p).st_size
    except: return np.nan
train['file_size_bytes'] = train['image_path'].map(safe_stat)
test['file_size_bytes']  = test['image_path'].map(safe_stat)
train['log_file_size'] = np.log1p(train['file_size_bytes'])
test['log_file_size']  = np.log1p(test['file_size_bytes'])

train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all()
train['fold'] = train['fold'].astype(int)

target_col = 'target'
cat_cols = ['sex','anatom_site_general_challenge']
num_cols = ['age_approx','log_file_size']

oof = np.zeros(len(train), dtype=float)
test_pred_folds = []
fold_aucs = []

for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'] != f)[0]
    val_idx = np.where(train['fold'] == f)[0]
    trn_df = train.iloc[trn_idx].copy()
    val_df = train.iloc[val_idx].copy()
    tst_df = test.copy()

    # Missingness flags + fold-safe age imputation (by sex/site median -> global median)
    for df in (trn_df, val_df, tst_df):
        df['age_missing'] = df['age_approx'].isna().astype(int)
    grp_key = ['sex','anatom_site_general_challenge']
    med_age = trn_df.groupby(grp_key)['age_approx'].median()
    def impute_age(df):
        df = df.copy()
        df['age_approx'] = df['age_approx'].fillna(df[grp_key].apply(lambda r: med_age.get(tuple(r), np.nan), axis=1))
        df['age_approx'] = df['age_approx'].fillna(trn_df['age_approx'].median())
        return df
    trn_df = impute_age(trn_df); val_df = impute_age(val_df); tst_df = impute_age(tst_df)

    # Categorical hygiene
    for df in (trn_df, val_df, tst_df):
        df[cat_cols] = df[cat_cols].fillna('NA').astype(str)
        df['sex_site'] = df['sex'] + '|' + df['anatom_site_general_challenge']

    # Patient context (fold-safe): counts and smoothed rate only
    p = trn_df.groupby('patient_id').agg(p_cnt=('image_name','size'), p_pos=('target','sum'))
    gmean = trn_df[target_col].mean()
    p['p_rate_smooth'] = (p['p_pos'] + 2.0*gmean) / (p['p_cnt'] + 2.0)
    p = p[['p_cnt','p_rate_smooth']]
    def merge_p(df):
        df = df.merge(p, left_on='patient_id', right_index=True, how='left')
        df['p_cnt'] = df['p_cnt'].fillna(0.0)
        df['p_rate_smooth'] = df['p_rate_smooth'].fillna(gmean)
        return df
    trn_df = merge_p(trn_df); val_df = merge_p(val_df); tst_df = merge_p(tst_df)

    used_num = num_cols + ['age_missing','p_cnt','p_rate_smooth']
    used_cat = ['sex','anatom_site_general_challenge','sex_site']
    used_cols = used_num + used_cat

    X_trn = trn_df[used_cols]; y_trn = trn_df[target_col].values
    X_val = val_df[used_cols]; y_val = val_df[target_col].values
    X_tst = tst_df[used_cols]

    # CatBoost Pools (categoricals as string already)
    cat_idx = [X_trn.columns.get_loc(c) for c in used_cat]
    train_pool = Pool(X_trn, y_trn, cat_features=cat_idx)
    val_pool   = Pool(X_val, y_val, cat_features=cat_idx)
    test_pool  = Pool(X_tst, cat_features=cat_idx)

    # Mild class weights (cap to reduce instability)
    pos = max(1, int(y_trn.sum())); neg = max(1, int(len(y_trn)-pos))
    cw = min(50.0, neg/pos)

    model = CatBoostClassifier(
        loss_function='Logloss',
        eval_metric='AUC',
        iterations=10000,
        learning_rate=0.03,
        depth=6,
        l2_leaf_reg=6.0,
        random_seed=SEED,
        bootstrap_type='Bernoulli',
        subsample=0.8,
        class_weights=[1.0, cw],
        task_type='CPU',
        verbose=False
    )
    model.fit(train_pool, eval_set=val_pool, early_stopping_rounds=500, use_best_model=True, verbose=False)

    oof[val_idx] = model.predict_proba(val_pool)[:,1]
    auc = roc_auc_score(y_val, oof[val_idx])
    fold_aucs.append(auc)
    test_pred_folds.append(model.predict_proba(test_pool)[:,1])
    print(f'Fold {f}: AUC={auc:.5f}, iters={model.get_best_iteration()}, cw={cw:.1f}')

oof_auc = roc_auc_score(train[target_col].values, oof)
print('OOF AUC (raw):', round(oof_auc, 6), '| per-fold:', [round(a,6) for a in fold_aucs])

# Optional diagnostic: within-fold rank-normalized AUC
diag = pd.DataFrame({'image_name': train['image_name'], 'fold': train['fold'], 'target': train[target_col], 'pred': oof})
diag['pred_rank'] = diag.groupby('fold')['pred'].rank(pct=True)
try:
    auc_rank = roc_auc_score(diag['target'].values, diag['pred_rank'].values)
    print('OOF AUC (within-fold rank-normalized, diagnostic):', round(auc_rank, 6))
except Exception:
    pass

# Save artifacts
oof_df = train[['image_name','fold',target_col]].copy()
oof_df['pred_meta_cb2'] = oof
oof_df.to_csv('oof_meta.csv', index=False)

test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test))
sub = pd.DataFrame({'image_name': test['image_name'], 'target': test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved oof_meta.csv and submission.csv')

Fold 0: AUC=0.70975, iters=422, cw=50.0


Fold 1: AUC=0.71504, iters=210, cw=50.0


Fold 2: AUC=0.69881, iters=3, cw=50.0


Fold 3: AUC=0.76834, iters=366, cw=50.0


Fold 4: AUC=0.75399, iters=191, cw=50.0
OOF AUC (raw): 0.592775 | per-fold: [0.709746, 0.715038, 0.69881, 0.76834, 0.753991]
OOF AUC (within-fold rank-normalized, diagnostic): 0.73125
Saved oof_meta.csv and submission.csv


In [23]:
# EXP1-CB3: Restored features + two-stage calibration (per-fold Platt + global) with stabilized CatBoost
import os, sys, time, subprocess, warnings, multiprocessing as mp, json
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression

warnings.filterwarnings('ignore')
SEED = 42
np.random.seed(SEED)

try:
    from catboost import CatBoostClassifier, Pool
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'catboost'])
    from catboost import CatBoostClassifier, Pool

DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')

# File stats
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
JPEG_TEST  = DATA_DIR / 'jpeg' / 'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))

def safe_stat(p):
    try: return os.stat(p).st_size
    except: return np.nan
train['file_size_bytes'] = train['image_path'].map(safe_stat)
test['file_size_bytes']  = test['image_path'].map(safe_stat)

# Image dimensions
try:
    from PIL import Image
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'Pillow'])
    from PIL import Image

def get_wh(path):
    try:
        with Image.open(path) as im:
            return im.size
    except Exception:
        return (np.nan, np.nan)

def map_wh(paths, workers=16):
    with mp.Pool(processes=min(workers, mp.cpu_count())) as pool:
        return list(pool.imap(get_wh, paths, chunksize=64))

t0 = time.time()
tr_wh = map_wh(train['image_path'].tolist())
te_wh = map_wh(test['image_path'].tolist())
train['image_width']  = [w for (w,h) in tr_wh]
train['image_height'] = [h for (w,h) in tr_wh]
test['image_width']   = [w for (w,h) in te_wh]
test['image_height']  = [h for (w,h) in te_wh]
print(f'Dimension extraction done in {time.time()-t0:.1f}s')

for df in (train, test):
    df['aspect_ratio'] = df['image_width'] / df['image_height']
    df['area'] = df['image_width'] * df['image_height']
    df['log_area'] = np.log1p(df['area'])
    df['log_file_size'] = np.log1p(df['file_size_bytes'])

# Merge folds
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all()
train['fold'] = train['fold'].astype(int)

target_col = 'target'
cat_cols = ['sex','anatom_site_general_challenge']
num_base = ['age_approx','file_size_bytes','image_width','image_height','aspect_ratio','area','log_area','log_file_size']

F = sorted(train['fold'].unique())
oof_raw_logit = np.zeros(len(train), dtype=float)  # raw margins from CatBoost
oof_cal_prob  = np.zeros(len(train), dtype=float)  # after per-fold Platt
test_pred_cal_folds = []  # per-fold calibrated probs on test
fold_aucs_cal = []
fold_aucs_rawsig = []
best_iters = []

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

for f in F:
    trn_idx = np.where(train['fold'] != f)[0]
    val_idx = np.where(train['fold'] == f)[0]
    trn_df = train.iloc[trn_idx].copy()
    val_df = train.iloc[val_idx].copy()
    tst_df = test.copy()

    # Missingness flags
    for df in (trn_df, val_df, tst_df):
        df['age_missing'] = df['age_approx'].isna().astype(int)
        df['sex_missing'] = df['sex'].isna().astype(int)
        df['site_missing'] = df['anatom_site_general_challenge'].isna().astype(int)

    # Fold-safe age imputation (by sex/site median -> global median)
    grp_key = ['sex','anatom_site_general_challenge']
    med_age = trn_df.groupby(grp_key)['age_approx'].median()
    def impute_age(df):
        df = df.copy()
        df['age_approx'] = df['age_approx'].fillna(df[grp_key].apply(lambda r: med_age.get(tuple(r), np.nan), axis=1))
        df['age_approx'] = df['age_approx'].fillna(trn_df['age_approx'].median())
        return df
    trn_df = impute_age(trn_df); val_df = impute_age(val_df); tst_df = impute_age(tst_df)

    # Categorical hygiene and interaction
    for df in (trn_df, val_df, tst_df):
        df[cat_cols] = df[cat_cols].fillna('NA').astype(str)
        df['sex_site'] = df['sex'] + '|' + df['anatom_site_general_challenge']

    # Patient context (fold-safe): counts, smoothed rate, age mean/std
    p_stats = trn_df.groupby('patient_id').agg(
        p_cnt=('image_name','size'),
        p_pos=('target','sum'),
        p_age_mean=('age_approx','mean'),
        p_age_std=('age_approx','std')
    )
    g_rate = trn_df[target_col].mean()
    p_stats['p_rate_smooth'] = (p_stats['p_pos'] + 2.0*g_rate) / (p_stats['p_cnt'] + 2.0)
    p_stats = p_stats[['p_cnt','p_rate_smooth','p_age_mean','p_age_std']]
    def merge_p(df):
        df = df.merge(p_stats, left_on='patient_id', right_index=True, how='left')
        df['p_cnt'] = df['p_cnt'].fillna(0.0)
        df['p_rate_smooth'] = df['p_rate_smooth'].fillna(g_rate)
        df['p_age_mean'] = df['p_age_mean'].fillna(trn_df['age_approx'].median())
        df['p_age_std'] = df['p_age_std'].fillna(0.0)
        return df
    trn_df = merge_p(trn_df); val_df = merge_p(val_df); tst_df = merge_p(tst_df)

    # Feature set
    used_num = num_base + ['p_cnt','p_rate_smooth','p_age_mean','p_age_std','age_missing','sex_missing','site_missing']
    used_cat = cat_cols + ['sex_site']
    used_cols = used_num + used_cat

    # Diagnostics: feature variance check
    try:
        desc = trn_df[used_num].describe().loc[['min','mean','std','max']]
        print(f'Fold {f} numeric feature stats (min/mean/std/max) summary:')
        print(desc.to_dict())
    except Exception:
        pass

    X_trn = trn_df[used_cols]
    y_trn = trn_df[target_col].values
    X_val = val_df[used_cols]
    y_val = val_df[target_col].values
    X_tst = tst_df[used_cols]

    cat_idx = [X_trn.columns.get_loc(c) for c in used_cat]
    train_pool = Pool(X_trn, y_trn, cat_features=cat_idx)
    val_pool   = Pool(X_val, y_val, cat_features=cat_idx)
    test_pool  = Pool(X_tst, cat_features=cat_idx)

    model = CatBoostClassifier(
        loss_function='Logloss',
        eval_metric='AUC',
        iterations=20000,
        learning_rate=0.02,
        depth=6,
        l2_leaf_reg=8.0,
        random_strength=1.0,
        bootstrap_type='Bernoulli',
        subsample=0.8,
        auto_class_weights='Balanced',
        random_seed=SEED,
        task_type='CPU',
        verbose=False
    )
    model.fit(train_pool, eval_set=val_pool, early_stopping_rounds=1000, use_best_model=True, verbose=False)

    # Raw margins
    val_logit = model.predict(val_pool, prediction_type='RawFormulaVal')
    tst_logit = model.predict(test_pool, prediction_type='RawFormulaVal')
    oof_raw_logit[val_idx] = val_logit

    # Per-fold Platt scaling (logistic regression on raw margin)
    platt = LogisticRegression(C=1e6, solver='lbfgs', max_iter=2000)
    platt.fit(val_logit.reshape(-1,1), y_val)
    val_prob_cal = platt.predict_proba(val_logit.reshape(-1,1))[:,1]
    tst_prob_cal = platt.predict_proba(tst_logit.reshape(-1,1))[:,1]

    oof_cal_prob[val_idx] = val_prob_cal
    test_pred_cal_folds.append(tst_prob_cal.astype(np.float64))

    # Diagnostics: per-fold AUCs
    auc_cal = roc_auc_score(y_val, val_prob_cal)
    auc_rawsig = roc_auc_score(y_val, sigmoid(val_logit))
    fold_aucs_cal.append(auc_cal)
    fold_aucs_rawsig.append(auc_rawsig)
    bi = getattr(model, 'get_best_iteration', lambda: None)()
    best_iters.append(bi)
    print(f'Fold {f}: AUC_cal={auc_cal:.5f}, AUC_rawsig={auc_rawsig:.5f}, best_iter={bi}')

# Stage A: OOF metrics
auc_oof_rawsig = roc_auc_score(train[target_col].values, sigmoid(oof_raw_logit))
auc_oof_cal = roc_auc_score(train[target_col].values, oof_cal_prob)
print(f'OOF AUC raw-sigmoid: {auc_oof_rawsig:.6f} | OOF AUC per-fold calibrated: {auc_oof_cal:.6f} | per-fold cal AUCs: {[round(a,6) for a in fold_aucs_cal]}')
print('Best iterations per fold:', best_iters)

# Stage B: Global calibration on OOF (fit LR on logit of per-fold calibrated probs)
eps = 1e-6
oof_cal_prob_clip = np.clip(oof_cal_prob, eps, 1-eps)
oof_cal_logit = np.log(oof_cal_prob_clip/(1.0-oof_cal_prob_clip))
glob_cal = LogisticRegression(C=1e6, solver='lbfgs', max_iter=5000)
glob_cal.fit(oof_cal_logit.reshape(-1,1), train[target_col].values)
oof_post_prob = glob_cal.predict_proba(oof_cal_logit.reshape(-1,1))[:,1]
auc_oof_post = roc_auc_score(train[target_col].values, oof_post_prob)
print(f'OOF AUC post-global-cal: {auc_oof_post:.6f}')

# Test prediction: average per-fold calibrated probs -> global calibrator
test_pred_cal = np.mean(np.stack(test_pred_cal_folds, axis=0), axis=0) if len(test_pred_cal_folds)>0 else np.zeros(len(test))
test_pred_cal_clip = np.clip(test_pred_cal, eps, 1-eps)
test_pred_cal_logit = np.log(test_pred_cal_clip/(1.0-test_pred_cal_clip))
test_pred_final = glob_cal.predict_proba(test_pred_cal_logit.reshape(-1,1))[:,1]

# Save artifacts
oof_df = train[['image_name','fold',target_col]].copy()
oof_df['pred_meta_cb3_rawlogit'] = oof_raw_logit
oof_df['pred_meta_cb3_calprob'] = oof_cal_prob
oof_df['pred_meta_cb3_postprob'] = oof_post_prob
oof_df.to_csv('oof_meta.csv', index=False)
sub = pd.DataFrame({'image_name': test['image_name'], 'target': test_pred_final})
sub.to_csv('submission.csv', index=False)
print('Saved oof_meta.csv and submission.csv')

# Guardrail: warn but do not crash to allow iteration
if auc_oof_post < 0.80:
    print(f'WARNING: Metadata baseline under target after calibration: {auc_oof_post:.6f}')

with open('results.md','a') as f:
    f.write(f"EXP1_CB3 | OOF_rawsig={auc_oof_rawsig:.6f} | OOF_cal={auc_oof_cal:.6f} | OOF_post={auc_oof_post:.6f} | folds_cal={json.dumps([float(a) for a in fold_aucs_cal])} | best_iters={json.dumps(best_iters)} | time={time.strftime('%Y-%m-%d %H:%M:%S')}\n")

Dimension extraction done in 0.8s


Fold 0 numeric feature stats (min/mean/std/max) summary:
{'age_approx': {'min': 0.0, 'mean': 48.804144725666724, 'std': 14.509834605216062, 'max': 90.0}, 'file_size_bytes': {'min': 15796.0, 'mean': 779967.8380754205, 'std': 601283.9364589826, 'max': 3488673.0}, 'image_width': {'min': 640.0, 'mean': 4057.4687446916937, 'std': 2104.580192615368, 'max': 6000.0}, 'image_height': {'min': 480.0, 'mean': 2690.3566332597247, 'std': 1434.532599947264, 'max': 6000.0}, 'aspect_ratio': {'min': 0.664179104477612, 'mean': 1.5281642804923907, 'std': 0.15446712192078968, 'max': 1.7814726840855106}, 'area': {'min': 307200.0, 'mean': 13918133.822320366, 'std': 10290240.194503324, 'max': 24000000.0}, 'log_area': {'min': 12.63525753545869, 'mean': 15.751605562551438, 'std': 1.533135475500693, 'max': 16.993564429978885}, 'log_file_size': {'min': 9.667575327569097, 'mean': 13.00050357159625, 'std': 1.2976547525973683, 'max': 15.065032279288687}, 'p_cnt': {'min': 1.0, 'mean': 28.197468999490404, 'std': 21.79

Fold 0: AUC_cal=0.75902, AUC_rawsig=0.75902, best_iter=7


Fold 1 numeric feature stats (min/mean/std/max) summary:
{'age_approx': {'min': 0.0, 'mean': 48.88377145339996, 'std': 14.594358877860493, 'max': 90.0}, 'file_size_bytes': {'min': 15796.0, 'mean': 782809.0460134695, 'std': 609595.9281031807, 'max': 3488673.0}, 'image_width': {'min': 640.0, 'mean': 4046.6490115142296, 'std': 2105.2974771542035, 'max': 6000.0}, 'image_height': {'min': 480.0, 'mean': 2683.0237671084074, 'std': 1435.3007953308013, 'max': 6000.0}, 'aspect_ratio': {'min': 0.664179104477612, 'mean': 1.5284300329400649, 'std': 0.15514452212591018, 'max': 1.7814726840855106}, 'area': {'min': 307200.0, 'mean': 13861566.225287857, 'std': 10293794.361004535, 'max': 24000000.0}, 'log_area': {'min': 12.63525753545869, 'mean': 15.744875635082623, 'std': 1.5340767473801704, 'max': 16.993564429978885}, 'log_file_size': {'min': 9.667575327569097, 'mean': 12.99871805323744, 'std': 1.3012636751692104, 'max': 15.065032279288687}, 'p_cnt': {'min': 1.0, 'mean': 27.40078209863133, 'std': 21.5

Fold 1: AUC_cal=0.80367, AUC_rawsig=0.80367, best_iter=275


Fold 2 numeric feature stats (min/mean/std/max) summary:
{'age_approx': {'min': 0.0, 'mean': 48.787704253594434, 'std': 14.161397059190717, 'max': 90.0}, 'file_size_bytes': {'min': 15796.0, 'mean': 779360.6372712146, 'std': 605169.3611306609, 'max': 3488673.0}, 'image_width': {'min': 640.0, 'mean': 4040.448781944622, 'std': 2117.929246003511, 'max': 6000.0}, 'image_height': {'min': 480.0, 'mean': 2679.309313537267, 'std': 1442.8408323898504, 'max': 6000.0}, 'aspect_ratio': {'min': 0.664179104477612, 'mean': 1.527401737696928, 'std': 0.15516457269806344, 'max': 1.7814726840855106}, 'area': {'min': 307200.0, 'mean': 13864375.285763044, 'std': 10340781.232304238, 'max': 24000000.0}, 'log_area': {'min': 12.63525753545869, 'mean': 15.734588444524283, 'std': 1.546685021561198, 'max': 16.993564429978885}, 'log_file_size': {'min': 9.667575327569097, 'mean': 12.989601775805024, 'std': 1.3092838050100049, 'max': 15.065032279288687}, 'p_cnt': {'min': 1.0, 'mean': 28.47822006058279, 'std': 22.5113

Fold 2: AUC_cal=0.72833, AUC_rawsig=0.72833, best_iter=78


Fold 3 numeric feature stats (min/mean/std/max) summary:
{'age_approx': {'min': 10.0, 'mean': 49.06782639254964, 'std': 14.187910683450363, 'max': 90.0}, 'file_size_bytes': {'min': 15796.0, 'mean': 774559.4955192409, 'std': 608344.737526613, 'max': 3411645.0}, 'image_width': {'min': 640.0, 'mean': 4015.8847742048847, 'std': 2111.3114059429645, 'max': 6000.0}, 'image_height': {'min': 480.0, 'mean': 2661.5108504656473, 'std': 1440.6885128449371, 'max': 6000.0}, 'aspect_ratio': {'min': 0.664179104477612, 'mean': 1.5299966032147123, 'std': 0.156612119063547, 'max': 1.7814726840855106}, 'area': {'min': 307200.0, 'mean': 13711984.287515376, 'std': 10318255.154862357, 'max': 24000000.0}, 'log_area': {'min': 12.63525753545869, 'mean': 15.723502570269238, 'std': 1.5393948096394254, 'max': 16.993564429978885}, 'log_file_size': {'min': 9.667575327569097, 'mean': 12.980811811405326, 'std': 1.3058752199309895, 'max': 15.042705430745736}, 'p_cnt': {'min': 1.0, 'mean': 26.684150412932702, 'std': 20.4

Fold 3: AUC_cal=0.82068, AUC_rawsig=0.82068, best_iter=878


Fold 4 numeric feature stats (min/mean/std/max) summary:
{'age_approx': {'min': 0.0, 'mean': 48.90656020716444, 'std': 14.50816921933258, 'max': 90.0}, 'file_size_bytes': {'min': 16401.0, 'mean': 778349.2259818731, 'std': 600724.7830363989, 'max': 3488673.0}, 'image_width': {'min': 640.0, 'mean': 4049.496201985326, 'std': 2105.699081510124, 'max': 6000.0}, 'image_height': {'min': 480.0, 'mean': 2686.0846784635305, 'std': 1435.4883075674275, 'max': 6000.0}, 'aspect_ratio': {'min': 0.664179104477612, 'mean': 1.5275840406333574, 'std': 0.1547915618265879, 'max': 1.7777777777777777}, 'area': {'min': 307200.0, 'mean': 13881935.555675443, 'std': 10299436.492119517, 'max': 24000000.0}, 'log_area': {'min': 12.63525753545869, 'mean': 15.746952936084359, 'std': 1.5342493043715268, 'max': 16.993564429978885}, 'log_file_size': {'min': 9.705158557596356, 'mean': 12.997800446704801, 'std': 1.2972734740409693, 'max': 15.065032279288687}, 'p_cnt': {'min': 1.0, 'mean': 27.32593871385412, 'std': 21.2642

Fold 4: AUC_cal=0.79243, AUC_rawsig=0.79243, best_iter=2
OOF AUC raw-sigmoid: 0.594256 | OOF AUC per-fold calibrated: 0.780076 | per-fold cal AUCs: [0.759025, 0.803668, 0.728331, 0.820678, 0.792428]
Best iterations per fold: [7, 275, 78, 878, 2]
OOF AUC post-global-cal: 0.780076


Saved oof_meta.csv and submission.csv


In [25]:
# EXP1-CB4: Target-encoded metadata + stabilized CatBoost + two-stage calibration (aim OOF >= 0.80)
import os, sys, time, subprocess, warnings, multiprocessing as mp, json
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression

warnings.filterwarnings('ignore')
SEED = 42
np.random.seed(SEED)

try:
    from catboost import CatBoostClassifier, Pool
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'catboost'])
    from catboost import CatBoostClassifier, Pool

DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')

# Paths + file size
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
JPEG_TEST  = DATA_DIR / 'jpeg' / 'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
def safe_stat(p):
    try: return os.stat(p).st_size
    except: return np.nan
train['file_size_bytes'] = train['image_path'].map(safe_stat)
test['file_size_bytes']  = test['image_path'].map(safe_stat)

# Fast image dims via PIL + multiprocessing
try:
    from PIL import Image
except Exception:
    subprocess.check_call([sys.argv[0], '-m', 'pip', 'install', '--quiet', 'Pillow'])
    from PIL import Image

def get_wh(path):
    try:
        with Image.open(path) as im:
            w,h = im.size
            return (w,h)
    except Exception:
        return (np.nan, np.nan)

def map_wh(paths, workers=16):
    with mp.Pool(processes=min(workers, mp.cpu_count())) as pool:
        return list(pool.imap(get_wh, paths, chunksize=64))

t0 = time.time()
tr_wh = map_wh(train['image_path'].tolist())
te_wh = map_wh(test['image_path'].tolist())
train['image_width']  = [w for (w,h) in tr_wh]
train['image_height'] = [h for (w,h) in tr_wh]
test['image_width']   = [w for (w,h) in te_wh]
test['image_height']  = [h for (w,h) in te_wh]
print(f'Dimension extraction done in {time.time()-t0:.1f}s')

for df in (train, test):
    df['aspect_ratio'] = df['image_width'] / df['image_height']
    df['area'] = df['image_width'] * df['image_height']
    df['log_area'] = np.log1p(df['area'])
    df['log_file_size'] = np.log1p(df['file_size_bytes'])

# Merge folds
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all()
train['fold'] = train['fold'].astype(int)

target_col = 'target'
cat_cols = ['sex','anatom_site_general_challenge']
num_base = ['age_approx','file_size_bytes','image_width','image_height','aspect_ratio','area','log_area','log_file_size']

F = sorted(train['fold'].unique())
oof_raw_logit = np.zeros(len(train), dtype=float)
oof_cal_prob  = np.zeros(len(train), dtype=float)
test_pred_cal_folds = []
fold_aucs_cal = []
best_iters = []
feat_imps = []

def sigmoid(x):
    return 1.0/(1.0+np.exp(-x))

for f in F:
    trn_idx = np.where(train['fold'] != f)[0]
    val_idx = np.where(train['fold'] == f)[0]
    trn_df = train.iloc[trn_idx].copy()
    val_df = train.iloc[val_idx].copy()
    tst_df = test.copy()

    # Missingness flags
    for df in (trn_df, val_df, tst_df):
        df['age_missing'] = df['age_approx'].isna().astype(int)
        df['sex_missing'] = df['sex'].isna().astype(int)
        df['site_missing'] = df['anatom_site_general_challenge'].isna().astype(int)

    # Age impute (fold-safe) and age bins
    grp_key = ['sex','anatom_site_general_challenge']
    med_age = trn_df.groupby(grp_key)['age_approx'].median()
    def impute_age(df):
        df = df.copy()
        df['age_approx'] = df['age_approx'].fillna(df[grp_key].apply(lambda r: med_age.get(tuple(r), np.nan), axis=1))
        df['age_approx'] = df['age_approx'].fillna(trn_df['age_approx'].median())
        return df
    trn_df = impute_age(trn_df); val_df = impute_age(val_df); tst_df = impute_age(tst_df)
    for df in (trn_df, val_df, tst_df):
        df[cat_cols] = df[cat_cols].fillna('NA').astype(str)
        df['sex_site'] = df['sex'] + '|' + df['anatom_site_general_challenge']
        df['age_bin10'] = (np.floor(df['age_approx']/10)*10).astype(int)

    # Patient context (fold-safe)
    p = trn_df.groupby('patient_id').agg(
        p_cnt=('image_name','size'),
        p_pos=('target','sum'),
        p_age_mean=('age_approx','mean'),
        p_age_std=('age_approx','std')
    )
    g_rate = trn_df[target_col].mean()
    p['p_rate_smooth'] = (p['p_pos'] + 2.0*g_rate) / (p['p_cnt'] + 2.0)
    p = p[['p_cnt','p_rate_smooth','p_age_mean','p_age_std']]
    def merge_p(df):
        df = df.merge(p, left_on='patient_id', right_index=True, how='left')
        df['p_cnt'] = df['p_cnt'].fillna(0.0)
        df['p_rate_smooth'] = df['p_rate_smooth'].fillna(g_rate)
        df['p_age_mean'] = df['p_age_mean'].fillna(trn_df['age_approx'].median())
        df['p_age_std'] = df['p_age_std'].fillna(0.0)
        return df
    trn_df = merge_p(trn_df); val_df = merge_p(val_df); tst_df = merge_p(tst_df)

    # Heavy-smoothed fold-safe target encoding for key categoricals
    def te_fit_apply(train_s, y, apply_s, alpha=200.0):
        g = pd.DataFrame({'s': train_s, 'y': y}).groupby('s')['y'].agg(['sum','count'])
        prior = float(y.mean()) if np.isfinite(y.mean()) else 0.0
        te_map = ((g['sum'] + alpha*prior) / (g['count'] + alpha)).to_dict()
        return apply_s.map(te_map).fillna(prior).astype(float)

    trn_df['te_site'] = te_fit_apply(trn_df['anatom_site_general_challenge'], trn_df[target_col].values, trn_df['anatom_site_general_challenge'])
    val_df['te_site'] = te_fit_apply(trn_df['anatom_site_general_challenge'], trn_df[target_col].values, val_df['anatom_site_general_challenge'])
    tst_df['te_site'] = te_fit_apply(trn_df['anatom_site_general_challenge'], trn_df[target_col].values, tst_df['anatom_site_general_challenge'])

    trn_df['te_sex'] = te_fit_apply(trn_df['sex'], trn_df[target_col].values, trn_df['sex'])
    val_df['te_sex'] = te_fit_apply(trn_df['sex'], trn_df[target_col].values, val_df['sex'])
    tst_df['te_sex'] = te_fit_apply(trn_df['sex'], trn_df[target_col].values, tst_df['sex'])

    trn_df['te_sex_site'] = te_fit_apply(trn_df['sex_site'], trn_df[target_col].values, trn_df['sex_site'])
    val_df['te_sex_site'] = te_fit_apply(trn_df['sex_site'], trn_df[target_col].values, val_df['sex_site'])
    tst_df['te_sex_site'] = te_fit_apply(trn_df['sex_site'], trn_df[target_col].values, tst_df['sex_site'])

    # Feature set
    used_num = num_base + ['p_cnt','p_rate_smooth','p_age_mean','p_age_std','age_missing','sex_missing','site_missing',
                           'te_site','te_sex','te_sex_site']
    used_cat = cat_cols + ['sex_site','age_bin10']
    used_cols = used_num + used_cat

    # Pools with cat indices
    X_trn = trn_df[used_cols]; y_trn = trn_df[target_col].values
    X_val = val_df[used_cols]; y_val = val_df[target_col].values
    X_tst = tst_df[used_cols]
    cat_idx = [X_trn.columns.get_loc(c) for c in used_cat]
    train_pool = Pool(X_trn, y_trn, cat_features=cat_idx)
    val_pool   = Pool(X_val, y_val, cat_features=cat_idx)
    test_pool  = Pool(X_tst, cat_features=cat_idx)

    # Stabilized CatBoost params
    model = CatBoostClassifier(
        loss_function='Logloss',
        eval_metric='AUC',
        iterations=30000,
        learning_rate=0.03,
        depth=7,
        l2_leaf_reg=12.0,
        random_strength=2.0,
        bootstrap_type='Bayesian',
        bagging_temperature=1.0,
        auto_class_weights='SqrtBalanced',
        random_seed=SEED,
        task_type='CPU',
        verbose=False
    )
    model.fit(train_pool, eval_set=val_pool, early_stopping_rounds=1500, use_best_model=True, verbose=False)

    # Raw margins for calibration
    val_logit = model.predict(val_pool, prediction_type='RawFormulaVal')
    tst_logit = model.predict(test_pool, prediction_type='RawFormulaVal')
    oof_raw_logit[val_idx] = val_logit

    # Per-fold Platt scaling
    platt = LogisticRegression(C=1e6, solver='lbfgs', max_iter=5000)
    platt.fit(val_logit.reshape(-1,1), y_val)
    val_prob_cal = platt.predict_proba(val_logit.reshape(-1,1))[:,1]
    tst_prob_cal = platt.predict_proba(tst_logit.reshape(-1,1))[:,1]
    oof_cal_prob[val_idx] = val_prob_cal
    test_pred_cal_folds.append(tst_prob_cal.astype(np.float64))

    auc_cal = roc_auc_score(y_val, val_prob_cal)
    fold_aucs_cal.append(auc_cal)
    bi = getattr(model, 'get_best_iteration', lambda: None)()
    best_iters.append(bi)
    # Feature importance (top 15)
    try:
        fi = model.get_feature_importance(type='FeatureImportance', data=train_pool)
        fi_df = pd.DataFrame({'feature': used_cols, 'importance': fi}).sort_values('importance', ascending=False).head(15)
        feat_imps.append({'fold': int(f), 'top15': fi_df.to_dict(orient='records')})
        print(f"Fold {f}: AUC_cal={auc_cal:.5f}, best_iter={bi}, top_imp={[r['feature'] for r in fi_df.to_dict(orient='records')]}")
    except Exception:
        print(f"Fold {f}: AUC_cal={auc_cal:.5f}, best_iter={bi}")

# OOF metrics pre/post global calibration
auc_oof_rawsig = roc_auc_score(train[target_col].values, sigmoid(oof_raw_logit))
auc_oof_cal = roc_auc_score(train[target_col].values, oof_cal_prob)
print(f'OOF AUC raw-sigmoid: {auc_oof_rawsig:.6f} | OOF AUC per-fold calibrated: {auc_oof_cal:.6f}')

# Global calibration
eps = 1e-6
oof_cal_prob_clip = np.clip(oof_cal_prob, eps, 1-eps)
oof_cal_logit = np.log(oof_cal_prob_clip/(1.0-oof_cal_prob_clip))
glob_cal = LogisticRegression(C=1e6, solver='lbfgs', max_iter=10000)
glob_cal.fit(oof_cal_logit.reshape(-1,1), train[target_col].values)
oof_post_prob = glob_cal.predict_proba(oof_cal_logit.reshape(-1,1))[:,1]
auc_oof_post = roc_auc_score(train[target_col].values, oof_post_prob)
print(f'OOF AUC post-global-cal: {auc_oof_post:.6f} | per-fold cal AUCs: {[round(a,6) for a in fold_aucs_cal]} | best iters: {best_iters}')

# Test prediction pipeline
test_pred_cal = np.mean(np.stack(test_pred_cal_folds, axis=0), axis=0) if len(test_pred_cal_folds)>0 else np.zeros(len(test))
test_pred_cal_clip = np.clip(test_pred_cal, eps, 1-eps)
test_pred_cal_logit = np.log(test_pred_cal_clip/(1.0-test_pred_cal_clip))
test_pred_final = glob_cal.predict_proba(test_pred_cal_logit.reshape(-1,1))[:,1]

# Save artifacts
oof_df = train[['image_name','fold',target_col]].copy()
oof_df['pred_meta_cb4_rawlogit'] = oof_raw_logit
oof_df['pred_meta_cb4_calprob'] = oof_cal_prob
oof_df['pred_meta_cb4_postprob'] = oof_post_prob
oof_df.to_csv('oof_meta.csv', index=False)
sub = pd.DataFrame({'image_name': test['image_name'], 'target': test_pred_final})
sub.to_csv('submission.csv', index=False)
with open('results.md','a') as f:
    f.write(f"EXP1_CB4 | OOF_rawsig={auc_oof_rawsig:.6f} | OOF_cal={auc_oof_cal:.6f} | OOF_post={auc_oof_post:.6f} | best_iters={json.dumps([int(x) if x is not None else None for x in best_iters])} | time={time.strftime('%Y-%m-%d %H:%M:%S')}\n")
with open('cb4_feature_importances.json','w') as f:
    json.dump(feat_imps, f)
print('Saved oof_meta.csv, submission.csv, and cb4_feature_importances.json')

# Hard guardrail
assert auc_oof_post >= 0.80, f'Metadata baseline under target after calibration: {auc_oof_post:.6f}'

Dimension extraction done in 1.3s


Fold 0: AUC_cal=0.75638, best_iter=36, top_imp=['p_rate_smooth', 'image_height', 'p_cnt', 'log_area', 'area', 'image_width', 'p_age_std', 'file_size_bytes', 'log_file_size', 'aspect_ratio', 'p_age_mean', 'te_sex_site', 'te_sex', 'te_site', 'anatom_site_general_challenge']


Fold 1: AUC_cal=0.79198, best_iter=237, top_imp=['p_rate_smooth', 'p_cnt', 'age_bin10', 'aspect_ratio', 'log_file_size', 'te_site', 'p_age_mean', 'te_sex_site', 'image_height', 'file_size_bytes', 'anatom_site_general_challenge', 'p_age_std', 'log_area', 'area', 'age_approx']


Fold 2: AUC_cal=0.75475, best_iter=105, top_imp=['p_rate_smooth', 'p_cnt', 'area', 'anatom_site_general_challenge', 'te_sex_site', 'aspect_ratio', 'image_height', 'image_width', 'te_site', 'log_file_size', 'p_age_mean', 'p_age_std', 'file_size_bytes', 'log_area', 'te_sex']


Fold 3: AUC_cal=0.79477, best_iter=715, top_imp=['p_rate_smooth', 'p_cnt', 'log_file_size', 'file_size_bytes', 'te_sex_site', 'p_age_mean', 'age_bin10', 'p_age_std', 'age_approx', 'te_site', 'anatom_site_general_challenge', 'te_sex', 'aspect_ratio', 'sex_site', 'log_area']


Fold 4: AUC_cal=0.78242, best_iter=20, top_imp=['p_rate_smooth', 'p_cnt', 'aspect_ratio', 'file_size_bytes', 'area', 'image_height', 'te_sex_site', 'log_file_size', 'log_area', 'p_age_mean', 'p_age_std', 'te_sex', 'te_site', 'site_missing', 'image_width']
OOF AUC raw-sigmoid: 0.619218 | OOF AUC per-fold calibrated: 0.772865
OOF AUC post-global-cal: 0.772865 | per-fold cal AUCs: [0.756376, 0.791984, 0.754746, 0.79477, 0.782417] | best iters: [36, 237, 105, 715, 20]


Saved oof_meta.csv, submission.csv, and cb4_feature_importances.json


AssertionError: Metadata baseline under target after calibration: 0.772865

In [26]:
# EXP1-CB5: Multi-seed CatBoost ensemble + two-stage calibration to push OOF >= 0.80
import os, sys, time, subprocess, warnings, multiprocessing as mp, json
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression

warnings.filterwarnings('ignore')
BASE_SEED = 42
np.random.seed(BASE_SEED)
SEEDS = [42, 777, 1337]

try:
    from catboost import CatBoostClassifier, Pool
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'catboost'])
    from catboost import CatBoostClassifier, Pool

DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')

# Paths + file size
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
JPEG_TEST  = DATA_DIR / 'jpeg' / 'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
def safe_stat(p):
    try: return os.stat(p).st_size
    except: return np.nan
train['file_size_bytes'] = train['image_path'].map(safe_stat)
test['file_size_bytes']  = test['image_path'].map(safe_stat)

# Fast image dims via PIL + multiprocessing
try:
    from PIL import Image
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'Pillow'])
    from PIL import Image

def get_wh(path):
    try:
        with Image.open(path) as im:
            w,h = im.size
            return (w,h)
    except Exception:
        return (np.nan, np.nan)

def map_wh(paths, workers=16):
    with mp.Pool(processes=min(workers, mp.cpu_count())) as pool:
        return list(pool.imap(get_wh, paths, chunksize=64))

t0 = time.time()
tr_wh = map_wh(train['image_path'].tolist())
te_wh = map_wh(test['image_path'].tolist())
train['image_width']  = [w for (w,h) in tr_wh]
train['image_height'] = [h for (w,h) in tr_wh]
test['image_width']   = [w for (w,h) in te_wh]
test['image_height']  = [h for (w,h) in te_wh]
print(f'Dimension extraction done in {time.time()-t0:.1f}s')

for df in (train, test):
    df['aspect_ratio'] = df['image_width'] / df['image_height']
    df['area'] = df['image_width'] * df['image_height']
    df['log_area'] = np.log1p(df['area'])
    df['log_file_size'] = np.log1p(df['file_size_bytes'])

# Merge folds
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all()
train['fold'] = train['fold'].astype(int)

target_col = 'target'
cat_cols = ['sex','anatom_site_general_challenge']
num_base = ['age_approx','file_size_bytes','image_width','image_height','aspect_ratio','area','log_area','log_file_size']

FOLDS = sorted(train['fold'].unique())
oof_raw_logit = np.zeros(len(train), dtype=float)
oof_cal_prob  = np.zeros(len(train), dtype=float)
test_pred_cal_folds = []
fold_aucs_cal = []
best_iters_per_fold = []

def sigmoid(x):
    return 1.0/(1.0+np.exp(-x))

for f in FOLDS:
    trn_idx = np.where(train['fold'] != f)[0]
    val_idx = np.where(train['fold'] == f)[0]
    trn_df = train.iloc[trn_idx].copy()
    val_df = train.iloc[val_idx].copy()
    tst_df = test.copy()

    # Missingness flags
    for df in (trn_df, val_df, tst_df):
        df['age_missing'] = df['age_approx'].isna().astype(int)
        df['sex_missing'] = df['sex'].isna().astype(int)
        df['site_missing'] = df['anatom_site_general_challenge'].isna().astype(int)

    # Age imputation (fold-safe) + bins
    grp_key = ['sex','anatom_site_general_challenge']
    med_age = trn_df.groupby(grp_key)['age_approx'].median()
    def impute_age(df):
        df = df.copy()
        df['age_approx'] = df['age_approx'].fillna(df[grp_key].apply(lambda r: med_age.get(tuple(r), np.nan), axis=1))
        df['age_approx'] = df['age_approx'].fillna(trn_df['age_approx'].median())
        return df
    trn_df = impute_age(trn_df); val_df = impute_age(val_df); tst_df = impute_age(tst_df)
    for df in (trn_df, val_df, tst_df):
        df[cat_cols] = df[cat_cols].fillna('NA').astype(str)
        df['sex_site'] = df['sex'] + '|' + df['anatom_site_general_challenge']
        df['age_bin10'] = (np.floor(df['age_approx']/10)*10).astype(int)

    # Patient context (fold-safe)
    p = trn_df.groupby('patient_id').agg(
        p_cnt=('image_name','size'),
        p_pos=('target','sum'),
        p_age_mean=('age_approx','mean'),
        p_age_std=('age_approx','std')
    )
    g_rate = trn_df[target_col].mean()
    p['p_rate_smooth'] = (p['p_pos'] + 2.0*g_rate) / (p['p_cnt'] + 2.0)
    p = p[['p_cnt','p_rate_smooth','p_age_mean','p_age_std']]
    def merge_p(df):
        df = df.merge(p, left_on='patient_id', right_index=True, how='left')
        df['p_cnt'] = df['p_cnt'].fillna(0.0)
        df['p_rate_smooth'] = df['p_rate_smooth'].fillna(g_rate)
        df['p_age_mean'] = df['p_age_mean'].fillna(trn_df['age_approx'].median())
        df['p_age_std'] = df['p_age_std'].fillna(0.0)
        return df
    trn_df = merge_p(trn_df); val_df = merge_p(val_df); tst_df = merge_p(tst_df)

    # Heavy-smoothed TE (fold-safe) for site/sex/sex_site
    def te_fit_apply(train_s, y, apply_s, alpha=200.0):
        g = pd.DataFrame({'s': train_s, 'y': y}).groupby('s')['y'].agg(['sum','count'])
        prior = float(y.mean()) if np.isfinite(y.mean()) else 0.0
        te_map = ((g['sum'] + alpha*prior) / (g['count'] + alpha)).to_dict()
        return apply_s.map(te_map).fillna(prior).astype(float)

    trn_df['te_site'] = te_fit_apply(trn_df['anatom_site_general_challenge'], trn_df[target_col].values, trn_df['anatom_site_general_challenge'])
    val_df['te_site'] = te_fit_apply(trn_df['anatom_site_general_challenge'], trn_df[target_col].values, val_df['anatom_site_general_challenge'])
    tst_df['te_site'] = te_fit_apply(trn_df['anatom_site_general_challenge'], trn_df[target_col].values, tst_df['anatom_site_general_challenge'])

    trn_df['te_sex'] = te_fit_apply(trn_df['sex'], trn_df[target_col].values, trn_df['sex'])
    val_df['te_sex'] = te_fit_apply(trn_df['sex'], trn_df[target_col].values, val_df['sex'])
    tst_df['te_sex'] = te_fit_apply(trn_df['sex'], trn_df[target_col].values, tst_df['sex'])

    trn_df['te_sex_site'] = te_fit_apply(trn_df['sex_site'], trn_df[target_col].values, trn_df['sex_site'])
    val_df['te_sex_site'] = te_fit_apply(trn_df['sex_site'], trn_df[target_col].values, val_df['sex_site'])
    tst_df['te_sex_site'] = te_fit_apply(trn_df['sex_site'], trn_df[target_col].values, tst_df['sex_site'])

    used_num = num_base + ['p_cnt','p_rate_smooth','p_age_mean','p_age_std','age_missing','sex_missing','site_missing',
                           'te_site','te_sex','te_sex_site']
    used_cat = cat_cols + ['sex_site','age_bin10']
    used_cols = used_num + used_cat

    X_trn = trn_df[used_cols]; y_trn = trn_df[target_col].values
    X_val = val_df[used_cols]; y_val = val_df[target_col].values
    X_tst = tst_df[used_cols]
    cat_idx = [X_trn.columns.get_loc(c) for c in used_cat]
    train_pool = Pool(X_trn, y_trn, cat_features=cat_idx)
    val_pool   = Pool(X_val, y_val, cat_features=cat_idx)
    test_pool  = Pool(X_tst, cat_features=cat_idx)

    # Train multi-seed models and average raw logits
    val_logits = []
    tst_logits = []
    bests = []
    for sd in SEEDS:
        model = CatBoostClassifier(
            loss_function='Logloss',
            eval_metric='AUC',
            iterations=40000,
            learning_rate=0.03,
            depth=7,
            l2_leaf_reg=12.0,
            random_strength=2.0,
            bootstrap_type='Bayesian',
            bagging_temperature=1.0,
            # no auto_class_weights to avoid scale distortion; calibration will align scales
            random_seed=sd,
            task_type='CPU',
            verbose=False
        )
        model.fit(train_pool, eval_set=val_pool, early_stopping_rounds=2000, use_best_model=True, verbose=False)
        bests.append(model.get_best_iteration())
        val_logits.append(model.predict(val_pool, prediction_type='RawFormulaVal'))
        tst_logits.append(model.predict(test_pool, prediction_type='RawFormulaVal'))
    val_logit_mean = np.mean(np.stack(val_logits, axis=0), axis=0)
    tst_logit_mean = np.mean(np.stack(tst_logits, axis=0), axis=0)
    oof_raw_logit[val_idx] = val_logit_mean
    best_iters_per_fold.append(bests)
    # Per-fold Platt on averaged logits
    platt = LogisticRegression(C=1e6, solver='lbfgs', max_iter=10000)
    platt.fit(val_logit_mean.reshape(-1,1), y_val)
    val_prob_cal = platt.predict_proba(val_logit_mean.reshape(-1,1))[:,1]
    tst_prob_cal = platt.predict_proba(tst_logit_mean.reshape(-1,1))[:,1]
    oof_cal_prob[val_idx] = val_prob_cal
    test_pred_cal_folds.append(tst_prob_cal.astype(np.float64))
    auc_cal = roc_auc_score(y_val, val_prob_cal)
    fold_aucs_cal.append(auc_cal)
    print(f'Fold {f}: AUC_cal={auc_cal:.5f}, best_iters(seeds)={bests}')

# OOF metrics
auc_oof_rawsig = roc_auc_score(train[target_col].values, sigmoid(oof_raw_logit))
auc_oof_cal = roc_auc_score(train[target_col].values, oof_cal_prob)
print(f'OOF AUC raw-sigmoid: {auc_oof_rawsig:.6f} | OOF AUC per-fold calibrated: {auc_oof_cal:.6f}')

# Global calibration
eps = 1e-6
oof_cal_prob_clip = np.clip(oof_cal_prob, eps, 1-eps)
oof_cal_logit = np.log(oof_cal_prob_clip/(1.0-oof_cal_prob_clip))
glob_cal = LogisticRegression(C=1e6, solver='lbfgs', max_iter=20000)
glob_cal.fit(oof_cal_logit.reshape(-1,1), train[target_col].values)
oof_post_prob = glob_cal.predict_proba(oof_cal_logit.reshape(-1,1))[:,1]
auc_oof_post = roc_auc_score(train[target_col].values, oof_post_prob)
print(f'OOF AUC post-global-cal: {auc_oof_post:.6f} | per-fold cal AUCs: {[round(a,6) for a in fold_aucs_cal]} | best iters per fold: {best_iters_per_fold}')

# Test prediction
test_pred_cal = np.mean(np.stack(test_pred_cal_folds, axis=0), axis=0) if len(test_pred_cal_folds)>0 else np.zeros(len(test))
test_pred_cal_clip = np.clip(test_pred_cal, eps, 1-eps)
test_pred_cal_logit = np.log(test_pred_cal_clip/(1.0-test_pred_cal_clip))
test_pred_final = glob_cal.predict_proba(test_pred_cal_logit.reshape(-1,1))[:,1]

# Save artifacts
oof_df = train[['image_name','fold',target_col]].copy()
oof_df['pred_meta_cb5_rawlogit'] = oof_raw_logit
oof_df['pred_meta_cb5_calprob'] = oof_cal_prob
oof_df['pred_meta_cb5_postprob'] = oof_post_prob
oof_df.to_csv('oof_meta.csv', index=False)
sub = pd.DataFrame({'image_name': test['image_name'], 'target': test_pred_final})
sub.to_csv('submission.csv', index=False)
with open('results.md','a') as f:
    f.write(f"EXP1_CB5 | OOF_rawsig={auc_oof_rawsig:.6f} | OOF_cal={auc_oof_cal:.6f} | OOF_post={auc_oof_post:.6f} | folds_cal={json.dumps([float(a) for a in fold_aucs_cal])} | seeds={json.dumps(SEEDS)} | time={time.strftime('%Y-%m-%d %H:%M:%S')}\n")
print('Saved oof_meta.csv and submission.csv')

# Hard guardrail
assert auc_oof_post >= 0.80, f'Metadata baseline under target after calibration: {auc_oof_post:.6f}'

Dimension extraction done in 0.8s


In [27]:
# EXP2 (FAST CPU): EfficientNetB0 frozen backbone baseline with patient-grouped CV and AUC early stopping
import os, sys, json, time, warnings, subprocess
from pathlib import Path
import numpy as np
import pandas as pd
warnings.filterwarnings('ignore')

try:
    import tensorflow as tf
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'tensorflow==2.12.0'])
    import tensorflow as tf

from sklearn.metrics import roc_auc_score

SEED = 42
IMG_SIZE = 224
BATCH_SIZE = 32
EPOCHS = 3
BASE_LR = 3e-4
PATIENCE = 2
TTA = 1

tf.random.set_seed(SEED)
np.random.seed(SEED)
os.environ['TF_DETERMINISTIC_OPS'] = '1'

DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')

JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
JPEG_TEST  = DATA_DIR / 'jpeg' / 'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all()
train['fold'] = train['fold'].astype(int)

def decode_jpeg(path):
    img = tf.io.read_file(path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.convert_image_dtype(img, tf.float32)
    return img

def preprocess(img):
    img = tf.image.resize(img, (IMG_SIZE, IMG_SIZE))
    return img

def make_ds(paths, labels=None, training=False, repeat=False, shuffle=0):
    p = tf.data.Dataset.from_tensor_slices(paths)
    p = p.map(lambda x: preprocess(decode_jpeg(x)), num_parallel_calls=tf.data.AUTOTUNE)
    if labels is not None:
        y = tf.data.Dataset.from_tensor_slices(labels.astype(np.float32))
        ds = tf.data.Dataset.zip((p, y))
        if training and shuffle:
            ds = ds.shuffle(shuffle, seed=SEED, reshuffle_each_iteration=True)
        if repeat:
            ds = ds.repeat()
        ds = ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
        return ds
    else:
        return p.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

def build_model():
    base = tf.keras.applications.EfficientNetB0(include_top=False, input_shape=(IMG_SIZE, IMG_SIZE, 3), weights='imagenet')
    base.trainable = False  # freeze backbone for speed
    x = tf.keras.layers.GlobalAveragePooling2D()(base.output)
    x = tf.keras.layers.Dropout(0.2)(x)
    out = tf.keras.layers.Dense(1, dtype='float32')(x)
    model = tf.keras.Model(base.input, out)
    return model

def compile_model(model, lr):
    opt = tf.keras.optimizers.Adam(learning_rate=lr)
    model.compile(optimizer=opt, loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                  metrics=[tf.keras.metrics.AUC(from_logits=True, name='auc')])
    return model

oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds = []
fold_aucs = []

for f in sorted(train['fold'].unique()):
    trn = train[train['fold'] != f].reset_index(drop=True)
    val = train[train['fold'] == f].reset_index(drop=True)
    trn_ds = make_ds(trn['image_path'].values, trn['target'].values, training=False, repeat=True, shuffle=0)
    val_ds = make_ds(val['image_path'].values, val['target'].values, training=False)

    # Class weight (capped)
    pos = max(1, int(trn['target'].sum())); neg = max(1, int(len(trn) - pos))
    spw = min(10.0, neg / pos)
    class_weight = {0: 1.0, 1: spw}

    tf.keras.backend.clear_session()
    model = build_model()
    model = compile_model(model, BASE_LR)
    steps_per_epoch = max(1, len(trn) // BATCH_SIZE)
    cb = [tf.keras.callbacks.EarlyStopping(monitor='val_auc', mode='max', patience=PATIENCE, restore_best_weights=True)]
    model.fit(trn_ds, validation_data=val_ds, epochs=EPOCHS, steps_per_epoch=steps_per_epoch, callbacks=cb, class_weight=class_weight, verbose=0)

    # Predict OOF and test (no TTA for speed)
    val_logits = model.predict(val_ds, verbose=0).ravel()
    oof[val.index] = 1.0 / (1.0 + np.exp(-val_logits))
    tst_ds = make_ds(test['image_path'].values, training=False)
    test_pred = 1.0 / (1.0 + np.exp(-model.predict(tst_ds, verbose=0).ravel()))
    test_pred_folds.append(test_pred.astype(np.float32))

    auc = roc_auc_score(val['target'].values, oof[val.index])
    fold_aucs.append(float(auc))
    print(f'Fold {f}: AUC={auc:.5f}, class_weight_pos={spw:.2f}, steps/epoch={steps_per_epoch}')

oof_auc = roc_auc_score(train['target'].values, oof)
print('OOF AUC (image baseline, frozen):', round(oof_auc, 6), '| per-fold:', [round(a,6) for a in fold_aucs])

# Save OOF and submission
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_efficientnet_b0_frozen'] = oof
oof_df.to_csv('oof_meta.csv', index=False)
test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
sub = pd.DataFrame({'image_name': test['image_name'], 'target': test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved oof_meta.csv and submission.csv')

with open('results.md','a') as f:
    f.write(f"EXP2_EFFB0_FROZEN | IMG_SIZE={IMG_SIZE} | BATCH={BATCH_SIZE} | EPOCHS={EPOCHS} | OOF={oof_auc:.6f} | folds={json.dumps([float(a) for a in fold_aucs])} | time={time.strftime('%Y-%m-%d %H:%M:%S')}\n")

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pywavelets 1.9.0 requires numpy<3,>=1.25, but you have numpy 1.23.5 which is incompatible.
grpcio-status 1.74.0 requires protobuf<7.0.0,>=6.31.1, but you have protobuf 4.25.8 which is incompatible.
contourpy 1.3.3 requires numpy>=1.25, but you have numpy 1.23.5 which is incompatible.








2025-08-24 02:06:35.229907: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-08-24 02:06:35.280914: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-08-24 02:06:35.281615: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.




In [41]:
# EXP2-EMB (Robust & Fast-CPU): EfficientNetB0 embeddings + LogisticRegression with strict alignment, logging, cache guardrails
import os, sys, time, json, warnings, subprocess
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

warnings.filterwarnings('ignore')
SEED = 42
np.random.seed(SEED)
IMG_SIZE = 224
BATCH = 64
BACKBONE = 'efficientnet_b0'

DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
JPEG_TEST  = DATA_DIR / 'jpeg' / 'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all()
train['fold'] = train['fold'].astype(int)

# TensorFlow backbone only (no tf.data) to avoid CPU slowness
try:
    import tensorflow as tf
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'tensorflow==2.12.0'])
    import tensorflow as tf

tf.random.set_seed(SEED)
os.environ['TF_DETERMINISTIC_OPS'] = '1'

from PIL import Image

feat_train_path = 'features_train_b0_224.npy'
feat_test_path  = 'features_test_b0_224.npy'

def finite_ratio(X):
    Xf = np.isfinite(X)
    return float(Xf.sum()) / float(X.size)

def build_feature_extractor():
    base = tf.keras.applications.EfficientNetB0(include_top=False, input_shape=(IMG_SIZE, IMG_SIZE, 3), weights='imagenet')
    base.trainable = False
    x = tf.keras.layers.GlobalAveragePooling2D()(base.output)
    model = tf.keras.Model(base.input, x)
    return model

decode_fail_train = []
decode_fail_test = []

def load_image_safe(path):
    try:
        with Image.open(path) as im:
            im = im.convert('RGB').resize((IMG_SIZE, IMG_SIZE))
            arr = np.asarray(im, dtype=np.float32) / 255.0
            return arr
    except Exception:
        return None

def extract_features_numpy(paths, model, is_train):
    outs = []
    failures = decode_fail_train if is_train else decode_fail_test
    N = len(paths)
    t0 = time.time()
    for i in range(0, N, BATCH):
        batch_paths = paths[i:i+BATCH]
        batch_imgs = []
        for p in batch_paths:
            arr = load_image_safe(p)
            if arr is None:
                failures.append(p)
                arr = np.zeros((IMG_SIZE, IMG_SIZE, 3), dtype=np.float32)
            batch_imgs.append(arr)
        X = np.stack(batch_imgs, axis=0)
        feat = model.predict(X, verbose=0)
        outs.append(feat)
        if (i//BATCH) % 50 == 0:
            done = min(i+BATCH, N)
            print(f'Extracting feats: {done}/{N} ({done*100//N}%)')
    F = np.concatenate(outs, axis=0)
    # alignment check
    assert F.shape[0] == N, f'Embedding rows ({F.shape[0]}) != input rows ({N})'
    # sanitize
    F = np.nan_to_num(F, nan=0.0, posinf=0.0, neginf=0.0).astype(np.float32)
    # feature quality: no zero-variance columns
    std_F = F.std(axis=0)
    assert np.all(std_F > 0.0), 'Degenerate feature column detected (zero std)'
    print(f'Feature extraction done in {time.time()-t0:.1f}s | shape {F.shape} | finite_ratio {finite_ratio(F):.6f}')
    return F

def load_cached_features():
    Ft = np.load(feat_train_path)
    Fe = np.load(feat_test_path)
    return Ft, Fe

# Config logging
print(f'[CONFIG] SEED={SEED} | IMG_SIZE={IMG_SIZE} | BATCH={BATCH} | BACKBONE={BACKBONE}')

recompute = False
if os.path.exists(feat_train_path) and os.path.exists(feat_test_path):
    feats_train, feats_test = load_cached_features()
    rt, re = finite_ratio(feats_train), finite_ratio(feats_test)
    print('Loaded cached embeddings:', feats_train.shape, feats_test.shape, '| finite ratios:', round(rt,6), round(re,6))
    # Strict cache gate: anything < 1.0 means corruption
    if rt < 1.0 or re < 1.0:
        print('Cache not fully finite, recomputing embeddings ...')
        recompute = True
    else:
        # Loud assert that cached features are clean
        assert rt == 1.0 and re == 1.0, 'Cache should be fully finite but is not exactly 1.0'
else:
    recompute = True

if recompute:
    extractor_model = build_feature_extractor()
    feats_train = extract_features_numpy(train['image_path'].values.tolist(), extractor_model, is_train=True)
    feats_test  = extract_features_numpy(test['image_path'].values.tolist(),  extractor_model, is_train=False)
    # After recompute, enforce finite and alignment
    assert finite_ratio(feats_train) == 1.0 and finite_ratio(feats_test) == 1.0, 'Non-finite values after extraction'
    assert feats_train.shape[0] == len(train) and feats_test.shape[0] == len(test), 'Misaligned embeddings after extraction'
    np.save(feat_train_path, feats_train)
    np.save(feat_test_path, feats_test)
    if decode_fail_train:
        with open('decode_failures_train.txt','w') as f: f.write('\n'.join(decode_fail_train))
    if decode_fail_test:
        with open('decode_failures_test.txt','w') as f: f.write('\n'.join(decode_fail_test))
    if decode_fail_train or decode_fail_test:
        print(f'Decode fallbacks -> train:{len(decode_fail_train)} test:{len(decode_fail_test)} (logged)')
else:
    # If using cache, loudly assert again
    assert finite_ratio(feats_train) == 1.0 and finite_ratio(feats_test) == 1.0, 'Corrupted cache detected!'

# Per-fold Logistic Regression on embeddings
oof = np.zeros(len(train), dtype=float)
test_pred_folds = []
fold_aucs = []

for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = feats_train[trn_idx], train['target'].values[trn_idx]
    X_val, y_val = feats_train[val_idx], train['target'].values[val_idx]
    X_tst = feats_test
    assert np.isfinite(X_trn).all() and np.isfinite(X_val).all() and np.isfinite(X_tst).all(), 'Non-finite values detected before fit'

    clf = LogisticRegression(max_iter=5000, class_weight='balanced', solver='lbfgs', n_jobs=1, random_state=SEED)
    clf.fit(X_trn, y_trn)
    oof[val_idx] = clf.predict_proba(X_val)[:,1]
    auc = roc_auc_score(y_val, oof[val_idx])
    fold_aucs.append(float(auc))
    test_pred_folds.append(clf.predict_proba(X_tst)[:,1].astype(np.float64))
    print(f'Fold {f}: AUC={auc:.5f}')

oof_auc = roc_auc_score(train['target'].values, oof)
print('OOF AUC (EffB0 embeddings + LR):', round(oof_auc,6), '| per-fold:', [round(a,6) for a in fold_aucs])

# Save artifacts
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_effb0_emb_lr'] = oof
oof_df.to_csv('oof_meta.csv', index=False)
test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test))
sub = pd.DataFrame({'image_name': test['image_name'], 'target': test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved oof_meta.csv and submission.csv')

with open('results.md','a') as f:
    f.write(f"EXP2_EMB_LR_ROBUST_FASTCPU | IMG_SIZE={IMG_SIZE} | BATCH={BATCH} | BACKBONE={BACKBONE} | OOF={oof_auc:.6f} | folds={json.dumps([float(a) for a in fold_aucs])} | time={time.strftime('%Y-%m-%d %H:%M:%S')}\n")

[CONFIG] SEED=42 | IMG_SIZE=224 | BATCH=64 | BACKBONE=efficientnet_b0
Loaded cached embeddings: (28984, 1280) (4142, 1280) | finite ratios: 0.0 0.0
Cache not fully finite, recomputing embeddings ...


2025-08-24 05:17:46.448473: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:17:46.448760: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


Extracting feats: 64/28984 (0%)


2025-08-24 05:18:01.238036: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:18:01.238346: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:18:13.291580: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:18:13.291927: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:18:26.112425: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:18:26.112710: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:18:39.938304: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:18:39.938649: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:18:52.524529: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:18:52.524812: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:19:03.907088: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:19:03.907370: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:19:16.564553: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:19:16.564867: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:19:28.587220: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:19:28.587498: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:19:42.956071: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:19:42.956403: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:19:57.259713: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:19:57.260019: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:20:11.221782: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:20:11.222069: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:20:23.994736: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:20:23.995104: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:20:35.734644: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:20:35.734967: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:20:48.671553: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:20:48.671867: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:21:00.655323: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:21:00.655618: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:21:12.818806: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:21:12.819107: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:21:26.639126: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:21:26.639408: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:21:38.918984: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:21:38.919334: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:21:51.425664: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:21:51.425945: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:22:04.200626: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:22:04.200966: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:22:17.139035: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:22:17.139342: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:22:32.159789: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:22:32.160080: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:22:47.172870: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:22:47.173159: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:23:00.067930: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:23:00.068305: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:23:12.361537: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:23:12.361839: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:23:25.148978: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:23:25.149279: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:23:36.555753: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:23:36.556057: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:23:49.408962: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:23:49.409259: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:24:01.698755: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:24:01.699067: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:24:15.515526: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:24:15.515837: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:24:28.233718: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:24:28.234009: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:24:40.485383: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:24:40.485661: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:24:53.538261: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:24:53.538591: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:25:04.127598: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:25:04.127886: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:25:18.086819: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:25:18.087146: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:25:30.610280: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:25:30.610655: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:25:40.956608: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:25:40.956905: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:25:53.997353: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:25:53.997646: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:26:08.008956: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:26:08.009257: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:26:20.165378: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:26:20.165661: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:26:31.682306: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:26:31.682614: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:26:44.906744: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:26:44.907034: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:26:57.337566: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:26:57.337843: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:27:08.797686: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:27:08.797969: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:27:21.012265: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:27:21.012551: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:27:31.367651: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:27:31.367955: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:27:43.243421: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:27:43.243695: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:27:56.687727: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:27:56.688082: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:28:08.255865: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:28:08.256183: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:28:22.239695: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:28:22.240019: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


Extracting feats: 3264/28984 (11%)


2025-08-24 05:28:36.474394: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:28:36.474745: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:28:49.664863: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:28:49.665171: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:29:03.242952: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:29:03.243270: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:29:16.280742: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:29:16.281030: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:29:29.803694: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:29:29.804015: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:29:42.448029: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:29:42.448311: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:29:55.702044: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:29:55.702338: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:30:07.338630: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:30:07.338917: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:30:20.529424: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:30:20.529712: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:30:33.882082: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:30:33.882367: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:30:47.116193: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:30:47.116466: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:30:58.717620: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:30:58.717896: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:31:11.726883: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:31:11.727159: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:31:23.281497: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:31:23.281759: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:31:36.025586: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:31:36.025865: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:31:45.579581: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:31:45.579874: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:31:57.988004: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:31:57.988289: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:32:12.480198: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:32:12.480477: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:32:25.042215: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:32:25.042518: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:32:37.550307: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:32:37.550679: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:32:49.968279: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:32:49.968597: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:33:00.489212: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:33:00.489515: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:33:12.078615: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:33:12.078890: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:33:25.106727: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:33:25.107006: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:33:36.821484: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:33:36.821753: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:33:48.929191: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:33:48.929497: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:33:59.656686: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:33:59.656948: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:34:09.875645: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:34:09.875905: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:34:23.433842: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:34:23.434189: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:34:35.841021: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:34:35.841302: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:34:48.779751: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:34:48.780039: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:35:00.553907: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:35:00.554202: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:35:12.281224: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:35:12.281525: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:35:25.080894: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:35:25.081196: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:35:36.407678: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:35:36.407959: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:35:48.498975: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:35:48.499259: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:35:59.260782: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:35:59.261075: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:36:11.321308: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:36:11.321602: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:36:23.334207: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:36:23.334491: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:36:36.249557: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:36:36.249851: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:36:47.105349: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:36:47.105624: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:37:01.112038: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:37:01.112363: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:37:13.773988: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:37:13.774275: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:37:26.943180: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:37:26.943477: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:37:38.707686: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:37:38.707955: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:37:50.093039: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:37:50.093341: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:38:02.161409: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:38:02.161694: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:38:14.834066: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:38:14.834367: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:38:27.739058: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:38:27.739364: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:38:41.576080: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:38:41.576343: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


Extracting feats: 6464/28984 (22%)


2025-08-24 05:38:53.960634: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:38:53.960888: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:39:06.328353: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:39:06.328611: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:39:19.831606: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:39:19.831937: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:39:31.342633: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:39:31.342907: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:39:44.981023: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:39:44.981297: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:39:55.330552: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:39:55.330837: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:40:07.648290: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:40:07.648568: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:40:19.469012: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:40:19.469334: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:40:31.024168: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:40:31.024501: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:40:44.300514: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:40:44.300820: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:40:57.135562: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:40:57.135874: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:41:09.263558: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:41:09.263825: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:41:21.896041: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:41:21.896298: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:41:33.444684: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:41:33.444987: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:41:44.983567: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:41:44.983842: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:41:58.326059: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:41:58.326361: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:42:12.279666: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:42:12.279967: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:42:24.976763: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:42:24.977056: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:42:37.158881: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:42:37.159159: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:42:50.469720: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:42:50.470037: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:43:02.777603: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:43:02.777918: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:43:13.693129: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:43:13.693400: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:43:24.969142: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:43:24.969416: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:43:37.789518: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:43:37.789811: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:43:49.396551: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:43:49.396813: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:44:00.501191: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:44:00.501446: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:44:13.885592: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:44:13.885852: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:44:26.220102: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:44:26.220397: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:44:37.292962: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:44:37.293244: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:44:50.607030: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:44:50.607299: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:45:02.762646: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:45:02.762992: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:45:15.457368: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:45:15.457653: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:45:28.859406: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:45:28.859699: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:45:41.937289: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:45:41.937603: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:45:53.545460: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:45:53.545726: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:46:05.778009: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:46:05.778302: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:46:17.645665: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:46:17.645934: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:46:32.048440: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:46:32.048698: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:46:43.971795: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:46:43.972114: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:46:56.107931: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:46:56.108239: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:47:07.628910: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:47:07.629196: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:47:20.789778: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:47:20.790131: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:47:32.271864: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:47:32.272193: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:47:42.726910: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:47:42.727195: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:47:55.057060: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:47:55.057341: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:48:07.931606: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:48:07.931933: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:48:18.936891: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:48:18.937158: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


2025-08-24 05:48:30.073276: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]
2025-08-24 05:48:30.073566: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype float and shape [64,224,224,3]
	 [[{{node Placeholder/_10}}]]


In [36]:
# EXP2-FAST-STATS: CPU-fast image stats + LR baseline (no TF) to unblock modeling and avoid NaN embeddings
import os, time, json, warnings, multiprocessing as mp
import numpy as np
import pandas as pd
from pathlib import Path
from PIL import Image, ImageStat
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

warnings.filterwarnings('ignore')
SEED = 42
np.random.seed(SEED)

DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
JPEG_TEST  = DATA_DIR / 'jpeg' / 'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all()
train['fold'] = train['fold'].astype(int)

# Fast feature extractor: per-image stats + tiny thumbnail flatten (grayscale 16x16=256 dims)
THUMB = 16
HIST_BINS = 16

def img_stats_feature(path):
    try:
        with Image.open(path) as im:
            im = im.convert('RGB')
            # Basic stats
            stat = ImageStat.Stat(im)
            mean_r, mean_g, mean_b = [float(x) for x in stat.mean]
            std_r, std_g, std_b = [float(x) for x in stat.stddev]
            # Histograms per channel (16 bins -> 48 features normalized)
            h = []
            for c in range(3):
                hist = im.getchannel(c).histogram()  # 256 bins
                hist = np.asarray(hist, dtype=np.float32).reshape(16, 16).sum(axis=1)  # down-bin to 16
                hist = hist / (hist.sum() + 1e-6)
                h.append(hist)
            hist_feat = np.concatenate(h, axis=0)  # 48
            # Thumbnail grayscale flatten
            gs = im.convert('L').resize((THUMB, THUMB), Image.BILINEAR)
            thumb = np.asarray(gs, dtype=np.float32).reshape(-1) / 255.0  # 256
            # Entropy approximation (Shannon) on grayscale
            hist_g = gs.histogram()
            p = np.asarray(hist_g, dtype=np.float32)
            p = p / (p.sum() + 1e-6)
            ent = -np.sum(p * (np.log(p + 1e-12)))
            # Assemble
            base = np.array([mean_r, mean_g, mean_b, std_r, std_g, std_b, ent], dtype=np.float32)
            feat = np.concatenate([base, hist_feat.astype(np.float32), thumb.astype(np.float32)], axis=0)
            return feat
    except Exception:
        # On failure return zeros (length = 7 + 48 + 256 = 311)
        return np.zeros(7 + 3*HIST_BINS + THUMB*THUMB, dtype=np.float32)

def extract_features(paths, workers=16):
    t0 = time.time()
    with mp.Pool(processes=min(workers, mp.cpu_count())) as pool:
        feats = list(pool.imap(img_stats_feature, paths, chunksize=64))
    X = np.vstack(feats).astype(np.float32)
    # Sanitize
    X = np.nan_to_num(X, nan=0.0, posinf=0.0, neginf=0.0)
    print(f'Extracted features for {len(paths)} images in {time.time()-t0:.1f}s | shape {X.shape}')
    return X

feat_cache_tr = 'features_train_faststats.npy'
feat_cache_te = 'features_test_faststats.npy'

if os.path.exists(feat_cache_tr) and os.path.exists(feat_cache_te):
    X_train = np.load(feat_cache_tr)
    X_test  = np.load(feat_cache_te)
    print('Loaded cached fast-stats features:', X_train.shape, X_test.shape)
else:
    X_train = extract_features(train['image_path'].tolist())
    X_test  = extract_features(test['image_path'].tolist())
    np.save(feat_cache_tr, X_train)
    np.save(feat_cache_te, X_test)

# Per-fold LR
oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds = []
fold_aucs = []

for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = X_train[trn_idx], train['target'].values[trn_idx]
    X_val, y_val = X_train[val_idx], train['target'].values[val_idx]
    X_tst = X_test
    assert np.isfinite(X_trn).all() and np.isfinite(X_val).all() and np.isfinite(X_tst).all(), 'Non-finite values detected before fit'
    clf = LogisticRegression(max_iter=5000, class_weight='balanced', solver='lbfgs', n_jobs=1, random_state=SEED)
    clf.fit(X_trn, y_trn)
    oof[val_idx] = clf.predict_proba(X_val)[:,1]
    auc = roc_auc_score(y_val, oof[val_idx])
    fold_aucs.append(float(auc))
    test_pred_folds.append(clf.predict_proba(X_tst)[:,1].astype(np.float64))
    print(f'Fold {f}: AUC={auc:.5f}')

oof_auc = roc_auc_score(train['target'].values, oof)
print('OOF AUC (fast image stats + LR):', round(oof_auc,6), '| per-fold:', [round(a,6) for a in fold_aucs])

# Save artifacts
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_faststats_lr'] = oof
oof_df.to_csv('oof_meta.csv', index=False)
test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test))
sub = pd.DataFrame({'image_name': test['image_name'], 'target': test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved oof_meta.csv and submission.csv')

with open('results.md','a') as f:
    f.write(f"EXP2_FAST_STATS_LR | OOF={oof_auc:.6f} | folds={json.dumps([float(a) for a in fold_aucs])} | feats=311 | time={time.strftime('%Y-%m-%d %H:%M:%S')}\n")

Extracted features for 28984 images in 507.0s | shape (28984, 311)


Extracted features for 4142 images in 69.6s | shape (4142, 311)


Fold 0: AUC=0.81464


Fold 1: AUC=0.82331


Fold 2: AUC=0.77387


Fold 3: AUC=0.81250


Fold 4: AUC=0.78169
OOF AUC (fast image stats + LR): 0.801172 | per-fold: [0.814638, 0.823312, 0.773868, 0.8125, 0.781688]
Saved oof_meta.csv and submission.csv


In [39]:
# EXP2.5-FIX: Leak-proof stacking via nested CV (no in-sample meta-fit)
import os, sys, time, warnings, subprocess, multiprocessing as mp, json
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
warnings.filterwarnings('ignore')
SEED = 42
np.random.seed(SEED)

# Data
DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
JPEG_TEST  = DATA_DIR / 'jpeg' / 'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all()
train['fold'] = train['fold'].astype(int)

# ---------- Base Model A: Fast image-stats + LR (reuse cached features; retrain LR only) ----------
feat_cache_tr = 'features_train_faststats.npy'
feat_cache_te = 'features_test_faststats.npy'
assert os.path.exists(feat_cache_tr) and os.path.exists(feat_cache_te), 'Fast-stats features not cached; run EXP2-FAST-STATS first.'
X_train = np.load(feat_cache_tr).astype(np.float32)
X_test  = np.load(feat_cache_te).astype(np.float32)
oof_img = np.zeros(len(train), dtype=np.float64)
test_img_folds = []
fold_aucs_img = []
for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = X_train[trn_idx], train['target'].values[trn_idx]
    X_val, y_val = X_train[val_idx], train['target'].values[val_idx]
    clf = LogisticRegression(max_iter=5000, class_weight='balanced', solver='lbfgs', n_jobs=1, random_state=SEED)
    clf.fit(X_trn, y_trn)
    oof_img[val_idx] = clf.predict_proba(X_val)[:,1]
    fold_aucs_img.append(float(roc_auc_score(y_val, oof_img[val_idx])))
    test_img_folds.append(clf.predict_proba(X_test)[:,1].astype(np.float64))
test_img = np.mean(np.stack(test_img_folds, axis=0), axis=0)
auc_img = roc_auc_score(train['target'].values, oof_img)
print('Base A (fast-stats) OOF AUC:', round(auc_img,6), '| per-fold:', [round(a,6) for a in fold_aucs_img])

# ---------- Base Model B: Metadata CatBoost CB3 (per-fold Platt calibrated) ----------
try:
    from catboost import CatBoostClassifier, Pool
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'catboost'])
    from catboost import CatBoostClassifier, Pool

def safe_stat(p):
    try: return os.stat(p).st_size
    except: return np.nan
train['file_size_bytes'] = train['image_path'].map(safe_stat)
test['file_size_bytes']  = test['image_path'].map(safe_stat)
try:
    from PIL import Image
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'Pillow'])
    from PIL import Image

def get_wh(path):
    try:
        with Image.open(path) as im:
            return im.size
    except Exception:
        return (np.nan, np.nan)

def map_wh(paths, workers=16):
    with mp.Pool(processes=min(workers, mp.cpu_count())) as pool:
        return list(pool.imap(get_wh, paths, chunksize=64))

tr_wh = map_wh(train['image_path'].tolist())
te_wh = map_wh(test['image_path'].tolist())
train['image_width']  = [w for (w,h) in tr_wh]
train['image_height'] = [h for (w,h) in tr_wh]
test['image_width']   = [w for (w,h) in te_wh]
test['image_height']  = [h for (w,h) in te_wh]
for df in (train, test):
    df['aspect_ratio'] = df['image_width'] / df['image_height']
    df['area'] = df['image_width'] * df['image_height']
    df['log_area'] = np.log1p(df['area'])
    df['log_file_size'] = np.log1p(df['file_size_bytes'])

cat_cols = ['sex','anatom_site_general_challenge']
num_base = ['age_approx','file_size_bytes','image_width','image_height','aspect_ratio','area','log_area','log_file_size']

def fold_safe_preprocess(trn_df, val_df, tst_df):
    for df in (trn_df, val_df, tst_df):
        df['age_missing'] = df['age_approx'].isna().astype(int)
        df['sex_missing'] = df['sex'].isna().astype(int)
        df['site_missing'] = df['anatom_site_general_challenge'].isna().astype(int)
    grp_key = ['sex','anatom_site_general_challenge']
    med_age = trn_df.groupby(grp_key)['age_approx'].median()
    def impute_age(df):
        df = df.copy()
        df['age_approx'] = df['age_approx'].fillna(df[grp_key].apply(lambda r: med_age.get(tuple(r), np.nan), axis=1))
        df['age_approx'] = df['age_approx'].fillna(trn_df['age_approx'].median())
        return df
    trn_df = impute_age(trn_df); val_df = impute_age(val_df); tst_df = impute_age(tst_df)
    for df in (trn_df, val_df, tst_df):
        df[cat_cols] = df[cat_cols].fillna('NA').astype(str)
        df['sex_site'] = df['sex'] + '|' + df['anatom_site_general_challenge']
    p_stats = trn_df.groupby('patient_id').agg(
        p_cnt=('image_name','size'),
        p_pos=('target','sum'),
        p_age_mean=('age_approx','mean'),
        p_age_std=('age_approx','std')
    )
    g_rate = trn_df['target'].mean()
    p_stats['p_rate_smooth'] = (p_stats['p_pos'] + 2.0*g_rate) / (p_stats['p_cnt'] + 2.0)
    p_stats = p_stats[['p_cnt','p_rate_smooth','p_age_mean','p_age_std']]
    def merge_p(df):
        df = df.merge(p_stats, left_on='patient_id', right_index=True, how='left')
        df['p_cnt'] = df['p_cnt'].fillna(0.0)
        df['p_rate_smooth'] = df['p_rate_smooth'].fillna(g_rate)
        df['p_age_mean'] = df['p_age_mean'].fillna(trn_df['age_approx'].median())
        df['p_age_std'] = df['p_age_std'].fillna(0.0)
        return df
    trn_df = merge_p(trn_df); val_df = merge_p(val_df); tst_df = merge_p(tst_df)
    used_num = num_base + ['p_cnt','p_rate_smooth','p_age_mean','p_age_std','age_missing','sex_missing','site_missing']
    used_cat = cat_cols + ['sex_site']
    used_cols = used_num + used_cat
    X_trn = trn_df[used_cols]; y_trn = trn_df['target'].values
    X_val = val_df[used_cols]; y_val = val_df['target'].values
    X_tst = tst_df[used_cols]
    cat_idx = [X_trn.columns.get_loc(c) for c in ['sex','anatom_site_general_challenge','sex_site']]
    return X_trn, y_trn, X_val, y_val, X_tst, cat_idx

oof_meta = np.zeros(len(train), dtype=np.float64)
test_meta_folds = []
fold_aucs_meta = []
for f in sorted(train['fold'].unique()):
    trn_df = train[train['fold'] != f].copy()
    val_df = train[train['fold'] == f].copy()
    tst_df = test.copy()
    X_trn, y_trn, X_val, y_val, X_tst, cat_idx = fold_safe_preprocess(trn_df, val_df, tst_df)
    train_pool = Pool(X_trn, y_trn, cat_features=cat_idx)
    val_pool   = Pool(X_val, y_val, cat_features=cat_idx)
    test_pool  = Pool(X_tst, cat_features=cat_idx)
    model = CatBoostClassifier(
        loss_function='Logloss', eval_metric='AUC',
        iterations=20000, learning_rate=0.02, depth=6, l2_leaf_reg=8.0,
        random_seed=SEED, bootstrap_type='Bernoulli', subsample=0.8,
        auto_class_weights='Balanced', task_type='CPU', verbose=False
    )
    model.fit(train_pool, eval_set=val_pool, early_stopping_rounds=1000, use_best_model=True, verbose=False)
    val_logit = model.predict(val_pool, prediction_type='RawFormulaVal')
    tst_logit = model.predict(test_pool, prediction_type='RawFormulaVal')
    platt = LogisticRegression(C=1e6, solver='lbfgs', max_iter=5000, n_jobs=1)
    platt.fit(val_logit.reshape(-1,1), y_val)
    val_prob = platt.predict_proba(val_logit.reshape(-1,1))[:,1]
    tst_prob = platt.predict_proba(tst_logit.reshape(-1,1))[:,1]
    oof_meta[train.index[train['fold']==f]] = val_prob
    fold_aucs_meta.append(float(roc_auc_score(y_val, val_prob)))
    test_meta_folds.append(tst_prob.astype(np.float64))
test_meta = np.mean(np.stack(test_meta_folds, axis=0), axis=0)
auc_meta = roc_auc_score(train['target'].values, oof_meta)
print('Base B (metadata CB3-cal) OOF AUC:', round(auc_meta,6), '| per-fold:', [round(a,6) for a in fold_aucs_meta])

# ---------- Leak-proof Stacking: nested CV for meta-learner ----------
stack_feats = np.vstack([oof_img, oof_meta]).T.astype(np.float64)
stack_y = train['target'].values.astype(int)
oof_stack = np.zeros(len(train), dtype=np.float64)
coefs_per_fold = []
intercepts_per_fold = []
for f in sorted(train['fold'].unique()):
    trn_mask = train['fold'].values != f
    val_mask = ~trn_mask
    X_trn, y_trn = stack_feats[trn_mask], stack_y[trn_mask]
    X_val, y_val = stack_feats[val_mask], stack_y[val_mask]
    meta = LogisticRegression(C=1e6, solver='lbfgs', max_iter=10000, n_jobs=1)
    meta.fit(X_trn, y_trn)
    oof_stack[val_mask] = meta.predict_proba(X_val)[:,1]
    coefs_per_fold.append(meta.coef_.ravel().tolist())
    intercepts_per_fold.append(float(meta.intercept_[0]))
auc_stack_oof = roc_auc_score(stack_y, oof_stack)
print('STACK OOF AUC (leak-proof):', round(auc_stack_oof,6))
print('Meta coefs per fold (img, meta):', coefs_per_fold)

# Train final meta on all OOF preds for test-time blending
meta_final = LogisticRegression(C=1e6, solver='lbfgs', max_iter=10000, n_jobs=1)
meta_final.fit(stack_feats, stack_y)
test_stack_X = np.vstack([test_img, test_meta]).T.astype(np.float64)
test_stack = meta_final.predict_proba(test_stack_X)[:,1]

# Save blended OOF and submission
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_faststats_lr'] = oof_img
oof_df['pred_meta_cb3_cal'] = oof_meta
oof_df['pred_stack_blend_oof'] = oof_stack
oof_df.to_csv('oof_meta.csv', index=False)
sub = pd.DataFrame({'image_name': test['image_name'], 'target': test_stack})
sub.to_csv('submission.csv', index=False)
print('Saved leak-proof stacked oof_meta.csv and submission.csv')

with open('results.md','a') as f:
    f.write(f"EXP2_5_STACK_FIX | OOF_img={auc_img:.6f} | OOF_meta={auc_meta:.6f} | OOF_stack_leakproof={auc_stack_oof:.6f} | time={time.strftime('%Y-%m-%d %H:%M:%S')}\n")

In [76]:
# EXP2.5-LITE: Leak-proof stacking using cached fast-stats + lightweight metadata (no PIL/dims) to avoid stalls
import os, time, numpy as np, pandas as pd
from pathlib import Path
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

SEED = 42
np.random.seed(SEED)
DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')

# Attach paths and folds
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
JPEG_TEST  = DATA_DIR / 'jpeg' / 'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all()
train['fold'] = train['fold'].astype(int)

# ---------- Base A: fast-stats LR from cached features ----------
feat_tr = 'features_train_faststats.npy'
feat_te = 'features_test_faststats.npy'
assert os.path.exists(feat_tr) and os.path.exists(feat_te), 'Missing cached fast-stats features. Run EXP2-FAST-STATS first.'
X_train = np.load(feat_tr).astype(np.float32)
X_test  = np.load(feat_te).astype(np.float32)
oof_img = np.zeros(len(train), dtype=np.float64)
test_img_folds = []
fold_aucs_img = []
for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = X_train[trn_idx], train['target'].values[trn_idx]
    X_val, y_val = X_train[val_idx], train['target'].values[val_idx]
    clf = LogisticRegression(max_iter=3000, class_weight='balanced', solver='lbfgs', n_jobs=1, random_state=SEED)
    clf.fit(X_trn, y_trn)
    oof_img[val_idx] = clf.predict_proba(X_val)[:,1]
    fold_aucs_img.append(float(roc_auc_score(y_val, oof_img[val_idx])))
    test_img_folds.append(clf.predict_proba(X_test)[:,1].astype(np.float64))
test_img = np.mean(np.stack(test_img_folds, axis=0), axis=0)
auc_img = roc_auc_score(train['target'].values, oof_img)
print('Base A (fast-stats) OOF AUC:', round(auc_img,6), '| per-fold:', [round(a,6) for a in fold_aucs_img])

# ---------- Base B: Lightweight metadata (no PIL/dims): sex, site, age, log_file_size, patient context ----------
def safe_stat(p):
    try: return os.stat(p).st_size
    except: return np.nan
train['file_size_bytes'] = train['image_path'].map(safe_stat)
test['file_size_bytes']  = test['image_path'].map(safe_stat)
for df in (train, test):
    df['log_file_size'] = np.log1p(df['file_size_bytes'])

cat_cols = ['sex','anatom_site_general_challenge']
num_cols = ['age_approx','log_file_size']

oof_meta = np.zeros(len(train), dtype=np.float64)
test_meta_folds = []
fold_aucs_meta = []

for f in sorted(train['fold'].unique()):
    trn_df = train[train['fold'] != f].copy()
    val_df = train[train['fold'] == f].copy()
    tst_df = test.copy()

    # Fold-safe age imputation
    grp_key = ['sex','anatom_site_general_challenge']
    med_age = trn_df.groupby(grp_key)['age_approx'].median()
    def impute_age(df):
        df = df.copy()
        df['age_approx'] = df['age_approx'].fillna(df[grp_key].apply(lambda r: med_age.get(tuple(r), np.nan), axis=1))
        df['age_approx'] = df['age_approx'].fillna(trn_df['age_approx'].median())
        return df
    trn_df = impute_age(trn_df); val_df = impute_age(val_df); tst_df = impute_age(tst_df)

    # Patient context (fold-safe): counts + smoothed rate
    p = trn_df.groupby('patient_id').agg(p_cnt=('image_name','size'), p_pos=('target','sum'))
    g_rate = trn_df['target'].mean()
    p['p_rate_smooth'] = (p['p_pos'] + 2.0*g_rate) / (p['p_cnt'] + 2.0)
    p = p[['p_cnt','p_rate_smooth']]
    def merge_p(df):
        df = df.merge(p, left_on='patient_id', right_index=True, how='left')
        df['p_cnt'] = df['p_cnt'].fillna(0.0)
        df['p_rate_smooth'] = df['p_rate_smooth'].fillna(g_rate)
        return df
    trn_df = merge_p(trn_df); val_df = merge_p(val_df); tst_df = merge_p(tst_df)

    used_num = num_cols + ['p_cnt','p_rate_smooth']
    used_cat = cat_cols
    used_cols = used_num + used_cat

    pre = ColumnTransformer(transformers=[
        ('num', SimpleImputer(strategy='median'), used_num),
        ('cat', Pipeline(steps=[('imp', SimpleImputer(strategy='most_frequent')),
                                 ('ohe', OneHotEncoder(handle_unknown='ignore'))]), used_cat),
    ])

    X_trn = trn_df[used_cols]; y_trn = trn_df['target'].values
    X_val = val_df[used_cols]; y_val = val_df['target'].values
    X_tst = tst_df[used_cols]

    clf_meta = Pipeline(steps=[('pre', pre), ('lr', LogisticRegression(max_iter=3000, class_weight='balanced', solver='lbfgs', n_jobs=1, random_state=SEED))])
    clf_meta.fit(X_trn, y_trn)
    val_prob = clf_meta.predict_proba(X_val)[:,1]
    tst_prob = clf_meta.predict_proba(X_tst)[:,1]
    oof_meta[train.index[train['fold']==f]] = val_prob
    fold_aucs_meta.append(float(roc_auc_score(y_val, val_prob)))
    test_meta_folds.append(tst_prob.astype(np.float64))

test_meta = np.mean(np.stack(test_meta_folds, axis=0), axis=0)
auc_meta = roc_auc_score(train['target'].values, oof_meta)
print('Base B (metadata-lite) OOF AUC:', round(auc_meta,6), '| per-fold:', [round(a,6) for a in fold_aucs_meta])

# ---------- Leak-proof stacking (nested CV over folds) ----------
stack_feats = np.vstack([oof_img, oof_meta]).T.astype(np.float64)
stack_y = train['target'].values.astype(int)
oof_stack = np.zeros(len(train), dtype=np.float64)
coefs = []
for f in sorted(train['fold'].unique()):
    trn_mask = train['fold'].values != f
    val_mask = ~trn_mask
    X_trn, y_trn = stack_feats[trn_mask], stack_y[trn_mask]
    X_val, y_val = stack_feats[val_mask], stack_y[val_mask]
    meta = LogisticRegression(C=1e6, solver='lbfgs', max_iter=10000, n_jobs=1, random_state=SEED)
    meta.fit(X_trn, y_trn)
    oof_stack[val_mask] = meta.predict_proba(X_val)[:,1]
    coefs.append(meta.coef_.ravel().tolist())
auc_stack = roc_auc_score(stack_y, oof_stack)
print('STACK OOF AUC (leak-proof):', round(auc_stack,6), '| coefs per fold (img, meta):', [[round(v,4) for v in c] for c in coefs])

# Final meta for test-time blend
meta_final = LogisticRegression(C=1e6, solver='lbfgs', max_iter=10000, n_jobs=1, random_state=SEED)
meta_final.fit(stack_feats, stack_y)
test_stack_X = np.vstack([test_img, test_meta]).T.astype(np.float64)
test_stack = meta_final.predict_proba(test_stack_X)[:,1]

# Save artifacts
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_faststats_lr'] = oof_img
oof_df['pred_meta_lite_lr'] = oof_meta
oof_df['pred_stack_blend_oof'] = oof_stack
oof_df.to_csv('oof_meta.csv', index=False)
sub = pd.DataFrame({'image_name': test['image_name'], 'target': test_stack})
sub.to_csv('submission.csv', index=False)
print('Saved leak-proof stacked oof_meta.csv and submission.csv')

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [42]:
# EXP2.5-DIAG+SAFE: Diagnostics, artifact saves, and safe blending fallback
import os, json, numpy as np, pandas as pd
from pathlib import Path
from sklearn.metrics import roc_auc_score
from scipy.stats import pearsonr
from sklearn.linear_model import LogisticRegression

SEED = 42
DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR / 'train.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')
oof_df = pd.read_csv('oof_meta.csv')

# Identify columns
col_img = None
for c in ['pred_img_faststats_lr', 'pred_img_efficientnet_b0_frozen', 'pred_img_effb0_emb_lr']:
    if c in oof_df.columns: col_img = c; break
col_meta = None
for c in ['pred_meta_lite_lr', 'pred_meta_cb3_cal', 'pred_meta_cb4_postprob', 'pred_meta_cb3_postprob', 'pred_meta_cb']:
    if c in oof_df.columns: col_meta = c; break
col_stack = None
for c in ['pred_stack_blend_oof', 'pred_stack_blend']:
    if c in oof_df.columns: col_stack = c; break
assert col_img is not None, 'Missing image OOF column in oof_meta.csv'
assert col_meta is not None, 'Missing metadata OOF column in oof_meta.csv'
assert col_stack is not None, 'Missing stack OOF column in oof_meta.csv'

# Merge labels safely
df = oof_df.merge(train[['image_name','target']], on='image_name', how='left', suffixes=('', '_tr'))
y = df['target'].values.astype(int)
fold = df['fold'].values.astype(int)
oof_img = df[col_img].values.astype(float)
oof_meta = df[col_meta].values.astype(float)
oof_stack = df[col_stack].values.astype(float)

# Diagnostics
auc_img = roc_auc_score(y, oof_img)
auc_meta = roc_auc_score(y, oof_meta)
auc_stack = roc_auc_score(y, oof_stack)
per_fold_stack = {}
for f in sorted(np.unique(fold)):
    m = (fold==f)
    per_fold_stack[int(f)] = roc_auc_score(y[m], oof_stack[m])
corr, _ = pearsonr(oof_img, oof_meta)
print(f'DIAG | OOF AUCs -> IMG:{auc_img:.6f} META:{auc_meta:.6f} STACK:{auc_stack:.6f}')
print('DIAG | Stack per-fold AUCs:', {k: round(v,6) for k,v in per_fold_stack.items()})
print(f'DIAG | Pearson corr (IMG vs META): {corr:.6f}')

# Save artifacts
np.save('oof_base_A_faststats.npy', oof_img)
np.save('oof_base_B_meta.npy', oof_meta)
np.save('oof_stack.npy', oof_stack)
print('Saved OOF artifacts: oof_base_A_faststats.npy, oof_base_B_meta.npy, oof_stack.npy')

# Safe blending: fall back to best single if stack underperforms
best_name = 'STACK'
best_auc = auc_stack
best_is_stack = True
if auc_img >= best_auc:
    best_name = 'IMG_FASTSTATS'
    best_auc = auc_img
    best_is_stack = False

print(f'SAFE BLEND DECISION | Best OOF source: {best_name} (AUC={best_auc:.6f})')

if not best_is_stack:
    # Recompute test predictions for fast-stats IMG base quickly (per-fold LR averaging)
    feat_tr = 'features_train_faststats.npy'
    feat_te = 'features_test_faststats.npy'
    assert os.path.exists(feat_tr) and os.path.exists(feat_te), 'Missing cached fast-stats features.'
    X_tr = np.load(feat_tr).astype(np.float32)
    X_te = np.load(feat_te).astype(np.float32)
    # Attach folds
    df_fold = train.merge(folds[['image_name','fold']], on='image_name', how='left')
    assert df_fold['fold'].notna().all()
    df_fold['fold'] = df_fold['fold'].astype(int)
    test_pred_folds = []
    for f in sorted(df_fold['fold'].unique()):
        trn_idx = np.where(df_fold['fold'].values != f)[0]
        y_trn = df_fold['target'].values[trn_idx]
        clf = LogisticRegression(max_iter=3000, class_weight='balanced', solver='lbfgs', n_jobs=1, random_state=SEED)
        clf.fit(X_tr[trn_idx], y_trn)
        test_pred_folds.append(clf.predict_proba(X_te)[:,1].astype(np.float64))
    test_img = np.mean(np.stack(test_pred_folds, axis=0), axis=0)
    sub = pd.read_csv('submission.csv') if os.path.exists('submission.csv') else pd.DataFrame({'image_name': pd.read_csv('test.csv')['image_name']})
    sub['target'] = test_img
    sub.to_csv('submission.csv', index=False)
    np.save('test_base_A_faststats.npy', test_img)
    print('Overwrote submission.csv with best IMG base predictions and saved test_base_A_faststats.npy')
else:
    print('Keeping existing stacked submission.csv (stack was best by OOF)')

DIAG | OOF AUCs -> IMG:0.800667 META:0.671955 STACK:0.798588
DIAG | Stack per-fold AUCs: {0: 0.815261, 1: 0.823625, 2: 0.771792, 3: 0.810228, 4: 0.776719}
DIAG | Pearson corr (IMG vs META): 0.426809
Saved OOF artifacts: oof_base_A_faststats.npy, oof_base_B_meta.npy, oof_stack.npy
SAFE BLEND DECISION | Best OOF source: IMG_FASTSTATS (AUC=0.800667)


Overwrote submission.csv with best IMG base predictions and saved test_base_A_faststats.npy


In [43]:
# EXP3-CPU-TIMM: EfficientNet-B3 (384px) embeddings on CPU + GBDT (LightGBM) meta (target OOF > 0.90)
import os, sys, time, warnings, subprocess, math
from pathlib import Path
import numpy as np
import pandas as pd
warnings.filterwarnings('ignore')
SEED = 42
np.random.seed(SEED)

# ---------------- Install CPU PyTorch + timm + LightGBM ----------------
def ensure(pkg, extra_args=None):
    try:
        __import__(pkg)
    except Exception:
        cmd = [sys.executable, '-m', 'pip', 'install', '--quiet']
        if extra_args: cmd += extra_args
        subprocess.check_call(cmd + [pkg])

# CPU wheels for torch/torchvision
ensure('torch', extra_args=['--index-url', 'https://download.pytorch.org/whl/cpu'])
ensure('torchvision', extra_args=['--index-url', 'https://download.pytorch.org/whl/cpu'])
ensure('timm')
try:
    import lightgbm as lgb
except Exception:
    ensure('lightgbm')
    import lightgbm as lgb

import torch
import timm
from PIL import Image
from torch.utils.data import Dataset, DataLoader
from timm.data import resolve_data_config, create_transform
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedGroupKFold

torch.set_num_threads(max(1, os.cpu_count() - 1))

# ---------------- Data ----------------
DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
JPEG_TEST  = DATA_DIR / 'jpeg' / 'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all()
train['fold'] = train['fold'].astype(int)

# ---------------- TIMM embedding extractor (EffNet-B3 @ 384px) ----------------
BACKBONE = 'efficientnet_b3'  # 1536-dim
IMG_SIZE = 384
BATCH = 128  # tune down to 64 if OOM
NUM_WORKERS = min(16, max(2, os.cpu_count() - 2))

class TimmEmbedDataset(Dataset):
    def __init__(self, paths, transform):
        self.paths = paths
        self.t = transform
    def __len__(self):
        return len(self.paths)
    def __getitem__(self, i):
        p = self.paths[i]
        try:
            with Image.open(p) as im:
                im = im.convert('RGB')
                x = self.t(im)
        except Exception:
            # fallback as zeros image
            x = torch.zeros(3, IMG_SIZE, IMG_SIZE, dtype=torch.float32)
        return x

@torch.no_grad()
def extract_embeddings(paths):
    model = timm.create_model(BACKBONE, pretrained=True, num_classes=0, global_pool='avg')
    model.eval().to('cpu')
    cfg = resolve_data_config({'input_size': (3, IMG_SIZE, IMG_SIZE)}, model=model)
    # no-aug eval transform at target size
    transform = create_transform(input_size=(3, IMG_SIZE, IMG_SIZE), crop_pct=cfg.get('crop_pct', 1.0),
                                 mean=cfg['mean'], std=cfg['std'], interpolation='bilinear')
    ds = TimmEmbedDataset(paths, transform)
    dl = DataLoader(ds, batch_size=BATCH, shuffle=False, num_workers=NUM_WORKERS, pin_memory=False, persistent_workers=True)
    feats = []
    use_amp = hasattr(torch.amp, 'autocast')
    t0 = time.time()
    for i, xb in enumerate(dl):
        if use_amp:
            with torch.amp.autocast('cpu', enabled=True, dtype=torch.bfloat16):
                fb = model(xb).to(torch.float32)
        else:
            fb = model(xb)
        feats.append(fb.cpu().numpy())
        if (i+1) % 50 == 0:
            done = min(len(ds), (i+1)*BATCH)
            print(f'Emb {done}/{len(ds)} ({(done*100)//len(ds)}%)')
    F = np.vstack(feats).astype(np.float32)
    print(f'Embeddings: shape={F.shape} | took {time.time()-t0:.1f}s')
    return F

feat_tr_path = f'features_train_timm_b3_{IMG_SIZE}.npy'
feat_te_path = f'features_test_timm_b3_{IMG_SIZE}.npy'

if os.path.exists(feat_tr_path) and os.path.exists(feat_te_path):
    Ft = np.load(feat_tr_path)
    Fe = np.load(feat_te_path)
    print('Loaded cached TIMM embeddings:', Ft.shape, Fe.shape)
else:
    print(f'[TIMM CONFIG] backbone={BACKBONE} | size={IMG_SIZE} | batch={BATCH} | workers={NUM_WORKERS}')
    Ft = extract_embeddings(train['image_path'].tolist())
    Fe = extract_embeddings(test['image_path'].tolist())
    # basic sanity
    assert Ft.shape[0] == len(train) and Fe.shape[0] == len(test)
    assert np.isfinite(Ft).all() and np.isfinite(Fe).all()
    np.save(feat_tr_path, Ft); np.save(feat_te_path, Fe)
    print('Saved TIMM embeddings to disk.')

# ---------------- GBDT on embeddings (5-fold StratifiedGroupKFold, leak-proof) ----------------
y = train['target'].values.astype(int)
groups = train['patient_id'].astype(str).values if 'patient_id' in train.columns else train['fold'].values
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=SEED)
oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds = []
fold_aucs = []

# Robust LightGBM params
base_params = dict(
    objective='binary',
    learning_rate=0.05,
    n_estimators=20000,
    num_leaves=255,
    max_bin=511,
    min_data_in_bin=1,
    feature_fraction=0.9,
    bagging_fraction=0.8,
    subsample=0.8,
    subsample_freq=1,
    colsample_bytree=0.9,
    reg_lambda=1.0,
    reg_alpha=0.0,
    min_child_samples=20,
    n_jobs=8,
    random_state=SEED
)

for f, (trn_idx, val_idx) in enumerate(sgkf.split(Ft, y, groups)):
    X_trn, X_val = Ft[trn_idx], Ft[val_idx]
    y_trn, y_val = y[trn_idx], y[val_idx]
    pos = max(1, int(y_trn.sum())); neg = max(1, int(len(y_trn)-pos))
    spw = neg / pos
    params = dict(base_params); params['scale_pos_weight'] = spw; params['is_unbalance'] = False
    clf = lgb.LGBMClassifier(**params)
    clf.fit(
        X_trn, y_trn,
        eval_set=[(X_val, y_val)],
        eval_metric='auc',
        callbacks=[lgb.early_stopping(1000, verbose=False)]
    )
    pv = clf.predict_proba(X_val)[:,1]
    oof[val_idx] = pv.astype(np.float32)
    auc = roc_auc_score(y_val, pv)
    fold_aucs.append(float(auc))
    test_pred_folds.append(clf.predict_proba(Fe)[:,1].astype(np.float32))
    print(f'Fold {f}: AUC={auc:.5f}, best_iter={getattr(clf, "best_iteration_", None)}')

oof_auc = roc_auc_score(y, oof)
print('OOF AUC (TIMM-B3 emb + LGBM):', round(oof_auc,6), '| per-fold:', [round(a,6) for a in fold_aucs])

# Save OOF and submission
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_timm_b3_emb_lgbm'] = oof
oof_df.to_csv('oof_meta.csv', index=False)
test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
sub = pd.DataFrame({'image_name': test['image_name'], 'target': test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved oof_meta.csv and submission.csv')

# Persist features + diagnostics
np.save('features_train_timm_b3_384.npy', Ft)
np.save('features_test_timm_b3_384.npy', Fe)
with open('results.md','a') as f:
    f.write(f"EXP3_TIMM_B3_384 | OOF={oof_auc:.6f} | folds={[float(a) for a in fold_aucs]} | time={time.strftime('%Y-%m-%d %H:%M:%S')}\n")

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.26.4 which is incompatible.






ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.26.3 which is incompatible.


In [45]:
# EXP2.5-NN-STACK+VER: Non-negative fold-wise stacker (grid w in [0,1]) + true artifact versioning manifest
import os, json, time, numpy as np, pandas as pd
from pathlib import Path
from sklearn.metrics import roc_auc_score
from scipy.stats import pearsonr

ts = time.strftime('%Y%m%d_%H%M%S')
DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR / 'train.csv')
oof_df = pd.read_csv('oof_meta.csv')

# Identify available columns for base A (image) and base B (metadata)
col_img = None
for c in ['pred_img_faststats_lr', 'pred_img_efficientnet_b0_frozen', 'pred_img_effb0_emb_lr', 'pred_img_timm_b3_emb_lgbm']:
    if c in oof_df.columns: col_img = c; break
col_meta = None
for c in ['pred_meta_lite_lr', 'pred_meta_cb3_cal', 'pred_meta_cb4_postprob', 'pred_meta_cb3_postprob', 'pred_meta_cb', 'pred_meta_lgb']:
    if c in oof_df.columns: col_meta = c; break
assert col_img is not None and col_meta is not None, 'Required OOF columns missing'

# Merge labels safely (handle possible target duplication in oof file)
df = oof_df.merge(train[['image_name','target']], on='image_name', how='left', suffixes=('', '_tr'))
label_col = 'target' if 'target' in df.columns else ('target_tr' if 'target_tr' in df.columns else None)
assert label_col is not None, f"No target column found after merge. Columns: {df.columns.tolist()}"
assert 'fold' in df.columns, 'fold column missing in oof_meta.csv'
y = df[label_col].values.astype(int)
fold = df['fold'].values.astype(int)
oof_img = df[col_img].values.astype(float)
oof_meta = df[col_meta].values.astype(float)

# Diagnostics
auc_img = roc_auc_score(y, oof_img)
auc_meta = roc_auc_score(y, oof_meta)
corr, _ = pearsonr(oof_img, oof_meta)

# Non-negative convex blend per fold: find w in [0,1] maximizing AUC on held-out fold
ws = np.linspace(0.0, 1.0, 101)
oof_stack_nn = np.zeros_like(oof_img, dtype=float)
w_per_fold = {}
auc_per_fold = {}
for f in sorted(np.unique(fold)):
    m = (fold == f)
    best_auc = -1.0
    best_w = 1.0  # default to image-only
    for w in ws:
        blend = w*oof_img[m] + (1.0-w)*oof_meta[m]
        try:
            a = roc_auc_score(y[m], blend)
        except ValueError:
            a = -1.0
        if a > best_auc:
            best_auc = a
            best_w = float(w)
    oof_stack_nn[m] = best_w*oof_img[m] + (1.0-best_w)*oof_meta[m]
    w_per_fold[int(f)] = best_w
    auc_per_fold[int(f)] = best_auc

auc_stack_nn = roc_auc_score(y, oof_stack_nn)
print(f'OOF AUCs | IMG:{auc_img:.6f} META:{auc_meta:.6f} NN-STACK:{auc_stack_nn:.6f} | corr={corr:.6f}')
print('Per-fold NN-STACK AUCs:', {k: round(v,6) for k,v in auc_per_fold.items()})
print('Per-fold blend weights (w for IMG, 1-w for META):', {k: round(v,3) for k,v in w_per_fold.items()})

# Safe-blend decision: if NN-STACK underperforms best base, fall back to best base for submission
best_name = 'NN_STACK'
best_auc = auc_stack_nn
if auc_img >= best_auc and auc_img >= auc_meta:
    best_name, best_auc = 'IMG_BASE', auc_img
elif auc_meta >= best_auc and auc_meta >= auc_img:
    best_name, best_auc = 'META_BASE', auc_meta
print(f'SAFE BLEND | Best by OOF: {best_name} ({best_auc:.6f})')

# Attempt to form test predictions depending on best source and availability of cached test arrays
sub = pd.read_csv('submission.csv') if os.path.exists('submission.csv') else pd.DataFrame({'image_name': pd.read_csv('test.csv')['image_name']})
updated_submission = False

def try_load(path):
    return np.load(path) if os.path.exists(path) else None

test_img = try_load('test_base_A_faststats.npy')  # from Cell 20 when IMG was best
test_meta = try_load('test_base_B_meta.npy')      # may be None

if best_name == 'IMG_BASE' and test_img is not None:
    sub['target'] = test_img.astype(float)
    updated_submission = True
elif best_name == 'META_BASE' and test_meta is not None:
    sub['target'] = test_meta.astype(float)
    updated_submission = True
elif best_name == 'NN_STACK' and (test_img is not None) and (test_meta is not None):
    w_mean = float(np.mean(list(w_per_fold.values()))) if len(w_per_fold)>0 else 0.5
    sub['target'] = (w_mean*test_img + (1.0-w_mean)*test_meta).astype(float)
    updated_submission = True

if updated_submission:
    sub.to_csv('submission.csv', index=False)
    print('submission.csv updated based on safe-blend decision.')
else:
    print('submission.csv kept as-is (missing needed test arrays for chosen blend).')

# True artifact versioning: save OOF arrays and manifest with timestamped names
oofA_path = f'oof_base_A_{ts}.npy'
oofB_path = f'oof_base_B_{ts}.npy'
oofS_path = f'oof_stack_nn_{ts}.npy'
np.save(oofA_path, oof_img)
np.save(oofB_path, oof_meta)
np.save(oofS_path, oof_stack_nn)

# Save a manifest JSON with metrics and decisions
manifest = {
    'timestamp': ts,
    'columns': {'img': col_img, 'meta': col_meta},
    'metrics': {
        'auc_img': float(auc_img),
        'auc_meta': float(auc_meta),
        'auc_stack_nn': float(auc_stack_nn),
        'pearson_corr_img_meta': float(corr),
        'per_fold_auc_stack_nn': {int(k): float(v) for k,v in auc_per_fold.items()}
    },
    'stacker': {
        'type': 'convex_weight_grid_0_1',
        'weights_per_fold_img': {int(k): float(v) for k,v in w_per_fold.items()}
    },
    'safe_blend': {
        'best_source': best_name,
        'best_auc': float(best_auc),
        'test_arrays_present': {'img': bool(test_img is not None), 'meta': bool(test_meta is not None)}
    },
    'artifacts': {
        'oof_img': oofA_path,
        'oof_meta': oofB_path,
        'oof_stack_nn': oofS_path,
        'submission': 'submission.csv'
    }
}
with open(f'manifest_{ts}.json','w') as f:
    json.dump(manifest, f, indent=2)
print('Versioned artifacts saved:', oofA_path, oofB_path, oofS_path, f'manifest_{ts}.json')

OOF AUCs | IMG:0.800667 META:0.671955 NN-STACK:0.798677 | corr=0.426809
Per-fold NN-STACK AUCs: {0: 0.817176, 1: 0.823855, 2: 0.781405, 3: 0.811835, 4: 0.789059}
Per-fold blend weights (w for IMG, 1-w for META): {0: 1.0, 1: 0.91, 2: 0.58, 3: 0.82, 4: 0.45}
SAFE BLEND | Best by OOF: IMG_BASE (0.800667)
submission.csv updated based on safe-blend decision.
Versioned artifacts saved: oof_base_A_20250824_061012.npy oof_base_B_20250824_061012.npy oof_stack_nn_20250824_061012.npy manifest_20250824_061012.json


In [59]:
# EXP3-ISOLATED (venv, streamlined): Use existing .venv_timm to extract TIMM embeddings (resnet18 @224) with manual preprocessing + LGBM head
import os, sys, json, time, subprocess, textwrap
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score

SEED = 42
np.random.seed(SEED)
DATA_DIR = Path('.')

# Safer, simpler backbone for CPU stability
BACKBONE = os.environ.get('TIMM_BACKBONE', 'resnet18')
IMG_SIZE = int(os.environ.get('TIMM_IMG_SIZE', '224'))
BATCH = 1  # strictly sequential
PILOT_N = 200  # small pilot subset

train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
JPEG_TEST  = DATA_DIR / 'jpeg' / 'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all()
train['fold'] = train['fold'].astype(int)

if PILOT_N > 0:
    train = train.iloc[:PILOT_N].copy()
    test = test.iloc[:min(PILOT_N, len(test))].copy()
    print(f'[PILOT] Using subset of train={len(train)} test={len(test)}')

# Paths
VENV_DIR = Path('.venv_timm')
PY_BIN = VENV_DIR / 'bin' / 'python'
assert VENV_DIR.exists(), 'Missing .venv_timm. Run venv install cell first.'

suffix = f'pilot{PILOT_N}_' if PILOT_N > 0 else ''
ft_tr = f"features_train_timm_{suffix}{BACKBONE}_{IMG_SIZE}.npy"
ft_te = f"features_test_timm_{suffix}{BACKBONE}_{IMG_SIZE}.npy"

def clean_env():
    env = os.environ.copy()
    for k in ['PIP_TARGET','PYTHONPATH','PYTHONUSERBASE','PIP_PREFIX','PIP_SRC','PIP_BUILD','PIP_CONFIG_FILE']:
        env.pop(k, None)
    # threading/env guards to avoid segfaults
    env.setdefault('OMP_NUM_THREADS', '1')
    env.setdefault('MKL_NUM_THREADS', '1')
    env.setdefault('OPENBLAS_NUM_THREADS', '1')
    env.setdefault('NUMEXPR_NUM_THREADS', '1')
    env.setdefault('MKL_THREADING_LAYER', 'GNU')
    env.setdefault('MALLOC_ARENA_MAX', '1')
    # Torch guards
    env.setdefault('TORCH_NUM_THREADS', '1')
    env.setdefault('TORCH_CPP_LOG_LEVEL', 'ERROR')
    return env

def run(cmd, env=None):
    print('>', ' '.join(cmd))
    subprocess.check_call(cmd, env=(env or clean_env()))

# Write path lists
Path('train_paths.txt').write_text('\n'.join(train['image_path'].tolist()))
Path('test_paths.txt').write_text('\n'.join(test['image_path'].tolist()))

# Embedding extractor script (runs inside venv): manual PIL resize + mean/std normalize, no torchvision/timm transforms
embed_script = Path('embed_extract.py')
embed_code = '''
import os, sys, time, numpy as np
from PIL import Image
import torch, timm

BACKBONE = os.environ.get('BACKBONE', '__BACKBONE__')
IMG_SIZE = int(os.environ.get('IMG_SIZE', '__IMG_SIZE__'))
torch.set_num_threads(1)
torch.set_num_interop_threads(1)
try:
    torch.backends.mkldnn.enabled = False
except Exception:
    pass

# Standard ImageNet mean/std
MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
STD  = np.array([0.229, 0.224, 0.225], dtype=np.float32)

def read_paths(p):
    with open(p, 'r') as f:
        return [ln.strip() for ln in f if ln.strip()]

def load_img_tensor(path):
    try:
        with Image.open(path) as im:
            im = im.convert('RGB').resize((IMG_SIZE, IMG_SIZE), Image.BILINEAR)
            x = np.asarray(im, dtype=np.float32) / 255.0  # HWC
            x = (x - MEAN) / STD
            x = np.transpose(x, (2,0,1))  # CHW
            xb = torch.from_numpy(x).unsqueeze(0)  # 1xCxHxW
            return xb
    except Exception:
        return torch.zeros(1, 3, IMG_SIZE, IMG_SIZE, dtype=torch.float32)

@torch.no_grad()
def extract(paths):
    model = timm.create_model(BACKBONE, pretrained=True, num_classes=0, global_pool='avg')
    model.eval()
    feats = []
    t0 = time.time()
    for i, p in enumerate(paths):
        xb = load_img_tensor(p)
        fb = model(xb).cpu().numpy().astype(np.float32)
        feats.append(fb)
        if (i+1) % 50 == 0:
            print(f'Emb {i+1}/{len(paths)} ({((i+1)*100)//len(paths)}%)', flush=True)
    F = np.vstack(feats).astype(np.float32)
    print('Embeddings shape:', F.shape, 'took', round(time.time()-t0,1), 's', flush=True)
    return F

train_list, test_list, out_tr, out_te = sys.argv[1:5]
tr_paths = read_paths(train_list)
te_paths = read_paths(test_list)
Ft = extract(tr_paths)
Fe = extract(te_paths)
assert Ft.shape[0]==len(tr_paths) and Fe.shape[0]==len(te_paths)
assert np.isfinite(Ft).all() and np.isfinite(Fe).all()
np.save(out_tr, Ft); np.save(out_te, Fe)
print('Saved:', out_tr, out_te)
'''
embed_code = embed_code.replace('__BACKBONE__', BACKBONE).replace('__IMG_SIZE__', str(IMG_SIZE))
embed_script.write_text(textwrap.dedent(embed_code))

# Extract embeddings if not cached
if not (Path(ft_tr).exists() and Path(ft_te).exists()):
    env = clean_env()
    env.update({'BACKBONE': BACKBONE, 'IMG_SIZE': str(IMG_SIZE)})
    # Ensure numpy available in venv runtime
    run([str(PY_BIN), '-m', 'pip', 'install', '--quiet', 'numpy==1.26.4'], env=env)
    run([str(PY_BIN), str(embed_script), 'train_paths.txt', 'test_paths.txt', ft_tr, ft_te], env=env)
else:
    print('Cached TIMM embeddings found:', ft_tr, ft_te)

# Train LightGBM on embeddings (pilot: holdout only)
import lightgbm as lgb
Ft = np.load(ft_tr)
Fe = np.load(ft_te)
y = train['target'].values.astype(int)

if PILOT_N > 0:
    m = int(len(train)*0.8)
    X_trn, X_val = Ft[:m], Ft[m:]
    y_trn, y_val = y[:m], y[m:]
    pos = max(1, int(y_trn.sum())); neg = max(1, int(len(y_trn)-pos)); spw = neg/pos
    params = dict(objective='binary', learning_rate=0.05, n_estimators=2000, num_leaves=127, max_bin=255,
                  feature_fraction=0.9, bagging_fraction=0.8, subsample=0.8, subsample_freq=1, colsample_bytree=0.9,
                  reg_lambda=1.0, reg_alpha=0.0, min_child_samples=20, n_jobs=4, random_state=SEED)
    clf = lgb.LGBMClassifier(**params, scale_pos_weight=spw, is_unbalance=False)
    clf.fit(X_trn, y_trn, eval_set=[(X_val, y_val)], eval_metric='auc', callbacks=[lgb.early_stopping(200, verbose=False)])
    from sklearn.metrics import roc_auc_score
    pv = clf.predict_proba(X_val)[:,1]
    try:
        auc = roc_auc_score(y_val, pv)
        print('PILOT holdout AUC:', round(auc,6))
    except Exception:
        print('PILOT AUC could not be computed (insufficient positives).')
else:
    print('Full CV is disabled in pilot-mode script.')

with open('results.md','a') as f:
    f.write(f"EXP3_ISO_TIMM_SEQ_SAFE | backbone={BACKBONE} | size={IMG_SIZE} | pilot={PILOT_N} | time={time.strftime('%Y-%m-%d %H:%M:%S')}\n")
print('Done EXP3-ISOLATED sequential extractor (safe preprocessing).')

[PILOT] Using subset of train=200 test=200
> .venv_timm/bin/python -m pip install --quiet numpy==1.26.4


> .venv_timm/bin/python embed_extract.py train_paths.txt test_paths.txt features_train_timm_pilot200_resnet18_224.npy features_test_timm_pilot200_resnet18_224.npy


CalledProcessError: Command '['.venv_timm/bin/python', 'embed_extract.py', 'train_paths.txt', 'test_paths.txt', 'features_train_timm_pilot200_resnet18_224.npy', 'features_test_timm_pilot200_resnet18_224.npy']' died with <Signals.SIGSEGV: 11>.

In [56]:
# PHASE 2a: Create clean venv and install CPU torch/torchvision + timm (per audit directive)
import os, sys, subprocess, time
from pathlib import Path

VENV_DIR = Path('.venv_timm')
PY = VENV_DIR / 'bin' / 'python'

def run(cmd, env=None):
    print('>', ' '.join(cmd));
    subprocess.check_call(cmd, env=env)

def clean_env():
    env = os.environ.copy()
    for k in ['PIP_TARGET','PYTHONPATH','PYTHONUSERBASE','PIP_PREFIX','PIP_SRC','PIP_BUILD']:
        env.pop(k, None)
    env['PIP_DISABLE_PIP_VERSION_CHECK'] = '1'
    env['PIP_NO_CACHE_DIR'] = '1'
    env['PIP_USER'] = 'no'
    return env

if not VENV_DIR.exists():
    run([sys.executable, '-m', 'venv', str(VENV_DIR)], env=clean_env())

# Upgrade pip
run([str(PY), '-m', 'pip', 'install', '--upgrade', 'pip'], env=clean_env())

# Install numpy
run([str(PY), '-m', 'pip', 'install', '--upgrade', 'numpy==1.26.4'], env=clean_env())

# Install CPU torch/torchvision from official index
run([str(PY), '-m', 'pip', 'install', '--upgrade', '--index-url', 'https://download.pytorch.org/whl/cpu', 'torch', 'torchvision'], env=clean_env())

# Install timm and pillow
run([str(PY), '-m', 'pip', 'install', '--upgrade', 'timm==0.9.16', 'pillow'], env=clean_env())

# Sanity check
run([str(PY), '-c', 'import numpy, torch, timm, PIL; print("venv ok", numpy.__version__, torch.__version__, timm.__version__)'], env=clean_env())
print('Venv ready at', VENV_DIR.resolve())

> .venv_timm/bin/python -m pip install --upgrade pip




> .venv_timm/bin/python -m pip install --upgrade numpy==1.26.4




> .venv_timm/bin/python -m pip install --upgrade --index-url https://download.pytorch.org/whl/cpu torch torchvision


Looking in indexes: https://download.pytorch.org/whl/cpu


Collecting torch
  Downloading https://download.pytorch.org/whl/cpu/torch-2.8.0%2Bcpu-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (29 kB)


Collecting torchvision
  Downloading https://download.pytorch.org/whl/cpu/torchvision-0.23.0%2Bcpu-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (6.1 kB)
Downloading https://download.pytorch.org/whl/cpu/torch-2.8.0%2Bcpu-cp311-cp311-manylinux_2_28_x86_64.whl (184.1 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/184.1 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [91m━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.0/184.1 MB[0m [31m258.4 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━[0m [32m110.4/184.1 MB[0m [31m282.0 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m172.0/184.1 MB[0m [31m290.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.1/184.1 MB[0m [31m290.7 MB/s[0m  [33m0:00:00[0m
[?25hDownloading https://download.pytorch.org/whl/cpu/torchvision-0.23.0%2Bcpu-cp311-cp311-manylinux_2_28_x86_64.whl (1.9 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m228.4 MB/s[0m  [33m0:00:00[0m
[?25h

Installing collected packages: torch, torchvision
[?25l[2K  Attempting uninstall: torch
[2K    Found existing installation: torch 2.8.0
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K    Uninstalling torch-2.8.0:
   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K      Successfully uninstalled torch-2.8.0
   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K  Attempting uninstall: torchvision
   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K    Found existing installation: torchvision 0.23.0
   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K    Uninstalling torchvision-0.23.0:
   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K      Successfully uninstalled torchvision-0.23.0
   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/2[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m1/2[0m [torchvision]

[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m1/2[0m [torchvision][2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m1/2[0m [torchvision][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [torchvision]
[?25h[1A[2KSuccessfully installed torch-2.8.0+cpu torchvision-0.23.0+cpu


> .venv_timm/bin/python -m pip install --upgrade timm==0.9.16 pillow






> .venv_timm/bin/python -c import numpy, torch, timm, PIL; print("venv ok", numpy.__version__, torch.__version__, timm.__version__)


CalledProcessError: Command '['.venv_timm/bin/python', '-c', 'import numpy, torch, timm, PIL; print("venv ok", numpy.__version__, torch.__version__, timm.__version__)']' died with <Signals.SIGSEGV: 11>.

In [60]:
# EXP3-INPROC: Import torch/timm from .venv_timm in-process (no subprocess) and extract resnet18 embeddings sequentially
import os, sys, time, json
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image
from sklearn.metrics import roc_auc_score

SEED = 42
np.random.seed(SEED)
DATA_DIR = Path('.')
VENV_DIR = Path('.venv_timm')
assert VENV_DIR.exists(), 'Missing .venv_timm. Run installer cell (Cell 24) first.'

# Add venv site-packages to current interpreter to avoid spawning a subprocess (which segfaults)
pyver = f"python{sys.version_info.major}.{sys.version_info.minor}"
site_dir = VENV_DIR / 'lib' / pyver / 'site-packages'
assert site_dir.exists(), f'Site-packages not found at {site_dir}'
if str(site_dir) not in sys.path:
    sys.path.insert(0, str(site_dir))

# Threading guards to avoid OpenMP/MKL conflicts
os.environ.setdefault('OMP_NUM_THREADS', '1')
os.environ.setdefault('MKL_NUM_THREADS', '1')
os.environ.setdefault('OPENBLAS_NUM_THREADS', '1')
os.environ.setdefault('NUMEXPR_NUM_THREADS', '1')
os.environ.setdefault('MKL_THREADING_LAYER', 'GNU')
os.environ.setdefault('MALLOC_ARENA_MAX', '1')

import torch
import timm
from timm.data import resolve_data_config, create_transform

torch.set_num_threads(1)

BACKBONE = 'resnet18'
IMG_SIZE = 224
PILOT_N = 200  # small pilot to validate end-to-end

train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
JPEG_TEST  = DATA_DIR / 'jpeg' / 'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all()
train['fold'] = train['fold'].astype(int)

if PILOT_N > 0:
    train = train.iloc[:PILOT_N].copy()
    test = test.iloc[:min(PILOT_N, len(test))].copy()
    print(f'[INPROC PILOT] train={len(train)} test={len(test)}')

@torch.no_grad()
def build_and_transform():
    model = timm.create_model(BACKBONE, pretrained=True, num_classes=0, global_pool='avg')
    model.eval()
    cfg = resolve_data_config({'input_size': (3, IMG_SIZE, IMG_SIZE)}, model=model)
    tfm = create_transform(input_size=(3, IMG_SIZE, IMG_SIZE), crop_pct=cfg.get('crop_pct', 1.0),
                           mean=cfg['mean'], std=cfg['std'], interpolation='bilinear')
    return model, tfm

def load_img(path, tfm):
    try:
        with Image.open(path) as im:
            im = im.convert('RGB')
            x = tfm(im).unsqueeze(0)  # 1xCxHxW
            return x
    except Exception:
        return torch.zeros(1, 3, IMG_SIZE, IMG_SIZE, dtype=torch.float32)

@torch.no_grad()
def extract_seq(paths):
    model, tfm = build_and_transform()
    feats = []
    t0 = time.time()
    for i, p in enumerate(paths):
        xb = load_img(p, tfm)
        fb = model(xb).cpu().numpy().astype(np.float32)
        feats.append(fb)
        if (i+1) % 50 == 0:
            print(f'Emb {i+1}/{len(paths)} ({((i+1)*100)//len(paths)}%)')
    F = np.vstack(feats).astype(np.float32)
    print(f'Embeddings shape: {F.shape} | took {time.time()-t0:.1f}s')
    assert np.isfinite(F).all(), 'Non-finite values in embeddings'
    return F

ft_tr = f'features_train_timm_inproc_pilot{PILOT_N}_{BACKBONE}_{IMG_SIZE}.npy'
ft_te = f'features_test_timm_inproc_pilot{PILOT_N}_{BACKBONE}_{IMG_SIZE}.npy'

if not (Path(ft_tr).exists() and Path(ft_te).exists()):
    Ft = extract_seq(train['image_path'].tolist())
    Fe = extract_seq(test['image_path'].tolist())
    np.save(ft_tr, Ft); np.save(ft_te, Fe)
    print('Saved:', ft_tr, ft_te)
else:
    Ft = np.load(ft_tr); Fe = np.load(ft_te)
    print('Loaded cached:', Ft.shape, Fe.shape)

# Simple holdout AUC on pilot
y = train['target'].values.astype(int)
m = int(len(train)*0.8)
if m > 0 and m < len(train):
    X_trn, X_val = Ft[:m], Ft[m:]
    y_trn, y_val = y[:m], y[m:]
    # tiny logistic regression head to score quickly
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import StandardScaler
    from sklearn.pipeline import Pipeline
    clf = Pipeline([('sc', StandardScaler(with_mean=False)),
                    ('lr', LogisticRegression(max_iter=2000, class_weight='balanced', n_jobs=1, random_state=SEED))])
    clf.fit(X_trn, y_trn)
    pv = clf.predict_proba(X_val)[:,1]
    try:
        auc = roc_auc_score(y_val, pv)
        print('INPROC PILOT holdout AUC:', round(auc, 6))
    except Exception as e:
        print('AUC unavailable:', e)
else:
    print('Pilot too small for holdout split.')

print('Done EXP3-INPROC pilot.')

KernelDied: Kernel died unexpectedly.

In [64]:
# EXP3-ONNX (Local build): EfficientNetB0 -> ONNX via tf2onnx, then ORT logits — sequential, stable CPU pipeline
import os, sys, time, json, subprocess
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image

# Threading guards
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['NUMEXPR_NUM_THREADS'] = '1'
os.environ['MALLOC_ARENA_MAX'] = '1'

SEED = 42
np.random.seed(SEED)
DATA_DIR = Path('.')
MODEL_PATH = Path('efficientnetb0_imagenet_224.onnx')
IMG_SIZE = 224
RESIZE_SHORT = 256
PILOT_N = 200

def pip_install(pkgs):
    cmd = [sys.executable, '-m', 'pip', 'install', '--quiet'] + pkgs
    print('> pip', ' '.join(pkgs))
    subprocess.check_call(cmd)

# Minimal deps
pip_install(['onnxruntime==1.18.0', 'onnx==1.16.2', 'tf2onnx==1.16.1'])
import onnxruntime as ort

# Build ONNX locally from Keras if missing
if not MODEL_PATH.exists() or MODEL_PATH.stat().st_size == 0:
    print('Building EfficientNetB0 Keras model and converting to ONNX ...')
    import tensorflow as tf
    import tf2onnx
    tf.random.set_seed(SEED)
    # Keras EfficientNetB0 with ImageNet top for 1000-dim logits
    model = tf.keras.applications.EfficientNetB0(include_top=True, weights='imagenet', input_shape=(IMG_SIZE, IMG_SIZE, 3))
    spec = (tf.TensorSpec((None, IMG_SIZE, IMG_SIZE, 3), tf.float32, name='input'),)
    onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature=spec, opset=13)
    with open(MODEL_PATH, 'wb') as f:
        f.write(onnx_model.SerializeToString())
    print('Saved ONNX to', MODEL_PATH)

def make_session(model_path):
    so = ort.SessionOptions()
    so.intra_op_num_threads = 1
    so.inter_op_num_threads = 1
    return ort.InferenceSession(str(model_path), sess_options=so, providers=['CPUExecutionProvider'])

def resize_shorter_side(im, short=RESIZE_SHORT):
    w, h = im.size
    if w <= 0 or h <= 0:
        return im.resize((short, short), Image.BILINEAR)
    if w < h:
        nw, nh = short, int(round(h * short / w))
    else:
        nh, nw = short, int(round(w * short / h))
    return im.resize((nw, nh), Image.BILINEAR)

# EfficientNet (tf mode) preprocessing: scale to [-1, 1]
def preprocess_efficientnet(path):
    try:
        with Image.open(path) as im:
            im = im.convert('RGB')
            im = resize_shorter_side(im, RESIZE_SHORT)
            w, h = im.size
            left = (w - IMG_SIZE) // 2
            top  = (h - IMG_SIZE) // 2
            im = im.crop((left, top, left + IMG_SIZE, top + IMG_SIZE))
            arr = np.asarray(im, dtype=np.float32)  # 0..255
            arr = (arr / 127.5) - 1.0  # [-1, 1]
            nhwc = arr[np.newaxis, ...]  # 1,H,W,3
            return nhwc
    except Exception:
        return np.zeros((1, IMG_SIZE, IMG_SIZE, 3), dtype=np.float32)

def extract_logits(paths):
    sess = make_session(MODEL_PATH)
    inp_name = sess.get_inputs()[0].name
    out_name = sess.get_outputs()[0].name
    feats = np.zeros((len(paths), 1000), dtype=np.float32)
    t0 = time.time()
    for i, p in enumerate(paths):
        x = preprocess_efficientnet(p)
        y = sess.run([out_name], {inp_name: x})[0]  # (1,1000)
        feats[i] = y.reshape(-1).astype(np.float32)
        if (i+1) % 100 == 0:
            print(f'Emb {i+1}/{len(paths)} ({(i+1)*100//len(paths)}%)')
    print(f'ONNX logits extraction done: shape={feats.shape} | {time.time()-t0:.1f}s')
    assert np.isfinite(feats).all(), 'Non-finite values in ONNX features'
    return feats

# Data
train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
JPEG_TEST  = DATA_DIR / 'jpeg' / 'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all(); train['fold'] = train['fold'].astype(int)

# Pilot subset
tri = train.iloc[:min(PILOT_N, len(train))].copy()
tei = test.iloc[:min(PILOT_N, len(test))].copy()
print(f'[ONNX LOCAL PILOT] train={len(tri)} test={len(tei)}')

ft_tr = f'features_train_onnx_effb0_logits_pilot{len(tri)}_{IMG_SIZE}.npy'
ft_te = f'features_test_onnx_effb0_logits_pilot{len(tei)}_{IMG_SIZE}.npy'

if not (Path(ft_tr).exists() and Path(ft_te).exists()):
    Ft = extract_logits(tri['image_path'].tolist())
    Fe = extract_logits(tei['image_path'].tolist())
    np.save(ft_tr, Ft); np.save(ft_te, Fe)
    print('Saved:', ft_tr, ft_te)
else:
    Ft = np.load(ft_tr); Fe = np.load(ft_te)
    print('Loaded cached:', Ft.shape, Fe.shape)

# Pilot AUC (holdout 80/20)
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
y = tri['target'].values.astype(int)
m = max(1, int(len(tri)*0.8))
if m < len(tri):
    X_trn, X_val = Ft[:m], Ft[m:]
    y_trn, y_val = y[:m], y[m:]
    pos = max(1, int(y_trn.sum())); neg = max(1, int(len(y_trn)-pos)); spw = neg/pos
    params = dict(objective='binary', learning_rate=0.05, n_estimators=3000, num_leaves=255, max_bin=511,
                  feature_fraction=0.9, bagging_fraction=0.8, subsample=0.8, subsample_freq=1, colsample_bytree=0.9,
                  reg_lambda=1.0, reg_alpha=0.0, min_child_samples=20, n_jobs=4, random_state=SEED)
    clf = lgb.LGBMClassifier(**params, scale_pos_weight=spw, is_unbalance=False)
    clf.fit(X_trn, y_trn, eval_set=[(X_val, y_val)], eval_metric='auc', callbacks=[lgb.early_stopping(200, verbose=False)])
    pv = clf.predict_proba(X_val)[:,1]
    try:
        auc = roc_auc_score(y_val, pv)
        print('ONNX LOCAL PILOT holdout AUC:', round(auc, 6))
    except Exception as e:
        print('AUC unavailable:', e)
else:
    print('Pilot too small for holdout split.')

with open('results.md','a') as f:
    f.write(f"EXP3_ONNX_LOCAL_EFFB0_PILOT | img={IMG_SIZE} | pilot={len(tri)} | time={time.strftime('%Y-%m-%d %H:%M:%S')}\n")
print('Done EXP3-ONNX local pilot.')

> pip onnxruntime==1.18.0 onnx==1.16.2 tf2onnx==1.16.1


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.26.4 which is incompatible.
grpcio-status 1.74.0 requires protobuf<7.0.0,>=6.31.1, but you have protobuf 3.20.3 which is incompatible.[0m[31m
[0m

Building EfficientNetB0 Keras model and converting to ONNX ...


2025-08-24 07:09:12.541976: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-08-24 07:09:12.640641: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-08-24 07:09:12.642338: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.




2025-08-24 07:09:15.727820: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-08-24 07:09:15.730023: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-08-24 07:09:15.732029: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

Downloading data from https://storage.googleapis.com/keras-applications/efficientnetb0.h5


    8192/21834768 [..............................] - ETA: 0s

 2588672/21834768 [==>...........................] - ETA: 0s









2025-08-24 07:09:19.728435: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-08-24 07:09:19.730945: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-08-24 07:09:19.733050: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

2025-08-24 07:09:23.461632: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-08-24 07:09:23.463292: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-08-24 07:09:23.464698: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

Saved ONNX to efficientnetb0_imagenet_224.onnx


[ONNX LOCAL PILOT] train=200 test=200


Emb 100/200 (50%)


Emb 200/200 (100%)
ONNX logits extraction done: shape=(200, 1000) | 41.6s


Emb 100/200 (50%)


Emb 200/200 (100%)
ONNX logits extraction done: shape=(200, 1000) | 37.4s
Saved: features_train_onnx_effb0_logits_pilot200_224.npy features_test_onnx_effb0_logits_pilot200_224.npy
[LightGBM] [Info] Number of positive: 1, number of negative: 159
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003848 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 54955


[LightGBM] [Info] Number of data points in the train set: 160, number of used features: 1000
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006250 -> initscore=-5.068904
[LightGBM] [Info] Start training from score -5.068904
AUC unavailable: Only one class present in y_true. ROC AUC score is not defined in that case.
Done EXP3-ONNX local pilot.


In [65]:
# DEPRECATED: Blocked to prevent unsafe submission overwrites. Use hardened cells 40, 45, 46, 47, or 48.
import sys
raise SystemExit("DEPRECATED CELL 27: submission writes are blocked. Use hardened pipelines (Cells 40/45/46/47/48) with gating and manifests.")

Emb 500/28984 (1%)


Emb 1000/28984 (3%)


Emb 1500/28984 (5%)


Emb 2000/28984 (6%)


Emb 2500/28984 (8%)


Emb 3000/28984 (10%)


Emb 3500/28984 (12%)


Emb 4000/28984 (13%)


Emb 4500/28984 (15%)


Emb 5000/28984 (17%)


Emb 5500/28984 (18%)


Emb 6000/28984 (20%)


Emb 6500/28984 (22%)


Emb 7000/28984 (24%)


Emb 7500/28984 (25%)


Emb 8000/28984 (27%)


Emb 8500/28984 (29%)


Emb 9000/28984 (31%)


Emb 9500/28984 (32%)


Emb 10000/28984 (34%)


Emb 10500/28984 (36%)


Emb 11000/28984 (37%)


Emb 11500/28984 (39%)


Emb 12000/28984 (41%)


Emb 12500/28984 (43%)


Emb 13000/28984 (44%)


Emb 13500/28984 (46%)


Emb 14000/28984 (48%)


Emb 14500/28984 (50%)


Emb 15000/28984 (51%)


Emb 15500/28984 (53%)


Emb 16000/28984 (55%)


Emb 16500/28984 (56%)


Emb 17000/28984 (58%)


Emb 17500/28984 (60%)


Emb 18000/28984 (62%)


Emb 18500/28984 (63%)


Emb 19000/28984 (65%)


Emb 19500/28984 (67%)


Emb 20000/28984 (69%)


Emb 20500/28984 (70%)


Emb 21000/28984 (72%)


Emb 21500/28984 (74%)


Emb 22000/28984 (75%)


Emb 22500/28984 (77%)


Emb 23000/28984 (79%)


Emb 23500/28984 (81%)


Emb 24000/28984 (82%)


Emb 24500/28984 (84%)


Emb 25000/28984 (86%)


Emb 25500/28984 (87%)


Emb 26000/28984 (89%)


Emb 26500/28984 (91%)


Emb 27000/28984 (93%)


Emb 27500/28984 (94%)


Emb 28000/28984 (96%)


Emb 28500/28984 (98%)


ONNX logits extraction done: shape=(28984, 1000) | 5260.2s


Emb 500/4142 (12%)


Emb 1000/4142 (24%)


Emb 1500/4142 (36%)


Emb 2000/4142 (48%)


Emb 2500/4142 (60%)


Emb 3000/4142 (72%)


Emb 3500/4142 (84%)


Emb 4000/4142 (96%)


ONNX logits extraction done: shape=(4142, 1000) | 776.6s
Saved: features_train_onnx_effb0_logits_224.npy features_test_onnx_effb0_logits_224.npy


[LightGBM] [Info] Number of positive: 410, number of negative: 23138
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.166087 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 511000


[LightGBM] [Info] Number of data points in the train set: 23548, number of used features: 1000
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017411 -> initscore=-4.033074
[LightGBM] [Info] Start training from score -4.033074




































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Fold 0: AUC=0.78757, best_iter=70


[LightGBM] [Info] Number of positive: 411, number of negative: 22604
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.111907 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 511000


[LightGBM] [Info] Number of data points in the train set: 23015, number of used features: 1000


[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017858 -> initscore=-4.007289
[LightGBM] [Info] Start training from score -4.007289




























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Fold 1: AUC=0.78828, best_iter=88


[LightGBM] [Info] Number of positive: 413, number of negative: 23026
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.149466 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 511000


[LightGBM] [Info] Number of data points in the train set: 23439, number of used features: 1000
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017620 -> initscore=-4.020932
[LightGBM] [Info] Start training from score -4.020932








































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Fold 2: AUC=0.77163, best_iter=88


[LightGBM] [Info] Number of positive: 407, number of negative: 22357
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.113949 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 511000


[LightGBM] [Info] Number of data points in the train set: 22764, number of used features: 1000
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017879 -> initscore=-4.006082
[LightGBM] [Info] Start training from score -4.006082






























































































































































































































































































































































































































































































































































































































































































































































Fold 3: AUC=0.83810, best_iter=93


[LightGBM] [Info] Number of positive: 411, number of negative: 22759
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.110718 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 511000


[LightGBM] [Info] Number of data points in the train set: 23170, number of used features: 1000


[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017738 -> initscore=-4.014123
[LightGBM] [Info] Start training from score -4.014123
















































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Fold 4: AUC=0.79413, best_iter=74
OOF AUC (ONNX EffB0 logits + LGBM): 0.796123 | per-fold: [0.787566, 0.788282, 0.771625, 0.838104, 0.79413]


Saved oof_meta.csv and submission.csv


In [69]:
# DEPRECATED: Blocked to prevent unsafe submission overwrites. Use hardened cells 40, 45, 46, 47, or 48.
import sys
raise SystemExit("DEPRECATED CELL 28: submission writes are blocked. Use hardened pipelines (Cells 40/45/46/47/48) with gating and manifests.")

> pip onnxruntime==1.18.0 onnx==1.16.2 tf2onnx==1.16.1


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.26.4 which is incompatible.
grpcio-status 1.74.0 requires protobuf<7.0.0,>=6.31.1, but you have protobuf 3.20.3 which is incompatible.[0m[31m
[0m

Loaded cached ONNX embeddings: (28984, 1280) (4142, 1280)


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Fold 0: AUC=0.77305


In [67]:
# EXP3-ONNX-B3: EfficientNetB3 penultimate embeddings (1536D) + LogisticRegression CV (target OOF > 0.92)
import os, sys, time, subprocess
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score

# Threading guards
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['NUMEXPR_NUM_THREADS'] = '1'
os.environ['MALLOC_ARENA_MAX'] = '1'

SEED = 42
IMG_SIZE = 300  # EfficientNetB3 default
RESIZE_SHORT = 342  # ~1.14 * IMG_SIZE (ImageNet convention)
MODEL_PATH = Path('efficientnetb3_imagenet_300_embed.onnx')  # penultimate GAP output (1536D)

def pip_install(pkgs):
    cmd = [sys.executable, '-m', 'pip', 'install', '--quiet'] + pkgs
    print('> pip', ' '.join(pkgs))
    subprocess.check_call(cmd)

# Ensure deps
pip_install(['onnxruntime==1.18.0', 'onnx==1.16.2', 'tf2onnx==1.16.1'])
import onnxruntime as ort

# Build ONNX (EfficientNetB3 include_top=False + GAP) if missing
if not MODEL_PATH.exists() or MODEL_PATH.stat().st_size == 0:
    print('Building EfficientNetB3 (penultimate GAP) and converting to ONNX ...')
    import tensorflow as tf
    import tf2onnx
    tf.random.set_seed(SEED)
    base = tf.keras.applications.EfficientNetB3(include_top=False, weights='imagenet', input_shape=(IMG_SIZE, IMG_SIZE, 3))
    x = tf.keras.layers.GlobalAveragePooling2D(name='gap')(base.output)  # 1536D
    model = tf.keras.Model(base.input, x)
    spec = (tf.TensorSpec((None, IMG_SIZE, IMG_SIZE, 3), tf.float32, name='input'),)
    onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature=spec, opset=13)
    with open(MODEL_PATH, 'wb') as f:
        f.write(onnx_model.SerializeToString())
    print('Saved ONNX penultimate model to', MODEL_PATH)

def make_session(model_path):
    so = ort.SessionOptions()
    so.intra_op_num_threads = 1
    so.inter_op_num_threads = 1
    return ort.InferenceSession(str(model_path), sess_options=so, providers=['CPUExecutionProvider'])

def resize_shorter_side(im, short=RESIZE_SHORT):
    w, h = im.size
    if w <= 0 or h <= 0:
        return im.resize((short, short), Image.BILINEAR)
    if w < h:
        nw, nh = short, int(round(h * short / w))
    else:
        nh, nw = short, int(round(w * short / h))
    return im.resize((nw, nh), Image.BILINEAR)

# EfficientNet TF-mode preprocessing to [-1, 1]
def preprocess_efficientnet(path):
    try:
        with Image.open(path) as im:
            im = im.convert('RGB')
            im = resize_shorter_side(im, RESIZE_SHORT)
            w, h = im.size
            left = (w - IMG_SIZE) // 2
            top  = (h - IMG_SIZE) // 2
            im = im.crop((left, top, left + IMG_SIZE, top + IMG_SIZE))
            arr = np.asarray(im, dtype=np.float32)  # 0..255
            arr = (arr / 127.5) - 1.0  # [-1,1]
            return arr[np.newaxis, ...]  # 1,H,W,3
    except Exception:
        return np.zeros((1, IMG_SIZE, IMG_SIZE, 3), dtype=np.float32)

def extract_embeddings(paths):
    sess = make_session(MODEL_PATH)
    inp_name = sess.get_inputs()[0].name
    out_name = sess.get_outputs()[0].name
    t0 = time.time()
    feats = []
    for i, p in enumerate(paths):
        x = preprocess_efficientnet(p)
        y = sess.run([out_name], {inp_name: x})[0]  # (1,1536)
        feats.append(y.reshape(-1).astype(np.float32))
        if (i+1) % 500 == 0:
            print(f'Emb {i+1}/{len(paths)} ({(i+1)*100//len(paths)}%)')
    F = np.vstack(feats).astype(np.float32)
    print(f'ONNX B3 penultimate extraction done: shape={F.shape} | {time.time()-t0:.1f}s')
    assert np.isfinite(F).all(), 'Non-finite values in ONNX embeddings'
    return F

# Data
DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')
JPEG_TRAIN = DATA_DIR / 'jpeg' / 'train'
JPEG_TEST  = DATA_DIR / 'jpeg' / 'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all(); train['fold'] = train['fold'].astype(int)

# Feature cache
ft_tr = 'features_train_onnx_effb3_embed_300.npy'
ft_te = 'features_test_onnx_effb3_embed_300.npy'

if Path(ft_tr).exists() and Path(ft_te).exists():
    Ft = np.load(ft_tr); Fe = np.load(ft_te)
    print('Loaded cached ONNX B3 embeddings:', Ft.shape, Fe.shape)
else:
    Ft = extract_embeddings(train['image_path'].tolist())
    Fe = extract_embeddings(test['image_path'].tolist())
    np.save(ft_tr, Ft); np.save(ft_te, Fe)
    print('Saved:', ft_tr, ft_te)

# Logistic Regression head
y = train['target'].values.astype(int)
oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds = []
fold_aucs = []
for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = Ft[trn_idx], y[trn_idx]
    X_val, y_val = Ft[val_idx], y[val_idx]
    X_tst = Fe
    clf = Pipeline([
        ('sc', StandardScaler(with_mean=False)),
        ('lr', LogisticRegression(max_iter=5000, class_weight='balanced', solver='lbfgs', n_jobs=1, random_state=SEED))
    ])
    clf.fit(X_trn, y_trn)
    pv = clf.predict_proba(X_val)[:,1]
    oof[val_idx] = pv.astype(np.float32)
    auc = roc_auc_score(y_val, pv)
    fold_aucs.append(float(auc))
    test_pred_folds.append(clf.predict_proba(X_tst)[:,1].astype(np.float32))
    print(f'Fold {f}: AUC={auc:.5f}')

oof_auc = roc_auc_score(y, oof)
print('OOF AUC (ONNX EffB3 penultimate + LR):', round(oof_auc,6), '| per-fold:', [round(a,6) for a in fold_aucs])

# Save artifacts
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_onnx_effb3_embed_lr'] = oof
oof_df.to_csv('oof_meta.csv', index=False)
test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
sub = pd.DataFrame({'image_name': test['image_name'], 'target': test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved oof_meta.csv and submission.csv')

> pip onnxruntime==1.18.0 onnx==1.16.2 tf2onnx==1.16.1


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.26.4 which is incompatible.
grpcio-status 1.74.0 requires protobuf<7.0.0,>=6.31.1, but you have protobuf 3.20.3 which is incompatible.[0m[31m
[0m

Building EfficientNetB3 (penultimate GAP) and converting to ONNX ...


Downloading data from https://storage.googleapis.com/keras-applications/efficientnetb3_notop.h5


    8192/43941136 [..............................] - ETA: 0s

 6037504/43941136 [===>..........................] - ETA: 0s











2025-08-24 10:52:33.580017: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-08-24 10:52:33.581592: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-08-24 10:52:33.582958: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

2025-08-24 10:52:37.248953: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-08-24 10:52:37.250517: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-08-24 10:52:37.251873: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

Saved ONNX penultimate model to efficientnetb3_imagenet_300_embed.onnx


Emb 500/28984 (1%)


Emb 1000/28984 (3%)


Emb 1500/28984 (5%)


Emb 2000/28984 (6%)


Emb 2500/28984 (8%)


Emb 3000/28984 (10%)


Emb 3500/28984 (12%)


Emb 4000/28984 (13%)


Emb 4500/28984 (15%)


Emb 5000/28984 (17%)


Emb 5500/28984 (18%)


Emb 6000/28984 (20%)


Emb 6500/28984 (22%)


Emb 7000/28984 (24%)


Emb 7500/28984 (25%)


Emb 8000/28984 (27%)


Emb 8500/28984 (29%)


Emb 9000/28984 (31%)


Emb 9500/28984 (32%)


Emb 10000/28984 (34%)


Emb 10500/28984 (36%)


Emb 11000/28984 (37%)


Emb 11500/28984 (39%)


Emb 12000/28984 (41%)


Emb 12500/28984 (43%)


Emb 13000/28984 (44%)


Emb 13500/28984 (46%)


Emb 14000/28984 (48%)


Emb 14500/28984 (50%)


Emb 15000/28984 (51%)


Emb 15500/28984 (53%)


Emb 16000/28984 (55%)


Emb 16500/28984 (56%)


Emb 17000/28984 (58%)


Emb 17500/28984 (60%)


Emb 18000/28984 (62%)


Emb 18500/28984 (63%)


Emb 19000/28984 (65%)


Emb 19500/28984 (67%)


Emb 20000/28984 (69%)


Emb 20500/28984 (70%)


Emb 21000/28984 (72%)


Emb 21500/28984 (74%)


Emb 22000/28984 (75%)


Emb 22500/28984 (77%)


Emb 23000/28984 (79%)


Emb 23500/28984 (81%)


Emb 24000/28984 (82%)


Emb 24500/28984 (84%)


Emb 25000/28984 (86%)


Emb 25500/28984 (87%)


Emb 26000/28984 (89%)


Emb 26500/28984 (91%)


Emb 27000/28984 (93%)


Emb 27500/28984 (94%)


Emb 28000/28984 (96%)


Emb 28500/28984 (98%)


ONNX B3 penultimate extraction done: shape=(28984, 1536) | 9030.5s


Emb 500/4142 (12%)


Emb 1000/4142 (24%)


Emb 1500/4142 (36%)


Emb 2000/4142 (48%)


Emb 2500/4142 (60%)


Emb 3000/4142 (72%)


Emb 3500/4142 (84%)


Emb 4000/4142 (96%)


ONNX B3 penultimate extraction done: shape=(4142, 1536) | 1278.4s


Saved: features_train_onnx_effb3_embed_300.npy features_test_onnx_effb3_embed_300.npy


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Fold 0: AUC=0.75589


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Fold 1: AUC=0.77439


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Fold 2: AUC=0.81068


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Fold 3: AUC=0.81134


In [68]:
# EXP3-ONNX-B3-REFIT: Refit stronger LR head on cached B3 embeddings with proper scaling and higher max_iter
import os, numpy as np, pandas as pd
from pathlib import Path
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

SEED = 42
np.random.seed(SEED)

DATA_DIR = Path('.')
Ft_path = 'features_train_onnx_effb3_embed_300.npy'
Fe_path = 'features_test_onnx_effb3_embed_300.npy'
assert Path(Ft_path).exists() and Path(Fe_path).exists(), 'Cached B3 embeddings not found. Ensure Cell 29 finished extraction.'
Ft = np.load(Ft_path).astype(np.float32)
Fe = np.load(Fe_path).astype(np.float32)
train = pd.read_csv(DATA_DIR/'train.csv')
test  = pd.read_csv(DATA_DIR/'test.csv')
folds = pd.read_csv(DATA_DIR/'folds.csv')
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all() and len(Ft)==len(train) and len(Fe)==len(test)
train['fold'] = train['fold'].astype(int)
y = train['target'].values.astype(int)

# Stronger LR head with proper scaling
oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds = []
fold_aucs = []

for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = Ft[trn_idx], y[trn_idx]
    X_val, y_val = Ft[val_idx], y[val_idx]
    X_tst = Fe

    clf = Pipeline([
        ('sc', StandardScaler(with_mean=True, with_std=True)),
        ('lr', LogisticRegression(
            C=2.0, max_iter=20000, class_weight='balanced',
            solver='lbfgs', n_jobs=1, random_state=SEED
        ))
    ])
    clf.fit(X_trn, y_trn)
    pv = clf.predict_proba(X_val)[:,1]
    oof[val_idx] = pv.astype(np.float32)
    fold_aucs.append(float(roc_auc_score(y_val, pv)))
    test_pred_folds.append(clf.predict_proba(X_tst)[:,1].astype(np.float32))
    print(f'Fold {f}: AUC={fold_aucs[-1]:.5f}')

oof_auc = roc_auc_score(y, oof)
print('OOF AUC (EffB3 emb + LR refit):', round(oof_auc,6), '| per-fold:', [round(a,6) for a in fold_aucs])

# Save artifacts
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_onnx_effb3_embed_lr_refit'] = oof
oof_df.to_csv('oof_meta.csv', index=False)
test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
sub = pd.DataFrame({'image_name': test['image_name'], 'target': test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved oof_meta.csv and submission.csv')

Fold 0: AUC=0.75707


Fold 1: AUC=0.77956


In [70]:
# EXP3-ALIGN-B0: Enforce feature/label alignment with sidecar order for EffB0 embeddings, restore LR baseline
import os, json, numpy as np, pandas as pd
from pathlib import Path
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

SEED = 42
np.random.seed(SEED)

DATA_DIR = Path('.')
ft_tr_path = 'features_train_onnx_effb0_embed_224.npy'
ft_te_path = 'features_test_onnx_effb0_embed_224.npy'
order_tr_path = 'order_train_effb0_embed_224.npy'
order_te_path = 'order_test_effb0_embed_224.npy'

assert Path(ft_tr_path).exists() and Path(ft_te_path).exists(), 'Missing cached EffB0 embeddings. Run ONNX B0 embed extraction.'
Ft = np.load(ft_tr_path)
Fe = np.load(ft_te_path)
train = pd.read_csv(DATA_DIR/'train.csv')
test  = pd.read_csv(DATA_DIR/'test.csv')
folds = pd.read_csv(DATA_DIR/'folds.csv')
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all(), 'folds merge failed'
train['fold'] = train['fold'].astype(int)

# Sidecar order: create if not present, assuming current order was used at extraction time
if not Path(order_tr_path).exists():
    np.save(order_tr_path, train['image_name'].values)
    print('Saved sidecar order for train (assumed current train order).')
else:
    order_tr = np.load(order_tr_path, allow_pickle=True)
    # Validate length
    assert len(order_tr) == len(Ft), f'Train sidecar length {len(order_tr)} != Ft rows {len(Ft)}'
    # Check if matches current DataFrame order
    cur = train['image_name'].values
    if not np.array_equal(order_tr, cur):
        # Cannot permute Ft without a mapping from names to rows, so fail fast with actionable error
        raise RuntimeError('Alignment mismatch: sidecar order != current train order. Regenerate embeddings with sidecar in this order or reorder train to match sidecar before CV.')

if not Path(order_te_path).exists():
    np.save(order_te_path, test['image_name'].values)
    print('Saved sidecar order for test (assumed current test order).')
else:
    order_te = np.load(order_te_path, allow_pickle=True)
    assert len(order_te) == len(Fe), f'Test sidecar length {len(order_te)} != Fe rows {len(Fe)}'
    if not np.array_equal(order_te, test['image_name'].values):
        raise RuntimeError('Alignment mismatch: sidecar order != current test order. Regenerate embeddings with sidecar.')

# Train LR head on aligned features
y = train['target'].values.astype(int)
oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds = []
fold_aucs = []
for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = Ft[trn_idx], y[trn_idx]
    X_val, y_val = Ft[val_idx], y[val_idx]
    X_tst = Fe
    clf = Pipeline([
        ('scaler', StandardScaler(with_mean=True)),
        ('lr', LogisticRegression(solver='lbfgs', C=1.0, max_iter=10000, class_weight='balanced', n_jobs=1, random_state=SEED))
    ])
    clf.fit(X_trn, y_trn)
    pv = clf.predict_proba(X_val)[:,1]
    oof[val_idx] = pv.astype(np.float32)
    auc = roc_auc_score(y_val, pv)
    fold_aucs.append(float(auc))
    test_pred_folds.append(clf.predict_proba(X_tst)[:,1].astype(np.float32))
    print(f'Fold {f}: AUC={auc:.5f}')

oof_auc = roc_auc_score(y, oof)
print('OOF AUC (EffB0 emb + LR, alignment-guarded):', round(oof_auc,6), '| per-fold:', [round(a,6) for a in fold_aucs])

# Save artifacts
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_onnx_effb0_embed_lr_aligned'] = oof
oof_df.to_csv('oof_meta.csv', index=False)
test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
sub = pd.DataFrame({'image_name': test['image_name'], 'target': test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved oof_meta.csv and submission.csv')

Saved sidecar order for train (assumed current train order).
Saved sidecar order for test (assumed current test order).


Fold 0: AUC=0.77277


Fold 1: AUC=0.81623


Fold 2: AUC=0.80639


Fold 3: AUC=0.81996


In [71]:
# EXP3-ALIGN-FIX-B0: Realign cached EffB0 embeddings using saved sidecar orders, then refit LR
import os, numpy as np, pandas as pd
from pathlib import Path
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

SEED = 42
np.random.seed(SEED)

ft_tr_path = 'features_train_onnx_effb0_embed_224.npy'
ft_te_path = 'features_test_onnx_effb0_embed_224.npy'
order_tr_path = 'order_train_effb0_embed_224.npy'
order_te_path = 'order_test_effb0_embed_224.npy'

assert Path(ft_tr_path).exists() and Path(ft_te_path).exists(), 'Missing cached EffB0 embeddings.'
assert Path(order_tr_path).exists() and Path(order_te_path).exists(), 'Missing sidecar order files; re-extract embeddings with sidecar.'

Ft = np.load(ft_tr_path)
Fe = np.load(ft_te_path)
order_tr = np.load(order_tr_path, allow_pickle=True).astype(str)
order_te = np.load(order_te_path, allow_pickle=True).astype(str)

train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')
folds = pd.read_csv('folds.csv')
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all(), 'folds merge failed'
train['fold'] = train['fold'].astype(int)

# Build index maps from sidecar to positions
assert len(order_tr) == Ft.shape[0], f'Sidecar train length {len(order_tr)} != Ft rows {Ft.shape[0]}'
assert len(order_te) == Fe.shape[0], f'Sidecar test length {len(order_te)} != Fe rows {Fe.shape[0]}'
idx_tr_map = {name:i for i, name in enumerate(order_tr)}
idx_te_map = {name:i for i, name in enumerate(order_te)}

# Verify coverage and build reindex arrays
cur_tr_names = train['image_name'].astype(str).values
cur_te_names = test['image_name'].astype(str).values
missing_tr = [n for n in cur_tr_names if n not in idx_tr_map]
missing_te = [n for n in cur_te_names if n not in idx_te_map]
assert len(missing_tr)==0 and len(missing_te)==0, f'Missing names in sidecar maps: train={len(missing_tr)} test={len(missing_te)}'
perm_tr = np.array([idx_tr_map[n] for n in cur_tr_names], dtype=np.int64)
perm_te = np.array([idx_te_map[n] for n in cur_te_names], dtype=np.int64)

# Reorder features to match current DataFrame order
Ft_aligned = Ft[perm_tr]
Fe_aligned = Fe[perm_te]

# Quick sanity: identical if already aligned
same_tr = np.array_equal(perm_tr, np.arange(len(perm_tr)))
same_te = np.array_equal(perm_te, np.arange(len(perm_te)))
print('Alignment needed? train:', not same_tr, 'test:', not same_te)

# Refit LR head on aligned features
y = train['target'].values.astype(int)
oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds = []
fold_aucs = []
for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = Ft_aligned[trn_idx], y[trn_idx]
    X_val, y_val = Ft_aligned[val_idx], y[val_idx]
    X_tst = Fe_aligned
    clf = Pipeline([
        ('sc', StandardScaler(with_mean=True)),
        ('lr', LogisticRegression(solver='lbfgs', C=1.0, max_iter=20000, class_weight='balanced', n_jobs=1, random_state=SEED))
    ])
    clf.fit(X_trn, y_trn)
    pv = clf.predict_proba(X_val)[:,1]
    oof[val_idx] = pv.astype(np.float32)
    auc = roc_auc_score(y_val, pv)
    fold_aucs.append(float(auc))
    test_pred_folds.append(clf.predict_proba(X_tst)[:,1].astype(np.float32))
    print(f'Fold {f}: AUC={auc:.5f}')

oof_auc = roc_auc_score(y, oof)
print('OOF AUC (EffB0 emb aligned + LR):', round(oof_auc,6), '| per-fold:', [round(a,6) for a in fold_aucs])

# Save artifacts
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_onnx_effb0_embed_lr_aligned'] = oof
oof_df.to_csv('oof_meta.csv', index=False)
test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
sub = pd.DataFrame({'image_name': test['image_name'], 'target': test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved oof_meta.csv and submission.csv')

Alignment needed? train: False test: False


Fold 0: AUC=0.77277


Fold 1: AUC=0.81623


Fold 2: AUC=0.80639


In [72]:
# EXP3-ONNX-B0-REEXTRACT-ORDERED: Re-extract EffNetB0 penultimate embeddings with explicit sidecar order (current train/test order)
import os, sys, time, subprocess
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image

os.environ['OMP_NUM_THREADS'] = '1'; os.environ['MKL_NUM_THREADS'] = '1'; os.environ['OPENBLAS_NUM_THREADS'] = '1'; os.environ['NUMEXPR_NUM_THREADS'] = '1'; os.environ['MALLOC_ARENA_MAX'] = '1'

SEED = 42
IMG_SIZE = 224
RESIZE_SHORT = 256
MODEL_PATH = Path('efficientnetb0_imagenet_224_embed.onnx')

def pip_install(pkgs):
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet'] + pkgs)

# Ensure deps and ONNX model
pip_install(['onnxruntime==1.18.0', 'onnx==1.16.2', 'tf2onnx==1.16.1'])
import onnxruntime as ort
if not MODEL_PATH.exists() or MODEL_PATH.stat().st_size == 0:
    import tensorflow as tf, tf2onnx
    tf.random.set_seed(SEED)
    base = tf.keras.applications.EfficientNetB0(include_top=False, weights='imagenet', input_shape=(IMG_SIZE, IMG_SIZE, 3))
    x = tf.keras.layers.GlobalAveragePooling2D(name='gap')(base.output)
    model = tf.keras.Model(base.input, x)
    spec = (tf.TensorSpec((None, IMG_SIZE, IMG_SIZE, 3), tf.float32, name='input'),)
    onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature=spec, opset=13)
    with open(MODEL_PATH, 'wb') as f: f.write(onnx_model.SerializeToString())

def make_session(mpath):
    so = ort.SessionOptions(); so.intra_op_num_threads = 1; so.inter_op_num_threads = 1
    return ort.InferenceSession(str(mpath), sess_options=so, providers=['CPUExecutionProvider'])

def resize_shorter_side(im, short=RESIZE_SHORT):
    w, h = im.size
    if w <= 0 or h <= 0: return im.resize((short, short), Image.BILINEAR)
    if w < h: nw, nh = short, int(round(h * short / w))
    else: nh, nw = short, int(round(w * short / h))
    return im.resize((nw, nh), Image.BILINEAR)

def preprocess_efficientnet(path):
    try:
        with Image.open(path) as im:
            im = im.convert('RGB')
            im = resize_shorter_side(im, RESIZE_SHORT)
            w, h = im.size
            left = (w - IMG_SIZE) // 2; top = (h - IMG_SIZE) // 2
            im = im.crop((left, top, left + IMG_SIZE, top + IMG_SIZE))
            arr = np.asarray(im, dtype=np.float32)
            arr = (arr / 127.5) - 1.0
            return arr[np.newaxis, ...]
    except Exception:
        return np.zeros((1, IMG_SIZE, IMG_SIZE, 3), dtype=np.float32)

def extract_embeddings(paths):
    sess = make_session(MODEL_PATH)
    inp = sess.get_inputs()[0].name
    out = sess.get_outputs()[0].name
    feats = np.zeros((len(paths), 1280), dtype=np.float32)
    t0 = time.time()
    for i, p in enumerate(paths):
        x = preprocess_efficientnet(p)
        y = sess.run([out], {inp: x})[0]
        feats[i] = y.reshape(-1).astype(np.float32)
        if (i+1) % 500 == 0:
            print(f'Emb {i+1}/{len(paths)} ({(i+1)*100//len(paths)}%)')
    print(f'ONNX B0 penultimate extraction: shape={feats.shape} | {time.time()-t0:.1f}s')
    assert np.isfinite(feats).all(), 'Non-finite values in embeddings'
    return feats

# Load data and establish current order
DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR/'train.csv')
test  = pd.read_csv(DATA_DIR/'test.csv')
JPEG_TRAIN = DATA_DIR/'jpeg'/'train'; JPEG_TEST = DATA_DIR/'jpeg'/'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))

# Extract in CURRENT DataFrame order and save sidecars with explicit v2 suffix
ft_tr_v2 = 'features_train_onnx_effb0_embed_224_v2.npy'
ft_te_v2 = 'features_test_onnx_effb0_embed_224_v2.npy'
order_tr_v2 = 'order_train_effb0_embed_224_v2.npy'
order_te_v2 = 'order_test_effb0_embed_224_v2.npy'

Ft = extract_embeddings(train['image_path'].tolist())
Fe = extract_embeddings(test['image_path'].tolist())
np.save(ft_tr_v2, Ft); np.save(ft_te_v2, Fe)
np.save(order_tr_v2, train['image_name'].values.astype(str))
np.save(order_te_v2, test['image_name'].values.astype(str))
print('Saved embeddings and sidecars:', ft_tr_v2, ft_te_v2, order_tr_v2, order_te_v2)

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.26.4 which is incompatible.
grpcio-status 1.74.0 requires protobuf<7.0.0,>=6.31.1, but you have protobuf 3.20.3 which is incompatible.[0m[31m
[0m

Emb 500/28984 (1%)


Emb 1000/28984 (3%)


Emb 1500/28984 (5%)


Emb 2000/28984 (6%)


Emb 2500/28984 (8%)


Emb 3000/28984 (10%)


Emb 3500/28984 (12%)


Emb 4000/28984 (13%)


Emb 4500/28984 (15%)


Emb 5000/28984 (17%)


Emb 5500/28984 (18%)


Emb 6000/28984 (20%)


Emb 6500/28984 (22%)


Emb 7000/28984 (24%)


Emb 7500/28984 (25%)


Emb 8000/28984 (27%)


Emb 8500/28984 (29%)


Emb 9000/28984 (31%)


Emb 9500/28984 (32%)


Emb 10000/28984 (34%)


Emb 10500/28984 (36%)


Emb 11000/28984 (37%)


Emb 11500/28984 (39%)


Emb 12000/28984 (41%)


Emb 12500/28984 (43%)


Emb 13000/28984 (44%)


Emb 13500/28984 (46%)


Emb 14000/28984 (48%)


Emb 14500/28984 (50%)


Emb 15000/28984 (51%)


Emb 15500/28984 (53%)


Emb 16000/28984 (55%)


Emb 16500/28984 (56%)


Emb 17000/28984 (58%)


Emb 17500/28984 (60%)


Emb 18000/28984 (62%)


Emb 18500/28984 (63%)


Emb 19000/28984 (65%)


Emb 19500/28984 (67%)


Emb 20000/28984 (69%)


Emb 20500/28984 (70%)


Emb 21000/28984 (72%)


Emb 21500/28984 (74%)


Emb 22000/28984 (75%)


Emb 22500/28984 (77%)


Emb 23000/28984 (79%)


Emb 23500/28984 (81%)


Emb 24000/28984 (82%)


Emb 24500/28984 (84%)


Emb 25000/28984 (86%)


Emb 25500/28984 (87%)


Emb 26000/28984 (89%)


Emb 26500/28984 (91%)


Emb 27000/28984 (93%)


Emb 27500/28984 (94%)


Emb 28000/28984 (96%)


Emb 28500/28984 (98%)


ONNX B0 penultimate extraction: shape=(28984, 1280) | 4362.6s


Emb 500/4142 (12%)


Emb 1000/4142 (24%)


Emb 1500/4142 (36%)


Emb 2000/4142 (48%)


Emb 2500/4142 (60%)


Emb 3000/4142 (72%)


Emb 3500/4142 (84%)


Emb 4000/4142 (96%)


ONNX B0 penultimate extraction: shape=(4142, 1280) | 614.0s
Saved embeddings and sidecars: features_train_onnx_effb0_embed_224_v2.npy features_test_onnx_effb0_embed_224_v2.npy order_train_effb0_embed_224_v2.npy order_test_effb0_embed_224_v2.npy


In [73]:
# EXP3-ONNX-B0-REFIT-V2: Use re-extracted, order-verified B0 embeddings (_v2) to restore baseline
import numpy as np, pandas as pd
from pathlib import Path
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

SEED = 42
ft_tr = 'features_train_onnx_effb0_embed_224_v2.npy'
ft_te = 'features_test_onnx_effb0_embed_224_v2.npy'
ord_tr = 'order_train_effb0_embed_224_v2.npy'
ord_te = 'order_test_effb0_embed_224_v2.npy'
assert Path(ft_tr).exists() and Path(ft_te).exists() and Path(ord_tr).exists() and Path(ord_te).exists(), 'Missing _v2 features or sidecars.'
Ft = np.load(ft_tr); Fe = np.load(ft_te)
otr = np.load(ord_tr, allow_pickle=True).astype(str); ote = np.load(ord_te, allow_pickle=True).astype(str)
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv'); folds = pd.read_csv('folds.csv')
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all(); train['fold'] = train['fold'].astype(int)
assert len(Ft)==len(train) and len(Fe)==len(test), 'Shape mismatch vs current CSVs'
assert np.array_equal(otr, train['image_name'].astype(str).values), 'Train order mismatch for _v2 sidecar'
assert np.array_equal(ote, test['image_name'].astype(str).values),  'Test order mismatch for _v2 sidecar'

y = train['target'].values.astype(int)
oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds, fold_aucs = [], []
for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = Ft[trn_idx], y[trn_idx]
    X_val, y_val = Ft[val_idx], y[val_idx]
    clf = Pipeline([
        ('sc', StandardScaler(with_mean=True, with_std=True)),
        ('lr', LogisticRegression(solver='lbfgs', C=1.0, max_iter=20000, class_weight='balanced', n_jobs=1, random_state=SEED))
    ])
    clf.fit(X_trn, y_trn)
    pv = clf.predict_proba(X_val)[:,1]
    oof[val_idx] = pv.astype(np.float32)
    fold_aucs.append(float(roc_auc_score(y_val, pv)))
    test_pred_folds.append(clf.predict_proba(Fe)[:,1].astype(np.float32))
print('Per-fold AUCs:', [round(a,6) for a in fold_aucs])
oof_auc = roc_auc_score(y, oof)
print('OOF AUC (EffB0 _v2 emb + LR):', round(oof_auc,6))

# Save artifacts
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_onnx_effb0_embed_lr_v2'] = oof
oof_df.to_csv('oof_meta.csv', index=False)
test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
pd.DataFrame({'image_name': test['image_name'], 'target': test_pred}).to_csv('submission.csv', index=False)
print('Saved oof_meta.csv and submission.csv')

Per-fold AUCs: [0.772772, 0.816227, 0.806393, 0.819959, 0.779595]
OOF AUC (EffB0 _v2 emb + LR): 0.799246
Saved oof_meta.csv and submission.csv


In [75]:
# EXP3-ONNX-B0-TTA2: 2-view TTA (orig + horizontal flip) EfficientNetB0 embeddings + LR CV
import os, sys, time, subprocess
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

os.environ['OMP_NUM_THREADS'] = '1'; os.environ['MKL_NUM_THREADS'] = '1'; os.environ['OPENBLAS_NUM_THREADS'] = '1'; os.environ['NUMEXPR_NUM_THREADS'] = '1'; os.environ['MALLOC_ARENA_MAX'] = '1'

SEED = 42
IMG_SIZE = 224
RESIZE_SHORT = 256
MODEL_PATH = Path('efficientnetb0_imagenet_224_embed.onnx')  # 1280D penultimate

def pip_install(pkgs):
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet'] + pkgs)

# Ensure deps
pip_install(['onnxruntime==1.18.0', 'onnx==1.16.2', 'tf2onnx==1.16.1'])
import onnxruntime as ort

# Build ONNX model if missing
if not MODEL_PATH.exists() or MODEL_PATH.stat().st_size == 0:
    import tensorflow as tf, tf2onnx
    tf.random.set_seed(SEED)
    base = tf.keras.applications.EfficientNetB0(include_top=False, weights='imagenet', input_shape=(IMG_SIZE, IMG_SIZE, 3))
    x = tf.keras.layers.GlobalAveragePooling2D(name='gap')(base.output)
    model = tf.keras.Model(base.input, x)
    spec = (tf.TensorSpec((None, IMG_SIZE, IMG_SIZE, 3), tf.float32, name='input'),)
    onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature=spec, opset=13)
    with open(MODEL_PATH, 'wb') as f: f.write(onnx_model.SerializeToString())

def make_session(mpath):
    so = ort.SessionOptions(); so.intra_op_num_threads = 1; so.inter_op_num_threads = 1
    return ort.InferenceSession(str(mpath), sess_options=so, providers=['CPUExecutionProvider'])

def resize_shorter_side(im, short=RESIZE_SHORT):
    w, h = im.size
    if w <= 0 or h <= 0: return im.resize((short, short), Image.BILINEAR)
    if w < h: nw, nh = short, int(round(h * short / w))
    else: nh, nw = short, int(round(w * short / h))
    return im.resize((nw, nh), Image.BILINEAR)

def preprocess_arr(im: Image.Image):
    im = im.convert('RGB')
    im = resize_shorter_side(im, RESIZE_SHORT)
    w, h = im.size
    left = (w - IMG_SIZE) // 2; top = (h - IMG_SIZE) // 2
    im = im.crop((left, top, left + IMG_SIZE, top + IMG_SIZE))
    arr = np.asarray(im, dtype=np.float32)  # H,W,3 in 0..255
    arr = (arr / 127.5) - 1.0  # [-1,1]
    return arr  # H,W,3

def extract_tta2_avg(paths):
    sess = make_session(MODEL_PATH)
    inp = sess.get_inputs()[0].name
    out = sess.get_outputs()[0].name
    feats = np.zeros((len(paths), 1280), dtype=np.float32)
    t0 = time.time()
    for i, p in enumerate(paths):
        try:
            with Image.open(p) as im:
                arr = preprocess_arr(im)  # H,W,3
        except Exception:
            arr = np.zeros((IMG_SIZE, IMG_SIZE, 3), dtype=np.float32)
        arr_flip = arr[:, ::-1, :]
        x = np.stack([arr, arr_flip], axis=0)  # 2,H,W,3
        y = sess.run([out], {inp: x})[0]  # (2,1280)
        feats[i] = y.mean(axis=0).astype(np.float32)
        if (i+1) % 500 == 0:
            print(f'TTA2 Emb {i+1}/{len(paths)} ({(i+1)*100//len(paths)}%)')
    print(f'TTA2 extraction: shape={feats.shape} | {time.time()-t0:.1f}s')
    assert np.isfinite(feats).all(), 'Non-finite values in TTA embeddings'
    return feats

# Data
DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR/'train.csv')
test  = pd.read_csv(DATA_DIR/'test.csv')
folds = pd.read_csv(DATA_DIR/'folds.csv')
JPEG_TRAIN = DATA_DIR/'jpeg'/'train'; JPEG_TEST = DATA_DIR/'jpeg'/'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all(); train['fold'] = train['fold'].astype(int)

# Cache paths
ft_tr = 'features_train_onnx_effb0_embed_224_tta2.npy'
ft_te = 'features_test_onnx_effb0_embed_224_tta2.npy'
ord_tr = 'order_train_effb0_embed_224_tta2.npy'
ord_te = 'order_test_effb0_embed_224_tta2.npy'

if Path(ft_tr).exists() and Path(ft_te).exists() and Path(ord_tr).exists() and Path(ord_te).exists():
    Ft = np.load(ft_tr); Fe = np.load(ft_te)
    otr = np.load(ord_tr, allow_pickle=True).astype(str); ote = np.load(ord_te, allow_pickle=True).astype(str)
    assert np.array_equal(otr, train['image_name'].astype(str).values), 'Train sidecar mismatch for TTA2'
    assert np.array_equal(ote, test['image_name'].astype(str).values),  'Test sidecar mismatch for TTA2'
    print('Loaded cached TTA2 embeddings:', Ft.shape, Fe.shape)
else:
    Ft = extract_tta2_avg(train['image_path'].tolist())
    Fe = extract_tta2_avg(test['image_path'].tolist())
    np.save(ft_tr, Ft); np.save(ft_te, Fe)
    np.save(ord_tr, train['image_name'].values.astype(str)); np.save(ord_te, test['image_name'].values.astype(str))
    print('Saved TTA2 embeddings and sidecars.')

# LR head CV
y = train['target'].values.astype(int)
oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds, fold_aucs = [], []
for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = Ft[trn_idx], y[trn_idx]
    X_val, y_val = Ft[val_idx], y[val_idx]
    clf = Pipeline([
        ('sc', StandardScaler(with_mean=True, with_std=True)),
        ('lr', LogisticRegression(solver='lbfgs', C=1.0, max_iter=20000, class_weight='balanced', n_jobs=1, random_state=SEED))
    ])
    clf.fit(X_trn, y_trn)
    pv = clf.predict_proba(X_val)[:,1]
    oof[val_idx] = pv.astype(np.float32)
    fold_aucs.append(float(roc_auc_score(y_val, pv)))
    test_pred_folds.append(clf.predict_proba(Fe)[:,1].astype(np.float32))
print('Per-fold AUCs (B0 TTA2):', [round(a,6) for a in fold_aucs])
oof_auc = roc_auc_score(y, oof)
print('OOF AUC (EffB0 TTA2 emb + LR):', round(oof_auc,6))

# Save artifacts
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_onnx_effb0_embed_tta2_lr'] = oof
oof_df.to_csv('oof_meta.csv', index=False)
test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
pd.DataFrame({'image_name': test['image_name'], 'target': test_pred}).to_csv('submission.csv', index=False)
print('Saved oof_meta.csv and submission.csv')

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.26.4 which is incompatible.
grpcio-status 1.74.0 requires protobuf<7.0.0,>=6.31.1, but you have protobuf 3.20.3 which is incompatible.[0m[31m
[0m

TTA2 Emb 500/28984 (1%)


TTA2 Emb 1000/28984 (3%)


TTA2 Emb 1500/28984 (5%)


TTA2 Emb 2000/28984 (6%)


TTA2 Emb 2500/28984 (8%)


TTA2 Emb 3000/28984 (10%)


TTA2 Emb 3500/28984 (12%)


TTA2 Emb 4000/28984 (13%)


TTA2 Emb 4500/28984 (15%)


TTA2 Emb 5000/28984 (17%)


TTA2 Emb 5500/28984 (18%)


TTA2 Emb 6000/28984 (20%)


TTA2 Emb 6500/28984 (22%)


TTA2 Emb 7000/28984 (24%)


TTA2 Emb 7500/28984 (25%)


TTA2 Emb 8000/28984 (27%)


TTA2 Emb 8500/28984 (29%)


TTA2 Emb 9000/28984 (31%)


TTA2 Emb 9500/28984 (32%)


TTA2 Emb 10000/28984 (34%)


In [78]:
# EXP3-ONNX-B0-TTA2-BATCHED: Fast 2-view TTA (orig+HFlip) with batched ONNX inference + LR CV
import os, sys, time, subprocess
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

os.environ['OMP_NUM_THREADS'] = '1'; os.environ['MKL_NUM_THREADS'] = '1'; os.environ['OPENBLAS_NUM_THREADS'] = '1'; os.environ['NUMEXPR_NUM_THREADS'] = '1'; os.environ['MALLOC_ARENA_MAX'] = '1'

SEED = 42
IMG_SIZE = 224
RESIZE_SHORT = 256
BATCH = 32  # effective 64 inputs per run with TTA2
MODEL_PATH = Path('efficientnetb0_imagenet_224_embed.onnx')

def pip_install(pkgs):
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet'] + pkgs)

pip_install(['onnxruntime==1.18.0', 'onnx==1.16.2', 'tf2onnx==1.16.1'])
import onnxruntime as ort

if not MODEL_PATH.exists() or MODEL_PATH.stat().st_size == 0:
    import tensorflow as tf, tf2onnx
    tf.random.set_seed(SEED)
    base = tf.keras.applications.EfficientNetB0(include_top=False, weights='imagenet', input_shape=(IMG_SIZE, IMG_SIZE, 3))
    x = tf.keras.layers.GlobalAveragePooling2D(name='gap')(base.output)
    model = tf.keras.Model(base.input, x)
    spec = (tf.TensorSpec((None, IMG_SIZE, IMG_SIZE, 3), tf.float32, name='input'),)
    onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature=spec, opset=13)
    with open(MODEL_PATH, 'wb') as f: f.write(onnx_model.SerializeToString())

def make_session(mpath):
    so = ort.SessionOptions(); so.intra_op_num_threads = 1; so.inter_op_num_threads = 1
    return ort.InferenceSession(str(mpath), sess_options=so, providers=['CPUExecutionProvider'])

def resize_shorter_side(im, short=RESIZE_SHORT):
    w, h = im.size
    if w <= 0 or h <= 0: return im.resize((short, short), Image.BILINEAR)
    if w < h: nw, nh = short, int(round(h * short / w))
    else: nh, nw = short, int(round(w * short / h))
    return im.resize((nw, nh), Image.BILINEAR)

def preprocess_arr(im: Image.Image):
    im = im.convert('RGB')
    im = resize_shorter_side(im, RESIZE_SHORT)
    w, h = im.size
    left = (w - IMG_SIZE) // 2; top = (h - IMG_SIZE) // 2
    im = im.crop((left, top, left + IMG_SIZE, top + IMG_SIZE))
    arr = np.asarray(im, dtype=np.float32)  # H,W,3 in 0..255
    arr = (arr / 127.5) - 1.0  # [-1,1]
    return arr

def extract_tta2_avg_batched(paths):
    sess = make_session(MODEL_PATH)
    inp = sess.get_inputs()[0].name
    out = sess.get_outputs()[0].name
    N = len(paths)
    feats = np.zeros((N, 1280), dtype=np.float32)
    t0 = time.time()
    i = 0; k = 0
    while i < N:
        j = min(i + BATCH, N)
        batch = []
        for p in paths[i:j]:
            try:
                with Image.open(p) as im:
                    arr = preprocess_arr(im)
            except Exception:
                arr = np.zeros((IMG_SIZE, IMG_SIZE, 3), dtype=np.float32)
            arr_flip = arr[:, ::-1, :]
            batch.append(arr); batch.append(arr_flip)
        X = np.stack(batch, axis=0)  # (2*B, H, W, 3)
        Y = sess.run([out], {inp: X})[0]  # (2*B, 1280)
        # average pairs
        Y = Y.reshape(-1, 2, 1280).mean(axis=1)
        feats[i:j] = Y.astype(np.float32)
        i = j; k += 1
        if k % 20 == 0:
            done = j
            print(f'TTA2 Batched {done}/{N} ({(done*100)//N}%)')
    print(f'TTA2 batched extraction: shape={feats.shape} | {time.time()-t0:.1f}s')
    assert np.isfinite(feats).all(), 'Non-finite values in TTA embeddings'
    return feats

# Data
DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR/'train.csv')
test  = pd.read_csv(DATA_DIR/'test.csv')
folds = pd.read_csv(DATA_DIR/'folds.csv')
JPEG_TRAIN = DATA_DIR/'jpeg'/'train'; JPEG_TEST = DATA_DIR/'jpeg'/'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all(); train['fold'] = train['fold'].astype(int)

# Cache paths
ft_tr = 'features_train_onnx_effb0_embed_224_tta2_batched.npy'
ft_te = 'features_test_onnx_effb0_embed_224_tta2_batched.npy'
ord_tr = 'order_train_effb0_embed_224_tta2_batched.npy'
ord_te = 'order_test_effb0_embed_224_tta2_batched.npy'

if Path(ft_tr).exists() and Path(ft_te).exists() and Path(ord_tr).exists() and Path(ord_te).exists():
    Ft = np.load(ft_tr); Fe = np.load(ft_te)
    otr = np.load(ord_tr, allow_pickle=True).astype(str); ote = np.load(ord_te, allow_pickle=True).astype(str)
    assert np.array_equal(otr, train['image_name'].astype(str).values)
    assert np.array_equal(ote, test['image_name'].astype(str).values)
    print('Loaded cached TTA2-batched embeddings:', Ft.shape, Fe.shape)
else:
    Ft = extract_tta2_avg_batched(train['image_path'].tolist())
    Fe = extract_tta2_avg_batched(test['image_path'].tolist())
    np.save(ft_tr, Ft); np.save(ft_te, Fe)
    np.save(ord_tr, train['image_name'].values.astype(str)); np.save(ord_te, test['image_name'].values.astype(str))
    print('Saved TTA2-batched embeddings and sidecars.')

# LR head CV
y = train['target'].values.astype(int)
oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds, fold_aucs = [], []
for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = Ft[trn_idx], y[trn_idx]
    X_val, y_val = Ft[val_idx], y[val_idx]
    clf = Pipeline([
        ('sc', StandardScaler(with_mean=True, with_std=True)),
        ('lr', LogisticRegression(solver='lbfgs', C=1.0, max_iter=20000, class_weight='balanced', n_jobs=1, random_state=SEED))
    ])
    clf.fit(X_trn, y_trn)
    pv = clf.predict_proba(X_val)[:,1]
    oof[val_idx] = pv.astype(np.float32)
    fold_aucs.append(float(roc_auc_score(y_val, pv)))
    test_pred_folds.append(clf.predict_proba(Fe)[:,1].astype(np.float32))
print('Per-fold AUCs (B0 TTA2 batched):', [round(a,6) for a in fold_aucs])
oof_auc = roc_auc_score(y, oof)
print('OOF AUC (EffB0 TTA2-batched emb + LR):', round(oof_auc,6))

# Save artifacts
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_onnx_effb0_embed_tta2_batched_lr'] = oof
oof_df.to_csv('oof_meta.csv', index=False)
test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
pd.DataFrame({'image_name': test['image_name'], 'target': test_pred}).to_csv('submission.csv', index=False)
print('Saved oof_meta.csv and submission.csv')

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.26.4 which is incompatible.
grpcio-status 1.74.0 requires protobuf<7.0.0,>=6.31.1, but you have protobuf 3.20.3 which is incompatible.[0m[31m
[0m

TTA2 Batched 640/28984 (2%)


TTA2 Batched 1280/28984 (4%)


TTA2 Batched 1920/28984 (6%)


TTA2 Batched 2560/28984 (8%)


TTA2 Batched 3200/28984 (11%)


TTA2 Batched 3840/28984 (13%)


TTA2 Batched 4480/28984 (15%)


TTA2 Batched 5120/28984 (17%)


TTA2 Batched 5760/28984 (19%)


TTA2 Batched 6400/28984 (22%)


TTA2 Batched 7040/28984 (24%)


TTA2 Batched 7680/28984 (26%)


TTA2 Batched 8320/28984 (28%)


TTA2 Batched 8960/28984 (30%)


TTA2 Batched 9600/28984 (33%)


TTA2 Batched 10240/28984 (35%)


TTA2 Batched 10880/28984 (37%)


TTA2 Batched 11520/28984 (39%)


TTA2 Batched 12160/28984 (41%)


TTA2 Batched 12800/28984 (44%)


TTA2 Batched 13440/28984 (46%)


TTA2 Batched 14080/28984 (48%)


TTA2 Batched 14720/28984 (50%)


TTA2 Batched 15360/28984 (52%)


TTA2 Batched 16000/28984 (55%)


TTA2 Batched 16640/28984 (57%)


TTA2 Batched 17280/28984 (59%)


TTA2 Batched 17920/28984 (61%)


TTA2 Batched 18560/28984 (64%)


TTA2 Batched 19200/28984 (66%)


TTA2 Batched 19840/28984 (68%)


TTA2 Batched 20480/28984 (70%)


TTA2 Batched 21120/28984 (72%)


TTA2 Batched 21760/28984 (75%)


TTA2 Batched 22400/28984 (77%)


TTA2 Batched 23040/28984 (79%)


TTA2 Batched 23680/28984 (81%)


TTA2 Batched 24320/28984 (83%)


TTA2 Batched 24960/28984 (86%)


TTA2 Batched 25600/28984 (88%)


TTA2 Batched 26240/28984 (90%)


TTA2 Batched 26880/28984 (92%)


TTA2 Batched 27520/28984 (94%)


TTA2 Batched 28160/28984 (97%)


TTA2 Batched 28800/28984 (99%)


TTA2 batched extraction: shape=(28984, 1280) | 5651.8s


TTA2 Batched 640/4142 (15%)


TTA2 Batched 1280/4142 (30%)


TTA2 Batched 1920/4142 (46%)


TTA2 Batched 2560/4142 (61%)


TTA2 Batched 3200/4142 (77%)


TTA2 Batched 3840/4142 (92%)


TTA2 batched extraction: shape=(4142, 1280) | 777.4s
Saved TTA2-batched embeddings and sidecars.


Per-fold AUCs (B0 TTA2 batched): [0.801398, 0.825566, 0.826825, 0.842334, 0.782712]
OOF AUC (EffB0 TTA2-batched emb + LR): 0.816025
Saved oof_meta.csv and submission.csv


# Fact Sheet: SIIM-ISIC Melanoma Classification (Current State, Revised per Audit)

## 1. Goal & Metric
- Objective: Achieve medal on AUC-ROC (Bronze ≥ 0.937, Silver ≥ 0.9401, Gold ≥ 0.9455).

## 2. Cross-Validation & Data Leaks
- Final CV: 5-fold StratifiedGroupKFold grouped by patient (with refined duplicate handling).
- folds.csv validated; leak-proof design.

## 3. Baselines & Models Tried
- Metadata models (LGBM/CatBoost): OOF ~0.63–0.78 after calibration.
- Fast image stats + LR (CPU-only): OOF 0.801 (current single-best OOF among simple baselines).
- ONNX EfficientNet-B0 penultimate (1280D) + LR:
  - Alignment risk corrected via explicit sidecars and re-extraction (_v2).
  - Verified baseline OOF = 0.799246 (per-fold ~0.773–0.820).
- ONNX EfficientNet-B3 penultimate (1536D) + LR: preliminary per-fold ~0.76–0.82 (not yet optimized).
- Strategic Baseline for DL path: ONNX B0+LR (0.799 OOF) is the official baseline to improve upon with TTA/backbone upgrades.

## 4. Root Cause Correction
- Implemented feature/label alignment guardrails via sidecar order files and strict equality asserts.
- Re-extracted B0 embeddings in DataFrame order with sidecars (_v2).
- 0.799 OOF is the verified ground-truth baseline for ONNX B0+LR.

## 5. Current Best Submission & Provenance (Critical Fix)
- Current best OOF: Fast-stats image model (LR) at ~0.801.
- Important: submission.csv is overwritten by experiments. The final leaderboard submission must be regenerated from the best-performing model’s test predictions corresponding to the highest OOF.
- Provenance policy: Record source model, OOF, and timestamp for every submission and avoid overwriting a better submission with a lower-OOF model.

## 6. Approved, Staged Action Plan
1) B0 2-view TTA Probe (low-cost):
   - Views: {orig, horizontal flip} at embedding time; average per-image 1280D embeddings.
   - Gate: ΔOOF ≥ +0.005 over 0.799 → PASS.
   - If +0.001 ≤ ΔOOF < +0.005 → Marginal; flag for review (likely apply 2-view TTA to B3 later).
   - If ΔOOF < +0.001 → FAIL; pivot immediately to B3 single-view.
2) If Gate PASSED: Scale to 8-view TTA on B0 (target OOF ≥ 0.82).
3) Backbone Upgrade to B3 (300px):
   - Extract penultimate embeddings (apply TTA if Step 2 passed).
   - Train LR head with StandardScaler; target OOF > 0.85.
4) Ensemble (only after >0.85 single model):
   - Blend best image model with metadata and/or fast-stats via leak-proof stacking or convex per-fold blend.
   - Push OOF toward medal range.

## 7. Engineering Guardrails
- Sidecar order files for every feature cache; assert equality on load.
- Clear artifact names (e.g., _tta2, _tta8, _b3_300) and save per-step sidecars.
- Batch ONNX inference (threads=1) for stability/speed; prefer batched TTA extraction.
- LR head: StandardScaler(with_mean=True) + LogisticRegression(lbfgs, class_weight='balanced', max_iter=20k).
- Submission Provenance: For each submission, log {model, features, TTA, OOF, timestamp}; do not overwrite a higher-OOF submission.

## 8. Next Immediate Step
- Execute B0 2-view TTA (batched ORT), refit LR, and report OOF delta vs 0.799 baseline according to the refined gate logic.

In [83]:
# EXP3-ONNX-B3-SV-BATCHED: EfficientNetB3 (300px) single-view batched ONNX embeddings + LR CV
import os, sys, time, subprocess
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

os.environ['OMP_NUM_THREADS'] = '1'; os.environ['MKL_NUM_THREADS'] = '1'; os.environ['OPENBLAS_NUM_THREADS'] = '1'; os.environ['NUMEXPR_NUM_THREADS'] = '1'; os.environ['MALLOC_ARENA_MAX'] = '1'

SEED = 42
IMG_SIZE = 300
RESIZE_SHORT = 342  # ~1.14*IMG_SIZE (ImageNet conv.)
BATCH = 16  # B3 is heavier; keep conservative batch
MODEL_PATH = Path('efficientnetb3_imagenet_300_embed.onnx')  # 1536D penultimate

def pip_install(pkgs):
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet'] + pkgs)

# Ensure deps
pip_install(['onnxruntime==1.18.0', 'onnx==1.16.2', 'tf2onnx==1.16.1'])
import onnxruntime as ort

# Build ONNX (EffNetB3 penultimate GAP) if missing
if (not MODEL_PATH.exists()) or MODEL_PATH.stat().st_size == 0:
    import tensorflow as tf, tf2onnx
    tf.random.set_seed(SEED)
    base = tf.keras.applications.EfficientNetB3(include_top=False, weights='imagenet', input_shape=(IMG_SIZE, IMG_SIZE, 3))
    x = tf.keras.layers.GlobalAveragePooling2D(name='gap')(base.output)  # 1536D
    model = tf.keras.Model(base.input, x)
    spec = (tf.TensorSpec((None, IMG_SIZE, IMG_SIZE, 3), tf.float32, name='input'),)
    onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature=spec, opset=13)
    with open(MODEL_PATH, 'wb') as f: f.write(onnx_model.SerializeToString())

def make_session(mpath):
    so = ort.SessionOptions(); so.intra_op_num_threads = 1; so.inter_op_num_threads = 1
    return ort.InferenceSession(str(mpath), sess_options=so, providers=['CPUExecutionProvider'])

def resize_shorter_side(im, short=RESIZE_SHORT):
    w, h = im.size
    if w <= 0 or h <= 0: return im.resize((short, short), Image.BILINEAR)
    if w < h: nw, nh = short, int(round(h * short / w))
    else: nh, nw = short, int(round(w * short / h))
    return im.resize((nw, nh), Image.BILINEAR)

def preprocess_arr(im: Image.Image):
    im = im.convert('RGB')
    im = resize_shorter_side(im, RESIZE_SHORT)
    w, h = im.size
    left = (w - IMG_SIZE) // 2; top = (h - IMG_SIZE) // 2
    im = im.crop((left, top, left + IMG_SIZE, top + IMG_SIZE))
    arr = np.asarray(im, dtype=np.float32)  # H,W,3 in 0..255
    arr = (arr / 127.5) - 1.0  # [-1,1] (TF EfficientNet preprocess)
    return arr

def extract_singleview_batched(paths):
    sess = make_session(MODEL_PATH)
    inp = sess.get_inputs()[0].name
    out = sess.get_outputs()[0].name
    N = len(paths)
    feats = np.zeros((N, 1536), dtype=np.float32)
    t0 = time.time()
    i = 0; k = 0
    while i < N:
        j = min(i + BATCH, N)
        batch = []
        for p in paths[i:j]:
            try:
                with Image.open(p) as im:
                    arr = preprocess_arr(im)
            except Exception:
                arr = np.zeros((IMG_SIZE, IMG_SIZE, 3), dtype=np.float32)
            batch.append(arr)
        X = np.stack(batch, axis=0)  # (B, H, W, 3)
        Y = sess.run([out], {inp: X})[0]  # (B, 1536)
        feats[i:j] = Y.astype(np.float32)
        i = j; k += 1
        if k % 20 == 0:
            done = j
            print(f'B3 SV Batched {done}/{N} ({(done*100)//N}%)')
    print(f'B3 single-view batched extraction: shape={feats.shape} | {time.time()-t0:.1f}s')
    assert np.isfinite(feats).all(), 'Non-finite values in B3 embeddings'
    return feats

# Data
DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR/'train.csv')
test  = pd.read_csv(DATA_DIR/'test.csv')
folds = pd.read_csv(DATA_DIR/'folds.csv')
JPEG_TRAIN = DATA_DIR/'jpeg'/'train'; JPEG_TEST = DATA_DIR/'jpeg'/'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all(); train['fold'] = train['fold'].astype(int)

# Cache paths (versioned, single-view batched)
ft_tr = 'features_train_onnx_effb3_embed_300_sv_batched.npy'
ft_te = 'features_test_onnx_effb3_embed_300_sv_batched.npy'
ord_tr = 'order_train_onnx_effb3_embed_300_sv_batched.npy'
ord_te = 'order_test_onnx_effb3_embed_300_sv_batched.npy'

if Path(ft_tr).exists() and Path(ft_te).exists() and Path(ord_tr).exists() and Path(ord_te).exists():
    Ft = np.load(ft_tr); Fe = np.load(ft_te)
    otr = np.load(ord_tr, allow_pickle=True).astype(str); ote = np.load(ord_te, allow_pickle=True).astype(str)
    assert np.array_equal(otr, train['image_name'].astype(str).values)
    assert np.array_equal(ote, test['image_name'].astype(str).values)
    print('Loaded cached B3 single-view batched embeddings:', Ft.shape, Fe.shape)
else:
    Ft = extract_singleview_batched(train['image_path'].tolist())
    Fe = extract_singleview_batched(test['image_path'].tolist())
    np.save(ft_tr, Ft); np.save(ft_te, Fe)
    np.save(ord_tr, train['image_name'].values.astype(str)); np.save(ord_te, test['image_name'].values.astype(str))
    print('Saved B3 single-view batched embeddings and sidecars.')

# LR head CV
y = train['target'].values.astype(int)
oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds, fold_aucs = [], []
for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = Ft[trn_idx], y[trn_idx]
    X_val, y_val = Ft[val_idx], y[val_idx]
    clf = Pipeline([
        ('sc', StandardScaler(with_mean=True, with_std=True)),
        ('lr', LogisticRegression(solver='lbfgs', C=1.0, max_iter=20000, class_weight='balanced', n_jobs=1, random_state=SEED))
    ])
    clf.fit(X_trn, y_trn)
    pv = clf.predict_proba(X_val)[:,1]
    oof[val_idx] = pv.astype(np.float32)
    fold_aucs.append(float(roc_auc_score(y_val, pv)))
    test_pred_folds.append(clf.predict_proba(Fe)[:,1].astype(np.float32))
print('Per-fold AUCs (B3 single-view batched):', [round(a,6) for a in fold_aucs])
oof_auc = roc_auc_score(y, oof)
print('OOF AUC (EffB3 single-view batched emb + LR):', round(oof_auc,6))

# Save artifacts
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_onnx_effb3_embed300_svbatched_lr'] = oof
oof_df.to_csv('oof_meta.csv', index=False)
test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
pd.DataFrame({'image_name': test['image_name'], 'target': test_pred}).to_csv('submission.csv', index=False)
print('Saved oof_meta.csv and submission.csv')

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.26.4 which is incompatible.
grpcio-status 1.74.0 requires protobuf<7.0.0,>=6.31.1, but you have protobuf 3.20.3 which is incompatible.[0m[31m
[0m

B3 SV Batched 320/28984 (1%)


B3 SV Batched 640/28984 (2%)


B3 SV Batched 960/28984 (3%)


B3 SV Batched 1280/28984 (4%)


B3 SV Batched 1600/28984 (5%)


B3 SV Batched 1920/28984 (6%)


B3 SV Batched 2240/28984 (7%)


B3 SV Batched 2560/28984 (8%)


B3 SV Batched 2880/28984 (9%)


B3 SV Batched 3200/28984 (11%)


B3 SV Batched 3520/28984 (12%)


B3 SV Batched 3840/28984 (13%)


B3 SV Batched 4160/28984 (14%)


B3 SV Batched 4480/28984 (15%)


B3 SV Batched 4800/28984 (16%)


B3 SV Batched 5120/28984 (17%)


B3 SV Batched 5440/28984 (18%)


B3 SV Batched 5760/28984 (19%)


B3 SV Batched 6080/28984 (20%)


B3 SV Batched 6400/28984 (22%)


B3 SV Batched 6720/28984 (23%)


B3 SV Batched 7040/28984 (24%)


B3 SV Batched 7360/28984 (25%)


B3 SV Batched 7680/28984 (26%)


B3 SV Batched 8000/28984 (27%)


B3 SV Batched 8320/28984 (28%)


B3 SV Batched 8640/28984 (29%)


B3 SV Batched 8960/28984 (30%)


B3 SV Batched 9280/28984 (32%)


B3 SV Batched 9600/28984 (33%)


B3 SV Batched 9920/28984 (34%)


B3 SV Batched 10240/28984 (35%)


B3 SV Batched 10560/28984 (36%)


B3 SV Batched 10880/28984 (37%)


B3 SV Batched 11200/28984 (38%)


B3 SV Batched 11520/28984 (39%)


B3 SV Batched 11840/28984 (40%)


B3 SV Batched 12160/28984 (41%)


B3 SV Batched 12480/28984 (43%)


B3 SV Batched 12800/28984 (44%)


B3 SV Batched 13120/28984 (45%)


B3 SV Batched 13440/28984 (46%)


B3 SV Batched 13760/28984 (47%)


B3 SV Batched 14080/28984 (48%)


B3 SV Batched 14400/28984 (49%)


B3 SV Batched 14720/28984 (50%)


B3 SV Batched 15040/28984 (51%)


B3 SV Batched 15360/28984 (52%)


B3 SV Batched 15680/28984 (54%)


B3 SV Batched 16000/28984 (55%)


B3 SV Batched 16320/28984 (56%)


B3 SV Batched 16640/28984 (57%)


B3 SV Batched 16960/28984 (58%)


B3 SV Batched 17280/28984 (59%)


B3 SV Batched 17600/28984 (60%)


B3 SV Batched 17920/28984 (61%)


B3 SV Batched 18240/28984 (62%)


B3 SV Batched 18560/28984 (64%)


B3 SV Batched 18880/28984 (65%)


B3 SV Batched 19200/28984 (66%)


B3 SV Batched 19520/28984 (67%)


B3 SV Batched 19840/28984 (68%)


B3 SV Batched 20160/28984 (69%)


B3 SV Batched 20480/28984 (70%)


B3 SV Batched 20800/28984 (71%)


B3 SV Batched 21120/28984 (72%)


B3 SV Batched 21440/28984 (73%)


B3 SV Batched 21760/28984 (75%)


B3 SV Batched 22080/28984 (76%)


B3 SV Batched 22400/28984 (77%)


B3 SV Batched 22720/28984 (78%)


B3 SV Batched 23040/28984 (79%)


B3 SV Batched 23360/28984 (80%)


B3 SV Batched 23680/28984 (81%)


B3 SV Batched 24000/28984 (82%)


B3 SV Batched 24320/28984 (83%)


B3 SV Batched 24640/28984 (85%)


B3 SV Batched 24960/28984 (86%)


B3 SV Batched 25280/28984 (87%)


B3 SV Batched 25600/28984 (88%)


B3 SV Batched 25920/28984 (89%)


B3 SV Batched 26240/28984 (90%)


B3 SV Batched 26560/28984 (91%)


B3 SV Batched 26880/28984 (92%)


B3 SV Batched 27200/28984 (93%)


B3 SV Batched 27520/28984 (94%)


B3 SV Batched 27840/28984 (96%)


B3 SV Batched 28160/28984 (97%)


B3 SV Batched 28480/28984 (98%)


B3 SV Batched 28800/28984 (99%)


B3 single-view batched extraction: shape=(28984, 1536) | 8808.2s


B3 SV Batched 320/4142 (7%)


B3 SV Batched 640/4142 (15%)


B3 SV Batched 960/4142 (23%)


B3 SV Batched 1280/4142 (30%)


B3 SV Batched 1600/4142 (38%)


B3 SV Batched 1920/4142 (46%)


B3 SV Batched 2240/4142 (54%)


B3 SV Batched 2560/4142 (61%)


B3 SV Batched 2880/4142 (69%)


B3 SV Batched 3200/4142 (77%)


B3 SV Batched 3520/4142 (84%)


B3 SV Batched 3840/4142 (92%)


B3 single-view batched extraction: shape=(4142, 1536) | 1465.2s


Saved B3 single-view batched embeddings and sidecars.


Per-fold AUCs (B3 single-view batched): [0.760391, 0.780485, 0.812231, 0.810926, 0.751279]
OOF AUC (EffB3 single-view batched emb + LR): 0.782867
Saved oof_meta.csv and submission.csv


In [80]:
# EXP3-B3-REFIT-FAST: Refit LR on cached EfficientNetB3 (300px, 1536D) embeddings with CV and save submission
import numpy as np, pandas as pd
from pathlib import Path
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

SEED = 42
Ft_path, Fe_path = 'features_train_onnx_effb3_embed_300.npy', 'features_test_onnx_effb3_embed_300.npy'
assert Path(Ft_path).exists() and Path(Fe_path).exists(), 'B3 cached features not found'
Ft = np.load(Ft_path).astype(np.float32)
Fe = np.load(Fe_path).astype(np.float32)

train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')
folds = pd.read_csv('folds.csv')
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all() and len(Ft)==len(train) and len(Fe)==len(test)
train['fold'] = train['fold'].astype(int)
y = train['target'].values.astype(int)

oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds, fold_aucs = [], []

for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = Ft[trn_idx], y[trn_idx]
    X_val, y_val = Ft[val_idx], y[val_idx]
    clf = Pipeline([
        ('sc', StandardScaler(with_mean=True, with_std=True)),
        ('lr', LogisticRegression(solver='lbfgs', C=1.0, max_iter=20000, class_weight='balanced', n_jobs=1, random_state=SEED))
    ])
    clf.fit(X_trn, y_trn)
    pv = clf.predict_proba(X_val)[:,1]
    oof[val_idx] = pv.astype(np.float32)
    fold_aucs.append(float(roc_auc_score(y_val, pv)))
    test_pred_folds.append(clf.predict_proba(Fe)[:,1].astype(np.float32))
    print(f'Fold {f}: AUC={fold_aucs[-1]:.5f}')

oof_auc = roc_auc_score(y, oof)
print('OOF AUC (EffB3 emb 300px + LR):', round(oof_auc,6), '| per-fold:', [round(a,6) for a in fold_aucs])

# Save artifacts
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_onnx_effb3_embed300_lr'] = oof
oof_df.to_csv('oof_meta.csv', index=False)
test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
pd.DataFrame({'image_name': test['image_name'], 'target': test_pred}).to_csv('submission.csv', index=False)
print('Saved oof_meta.csv and submission.csv')

Fold 0: AUC=0.76039


Fold 1: AUC=0.78049


Fold 2: AUC=0.81223


Fold 3: AUC=0.81093


Fold 4: AUC=0.75128
OOF AUC (EffB3 emb 300px + LR): 0.782867 | per-fold: [0.760391, 0.780485, 0.812231, 0.810926, 0.751279]
Saved oof_meta.csv and submission.csv


In [94]:
# EXP3-SUBMIT-BEST (HARDENED): B0 TTA2-batched embeddings + LR CV with sidecar checks, submission gating, and manifest
import numpy as np, pandas as pd, json, time
from pathlib import Path
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

SEED = 42
ts = time.strftime('%Y%m%d_%H%M%S')
DATA_DIR = Path('.')
best_registry = DATA_DIR / 'best_submission.json'

ft_tr = DATA_DIR / 'features_train_onnx_effb0_embed_224_tta2_batched.npy'
ft_te = DATA_DIR / 'features_test_onnx_effb0_embed_224_tta2_batched.npy'
ord_tr = DATA_DIR / 'order_train_effb0_embed_224_tta2_batched.npy'
ord_te = DATA_DIR / 'order_test_effb0_embed_224_tta2_batched.npy'
assert ft_tr.exists() and ft_te.exists() and ord_tr.exists() and ord_te.exists(), 'Missing cached B0 TTA2 features or sidecars.'
Ft = np.load(ft_tr).astype(np.float32); Fe = np.load(ft_te).astype(np.float32)
otr = np.load(ord_tr, allow_pickle=True).astype(str); ote = np.load(ord_te, allow_pickle=True).astype(str)
train = pd.read_csv(DATA_DIR / 'train.csv'); test = pd.read_csv(DATA_DIR / 'test.csv'); folds = pd.read_csv(DATA_DIR / 'folds.csv')
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all(); train['fold'] = train['fold'].astype(int)
assert np.array_equal(otr, train['image_name'].astype(str).values) and np.array_equal(ote, test['image_name'].astype(str).values), 'Sidecar order mismatch; aborting.'
assert Ft.shape[0]==len(train) and Fe.shape[0]==len(test), 'Feature/CSV shape mismatch'

y = train['target'].values.astype(int)
oof = np.zeros(len(train), dtype=np.float32); test_pred_folds, fold_aucs = [], []
for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = Ft[trn_idx], y[trn_idx]
    X_val, y_val = Ft[val_idx], y[val_idx]
    clf = Pipeline([
        ('sc', StandardScaler(with_mean=True, with_std=True)),
        ('lr', LogisticRegression(solver='lbfgs', C=1.0, max_iter=20000, class_weight='balanced', n_jobs=1, random_state=SEED))
    ])
    clf.fit(X_trn, y_trn)
    pv = clf.predict_proba(X_val)[:,1]
    oof[val_idx] = pv.astype(np.float32)
    fold_aucs.append(float(roc_auc_score(y_val, pv)))
    test_pred_folds.append(clf.predict_proba(Fe)[:,1].astype(np.float32))
oof_auc = roc_auc_score(y, oof)
print('OOF AUC (B0 TTA2-batched + LR):', round(oof_auc,6), '| folds:', [round(a,6) for a in fold_aucs])

# Persist OOF always
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_effb0_tta2_batched_lr'] = oof
oof_df.to_csv('oof_meta.csv', index=False)

# Submission gating via best_submission.json
entry = {
    'timestamp': ts,
    'model': 'EffB0_TTA2_emb + LR',
    'features': ft_tr.name,
    'sidecars': {'train': ord_tr.name, 'test': ord_te.name},
    'folds_csv': 'folds.csv',
    'oof_auc': float(oof_auc),
    'per_fold_auc': [float(a) for a in fold_aucs]
}

def load_best():
    if best_registry.exists():
        try:
            with open(best_registry,'r') as f: return json.load(f)
        except Exception:
            return None
    return None

def save_best(e):
    with open(best_registry,'w') as f: json.dump(e, f, indent=2)

best = load_best()
allow_write = False
if best is None:
    allow_write = True
else:
    try:
        allow_write = (float(entry['oof_auc']) > float(best.get('oof_auc', -1.0)))
    except Exception:
        allow_write = True

if allow_write:
    test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
    pd.DataFrame({'image_name': test['image_name'], 'target': test_pred}).to_csv('submission.csv', index=False)
    save_best(entry)
    print(f'submission.csv written (OOF improved to {oof_auc:.6f}). best_submission.json updated.')
else:
    print(f'submission.csv NOT overwritten (OOF {oof_auc:.6f} <= best {best.get("oof_auc")} ).')

# Manifest
manifest = {
    'timestamp': ts,
    'pipeline': 'EffB0 TTA2 embeddings -> LR CV',
    'artifacts': {
        'features_train': ft_tr.name,
        'features_test': ft_te.name,
        'sidecar_train': ord_tr.name,
        'sidecar_test': ord_te.name,
        'folds_csv': 'folds.csv',
        'oof_file': 'oof_meta.csv',
        'submission': 'submission.csv'
    },
    'metrics': {
        'oof_auc': float(oof_auc),
        'per_fold_auc': [float(a) for a in fold_aucs]
    }
}
with open(f'manifest_{ts}.json','w') as f: json.dump(manifest, f, indent=2)
print('Manifest saved:', f'manifest_{ts}.json')

In [82]:
# CLEANUP: Delete stale B3 cached embeddings to prevent misalignment reuse (per audit immediate correction)
from pathlib import Path
targets = [
    'features_train_onnx_effb3_embed_300.npy',
    'features_test_onnx_effb3_embed_300.npy'
]
removed = []
for p in targets:
    fp = Path(p)
    if fp.exists():
        try:
            fp.unlink()
            removed.append(p)
        except Exception as e:
            print(f'Failed to remove {p}:', e)
print('Removed stale files:', removed if removed else 'None (already absent)')
print('Next: run Cell 38 to generate alignment-proof B3 single-view batched embeddings.')

Removed stale files: ['features_train_onnx_effb3_embed_300.npy', 'features_test_onnx_effb3_embed_300.npy']
Next: run Cell 38 to generate alignment-proof B3 single-view batched embeddings.


In [84]:
# DIAG-B3-vs-B0: LightGBM head on B0 TTA2 vs B3 SV-batched embeddings (no submission writes)
import os, sys, subprocess, numpy as np, pandas as pd
from pathlib import Path
from sklearn.metrics import roc_auc_score

# Ensure LightGBM
try:
    import lightgbm as lgb
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'lightgbm'])
    import lightgbm as lgb

SEED = 42

def run_lgbm_cv(Ft, y, Fe, folds):
    oof = np.zeros(len(y), dtype=np.float32)
    test_pred_folds = []
    fold_aucs = []
    params = dict(
        objective='binary', learning_rate=0.05, n_estimators=20000,
        num_leaves=255, max_bin=511, feature_fraction=0.9, bagging_fraction=0.8,
        subsample=0.8, subsample_freq=1, colsample_bytree=0.9,
        reg_lambda=1.0, reg_alpha=0.0, min_child_samples=20, n_jobs=8, random_state=SEED
    )
    for f in sorted(np.unique(folds)):
        trn_idx = np.where(folds != f)[0]
        val_idx = np.where(folds == f)[0]
        X_trn, X_val = Ft[trn_idx], Ft[val_idx]
        y_trn, y_val = y[trn_idx], y[val_idx]
        pos = max(1, int(y_trn.sum())); neg = max(1, int(len(y_trn) - pos))
        spw = neg / pos
        clf = lgb.LGBMClassifier(**params, scale_pos_weight=spw, is_unbalance=False)
        clf.fit(
            X_trn, y_trn,
            eval_set=[(X_val, y_val)],
            eval_metric='auc',
            callbacks=[lgb.early_stopping(1000, verbose=False)]
        )
        pv = clf.predict_proba(X_val)[:,1]
        oof[val_idx] = pv.astype(np.float32)
        fold_aucs.append(float(roc_auc_score(y_val, pv)))
        if Fe is not None:
            test_pred_folds.append(clf.predict_proba(Fe)[:,1].astype(np.float32))
    oof_auc = roc_auc_score(y, oof)
    test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else None
    return oof_auc, fold_aucs, oof, test_pred

# Load CSVs and folds
train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')
folds_df = pd.read_csv('folds.csv')
train = train.merge(folds_df[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all()
folds = train['fold'].astype(int).values
y = train['target'].astype(int).values

# ---------- B0 TTA2-batched embeddings ----------
b0_tr = 'features_train_onnx_effb0_embed_224_tta2_batched.npy'
b0_te = 'features_test_onnx_effb0_embed_224_tta2_batched.npy'
b0_ord_tr = 'order_train_effb0_embed_224_tta2_batched.npy'
b0_ord_te = 'order_test_effb0_embed_224_tta2_batched.npy'
assert Path(b0_tr).exists() and Path(b0_te).exists() and Path(b0_ord_tr).exists() and Path(b0_ord_te).exists(), 'Missing B0 TTA2 artifacts'
Ft_b0 = np.load(b0_tr).astype(np.float32)
Fe_b0 = np.load(b0_te).astype(np.float32)
otr_b0 = np.load(b0_ord_tr, allow_pickle=True).astype(str)
ote_b0 = np.load(b0_ord_te, allow_pickle=True).astype(str)
assert np.array_equal(otr_b0, train['image_name'].astype(str).values) and np.array_equal(ote_b0, test['image_name'].astype(str).values), 'B0 sidecar mismatch'

# ---------- B3 single-view batched embeddings ----------
b3_tr = 'features_train_onnx_effb3_embed_300_sv_batched.npy'
b3_te = 'features_test_onnx_effb3_embed_300_sv_batched.npy'
b3_ord_tr = 'order_train_onnx_effb3_embed_300_sv_batched.npy'
b3_ord_te = 'order_test_onnx_effb3_embed_300_sv_batched.npy'
assert Path(b3_tr).exists() and Path(b3_te).exists() and Path(b3_ord_tr).exists() and Path(b3_ord_te).exists(), 'Missing B3 SV-batched artifacts'
Ft_b3 = np.load(b3_tr).astype(np.float32)
Fe_b3 = np.load(b3_te).astype(np.float32)
otr_b3 = np.load(b3_ord_tr, allow_pickle=True).astype(str)
ote_b3 = np.load(b3_ord_te, allow_pickle=True).astype(str)
assert np.array_equal(otr_b3, train['image_name'].astype(str).values) and np.array_equal(ote_b3, test['image_name'].astype(str).values), 'B3 sidecar mismatch'

# Run diagnostics
auc_b0, folds_b0, oof_b0, _ = run_lgbm_cv(Ft_b0, y, Fe_b0, folds)
auc_b3, folds_b3, oof_b3, _ = run_lgbm_cv(Ft_b3, y, Fe_b3, folds)
print('LGBM OOF AUC | B0 TTA2-batched:', round(auc_b0,6), '| per-fold:', [round(a,6) for a in folds_b0])
print('LGBM OOF AUC | B3 single-view batched:', round(auc_b3,6), '| per-fold:', [round(a,6) for a in folds_b3])

# Decision gate (no file writes, only print):
if auc_b3 > auc_b0:
    print('DECISION: B3>LGBM beats B0>LGBM. APPROVED to proceed with B3 TTA2 + LGBM next.')
else:
    print('DECISION: B3<=B0 under same head. ABANDON B3 for now; pivot to B0 TTA4/8 improvements.')

print('Note: submission.csv was not modified by this diagnostic run.')



[LightGBM] [Info] Number of positive: 410, number of negative: 23138
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.177287 seconds.
You can set `force_col_wise=true` to remove the overhead.


[LightGBM] [Info] Total Bins 654080


[LightGBM] [Info] Number of data points in the train set: 23548, number of used features: 1280


[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017411 -> initscore=-4.033074
[LightGBM] [Info] Start training from score -4.033074








































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































[LightGBM] [Info] Number of positive: 411, number of negative: 22604
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.190165 seconds.
You can set `force_col_wise=true` to remove the overhead.


[LightGBM] [Info] Total Bins 654080


[LightGBM] [Info] Number of data points in the train set: 23015, number of used features: 1280


[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017858 -> initscore=-4.007289
[LightGBM] [Info] Start training from score -4.007289


























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































[LightGBM] [Info] Number of positive: 413, number of negative: 23026
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.159290 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 654080


[LightGBM] [Info] Number of data points in the train set: 23439, number of used features: 1280


[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017620 -> initscore=-4.020932
[LightGBM] [Info] Start training from score -4.020932






































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































[LightGBM] [Info] Number of positive: 407, number of negative: 22357
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.152459 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 654080


[LightGBM] [Info] Number of data points in the train set: 22764, number of used features: 1280


[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017879 -> initscore=-4.006082
[LightGBM] [Info] Start training from score -4.006082




































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































In [87]:
# EXP3-ONNX-B0-TTA4-BATCHED: 4-view TTA (orig, hflip, vflip, hvflip) with batched ONNX inference + LR CV + gated submission
import os, sys, time, subprocess
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

os.environ['OMP_NUM_THREADS']='1'; os.environ['MKL_NUM_THREADS']='1'; os.environ['OPENBLAS_NUM_THREADS']='1'; os.environ['NUMEXPR_NUM_THREADS']='1'; os.environ['MALLOC_ARENA_MAX']='1'

SEED = 42
IMG_SIZE = 224
RESIZE_SHORT = 256
BATCH = 16  # effective 4*B per run
MODEL_PATH = Path('efficientnetb0_imagenet_224_embed.onnx')  # 1280D penultimate
BEST_OOF_GATE = 0.816025  # current best

def pip_install(pkgs):
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet'] + pkgs)

# Ensure deps
pip_install(['onnxruntime==1.18.0', 'onnx==1.16.2', 'tf2onnx==1.16.1'])
import onnxruntime as ort

# Build ONNX model if missing
if (not MODEL_PATH.exists()) or MODEL_PATH.stat().st_size == 0:
    import tensorflow as tf, tf2onnx
    tf.random.set_seed(SEED)
    base = tf.keras.applications.EfficientNetB0(include_top=False, weights='imagenet', input_shape=(IMG_SIZE, IMG_SIZE, 3))
    x = tf.keras.layers.GlobalAveragePooling2D(name='gap')(base.output)
    model = tf.keras.Model(base.input, x)
    spec = (tf.TensorSpec((None, IMG_SIZE, IMG_SIZE, 3), tf.float32, name='input'),)
    onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature=spec, opset=13)
    with open(MODEL_PATH, 'wb') as f: f.write(onnx_model.SerializeToString())

def make_session(mpath):
    so = ort.SessionOptions(); so.intra_op_num_threads=1; so.inter_op_num_threads=1
    return ort.InferenceSession(str(mpath), sess_options=so, providers=['CPUExecutionProvider'])

def resize_shorter_side(im, short=RESIZE_SHORT):
    w, h = im.size
    if w <= 0 or h <= 0: return im.resize((short, short), Image.BILINEAR)
    if w < h: nw, nh = short, int(round(h * short / w))
    else: nh, nw = short, int(round(w * short / h))
    return im.resize((nw, nh), Image.BILINEAR)

def preprocess_arr(im: Image.Image):
    im = im.convert('RGB')
    im = resize_shorter_side(im, RESIZE_SHORT)
    w, h = im.size
    left = (w - IMG_SIZE)//2; top = (h - IMG_SIZE)//2
    im = im.crop((left, top, left+IMG_SIZE, top+IMG_SIZE))
    arr = np.asarray(im, dtype=np.float32)  # H,W,3 0..255
    arr = (arr / 127.5) - 1.0  # [-1,1]
    return arr

def tta4_views(arr):
    # orig, hflip, vflip, hvflip
    h = arr[:, ::-1, :]
    v = arr[::-1, :, :]
    hv = v[:, ::-1, :]
    return np.stack([arr, h, v, hv], axis=0)  # 4,H,W,3

def extract_tta4_avg_batched(paths):
    sess = make_session(MODEL_PATH)
    inp = sess.get_inputs()[0].name
    out = sess.get_outputs()[0].name
    N = len(paths)
    feats = np.zeros((N, 1280), dtype=np.float32)
    t0 = time.time(); i = 0; k = 0
    while i < N:
        j = min(i + BATCH, N)
        batch = []
        for p in paths[i:j]:
            try:
                with Image.open(p) as im:
                    arr = preprocess_arr(im)
            except Exception:
                arr = np.zeros((IMG_SIZE, IMG_SIZE, 3), dtype=np.float32)
            batch.append(tta4_views(arr))
        X = np.concatenate(batch, axis=0)  # (4*B, H, W, 3)
        Y = sess.run([out], {inp: X})[0]  # (4*B, 1280)
        Y = Y.reshape(-1, 4, 1280).mean(axis=1)  # (B, 1280)
        feats[i:j] = Y.astype(np.float32)
        i = j; k += 1
        if k % 20 == 0:
            done = j
            print(f'TTA4 Batched {done}/{N} ({(done*100)//N}%)')
    print(f'TTA4 batched extraction: shape={feats.shape} | {time.time()-t0:.1f}s')
    assert np.isfinite(feats).all(), 'Non-finite values in TTA4 embeddings'
    return feats

# Data
DATA_DIR = Path('.')
train = pd.read_csv(DATA_DIR/'train.csv')
test  = pd.read_csv(DATA_DIR/'test.csv')
folds = pd.read_csv(DATA_DIR/'folds.csv')
JPEG_TRAIN = DATA_DIR/'jpeg'/'train'; JPEG_TEST = DATA_DIR/'jpeg'/'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all(); train['fold'] = train['fold'].astype(int)

# Cache paths
ft_tr = 'features_train_onnx_effb0_embed_224_tta4_batched.npy'
ft_te = 'features_test_onnx_effb0_embed_224_tta4_batched.npy'
ord_tr = 'order_train_effb0_embed_224_tta4_batched.npy'
ord_te = 'order_test_effb0_embed_224_tta4_batched.npy'

if Path(ft_tr).exists() and Path(ft_te).exists() and Path(ord_tr).exists() and Path(ord_te).exists():
    Ft = np.load(ft_tr); Fe = np.load(ft_te)
    otr = np.load(ord_tr, allow_pickle=True).astype(str); ote = np.load(ord_te, allow_pickle=True).astype(str)
    assert np.array_equal(otr, train['image_name'].astype(str).values)
    assert np.array_equal(ote, test['image_name'].astype(str).values)
    print('Loaded cached TTA4-batched embeddings:', Ft.shape, Fe.shape)
else:
    Ft = extract_tta4_avg_batched(train['image_path'].tolist())
    Fe = extract_tta4_avg_batched(test['image_path'].tolist())
    np.save(ft_tr, Ft); np.save(ft_te, Fe)
    np.save(ord_tr, train['image_name'].values.astype(str)); np.save(ord_te, test['image_name'].values.astype(str))
    print('Saved TTA4-batched embeddings and sidecars.')

# LR head CV
y = train['target'].values.astype(int)
oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds, fold_aucs = [], []
for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = Ft[trn_idx], y[trn_idx]
    X_val, y_val = Ft[val_idx], y[val_idx]
    clf = Pipeline([
        ('sc', StandardScaler(with_mean=True, with_std=True)),
        ('lr', LogisticRegression(solver='lbfgs', C=1.0, max_iter=20000, class_weight='balanced', n_jobs=1, random_state=SEED))
    ])
    clf.fit(X_trn, y_trn)
    pv = clf.predict_proba(X_val)[:,1]
    oof[val_idx] = pv.astype(np.float32)
    fold_aucs.append(float(roc_auc_score(y_val, pv)))
    test_pred_folds.append(clf.predict_proba(Fe)[:,1].astype(np.float32))
print('Per-fold AUCs (B0 TTA4 batched):', [round(a,6) for a in fold_aucs])
oof_auc = roc_auc_score(y, oof)
print('OOF AUC (EffB0 TTA4-batched emb + LR):', round(oof_auc,6))

# Save OOF artifacts always
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_onnx_effb0_embed_tta4_batched_lr'] = oof
oof_df.to_csv('oof_meta.csv', index=False)

# Gate submission: only overwrite if strictly better than BEST_OOF_GATE
if oof_auc > BEST_OOF_GATE:
    test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
    pd.DataFrame({'image_name': test['image_name'], 'target': test_pred}).to_csv('submission.csv', index=False)
    print(f'submission.csv updated from TTA4 (OOF={oof_auc:.6f} > {BEST_OOF_GATE:.6f}).')
else:
    print(f'submission.csv kept (OOF={oof_auc:.6f} <= {BEST_OOF_GATE:.6f}).')

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.26.4 which is incompatible.
grpcio-status 1.74.0 requires protobuf<7.0.0,>=6.31.1, but you have protobuf 3.20.3 which is incompatible.[0m[31m
[0m

In [86]:
# EXP3-B3-SV-REFIT: Corrected LR head on alignment-proof B3 single-view batched embeddings (no submission write)
import numpy as np, pandas as pd
from pathlib import Path
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

SEED = 42
# Correct artifacts produced by Cell 38
ft_tr = 'features_train_onnx_effb3_embed_300_sv_batched.npy'
ft_te = 'features_test_onnx_effb3_embed_300_sv_batched.npy'
ord_tr = 'order_train_onnx_effb3_embed_300_sv_batched.npy'
ord_te = 'order_test_onnx_effb3_embed_300_sv_batched.npy'
assert Path(ft_tr).exists() and Path(ft_te).exists() and Path(ord_tr).exists() and Path(ord_te).exists(), 'Missing B3 SV-batched artifacts'
Ft = np.load(ft_tr).astype(np.float32)
Fe = np.load(ft_te).astype(np.float32)
otr = np.load(ord_tr, allow_pickle=True).astype(str)
ote = np.load(ord_te, allow_pickle=True).astype(str)

train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')
folds = pd.read_csv('folds.csv')
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all(); train['fold'] = train['fold'].astype(int)
assert np.array_equal(otr, train['image_name'].astype(str).values), 'Train sidecar mismatch for B3 SV-batched'
assert np.array_equal(ote, test['image_name'].astype(str).values),  'Test sidecar mismatch for B3 SV-batched'
assert len(Ft)==len(train) and len(Fe)==len(test), 'Shape mismatch vs CSVs'

y = train['target'].values.astype(int)
oof = np.zeros(len(train), dtype=np.float32)
fold_aucs = []
for f in sorted(train['fold'].unique()):
    trn_idx = np.where(train['fold'].values != f)[0]
    val_idx = np.where(train['fold'].values == f)[0]
    X_trn, y_trn = Ft[trn_idx], y[trn_idx]
    X_val, y_val = Ft[val_idx], y[val_idx]
    clf = Pipeline([
        ('sc', StandardScaler(with_mean=True, with_std=True)),
        ('lr', LogisticRegression(solver='lbfgs', C=1.0, max_iter=20000, class_weight='balanced', n_jobs=1, random_state=SEED))
    ])
    clf.fit(X_trn, y_trn)
    pv = clf.predict_proba(X_val)[:,1].astype(np.float32)
    oof[val_idx] = pv
    fold_aucs.append(float(roc_auc_score(y_val, pv)))

oof_auc = roc_auc_score(y, oof)
print('Per-fold AUCs (B3 SV-batched, corrected):', [round(a,6) for a in fold_aucs])
print('OOF AUC (B3 SV-batched + LR, corrected):', round(oof_auc,6))

# Save OOF only; do not modify submission
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_onnx_effb3_embed300_svbatched_lr_corrected'] = oof
oof_df.to_csv('oof_meta.csv', index=False)
print('Saved corrected OOF to oof_meta.csv; submission.csv left unchanged (B0 TTA2 retained).')

In [88]:
# EXP3-ONNX-B0-TTA2-LGBM (HARDENED): LGBM head on EfficientNetB0 TTA2-batched embeddings with sidecar checks, submission gating, and manifest
import os, sys, json, time, subprocess, numpy as np, pandas as pd
from pathlib import Path
from sklearn.metrics import roc_auc_score

# Ensure LightGBM
try:
    import lightgbm as lgb
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'lightgbm'])
    import lightgbm as lgb

SEED = 42
ts = time.strftime('%Y%m%d_%H%M%S')
DATA_DIR = Path('.')
best_registry = DATA_DIR / 'best_submission.json'

# Load features + sidecars
ft_tr = DATA_DIR / 'features_train_onnx_effb0_embed_224_tta2_batched.npy'
ft_te = DATA_DIR / 'features_test_onnx_effb0_embed_224_tta2_batched.npy'
ord_tr = DATA_DIR / 'order_train_effb0_embed_224_tta2_batched.npy'
ord_te = DATA_DIR / 'order_test_effb0_embed_224_tta2_batched.npy'
assert ft_tr.exists() and ft_te.exists() and ord_tr.exists() and ord_te.exists(), 'Missing B0 TTA2-batched features or sidecars.'
Ft = np.load(ft_tr).astype(np.float32)
Fe = np.load(ft_te).astype(np.float32)
otr = np.load(ord_tr, allow_pickle=True).astype(str)
ote = np.load(ord_te, allow_pickle=True).astype(str)

# Load data + folds and verify strict alignment
train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all(), 'Folds missing'
train['fold'] = train['fold'].astype(int)
assert np.array_equal(otr, train['image_name'].astype(str).values), 'Train sidecar order mismatch'
assert np.array_equal(ote, test['image_name'].astype(str).values),  'Test sidecar order mismatch'
assert Ft.shape[0] == len(train) and Fe.shape[0] == len(test), 'Feature shape mismatch vs CSVs'

y = train['target'].values.astype(int)
fold_vals = train['fold'].values.astype(int)

# LGBM params (robust) 
base_params = dict(
    objective='binary', learning_rate=0.05, n_estimators=20000,
    num_leaves=255, max_bin=511, feature_fraction=0.9, bagging_fraction=0.8,
    subsample=0.8, subsample_freq=1, colsample_bytree=0.9,
    reg_lambda=1.0, reg_alpha=0.0, min_child_samples=20, n_jobs=8, random_state=SEED
)

oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds = []
fold_aucs = []

for f in sorted(np.unique(fold_vals)):
    trn_idx = np.where(fold_vals != f)[0]
    val_idx = np.where(fold_vals == f)[0]
    X_trn, X_val = Ft[trn_idx], Ft[val_idx]
    y_trn, y_val = y[trn_idx], y[val_idx]
    pos = max(1, int(y_trn.sum())); neg = max(1, int(len(y_trn) - pos))
    spw = neg / pos
    params = dict(base_params)
    clf = lgb.LGBMClassifier(**params, scale_pos_weight=spw, is_unbalance=False)
    clf.fit(
        X_trn, y_trn,
        eval_set=[(X_val, y_val)],
        eval_metric='auc',
        callbacks=[lgb.early_stopping(1000, verbose=False)]
    )
    pv = clf.predict_proba(X_val)[:, 1]
    oof[val_idx] = pv.astype(np.float32)
    auc = roc_auc_score(y_val, pv)
    fold_aucs.append(float(auc))
    test_pred_folds.append(clf.predict_proba(Fe)[:, 1].astype(np.float32))
    print(f'Fold {f}: AUC={auc:.6f}, best_iter={getattr(clf, "best_iteration_", None)}')

oof_auc = roc_auc_score(y, oof)
print('OOF AUC (B0 TTA2-batched emb + LGBM):', round(oof_auc, 6), '| per-fold:', [round(a,6) for a in fold_aucs])

# Save OOF always
oof_df = train[['image_name', 'fold', 'target']].copy()
oof_df['pred_img_onnx_effb0_embed_tta2_batched_lgbm'] = oof
oof_df.to_csv('oof_meta.csv', index=False)

# Submission gating via best_submission.json
entry = {
    'timestamp': ts,
    'model': 'EffB0_TTA2_emb + LGBM',
    'features': ft_tr.name,
    'sidecars': {'train': ord_tr.name, 'test': ord_te.name},
    'folds_csv': 'folds.csv',
    'oof_auc': float(oof_auc),
    'per_fold_auc': [float(a) for a in fold_aucs]
}

def load_best():
    if best_registry.exists():
        try:
            with open(best_registry,'r') as f: return json.load(f)
        except Exception:
            return None
    return None

def save_best(e):
    with open(best_registry,'w') as f: json.dump(e, f, indent=2)

best = load_best()
allow_write = False
if best is None:
    allow_write = True
else:
    try:
        allow_write = (float(entry['oof_auc']) > float(best.get('oof_auc', -1.0)))
    except Exception:
        allow_write = True

if allow_write:
    test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
    pd.DataFrame({'image_name': test['image_name'], 'target': test_pred}).to_csv('submission.csv', index=False)
    save_best(entry)
    print(f'submission.csv written (OOF improved to {oof_auc:.6f}). best_submission.json updated.')
else:
    print(f'submission.csv NOT overwritten (OOF {oof_auc:.6f} <= best {best.get("oof_auc")} ).')

# Manifest for provenance
manifest = {
    'timestamp': ts,
    'pipeline': 'EffB0 TTA2 embeddings -> LGBM CV',
    'artifacts': {
        'features_train': ft_tr.name,
        'features_test': ft_te.name,
        'sidecar_train': ord_tr.name,
        'sidecar_test': ord_te.name,
        'folds_csv': 'folds.csv',
        'oof_file': 'oof_meta.csv',
        'submission': 'submission.csv'
    },
    'metrics': {
        'oof_auc': float(oof_auc),
        'per_fold_auc': [float(a) for a in fold_aucs]
    }
}
with open(f'manifest_{ts}.json','w') as f:
    json.dump(manifest, f, indent=2)
print('Manifest saved:', f'manifest_{ts}.json')



[LightGBM] [Info] Number of positive: 410, number of negative: 23138
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.165355 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 654080


[LightGBM] [Info] Number of data points in the train set: 23548, number of used features: 1280


[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017411 -> initscore=-4.033074
[LightGBM] [Info] Start training from score -4.033074






















































































































































































































































































































































































In [89]:
# EXP4-GLOBAL-SAFEGUARDS + LOGITS+LGBM (mandated): sidecar-aligned CV with submission gating and manifest
import os, sys, json, time, subprocess
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score

# Ensure LightGBM
try:
    import lightgbm as lgb
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'lightgbm'])
    import lightgbm as lgb

SEED = 42
ts = time.strftime('%Y%m%d_%H%M%S')

# Paths
DATA_DIR = Path('.')
ft_tr = DATA_DIR / 'features_train_onnx_effb0_logits_224.npy'
ft_te = DATA_DIR / 'features_test_onnx_effb0_logits_224.npy'
ord_tr = DATA_DIR / 'order_train_onnx_effb0_logits_224.npy'
ord_te = DATA_DIR / 'order_test_onnx_effb0_logits_224.npy'
best_registry = DATA_DIR / 'best_submission.json'

# Load CSVs
train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all(), 'folds missing'
train['fold'] = train['fold'].astype(int)

# Load logits features
assert ft_tr.exists() and ft_te.exists(), 'Missing ONNX logits features. Run the ONNX logits extraction cell first.'
Ft = np.load(ft_tr).astype(np.float32)
Fe = np.load(ft_te).astype(np.float32)

# Create or validate sidecars
if not ord_tr.exists():
    np.save(ord_tr, train['image_name'].values.astype(str))
if not ord_te.exists():
    np.save(ord_te, test['image_name'].values.astype(str))
otr = np.load(ord_tr, allow_pickle=True).astype(str)
ote = np.load(ord_te, allow_pickle=True).astype(str)
assert len(otr)==Ft.shape[0] and len(ote)==Fe.shape[0], 'Sidecar length mismatch to feature rows'
assert np.array_equal(otr, train['image_name'].astype(str).values), 'Train sidecar order mismatch for logits'
assert np.array_equal(ote, test['image_name'].astype(str).values),  'Test sidecar order mismatch for logits'
assert Ft.shape[0]==len(train) and Fe.shape[0]==len(test), 'Feature shape mismatch vs CSVs'

# LGBM params (robust, quiet)
base_params = dict(
    objective='binary', learning_rate=0.05, n_estimators=20000,
    num_leaves=255, max_bin=511, feature_fraction=0.9, bagging_fraction=0.8,
    subsample=0.8, subsample_freq=1, colsample_bytree=0.9,
    reg_lambda=1.0, reg_alpha=0.0, min_child_samples=20, n_jobs=8, random_state=SEED
)

y = train['target'].values.astype(int)
fold_vals = train['fold'].values.astype(int)
oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds = []
fold_aucs = []

for f in sorted(np.unique(fold_vals)):
    trn_idx = np.where(fold_vals != f)[0]
    val_idx = np.where(fold_vals == f)[0]
    X_trn, X_val = Ft[trn_idx], Ft[val_idx]
    y_trn, y_val = y[trn_idx], y[val_idx]
    pos = max(1, int(y_trn.sum())); neg = max(1, int(len(y_trn)-pos))
    spw = neg / pos
    params = dict(base_params)
    clf = lgb.LGBMClassifier(**params, scale_pos_weight=spw, is_unbalance=False)
    clf.fit(
        X_trn, y_trn,
        eval_set=[(X_val, y_val)],
        eval_metric='auc',
        callbacks=[lgb.early_stopping(1000, verbose=False)]
    )
    pv = clf.predict_proba(X_val)[:,1]
    oof[val_idx] = pv.astype(np.float32)
    auc = roc_auc_score(y_val, pv)
    fold_aucs.append(float(auc))
    test_pred_folds.append(clf.predict_proba(Fe)[:,1].astype(np.float32))
    print(f'Fold {f}: AUC={auc:.6f}, best_iter={getattr(clf, "best_iteration_", None)}')

oof_auc = roc_auc_score(y, oof)
print('GLOBAL OOF AUC (EffB0 logits + LGBM):', round(oof_auc,6))
print('Per-fold AUCs:', [round(a,6) for a in fold_aucs])

# Persist OOF always
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_onnx_effb0_logits_lgbm'] = oof
oof_df.to_csv('oof_meta.csv', index=False)

# Submission gating via best_submission.json
current_entry = {
    'timestamp': ts,
    'model': 'EffB0_ONNX_logits + LGBM',
    'features': ft_tr.name,
    'sidecars': {'train': ord_tr.name, 'test': ord_te.name},
    'folds_csv': 'folds.csv',
    'oof_auc': float(oof_auc),
    'per_fold_auc': [float(a) for a in fold_aucs]
}

def load_best():
    if best_registry.exists():
        try:
            with open(best_registry,'r') as f: return json.load(f)
        except Exception:
            return None
    return None

def save_best(entry):
    with open(best_registry,'w') as f: json.dump(entry, f, indent=2)

best = load_best()
allow_write = False
if best is None:
    allow_write = True
else:
    try:
        allow_write = (float(current_entry['oof_auc']) > float(best.get('oof_auc', -1.0)))
    except Exception:
        allow_write = True

if allow_write:
    test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
    pd.DataFrame({'image_name': test['image_name'], 'target': test_pred}).to_csv('submission.csv', index=False)
    save_best(current_entry)
    print(f'submission.csv written (OOF improved to {oof_auc:.6f}). best_submission.json updated.')
else:
    print(f'submission.csv NOT overwritten (OOF {oof_auc:.6f} <= best {best.get("oof_auc")}).')

# Save manifest for provenance
manifest = {
    'timestamp': ts,
    'pipeline': 'EffB0 ONNX logits -> LGBM CV',
    'artifacts': {
        'features_train': ft_tr.name,
        'features_test': ft_te.name,
        'sidecar_train': ord_tr.name,
        'sidecar_test': ord_te.name,
        'folds_csv': 'folds.csv',
        'oof_file': 'oof_meta.csv',
        'submission': 'submission.csv'
    },
    'metrics': {
        'oof_auc': float(oof_auc),
        'per_fold_auc': [float(a) for a in fold_aucs]
    }
}
with open(f'manifest_{ts}.json','w') as f:
    json.dump(manifest, f, indent=2)
print('Manifest saved:', f'manifest_{ts}.json')



[LightGBM] [Info] Number of positive: 410, number of negative: 23138
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.145014 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 511000


[LightGBM] [Info] Number of data points in the train set: 23548, number of used features: 1000


[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017411 -> initscore=-4.033074
[LightGBM] [Info] Start training from score -4.033074










































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Fold 0: AUC=0.787566, best_iter=70




[LightGBM] [Info] Number of positive: 411, number of negative: 22604
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.125122 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 511000


[LightGBM] [Info] Number of data points in the train set: 23015, number of used features: 1000


[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017858 -> initscore=-4.007289
[LightGBM] [Info] Start training from score -4.007289






























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Fold 1: AUC=0.788282, best_iter=88


[LightGBM] [Info] Number of positive: 413, number of negative: 23026
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.142545 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 511000


[LightGBM] [Info] Number of data points in the train set: 23439, number of used features: 1000


[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017620 -> initscore=-4.020932
[LightGBM] [Info] Start training from score -4.020932
























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Fold 2: AUC=0.771625, best_iter=88




[LightGBM] [Info] Number of positive: 407, number of negative: 22357
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.123550 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 511000


[LightGBM] [Info] Number of data points in the train set: 22764, number of used features: 1000
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017879 -> initscore=-4.006082
[LightGBM] [Info] Start training from score -4.006082




































































































































































































































































































































































































































































































































































































































































































































































































































































































Fold 3: AUC=0.838104, best_iter=93


[LightGBM] [Info] Number of positive: 411, number of negative: 22759
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.117796 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 511000
[LightGBM] [Info] Number of data points in the train set: 23170, number of used features: 1000


[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017738 -> initscore=-4.014123
[LightGBM] [Info] Start training from score -4.014123


























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Fold 4: AUC=0.794130, best_iter=74
GLOBAL OOF AUC (EffB0 logits + LGBM): 0.796123
Per-fold AUCs: [0.787566, 0.788282, 0.771625, 0.838104, 0.79413]
submission.csv written (OOF improved to 0.796123). best_submission.json updated.
Manifest saved: manifest_20250824_224755.json


In [91]:
# EXP4-TUNED: LGBM head on B0 TTA2-batched embeddings (raw features) with robust params, gating, and manifest
import os, sys, json, time, subprocess
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score

try:
    import lightgbm as lgb
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'lightgbm'])
    import lightgbm as lgb

SEED = 42
ts = time.strftime('%Y%m%d_%H%M%S')
DATA_DIR = Path('.')

# Artifacts (B0 TTA2-batched embeddings already extracted)
ft_tr = DATA_DIR / 'features_train_onnx_effb0_embed_224_tta2_batched.npy'
ft_te = DATA_DIR / 'features_test_onnx_effb0_embed_224_tta2_batched.npy'
ord_tr = DATA_DIR / 'order_train_effb0_embed_224_tta2_batched.npy'
ord_te = DATA_DIR / 'order_test_effb0_embed_224_tta2_batched.npy'
best_registry = DATA_DIR / 'best_submission.json'

assert ft_tr.exists() and ft_te.exists() and ord_tr.exists() and ord_te.exists(), 'Missing B0 TTA2-batched features/sidecars.'

train = pd.read_csv(DATA_DIR / 'train.csv')
test  = pd.read_csv(DATA_DIR / 'test.csv')
folds = pd.read_csv(DATA_DIR / 'folds.csv')
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all(), 'folds missing'
train['fold'] = train['fold'].astype(int)

Ft = np.load(ft_tr).astype(np.float32)
Fe = np.load(ft_te).astype(np.float32)
otr = np.load(ord_tr, allow_pickle=True).astype(str)
ote = np.load(ord_te, allow_pickle=True).astype(str)
assert Ft.shape[0]==len(train) and Fe.shape[0]==len(test), 'Feature shape mismatch vs CSVs'
assert np.array_equal(otr, train['image_name'].astype(str).values), 'Train sidecar order mismatch (TTA2)'
assert np.array_equal(ote, test['image_name'].astype(str).values),  'Test sidecar order mismatch (TTA2)'

# Tuned LightGBM params for dense CNN features
params = dict(
    objective='binary',
    boosting_type='gbdt',
    learning_rate=0.03,
    n_estimators=20000,
    num_leaves=127,
    max_bin=255,
    min_child_samples=50,
    min_sum_hessian_in_leaf=1e-3,
    feature_fraction=0.8,
    reg_lambda=2.0,
    reg_alpha=0.1,
    feature_pre_filter=False,
    force_col_wise=True,
    n_jobs=-1,
    random_state=SEED,
    is_unbalance=True
)

y = train['target'].values.astype(int)
fold_vals = train['fold'].values.astype(int)
oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds = []
fold_aucs = []

for f in sorted(np.unique(fold_vals)):
    trn_idx = np.where(fold_vals != f)[0]
    val_idx = np.where(fold_vals == f)[0]
    X_trn, X_val = Ft[trn_idx], Ft[val_idx]
    y_trn, y_val = y[trn_idx], y[val_idx]
    clf = lgb.LGBMClassifier(**params)
    clf.fit(
        X_trn, y_trn,
        eval_set=[(X_val, y_val)],
        eval_metric='auc',
        callbacks=[lgb.early_stopping(300, verbose=False)]
    )
    pv = clf.predict_proba(X_val)[:,1]
    oof[val_idx] = pv.astype(np.float32)
    auc = roc_auc_score(y_val, pv)
    fold_aucs.append(float(auc))
    test_pred_folds.append(clf.predict_proba(Fe)[:,1].astype(np.float32))
    print(f'Fold {f}: AUC={auc:.6f}, best_iter={getattr(clf, "best_iteration_", None)}')

oof_auc = roc_auc_score(y, oof)
print('GLOBAL OOF AUC (B0 TTA2 emb + Tuned LGBM):', round(oof_auc,6))
print('Per-fold AUCs:', [round(a,6) for a in fold_aucs])

# Persist OOF always
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_effb0_tta2_emb_lgbm_tuned'] = oof
oof_df.to_csv('oof_meta.csv', index=False)

# Submission gating
entry = {
    'timestamp': ts,
    'model': 'EffB0_TTA2_emb + LGBM_tuned',
    'features': ft_tr.name,
    'sidecars': {'train': ord_tr.name, 'test': ord_te.name},
    'folds_csv': 'folds.csv',
    'oof_auc': float(oof_auc),
    'per_fold_auc': [float(a) for a in fold_aucs]
}

def load_best():
    if best_registry.exists():
        try:
            with open(best_registry,'r') as f: return json.load(f)
        except Exception:
            return None
    return None

def save_best(e):
    with open(best_registry,'w') as f: json.dump(e, f, indent=2)

best = load_best()
allow_write = False
if best is None:
    allow_write = True
else:
    try:
        allow_write = (float(entry['oof_auc']) > float(best.get('oof_auc', -1.0)))
    except Exception:
        allow_write = True

if allow_write:
    test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
    pd.DataFrame({'image_name': test['image_name'], 'target': test_pred}).to_csv('submission.csv', index=False)
    save_best(entry)
    print(f'submission.csv written (OOF improved to {oof_auc:.6f}). best_submission.json updated.')
else:
    print(f'submission.csv NOT overwritten (OOF {oof_auc:.6f} <= best {best.get("oof_auc")} ).')

# Manifest
manifest = {
    'timestamp': ts,
    'pipeline': 'EffB0 TTA2 embeddings -> Tuned LGBM',
    'artifacts': {
        'features_train': ft_tr.name,
        'features_test': ft_te.name,
        'sidecar_train': ord_tr.name,
        'sidecar_test': ord_te.name,
        'folds_csv': 'folds.csv',
        'oof_file': 'oof_meta.csv',
        'submission': 'submission.csv'
    },
    'metrics': {
        'oof_auc': float(oof_auc),
        'per_fold_auc': [float(a) for a in fold_aucs]
    }
}
with open(f'manifest_{ts}.json','w') as f:
    json.dump(manifest, f, indent=2)
print('Manifest saved:', f'manifest_{ts}.json')



[LightGBM] [Info] Number of positive: 410, number of negative: 23138
[LightGBM] [Info] Total Bins 326400


[LightGBM] [Info] Number of data points in the train set: 23548, number of used features: 1280


[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.017411 -> initscore=-4.033074
[LightGBM] [Info] Start training from score -4.033074








In [96]:
# EXP5: RAW LOGITS FIX — Build B0 ONNX with classifier_activation=None, extract raw logits, tuned LGBM, gated submission + manifest
import os, sys, time, json, subprocess
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image
from sklearn.metrics import roc_auc_score

SEED = 42
IMG_SIZE = 224
RESIZE_SHORT = 256
BATCH = 32
ts = time.strftime('%Y%m%d_%H%M%S')
DATA_DIR = Path('.')
best_registry = DATA_DIR / 'best_submission.json'

def pip_install(pkgs):
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet'] + pkgs)

# Ensure deps
pip_install(['onnxruntime==1.18.0', 'onnx==1.16.2', 'tf2onnx==1.16.1', 'lightgbm'])
import onnxruntime as ort
import lightgbm as lgb

# Build ONNX model with raw logits (classifier_activation=None)
MODEL_RAW = Path('efficientnetb0_imagenet_224_logits_raw.onnx')
if (not MODEL_RAW.exists()) or MODEL_RAW.stat().st_size == 0:
    import tensorflow as tf, tf2onnx
    tf.random.set_seed(SEED)
    base = tf.keras.applications.EfficientNetB0(include_top=True, weights='imagenet', input_shape=(IMG_SIZE, IMG_SIZE, 3), classifier_activation=None)
    spec = (tf.TensorSpec((None, IMG_SIZE, IMG_SIZE, 3), tf.float32, name='input'),)
    onnx_model, _ = tf2onnx.convert.from_keras(base, input_signature=spec, opset=13)
    with open(MODEL_RAW, 'wb') as f: f.write(onnx_model.SerializeToString())
    print('Saved RAW logits ONNX to', MODEL_RAW)

def make_session(mpath):
    so = ort.SessionOptions(); so.intra_op_num_threads = 1; so.inter_op_num_threads = 1
    return ort.InferenceSession(str(mpath), sess_options=so, providers=['CPUExecutionProvider'])

def resize_shorter_side(im, short=RESIZE_SHORT):
    w, h = im.size
    if w <= 0 or h <= 0: return im.resize((short, short), Image.BILINEAR)
    if w < h: nw, nh = short, int(round(h * short / w))
    else: nh, nw = short, int(round(w * short / h))
    return im.resize((nw, nh), Image.BILINEAR)

def preprocess_arr(im: Image.Image):
    im = im.convert('RGB')
    im = resize_shorter_side(im, RESIZE_SHORT)
    w, h = im.size
    left = (w - IMG_SIZE)//2; top = (h - IMG_SIZE)//2
    im = im.crop((left, top, left+IMG_SIZE, top+IMG_SIZE))
    arr = np.asarray(im, dtype=np.float32)  # 0..255
    arr = (arr / 127.5) - 1.0  # TF EfficientNet preprocess [-1,1]
    return arr

@np.errstate(all='ignore')
def extract_logits_raw(paths, mpath, batch=BATCH):
    sess = make_session(mpath)
    inp = sess.get_inputs()[0].name
    out = sess.get_outputs()[0].name
    N = len(paths)
    F = np.zeros((N, 1000), dtype=np.float32)
    t0 = time.time(); i = 0; k = 0
    while i < N:
        j = min(i + batch, N)
        batch_arr = []
        for p in paths[i:j]:
            try:
                with Image.open(p) as im:
                    arr = preprocess_arr(im)
            except Exception:
                arr = np.zeros((IMG_SIZE, IMG_SIZE, 3), dtype=np.float32)
            batch_arr.append(arr)
        X = np.stack(batch_arr, axis=0)  # (B, H, W, 3)
        Y = sess.run([out], {inp: X})[0]  # (B, 1000) RAW logits
        F[i:j] = Y.astype(np.float32)
        i = j; k += 1
        if k % 20 == 0:
            done = j; print(f'RAW logits {done}/{N} ({(done*100)//N}%)')
    print(f'RAW logits extraction: shape={F.shape} | {time.time()-t0:.1f}s')
    assert np.isfinite(F).all(), 'Non-finite values in RAW logits'
    return F

# Data
train = pd.read_csv(DATA_DIR/'train.csv')
test  = pd.read_csv(DATA_DIR/'test.csv')
folds = pd.read_csv(DATA_DIR/'folds.csv')
JPEG_TRAIN = DATA_DIR/'jpeg'/'train'; JPEG_TEST = DATA_DIR/'jpeg'/'test'
train['image_path'] = train['image_name'].apply(lambda x: str(JPEG_TRAIN / f'{x}.jpg'))
test['image_path']  = test['image_name'].apply(lambda x: str(JPEG_TEST  / f'{x}.jpg'))
train = train.merge(folds[['image_name','fold']], on='image_name', how='left')
assert train['fold'].notna().all(); train['fold'] = train['fold'].astype(int)

# Feature cache paths (RAW logits)
ft_tr = DATA_DIR / 'features_train_onnx_effb0_logits_raw_224.npy'
ft_te = DATA_DIR / 'features_test_onnx_effb0_logits_raw_224.npy'
ord_tr = DATA_DIR / 'order_train_onnx_effb0_logits_raw_224.npy'
ord_te = DATA_DIR / 'order_test_onnx_effb0_logits_raw_224.npy'

if ft_tr.exists() and ft_te.exists() and ord_tr.exists() and ord_te.exists():
    Ft = np.load(ft_tr).astype(np.float32); Fe = np.load(ft_te).astype(np.float32)
    otr = np.load(ord_tr, allow_pickle=True).astype(str); ote = np.load(ord_te, allow_pickle=True).astype(str)
    print('Loaded cached RAW logits:', Ft.shape, Fe.shape)
else:
    Ft = extract_logits_raw(train['image_path'].tolist(), MODEL_RAW)
    Fe = extract_logits_raw(test['image_path'].tolist(),  MODEL_RAW)
    np.save(ft_tr, Ft); np.save(ft_te, Fe)
    np.save(ord_tr, train['image_name'].values.astype(str)); np.save(ord_te, test['image_name'].values.astype(str))
    print('Saved RAW logits features and sidecars.')
    otr = train['image_name'].astype(str).values; ote = test['image_name'].astype(str).values

# Strict sidecar checks
assert Ft.shape[0]==len(train) and Fe.shape[0]==len(test), 'RAW logits shape mismatch vs CSVs'
assert np.array_equal(otr, train['image_name'].astype(str).values), 'Train RAW sidecar mismatch'
assert np.array_equal(ote, test['image_name'].astype(str).values),  'Test RAW sidecar mismatch'

# Tuned LGBM for dense CNN features
params = dict(
    objective='binary', boosting_type='gbdt',
    learning_rate=0.03, n_estimators=20000,
    num_leaves=127, max_bin=255,
    min_child_samples=50, min_sum_hessian_in_leaf=1e-3,
    feature_fraction=0.8,
    reg_lambda=2.0, reg_alpha=0.1,
    feature_pre_filter=False,
    force_col_wise=True,
    n_jobs=-1, random_state=SEED,
    is_unbalance=True
)

y = train['target'].values.astype(int)
fold_vals = train['fold'].values.astype(int)
oof = np.zeros(len(train), dtype=np.float32)
test_pred_folds = []; fold_aucs = []

for f in sorted(np.unique(fold_vals)):
    trn_idx = np.where(fold_vals != f)[0]
    val_idx = np.where(fold_vals == f)[0]
    X_trn, X_val = Ft[trn_idx], Ft[val_idx]
    y_trn, y_val = y[trn_idx], y[val_idx]
    clf = lgb.LGBMClassifier(**params)
    clf.fit(
        X_trn, y_trn,
        eval_set=[(X_val, y_val)],
        eval_metric='auc',
        callbacks=[lgb.early_stopping(400, verbose=False)]
    )
    pv = clf.predict_proba(X_val)[:,1]
    oof[val_idx] = pv.astype(np.float32)
    auc = roc_auc_score(y_val, pv); fold_aucs.append(float(auc))
    test_pred_folds.append(clf.predict_proba(Fe)[:,1].astype(np.float32))
    print(f'Fold {f}: AUC={auc:.6f}, best_iter={getattr(clf, "best_iteration_", None)}')

oof_auc = roc_auc_score(y, oof)
print('GLOBAL OOF AUC (EffB0 RAW logits + tuned LGBM):', round(oof_auc,6))
print('Per-fold AUCs:', [round(a,6) for a in fold_aucs])

# Persist OOF always
oof_df = train[['image_name','fold','target']].copy()
oof_df['pred_img_onnx_effb0_logits_raw_lgbm'] = oof
oof_df.to_csv('oof_meta.csv', index=False)

# Submission gating
entry = {
    'timestamp': ts,
    'model': 'EffB0_ONNX_RAWlogits + LGBM_tuned',
    'features': ft_tr.name,
    'sidecars': {'train': ord_tr.name, 'test': ord_te.name},
    'folds_csv': 'folds.csv',
    'oof_auc': float(oof_auc),
    'per_fold_auc': [float(a) for a in fold_aucs]
}

def load_best():
    if best_registry.exists():
        try:
            with open(best_registry,'r') as f: return json.load(f)
        except Exception:
            return None
    return None

def save_best(e):
    with open(best_registry,'w') as f: json.dump(e, f, indent=2)

best = load_best()
allow_write = False
if best is None:
    allow_write = True
else:
    try:
        allow_write = (float(entry['oof_auc']) > float(best.get('oof_auc', -1.0)))
    except Exception:
        allow_write = True

if allow_write:
    test_pred = np.mean(np.stack(test_pred_folds, axis=0), axis=0) if len(test_pred_folds)>0 else np.zeros(len(test), dtype=np.float32)
    pd.DataFrame({'image_name': test['image_name'], 'target': test_pred}).to_csv('submission.csv', index=False)
    save_best(entry)
    print(f'submission.csv written (OOF improved to {oof_auc:.6f}). best_submission.json updated.')
else:
    print(f'submission.csv NOT overwritten (OOF {oof_auc:.6f} <= best {best.get("oof_auc")} ).')

# Manifest for provenance
manifest = {
    'timestamp': ts,
    'pipeline': 'EffB0 ONNX RAW logits -> Tuned LGBM CV',
    'artifacts': {
        'features_train': ft_tr.name,
        'features_test': ft_te.name,
        'sidecar_train': ord_tr.name,
        'sidecar_test': ord_te.name,
        'folds_csv': 'folds.csv',
        'oof_file': 'oof_meta.csv',
        'submission': 'submission.csv'
    },
    'metrics': {
        'oof_auc': float(oof_auc),
        'per_fold_auc': [float(a) for a in fold_aucs]
    }
}
with open(f'manifest_{ts}.json','w') as f: json.dump(manifest, f, indent=2)
print('Manifest saved:', f'manifest_{ts}.json')

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.26.4 which is incompatible.
grpcio-status 1.74.0 requires protobuf<7.0.0,>=6.31.1, but you have protobuf 3.20.3 which is incompatible.[0m[31m
[0m

## Submission Safety Notice (Legacy Cells Deprecated)

To prevent accidental downgrades of submission.csv and ensure provenance integrity, the following legacy cells are marked as DEPRECATED for submission writing:

- Do NOT execute for submissions: Cells 7, 8, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 38, 39, 41, 42, 43, 44.
- These cells either predate gating or were exploratory; some still contain unconditional writes.

Approved submission pipelines (gated, with manifests and sidecar asserts):
- Cell 40: EffB0 TTA2 embeddings + LR (HARDENED, gated).
- Cell 45: EffB0 TTA2 embeddings + LGBM (HARDENED, gated).
- Cell 46: EffB0 ONNX logits + LGBM (HARDENED, gated).
- Cell 47: EffB0 TTA2 embeddings + Tuned LGBM (HARDENED, gated).
- Cell 48: EffB0 RAW logits (classifier_activation=None) + Tuned LGBM (HARDENED, gated) — run end-to-end for validation.

All future submission-writing cells must:
- Use best_submission.json gating (overwrite only on OOF improvement).
- Save a timestamped manifest with features, sidecars, folds, OOF, and submission path.
- Assert exact sidecar order alignment for train/test before CV and prediction.

This deprecation ensures only the hardened cells above can produce final submissions.