# Plan to WIN A MEDAL

## High-level strategy
- Quickly inspect provided artifacts (train2.zip, test2.zip, sampleSubmission.csv).
- Identify training labels and feature representation (likely audio features or precomputed features).
- Build a strong baseline fast: stratified CV + LightGBM/XGBoost or an audio CNN if features are spectrograms already provided.
- Iterate with feature cleaning, normalization, target leakage checks, and robust CV aligned with competition splits.
- Ensembling/stacking if multiple feature sets exist inside zips.
- Submit early baseline, then iterate to improve AUC with calibrated thresholds/ensembles.

## Milestones
1) Data discovery and schema mapping.
2) Baseline model with proper CV, log AUC.
3) Feature engineering/selection or data augmentation if images/spectrograms.
4) Model tuning and ensembling.
5) Generate submission.csv and validate format.

## Logging and efficiency
- Print progress and timing per fold.
- Use early stopping and reasonable params first.
- Avoid long training without expert check.

Next: explore files, unzip train/test, preview contents and schema.

In [1]:
import os, zipfile, pandas as pd, time, json, sys, glob
from pathlib import Path
print('CWD:', os.getcwd())
print('Dir listing:', os.listdir())

print('\nPreview sampleSubmission.csv:')
sample = pd.read_csv('sampleSubmission.csv')
print(sample.shape)
print(sample.head())

def inspect_zip(zpath):
    print(f'\nInspecting {zpath} ...')
    with zipfile.ZipFile(zpath, 'r') as z:
        names = z.namelist()
        print('Total files:', len(names))
        for n in names[:20]:
            print(' -', n)
        # find potential csv/labels/features
        csvs = [n for n in names if n.lower().endswith('.csv')]
        print('CSV files in zip:', csvs[:10])
inspect_zip('train2.zip')
inspect_zip('test2.zip')

# Unzip into folders if not already
def safe_extract(zpath, outdir):
    out = Path(outdir)
    if out.exists() and any(out.iterdir()):
        print(f'{outdir} already extracted, skipping')
        return
    out.mkdir(parents=True, exist_ok=True)
    t0=time.time()
    with zipfile.ZipFile(zpath, 'r') as z:
        z.extractall(outdir)
    print(f'Extracted {zpath} -> {outdir} in {time.time()-t0:.2f}s')

safe_extract('train2.zip', 'train2')
safe_extract('test2.zip', 'test2')

# Walk directories to understand schema
def walk_dir(d, depth=2):
    d = Path(d)
    print(f'\nWalking {d} (depth={depth})')
    for root, dirs, files in os.walk(d):
        rel = Path(root).relative_to(d)
        lvl = len(rel.parts)
        print(root, 'dirs:', len(dirs), 'files:', len(files))
        # show a few sample files
        for f in files[:10]:
            print('  file:', f)
        if lvl >= depth:
            # don't descend further
            dirs[:] = []
walk_dir('train2', depth=3)
walk_dir('test2', depth=3)

CWD: /app/agent_run_states/the-icml-2013-whale-challenge-right-whale-redux-spray-20250911-045804
Dir listing: ['test2.zip', 'train2.zip', 'main.ipynb', 'agent_metadata', 'requirements.txt', 'task.txt', 'sampleSubmission.csv', 'docker_run.log', 'description.md']

Preview sampleSubmission.csv:
(25149, 2)
                                clip  probability
0  20090330_000000_015s2ms_Test0.aif            0
1  20090330_000000_020s0ms_Test1.aif            0
2  20090330_000000_025s8ms_Test2.aif            0
3  20090330_000000_028s0ms_Test3.aif            0
4  20090330_000000_035s4ms_Test4.aif            0

Inspecting train2.zip ...
Total files: 22693
 - train2/
 - train2/20090328_121500_44292s1ms_TRAIN3930_0.aif
 - train2/20090328_040000_14754s3ms_TRAIN1536_0.aif
 - train2/20090329_190000_68809s7ms_TRAIN20079_0.aif
 - train2/20090328_170000_61317s6ms_TRAIN5011_0.aif
 - train2/20090329_050000_18012s6ms_TRAIN11800_0.aif
 - train2/20090329_191500_69780s9ms_TRAIN20291_0.aif
 - train2/20090328_11300

Total files: 25150
 - test2/
 - test2/20090330_211500_76911s3ms_Test9103.aif
 - test2/20090331_093000_34786s9ms_Test16890.aif
 - test2/20090331_164500_60302s0ms_Test20173.aif
 - test2/20090331_061500_23065s5ms_Test15089.aif
 - test2/20090330_103000_38310s6ms_Test4068.aif
 - test2/20090331_173000_63548s4ms_Test20737.aif
 - test2/20090331_204500_74792s1ms_Test22617.aif
 - test2/20090330_184500_67999s0ms_Test7115.aif
 - test2/20090330_203000_74285s8ms_Test8581.aif
 - test2/20090331_054500_21339s8ms_Test14792.aif
 - test2/20090330_211500_76623s1ms_Test9047.aif
 - test2/20090331_181500_66381s2ms_Test21309.aif
 - test2/20090330_183000_66957s4ms_Test7017.aif
 - test2/20090330_144500_53767s4ms_Test5595.aif
 - test2/20090331_201500_73585s2ms_Test22419.aif
 - test2/20090330_191500_70183s0ms_Test7551.aif
 - test2/20090331_020000_7418s6ms_Test12846.aif
 - test2/20090330_191500_70086s1ms_Test7521.aif
 - test2/20090331_000000_056s1ms_Test11109.aif
CSV files in zip: []


Extracted train2.zip -> train2 in 3.66s


Extracted test2.zip -> test2 in 4.10s

Walking train2 (depth=3)
train2 dirs: 1 files: 0
train2/train2 dirs: 0 files: 22692
  file: 20090329_224500_82193s1ms_TRAIN22168_0.aif
  file: 20090329_181500_66249s3ms_TRAIN19586_0.aif
  file: 20090329_191500_70052s7ms_TRAIN20346_0.aif
  file: 20090328_143000_52703s4ms_TRAIN4326_0.aif
  file: 20090329_153000_55933s0ms_TRAIN16724_0.aif
  file: 20090329_040000_14652s0ms_TRAIN11313_0.aif
  file: 20090329_191500_69741s8ms_TRAIN20283_0.aif
  file: 20090329_041500_15461s1ms_TRAIN11470_0.aif
  file: 20090328_093000_34253s1ms_TRAIN3035_0.aif
  file: 20090328_203000_74619s7ms_TRAIN6196_0.aif

Walking test2 (depth=3)
test2 dirs: 1 files: 0
test2/test2 dirs: 0 files: 25149
  file: 20090330_021500_8988s9ms_Test893.aif
  file: 20090331_211500_77375s4ms_Test23172.aif
  file: 20090330_001500_1374s9ms_Test107.aif
  file: 20090330_210000_76001s4ms_Test8920.aif
  file: 20090330_003000_2023s7ms_Test215.aif
  file: 20090330_211500_76703s8ms_Test9057.aif
  file: 2009

In [2]:
import re, numpy as np, pandas as pd, os, glob, time
from pathlib import Path
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

t0 = time.time()
train_files = sorted(glob.glob('train2/train2/*.aif'))
test_files = sorted(glob.glob('test2/test2/*.aif'))
print('Found train files:', len(train_files), 'test files:', len(test_files))

pat_train = re.compile(r'(?P<date>\d{8})_(?P<hh>\d{2})(?P<mm>\d{2})(?P<ss>\d{2})_(?P<s>\d+)s(?P<ms>\d+)ms_.*_(?P<label>[01])\.aif$')
pat_test = re.compile(r'(?P<date>\d{8})_(?P<hh>\d{2})(?P<mm>\d{2})(?P<ss>\d{2})_(?P<s>\d+)s(?P<ms>\d+)ms_.*\.aif$')

def parse_train(fp):
    name = os.path.basename(fp)
    m = pat_train.search(name)
    if not m:
        return None
    d = m.groupdict()
    return {
        'clip': name,
        'date': int(d['date']),
        'hour': int(d['hh']),
        'minute': int(d['mm']),
        'second': int(d['ss']),
        'seg_seconds': int(d['s']),
        'seg_ms': int(d['ms']),
        'label': int(d['label'])
    }

def parse_test(fp):
    name = os.path.basename(fp)
    m = pat_test.search(name)
    if not m:
        return None
    d = m.groupdict()
    return {
        'clip': name,
        'date': int(d['date']),
        'hour': int(d['hh']),
        'minute': int(d['mm']),
        'second': int(d['ss']),
        'seg_seconds': int(d['s']),
        'seg_ms': int(d['ms'])
    }

t1 = time.time()
train_rows = [r for r in map(parse_train, train_files) if r is not None]
test_rows = [r for r in map(parse_test, test_files) if r is not None]
print(f'Parsed train rows: {len(train_rows)}, test rows: {len(test_rows)} in {time.time()-t1:.2f}s')

train_df = pd.DataFrame(train_rows)
test_df = pd.DataFrame(test_rows)
print(train_df.head())
print(train_df['label'].value_counts())

# Basic time-based features
def add_features(df):
    df = df.copy()
    df['tod_sec'] = df['hour']*3600 + df['minute']*60 + df['second']
    df['seg_len_ms'] = df['seg_seconds']*1000 + df['seg_ms']
    df['hour_sin'] = np.sin(2*np.pi*df['hour']/24)
    df['hour_cos'] = np.cos(2*np.pi*df['hour']/24)
    df['minute_sin'] = np.sin(2*np.pi*df['minute']/60)
    df['minute_cos'] = np.cos(2*np.pi*df['minute']/60)
    return df

train_fe = add_features(train_df)
test_fe = add_features(test_df)
feats = ['tod_sec','seg_len_ms','hour','minute','second','seg_seconds','seg_ms','hour_sin','hour_cos','minute_sin','minute_cos']

# CV baseline with logistic regression
X = train_fe[feats].values
y = train_fe['label'].values
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof = np.zeros(len(train_fe))
fold = 0
for tr, va in skf.split(X, y):
    fold += 1
    print(f'Fold {fold} start (n_tr={len(tr)}, n_va={len(va)})')
    tfs = time.time()
    model = make_pipeline(StandardScaler(), LogisticRegression(max_iter=200, n_jobs=8, solver='lbfgs'))
    model.fit(X[tr], y[tr])
    oof[va] = model.predict_proba(X[va])[:,1]
    print(f'Fold {fold} done in {time.time()-tfs:.2f}s')
auc = roc_auc_score(y, oof)
print('OOF AUC:', auc)

# Train final model on all data and predict test
final_model = make_pipeline(StandardScaler(), LogisticRegression(max_iter=200, n_jobs=8, solver='lbfgs'))
final_model.fit(X, y)
test_pred = final_model.predict_proba(test_fe[feats].values)[:,1]

# Build submission per sampleSubmission order
sub_template = pd.read_csv('sampleSubmission.csv')
sub_map = pd.Series(test_pred, index=test_fe['clip']).to_dict()
sub = sub_template.copy()
miss = 0
vals = []
for c in sub['clip']:
    v = sub_map.get(c)
    if v is None:
        miss += 1
        v = 0.5
    vals.append(v)
sub['probability'] = vals
print('Missing mapped clips:', miss)
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv with shape', sub.shape, 'in', f'{time.time()-t0:.2f}s')

Found train files: 22692 test files: 25149
Parsed train rows: 22692, test rows: 25149 in 0.15s


                                   clip      date  hour  minute  second  \
0  20090328_000000_002s3ms_TRAIN0_0.aif  20090328     0       0       0   
1  20090328_000000_010s8ms_TRAIN1_0.aif  20090328     0       0       0   
2  20090328_000000_021s6ms_TRAIN2_0.aif  20090328     0       0       0   
3  20090328_000000_059s0ms_TRAIN3_0.aif  20090328     0       0       0   
4  20090328_000000_068s4ms_TRAIN4_0.aif  20090328     0       0       0   

   seg_seconds  seg_ms  label  
0            2       3      0  
1           10       8      0  
2           21       6      0  
3           59       0      0  
4           68       4      0  
label
0    20426
1     2266
Name: count, dtype: int64
Fold 1 start (n_tr=18153, n_va=4539)


Fold 1 done in 1.33s
Fold 2 start (n_tr=18153, n_va=4539)


Fold 2 done in 1.07s
Fold 3 start (n_tr=18154, n_va=4538)


Fold 3 done in 1.06s
Fold 4 start (n_tr=18154, n_va=4538)


Fold 5 done in 1.10s
OOF AUC: 0.57517519163097


Missing mapped clips: 0
Wrote submission.csv with shape (25149, 2) in 7.34s


In [4]:
# Audio feature extraction + LGBM with grouped CV
import os, sys, time, glob, re, warnings
import numpy as np
import pandas as pd
from pathlib import Path
warnings.filterwarnings('ignore')

t0_all = time.time()

# Install deps if missing
def ensure(pkg, import_name=None):
    import importlib
    try:
        importlib.import_module(import_name or pkg)
    except Exception as e:
        print(f'Installing {pkg} ...')
        os.system(f'{sys.executable} -m pip install -q {pkg}')
ensure('soundfile')
ensure('librosa')
ensure('joblib')
ensure('scipy')
ensure('lightgbm')

import soundfile as sf
import librosa
from joblib import Parallel, delayed
from scipy.signal import butter, filtfilt
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb

# Load parsed filename metadata from previous cell's train_df/test_df if available, else parse again
if 'train_df' not in globals() or 'test_df' not in globals():
    train_files = sorted(glob.glob('train2/train2/*.aif'))
    test_files = sorted(glob.glob('test2/test2/*.aif'))
    pat_train = re.compile(r'(?P<date>\d{8})_(?P<hh>\d{2})(?P<mm>\d{2})(?P<ss>\d{2})_(?P<s>\d+)s(?P<ms>\d+)ms_.*_(?P<label>[01])\.aif$')
    pat_test = re.compile(r'(?P<date>\d{8})_(?P<hh>\d{2})(?P<mm>\d{2})(?P<ss>\d{2})_(?P<s>\d+)s(?P<ms>\d+)ms_.*\.aif$')
    def parse_train(fp):
        name = os.path.basename(fp)
        m = pat_train.search(name)
        if not m: return None
        d = m.groupdict()
        return {'clip': name, 'date': int(d['date']), 'hour': int(d['hh']), 'minute': int(d['mm']), 'second': int(d['ss']), 'label': int(d['label'])}
    def parse_test(fp):
        name = os.path.basename(fp)
        m = pat_test.search(name)
        if not m: return None
        d = m.groupdict()
        return {'clip': name, 'date': int(d['date']), 'hour': int(d['hh']), 'minute': int(d['mm']), 'second': int(d['ss'])}
    train_rows = [r for r in map(parse_train, train_files) if r]
    test_rows = [r for r in map(parse_test, test_files) if r]
    train_df = pd.DataFrame(train_rows)
    test_df = pd.DataFrame(test_rows)
else:
    train_files = sorted(glob.glob('train2/train2/*.aif'))
    test_files = sorted(glob.glob('test2/test2/*.aif'))

print(f'Train clips: {len(train_files)}, Test clips: {len(test_files)}')

# Audio params
SR = 2000
TARGET_LEN = 2 * SR  # 2 seconds
N_FFT = 512
HOP = 128
WIN = 256
N_MELS = 64
FMIN = 20
FMAX = 1000
BP_LO = 50
BP_HI = 400
b, a = butter(4, [BP_LO/(SR/2), BP_HI/(SR/2)], btype='band')

mel_freqs = librosa.mel_frequencies(n_mels=N_MELS, fmin=FMIN, fmax=FMAX)
low_idx = (mel_freqs >= 50) & (mel_freqs <= 400)

def load_audio(fp):
    y, orig_sr = sf.read(fp, dtype='float32')
    if y.ndim > 1:
        y = y.mean(axis=1)
    if orig_sr != SR:
        y = librosa.resample(y, orig_sr=orig_sr, target_sr=SR, res_type='kaiser_fast')
    if len(y) < TARGET_LEN:
        y = np.pad(y, (0, TARGET_LEN - len(y)))
    else:
        y = y[:TARGET_LEN]
    # Band-pass filter
    try:
        y = filtfilt(b, a, y)
    except Exception:
        pass
    return y

def feature_dict_from_audio(y):
    S = librosa.feature.melspectrogram(y=y, sr=SR, n_fft=N_FFT, hop_length=HOP, win_length=WIN, n_mels=N_MELS, fmin=FMIN, fmax=FMAX, power=2.0)
    L = np.log1p(S)
    feats = {}
    # per-mel stats (mean, std, max, p90)
    feats.update({f'mel_mean_{i}': float(L[i].mean()) for i in range(L.shape[0])})
    feats.update({f'mel_std_{i}': float(L[i].std()) for i in range(L.shape[0])})
    feats.update({f'mel_max_{i}': float(L[i].max()) for i in range(L.shape[0])})
    feats.update({f'mel_p90_{i}': float(np.percentile(L[i], 90)) for i in range(L.shape[0])})
    # MFCCs + deltas
    mfcc = librosa.feature.mfcc(S=librosa.power_to_db(S + 1e-12), sr=SR, n_mfcc=20)
    dm = librosa.feature.delta(mfcc)
    feats.update({f'mfcc_mean_{i}': float(mfcc[i].mean()) for i in range(mfcc.shape[0])})
    feats.update({f'mfcc_std_{i}': float(mfcc[i].std()) for i in range(mfcc.shape[0])})
    feats.update({f'dmfcc_mean_{i}': float(dm[i].mean()) for i in range(dm.shape[0])})
    feats.update({f'dmfcc_std_{i}': float(dm[i].std()) for i in range(dm.shape[0])})
    # Spectral + energy stats
    spec_cent = librosa.feature.spectral_centroid(y=y, sr=SR, n_fft=N_FFT, hop_length=HOP).squeeze()
    spec_bw = librosa.feature.spectral_bandwidth(y=y, sr=SR, n_fft=N_FFT, hop_length=HOP).squeeze()
    spec_roll = librosa.feature.spectral_rolloff(y=y, sr=SR, roll_percent=0.80, n_fft=N_FFT, hop_length=HOP).squeeze()
    flat = librosa.feature.spectral_flatness(y=y, n_fft=N_FFT, hop_length=HOP).squeeze()
    rms = librosa.feature.rms(y=y, frame_length=WIN, hop_length=HOP).squeeze()
    zcr = librosa.feature.zero_crossing_rate(y=y, frame_length=WIN, hop_length=HOP).squeeze()
    def add_stat(name, arr):
        feats[f'{name}_mean'] = float(arr.mean())
        feats[f'{name}_std'] = float(arr.std())
        feats[f'{name}_max'] = float(arr.max())
        feats[f'{name}_p90'] = float(np.percentile(arr, 90))
    for name, arr in [('centroid', spec_cent), ('bandwidth', spec_bw), ('rolloff', spec_roll), ('flatness', flat), ('rms', rms), ('zcr', zcr)]:
        add_stat(name, arr)
    # low-band energy metrics
    low_band = L[low_idx].mean(axis=0)
    full_band = L.mean(axis=0)
    feats['low_full_ratio_mean'] = float(low_band.mean() / (full_band.mean() + 1e-8))
    feats['low_p90'] = float(np.percentile(low_band, 90))
    feats['low_peak_to_mean'] = float(low_band.max() / (low_band.mean() + 1e-8))
    thr = float(np.percentile(low_band, 75))
    mask = (low_band > thr).astype(np.int32)
    # run-lengths of >thr
    if mask.size:
        idx = np.where(np.r_[True, mask==0, True])[0]
        runs = np.diff(idx) - 1
        runs = runs[runs > 0]
        feats['low_run_max'] = float(runs.max() if runs.size else 0.0)
        feats['low_run_count'] = float((mask[1:] > mask[:-1]).sum())
    else:
        feats['low_run_max'] = 0.0
        feats['low_run_count'] = 0.0
    return feats

def process_clip(fp):
    try:
        y = load_audio(fp)
        feats = feature_dict_from_audio(y)
        feats['clip'] = os.path.basename(fp)
        return feats
    except Exception as e:
        return {'clip': os.path.basename(fp), 'error': str(e)}

feat_train_path = Path('features_train.parquet')
feat_test_path = Path('features_test.parquet')

def extract_or_load_features(files, out_path, tag):
    if out_path.exists():
        print(f'Loading cached {tag} features from {out_path} ...')
        return pd.read_parquet(out_path)
    n = len(files)
    print(f'Extracting {tag} features for {n} files ...')
    t0 = time.time()
    res = Parallel(n_jobs=-1, batch_size=16, prefer='processes')(delayed(process_clip)(fp) for fp in files)
    df = pd.DataFrame(res)
    if 'error' in df.columns:
        errs = df['error'].notnull().sum()
        print(f'{tag} errors: {errs}')
    df.to_parquet(out_path, index=False)
    print(f'Saved {tag} features to {out_path} in {time.time()-t0:.1f}s, shape={df.shape}')
    return df

train_feat_df = extract_or_load_features(train_files, feat_train_path, 'train')
test_feat_df = extract_or_load_features(test_files, feat_test_path, 'test')

# Merge labels
train_feat_df = train_feat_df.merge(train_df[['clip','label','date','hour']], on='clip', how='left')
test_feat_df = test_feat_df.merge(test_df[['clip','date','hour']], on='clip', how='left')
print('Train features shape:', train_feat_df.shape, 'Test features shape:', test_feat_df.shape)

# Prepare features list
drop_cols = {'clip','label','date','hour','error'}
features = [c for c in train_feat_df.columns if c not in drop_cols]
print('N features:', len(features))

# Grouped CV by date+hour to reduce leakage
groups = train_feat_df['date'].astype(str) + '_' + train_feat_df['hour'].astype(int).astype(str).str.zfill(2)
y = train_feat_df['label'].astype(int).values

cv = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
oof = np.zeros(len(train_feat_df), dtype=np.float32)
models = []
# compute global class imbalance for scale_pos_weight
pos = (y == 1).sum()
neg = (y == 0).sum()
scale_pos_weight = float(neg) / max(1.0, float(pos))
params = dict(objective='binary', metric='auc', learning_rate=0.05, num_leaves=64, feature_fraction=0.8, bagging_fraction=0.8, bagging_freq=1, min_data_in_leaf=50, verbosity=-1, scale_pos_weight=scale_pos_weight)

for i, (tr, va) in enumerate(cv.split(train_feat_df[features], y, groups), 1):
    tfs = time.time()
    print(f'Fold {i} start: tr={len(tr)} va={len(va)}')
    dtr = lgb.Dataset(train_feat_df.loc[tr, features], label=y[tr])
    dva = lgb.Dataset(train_feat_df.loc[va, features], label=y[va])
    callbacks = [lgb.early_stopping(200), lgb.log_evaluation(200)]
    model = lgb.train(params, dtr, num_boost_round=3000, valid_sets=[dva], callbacks=callbacks)
    oof[va] = model.predict(train_feat_df.loc[va, features], num_iteration=model.best_iteration)
    models.append(model)
    print(f'Fold {i} done in {time.time()-tfs:.1f}s, best_iter={model.best_iteration}')
auc = roc_auc_score(y, oof)
print('OOF AUC:', auc)

# Train final model on all data (use median best_iter * 1.1)
best_iters = [m.best_iteration for m in models]
final_boost_round = int(np.median(best_iters) * 1.1) if best_iters else 1000
print('Final boost round:', final_boost_round)
final_train = lgb.Dataset(train_feat_df[features], label=y)
final_model = lgb.train(params, final_train, num_boost_round=final_boost_round, valid_sets=[final_train])
test_pred = final_model.predict(test_feat_df[features])

# Build submission
sub = pd.read_csv('sampleSubmission.csv')
pred_map = pd.Series(test_pred, index=test_feat_df['clip']).to_dict()
sub['probability'] = sub['clip'].map(pred_map).fillna(0.5)
miss = int(sub['probability'].isna().sum())
print('Missing mapped clips:', miss)
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv', sub.shape, 'Total time:', f'{time.time()-t0_all:.1f}s')

Train clips: 22692, Test clips: 25149
Loading cached train features from features_train.parquet ...
Loading cached test features from features_test.parquet ...


Train features shape: (22692, 369) Test features shape: (25149, 368)
N features: 365
Fold 1 start: tr=19139 va=3553


Training until validation scores don't improve for 200 rounds


[200]	valid_0's auc: 0.932028


[400]	valid_0's auc: 0.933794


[600]	valid_0's auc: 0.933641


Early stopping, best iteration is:
[440]	valid_0's auc: 0.934492
Fold 1 done in 12.8s, best_iter=440
Fold 2 start: tr=18184 va=4508


Training until validation scores don't improve for 200 rounds


[200]	valid_0's auc: 0.949883


[400]	valid_0's auc: 0.951055


[600]	valid_0's auc: 0.952004


[800]	valid_0's auc: 0.952088


Early stopping, best iteration is:
[694]	valid_0's auc: 0.952866
Fold 2 done in 12.9s, best_iter=694
Fold 3 start: tr=18115 va=4577


Training until validation scores don't improve for 200 rounds


[200]	valid_0's auc: 0.941456


[400]	valid_0's auc: 0.944947


[600]	valid_0's auc: 0.945568


Early stopping, best iteration is:
[524]	valid_0's auc: 0.946663
Fold 3 done in 9.9s, best_iter=524
Fold 4 start: tr=16960 va=5732


Training until validation scores don't improve for 200 rounds


[200]	valid_0's auc: 0.929178


[400]	valid_0's auc: 0.932521


[600]	valid_0's auc: 0.934217


[800]	valid_0's auc: 0.934686


[1000]	valid_0's auc: 0.935368


[1200]	valid_0's auc: 0.935898


[1400]	valid_0's auc: 0.935997


Early stopping, best iteration is:
[1229]	valid_0's auc: 0.936037
Fold 4 done in 18.7s, best_iter=1229
Fold 5 start: tr=18370 va=4322


Training until validation scores don't improve for 200 rounds


[200]	valid_0's auc: 0.940522


[400]	valid_0's auc: 0.943225


[600]	valid_0's auc: 0.944721


Early stopping, best iteration is:
[598]	valid_0's auc: 0.944864
Fold 5 done in 11.7s, best_iter=598
OOF AUC: 0.9303386521116115
Final boost round: 657


Missing mapped clips: 0
Wrote submission.csv (25149, 2) Total time: 74.5s
