# NYC Taxi Fare Prediction — Corrected Experiment Plan (Audit-Ready)

Objective: Win a medal (target: GOLD). Metric: RMSE (lower is better).

Performance targets:
- Bronze: RMSE ≤ 2.92371
- Silver: RMSE ≤ 2.88191
- Gold: RMSE ≤ 2.83377

1) Definitive Data Structure
- Files in CWD:
  - labels.csv: FULL TRAINING DATASET (features + target). Columns: [key, pickup_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count, fare_amount].
  - test.csv: HOLDOUT feature table for final prediction. Columns: [key, pickup_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count].
  - sample_submission.csv: [key, fare_amount] — defines required submission order and set; keys match test.csv exactly.
- No key overlap: labels.csv keys are disjoint from test.csv keys. No joins required for training.
- Training sampling: load labels.csv in chunks and randomly sample 200k–5M rows (scalable), ensuring UTC tz-awareness and non-null critical fields.

2) Validation Plan (time-aware, no leakage)
- Convert pickup_datetime to timezone-aware UTC on load, and to America/New_York only for temporal features.
- Forward-chaining TimeSeries CV with 5 folds and a 1-day gap before each validation slice.
- Track RMSE on original scale; model trained on log1p(target) with inverse-transform for scoring.
- Keep a final temporal holdout (last ~10%) if needed for sanity checks.

3) Train Cleaning and Test Parity
- Train (labels.csv) strict filters:
  - NYC geofence: lat ∈ [40.5, 41.0], lon ∈ [-74.5, -72.8] for both pickup and dropoff.
  - passenger_count ∈ [1, 6].
  - Distance (haversine_km): keep 0.01 ≤ distance ≤ 200.
  - fare_amount ∈ [2.5, 200].
- Test (test.csv) parity without dropping rows:
  - Clip passenger_count to [1, 6], clip lat/lon to geofence.
  - Clip derived distances post-feature to [0.01, 200].

4) Core Feature Set (initial ≥ 20 features)
- Geometry: haversine_km, manhattan_km (reuse haversine components), delta_lat, delta_lon, bearing_sin/cos, center_lat/lon.
- Temporal (NY local): year, month, day, hour, dow, is_weekend, is_rush, is_night, hour_sin/cos, dow_sin/cos.
- Domain/POIs: distances to JFK/LGA/EWR/Midtown/FiDi; is_airport_trip; interactions like distance × rush/weekend.
- Consistent clipping for engineered features to robust ranges.

5) Modeling Strategy (LightGBM baseline — mandatory)
- Baseline model: LightGBM Regressor on log1p(fare).
- Initial params (CPU): {num_leaves: 128, max_depth: -1, learning_rate: 0.05, n_estimators: up to 10k (manual early stopping via time-CV only), min_data_in_leaf: 200, feature_fraction: 0.8, bagging_fraction: 0.8, bagging_freq: 1, lambda_l1: 0.0, lambda_l2: 1.0}.
- Disable internal LightGBM validation leakage; use our manual time-CV splits for any early stopping.
- Report fold RMSE and mean/std. Fit final model on all ordered data; predict test; clip predictions to [0, 500].

6) Scale-up & Advanced Features (post-baseline)
- Scale training sample to 2–5M+ rows (highest ROI).
- Add geohash/grid bins, KMeans clusters (~50–100), airport/borough/toll proxies, rotated-Manhattan distance, bearing buckets, holiday flags, frequency encodings.
- Hyperparameter tuning with Optuna (50–100 trials).
- Diversity models (XGBoost/CatBoost) and ensembling/stacking.

7) Reproducibility & Hygiene
- Global seed (e.g., 2025) for numpy/pandas/LightGBM.
- Hard assertions on schema, dtypes, tz-awareness, key uniqueness, and train–test disjointness.
- Clean notebook: delete stale code after documenting; resolve deprecation/future warnings (use root_mean_squared_error, dtype checks via isinstance).
- Save submission via robust merge on key, preserving sample_submission order.

Milestones
- M1 (baseline): LightGBM + core features + time-CV → aim RMSE ≤ 3.3–3.5 (with 1–2M rows).
- M2: +features + tuning → ≤ 2.9–3.1.
- M3: +ensemble/stack → ≤ 2.83–2.88 (gold range).

Audit Checkpoints
- This corrected plan.
- After data sampling and schema assertions.
- After LightGBM baseline CV.
- After scaling/tuning/features.
- Before final submission.

In [8]:
import pandas as pd
import numpy as np

SEED = 2025
rng = np.random.default_rng(SEED)

def read_test_and_sample():
    dtypes = {
        'key': 'string',
        'pickup_longitude': 'float32',
        'pickup_latitude': 'float32',
        'dropoff_longitude': 'float32',
        'dropoff_latitude': 'float32',
        'passenger_count': 'int8'
    }
    test = pd.read_csv(
        'test.csv',
        dtype=dtypes,
        parse_dates=['pickup_datetime']
    )
    test['pickup_datetime'] = pd.to_datetime(test['pickup_datetime'], utc=True, errors='raise')
    sample = pd.read_csv('sample_submission.csv', dtype={'key': 'string'})
    # Hard assertions
    assert test['key'].notna().all(), 'Null keys in test'
    assert test['key'].is_unique, 'Duplicate keys in test'
    assert test['pickup_datetime'].notna().all(), 'Null pickup_datetime in test'
    assert isinstance(test['pickup_datetime'].dtype, pd.DatetimeTZDtype), 'pickup_datetime must be timezone-aware'
    # Ensure dtypes
    assert test['passenger_count'].dtype == 'int8', 'passenger_count not int8'
    for col in ['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']:
        assert test[col].dtype == 'float32', f'{col} not float32'
    # Sample alignment: keys must exactly match test keys
    assert sample['key'].is_unique, 'Duplicate keys in sample_submission'
    test_keys = set(test['key'].tolist())
    samp_keys = set(sample['key'].tolist())
    assert test_keys == samp_keys, f'Sample and test key sets differ: diff_count={len(test_keys.symmetric_difference(samp_keys))}'
    return test, sample

test, sample = read_test_and_sample()
summary = {
    'test_rows': int(len(test)),
    'sample_rows': int(len(sample))
}
print('SUMMARY:', summary)
print('Dtypes OK; TZ-aware datetime enforced; keys validated; sample alignment verified.')

SUMMARY: {'test_rows': 9914, 'sample_rows': 9914}
Dtypes OK; TZ-aware datetime enforced; keys validated; sample alignment verified.


In [9]:
# Robust, unbiased hash-based sampler over entire labels.csv; tz-aware; assertions; optional cache
import pandas as pd
import numpy as np
from pandas.util import hash_pandas_object

TRAIN_SAMPLE_TARGET = 2_000_000  # per plan (>=2M)
CHUNKSIZE = 1_000_000            # IO-efficient chunk size
EST_TOTAL_ROWS = 55_000_000      # rough total rows to set sampling prob
OVERSAMPLE_FACTOR = 1.1          # safety margin to reduce undershoot risk
CACHE_PATH = None                # e.g., 'train_sample.parquet' to persist

def empty_typed_frame():
    return pd.DataFrame({
        'key': pd.Series([], dtype='string'),
        'pickup_datetime': pd.Series([], dtype='datetime64[ns, UTC]'),
        'pickup_longitude': pd.Series([], dtype='float32'),
        'pickup_latitude': pd.Series([], dtype='float32'),
        'dropoff_longitude': pd.Series([], dtype='float32'),
        'dropoff_latitude': pd.Series([], dtype='float32'),
        'passenger_count': pd.Series([], dtype='int8'),
        'fare_amount': pd.Series([], dtype='float32')
    })

def load_train_sample_hash(target_rows=TRAIN_SAMPLE_TARGET, chunksize=CHUNKSIZE, seed=SEED,
                           est_total_rows=EST_TOTAL_ROWS, oversample=OVERSAMPLE_FACTOR, cache_path=CACHE_PATH):
    if cache_path:
        try:
            df_cached = pd.read_parquet(cache_path)
            if len(df_cached) >= int(0.9 * target_rows):
                print({'train_sample_rows': int(len(df_cached)), 'source': 'cache'})
                return df_cached
        except Exception:
            pass
    usecols = ['key','pickup_datetime','pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count','fare_amount']
    dtypes = {
        'key': 'string',
        'pickup_longitude': 'float32',
        'pickup_latitude': 'float32',
        'dropoff_longitude': 'float32',
        'dropoff_latitude': 'float32',
        'passenger_count': 'int8',
        'fare_amount': 'float32'
    }
    required_nonnull = ['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count','fare_amount']
    # Probability chosen to hit target on full-pass; oversample to reduce undershoot
    p = min(0.5, max(1e-4, (target_rows / max(1, est_total_rows)) * oversample))
    parts = []
    for ch in pd.read_csv('labels.csv', usecols=usecols, dtype=dtypes, chunksize=chunksize):
        # Convert datetime post-read for speed and enforce UTC
        ch['pickup_datetime'] = pd.to_datetime(ch['pickup_datetime'], utc=True, errors='coerce')
        ch = ch.dropna(subset=['pickup_datetime'] + required_nonnull)
        if ch.empty:
            continue
        # Deterministic hash-based sampling across entire file (unbiased by position/time)
        h = hash_pandas_object(ch['key'], index=False).values.astype('uint64')
        keep = (h / np.float64(2**64)) < p
        if keep.any():
            parts.append(ch.loc[keep])
    if not parts:
        df = empty_typed_frame()
    else:
        df = pd.concat(parts, ignore_index=True)
    # If we overshot target, downsample deterministically
    if len(df) > target_rows:
        df = df.sample(n=target_rows, random_state=seed)
    # Guarantee minimum size
    assert len(df) >= int(0.9 * target_rows), f'Sampler undershot: got {len(df)} < 90% of target {target_rows}. Increase oversample or adjust est_total_rows.'
    # Hard assertions (post-clean)
    assert df['pickup_datetime'].notna().all(), 'Null pickup_datetime in train sample'
    assert isinstance(df['pickup_datetime'].dtype, pd.DatetimeTZDtype), 'pickup_datetime must be tz-aware'
    for col in required_nonnull:
        assert df[col].notna().all(), f'Nulls in {col} after cleaning'
    for col in ['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']:
        assert df[col].dtype == 'float32', f'{col} dtype not float32'
    assert df['fare_amount'].dtype == 'float32', 'fare_amount dtype incorrect'
    assert df['passenger_count'].dtype == 'int8', 'passenger_count dtype incorrect'
    # Key uniqueness: warn instead of hard fail (some datasets may contain dup keys)
    if not df['key'].is_unique:
        dup_ct = int(df['key'].size - df['key'].nunique())
        print(f'Warning: {dup_ct} duplicate keys in train_sample (diagnostic only).')
    # Train-test disjoint (diagnostic assert ok)
    try:
        _test_keys = set(test['key'])
    except NameError:
        _test_keys = set()
    assert _test_keys == set() or set(df['key']).isdisjoint(_test_keys), 'Train sample overlaps with test keys'
    if cache_path:
        try:
            df.to_parquet(cache_path, index=False)
        except Exception:
            pass
    print({'train_sample_rows': int(len(df)), 'p': float(p)})
    return df

train_sample = load_train_sample_hash()

{'train_sample_rows': 2000000, 'p': 0.04}


In [10]:
# LightGBM with corrected time-CV (quantile folds + 1-day gap) and stronger params (unleash capacity)
import numpy as np
import pandas as pd
from sklearn.metrics import root_mean_squared_error
from datetime import timedelta

assert 'train_sample' in globals(), 'train_sample not loaded'
assert 'test' in globals() and 'sample' in globals(), 'test/sample not loaded'

try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

SEED = 2025
np.random.seed(SEED)

def haversine_km(lat1, lon1, lat2, lon2):
    R = 6371.0
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1)*np.cos(lat2)*np.sin(dlon/2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c

def bearing_rad(lat1, lon1, lat2, lon2):
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlon = lon2 - lon1
    y = np.sin(dlon) * np.cos(lat2)
    x = np.cos(lat1)*np.cos(lat2)*np.cos(dlon) + np.sin(lat1)*np.sin(lat2)
    return np.arctan2(y, x)

POIS = {
    'JFK': (40.6413, -73.7781),
    'LGA': (40.7769, -73.8740),
    'EWR': (40.6895, -74.1745),
    'MIDTOWN': (40.7580, -73.9855),
    'FIDI': (40.7060, -74.0086)
}

def to_local_ny(dt_series):
    return dt_series.dt.tz_convert('America/New_York')

def add_features(df):
    df = df.copy()
    dt_local = to_local_ny(df['pickup_datetime'])
    pu_lat = df['pickup_latitude'].astype('float32')
    pu_lon = df['pickup_longitude'].astype('float32')
    do_lat = df['dropoff_latitude'].astype('float32')
    do_lon = df['dropoff_longitude'].astype('float32')
    # Geometry
    df['dist_hav_km'] = haversine_km(pu_lat, pu_lon, do_lat, do_lon).astype('float32')
    df['dist_man_km'] = (
        haversine_km(pu_lat, pu_lon, pu_lat, do_lon) +
        haversine_km(pu_lat, do_lon, do_lat, do_lon)
    ).astype('float32')
    df['delta_lat'] = (do_lat - pu_lat).astype('float32')
    df['delta_lon'] = (do_lon - pu_lon).astype('float32')
    b = bearing_rad(pu_lat, pu_lon, do_lat, do_lon)
    df['bear_sin'] = np.sin(b).astype('float32')
    df['bear_cos'] = np.cos(b).astype('float32')
    df['center_lat'] = ((pu_lat + do_lat) / 2.0).astype('float32')
    df['center_lon'] = ((pu_lon + do_lon) / 2.0).astype('float32')
    # Temporal (NY local time)
    df['year'] = dt_local.dt.year.astype('int16')
    df['month'] = dt_local.dt.month.astype('int8')
    df['day'] = dt_local.dt.day.astype('int8')
    df['hour'] = dt_local.dt.hour.astype('int8')
    df['dow'] = dt_local.dt.dayofweek.astype('int8')
    df['is_weekend'] = df['dow'].isin([5,6]).astype('int8')
    df['is_rush'] = ((df['hour'].between(7,10)) | (df['hour'].between(16,19))).astype('int8')
    df['is_night'] = ((df['hour'] >= 22) | (df['hour'] <= 5)).astype('int8')
    df['hour_sin'] = np.sin(2*np.pi*df['hour']/24).astype('float32')
    df['hour_cos'] = np.cos(2*np.pi*df['hour']/24).astype('float32')
    df['dow_sin'] = np.sin(2*np.pi*df['dow']/7).astype('float32')
    df['dow_cos'] = np.cos(2*np.pi*df['dow']/7).astype('float32')
    # POI distances
    for name, (lat, lon) in POIS.items():
        df[f'dist_{name.lower()}'] = haversine_km(pu_lat, pu_lon, lat, lon).astype('float32')
    df['is_airport_trip'] = (df[['dist_jfk','dist_lga','dist_ewr']].min(axis=1) < 2.0).astype('int8')
    # Interactions
    df['dist_x_rush'] = (df['dist_hav_km'] * df['is_rush']).astype('float32')
    df['dist_x_weekend'] = (df['dist_hav_km'] * df['is_weekend']).astype('float32')
    return df

def clean_train(df):
    df = df.copy()
    m = (df['pickup_latitude'].between(40.5, 41.0) &
         df['pickup_longitude'].between(-74.5, -72.8) &
         df['dropoff_latitude'].between(40.5, 41.0) &
         df['dropoff_longitude'].between(-74.5, -72.8))
    m &= df['passenger_count'].between(1, 6)
    dist = haversine_km(df['pickup_latitude'], df['pickup_longitude'], df['dropoff_latitude'], df['dropoff_longitude'])
    m &= pd.Series(dist).between(0.01, 200).values
    m &= df['fare_amount'].between(2.5, 200)
    return df.loc[m].copy()

def clip_test(df):
    df = df.copy()
    df['passenger_count'] = df['passenger_count'].clip(lower=1, upper=6).astype('int8')
    df['pickup_latitude'] = df['pickup_latitude'].clip(lower=40.5, upper=41.0).astype('float32')
    df['dropoff_latitude'] = df['dropoff_latitude'].clip(lower=40.5, upper=41.0).astype('float32')
    df['pickup_longitude'] = df['pickup_longitude'].clip(lower=-74.5, upper=-72.8).astype('float32')
    df['dropoff_longitude'] = df['dropoff_longitude'].clip(lower=-74.5, upper=-72.8).astype('float32')
    return df

# Prepare datasets with parity
train_df = clean_train(train_sample)
test_df = clip_test(test)

train_df = add_features(train_df)
test_df = add_features(test_df)

# Post-feature clipping on test to match train domain for derived distances
for col in ['dist_hav_km', 'dist_man_km']:
    test_df[col] = test_df[col].clip(lower=0.01, upper=200).astype('float32')

feature_cols = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'year','month','day','hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi','is_airport_trip','dist_x_rush','dist_x_weekend'
]

train_df = train_df.dropna(subset=feature_cols + ['fare_amount']).copy()
test_df[feature_cols] = test_df[feature_cols].fillna(0)

# Order by pickup time for time-series CV
train_df = train_df.sort_values('pickup_datetime').reset_index(drop=True)

# Create matrices
X_all = train_df[feature_cols].astype('float32').values
y_all = np.log1p(train_df['fare_amount'].astype('float32').values)
dt_all = train_df['pickup_datetime']  # tz-aware
X_test = test_df[feature_cols].astype('float32').values

def make_time_folds_quantile(dt, n_folds=3, gap_days=1):
    """
    Build exactly n_folds non-empty validation windows using array_split on indices:
    - Split indices into (warmup + n_folds) contiguous chunks.
    - Each fold uses chunk k (1..K) as validation, and all rows strictly before (v_start - gap) as train.
    - If gap removes all train for earliest fold, fallback to all indices before val start (no leakage within val).
    Uses int64 timestamps and searchsorted for robust gap cut.
    """
    assert len(dt) > 0
    n = len(dt)
    idx = np.arange(n, dtype=int)
    chunks = np.array_split(idx, n_folds + 1)  # 0: warmup, 1..K: folds
    ts = dt.astype('int64').to_numpy()  # ns since epoch, sorted ascending
    gap_ns = np.int64(gap_days * 24 * 3600 * 10**9)
    folds = []
    for k in range(1, n_folds + 1):
        val_idx = chunks[k]
        if val_idx.size == 0:
            continue
        s = int(val_idx[0])
        v_start_ns = ts[s]
        cutoff = v_start_ns - gap_ns
        train_end = int(np.searchsorted(ts, cutoff, side='left'))  # positions strictly before cutoff
        if train_end == 0:
            train_end = s  # shrink gap only if needed to ensure non-empty train
        if train_end == 0:
            continue
        train_idx = np.arange(0, train_end, dtype=int)
        folds.append((train_idx, val_idx))
    assert len(folds) == n_folds, f'Expected {n_folds} folds, got {len(folds)}'
    return folds

K = 3
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

# LightGBM stronger params to reduce underfitting; rely on early stopping
lgb_params = dict(
    objective='regression',
    metric='rmse',
    learning_rate=0.05,
    n_estimators=10000,
    num_leaves=256,
    max_depth=-1,
    min_data_in_leaf=100,
    feature_fraction=0.8,
    bagging_fraction=0.8,
    bagging_freq=1,
    max_bin=255,
    reg_alpha=0.0,
    reg_lambda=1.0,
    random_state=SEED,
    n_jobs=-1,
    verbose=-1
)

cv_rmses = []
best_iters = []

for i, (tr_idx, va_idx) in enumerate(folds, 1):
    X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
    X_va, y_va = X_all[va_idx], y_all[va_idx]
    model = lgb.LGBMRegressor(**lgb_params)
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)],
        eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=200, verbose=False), lgb.log_evaluation(period=50)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    y_true_val = np.expm1(y_va)
    rmse = float(root_mean_squared_error(y_true_val, y_pred_val))
    cv_rmses.append(rmse)
    best_iters.append(int(model.best_iteration_ if model.best_iteration_ is not None else model.n_estimators))
    print(f'Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})')

cv_mean = float(np.mean(cv_rmses))
cv_std = float(np.std(cv_rmses))
best_iter_final = int(np.clip(int(np.median(best_iters)), 200, 10000))
print({'cv_rmse_mean': round(cv_mean, 6), 'cv_rmse_std': round(cv_std, 6), 'best_iter_final': best_iter_final})

# Fit final model on all ordered data with chosen iterations
final_model = lgb.LGBMRegressor(**{**lgb_params, 'n_estimators': best_iter_final})
final_model.fit(X_all, y_all)

test_pred = np.expm1(final_model.predict(X_test)).astype('float32')
test_pred = np.clip(test_pred, 0, 500)

pred_df = pd.DataFrame({'key': test_df['key'].astype('string'), 'fare_amount': test_pred})
sub = sample[['key']].merge(pred_df, on='key', how='left')
assert sub.shape[0] == sample.shape[0], 'Submission key alignment error'
assert sub['fare_amount'].notna().all(), 'Missing predictions for some keys'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv with shape:', sub.shape)

[50]	valid_0's rmse: 0.217772


[100]	valid_0's rmse: 0.211975


[150]	valid_0's rmse: 0.210341


[200]	valid_0's rmse: 0.209268


[250]	valid_0's rmse: 0.208753


[300]	valid_0's rmse: 0.208603


[350]	valid_0's rmse: 0.208558


[400]	valid_0's rmse: 0.208604


[450]	valid_0's rmse: 0.208581


[500]	valid_0's rmse: 0.208573


[550]	valid_0's rmse: 0.208589


Fold 1: RMSE=3.40236, best_iter=389 (train_n=480373, val_n=481134)


[50]	valid_0's rmse: 0.242087


[100]	valid_0's rmse: 0.228094


[150]	valid_0's rmse: 0.224096


[200]	valid_0's rmse: 0.222233


[250]	valid_0's rmse: 0.22136


[300]	valid_0's rmse: 0.220764


[350]	valid_0's rmse: 0.220321


[400]	valid_0's rmse: 0.21998


[450]	valid_0's rmse: 0.219742


[500]	valid_0's rmse: 0.219602


[550]	valid_0's rmse: 0.219477


[600]	valid_0's rmse: 0.219348


[650]	valid_0's rmse: 0.219169


[700]	valid_0's rmse: 0.219103


[750]	valid_0's rmse: 0.218999


[800]	valid_0's rmse: 0.218918


[850]	valid_0's rmse: 0.218836


[900]	valid_0's rmse: 0.218833


[950]	valid_0's rmse: 0.218781


[1000]	valid_0's rmse: 0.218736


[1050]	valid_0's rmse: 0.218707


[1100]	valid_0's rmse: 0.218696


[1150]	valid_0's rmse: 0.21866


[1200]	valid_0's rmse: 0.218684


[1250]	valid_0's rmse: 0.218645


[1300]	valid_0's rmse: 0.218656


[1350]	valid_0's rmse: 0.218652


[1400]	valid_0's rmse: 0.218589


[1450]	valid_0's rmse: 0.218591


[1500]	valid_0's rmse: 0.218565


[1550]	valid_0's rmse: 0.218573


[1600]	valid_0's rmse: 0.21859


[1650]	valid_0's rmse: 0.218598


[1700]	valid_0's rmse: 0.218641


Fold 2: RMSE=3.86595, best_iter=1523 (train_n=961323, val_n=481134)


[50]	valid_0's rmse: 0.219922


[100]	valid_0's rmse: 0.202177


[150]	valid_0's rmse: 0.197403


[200]	valid_0's rmse: 0.194709


[250]	valid_0's rmse: 0.193111


[300]	valid_0's rmse: 0.192227


[350]	valid_0's rmse: 0.191598


[400]	valid_0's rmse: 0.191059


[450]	valid_0's rmse: 0.190737


[500]	valid_0's rmse: 0.19048


[550]	valid_0's rmse: 0.190193


[600]	valid_0's rmse: 0.189983


[650]	valid_0's rmse: 0.189859


[700]	valid_0's rmse: 0.189689


[750]	valid_0's rmse: 0.189581


[800]	valid_0's rmse: 0.189454


[850]	valid_0's rmse: 0.18936


[900]	valid_0's rmse: 0.189304


[950]	valid_0's rmse: 0.189228


[1000]	valid_0's rmse: 0.189178


[1050]	valid_0's rmse: 0.189121


[1100]	valid_0's rmse: 0.189043


[1150]	valid_0's rmse: 0.189011


[1200]	valid_0's rmse: 0.188989


[1250]	valid_0's rmse: 0.18898


[1300]	valid_0's rmse: 0.188948


[1350]	valid_0's rmse: 0.188925


[1400]	valid_0's rmse: 0.188887


[1450]	valid_0's rmse: 0.18887


[1500]	valid_0's rmse: 0.188846


[1550]	valid_0's rmse: 0.188827


[1600]	valid_0's rmse: 0.188819


[1650]	valid_0's rmse: 0.18882


[1700]	valid_0's rmse: 0.188811


[1750]	valid_0's rmse: 0.18878


[1800]	valid_0's rmse: 0.188771


[1850]	valid_0's rmse: 0.188796


[1900]	valid_0's rmse: 0.188783


[1950]	valid_0's rmse: 0.188789


Fold 3: RMSE=3.48195, best_iter=1787 (train_n=1442667, val_n=481133)
{'cv_rmse_mean': 3.583418, 'cv_rmse_std': 0.202406, 'best_iter_final': 1523}


Saved submission.csv with shape: (9914, 2)


In [11]:
# Diagnosis: Investigate CV instability (Fold-wise time ranges, distributions, holiday impact) and prepare for 5-fold CV
import numpy as np
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as USCal

assert 'train_df' in globals() and 'train_df' is not None, 'train_df missing'
assert 'dt_all' in globals(), 'dt_all missing'
assert 'make_time_folds_quantile' in globals(), 'fold generator missing'

def fold_diagnostics(dt, df_full, folds, name):
    print(f'=== Diagnostics for {name} ===')
    # Prepare holiday calendar on NY local dates
    dt_local = df_full['pickup_datetime'].dt.tz_convert('America/New_York')
    date_min = dt_local.min().date()
    date_max = dt_local.max().date()
    cal = USCal()
    hol = cal.holidays(start=pd.Timestamp(date_min), end=pd.Timestamp(date_max))
    hol_dates = set(pd.to_datetime(hol).date)
    results = []
    for i, (tr_idx, va_idx) in enumerate(folds, 1):
        va_dt = dt.iloc[va_idx]
        va_local = va_dt.dt.tz_convert('America/New_York')
        v_start, v_end = va_local.min(), va_local.max()
        va_part = df_full.iloc[va_idx]
        # Basic distributions
        fare_desc = va_part['fare_amount'].describe() if 'fare_amount' in va_part.columns else None
        dist_desc = va_part['dist_hav_km'].describe() if 'dist_hav_km' in va_part.columns else None
        pc_counts = va_part['passenger_count'].value_counts(normalize=True).sort_index()
        # Holiday share
        va_dates = va_local.dt.date
        hol_share = float(np.mean(va_dates.isin(hol_dates)))
        dow_counts = va_local.dt.dayofweek.value_counts(normalize=True).sort_index()
        month_counts = va_local.dt.month.value_counts(normalize=True).sort_index()
        print({
            'fold': i,
            'val_n': int(len(va_idx)),
            'val_range_local': f"{v_start} -> {v_end}",
            'holiday_share': round(hol_share, 4),
            'fare_mean': round(float(fare_desc['mean']) if fare_desc is not None else float('nan'), 4),
            'fare_p95': round(float(va_part['fare_amount'].quantile(0.95)) if 'fare_amount' in va_part.columns else float('nan'), 4),
            'dist_hav_mean': round(float(dist_desc['mean']) if dist_desc is not None else float('nan'), 4),
            'dist_hav_p95': round(float(va_part['dist_hav_km'].quantile(0.95)) if 'dist_hav_km' in va_part.columns else float('nan'), 4),
        })
        results.append({
            'fold': i,
            'v_start': v_start,
            'v_end': v_end,
            'holiday_share': hol_share,
            'pc_dist': pc_counts.to_dict(),
            'dow_dist': dow_counts.to_dict(),
            'month_dist': month_counts.to_dict(),
        })
    return results

# Ensure features exist for diagnostics (they were added before modeling)
assert 'dist_hav_km' in train_df.columns, 'Expected dist_hav_km in train_df'

# Rebuild 3-folds used earlier and also prepare 5-folds for robustness
folds_3 = make_time_folds_quantile(dt_all, n_folds=3, gap_days=1)
diag3 = fold_diagnostics(dt_all, train_df, folds_3, name='3-fold CV (current)')

folds_5 = make_time_folds_quantile(dt_all, n_folds=5, gap_days=1)
diag5 = fold_diagnostics(dt_all, train_df, folds_5, name='5-fold CV (proposed)')

# Parity check note: ensure train-side clipping for derived distances in future modeling (currently only test is clipped)
train_clip_check = {
    'train_dist_hav_min': float(train_df['dist_hav_km'].min()),
    'train_dist_hav_max': float(train_df['dist_hav_km'].max()),
    'train_dist_man_min': float(train_df['dist_man_km'].min()),
    'train_dist_man_max': float(train_df['dist_man_km'].max()),
}
print('Train distance ranges (pre-clip):', train_clip_check)
print('Next step: clip train dist_hav_km and dist_man_km to [0.01, 200] to match test, add holiday/day-of-year features, and switch to 5-fold CV.')

  assert 'train_df' in globals() and 'train_df' is not None, 'train_df missing'


=== Diagnostics for 3-fold CV (current) ===


{'fold': 1, 'val_n': 481134, 'val_range_local': '2010-08-15 10:20:00-04:00 -> 2012-03-23 06:40:51-04:00', 'holiday_share': 0.0234, 'fare_mean': 10.3199, 'fare_p95': 26.9, 'dist_hav_mean': 3.3214, 'dist_hav_p95': 9.8415}


{'fold': 2, 'val_n': 481134, 'val_range_local': '2012-03-23 06:41:00-04:00 -> 2013-10-22 18:06:00-04:00', 'holiday_share': 0.017, 'fare_mean': 11.9459, 'fare_p95': 32.0, 'dist_hav_mean': 3.428, 'dist_hav_p95': 10.2621}


{'fold': 3, 'val_n': 481133, 'val_range_local': '2013-10-22 18:07:37-04:00 -> 2015-06-30 19:59:25-04:00', 'holiday_share': 0.0196, 'fare_mean': 12.8364, 'fare_p95': 35.5, 'dist_hav_mean': 3.4225, 'dist_hav_p95': 10.496}
=== Diagnostics for 5-fold CV (proposed) ===


{'fold': 1, 'val_n': 320756, 'val_range_local': '2010-01-27 12:25:28-05:00 -> 2011-03-05 12:35:00-05:00', 'holiday_share': 0.0214, 'fare_mean': 10.0924, 'fare_p95': 26.1, 'dist_hav_mean': 3.2518, 'dist_hav_p95': 9.6344}


{'fold': 2, 'val_n': 320756, 'val_range_local': '2011-03-05 12:35:04-05:00 -> 2012-03-23 06:40:51-04:00', 'holiday_share': 0.0189, 'fare_mean': 10.4113, 'fare_p95': 27.3, 'dist_hav_mean': 3.3645, 'dist_hav_p95': 9.9602}


{'fold': 3, 'val_n': 320756, 'val_range_local': '2012-03-23 06:41:00-04:00 -> 2013-04-09 03:24:00-04:00', 'holiday_share': 0.0182, 'fare_mean': 11.5625, 'fare_p95': 30.5, 'dist_hav_mean': 3.4024, 'dist_hav_p95': 10.1674}


{'fold': 4, 'val_n': 320756, 'val_range_local': '2013-04-09 03:27:00-04:00 -> 2014-05-03 16:48:00-04:00', 'holiday_share': 0.0181, 'fare_mean': 12.6008, 'fare_p95': 34.33, 'dist_hav_mean': 3.4203, 'dist_hav_p95': 10.3212}


{'fold': 5, 'val_n': 320755, 'val_range_local': '2014-05-03 16:50:37-04:00 -> 2015-06-30 19:59:25-04:00', 'holiday_share': 0.0187, 'fare_mean': 13.0101, 'fare_p95': 36.33, 'dist_hav_mean': 3.453, 'dist_hav_p95': 10.6888}
Train distance ranges (pre-clip): {'train_dist_hav_min': 0.010023146867752075, 'train_dist_hav_max': 91.0614013671875, 'train_dist_man_min': 0.01025301218032837, 'train_dist_man_max': 101.6641845703125}
Next step: clip train dist_hav_km and dist_man_km to [0.01, 200] to match test, add holiday/day-of-year features, and switch to 5-fold CV.


In [12]:
# Stability pass: parity clipping on train, add holiday/day-of-year features, switch to 5-fold CV
import numpy as np
import pandas as pd
from sklearn.metrics import root_mean_squared_error
from pandas.tseries.holiday import USFederalHolidayCalendar as USCal

assert 'train_df' in globals() and 'test_df' in globals(), 'Expected prepared train_df/test_df'
assert 'make_time_folds_quantile' in globals(), 'fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

def add_holiday_and_doy(df):
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    # Day-of-year cyclical
    doy = dt_local.dt.dayofyear.astype('int16')
    df['doy_sin'] = np.sin(2*np.pi*(doy/365.25)).astype('float32')
    df['doy_cos'] = np.cos(2*np.pi*(doy/365.25)).astype('float32')
    # US Federal Holidays flag
    start = dt_local.min().normalize().tz_localize(None)
    end = dt_local.max().normalize().tz_localize(None)
    hol = USCal().holidays(start=start, end=end)
    hol_dates = set(pd.to_datetime(hol).date)
    df['is_holiday'] = dt_local.dt.date.map(lambda d: d in hol_dates).astype('int8')
    return df

# Train-side parity clipping for derived distances
for col in ['dist_hav_km', 'dist_man_km']:
    if col in train_df.columns:
        train_df[col] = train_df[col].clip(lower=0.01, upper=200).astype('float32')

# Add stability features
train_df = add_holiday_and_doy(train_df)
test_df = add_holiday_and_doy(test_df)

# Features (extended)
feature_cols = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'year','month','day','hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi','is_airport_trip','dist_x_rush','dist_x_weekend'
]

# Clean NaNs just in case
train_df = train_df.dropna(subset=feature_cols + ['fare_amount']).copy()
test_df[feature_cols] = test_df[feature_cols].fillna(0)

# Order by time and build matrices
train_df = train_df.sort_values('pickup_datetime').reset_index(drop=True)
X_all = train_df[feature_cols].astype('float32').values
y_all = np.log1p(train_df['fare_amount'].astype('float32').values)
dt_all = train_df['pickup_datetime']
X_test = test_df[feature_cols].astype('float32').values

# 5-fold CV with 1-day gap
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

SEED = 2025
lgb_params = dict(
    objective='regression',
    metric='rmse',
    learning_rate=0.05,
    n_estimators=12000,
    num_leaves=256,
    max_depth=-1,
    min_data_in_leaf=100,
    feature_fraction=0.8,
    bagging_fraction=0.8,
    bagging_freq=1,
    max_bin=255,
    reg_alpha=0.0,
    reg_lambda=1.0,
    random_state=SEED,
    n_jobs=-1,
    verbose=-1
)

cv_rmses, best_iters = [], []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
    X_va, y_va = X_all[va_idx], y_all[va_idx]
    model = lgb.LGBMRegressor(**lgb_params)
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)],
        eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=300, verbose=False), lgb.log_evaluation(period=100)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    y_true_val = np.expm1(y_va)
    rmse = float(root_mean_squared_error(y_true_val, y_pred_val))
    cv_rmses.append(rmse)
    best_iters.append(int(model.best_iteration_ if model.best_iteration_ is not None else model.n_estimators))
    print(f'Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})')

cv_mean = float(np.mean(cv_rmses))
cv_std = float(np.std(cv_rmses))
best_iter_final = int(np.clip(int(np.median(best_iters)), 300, 12000))
print({'cv_rmse_mean': round(cv_mean, 6), 'cv_rmse_std': round(cv_std, 6), 'best_iter_final': best_iter_final})

# Fit final model and write submission
final_model = lgb.LGBMRegressor(**{**lgb_params, 'n_estimators': best_iter_final})
final_model.fit(X_all, y_all)
test_pred = np.expm1(final_model.predict(X_test)).astype('float32')
test_pred = np.clip(test_pred, 0, 500)
pred_df = pd.DataFrame({'key': test_df['key'].astype('string'), 'fare_amount': test_pred})
sub = sample[['key']].merge(pred_df, on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv with shape:', sub.shape)

[100]	valid_0's rmse: 0.234393


[200]	valid_0's rmse: 0.231866


[300]	valid_0's rmse: 0.231498


[400]	valid_0's rmse: 0.231514


[500]	valid_0's rmse: 0.231644


[600]	valid_0's rmse: 0.231863


Fold 1: RMSE=3.71100, best_iter=357 (train_n=319908, val_n=320756)


[100]	valid_0's rmse: 0.193662


[200]	valid_0's rmse: 0.191594


[300]	valid_0's rmse: 0.191712


[400]	valid_0's rmse: 0.191552


[500]	valid_0's rmse: 0.191791


[600]	valid_0's rmse: 0.192006


Fold 2: RMSE=3.19618, best_iter=388 (train_n=640523, val_n=320756)


[100]	valid_0's rmse: 0.216781


[200]	valid_0's rmse: 0.21133


[300]	valid_0's rmse: 0.21015


[400]	valid_0's rmse: 0.20956


[500]	valid_0's rmse: 0.20926


[600]	valid_0's rmse: 0.20898


[700]	valid_0's rmse: 0.208798


[800]	valid_0's rmse: 0.208703


[900]	valid_0's rmse: 0.208656


[1000]	valid_0's rmse: 0.208607


[1100]	valid_0's rmse: 0.208536


[1200]	valid_0's rmse: 0.208525


[1300]	valid_0's rmse: 0.208514


[1400]	valid_0's rmse: 0.208542


[1500]	valid_0's rmse: 0.208575


Fold 3: RMSE=3.61429, best_iter=1232 (train_n=961323, val_n=320756)


[100]	valid_0's rmse: 0.209217


[200]	valid_0's rmse: 0.201428


[300]	valid_0's rmse: 0.199304


[400]	valid_0's rmse: 0.19818


[500]	valid_0's rmse: 0.197609


[600]	valid_0's rmse: 0.197272


[700]	valid_0's rmse: 0.197031


[800]	valid_0's rmse: 0.196821


[900]	valid_0's rmse: 0.196684


[1000]	valid_0's rmse: 0.196571


[1100]	valid_0's rmse: 0.196447


[1200]	valid_0's rmse: 0.196375


[1300]	valid_0's rmse: 0.196293


[1400]	valid_0's rmse: 0.196247


[1500]	valid_0's rmse: 0.196189


[1600]	valid_0's rmse: 0.196144


[1700]	valid_0's rmse: 0.196128


[1800]	valid_0's rmse: 0.196117


[1900]	valid_0's rmse: 0.19609


[2000]	valid_0's rmse: 0.196095


[2100]	valid_0's rmse: 0.196093


[2200]	valid_0's rmse: 0.196101


[2300]	valid_0's rmse: 0.196103


[2400]	valid_0's rmse: 0.196109


Fold 4: RMSE=3.55101, best_iter=2136 (train_n=1282234, val_n=320756)


[100]	valid_0's rmse: 0.192223


[200]	valid_0's rmse: 0.183929


[300]	valid_0's rmse: 0.181241


[400]	valid_0's rmse: 0.179948


[500]	valid_0's rmse: 0.179229


[600]	valid_0's rmse: 0.178694


[700]	valid_0's rmse: 0.178406


[800]	valid_0's rmse: 0.17823


[900]	valid_0's rmse: 0.177993


[1000]	valid_0's rmse: 0.177886


[1100]	valid_0's rmse: 0.17779


[1200]	valid_0's rmse: 0.177687


[1300]	valid_0's rmse: 0.17763


[1400]	valid_0's rmse: 0.177611


[1500]	valid_0's rmse: 0.177527


[1600]	valid_0's rmse: 0.177509


[1700]	valid_0's rmse: 0.177494


[1800]	valid_0's rmse: 0.177463


[1900]	valid_0's rmse: 0.177458


[2000]	valid_0's rmse: 0.17744


[2100]	valid_0's rmse: 0.177417


[2200]	valid_0's rmse: 0.177429


[2300]	valid_0's rmse: 0.177411


[2400]	valid_0's rmse: 0.177402


[2500]	valid_0's rmse: 0.177416


[2600]	valid_0's rmse: 0.177426


[2700]	valid_0's rmse: 0.177414


Fold 5: RMSE=3.28215, best_iter=2404 (train_n=1602868, val_n=320755)
{'cv_rmse_mean': 3.470925, 'cv_rmse_std': 0.197849, 'best_iter_final': 1232}


Saved submission.csv with shape: (9914, 2)


In [13]:
# Feature sprint: spatial bins + frequency encodings (leakage-safe), dropoff POIs, continuous trend; 5-fold time-CV
import numpy as np
import pandas as pd
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals(), 'train_df/test_df not prepared'
assert 'make_time_folds_quantile' in globals(), 'fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# 1) Add dropoff-side POIs and robust airport flag
def add_dropoff_pois(df):
    df = df.copy()
    pu_lat = df['pickup_latitude'].astype('float32'); pu_lon = df['pickup_longitude'].astype('float32')
    do_lat = df['dropoff_latitude'].astype('float32'); do_lon = df['dropoff_longitude'].astype('float32')
    # Dropoff distances
    df['do_dist_jfk'] = haversine_km(do_lat, do_lon, POIS['JFK'][0], POIS['JFK'][1]).astype('float32')
    df['do_dist_lga'] = haversine_km(do_lat, do_lon, POIS['LGA'][0], POIS['LGA'][1]).astype('float32')
    df['do_dist_ewr'] = haversine_km(do_lat, do_lon, POIS['EWR'][0], POIS['EWR'][1]).astype('float32')
    df['do_dist_midtown'] = haversine_km(do_lat, do_lon, POIS['MIDTOWN'][0], POIS['MIDTOWN'][1]).astype('float32')
    df['do_dist_fidi'] = haversine_km(do_lat, do_lon, POIS['FIDI'][0], POIS['FIDI'][1]).astype('float32')
    # Any-airport indicator (pickup or dropoff within 2km of any airport)
    min_pu_air = df[['dist_jfk','dist_lga','dist_ewr']].min(axis=1)
    min_do_air = df[['do_dist_jfk','do_dist_lga','do_dist_ewr']].min(axis=1)
    df['is_airport_trip_any'] = ((min_pu_air < 2.0) | (min_do_air < 2.0)).astype('int8')
    return df

train_df = add_dropoff_pois(train_df)
test_df = add_dropoff_pois(test_df)

# 2) Continuous time trend (days since start) in NY local time
def add_time_trend(df, global_start):
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    df['days_since_start'] = (dt_local.view('int64') - global_start) / np.float64(24*3600*1e9)
    df['days_since_start'] = df['days_since_start'].astype('float32')
    return df

dt_local_all = train_df['pickup_datetime'].dt.tz_convert('America/New_York')
global_start_ns = int(dt_local_all.min().value)
train_df = add_time_trend(train_df, global_start_ns)
test_df = add_time_trend(test_df, global_start_ns)

# 3) Spatial discretization: 500m-ish grid bins (approx via 0.005 degrees)
def add_spatial_bins(df, res=0.005):
    df = df.copy()
    pu_lat_bin = np.floor(df['pickup_latitude'] / res).astype('int32')
    pu_lon_bin = np.floor(df['pickup_longitude'] / res).astype('int32')
    do_lat_bin = np.floor(df['dropoff_latitude'] / res).astype('int32')
    do_lon_bin = np.floor(df['dropoff_longitude'] / res).astype('int32')
    df['pu_bin'] = (pu_lat_bin.astype(str) + '_' + pu_lon_bin.astype(str))
    df['do_bin'] = (do_lat_bin.astype(str) + '_' + do_lon_bin.astype(str))
    df['pair_bin'] = (df['pu_bin'] + '|' + df['do_bin'])
    return df

train_df = add_spatial_bins(train_df)
test_df = add_spatial_bins(test_df)

# 4) Optional clusters (MiniBatchKMeans for speed) on a 200k sample
def add_clusters(train, test, k=80, sample_n=200_000, seed=2025):
    coords = train[['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude']].astype('float32')
    # Fit pickup and dropoff clusters separately for clarity
    pu_coords = train[['pickup_latitude','pickup_longitude']].astype('float32')
    do_coords = train[['dropoff_latitude','dropoff_longitude']].astype('float32')
    if len(train) > sample_n:
        samp_idx = np.random.default_rng(seed).choice(len(train), size=sample_n, replace=False)
        pu_fit = pu_coords.iloc[samp_idx]
        do_fit = do_coords.iloc[samp_idx]
    else:
        pu_fit, do_fit = pu_coords, do_coords
    pu_km = MiniBatchKMeans(n_clusters=k, random_state=seed, batch_size=10000, n_init=5, max_no_improvement=20)
    do_km = MiniBatchKMeans(n_clusters=k, random_state=seed+1, batch_size=10000, n_init=5, max_no_improvement=20)
    pu_km.fit(pu_fit.values)
    do_km.fit(do_fit.values)
    train = train.copy(); test = test.copy()
    train['pu_cl'] = pu_km.predict(pu_coords.values).astype('int32')
    train['do_cl'] = do_km.predict(do_coords.values).astype('int32')
    test['pu_cl'] = pu_km.predict(test[['pickup_latitude','pickup_longitude']].astype('float32').values).astype('int32')
    test['do_cl'] = do_km.predict(test[['dropoff_latitude','dropoff_longitude']].astype('float32').values).astype('int32')
    train['same_cl'] = (train['pu_cl'] == train['do_cl']).astype('int8')
    test['same_cl'] = (test['pu_cl'] == test['do_cl']).astype('int8')
    return train, test

train_df, test_df = add_clusters(train_df, test_df, k=80)

# 5) Prepare base numeric features
base_features = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'year','month','day','hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday','days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi','do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend',
    'pu_cl','do_cl','same_cl'
]

# Clean NaNs
train_df = train_df.dropna(subset=base_features + ['fare_amount']).reset_index(drop=True)
test_df[base_features] = test_df[base_features].fillna(0)

# Order by time
train_df = train_df.sort_values('pickup_datetime').reset_index(drop=True)
dt_all = train_df['pickup_datetime']
y_all = np.log1p(train_df['fare_amount'].astype('float32').values)

# 6) 5-fold CV with leakage-safe frequency encodings computed per-fold
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

def build_fold_matrices(df, tr_idx, va_idx):
    # Compute frequency encodings on training subset only
    tr = df.iloc[tr_idx]
    va = df.iloc[va_idx]
    pu_cnt_map = tr['pu_bin'].value_counts().astype('int32')
    do_cnt_map = tr['do_bin'].value_counts().astype('int32')
    pair_cnt_map = tr['pair_bin'].value_counts().astype('int32')
    # Map counts to both train and val (safe; no target)
    df_enc = df.copy()
    df_enc['pu_cnt'] = df_enc['pu_bin'].map(pu_cnt_map).fillna(0).astype('int32')
    df_enc['do_cnt'] = df_enc['do_bin'].map(do_cnt_map).fillna(0).astype('int32')
    df_enc['pair_cnt'] = df_enc['pair_bin'].map(pair_cnt_map).fillna(0).astype('int32')
    # Feature matrix
    use_cols = base_features + ['pu_cnt','do_cnt','pair_cnt']
    X_tr = df_enc.iloc[tr_idx][use_cols].astype('float32').values
    X_va = df_enc.iloc[va_idx][use_cols].astype('float32').values
    return X_tr, X_va, use_cols

SEED = 2025
lgb_params = dict(
    objective='regression',
    metric='rmse',
    learning_rate=0.05,
    n_estimators=16000,
    num_leaves=384,
    max_depth=-1,
    min_data_in_leaf=100,
    feature_fraction=0.85,
    bagging_fraction=0.8,
    bagging_freq=1,
    max_bin=255,
    reg_alpha=0.0,
    reg_lambda=1.0,
    random_state=SEED,
    n_jobs=-1,
    verbose=-1
)

cv_rmses, best_iters = [], []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    X_tr, X_va, use_cols = build_fold_matrices(train_df, tr_idx, va_idx)
    y_tr, y_va = y_all[tr_idx], y_all[va_idx]
    model = lgb.LGBMRegressor(**lgb_params)
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)],
        eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=300, verbose=False), lgb.log_evaluation(period=150)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    y_true_val = np.expm1(y_va)
    rmse = float(root_mean_squared_error(y_true_val, y_pred_val))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    print(f'Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})')

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_final = int(np.clip(int(np.median(best_iters)), 600, 16000))
print({'cv_rmse_mean': round(cv_mean, 6), 'cv_rmse_std': round(cv_std, 6), 'best_iter_final': best_iter_final})

# 7) Fit final model on all data: build frequency encodings using full train, then predict test
pu_cnt_full = train_df['pu_bin'].value_counts().astype('int32')
do_cnt_full = train_df['do_bin'].value_counts().astype('int32')
pair_cnt_full = train_df['pair_bin'].value_counts().astype('int32')

train_full = train_df.copy()
train_full['pu_cnt'] = train_full['pu_bin'].map(pu_cnt_full).fillna(0).astype('int32')
train_full['do_cnt'] = train_full['do_bin'].map(do_cnt_full).fillna(0).astype('int32')
train_full['pair_cnt'] = train_full['pair_bin'].map(pair_cnt_full).fillna(0).astype('int32')

test_full = test_df.copy()
test_full['pu_cnt'] = test_full['pu_bin'].map(pu_cnt_full).fillna(0).astype('int32')
test_full['do_cnt'] = test_full['do_bin'].map(do_cnt_full).fillna(0).astype('int32')
test_full['pair_cnt'] = test_full['pair_bin'].map(pair_cnt_full).fillna(0).astype('int32')

final_features = use_cols
X_full = train_full[final_features].astype('float32').values
X_test = test_full[final_features].astype('float32').values

final_model = lgb.LGBMRegressor(**{**lgb_params, 'n_estimators': best_iter_final})
final_model.fit(X_full, y_all)
test_pred = np.expm1(final_model.predict(X_test)).astype('float32')
test_pred = np.clip(test_pred, 0, 500)
pred_df = pd.DataFrame({'key': test_df['key'].astype('string'), 'fare_amount': test_pred})
sub = sample[['key']].merge(pred_df, on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv with shape:', sub.shape)

  df['days_since_start'] = (dt_local.view('int64') - global_start) / np.float64(24*3600*1e9)
  df['days_since_start'] = (dt_local.view('int64') - global_start) / np.float64(24*3600*1e9)


[150]	valid_0's rmse: 0.231592


[300]	valid_0's rmse: 0.230974


[450]	valid_0's rmse: 0.231527


Fold 1: RMSE=3.72799, best_iter=251 (train_n=319908, val_n=320756)


[150]	valid_0's rmse: 0.19299


[300]	valid_0's rmse: 0.191751


[450]	valid_0's rmse: 0.192054


Fold 2: RMSE=3.24314, best_iter=271 (train_n=640523, val_n=320756)


[150]	valid_0's rmse: 0.211039


[300]	valid_0's rmse: 0.208184


[450]	valid_0's rmse: 0.207952


[600]	valid_0's rmse: 0.207591


[750]	valid_0's rmse: 0.207656


[900]	valid_0's rmse: 0.207543


[1050]	valid_0's rmse: 0.207556


Fold 3: RMSE=3.58950, best_iter=781 (train_n=961323, val_n=320756)


[150]	valid_0's rmse: 0.200274


[300]	valid_0's rmse: 0.196358


[450]	valid_0's rmse: 0.195354


[600]	valid_0's rmse: 0.194925


[750]	valid_0's rmse: 0.194694


[900]	valid_0's rmse: 0.194651


[1050]	valid_0's rmse: 0.194604


[1200]	valid_0's rmse: 0.19459


[1350]	valid_0's rmse: 0.194658


Fold 4: RMSE=3.50795, best_iter=1082 (train_n=1282234, val_n=320756)


[150]	valid_0's rmse: 0.183133


[300]	valid_0's rmse: 0.17814


[450]	valid_0's rmse: 0.177058


[600]	valid_0's rmse: 0.176579


[750]	valid_0's rmse: 0.176387


[900]	valid_0's rmse: 0.176238


[1050]	valid_0's rmse: 0.176232


[1200]	valid_0's rmse: 0.176264


Fold 5: RMSE=3.21542, best_iter=1029 (train_n=1602868, val_n=320755)
{'cv_rmse_mean': 3.456801, 'cv_rmse_std': 0.198841, 'best_iter_final': 781}


Saved submission.csv with shape: (9914, 2)


In [18]:
# Mandatory rework (fast): stabilize TE — drop raw cluster IDs, add pu/do/pair log-counts, ablate te_pair, stronger regularization
import numpy as np
import pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals(), 'train_df/test_df not prepared'
assert 'make_time_folds_quantile' in globals(), 'fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# 1) Time trend (warning-safe)
def add_time_trend_fixed(df, global_start_ns):
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    ns = dt_local.astype('int64').values
    df['days_since_start'] = (ns - np.int64(global_start_ns)) / np.float64(24*3600*1e9)
    return df.astype({'days_since_start':'float32'})

dt_local_all = train_df['pickup_datetime'].dt.tz_convert('America/New_York')
global_start_ns = int(dt_local_all.min().value)
if 'days_since_start' not in train_df.columns:
    train_df = add_time_trend_fixed(train_df, global_start_ns)
    test_df = add_time_trend_fixed(test_df, global_start_ns)

# 2) Rotated-Manhattan in km
def add_rot_manhattan(df, theta_deg=29.0):
    df = df.copy()
    lat_rad = np.deg2rad(df['pickup_latitude'].astype('float32'))
    k_lat = 111_000.0
    k_lon = (np.cos(lat_rad).astype('float32') * 111_000.0).astype('float32')
    dx = (df['delta_lon'].astype('float32') * k_lon).astype('float32')
    dy = (df['delta_lat'].astype('float32') * k_lat).astype('float32')
    th = np.deg2rad(np.float32(theta_deg))
    c, s = np.cos(th), np.sin(th)
    r1 = np.abs(dx*c + dy*s); r2 = np.abs(-dx*s + dy*c)
    df['rot_manh_km'] = ((r1 + r2) / 1000.0).astype('float32')
    return df

if 'rot_manh_km' not in train_df.columns:
    train_df = add_rot_manhattan(train_df)
    test_df = add_rot_manhattan(test_df)

# 3) Ensure fast spatial bins
def ensure_spatial_bins(df, res=0.005):
    if all(c in df.columns for c in ['pu_bin','do_bin','pair_bin']):
        return df
    df = df.copy()
    pu_lat_bin = np.floor(df['pickup_latitude'] / res).astype('int32')
    pu_lon_bin = np.floor(df['pickup_longitude'] / res).astype('int32')
    do_lat_bin = np.floor(df['dropoff_latitude'] / res).astype('int32')
    do_lon_bin = np.floor(df['dropoff_longitude'] / res).astype('int32')
    df['pu_bin'] = (pu_lat_bin.astype(str) + '_' + pu_lon_bin.astype(str))
    df['do_bin'] = (do_lat_bin.astype(str) + '_' + do_lon_bin.astype(str))
    df['pair_bin'] = (df['pu_bin'] + '|' + df['do_bin'])
    return df

train_df = ensure_spatial_bins(train_df)
test_df = ensure_spatial_bins(test_df)

# 4) Smoothed target encoding helper (use moderate smoothing)
def target_encode_smooth(tr_ser, tr_target, va_ser, m=200.0, prior=None):
    if prior is None:
        prior = float(tr_target.mean())
    g = pd.DataFrame({'k': tr_ser.astype('object'), 'y': tr_target}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict()
    tr_enc = tr_ser.astype('object').map(mp).fillna(prior).astype('float32').values
    va_enc = va_ser.astype('object').map(mp).fillna(prior).astype('float32').values
    return tr_enc, va_enc, prior, mp

# 5) Base numeric features — drop raw cluster IDs to avoid fighting TE; keep same_cl only if present
base_num = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'year','month','day','hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday','days_since_start','rot_manh_km',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'same_cl' in train_df.columns: base_num.append('same_cl')

need_cols = base_num + ['fare_amount']
train_df = train_df.dropna(subset=need_cols).sort_values('pickup_datetime').reset_index(drop=True)
test_df[base_num] = test_df[base_num].fillna(0)
dt_all = train_df['pickup_datetime']

# 6) 5-fold CV with per-fold TE (pu_bin, do_bin only) + log-counts (pu/do/pair) from train-only maps
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

SEED = 2025
lgb_params = dict(
    objective='regression',
    metric='rmse',
    learning_rate=0.04,
    n_estimators=25000,
    num_leaves=256,
    max_depth=-1,
    min_data_in_leaf=200,
    feature_fraction=0.8,
    bagging_fraction=0.8,
    bagging_freq=1,
    max_bin=255,
    reg_alpha=0.0,
    reg_lambda=10.0,
    random_state=SEED,
    n_jobs=-1,
    verbose=-1
)

cv_rmses, best_iters = [], []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_df.iloc[tr_idx].copy(); va = train_df.iloc[va_idx].copy()
    # targets (log1p)
    y_tr_log = np.log1p(tr['fare_amount'].astype('float32').values); y_va_log = np.log1p(va['fare_amount'].astype('float32').values)
    prior = float(y_tr_log.mean()); m = 200.0
    # TE on pu/do only (ablate te_pair_bin for stability)
    for col in ['pu_bin','do_bin']:
        tr_enc, va_enc, _, _ = target_encode_smooth(tr[col], y_tr_log, va[col], m=m, prior=prior)
        tr[f'te_{col}'] = tr_enc; va[f'te_{col}'] = va_enc
    # Leakage-safe counts from train-only
    pu_vc = tr['pu_bin'].astype('object').value_counts().to_dict()
    do_vc = tr['do_bin'].astype('object').value_counts().to_dict()
    pair_vc = tr['pair_bin'].astype('object').value_counts().to_dict()
    tr['log_pu_cnt'] = np.log1p(tr['pu_bin'].astype('object').map(pu_vc).fillna(0).astype('int32')).astype('float32')
    va['log_pu_cnt'] = np.log1p(va['pu_bin'].astype('object').map(pu_vc).fillna(0).astype('int32')).astype('float32')
    tr['log_do_cnt'] = np.log1p(tr['do_bin'].astype('object').map(do_vc).fillna(0).astype('int32')).astype('float32')
    va['log_do_cnt'] = np.log1p(va['do_bin'].astype('object').map(do_vc).fillna(0).astype('int32')).astype('float32')
    tr['log_pair_cnt'] = np.log1p(tr['pair_bin'].astype('object').map(pair_vc).fillna(0).astype('int32')).astype('float32')
    va['log_pair_cnt'] = np.log1p(va['pair_bin'].astype('object').map(pair_vc).fillna(0).astype('int32')).astype('float32')
    # Assemble
    use_cols = base_num + ['te_pu_bin','te_do_bin','log_pu_cnt','log_do_cnt','log_pair_cnt']
    X_tr = tr[use_cols].astype('float32'); X_va = va[use_cols].astype('float32')
    assert np.isfinite(X_tr.values).all() and np.isfinite(X_va.values).all(), 'Non-finite values in features'
    model = lgb.LGBMRegressor(**lgb_params)
    model.fit(
        X_tr, y_tr_log,
        eval_set=[(X_va, y_va_log)],
        eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=500, verbose=False), lgb.log_evaluation(period=200)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    rmse = float(root_mean_squared_error(np.expm1(y_va_log), y_pred_val))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    print(f'Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})')

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_final = int(np.clip(int(np.median(best_iters)), 1200, 25000))
print({'cv_rmse_mean': round(cv_mean, 6), 'cv_rmse_std': round(cv_std, 6), 'best_iter_final': best_iter_final})

# 7) Final model: fit TE on full train (pu/do) + log-counts; predict test
full = train_df.copy(); tst = test_df.copy()
y_full_log = np.log1p(full['fare_amount'].astype('float32').values)
prior_full = float(y_full_log.mean()); m = 200.0
for col in ['pu_bin','do_bin']:
    tr_enc, va_enc, _, _ = target_encode_smooth(full[col], y_full_log, tst[col], m=m, prior=prior_full)
    full[f'te_{col}'] = tr_enc; tst[f'te_{col}'] = va_enc
pu_vc_full = full['pu_bin'].astype('object').value_counts().to_dict()
do_vc_full = full['do_bin'].astype('object').value_counts().to_dict()
pair_vc_full = full['pair_bin'].astype('object').value_counts().to_dict()
full['log_pu_cnt'] = np.log1p(full['pu_bin'].astype('object').map(pu_vc_full).fillna(0).astype('int32')).astype('float32')
tst['log_pu_cnt'] = np.log1p(tst['pu_bin'].astype('object').map(pu_vc_full).fillna(0).astype('int32')).astype('float32')
full['log_do_cnt'] = np.log1p(full['do_bin'].astype('object').map(do_vc_full).fillna(0).astype('int32')).astype('float32')
tst['log_do_cnt'] = np.log1p(tst['do_bin'].astype('object').map(do_vc_full).fillna(0).astype('int32')).astype('float32')
full['log_pair_cnt'] = np.log1p(full['pair_bin'].astype('object').map(pair_vc_full).fillna(0).astype('int32')).astype('float32')
tst['log_pair_cnt'] = np.log1p(tst['pair_bin'].astype('object').map(pair_vc_full).fillna(0).astype('int32')).astype('float32')
final_use = base_num + ['te_pu_bin','te_do_bin','log_pu_cnt','log_do_cnt','log_pair_cnt']
X_full = full[final_use].astype('float32'); X_test = tst[final_use].astype('float32')
assert np.isfinite(X_full.values).all() and np.isfinite(X_test.values).all(), 'Non-finite values in final features'
final_model = lgb.LGBMRegressor(**{**lgb_params, 'n_estimators': best_iter_final})
final_model.fit(X_full, y_full_log)
test_pred = np.expm1(final_model.predict(X_test)).astype('float32')
test_pred = np.clip(test_pred, 0, 500)
pred_df = pd.DataFrame({'key': test_df['key'].astype('string'), 'fare_amount': test_pred})
sub = sample[['key']].merge(pred_df, on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv with shape:', sub.shape)

[200]	valid_0's rmse: 0.232369


[400]	valid_0's rmse: 0.23242


[600]	valid_0's rmse: 0.233228


Fold 1: RMSE=3.79261, best_iter=256 (train_n=319908, val_n=320756)


[200]	valid_0's rmse: 0.189598


[400]	valid_0's rmse: 0.189246


[600]	valid_0's rmse: 0.189645


Fold 2: RMSE=3.36168, best_iter=268 (train_n=640523, val_n=320756)


[200]	valid_0's rmse: 0.211846


[400]	valid_0's rmse: 0.2098


[600]	valid_0's rmse: 0.209486


[800]	valid_0's rmse: 0.209322


[1000]	valid_0's rmse: 0.209366


[1200]	valid_0's rmse: 0.209441


[1400]	valid_0's rmse: 0.209475


Fold 3: RMSE=3.76216, best_iter=929 (train_n=961323, val_n=320756)


[200]	valid_0's rmse: 0.200908


[400]	valid_0's rmse: 0.197461


[600]	valid_0's rmse: 0.196715


[800]	valid_0's rmse: 0.196321


[1000]	valid_0's rmse: 0.196153


[1200]	valid_0's rmse: 0.196115


[1400]	valid_0's rmse: 0.196141


[1600]	valid_0's rmse: 0.196177


[1800]	valid_0's rmse: 0.196172


Fold 4: RMSE=3.73814, best_iter=1448 (train_n=1282234, val_n=320756)


[200]	valid_0's rmse: 0.1841


[400]	valid_0's rmse: 0.180154


[600]	valid_0's rmse: 0.179176


[800]	valid_0's rmse: 0.178836


[1000]	valid_0's rmse: 0.178555


[1200]	valid_0's rmse: 0.178449


[1400]	valid_0's rmse: 0.17848


[1600]	valid_0's rmse: 0.178509


Fold 5: RMSE=3.51287, best_iter=1262 (train_n=1602868, val_n=320755)
{'cv_rmse_mean': 3.633491, 'cv_rmse_std': 0.168083, 'best_iter_final': 1200}


In [17]:
# Debug TE pipeline: inspect per-fold target encodings and feature matrix for Fold 1
import numpy as np, pandas as pd
assert 'train_df' in globals() and 'make_time_folds_quantile' in globals(), 'Prereqs missing'

# Ensure required columns exist
for c in ['pu_bin','do_bin','pair_bin','fare_amount','pickup_datetime']:
    assert c in train_df.columns, f'Missing {c} in train_df'

dt_all_dbg = train_df['pickup_datetime']
folds_dbg = make_time_folds_quantile(dt_all_dbg, n_folds=5, gap_days=1)
tr_idx, va_idx = folds_dbg[0]
tr = train_df.iloc[tr_idx].copy(); va = train_df.iloc[va_idx].copy()
y_tr_log = np.log1p(tr['fare_amount'].astype('float32').values)
prior = float(y_tr_log.mean()); m = 1000.0

def te_map(series_tr, target_tr, m=1000.0, prior=None):
    if prior is None:
        prior = float(target_tr.mean())
    g = pd.DataFrame({'k': series_tr.astype('object'), 'y': target_tr}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    return smooth.to_dict(), prior

# Build maps on train only
maps = {}
for col in ['pu_bin','do_bin','pair_bin'] + ([c for c in ['pu_cl','do_cl'] if c in tr.columns]):
    mp, pr = te_map(tr[col], y_tr_log, m=m, prior=prior)
    maps[col] = (mp, pr)

# Apply maps
te_stats = []
for col in maps:
    mp, pr = maps[col]
    tr_enc = tr[col].astype('object').map(mp).fillna(pr).astype('float32')
    va_enc = va[col].astype('object').map(mp).fillna(pr).astype('float32')
    te_stats.append({
        'col': col,
        'tr_na_rate': float(1.0 - tr[col].astype('object').map(mp).notna().mean()),
        'va_na_rate': float(1.0 - va[col].astype('object').map(mp).notna().mean()),
        'tr_mean': float(tr_enc.mean()),
        'va_mean': float(va_enc.mean()),
        'tr_std': float(tr_enc.std()),
        'va_std': float(va_enc.std())
    })

print('TE diagnostics (Fold 1): prior=', prior)
for row in te_stats:
    print(row)

# Check feature finiteness and variance for Fold 1 assembled features
base_num = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'year','month','day','hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday'
]
if 'days_since_start' in train_df.columns: base_num.append('days_since_start')
if 'rot_manh_km' in train_df.columns: base_num.append('rot_manh_km')
for c in ['dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi','do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi','is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend']:
    if c in train_df.columns: base_num.append(c)
for c in ['pu_cl','do_cl','same_cl']:
    if c in train_df.columns: base_num.append(c)

# Assemble TE features
for col in maps:
    mp, pr = maps[col]
    tr[f'te_{col}'] = tr[col].astype('object').map(mp).fillna(pr).astype('float32')
    va[f'te_{col}'] = va[col].astype('object').map(mp).fillna(pr).astype('float32')
tr['log_pair_cnt'] = np.log1p(tr['pair_bin'].astype('object').map(tr['pair_bin'].astype('object').value_counts()).fillna(0).astype('int32')).astype('float32')
va['log_pair_cnt'] = np.log1p(va['pair_bin'].astype('object').map(tr['pair_bin'].astype('object').value_counts()).fillna(0).astype('int32')).astype('float32')
use_cols = base_num + [c for c in tr.columns if c.startswith('te_')] + ['log_pair_cnt']
X_tr = tr[use_cols].astype('float32')
X_va = va[use_cols].astype('float32')
print('Shapes:', X_tr.shape, X_va.shape, 'n_use_cols=', len(use_cols))
print('Finite checks:', np.isfinite(X_tr.values).all(), np.isfinite(X_va.values).all())
print('Variance (first 10 cols):', X_tr.var().head(10).to_dict())
print('Any zero-variance cols:', bool((X_tr.var() == 0).any()))

TE diagnostics (Fold 1): prior= 2.243056058883667
{'col': 'pu_bin', 'tr_na_rate': 0.0, 'va_na_rate': 0.00333898664405341, 'tr_mean': 2.2265536785125732, 'va_mean': 2.227857828140259, 'tr_std': 0.11589936912059784, 'va_std': 0.11819490790367126}
{'col': 'do_bin', 'tr_na_rate': 0.0, 'va_na_rate': 0.00381598473606104, 'tr_mean': 2.20108699798584, 'va_mean': 2.2015466690063477, 'tr_std': 0.0922887846827507, 'va_std': 0.09205456078052521}
{'col': 'pair_bin', 'tr_na_rate': 0.0, 'va_na_rate': 0.09091645986357233, 'tr_mean': 2.2360243797302246, 'va_mean': 2.236215591430664, 'tr_std': 0.015584036707878113, 'va_std': 0.01518352422863245}
{'col': 'pu_cl', 'tr_na_rate': 0.0, 'va_na_rate': 0.0, 'tr_mean': 2.233232021331787, 'va_mean': 2.2348179817199707, 'tr_std': 0.1591968834400177, 'va_std': 0.16106157004833221}
{'col': 'do_cl', 'tr_na_rate': 0.0, 'va_na_rate': 3.117634588267748e-06, 'tr_mean': 2.2207770347595215, 'va_mean': 2.2219908237457275, 'tr_std': 0.1709526628255844, 'va_std': 0.1719950735

Shapes: (319908, 53) (320756, 53) n_use_cols= 53
Finite checks: True True
Variance (first 10 cols): {'pickup_longitude': 0.0010823372285813093, 'pickup_latitude': 0.0006756684742867947, 'dropoff_longitude': 0.0011220808373764157, 'dropoff_latitude': 0.0009095306741073728, 'passenger_count': 1.5728309154510498, 'dist_hav_km': 11.514139175415039, 'dist_man_km': 19.904155731201172, 'delta_lat': 0.0009210002608597279, 'delta_lon': 0.001471993513405323, 'bear_sin': 2.5747036147549807e-07}


Any zero-variance cols: False


In [19]:
# Quick A/B: robust objective (huber) on stable feature set from Cell 5 to test impact on CV std
import numpy as np
import pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'feature_cols' in globals(), 'Prereqs missing (train_df/feature_cols)'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# Rebuild matrices from the simpler, previously stable feature set
train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True)
X_all_h = train_ord[feature_cols].astype('float32').values
y_all_h = np.log1p(train_ord['fare_amount'].astype('float32').values)
dt_all_h = train_ord['pickup_datetime']

# 5-fold time CV with 1-day gap
K = 5
folds_h = make_time_folds_quantile(dt_all_h, n_folds=K, gap_days=1)
assert len(folds_h) == K, f'Expected {K} folds, got {len(folds_h)}'

SEED = 2025
lgb_params_huber = dict(
    objective='huber',
    metric='rmse',
    learning_rate=0.05,
    n_estimators=16000,
    num_leaves=256,
    max_depth=-1,
    min_data_in_leaf=120,
    feature_fraction=0.8,
    bagging_fraction=0.8,
    bagging_freq=1,
    max_bin=255,
    reg_alpha=0.0,
    reg_lambda=2.0,
    random_state=SEED,
    n_jobs=-1,
    verbose=-1
)

cv_rmses, best_iters = [], []
for i, (tr_idx, va_idx) in enumerate(folds_h, 1):
    X_tr, y_tr = X_all_h[tr_idx], y_all_h[tr_idx]
    X_va, y_va = X_all_h[va_idx], y_all_h[va_idx]
    model = lgb.LGBMRegressor(**lgb_params_huber)
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)],
        eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=300, verbose=False), lgb.log_evaluation(period=150)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    y_true_val = np.expm1(y_va)
    rmse = float(root_mean_squared_error(y_true_val, y_pred_val))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    print(f'[huber] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})')

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_final = int(np.clip(int(np.median(best_iters)), 400, 16000))
print({'cv_rmse_mean': round(cv_mean, 6), 'cv_rmse_std': round(cv_std, 6), 'best_iter_final': best_iter_final, 'objective': 'huber'})

[150]	valid_0's rmse: 0.233055


[300]	valid_0's rmse: 0.231346


[450]	valid_0's rmse: 0.231142


[600]	valid_0's rmse: 0.231246


[huber] Fold 1: RMSE=3.68608, best_iter=431 (train_n=319908, val_n=320756)


[150]	valid_0's rmse: 0.190609


[300]	valid_0's rmse: 0.190214


[450]	valid_0's rmse: 0.190117


[600]	valid_0's rmse: 0.190738


[huber] Fold 2: RMSE=3.16536, best_iter=395 (train_n=640523, val_n=320756)


[150]	valid_0's rmse: 0.213345


[300]	valid_0's rmse: 0.209824


[450]	valid_0's rmse: 0.208807


[600]	valid_0's rmse: 0.208317


[750]	valid_0's rmse: 0.208066


[900]	valid_0's rmse: 0.207899


[1050]	valid_0's rmse: 0.207796


[1200]	valid_0's rmse: 0.207754


[1350]	valid_0's rmse: 0.207802


[1500]	valid_0's rmse: 0.207733


[1650]	valid_0's rmse: 0.207761


[1800]	valid_0's rmse: 0.207793


[huber] Fold 3: RMSE=3.58556, best_iter=1512 (train_n=961323, val_n=320756)


[150]	valid_0's rmse: 0.20438


[300]	valid_0's rmse: 0.199264


[450]	valid_0's rmse: 0.197697


[600]	valid_0's rmse: 0.197006


[750]	valid_0's rmse: 0.196583


[900]	valid_0's rmse: 0.196287


[1050]	valid_0's rmse: 0.196071


[1200]	valid_0's rmse: 0.195913


[1350]	valid_0's rmse: 0.195818


[1500]	valid_0's rmse: 0.195742


[1650]	valid_0's rmse: 0.195669


[1800]	valid_0's rmse: 0.19562


[1950]	valid_0's rmse: 0.195613


[2100]	valid_0's rmse: 0.195597


[2250]	valid_0's rmse: 0.195581


[2400]	valid_0's rmse: 0.19561


[huber] Fold 4: RMSE=3.51703, best_iter=2245 (train_n=1282234, val_n=320756)


[150]	valid_0's rmse: 0.186709


[300]	valid_0's rmse: 0.180568


[450]	valid_0's rmse: 0.178788


[600]	valid_0's rmse: 0.178003


[750]	valid_0's rmse: 0.17744


[900]	valid_0's rmse: 0.177071


[1050]	valid_0's rmse: 0.176827


[1200]	valid_0's rmse: 0.176683


[1350]	valid_0's rmse: 0.176574


[1500]	valid_0's rmse: 0.176521


[1650]	valid_0's rmse: 0.176469


[1800]	valid_0's rmse: 0.176441


[1950]	valid_0's rmse: 0.176425


[2100]	valid_0's rmse: 0.17641


[huber] Fold 5: RMSE=3.21017, best_iter=1912 (train_n=1602868, val_n=320755)
{'cv_rmse_mean': 3.432841, 'cv_rmse_std': 0.207688, 'best_iter_final': 1512, 'objective': 'huber'}


In [20]:
# Phase 1 (Stability): No Target Encoding, per-fold frequency encodings, fare policy feature, monotone constraint on time trend
import numpy as np
import pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals(), 'train_df/test_df missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# 0) Ensure time trend and spatial bins exist
def ensure_days_since_start(df, global_start_ns):
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    ns = dt_local.astype('int64').values
    df['days_since_start'] = ((ns - np.int64(global_start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return df

def ensure_spatial_bins(df, res=0.005):
    if all(c in df.columns for c in ['pu_bin','do_bin','pair_bin']):
        return df
    df = df.copy()
    pu_lat_bin = np.floor(df['pickup_latitude'] / res).astype('int32')
    pu_lon_bin = np.floor(df['pickup_longitude'] / res).astype('int32')
    do_lat_bin = np.floor(df['dropoff_latitude'] / res).astype('int32')
    do_lon_bin = np.floor(df['dropoff_longitude'] / res).astype('int32')
    df['pu_bin'] = (pu_lat_bin.astype(str) + '_' + pu_lon_bin.astype(str))
    df['do_bin'] = (do_lat_bin.astype(str) + '_' + do_lon_bin.astype(str))
    df['pair_bin'] = (df['pu_bin'] + '|' + df['do_bin'])
    return df

dt_local_all = train_df['pickup_datetime'].dt.tz_convert('America/New_York')
global_start_ns = int(dt_local_all.min().value)
if 'days_since_start' not in train_df.columns:
    train_df = ensure_days_since_start(train_df, global_start_ns)
    test_df = ensure_days_since_start(test_df, global_start_ns)
train_df = ensure_spatial_bins(train_df)
test_df = ensure_spatial_bins(test_df)

# 1) Fare policy feature (NYC fare hike around 2012-09-04 local time) + simple interactions
def add_fare_policy(df):
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    df['after_hike'] = (dt_local >= cutoff).astype('int8')
    # interactions (keep small for stability)
    if 'dist_hav_km' in df.columns:
        df['dist_x_after_hike'] = (df['dist_hav_km'] * df['after_hike']).astype('float32')
    else:
        df['dist_x_after_hike'] = np.float32(0.0)
    df['pc_x_after_hike'] = (df['passenger_count'].astype('float32') * df['after_hike']).astype('float32')
    return df

train_df = add_fare_policy(train_df)
test_df = add_fare_policy(test_df)

# 2) Core numeric features (no TE). Use robust, already-engineered set.
core_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'year','month','day','hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday','days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_df.columns: core_feats.append('rot_manh_km')
policy_feats = ['after_hike','dist_x_after_hike','pc_x_after_hike']

# 3) Build 5-fold time CV with 1-day gap and per-fold frequency encodings (log-counts) for pu/do/pair bins
train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True)
dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

def build_fold_mats(df, tr_idx, va_idx):
    tr = df.iloc[tr_idx].copy()
    va = df.iloc[va_idx].copy()
    pu_cnt = tr['pu_bin'].astype('object').value_counts()
    do_cnt = tr['do_bin'].astype('object').value_counts()
    pair_cnt = tr['pair_bin'].astype('object').value_counts()
    for d in (tr, va):
        d['log_pu_cnt'] = np.log1p(d['pu_bin'].astype('object').map(pu_cnt).fillna(0).astype('int32')).astype('float32')
        d['log_do_cnt'] = np.log1p(d['do_bin'].astype('object').map(do_cnt).fillna(0).astype('int32')).astype('float32')
        d['log_pair_cnt'] = np.log1p(d['pair_bin'].astype('object').map(pair_cnt).fillna(0).astype('int32')).astype('float32')
    use_cols = core_feats + policy_feats + ['log_pu_cnt','log_do_cnt','log_pair_cnt']
    X_tr = tr[use_cols].astype('float32').values
    X_va = va[use_cols].astype('float32').values
    return X_tr, X_va, use_cols

SEED = 2025
lgb_params = dict(
    objective='regression',
    metric='rmse',
    learning_rate=0.05,
    n_estimators=16000,
    num_leaves=256,
    max_depth=-1,
    min_data_in_leaf=120,
    feature_fraction=0.85,
    bagging_fraction=0.8,
    bagging_freq=1,
    max_bin=255,
    reg_alpha=0.0,
    reg_lambda=2.0,
    random_state=SEED,
    n_jobs=-1,
    verbose=-1
)

cv_rmses, best_iters = [], []
final_use_cols = None
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    X_tr, X_va, use_cols = build_fold_mats(train_ord, tr_idx, va_idx)
    y_tr, y_va = y_all_log[tr_idx], y_all_log[va_idx]
    # Monotonicity: +1 on days_since_start, 0 elsewhere
    mono = [0] * len(use_cols)
    if 'days_since_start' in use_cols:
        mono[use_cols.index('days_since_start')] = 1
    model = lgb.LGBMRegressor(**{**lgb_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)],
        eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=300, verbose=False), lgb.log_evaluation(period=150)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    y_true_val = np.expm1(y_va)
    rmse = float(root_mean_squared_error(y_true_val, y_pred_val))
    cv_rmses.append(rmse)
    best_iters.append(int(model.best_iteration_ or model.n_estimators))
    final_use_cols = use_cols
    print(f'[Stability Phase] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})')

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_final = int(np.clip(int(np.median(best_iters)), 500, lgb_params['n_estimators']))
print({'cv_rmse_mean': round(cv_mean, 6), 'cv_rmse_std': round(cv_std, 6), 'best_iter_final': best_iter_final, 'note': 'No TE; per-fold freq enc; fare policy; monotone days_since_start'})

# 4) Train final model on full data with full-data frequency maps; predict test and save submission
pu_cnt_full = train_ord['pu_bin'].astype('object').value_counts()
do_cnt_full = train_ord['do_bin'].astype('object').value_counts()
pair_cnt_full = train_ord['pair_bin'].astype('object').value_counts()

train_full = train_ord.copy()
test_full = test_df.copy()
for d in (train_full, test_full):
    d['log_pu_cnt'] = np.log1p(d['pu_bin'].astype('object').map(pu_cnt_full).fillna(0).astype('int32')).astype('float32')
    d['log_do_cnt'] = np.log1p(d['do_bin'].astype('object').map(do_cnt_full).fillna(0).astype('int32')).astype('float32')
    d['log_pair_cnt'] = np.log1p(d['pair_bin'].astype('object').map(pair_cnt_full).fillna(0).astype('int32')).astype('float32')

use_cols = final_use_cols
X_full = train_full[use_cols].astype('float32').values
y_full = y_all_log
X_test = test_full[use_cols].astype('float32').values
mono = [0] * len(use_cols)
if 'days_since_start' in use_cols:
    mono[use_cols.index('days_since_start')] = 1
final_model = lgb.LGBMRegressor(**{**lgb_params, 'n_estimators': best_iter_final, 'monotone_constraints': mono})
final_model.fit(X_full, y_full)
test_pred = np.expm1(final_model.predict(X_test)).astype('float32')
test_pred = np.clip(test_pred, 0, 500)
pred_df = pd.DataFrame({'key': test_df['key'].astype('string'), 'fare_amount': test_pred})
sub = sample[['key']].merge(pred_df, on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv with shape:', sub.shape)

[150]	valid_0's rmse: 0.231743


[300]	valid_0's rmse: 0.230688


[450]	valid_0's rmse: 0.230989


[600]	valid_0's rmse: 0.231295


[Stability Phase] Fold 1: RMSE=3.68317, best_iter=306 (train_n=319908, val_n=320756)


[150]	valid_0's rmse: 0.187459


[300]	valid_0's rmse: 0.186754


[450]	valid_0's rmse: 0.186616


[600]	valid_0's rmse: 0.187017


[Stability Phase] Fold 2: RMSE=3.05045, best_iter=407 (train_n=640523, val_n=320756)


[150]	valid_0's rmse: 0.211124


[300]	valid_0's rmse: 0.207833


[450]	valid_0's rmse: 0.207165


[600]	valid_0's rmse: 0.206658


[750]	valid_0's rmse: 0.206078


[900]	valid_0's rmse: 0.205662


[1050]	valid_0's rmse: 0.205571


[1200]	valid_0's rmse: 0.205574


[1350]	valid_0's rmse: 0.205615


[1500]	valid_0's rmse: 0.205616


[Stability Phase] Fold 3: RMSE=3.55499, best_iter=1227 (train_n=961323, val_n=320756)


[150]	valid_0's rmse: 0.200765


[300]	valid_0's rmse: 0.196497


[450]	valid_0's rmse: 0.195444


[600]	valid_0's rmse: 0.194967


[750]	valid_0's rmse: 0.194673


[900]	valid_0's rmse: 0.194511


[1050]	valid_0's rmse: 0.194424


[1200]	valid_0's rmse: 0.194333


[1350]	valid_0's rmse: 0.194275


[1500]	valid_0's rmse: 0.194296


[1650]	valid_0's rmse: 0.194248


[1800]	valid_0's rmse: 0.194355


[Stability Phase] Fold 4: RMSE=3.49205, best_iter=1629 (train_n=1282234, val_n=320756)


[150]	valid_0's rmse: 0.184215


[300]	valid_0's rmse: 0.178949


[450]	valid_0's rmse: 0.177555


[600]	valid_0's rmse: 0.176899


[750]	valid_0's rmse: 0.176472


[900]	valid_0's rmse: 0.176277


[1050]	valid_0's rmse: 0.176231


[1200]	valid_0's rmse: 0.176156


[1350]	valid_0's rmse: 0.176195


[Stability Phase] Fold 5: RMSE=3.19196, best_iter=1157 (train_n=1602868, val_n=320755)
{'cv_rmse_mean': 3.394525, 'cv_rmse_std': 0.235795, 'best_iter_final': 1157, 'note': 'No TE; per-fold freq enc; fare policy; monotone days_since_start'}


Saved submission.csv with shape: (9914, 2)


In [24]:
# Phase 1 — Targeted Iteration for Stability: piecewise time trend, airport-hike interaction, per-fold freq encodings (no monotone to avoid slowdown)
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals(), 'train_df/test_df missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

def ensure_days_since_start(df, global_start_ns):
    if 'days_since_start' in df.columns: return df
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    ns = dt_local.astype('int64').values
    df['days_since_start'] = ((ns - np.int64(global_start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return df

def ensure_spatial_bins(df, res=0.005):
    if all(c in df.columns for c in ['pu_bin','do_bin','pair_bin']): return df
    df = df.copy()
    pu_lat_bin = np.floor(df['pickup_latitude'] / res).astype('int32')
    pu_lon_bin = np.floor(df['pickup_longitude'] / res).astype('int32')
    do_lat_bin = np.floor(df['dropoff_latitude'] / res).astype('int32')
    do_lon_bin = np.floor(df['dropoff_longitude'] / res).astype('int32')
    df['pu_bin'] = (pu_lat_bin.astype(str) + '_' + pu_lon_bin.astype(str))
    df['do_bin'] = (do_lat_bin.astype(str) + '_' + do_lon_bin.astype(str))
    df['pair_bin'] = (df['pu_bin'] + '|' + df['do_bin'])
    return df

dt_local_all = train_df['pickup_datetime'].dt.tz_convert('America/New_York')
global_start_ns = int(dt_local_all.min().value)
train_df = ensure_days_since_start(train_df, global_start_ns)
test_df = ensure_days_since_start(test_df, global_start_ns)
train_df = ensure_spatial_bins(train_df)
test_df = ensure_spatial_bins(test_df)

def add_policy_and_interactions(df):
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    df['after_hike'] = (dt_local >= cutoff).astype('int8')
    df['days_since_start_x_after_hike'] = (df['days_since_start'] * df['after_hike']).astype('float32')
    if 'dist_hav_km' in df.columns:
        df['dist_x_after_hike'] = (df['dist_hav_km'] * df['after_hike']).astype('float32')
    else:
        df['dist_x_after_hike'] = np.float32(0.0)
    if 'is_airport_trip_any' not in df.columns:
        if all(c in df.columns for c in ['dist_jfk','dist_lga','dist_ewr']):
            df['is_airport_trip_any'] = (df[['dist_jfk','dist_lga','dist_ewr']].min(axis=1) < 2.0).astype('int8')
        else:
            df['is_airport_trip_any'] = df.get('is_airport_trip', pd.Series(0, index=df.index)).astype('int8')
    df['is_airport_trip_any_after_hike'] = (df['is_airport_trip_any'] * df['after_hike']).astype('float32')
    return df

train_df = add_policy_and_interactions(train_df)
test_df = add_policy_and_interactions(test_df)

core_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'year','month','day','hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday','days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_df.columns: core_feats.append('rot_manh_km')
policy_feats = ['after_hike','days_since_start_x_after_hike','dist_x_after_hike','is_airport_trip_any_after_hike']

train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True)
dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

def build_fold_mats(df, tr_idx, va_idx):
    tr = df.iloc[tr_idx].copy(); va = df.iloc[va_idx].copy()
    pu_cnt = tr['pu_bin'].astype('object').value_counts()
    do_cnt = tr['do_bin'].astype('object').value_counts()
    pair_cnt = tr['pair_bin'].astype('object').value_counts()
    for d in (tr, va):
        d['log_pu_cnt'] = np.log1p(d['pu_bin'].astype('object').map(pu_cnt).fillna(0).astype('int32')).astype('float32')
        d['log_do_cnt'] = np.log1p(d['do_bin'].astype('object').map(do_cnt).fillna(0).astype('int32')).astype('float32')
        d['log_pair_cnt'] = np.log1p(d['pair_bin'].astype('object').map(pair_cnt).fillna(0).astype('int32')).astype('float32')
    use_cols = core_feats + policy_feats + ['log_pu_cnt','log_do_cnt','log_pair_cnt']
    X_tr = tr[use_cols].astype('float32').values; X_va = va[use_cols].astype('float32').values
    return X_tr, X_va, use_cols

SEED = 2025
lgb_params = dict(
    objective='regression',
    metric='rmse',
    learning_rate=0.08,
    n_estimators=1200,
    num_leaves=32,
    max_depth=-1,
    min_data_in_leaf=2000,
    feature_fraction=0.6,
    bagging_fraction=0.7,
    bagging_freq=1,
    max_bin=31,
    reg_alpha=0.0,
    reg_lambda=40.0,
    random_state=SEED,
    n_jobs=-1,
    verbose=-1
)

cv_rmses, best_iters = [], []
final_use_cols = None
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    print(f'Starting fold {i}/{K}...')
    X_tr, X_va, use_cols = build_fold_mats(train_ord, tr_idx, va_idx)
    y_tr, y_va = y_all_log[tr_idx], y_all_log[va_idx]
    model = lgb.LGBMRegressor(**lgb_params)
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)],
        eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=50, verbose=False), lgb.log_evaluation(period=10)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    y_true_val = np.expm1(y_va)
    rmse = float(root_mean_squared_error(y_true_val, y_pred_val))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators)); final_use_cols = use_cols
    print(f'[Stability Iter FAST] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})')

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_final = int(np.clip(int(np.median(best_iters)), 100, lgb_params['n_estimators']))
print({'cv_rmse_mean': round(cv_mean, 6), 'cv_rmse_std': round(cv_std, 6), 'best_iter_final': best_iter_final, 'note': 'piecewise time; airport-hike; per-fold freq; no monotone (speed)'} )

# Train final model and predict test
pu_cnt_full = train_ord['pu_bin'].astype('object').value_counts()
do_cnt_full = train_ord['do_bin'].astype('object').value_counts()
pair_cnt_full = train_ord['pair_bin'].astype('object').value_counts()
train_full = train_ord.copy(); test_full = test_df.copy()
for d in (train_full, test_full):
    d['log_pu_cnt'] = np.log1p(d['pu_bin'].astype('object').map(pu_cnt_full).fillna(0).astype('int32')).astype('float32')
    d['log_do_cnt'] = np.log1p(d['do_bin'].astype('object').map(do_cnt_full).fillna(0).astype('int32')).astype('float32')
    d['log_pair_cnt'] = np.log1p(d['pair_bin'].astype('object').map(pair_cnt_full).fillna(0).astype('int32')).astype('float32')
use_cols = final_use_cols
X_full = train_full[use_cols].astype('float32').values; y_full = y_all_log
X_test = test_full[use_cols].astype('float32').values
final_model = lgb.LGBMRegressor(**{**lgb_params, 'n_estimators': best_iter_final})
final_model.fit(X_full, y_full)
test_pred = np.expm1(final_model.predict(X_test)).astype('float32')
test_pred = np.clip(test_pred, 0, 500)
pred_df = pd.DataFrame({'key': test_df['key'].astype('string'), 'fare_amount': test_pred})
sub = sample[['key']].merge(pred_df, on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv with shape:', sub.shape)

Starting fold 1/5...


[10]	valid_0's rmse: 0.312364
[20]	valid_0's rmse: 0.258756
[30]	valid_0's rmse: 0.246197


[40]	valid_0's rmse: 0.242903
[50]	valid_0's rmse: 0.24146
[60]	valid_0's rmse: 0.240405


[70]	valid_0's rmse: 0.239623
[80]	valid_0's rmse: 0.238997
[90]	valid_0's rmse: 0.238343


[100]	valid_0's rmse: 0.23779
[110]	valid_0's rmse: 0.237198
[120]	valid_0's rmse: 0.236821


[130]	valid_0's rmse: 0.236571
[140]	valid_0's rmse: 0.236225
[150]	valid_0's rmse: 0.235945


[160]	valid_0's rmse: 0.235696
[170]	valid_0's rmse: 0.235428
[180]	valid_0's rmse: 0.235196


[190]	valid_0's rmse: 0.235026
[200]	valid_0's rmse: 0.23484
[210]	valid_0's rmse: 0.234743


[220]	valid_0's rmse: 0.234586
[230]	valid_0's rmse: 0.234457
[240]	valid_0's rmse: 0.234374


[250]	valid_0's rmse: 0.234237
[260]	valid_0's rmse: 0.234164
[270]	valid_0's rmse: 0.234084


[280]	valid_0's rmse: 0.23403
[290]	valid_0's rmse: 0.23395
[300]	valid_0's rmse: 0.233761


[310]	valid_0's rmse: 0.23366
[320]	valid_0's rmse: 0.233615
[330]	valid_0's rmse: 0.233558


[340]	valid_0's rmse: 0.233467
[350]	valid_0's rmse: 0.233431
[360]	valid_0's rmse: 0.233395


[370]	valid_0's rmse: 0.233331
[380]	valid_0's rmse: 0.233249
[390]	valid_0's rmse: 0.233203


[400]	valid_0's rmse: 0.233126
[410]	valid_0's rmse: 0.233068
[420]	valid_0's rmse: 0.233012


[430]	valid_0's rmse: 0.232961
[440]	valid_0's rmse: 0.232923
[450]	valid_0's rmse: 0.232881


[460]	valid_0's rmse: 0.232828
[470]	valid_0's rmse: 0.232773
[480]	valid_0's rmse: 0.232731


[490]	valid_0's rmse: 0.232684
[500]	valid_0's rmse: 0.232631
[510]	valid_0's rmse: 0.232625


[520]	valid_0's rmse: 0.232584
[530]	valid_0's rmse: 0.232549
[540]	valid_0's rmse: 0.232518


[550]	valid_0's rmse: 0.232495
[560]	valid_0's rmse: 0.232444
[570]	valid_0's rmse: 0.232425


[580]	valid_0's rmse: 0.232392
[590]	valid_0's rmse: 0.232422
[600]	valid_0's rmse: 0.232397


[610]	valid_0's rmse: 0.232351
[620]	valid_0's rmse: 0.232305
[630]	valid_0's rmse: 0.232273


[640]	valid_0's rmse: 0.232234
[650]	valid_0's rmse: 0.232217
[660]	valid_0's rmse: 0.232202


[670]	valid_0's rmse: 0.232174
[680]	valid_0's rmse: 0.232146
[690]	valid_0's rmse: 0.232134


[700]	valid_0's rmse: 0.232111
[710]	valid_0's rmse: 0.232095
[720]	valid_0's rmse: 0.232082


[730]	valid_0's rmse: 0.232058
[740]	valid_0's rmse: 0.232022
[750]	valid_0's rmse: 0.232011


[760]	valid_0's rmse: 0.232012
[770]	valid_0's rmse: 0.232038
[780]	valid_0's rmse: 0.232042


[790]	valid_0's rmse: 0.232028


[Stability Iter FAST] Fold 1: RMSE=3.73128, best_iter=745 (train_n=319908, val_n=320756)
Starting fold 2/5...


[10]	valid_0's rmse: 0.298428
[20]	valid_0's rmse: 0.228591


[30]	valid_0's rmse: 0.206987
[40]	valid_0's rmse: 0.198029


[50]	valid_0's rmse: 0.194519
[60]	valid_0's rmse: 0.191986


[70]	valid_0's rmse: 0.190204
[80]	valid_0's rmse: 0.188663


[90]	valid_0's rmse: 0.187907
[100]	valid_0's rmse: 0.18719


[110]	valid_0's rmse: 0.186246
[120]	valid_0's rmse: 0.185978


[130]	valid_0's rmse: 0.185428
[140]	valid_0's rmse: 0.185316


[150]	valid_0's rmse: 0.184929
[160]	valid_0's rmse: 0.184748


[170]	valid_0's rmse: 0.184471
[180]	valid_0's rmse: 0.184161


[190]	valid_0's rmse: 0.183795
[200]	valid_0's rmse: 0.183441


[210]	valid_0's rmse: 0.183271
[220]	valid_0's rmse: 0.183134


[230]	valid_0's rmse: 0.182948
[240]	valid_0's rmse: 0.182882


[250]	valid_0's rmse: 0.182689
[260]	valid_0's rmse: 0.182446


[270]	valid_0's rmse: 0.182312
[280]	valid_0's rmse: 0.182517
[290]	valid_0's rmse: 0.18239


[300]	valid_0's rmse: 0.182395
[310]	valid_0's rmse: 0.182287


[320]	valid_0's rmse: 0.182168
[330]	valid_0's rmse: 0.182165


[340]	valid_0's rmse: 0.182244
[350]	valid_0's rmse: 0.182163
[360]	valid_0's rmse: 0.182121


[370]	valid_0's rmse: 0.182014
[380]	valid_0's rmse: 0.182036


[390]	valid_0's rmse: 0.181995
[400]	valid_0's rmse: 0.181923


[410]	valid_0's rmse: 0.181925
[420]	valid_0's rmse: 0.181843


[430]	valid_0's rmse: 0.181848
[440]	valid_0's rmse: 0.18175


[450]	valid_0's rmse: 0.181625
[460]	valid_0's rmse: 0.18153


[470]	valid_0's rmse: 0.181513
[480]	valid_0's rmse: 0.181465


[490]	valid_0's rmse: 0.181565
[500]	valid_0's rmse: 0.181498
[510]	valid_0's rmse: 0.181468


[520]	valid_0's rmse: 0.181524


[Stability Iter FAST] Fold 2: RMSE=3.05041, best_iter=478 (train_n=640523, val_n=320756)
Starting fold 3/5...


[10]	valid_0's rmse: 0.333515
[20]	valid_0's rmse: 0.264044


[30]	valid_0's rmse: 0.242557
[40]	valid_0's rmse: 0.234073


[50]	valid_0's rmse: 0.229793
[60]	valid_0's rmse: 0.226995


[70]	valid_0's rmse: 0.225117
[80]	valid_0's rmse: 0.223741


[90]	valid_0's rmse: 0.222562
[100]	valid_0's rmse: 0.221396


[110]	valid_0's rmse: 0.220479
[120]	valid_0's rmse: 0.219732


[130]	valid_0's rmse: 0.219057
[140]	valid_0's rmse: 0.218316


[150]	valid_0's rmse: 0.217804
[160]	valid_0's rmse: 0.217308


[170]	valid_0's rmse: 0.216892
[180]	valid_0's rmse: 0.216511


[190]	valid_0's rmse: 0.216188
[200]	valid_0's rmse: 0.215949


[210]	valid_0's rmse: 0.215653
[220]	valid_0's rmse: 0.215399


[230]	valid_0's rmse: 0.215243
[240]	valid_0's rmse: 0.215046


[250]	valid_0's rmse: 0.214791
[260]	valid_0's rmse: 0.214617


[270]	valid_0's rmse: 0.214528
[280]	valid_0's rmse: 0.214336


[290]	valid_0's rmse: 0.214205
[300]	valid_0's rmse: 0.214034


[310]	valid_0's rmse: 0.21394
[320]	valid_0's rmse: 0.21386


[330]	valid_0's rmse: 0.213708
[340]	valid_0's rmse: 0.213614


[350]	valid_0's rmse: 0.213546
[360]	valid_0's rmse: 0.213434


[370]	valid_0's rmse: 0.213286
[380]	valid_0's rmse: 0.213146


[390]	valid_0's rmse: 0.213068
[400]	valid_0's rmse: 0.212957


[410]	valid_0's rmse: 0.212918
[420]	valid_0's rmse: 0.212846


[430]	valid_0's rmse: 0.212774
[440]	valid_0's rmse: 0.212708


[450]	valid_0's rmse: 0.212689
[460]	valid_0's rmse: 0.212589


[470]	valid_0's rmse: 0.212547
[480]	valid_0's rmse: 0.212434


[490]	valid_0's rmse: 0.212343
[500]	valid_0's rmse: 0.212288


[510]	valid_0's rmse: 0.212257
[520]	valid_0's rmse: 0.212193


[530]	valid_0's rmse: 0.212148
[540]	valid_0's rmse: 0.212038


[550]	valid_0's rmse: 0.211889
[560]	valid_0's rmse: 0.211767


[570]	valid_0's rmse: 0.211654
[580]	valid_0's rmse: 0.211548


[590]	valid_0's rmse: 0.211497
[600]	valid_0's rmse: 0.211378


[610]	valid_0's rmse: 0.211366
[620]	valid_0's rmse: 0.211328


[630]	valid_0's rmse: 0.211335
[640]	valid_0's rmse: 0.211283


[650]	valid_0's rmse: 0.21128
[660]	valid_0's rmse: 0.211208


[670]	valid_0's rmse: 0.211169
[680]	valid_0's rmse: 0.211082


[690]	valid_0's rmse: 0.211031
[700]	valid_0's rmse: 0.210991


[710]	valid_0's rmse: 0.210979
[720]	valid_0's rmse: 0.210933


[730]	valid_0's rmse: 0.210921
[740]	valid_0's rmse: 0.210884


[750]	valid_0's rmse: 0.210842
[760]	valid_0's rmse: 0.210825


[770]	valid_0's rmse: 0.210705
[780]	valid_0's rmse: 0.210698


[790]	valid_0's rmse: 0.210679
[800]	valid_0's rmse: 0.210663


[810]	valid_0's rmse: 0.210614
[820]	valid_0's rmse: 0.210588


[830]	valid_0's rmse: 0.210601
[840]	valid_0's rmse: 0.210576


[850]	valid_0's rmse: 0.210566
[860]	valid_0's rmse: 0.210554


[870]	valid_0's rmse: 0.210535
[880]	valid_0's rmse: 0.210483


[890]	valid_0's rmse: 0.210486
[900]	valid_0's rmse: 0.210415


[910]	valid_0's rmse: 0.21029
[920]	valid_0's rmse: 0.210204


[930]	valid_0's rmse: 0.210201
[940]	valid_0's rmse: 0.2102


[950]	valid_0's rmse: 0.210131
[960]	valid_0's rmse: 0.210082


[970]	valid_0's rmse: 0.210032
[980]	valid_0's rmse: 0.209993


[990]	valid_0's rmse: 0.209957
[1000]	valid_0's rmse: 0.209961


[1010]	valid_0's rmse: 0.209958
[1020]	valid_0's rmse: 0.209895


[1030]	valid_0's rmse: 0.209852
[1040]	valid_0's rmse: 0.209849


[1050]	valid_0's rmse: 0.209849
[1060]	valid_0's rmse: 0.209808


[1070]	valid_0's rmse: 0.209786
[1080]	valid_0's rmse: 0.209803


[1090]	valid_0's rmse: 0.209771
[1100]	valid_0's rmse: 0.20971


[1110]	valid_0's rmse: 0.209704
[1120]	valid_0's rmse: 0.209702


[1130]	valid_0's rmse: 0.209705
[1140]	valid_0's rmse: 0.209697


[1150]	valid_0's rmse: 0.209671
[1160]	valid_0's rmse: 0.209629


[1170]	valid_0's rmse: 0.209611
[1180]	valid_0's rmse: 0.209601


[1190]	valid_0's rmse: 0.209613
[1200]	valid_0's rmse: 0.209593


[Stability Iter FAST] Fold 3: RMSE=3.70670, best_iter=1183 (train_n=961323, val_n=320756)
Starting fold 4/5...


[10]	valid_0's rmse: 0.343724


[20]	valid_0's rmse: 0.264035


[30]	valid_0's rmse: 0.239414


[40]	valid_0's rmse: 0.229089


[50]	valid_0's rmse: 0.223809


[60]	valid_0's rmse: 0.220265


[70]	valid_0's rmse: 0.218219


[80]	valid_0's rmse: 0.216414


[90]	valid_0's rmse: 0.214508


[100]	valid_0's rmse: 0.213321


[110]	valid_0's rmse: 0.212097


[120]	valid_0's rmse: 0.211315


[130]	valid_0's rmse: 0.210456


[140]	valid_0's rmse: 0.209704


[150]	valid_0's rmse: 0.208992


[160]	valid_0's rmse: 0.208296
[170]	valid_0's rmse: 0.207708


[180]	valid_0's rmse: 0.207256
[190]	valid_0's rmse: 0.206875


[200]	valid_0's rmse: 0.20645


[210]	valid_0's rmse: 0.206041
[220]	valid_0's rmse: 0.205755


[230]	valid_0's rmse: 0.20541
[240]	valid_0's rmse: 0.20518


[250]	valid_0's rmse: 0.204923
[260]	valid_0's rmse: 0.204702


[270]	valid_0's rmse: 0.204475
[280]	valid_0's rmse: 0.204332


[290]	valid_0's rmse: 0.204178
[300]	valid_0's rmse: 0.204038


[310]	valid_0's rmse: 0.203831
[320]	valid_0's rmse: 0.203656


[330]	valid_0's rmse: 0.203499
[340]	valid_0's rmse: 0.203331


[350]	valid_0's rmse: 0.203164
[360]	valid_0's rmse: 0.203017


[370]	valid_0's rmse: 0.202881
[380]	valid_0's rmse: 0.202765


[390]	valid_0's rmse: 0.202661
[400]	valid_0's rmse: 0.202532


[410]	valid_0's rmse: 0.20241
[420]	valid_0's rmse: 0.202301


[430]	valid_0's rmse: 0.20217
[440]	valid_0's rmse: 0.202054


[450]	valid_0's rmse: 0.201908
[460]	valid_0's rmse: 0.201816


[470]	valid_0's rmse: 0.201723
[480]	valid_0's rmse: 0.20158


[490]	valid_0's rmse: 0.20151
[500]	valid_0's rmse: 0.201394


[510]	valid_0's rmse: 0.201319
[520]	valid_0's rmse: 0.201265


[530]	valid_0's rmse: 0.201186
[540]	valid_0's rmse: 0.201102


[550]	valid_0's rmse: 0.201043
[560]	valid_0's rmse: 0.200969


[570]	valid_0's rmse: 0.200891
[580]	valid_0's rmse: 0.200828


[590]	valid_0's rmse: 0.200762
[600]	valid_0's rmse: 0.200705


[610]	valid_0's rmse: 0.200614
[620]	valid_0's rmse: 0.200539


[630]	valid_0's rmse: 0.200475
[640]	valid_0's rmse: 0.2004


[650]	valid_0's rmse: 0.200339
[660]	valid_0's rmse: 0.200259


[670]	valid_0's rmse: 0.200164
[680]	valid_0's rmse: 0.200106


[690]	valid_0's rmse: 0.20004
[700]	valid_0's rmse: 0.199981


[710]	valid_0's rmse: 0.199948
[720]	valid_0's rmse: 0.199903


[730]	valid_0's rmse: 0.199811
[740]	valid_0's rmse: 0.199783


[750]	valid_0's rmse: 0.199732
[760]	valid_0's rmse: 0.199688


[770]	valid_0's rmse: 0.199648
[780]	valid_0's rmse: 0.199628


[790]	valid_0's rmse: 0.199557


[800]	valid_0's rmse: 0.199511
[810]	valid_0's rmse: 0.199459


[820]	valid_0's rmse: 0.199424
[830]	valid_0's rmse: 0.199389


[840]	valid_0's rmse: 0.199349
[850]	valid_0's rmse: 0.199311


[860]	valid_0's rmse: 0.199267
[870]	valid_0's rmse: 0.199242


[880]	valid_0's rmse: 0.199194
[890]	valid_0's rmse: 0.19916


[900]	valid_0's rmse: 0.199119
[910]	valid_0's rmse: 0.199086


[920]	valid_0's rmse: 0.199057
[930]	valid_0's rmse: 0.199006


[940]	valid_0's rmse: 0.19898
[950]	valid_0's rmse: 0.198955


[960]	valid_0's rmse: 0.198926
[970]	valid_0's rmse: 0.198847


[980]	valid_0's rmse: 0.198823
[990]	valid_0's rmse: 0.198793


[1000]	valid_0's rmse: 0.198761
[1010]	valid_0's rmse: 0.198734


[1020]	valid_0's rmse: 0.198689
[1030]	valid_0's rmse: 0.198654


[1040]	valid_0's rmse: 0.198615
[1050]	valid_0's rmse: 0.19858


[1060]	valid_0's rmse: 0.198567
[1070]	valid_0's rmse: 0.198536


[1080]	valid_0's rmse: 0.198509
[1090]	valid_0's rmse: 0.198489


[1100]	valid_0's rmse: 0.198455
[1110]	valid_0's rmse: 0.198438


[1120]	valid_0's rmse: 0.198422
[1130]	valid_0's rmse: 0.198406


[1140]	valid_0's rmse: 0.19838
[1150]	valid_0's rmse: 0.198339


[1160]	valid_0's rmse: 0.198302
[1170]	valid_0's rmse: 0.198262


[1180]	valid_0's rmse: 0.198242
[1190]	valid_0's rmse: 0.198218


[1200]	valid_0's rmse: 0.198209


[Stability Iter FAST] Fold 4: RMSE=3.61654, best_iter=1200 (train_n=1282234, val_n=320756)
Starting fold 5/5...


[10]	valid_0's rmse: 0.340881


[20]	valid_0's rmse: 0.256952


[30]	valid_0's rmse: 0.228799


[40]	valid_0's rmse: 0.217647


[50]	valid_0's rmse: 0.211495


[60]	valid_0's rmse: 0.207357


[70]	valid_0's rmse: 0.204553


[80]	valid_0's rmse: 0.202308


[90]	valid_0's rmse: 0.200545


[100]	valid_0's rmse: 0.199049


[110]	valid_0's rmse: 0.19766


[120]	valid_0's rmse: 0.19636


[130]	valid_0's rmse: 0.195449


[140]	valid_0's rmse: 0.194455


[150]	valid_0's rmse: 0.193743


[160]	valid_0's rmse: 0.193062


[170]	valid_0's rmse: 0.192528


[180]	valid_0's rmse: 0.192006


[190]	valid_0's rmse: 0.191518


[200]	valid_0's rmse: 0.19093


[210]	valid_0's rmse: 0.19045


[220]	valid_0's rmse: 0.190091


[230]	valid_0's rmse: 0.189829


[240]	valid_0's rmse: 0.189504


[250]	valid_0's rmse: 0.189207


[260]	valid_0's rmse: 0.188967


[270]	valid_0's rmse: 0.18869
[280]	valid_0's rmse: 0.188483


[290]	valid_0's rmse: 0.18828


[300]	valid_0's rmse: 0.18807
[310]	valid_0's rmse: 0.187855


[320]	valid_0's rmse: 0.187596


[330]	valid_0's rmse: 0.187428


[340]	valid_0's rmse: 0.187279


[350]	valid_0's rmse: 0.187119


[360]	valid_0's rmse: 0.186942


[370]	valid_0's rmse: 0.186742


[380]	valid_0's rmse: 0.186549
[390]	valid_0's rmse: 0.186395


[400]	valid_0's rmse: 0.186321
[410]	valid_0's rmse: 0.186224


[420]	valid_0's rmse: 0.186083


[430]	valid_0's rmse: 0.185917


[440]	valid_0's rmse: 0.185785
[450]	valid_0's rmse: 0.185654


[460]	valid_0's rmse: 0.185555


[470]	valid_0's rmse: 0.185426
[480]	valid_0's rmse: 0.185297


[490]	valid_0's rmse: 0.185199


[500]	valid_0's rmse: 0.185089


[510]	valid_0's rmse: 0.184972
[520]	valid_0's rmse: 0.184852


[530]	valid_0's rmse: 0.184784


[540]	valid_0's rmse: 0.18468
[550]	valid_0's rmse: 0.184599


[560]	valid_0's rmse: 0.184529
[570]	valid_0's rmse: 0.184454


[580]	valid_0's rmse: 0.184386
[590]	valid_0's rmse: 0.184327


[600]	valid_0's rmse: 0.18427
[610]	valid_0's rmse: 0.184171


[620]	valid_0's rmse: 0.184087
[630]	valid_0's rmse: 0.183973


[640]	valid_0's rmse: 0.183887
[650]	valid_0's rmse: 0.183807


[660]	valid_0's rmse: 0.183749
[670]	valid_0's rmse: 0.183693


[680]	valid_0's rmse: 0.183649
[690]	valid_0's rmse: 0.183591


[700]	valid_0's rmse: 0.183506
[710]	valid_0's rmse: 0.183427


[720]	valid_0's rmse: 0.183367
[730]	valid_0's rmse: 0.183329


[740]	valid_0's rmse: 0.183277
[750]	valid_0's rmse: 0.183231


[760]	valid_0's rmse: 0.183158
[770]	valid_0's rmse: 0.183114


[780]	valid_0's rmse: 0.18305
[790]	valid_0's rmse: 0.182994


[800]	valid_0's rmse: 0.182913
[810]	valid_0's rmse: 0.182862


[820]	valid_0's rmse: 0.182818


[830]	valid_0's rmse: 0.182772


[840]	valid_0's rmse: 0.182734
[850]	valid_0's rmse: 0.18269


[860]	valid_0's rmse: 0.182631
[870]	valid_0's rmse: 0.182578


[880]	valid_0's rmse: 0.18253
[890]	valid_0's rmse: 0.182502


[900]	valid_0's rmse: 0.182452
[910]	valid_0's rmse: 0.182409


[920]	valid_0's rmse: 0.182382


[930]	valid_0's rmse: 0.182347
[940]	valid_0's rmse: 0.182294


[950]	valid_0's rmse: 0.182258
[960]	valid_0's rmse: 0.18222


[970]	valid_0's rmse: 0.182176
[980]	valid_0's rmse: 0.182115


[990]	valid_0's rmse: 0.182078


[1000]	valid_0's rmse: 0.182027
[1010]	valid_0's rmse: 0.181992


[1020]	valid_0's rmse: 0.181959
[1030]	valid_0's rmse: 0.181912


[1040]	valid_0's rmse: 0.181868
[1050]	valid_0's rmse: 0.181832


[1060]	valid_0's rmse: 0.181817


[1070]	valid_0's rmse: 0.181757
[1080]	valid_0's rmse: 0.181704


[1090]	valid_0's rmse: 0.181676
[1100]	valid_0's rmse: 0.181626


[1110]	valid_0's rmse: 0.181597
[1120]	valid_0's rmse: 0.181586


[1130]	valid_0's rmse: 0.181567


[1140]	valid_0's rmse: 0.181529
[1150]	valid_0's rmse: 0.181508


[1160]	valid_0's rmse: 0.181482
[1170]	valid_0's rmse: 0.181459


[1180]	valid_0's rmse: 0.1814
[1190]	valid_0's rmse: 0.181377


[1200]	valid_0's rmse: 0.181343


[Stability Iter FAST] Fold 5: RMSE=3.35906, best_iter=1200 (train_n=1602868, val_n=320755)
{'cv_rmse_mean': 3.492799, 'cv_rmse_std': 0.2575, 'best_iter_final': 1183, 'note': 'piecewise time; airport-hike; per-fold freq; no monotone (speed)'}


Saved submission.csv with shape: (9914, 2)


In [26]:
# Phase 2 — Two-Stage Residual Model (fixed): robust Stage1 trend + Stage2 residuals with residual TE (pu/do) and counts
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals(), 'train_df/test_df missing'
assert 'sample' in globals() and 'make_time_folds_quantile' in globals(), 'Prereqs missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

def ensure_days_since_start(df, global_start_ns):
    if 'days_since_start' in df.columns: return df
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    ns = dt_local.astype('int64').values
    df['days_since_start'] = ((ns - np.int64(global_start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return df

def ensure_policy_feats(df):
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    df['after_hike'] = (dt_local >= cutoff).astype('int8')
    if 'days_since_start' in df.columns:
        df['days_since_start_x_after_hike'] = (df['days_since_start'] * df['after_hike']).astype('float32')
    else:
        df['days_since_start_x_after_hike'] = np.float32(0.0)
    return df

def ensure_spatial_bins(df, res=0.005):
    if all(c in df.columns for c in ['pu_bin','do_bin','pair_bin']): return df
    df = df.copy()
    pu_lat_bin = np.floor(df['pickup_latitude'] / res).astype('int32')
    pu_lon_bin = np.floor(df['pickup_longitude'] / res).astype('int32')
    do_lat_bin = np.floor(df['dropoff_latitude'] / res).astype('int32')
    do_lon_bin = np.floor(df['dropoff_longitude'] / res).astype('int32')
    df['pu_bin'] = (pu_lat_bin.astype(str) + '_' + pu_lon_bin.astype(str))
    df['do_bin'] = (do_lat_bin.astype(str) + '_' + do_lon_bin.astype(str))
    df['pair_bin'] = (df['pu_bin'] + '|' + df['do_bin'])
    return df

dt_local_all = train_df['pickup_datetime'].dt.tz_convert('America/New_York')
global_start_ns = int(dt_local_all.min().value)
train_df = ensure_days_since_start(train_df, global_start_ns)
test_df = ensure_days_since_start(test_df, global_start_ns)
train_df = ensure_policy_feats(train_df)
test_df = ensure_policy_feats(test_df)
train_df = ensure_spatial_bins(train_df)
test_df = ensure_spatial_bins(test_df)

# Stage 1 (Trend) features — time and policy only
trend_feats = ['days_since_start','after_hike','days_since_start_x_after_hike','year','month','doy_sin','doy_cos']
for c in trend_feats: assert c in train_df.columns, f'Missing trend feature: {c}'

# Stage 2 base features — robust numeric set + policy interactions if present
base_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'year','month','day','hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday','days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_df.columns: base_feats.append('rot_manh_km')
if 'dist_x_after_hike' in train_df.columns: base_feats.append('dist_x_after_hike')

# Prepare ordered data
train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True)
dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

# Smoothed TE on residuals (safe), and log-counts
def te_smooth(train_key, train_tgt, apply_key, m=300.0, prior=None):
    if prior is None: prior = float(np.mean(train_tgt))
    g = pd.DataFrame({'k': train_key.astype('object'), 'y': train_tgt}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict()
    enc = apply_key.astype('object').map(mp).fillna(prior).astype('float32').values
    return enc, prior, mp

cv_rmses, best_iters_s2 = [], []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx].copy(); va = train_ord.iloc[va_idx].copy()
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    # Stage 1: small, monotone-constrained trend model
    X1_tr = tr[trend_feats].astype('float32').values
    X1_va = va[trend_feats].astype('float32').values
    mono_trend = [0]*len(trend_feats)
    for feat in ['days_since_start','after_hike','days_since_start_x_after_hike']:
        if feat in trend_feats: mono_trend[trend_feats.index(feat)] = 1
    trend_model = lgb.LGBMRegressor(objective='regression', metric='rmse',
                                   learning_rate=0.06, n_estimators=1200, num_leaves=16,
                                   min_data_in_leaf=1200, feature_fraction=0.8, bagging_fraction=0.8,
                                   bagging_freq=1, max_bin=63, reg_lambda=20.0, random_state=2025, n_jobs=-1, verbose=-1,
                                   monotone_constraints=mono_trend)
    trend_model.fit(X1_tr, y_tr, eval_set=[(X1_va, y_va)], eval_metric='rmse',
                    callbacks=[lgb.early_stopping(stopping_rounds=100, verbose=False), lgb.log_evaluation(period=100)])
    tr_trend = trend_model.predict(X1_tr, num_iteration=trend_model.best_iteration_).astype('float32')
    va_trend = trend_model.predict(X1_va, num_iteration=trend_model.best_iteration_).astype('float32')
    # Residual targets (log space)
    y_tr_res = (y_tr - tr_trend).astype('float32')
    y_va_res = (y_va - va_trend).astype('float32')
    # Residual encodings on train-only: TE for pu/do (no pair) + log-counts
    tr['te_pu'], _, _ = te_smooth(tr['pu_bin'], y_tr_res, tr['pu_bin'], m=400.0)
    va['te_pu'], _, _ = te_smooth(tr['pu_bin'], y_tr_res, va['pu_bin'], m=400.0)
    tr['te_do'], _, _ = te_smooth(tr['do_bin'], y_tr_res, tr['do_bin'], m=400.0)
    va['te_do'], _, _ = te_smooth(tr['do_bin'], y_tr_res, va['do_bin'], m=400.0)
    pu_cnt = tr['pu_bin'].astype('object').value_counts().astype('int32')
    do_cnt = tr['do_bin'].astype('object').value_counts().astype('int32')
    pair_cnt = tr['pair_bin'].astype('object').value_counts().astype('int32')
    for d in (tr, va):
        d['log_pu_cnt'] = np.log1p(d['pu_bin'].astype('object').map(pu_cnt).fillna(0).astype('int32')).astype('float32')
        d['log_do_cnt'] = np.log1p(d['do_bin'].astype('object').map(do_cnt).fillna(0).astype('int32')).astype('float32')
        d['log_pair_cnt'] = np.log1p(d['pair_bin'].astype('object').map(pair_cnt).fillna(0).astype('int32')).astype('float32')
    # Stage 2: Residual model on base features + trend_pred + residual TE/log-counts
    tr['trend_pred'] = tr_trend; va['trend_pred'] = va_trend
    use_cols2 = base_feats + ['trend_pred','te_pu','te_do','log_pu_cnt','log_do_cnt','log_pair_cnt']
    X2_tr = tr[use_cols2].astype('float32').values
    X2_va = va[use_cols2].astype('float32').values
    s2 = lgb.LGBMRegressor(objective='huber', metric='rmse',
                           learning_rate=0.05, n_estimators=12000, num_leaves=128,
                           min_data_in_leaf=600, feature_fraction=0.8, bagging_fraction=0.8,
                           bagging_freq=1, max_bin=127, reg_lambda=15.0, random_state=2026, n_jobs=-1, verbose=-1)
    s2.fit(X2_tr, y_tr_res, eval_set=[(X2_va, y_va_res)], eval_metric='rmse',
           callbacks=[lgb.early_stopping(stopping_rounds=300, verbose=False), lgb.log_evaluation(period=150)])
    va_res_pred = s2.predict(X2_va, num_iteration=s2.best_iteration_).astype('float32')
    va_final_log = (va_trend + va_res_pred).astype('float32')
    rmse = float(root_mean_squared_error(np.expm1(y_va), np.expm1(va_final_log)))
    cv_rmses.append(rmse); best_iters_s2.append(int(s2.best_iteration_ or s2.n_estimators))
    print(f'[Two-Stage FIX] Fold {i}: RMSE={rmse:.5f}, s2_best_iter={best_iters_s2[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})')

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_s2_final = int(np.clip(int(np.median(best_iters_s2)), 600, 12000))
print({'cv_rmse_mean': round(cv_mean, 6), 'cv_rmse_std': round(cv_std, 6), 's2_best_iter_final': best_iter_s2_final, 'note': 'Two-stage residual model (fixed) with residual TE pu/do + counts'})

# Final fit on full data
full = train_ord.copy(); tst = test_df.copy()
X1_full = full[trend_feats].astype('float32').values
mono_trend_full = [0]*len(trend_feats)
for feat in ['days_since_start','after_hike','days_since_start_x_after_hike']:
    if feat in trend_feats: mono_trend_full[trend_feats.index(feat)] = 1
trend_model_full = lgb.LGBMRegressor(objective='regression', metric='rmse',
                                     learning_rate=0.06, n_estimators=max(400, best_iter_s2_final//10), num_leaves=16,
                                     min_data_in_leaf=1200, feature_fraction=0.8, bagging_fraction=0.8,
                                     bagging_freq=1, max_bin=63, reg_lambda=20.0, random_state=3030, n_jobs=-1, verbose=-1,
                                     monotone_constraints=mono_trend_full)
trend_model_full.fit(X1_full, y_all_log)
full_trend = trend_model_full.predict(X1_full).astype('float32')
X1_test = tst[trend_feats].astype('float32').values
test_trend = trend_model_full.predict(X1_test).astype('float32')
y_full_res = (y_all_log - full_trend).astype('float32')

def te_apply_full(tr_key, tr_tgt, ap_key, m=400.0):
    prior = float(np.mean(tr_tgt))
    g = pd.DataFrame({'k': tr_key.astype('object'), 'y': tr_tgt}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict()
    tr_enc = tr_key.astype('object').map(mp).fillna(prior).astype('float32').values
    ap_enc = ap_key.astype('object').map(mp).fillna(prior).astype('float32').values
    return tr_enc, ap_enc

full['te_pu'], tst['te_pu'] = te_apply_full(full['pu_bin'], y_full_res, tst['pu_bin'])
full['te_do'], tst['te_do'] = te_apply_full(full['do_bin'], y_full_res, tst['do_bin'])
pu_cnt_full = full['pu_bin'].astype('object').value_counts().astype('int32')
do_cnt_full = full['do_bin'].astype('object').value_counts().astype('int32')
pair_cnt_full = full['pair_bin'].astype('object').value_counts().astype('int32')
full['log_pu_cnt'] = np.log1p(full['pu_bin'].astype('object').map(pu_cnt_full).fillna(0).astype('int32')).astype('float32')
tst['log_pu_cnt'] = np.log1p(tst['pu_bin'].astype('object').map(pu_cnt_full).fillna(0).astype('int32')).astype('float32')
full['log_do_cnt'] = np.log1p(full['do_bin'].astype('object').map(do_cnt_full).fillna(0).astype('int32')).astype('float32')
tst['log_do_cnt'] = np.log1p(tst['do_bin'].astype('object').map(do_cnt_full).fillna(0).astype('int32')).astype('float32')
full['log_pair_cnt'] = np.log1p(full['pair_bin'].astype('object').map(pair_cnt_full).fillna(0).astype('int32')).astype('float32')
tst['log_pair_cnt'] = np.log1p(tst['pair_bin'].astype('object').map(pair_cnt_full).fillna(0).astype('int32')).astype('float32')

full['trend_pred'] = full_trend; tst['trend_pred'] = test_trend
final_features = base_feats + ['trend_pred','te_pu','te_do','log_pu_cnt','log_do_cnt','log_pair_cnt']
X2_full = full[final_features].astype('float32').values
X2_test = tst[final_features].astype('float32').values
s2_full = lgb.LGBMRegressor(objective='huber', metric='rmse',
                            learning_rate=0.05, n_estimators=best_iter_s2_final, num_leaves=128,
                            min_data_in_leaf=600, feature_fraction=0.8, bagging_fraction=0.8,
                            bagging_freq=1, max_bin=127, reg_lambda=15.0, random_state=4040, n_jobs=-1, verbose=-1)
s2_full.fit(X2_full, y_full_res)
test_res = s2_full.predict(X2_test).astype('float32')
test_pred = np.expm1(test_trend + test_res).astype('float32')
test_pred = np.clip(test_pred, 0, 500)
pred_df = pd.DataFrame({'key': test_df['key'].astype('string'), 'fare_amount': test_pred})
sub = sample[['key']].merge(pred_df, on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv with shape:', sub.shape)

[100]	valid_0's rmse: 0.50873


[150]	valid_0's rmse: 0.23367


[300]	valid_0's rmse: 0.231544


[450]	valid_0's rmse: 0.230829


[600]	valid_0's rmse: 0.230664


[750]	valid_0's rmse: 0.230628


[900]	valid_0's rmse: 0.230591


[1050]	valid_0's rmse: 0.230703


[Two-Stage FIX] Fold 1: RMSE=3.71307, s2_best_iter=897 (train_n=319908, val_n=320756)


[100]	valid_0's rmse: 0.516779


[200]	valid_0's rmse: 0.516822


[150]	valid_0's rmse: 0.186695


[300]	valid_0's rmse: 0.186515


[450]	valid_0's rmse: 0.186563


[Two-Stage FIX] Fold 2: RMSE=3.23728, s2_best_iter=225 (train_n=640523, val_n=320756)


[100]	valid_0's rmse: 0.5459


[200]	valid_0's rmse: 0.545706


[300]	valid_0's rmse: 0.545432


[400]	valid_0's rmse: 0.54522


[500]	valid_0's rmse: 0.545204


[600]	valid_0's rmse: 0.544816


[700]	valid_0's rmse: 0.54468


[800]	valid_0's rmse: 0.544432


[900]	valid_0's rmse: 0.544363


[1000]	valid_0's rmse: 0.544185


[150]	valid_0's rmse: 0.210083


[300]	valid_0's rmse: 0.207646


[450]	valid_0's rmse: 0.206826


[600]	valid_0's rmse: 0.206357


[750]	valid_0's rmse: 0.206158


[900]	valid_0's rmse: 0.205892


[1050]	valid_0's rmse: 0.205884


[1200]	valid_0's rmse: 0.205842


[1350]	valid_0's rmse: 0.2058


[1500]	valid_0's rmse: 0.205767


[1650]	valid_0's rmse: 0.205734


[1800]	valid_0's rmse: 0.20567


[1950]	valid_0's rmse: 0.20571


[Two-Stage FIX] Fold 3: RMSE=3.60131, s2_best_iter=1735 (train_n=961323, val_n=320756)


[100]	valid_0's rmse: 0.558158


[150]	valid_0's rmse: 0.203743


[300]	valid_0's rmse: 0.19922


[450]	valid_0's rmse: 0.197791


[600]	valid_0's rmse: 0.197079


[750]	valid_0's rmse: 0.196584


[900]	valid_0's rmse: 0.196273


[1050]	valid_0's rmse: 0.196059


[1200]	valid_0's rmse: 0.195854


[1350]	valid_0's rmse: 0.195669


[1500]	valid_0's rmse: 0.195455


[1650]	valid_0's rmse: 0.195401


[1800]	valid_0's rmse: 0.195334


[1950]	valid_0's rmse: 0.195408


[Two-Stage FIX] Fold 4: RMSE=3.54319, s2_best_iter=1781 (train_n=1282234, val_n=320756)


[100]	valid_0's rmse: 0.572638


[150]	valid_0's rmse: 0.187028


[300]	valid_0's rmse: 0.181248


[450]	valid_0's rmse: 0.179353


[600]	valid_0's rmse: 0.178216


[750]	valid_0's rmse: 0.177518


[900]	valid_0's rmse: 0.177073


[1050]	valid_0's rmse: 0.176695


[1200]	valid_0's rmse: 0.176366


[1350]	valid_0's rmse: 0.176199


[1500]	valid_0's rmse: 0.176043


[1650]	valid_0's rmse: 0.175911


[1800]	valid_0's rmse: 0.17578


[1950]	valid_0's rmse: 0.175659


[2100]	valid_0's rmse: 0.175579


[2250]	valid_0's rmse: 0.175556


[2400]	valid_0's rmse: 0.17553


[2550]	valid_0's rmse: 0.175482


[2700]	valid_0's rmse: 0.175453


[2850]	valid_0's rmse: 0.175441


[3000]	valid_0's rmse: 0.175416


[3150]	valid_0's rmse: 0.175406


[3300]	valid_0's rmse: 0.175398


[3450]	valid_0's rmse: 0.175403


[3600]	valid_0's rmse: 0.175403


[Two-Stage FIX] Fold 5: RMSE=3.23181, s2_best_iter=3362 (train_n=1602868, val_n=320755)
{'cv_rmse_mean': 3.465332, 'cv_rmse_std': 0.196197, 's2_best_iter_final': 1735, 'note': 'Two-stage residual model (fixed) with residual TE pu/do + counts'}


Saved submission.csv with shape: (9914, 2)


In [27]:
# Phase 2 — Two-Stage Residual Model (gate fix): stronger Stage1, purified Stage2, geohash residual TE, time-decay weights
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals(), 'train_df/test_df missing'
assert 'sample' in globals() and 'make_time_folds_quantile' in globals(), 'Prereqs missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb
try:
    import pygeohash as pgh
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'pygeohash'])
    import pygeohash as pgh

# Ensure helpers
def ensure_days_since_start(df, global_start_ns):
    if 'days_since_start' in df.columns: return df
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    ns = dt_local.astype('int64').values
    df['days_since_start'] = ((ns - np.int64(global_start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return df

def ensure_policy_feats(df):
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    df['after_hike'] = (dt_local >= cutoff).astype('int8')
    if 'days_since_start' in df.columns:
        df['days_since_start_x_after_hike'] = (df['days_since_start'] * df['after_hike']).astype('float32')
    else:
        df['days_since_start_x_after_hike'] = np.float32(0.0)
    return df

def add_geohash(df, prec=6):
    df = df.copy()
    def enc_row(lat, lon):
        return pgh.encode(float(lat), float(lon), precision=prec) if np.isfinite(lat) and np.isfinite(lon) else ''
    df['pu_gh6'] = [enc_row(lat, lon) for lat, lon in zip(df['pickup_latitude'].astype('float32'), df['pickup_longitude'].astype('float32'))]
    df['do_gh6'] = [enc_row(lat, lon) for lat, lon in zip(df['dropoff_latitude'].astype('float32'), df['dropoff_longitude'].astype('float32'))]
    return df

def te_smooth(train_key, train_tgt, apply_key, m=700.0, prior=None):
    if prior is None: prior = float(np.mean(train_tgt))
    g = pd.DataFrame({'k': train_key.astype('object'), 'y': train_tgt}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict()
    enc = apply_key.astype('object').map(mp).fillna(prior).astype('float32').values
    return enc, prior, mp

# Prepare ordered data and features
dt_local_all = train_df['pickup_datetime'].dt.tz_convert('America/New_York')
global_start_ns = int(dt_local_all.min().value)
train_df = ensure_days_since_start(train_df, global_start_ns)
test_df = ensure_days_since_start(test_df, global_start_ns)
train_df = ensure_policy_feats(train_df)
test_df = ensure_policy_feats(test_df)
train_df = add_geohash(train_df, prec=6)
test_df = add_geohash(test_df, prec=6)

# Stage 1: expanded trend features (time + policy + weekly seasonality + holidays)
trend_feats = ['days_since_start','after_hike','days_since_start_x_after_hike','year','month','dow','is_holiday','doy_sin','doy_cos','dow_sin','dow_cos']
for c in trend_feats: assert c in train_df.columns, f'Missing trend feature: {c}'

# Stage 2: purified residual feature set — remove absolute time identifiers (year, month, days_since_start); keep geometry/POIs/flags and cyclic hour/dow if present
base_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'day','hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_df.columns: base_feats.append('rot_manh_km')
if 'dist_x_after_hike' in train_df.columns: base_feats.append('dist_x_after_hike')

train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True)
dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

# Time-decay weights for Stage 2 (tau ≈120 days): w = exp((days - max_days)/tau)
days_all = train_ord['days_since_start'].astype('float32').values
max_days = float(np.max(days_all))
tau_days = 120.0

cv_rmses, best_iters_s2 = [], []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx].copy(); va = train_ord.iloc[va_idx].copy()
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    # Stage 1: stronger, regularized trend model with monotonicity on time terms; print its val RMSE
    X1_tr = tr[trend_feats].astype('float32').values
    X1_va = va[trend_feats].astype('float32').values
    mono_trend = [0]*len(trend_feats)
    for feat in ['days_since_start','after_hike','days_since_start_x_after_hike']:
        if feat in trend_feats: mono_trend[trend_feats.index(feat)] = 1
    trend_model = lgb.LGBMRegressor(objective='regression', metric='rmse',
                                   learning_rate=0.05, n_estimators=1500, num_leaves=32,
                                   min_data_in_leaf=1200, feature_fraction=0.8, bagging_fraction=0.8,
                                   bagging_freq=1, max_bin=63, reg_lambda=20.0, random_state=2027, n_jobs=-1, verbose=-1,
                                   monotone_constraints=mono_trend)
    trend_model.fit(X1_tr, y_tr, eval_set=[(X1_va, y_va)], eval_metric='rmse',
                    callbacks=[lgb.early_stopping(stopping_rounds=150, verbose=False), lgb.log_evaluation(period=150)])
    va_tr_rmse = float(trend_model.best_score_['valid_0']['rmse']) if hasattr(trend_model, 'best_score_') else float('nan')
    print(f'[Stage1 Trend] Fold {i}: val_rmse_log={va_tr_rmse:.6f}, best_iter={trend_model.best_iteration_}')
    tr_trend = trend_model.predict(X1_tr, num_iteration=trend_model.best_iteration_).astype('float32')
    va_trend = trend_model.predict(X1_va, num_iteration=trend_model.best_iteration_).astype('float32')
    y_tr_res = (y_tr - tr_trend).astype('float32')
    y_va_res = (y_va - va_trend).astype('float32')
    # Stage 2: residual encodings using geohash TE (PU/DO) with strong smoothing; add geohash counts
    tr['te_pu_gh6'], _, _ = te_smooth(tr['pu_gh6'], y_tr_res, tr['pu_gh6'], m=700.0)
    va['te_pu_gh6'], _, _ = te_smooth(tr['pu_gh6'], y_tr_res, va['pu_gh6'], m=700.0)
    tr['te_do_gh6'], _, _ = te_smooth(tr['do_gh6'], y_tr_res, tr['do_gh6'], m=700.0)
    va['te_do_gh6'], _, _ = te_smooth(tr['do_gh6'], y_tr_res, va['do_gh6'], m=700.0)
    pu_cnt = tr['pu_gh6'].astype('object').value_counts().astype('int32')
    do_cnt = tr['do_gh6'].astype('object').value_counts().astype('int32')
    for d in (tr, va):
        d['log_pu_gh_cnt'] = np.log1p(d['pu_gh6'].astype('object').map(pu_cnt).fillna(0).astype('int32')).astype('float32')
        d['log_do_gh_cnt'] = np.log1p(d['do_gh6'].astype('object').map(do_cnt).fillna(0).astype('int32')).astype('float32')
    # Assemble Stage 2 matrices; purify time (no year/month/days_since_start) and include trend_pred
    tr['trend_pred'] = tr_trend; va['trend_pred'] = va_trend
    use_cols2 = base_feats + ['trend_pred','te_pu_gh6','te_do_gh6','log_pu_gh_cnt','log_do_gh_cnt']
    X2_tr = tr[use_cols2].astype('float32').values
    X2_va = va[use_cols2].astype('float32').values
    # Time-decay weights favoring recent history
    w_tr = np.exp((tr['days_since_start'].astype('float32').values - max_days) / tau_days).astype('float32')
    s2 = lgb.LGBMRegressor(objective='huber', metric='rmse',
                           learning_rate=0.05, n_estimators=16000, num_leaves=128,
                           min_data_in_leaf=800, feature_fraction=0.8, bagging_fraction=0.8,
                           bagging_freq=1, max_bin=127, reg_lambda=20.0, random_state=2028, n_jobs=-1, verbose=-1)
    s2.fit(X2_tr, y_tr_res, sample_weight=w_tr,
           eval_set=[(X2_va, y_va_res)], eval_metric='rmse',
           callbacks=[lgb.early_stopping(stopping_rounds=400, verbose=False), lgb.log_evaluation(period=150)])
    va_res_pred = s2.predict(X2_va, num_iteration=s2.best_iteration_).astype('float32')
    va_final_log = (va_trend + va_res_pred).astype('float32')
    rmse = float(root_mean_squared_error(np.expm1(y_va), np.expm1(va_final_log)))
    cv_rmses.append(rmse); best_iters_s2.append(int(s2.best_iteration_ or s2.n_estimators))
    print(f'[Two-Stage GH] Fold {i}: RMSE={rmse:.5f}, s2_best_iter={best_iters_s2[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})')

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_s2_final = int(np.clip(int(np.median(best_iters_s2)), 800, 16000))
print({'cv_rmse_mean': round(cv_mean, 6), 'cv_rmse_std': round(cv_std, 6), 's2_best_iter_final': best_iter_s2_final, 'note': 'Two-stage residual (Stage1 stronger, Stage2 purified + geohash TE + time-decay)'})

# Final fit on full data: Stage 1 then Stage 2
full = train_ord.copy(); tst = test_df.copy()
X1_full = full[trend_feats].astype('float32').values
mono_trend_full = [0]*len(trend_feats)
for feat in ['days_since_start','after_hike','days_since_start_x_after_hike']:
    if feat in trend_feats: mono_trend_full[trend_feats.index(feat)] = 1
trend_full = lgb.LGBMRegressor(objective='regression', metric='rmse',
                               learning_rate=0.05, n_estimators=1500, num_leaves=32,
                               min_data_in_leaf=1200, feature_fraction=0.8, bagging_fraction=0.8,
                               bagging_freq=1, max_bin=63, reg_lambda=20.0, random_state=3031, n_jobs=-1, verbose=-1,
                               monotone_constraints=mono_trend_full)
trend_full.fit(X1_full, y_all_log)
full_trend = trend_full.predict(X1_full).astype('float32')
X1_test = tst[trend_feats].astype('float32').values
test_trend = trend_full.predict(X1_test).astype('float32')
y_full_res = (y_all_log - full_trend).astype('float32')

# Geohash TE and counts on full residuals, apply to test
full = add_geohash(full, prec=6); tst = add_geohash(tst, prec=6)
def te_apply_full(key_tr, tgt_tr, key_ap, m=700.0):
    prior = float(np.mean(tgt_tr))
    g = pd.DataFrame({'k': key_tr.astype('object'), 'y': tgt_tr}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict()
    tr_enc = key_tr.astype('object').map(mp).fillna(prior).astype('float32').values
    ap_enc = key_ap.astype('object').map(mp).fillna(prior).astype('float32').values
    return tr_enc, ap_enc
full['te_pu_gh6'], tst['te_pu_gh6'] = te_apply_full(full['pu_gh6'], y_full_res, tst['pu_gh6'])
full['te_do_gh6'], tst['te_do_gh6'] = te_apply_full(full['do_gh6'], y_full_res, tst['do_gh6'])
pu_cnt_full = full['pu_gh6'].astype('object').value_counts().astype('int32')
do_cnt_full = full['do_gh6'].astype('object').value_counts().astype('int32')
full['log_pu_gh_cnt'] = np.log1p(full['pu_gh6'].astype('object').map(pu_cnt_full).fillna(0).astype('int32')).astype('float32')
tst['log_pu_gh_cnt'] = np.log1p(tst['pu_gh6'].astype('object').map(pu_cnt_full).fillna(0).astype('int32')).astype('float32')
full['log_do_gh_cnt'] = np.log1p(full['do_gh6'].astype('object').map(do_cnt_full).fillna(0).astype('int32')).astype('float32')
tst['log_do_gh_cnt'] = np.log1p(tst['do_gh6'].astype('object').map(do_cnt_full).fillna(0).astype('int32')).astype('float32')

full['trend_pred'] = full_trend; tst['trend_pred'] = test_trend
final_features = base_feats + ['trend_pred','te_pu_gh6','te_do_gh6','log_pu_gh_cnt','log_do_gh_cnt']
X2_full = full[final_features].astype('float32').values
X2_test = tst[final_features].astype('float32').values
w_full = np.exp((full['days_since_start'].astype('float32').values - max_days) / tau_days).astype('float32')
s2_full = lgb.LGBMRegressor(objective='huber', metric='rmse',
                            learning_rate=0.05, n_estimators=best_iter_s2_final, num_leaves=128,
                            min_data_in_leaf=800, feature_fraction=0.8, bagging_fraction=0.8,
                            bagging_freq=1, max_bin=127, reg_lambda=20.0, random_state=4041, n_jobs=-1, verbose=-1)
s2_full.fit(X2_full, y_full_res, sample_weight=w_full)
test_res = s2_full.predict(X2_test).astype('float32')
test_pred = np.expm1(test_trend + test_res).astype('float32')
test_pred = np.clip(test_pred, 0, 500)
pred_df = pd.DataFrame({'key': test_df['key'].astype('string'), 'fare_amount': test_pred})
sub = sample[['key']].merge(pred_df, on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv with shape:', sub.shape)

[150]	valid_0's rmse: 0.508649
[Stage1 Trend] Fold 1: val_rmse_log=0.508606, best_iter=2


[150]	valid_0's rmse: 0.508426


[300]	valid_0's rmse: 0.508223


[450]	valid_0's rmse: 0.50802


[600]	valid_0's rmse: 0.507816


[750]	valid_0's rmse: 0.507614


[900]	valid_0's rmse: 0.507411


[1050]	valid_0's rmse: 0.507209


[1200]	valid_0's rmse: 0.507008


[1350]	valid_0's rmse: 0.506806


[1500]	valid_0's rmse: 0.506605


[1650]	valid_0's rmse: 0.506404


[1800]	valid_0's rmse: 0.506203


[1950]	valid_0's rmse: 0.506003


[2100]	valid_0's rmse: 0.505803


[2250]	valid_0's rmse: 0.505603


[2400]	valid_0's rmse: 0.505404


[2550]	valid_0's rmse: 0.505204


[2700]	valid_0's rmse: 0.505006


[2850]	valid_0's rmse: 0.504807


[3000]	valid_0's rmse: 0.504608


[3150]	valid_0's rmse: 0.50441


[3300]	valid_0's rmse: 0.504211


[3450]	valid_0's rmse: 0.504014


[3600]	valid_0's rmse: 0.503817


[3750]	valid_0's rmse: 0.503619


[3900]	valid_0's rmse: 0.503423


[4050]	valid_0's rmse: 0.503227


[4200]	valid_0's rmse: 0.503032


[4350]	valid_0's rmse: 0.502835


[4500]	valid_0's rmse: 0.502639


[4650]	valid_0's rmse: 0.502445


[4800]	valid_0's rmse: 0.50225


[4950]	valid_0's rmse: 0.502054


[5100]	valid_0's rmse: 0.50186


[5250]	valid_0's rmse: 0.501665


[5400]	valid_0's rmse: 0.501472


[5550]	valid_0's rmse: 0.501278


[5700]	valid_0's rmse: 0.501084


[5850]	valid_0's rmse: 0.500892


[6000]	valid_0's rmse: 0.500699


[6150]	valid_0's rmse: 0.500507


[6300]	valid_0's rmse: 0.500314


[6450]	valid_0's rmse: 0.500122


[6600]	valid_0's rmse: 0.49993


[6750]	valid_0's rmse: 0.499739


[6900]	valid_0's rmse: 0.499547


[7050]	valid_0's rmse: 0.499356


[7200]	valid_0's rmse: 0.499166


[7350]	valid_0's rmse: 0.498976


[7500]	valid_0's rmse: 0.498786


[7650]	valid_0's rmse: 0.498595


[7800]	valid_0's rmse: 0.498406


[7950]	valid_0's rmse: 0.498217


[8100]	valid_0's rmse: 0.498028


[8250]	valid_0's rmse: 0.497839


[8400]	valid_0's rmse: 0.49765


[8550]	valid_0's rmse: 0.497462


[8700]	valid_0's rmse: 0.497273


[8850]	valid_0's rmse: 0.497086


[9000]	valid_0's rmse: 0.496898


[9150]	valid_0's rmse: 0.496711


[9300]	valid_0's rmse: 0.496524


[9450]	valid_0's rmse: 0.496337


[9600]	valid_0's rmse: 0.496151


[9750]	valid_0's rmse: 0.495964


[9900]	valid_0's rmse: 0.495778


[10050]	valid_0's rmse: 0.495592


[10200]	valid_0's rmse: 0.495407


[10350]	valid_0's rmse: 0.495222


[10500]	valid_0's rmse: 0.495037


[10650]	valid_0's rmse: 0.494852


[10800]	valid_0's rmse: 0.494667


[10950]	valid_0's rmse: 0.494482


[11100]	valid_0's rmse: 0.494298


[11250]	valid_0's rmse: 0.494115


[11400]	valid_0's rmse: 0.493932


[11550]	valid_0's rmse: 0.49375


[11700]	valid_0's rmse: 0.493567


[11850]	valid_0's rmse: 0.493385


[12000]	valid_0's rmse: 0.493202


[12150]	valid_0's rmse: 0.49302


[12300]	valid_0's rmse: 0.492838


[12450]	valid_0's rmse: 0.492657


[12600]	valid_0's rmse: 0.492475


[12750]	valid_0's rmse: 0.492295


[12900]	valid_0's rmse: 0.492114


[13050]	valid_0's rmse: 0.491934


[13200]	valid_0's rmse: 0.491753


[13350]	valid_0's rmse: 0.491573


[13500]	valid_0's rmse: 0.491393


[13650]	valid_0's rmse: 0.491214


[13800]	valid_0's rmse: 0.491034


[13950]	valid_0's rmse: 0.490854


[14100]	valid_0's rmse: 0.490675


[14250]	valid_0's rmse: 0.490496


[14400]	valid_0's rmse: 0.490318


[14550]	valid_0's rmse: 0.49014


[14700]	valid_0's rmse: 0.489962


[14850]	valid_0's rmse: 0.489784


[15000]	valid_0's rmse: 0.489607


[15150]	valid_0's rmse: 0.48943


[15300]	valid_0's rmse: 0.489252


[15450]	valid_0's rmse: 0.489075


[15600]	valid_0's rmse: 0.488898


[15750]	valid_0's rmse: 0.488722


[15900]	valid_0's rmse: 0.488546


[Two-Stage GH] Fold 1: RMSE=8.09253, s2_best_iter=16000 (train_n=319908, val_n=320756)


[150]	valid_0's rmse: 0.516721


[Stage1 Trend] Fold 2: val_rmse_log=0.516688, best_iter=75


[150]	valid_0's rmse: 0.510733


[300]	valid_0's rmse: 0.505008


[450]	valid_0's rmse: 0.499481


[600]	valid_0's rmse: 0.494136


[750]	valid_0's rmse: 0.488988


[900]	valid_0's rmse: 0.484008


[1050]	valid_0's rmse: 0.479212


[1200]	valid_0's rmse: 0.47458


[1350]	valid_0's rmse: 0.470098


[1500]	valid_0's rmse: 0.465767


[1650]	valid_0's rmse: 0.461578


[1800]	valid_0's rmse: 0.457514


[1950]	valid_0's rmse: 0.453582


[2100]	valid_0's rmse: 0.449769


[2250]	valid_0's rmse: 0.446076


[2400]	valid_0's rmse: 0.442501


[2550]	valid_0's rmse: 0.43902


[2700]	valid_0's rmse: 0.435651


[2850]	valid_0's rmse: 0.432373


[3000]	valid_0's rmse: 0.429187


[3150]	valid_0's rmse: 0.426093


[3300]	valid_0's rmse: 0.423082


[3450]	valid_0's rmse: 0.420156


[3600]	valid_0's rmse: 0.417309


[3750]	valid_0's rmse: 0.414534


[3900]	valid_0's rmse: 0.411833


[4050]	valid_0's rmse: 0.409208


[4200]	valid_0's rmse: 0.406644


[4350]	valid_0's rmse: 0.404135


[4500]	valid_0's rmse: 0.401692


[4650]	valid_0's rmse: 0.399318


[4800]	valid_0's rmse: 0.396993


[4950]	valid_0's rmse: 0.394723


[5100]	valid_0's rmse: 0.392507


[5250]	valid_0's rmse: 0.390338


[5400]	valid_0's rmse: 0.388219


[5550]	valid_0's rmse: 0.386149


[5700]	valid_0's rmse: 0.38413


[5850]	valid_0's rmse: 0.382155


[6000]	valid_0's rmse: 0.380222


[6150]	valid_0's rmse: 0.378327


[6300]	valid_0's rmse: 0.376473


[6450]	valid_0's rmse: 0.374658


[6600]	valid_0's rmse: 0.372879


[6750]	valid_0's rmse: 0.371135


[6900]	valid_0's rmse: 0.369423


[7050]	valid_0's rmse: 0.367753


[7200]	valid_0's rmse: 0.36611


[7350]	valid_0's rmse: 0.364503


[7500]	valid_0's rmse: 0.362926


[7650]	valid_0's rmse: 0.361377


[7800]	valid_0's rmse: 0.359864


[7950]	valid_0's rmse: 0.358375


[8100]	valid_0's rmse: 0.356909


[8250]	valid_0's rmse: 0.355475


[8400]	valid_0's rmse: 0.354066


[8550]	valid_0's rmse: 0.35268


[8700]	valid_0's rmse: 0.351316


[8850]	valid_0's rmse: 0.349982


[9000]	valid_0's rmse: 0.348667


[9150]	valid_0's rmse: 0.347379


[9300]	valid_0's rmse: 0.346107


[9450]	valid_0's rmse: 0.344862


[9600]	valid_0's rmse: 0.343633


[9750]	valid_0's rmse: 0.342425


[9900]	valid_0's rmse: 0.341238


[10050]	valid_0's rmse: 0.340069


[10200]	valid_0's rmse: 0.338918


[10350]	valid_0's rmse: 0.337789


[10500]	valid_0's rmse: 0.336673


[10650]	valid_0's rmse: 0.335575


[10800]	valid_0's rmse: 0.334493


[10950]	valid_0's rmse: 0.333428


[11100]	valid_0's rmse: 0.332385


[11250]	valid_0's rmse: 0.331354


[11400]	valid_0's rmse: 0.330343


[11550]	valid_0's rmse: 0.329345


[11700]	valid_0's rmse: 0.32836


[11850]	valid_0's rmse: 0.327388


[12000]	valid_0's rmse: 0.326429


[12150]	valid_0's rmse: 0.325484


[12300]	valid_0's rmse: 0.324553


[12450]	valid_0's rmse: 0.323638


[12600]	valid_0's rmse: 0.32273


[12750]	valid_0's rmse: 0.321838


[12900]	valid_0's rmse: 0.320958


[13050]	valid_0's rmse: 0.320091


[13200]	valid_0's rmse: 0.319231


[13350]	valid_0's rmse: 0.318385


[13500]	valid_0's rmse: 0.317551


[13650]	valid_0's rmse: 0.316728


[13800]	valid_0's rmse: 0.315915


[13950]	valid_0's rmse: 0.315111


[14100]	valid_0's rmse: 0.314322


[14250]	valid_0's rmse: 0.31354


[14400]	valid_0's rmse: 0.312771


[14550]	valid_0's rmse: 0.312011


[14700]	valid_0's rmse: 0.311255


[14850]	valid_0's rmse: 0.310515


[15000]	valid_0's rmse: 0.309781


[15150]	valid_0's rmse: 0.309055


[15300]	valid_0's rmse: 0.308337


[15450]	valid_0's rmse: 0.307628


[15600]	valid_0's rmse: 0.306925


[15750]	valid_0's rmse: 0.306233


[15900]	valid_0's rmse: 0.305546


[Two-Stage GH] Fold 2: RMSE=6.66018, s2_best_iter=16000 (train_n=640523, val_n=320756)


[150]	valid_0's rmse: 0.545733


[300]	valid_0's rmse: 0.545529


[450]	valid_0's rmse: 0.545506
[Stage1 Trend] Fold 3: val_rmse_log=0.545475, best_iter=305


[150]	valid_0's rmse: 0.44723


[300]	valid_0's rmse: 0.396003


[450]	valid_0's rmse: 0.364164


[600]	valid_0's rmse: 0.342213


[750]	valid_0's rmse: 0.325994


[900]	valid_0's rmse: 0.31283


[1050]	valid_0's rmse: 0.301784


[1200]	valid_0's rmse: 0.292679


[1350]	valid_0's rmse: 0.285237


[1500]	valid_0's rmse: 0.279012


[1650]	valid_0's rmse: 0.273842


[1800]	valid_0's rmse: 0.269494


[1950]	valid_0's rmse: 0.265709


[2100]	valid_0's rmse: 0.26246


In [31]:
# Two-Stage Residual Model — Gate Fixes Only: linear weights + simplified Stage1 trend (robustified for early folds)
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# Ensure minimal time/policy features
def ensure_days_since_start(df, global_start_ns):
    if 'days_since_start' in df.columns: return df
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    ns = dt_local.astype('int64').values
    df['days_since_start'] = ((ns - np.int64(global_start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return df

def ensure_after_hike(df):
    if 'after_hike' in df.columns and 'days_since_start_x_after_hike' in df.columns:
        return df
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    df['after_hike'] = (dt_local >= cutoff).astype('int8')
    if 'days_since_start' in df.columns:
        df['days_since_start_x_after_hike'] = (df['days_since_start'] * df['after_hike']).astype('float32')
    else:
        df['days_since_start_x_after_hike'] = np.float32(0.0)
    return df

# Prepare ordered data
dt_local_all = train_df['pickup_datetime'].dt.tz_convert('America/New_York')
global_start_ns = int(dt_local_all.min().value)
train_df = ensure_days_since_start(train_df, global_start_ns)
test_df = ensure_days_since_start(test_df, global_start_ns)
train_df = ensure_after_hike(train_df)
test_df = ensure_after_hike(test_df)

trend_feats = ['days_since_start','after_hike','days_since_start_x_after_hike']
for c in trend_feats: assert c in train_df.columns, f'Missing trend feature: {c}'

# Stage2 feature set (purified: no absolute time except via trend_pred)
base_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'day','hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_df.columns: base_feats.append('rot_manh_km')
if 'dist_x_after_hike' in train_df.columns: base_feats.append('dist_x_after_hike')

train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True)
dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)

K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

# Stage1 params (capacity + patience) per audit recommendation
s1_params = dict(
    objective='regression', metric='rmse',
    learning_rate=0.05, n_estimators=3000,
    num_leaves=64, min_data_in_leaf=400,
    feature_fraction=0.9, bagging_fraction=0.9, bagging_freq=1,
    max_bin=127, reg_lambda=5.0, random_state=2025, n_jobs=-1, verbose=-1
)

# Stage2 params (moderate) — residuals
s2_params = dict(
    objective='huber', metric='rmse',
    learning_rate=0.05, n_estimators=12000,
    num_leaves=128, min_data_in_leaf=600,
    feature_fraction=0.85, bagging_fraction=0.8, bagging_freq=1,
    max_bin=127, reg_lambda=10.0, random_state=2026, n_jobs=-1, verbose=-1
)

cv_rmses = []; s1_best_iters = []; s2_best_iters = []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx]; va = train_ord.iloc[va_idx]
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    # Stage 1: trend
    X1_tr = tr[trend_feats].astype('float32').values
    X1_va = va[trend_feats].astype('float32').values
    # Monotone only on continuous time terms; do NOT constrain binary after_hike
    mono = [0, 0, 0]
    mono[0] = 1  # days_since_start
    mono[2] = 1  # days_since_start_x_after_hike
    # Info note if after_hike is constant 0 in both tr and va (early folds)
    ah_tr_unique = tr['after_hike'].nunique(); ah_va_unique = va['after_hike'].nunique()
    if (ah_tr_unique == 1 and ah_va_unique == 1 and int(tr['after_hike'].iloc[0]) == 0 and int(va['after_hike'].iloc[0]) == 0):
        print(f'[Stage1 Note] Fold {i}: after_hike=0 for both train and val; trend signal may be weak in early period.')

    trend_model = lgb.LGBMRegressor(**{**s1_params, 'monotone_constraints': mono})
    trend_model.fit(
        X1_tr, y_tr,
        eval_set=[(X1_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=300, verbose=False), lgb.log_evaluation(period=150)]
    )
    va_tr_rmse = float(trend_model.best_score_['valid_0']['rmse'])
    best_iter_s1 = int(trend_model.best_iteration_ or s1_params['n_estimators'])
    print(f"[Stage1 Diagnostics] Fold {i}: val_rmse_log={va_tr_rmse:.4f}, best_iter={best_iter_s1}")
    s1_best_iters.append(best_iter_s1)
    tr_trend = trend_model.predict(X1_tr, num_iteration=trend_model.best_iteration_).astype('float32')
    va_trend = trend_model.predict(X1_va, num_iteration=trend_model.best_iteration_).astype('float32')
    y_tr_res = (y_tr - tr_trend).astype('float32'); y_va_res = (y_va - va_trend).astype('float32')

    # Stage 2: residual model with linear weight ramp (no underflow)
    use_cols2 = base_feats + ['trend_pred']
    tr2 = tr.copy(); va2 = va.copy()
    tr2['trend_pred'] = tr_trend; va2['trend_pred'] = va_trend
    X2_tr = tr2[use_cols2].astype('float32').values
    X2_va = va2[use_cols2].astype('float32').values
    days_tr = tr2['days_since_start'].astype('float32').values
    dmin, dmax = float(np.min(days_tr)), float(np.max(days_tr))
    denom = (dmax - dmin) if (dmax > dmin) else 1.0
    w_tr = 0.1 + 0.9 * (days_tr - dmin) / denom
    w_tr = w_tr.astype('float32')
    print(f"[Weight Diagnostics] Fold {i}: min={w_tr.min():.4f}, max={w_tr.max():.4f}, mean={w_tr.mean():.4f}")
    assert np.isfinite(w_tr).all() and w_tr.min() >= 0.1 and w_tr.max() <= 1.0, 'Bad weights'

    s2 = lgb.LGBMRegressor(**s2_params)
    s2.fit(
        X2_tr, y_tr_res, sample_weight=w_tr,
        eval_set=[(X2_va, y_va_res)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=400, verbose=False), lgb.log_evaluation(period=200)]
    )
    s2_best = int(s2.best_iteration_ or s2_params['n_estimators'])
    s2_best_iters.append(s2_best)
    va_res_pred = s2.predict(X2_va, num_iteration=s2.best_iteration_).astype('float32')
    va_final_log = (va_trend + va_res_pred).astype('float32')
    rmse = float(root_mean_squared_error(np.expm1(y_va), np.expm1(va_final_log)))
    cv_rmses.append(rmse)
    print(f"[Two-Stage GateFix] Fold {i}: RMSE={rmse:.5f}, s2_best_iter={s2_best}")

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
print({'cv_rmse_mean': round(cv_mean, 6), 'cv_rmse_std': round(cv_std, 6), 'note': 'Gate-fix two-stage: linear weights + simplified Stage1 (robust early-fold gate)'})
assert cv_mean < 4.0, f'Gate not met: CV mean {cv_mean:.4f} >= 4.0'
print('Gate passed. Ready for further tuning once stability confirmed.')

[Stage1 Note] Fold 1: after_hike=0 for both train and val; trend signal may be weak in early period.


[150]	valid_0's rmse: 0.508622
[300]	valid_0's rmse: 0.50862


[450]	valid_0's rmse: 0.508619
[600]	valid_0's rmse: 0.50862


[750]	valid_0's rmse: 0.508619
[900]	valid_0's rmse: 0.508619


[1050]	valid_0's rmse: 0.508619
[1200]	valid_0's rmse: 0.508619
[Stage1 Diagnostics] Fold 1: val_rmse_log=0.5086, best_iter=984


[Weight Diagnostics] Fold 1: min=0.1000, max=1.0000, mean=0.5517


[200]	valid_0's rmse: 0.232736


[400]	valid_0's rmse: 0.231144


[600]	valid_0's rmse: 0.2307


[800]	valid_0's rmse: 0.230545


[1000]	valid_0's rmse: 0.23049


[1200]	valid_0's rmse: 0.230508


[1400]	valid_0's rmse: 0.230583


[Two-Stage GateFix] Fold 1: RMSE=3.66476, s2_best_iter=1134


[Stage1 Note] Fold 2: after_hike=0 for both train and val; trend signal may be weak in early period.


[150]	valid_0's rmse: 0.516954


[300]	valid_0's rmse: 0.516949


[450]	valid_0's rmse: 0.516945
[600]	valid_0's rmse: 0.516943


[750]	valid_0's rmse: 0.516942
[900]	valid_0's rmse: 0.516942


[1050]	valid_0's rmse: 0.516941
[1200]	valid_0's rmse: 0.51694


[1350]	valid_0's rmse: 0.516941
[1500]	valid_0's rmse: 0.516942


[Stage1 Diagnostics] Fold 2: val_rmse_log=0.5169, best_iter=1278


[Weight Diagnostics] Fold 2: min=0.1000, max=1.0000, mean=0.5478


[200]	valid_0's rmse: 0.192794


[400]	valid_0's rmse: 0.193263


[600]	valid_0's rmse: 0.193542


[Two-Stage GateFix] Fold 2: RMSE=3.29956, s2_best_iter=215


[150]	valid_0's rmse: 0.544867


[300]	valid_0's rmse: 0.54477


[450]	valid_0's rmse: 0.544718


[600]	valid_0's rmse: 0.544671


[750]	valid_0's rmse: 0.544657


[900]	valid_0's rmse: 0.544633


[1050]	valid_0's rmse: 0.544626


[1200]	valid_0's rmse: 0.544604


[1350]	valid_0's rmse: 0.544607


[Stage1 Diagnostics] Fold 3: val_rmse_log=0.5446, best_iter=1179


[Weight Diagnostics] Fold 3: min=0.1000, max=1.0000, mean=0.5523


[200]	valid_0's rmse: 0.20927


[400]	valid_0's rmse: 0.207021


[600]	valid_0's rmse: 0.206291


[800]	valid_0's rmse: 0.205889


[1000]	valid_0's rmse: 0.205717


[1200]	valid_0's rmse: 0.205753


[1400]	valid_0's rmse: 0.205667


[1600]	valid_0's rmse: 0.205652


[1800]	valid_0's rmse: 0.2056


[2000]	valid_0's rmse: 0.205604


[2200]	valid_0's rmse: 0.205751


[Two-Stage GateFix] Fold 3: RMSE=3.56206, s2_best_iter=1861


[150]	valid_0's rmse: 0.55871


[300]	valid_0's rmse: 0.558817


[Stage1 Diagnostics] Fold 4: val_rmse_log=0.5583, best_iter=44


[Weight Diagnostics] Fold 4: min=0.1000, max=1.0000, mean=0.5537


[200]	valid_0's rmse: 0.201594


[400]	valid_0's rmse: 0.198502


[600]	valid_0's rmse: 0.197225


[800]	valid_0's rmse: 0.196382


[1000]	valid_0's rmse: 0.195802


[1200]	valid_0's rmse: 0.195344


[1400]	valid_0's rmse: 0.195144


[1600]	valid_0's rmse: 0.195067


[1800]	valid_0's rmse: 0.194952


[2000]	valid_0's rmse: 0.19489


[2200]	valid_0's rmse: 0.194893


[Two-Stage GateFix] Fold 4: RMSE=3.47642, s2_best_iter=1968


[150]	valid_0's rmse: 0.573326


[300]	valid_0's rmse: 0.57338


[Stage1 Diagnostics] Fold 5: val_rmse_log=0.5730, best_iter=44


[Weight Diagnostics] Fold 5: min=0.1000, max=1.0000, mean=0.5524


[200]	valid_0's rmse: 0.183248


[400]	valid_0's rmse: 0.17878


[600]	valid_0's rmse: 0.177162


[800]	valid_0's rmse: 0.176277


[1000]	valid_0's rmse: 0.175641


[1200]	valid_0's rmse: 0.175314


[1400]	valid_0's rmse: 0.175059


[1600]	valid_0's rmse: 0.174858


[1800]	valid_0's rmse: 0.174703


[2000]	valid_0's rmse: 0.174583


[2200]	valid_0's rmse: 0.174539


[2400]	valid_0's rmse: 0.174544


[2600]	valid_0's rmse: 0.174573


[Two-Stage GateFix] Fold 5: RMSE=3.16010, s2_best_iter=2337
{'cv_rmse_mean': 3.43258, 'cv_rmse_std': 0.181404, 'note': 'Gate-fix two-stage: linear weights + simplified Stage1 (robust early-fold gate)'}
Gate passed. Ready for further tuning once stability confirmed.


In [29]:
# Diagnostics for Cell 14 Stage1 failure: check trend feature variability and correlation in Fold 1
import numpy as np, pandas as pd
from scipy.stats import pearsonr

assert 'train_df' in globals(), 'train_df missing'
assert 'make_time_folds_quantile' in globals(), 'fold generator missing'

# Recreate ordered view and folds identical to Cell 14
train_ord_dbg = train_df.sort_values('pickup_datetime').reset_index(drop=True).copy()
dt_all_dbg = train_ord_dbg['pickup_datetime']
y_all_log_dbg = np.log1p(train_ord_dbg['fare_amount'].astype('float32').values)
folds_dbg = make_time_folds_quantile(dt_all_dbg, n_folds=5, gap_days=1)
tr_idx, va_idx = folds_dbg[0]
tr = train_ord_dbg.iloc[tr_idx].copy(); va = train_ord_dbg.iloc[va_idx].copy()

trend_feats = ['days_since_start','after_hike','days_since_start_x_after_hike']
for c in trend_feats: assert c in tr.columns, f'Missing {c}'

def stats_block(df, name):
    s = {
        'name': name,
        'rows': int(len(df)),
        'days_min': float(np.nanmin(df['days_since_start'])),
        'days_max': float(np.nanmax(df['days_since_start'])),
        'days_std': float(np.nanstd(df['days_since_start'])),
        'after_hike_unique': df['after_hike'].nunique(),
        'after_hike_counts': df['after_hike'].value_counts().to_dict(),
        'int_min': float(np.nanmin(df['days_since_start_x_after_hike'])),
        'int_max': float(np.nanmax(df['days_since_start_x_after_hike'])),
        'int_std': float(np.nanstd(df['days_since_start_x_after_hike'])),
        'y_log_std': float(np.std(y_all_log_dbg[tr_idx if name=='train' else va_idx]))
    }
    # Correlations (guard for constant)
    corr = {}
    for col in trend_feats:
        v = df[col].astype('float32').values
        y = y_all_log_dbg[tr_idx if name=='train' else va_idx]
        if np.std(v) > 0 and np.std(y) > 0:
            try:
                corr[col] = float(pearsonr(v, y)[0])
            except Exception:
                corr[col] = np.nan
        else:
            corr[col] = np.nan
    s['pearson'] = corr
    return s

train_stats = stats_block(tr, 'train')
valid_stats = stats_block(va, 'valid')
print('Stage1 trend feature diagnostics:')
print(train_stats)
print(valid_stats)

# If days_since_start looks degenerate, recompute quickly to test hypothesis
if train_stats['days_std'] < 1e-6 or valid_stats['days_std'] < 1e-6:
    print('Recomputing days_since_start and after_hike from pickup_datetime (NY local)...')
    def recompute(df):
        dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
        global_start_ns = int(dt_local.min().value)
        ns = dt_local.astype('int64').values
        df['days_since_start'] = ((ns - np.int64(global_start_ns)) / np.float64(24*3600*1e9)).astype('float32')
        cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
        df['after_hike'] = (dt_local >= cutoff).astype('int8')
        df['days_since_start_x_after_hike'] = (df['days_since_start'] * df['after_hike']).astype('float32')
        return df
    tr = recompute(tr); va = recompute(va)
    train_stats2 = stats_block(tr, 'train_recomp')
    valid_stats2 = stats_block(va, 'valid_recomp')
    print('After recompute:')
    print(train_stats2)
    print(valid_stats2)

Stage1 trend feature diagnostics:
{'name': 'train', 'rows': 319908, 'days_min': 0.0, 'days_max': 390.7224426269531, 'days_std': 112.65982055664062, 'after_hike_unique': 1, 'after_hike_counts': {0: 319908}, 'int_min': 0.0, 'int_max': 0.0, 'int_std': 0.0, 'y_log_std': 0.5045214891433716, 'pearson': {'days_since_start': 0.004789734683296588, 'after_hike': nan, 'days_since_start_x_after_hike': nan}}
{'name': 'valid', 'rows': 320756, 'days_min': 391.7254943847656, 'days_max': 793.7321166992188, 'days_std': 115.53379821777344, 'after_hike_unique': 1, 'after_hike_counts': {0: 320756}, 'int_min': 0.0, 'int_max': 0.0, 'int_std': 0.0, 'y_log_std': 0.508609414100647, 'pearson': {'days_since_start': 0.009176272326110886, 'after_hike': nan, 'days_since_start_x_after_hike': nan}}


In [33]:
# Two-Stage Residual Model — Stage1 enhancement per audit: seasonal/holiday trend, fold-aware gate; Stage2 purified
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error
from pandas.tseries.holiday import USFederalHolidayCalendar as USCal

assert 'train_df' in globals() and 'test_df' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# Ensure minimal time/policy features
def ensure_days_since_start(df, global_start_ns):
    if 'days_since_start' in df.columns: return df
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    ns = dt_local.astype('int64').values
    df['days_since_start'] = ((ns - np.int64(global_start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return df

def ensure_after_hike(df):
    if 'after_hike' in df.columns and 'days_since_start_x_after_hike' in df.columns:
        return df
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    df['after_hike'] = (dt_local >= cutoff).astype('int8')
    if 'days_since_start' in df.columns:
        df['days_since_start_x_after_hike'] = (df['days_since_start'] * df['after_hike']).astype('float32')
    else:
        df['days_since_start_x_after_hike'] = np.float32(0.0)
    return df

def ensure_seasonal_holiday(df):
    need = ['doy_sin','doy_cos','dow_sin','dow_cos','is_holiday']
    if all(c in df.columns for c in need):
        return df
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    # Day-of-year cyclic
    doy = dt_local.dt.dayofyear.astype('int16')
    df['doy_sin'] = np.sin(2*np.pi*(doy/365.25)).astype('float32')
    df['doy_cos'] = np.cos(2*np.pi*(doy/365.25)).astype('float32')
    # Day-of-week cyclic
    dow = dt_local.dt.dayofweek.astype('int8')
    df['dow_sin'] = np.sin(2*np.pi*(dow/7)).astype('float32')
    df['dow_cos'] = np.cos(2*np.pi*(dow/7)).astype('float32')
    # US Federal Holidays
    start = dt_local.min().normalize().tz_localize(None)
    end = dt_local.max().normalize().tz_localize(None)
    hol = USCal().holidays(start=start, end=end)
    hol_dates = set(pd.to_datetime(hol).date)
    df['is_holiday'] = dt_local.dt.date.map(lambda d: d in hol_dates).astype('int8')
    return df

# Prepare ordered data
dt_local_all = train_df['pickup_datetime'].dt.tz_convert('America/New_York')
global_start_ns = int(dt_local_all.min().value)
train_df = ensure_days_since_start(train_df, global_start_ns)
test_df = ensure_days_since_start(test_df, global_start_ns)
train_df = ensure_after_hike(train_df)
test_df = ensure_after_hike(test_df)
train_df = ensure_seasonal_holiday(train_df)
test_df = ensure_seasonal_holiday(test_df)

# Stage 1 trend features (per audit) and constraints
trend_feats = ['days_since_start','after_hike','days_since_start_x_after_hike','doy_sin','doy_cos','dow_sin','dow_cos','is_holiday']
for c in trend_feats: assert c in train_df.columns, f'Missing trend feature: {c}'
mono_map = {'days_since_start': 1, 'days_since_start_x_after_hike': 1}
mono_constraints = [mono_map.get(f, 0) for f in trend_feats]

# Stage 2 feature set (purified: remove any Stage1 features and absolute time identifiers)
base_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'day','hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_df.columns: base_feats.append('rot_manh_km')
if 'dist_x_after_hike' in train_df.columns: base_feats.append('dist_x_after_hike')

train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True)
dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)

K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

# Params
s1_params = dict(
    objective='regression', metric='rmse',
    learning_rate=0.05, n_estimators=4000,
    num_leaves=64, min_data_in_leaf=400,
    feature_fraction=0.9, bagging_fraction=0.9, bagging_freq=1,
    max_bin=127, reg_lambda=5.0, random_state=2025, n_jobs=-1, verbose=-1
)
s2_params = dict(
    objective='huber', metric='rmse',
    learning_rate=0.05, n_estimators=14000,
    num_leaves=128, min_data_in_leaf=600,
    feature_fraction=0.85, bagging_fraction=0.8, bagging_freq=1,
    max_bin=127, reg_lambda=10.0, random_state=2026, n_jobs=-1, verbose=-1
)

cv_rmses = []; s1_best_iters = []; s2_best_iters = []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx]; va = train_ord.iloc[va_idx]
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    # Stage 1: trend with monotonic time terms only
    X1_tr = tr[trend_feats].astype('float32').values
    X1_va = va[trend_feats].astype('float32').values
    trend_model = lgb.LGBMRegressor(**{**s1_params, 'monotone_constraints': mono_constraints})
    trend_model.fit(
        X1_tr, y_tr,
        eval_set=[(X1_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=300, verbose=False), lgb.log_evaluation(period=150)]
    )
    va_tr_rmse = float(trend_model.best_score_['valid_0']['rmse'])
    best_iter_s1 = int(trend_model.best_iteration_ or s1_params['n_estimators'])
    sigma_va = float(np.std(y_va))
    gate = min(0.35, 0.9 * sigma_va)
    print(f"[Stage1 Diagnostics] Fold {i}: val_rmse_log={va_tr_rmse:.4f}, best_iter={best_iter_s1}, sigma_va={sigma_va:.4f}, gate={gate:.4f}")
    # TEMP: soft gate for debugging to complete CV; log warning instead of assert
    if not (va_tr_rmse <= gate):
        print(f"[Stage1 WARN] Fold {i}: gate failed (rmse={va_tr_rmse:.3f} > gate={gate:.3f}). Continuing for diagnostics.")
    s1_best_iters.append(best_iter_s1)
    tr_trend = trend_model.predict(X1_tr, num_iteration=trend_model.best_iteration_).astype('float32')
    va_trend = trend_model.predict(X1_va, num_iteration=trend_model.best_iteration_).astype('float32')
    y_tr_res = (y_tr - tr_trend).astype('float32'); y_va_res = (y_va - va_trend).astype('float32')

    # Stage 2: residual model with linear weight ramp; inputs purified of Stage1 seasonal/holiday
    use_cols2 = base_feats + ['trend_pred']
    tr2 = tr.copy(); va2 = va.copy()
    tr2['trend_pred'] = tr_trend; va2['trend_pred'] = va_trend
    X2_tr = tr2[use_cols2].astype('float32').values
    X2_va = va2[use_cols2].astype('float32').values
    days_tr = tr2['days_since_start'].astype('float32').values
    dmin, dmax = float(np.min(days_tr)), float(np.max(days_tr))
    denom = (dmax - dmin) if (dmax > dmin) else 1.0
    w_tr = (0.1 + 0.9 * (days_tr - dmin) / denom).astype('float32')
    print(f"[Weight Diagnostics] Fold {i}: min={w_tr.min():.4f}, max={w_tr.max():.4f}, mean={w_tr.mean():.4f}")
    assert np.isfinite(w_tr).all() and w_tr.min() >= 0.1 and w_tr.max() <= 1.0, 'Bad weights'

    s2 = lgb.LGBMRegressor(**s2_params)
    s2.fit(
        X2_tr, y_tr_res, sample_weight=w_tr,
        eval_set=[(X2_va, y_va_res)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=400, verbose=False), lgb.log_evaluation(period=200)]
    )
    s2_best = int(s2.best_iteration_ or s2_params['n_estimators'])
    s2_best_iters.append(s2_best)
    va_res_pred = s2.predict(X2_va, num_iteration=s2.best_iteration_).astype('float32')
    va_final_log = (va_trend + va_res_pred).astype('float32')
    rmse = float(root_mean_squared_error(np.expm1(y_va), np.expm1(va_final_log)))
    cv_rmses.append(rmse)
    print(f"[Two-Stage Enhanced] Fold {i}: RMSE={rmse:.5f}, s2_best_iter={s2_best}")

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
print({'cv_rmse_mean': round(cv_mean, 6), 'cv_rmse_std': round(cv_std, 6), 'note': 'Two-stage: enhanced Stage1 (seasonal/holiday) + purified Stage2 + linear weights'})
print('Next: if gates improve, re-introduce residual TE (coarse) as advised.')

[150]	valid_0's rmse: 0.508963


[300]	valid_0's rmse: 0.510383
[Stage1 Diagnostics] Fold 1: val_rmse_log=0.5084, best_iter=14, sigma_va=0.5086, gate=0.3500
[Stage1 WARN] Fold 1: gate failed (rmse=0.508 > gate=0.350). Continuing for diagnostics.


[Weight Diagnostics] Fold 1: min=0.1000, max=1.0000, mean=0.5517


[200]	valid_0's rmse: 0.23314


[400]	valid_0's rmse: 0.231755


[600]	valid_0's rmse: 0.231374


[800]	valid_0's rmse: 0.231289


[1000]	valid_0's rmse: 0.231291


[Two-Stage Enhanced] Fold 1: RMSE=3.66505, s2_best_iter=718


[150]	valid_0's rmse: 0.516504


[300]	valid_0's rmse: 0.516648


[Stage1 Diagnostics] Fold 2: val_rmse_log=0.5165, best_iter=97, sigma_va=0.5166, gate=0.3500
[Stage1 WARN] Fold 2: gate failed (rmse=0.516 > gate=0.350). Continuing for diagnostics.


[Weight Diagnostics] Fold 2: min=0.1000, max=1.0000, mean=0.5478


[200]	valid_0's rmse: 0.176468


[400]	valid_0's rmse: 0.175258


[600]	valid_0's rmse: 0.175126


[800]	valid_0's rmse: 0.175468


[Two-Stage Enhanced] Fold 2: RMSE=2.88110, s2_best_iter=566


[150]	valid_0's rmse: 0.54459


[300]	valid_0's rmse: 0.544098


[450]	valid_0's rmse: 0.543672


[600]	valid_0's rmse: 0.543365


[750]	valid_0's rmse: 0.543074


[900]	valid_0's rmse: 0.542826


[1050]	valid_0's rmse: 0.54256


[1200]	valid_0's rmse: 0.542289


[1350]	valid_0's rmse: 0.542114


[1500]	valid_0's rmse: 0.541968


[1650]	valid_0's rmse: 0.541764


[1800]	valid_0's rmse: 0.541541


[1950]	valid_0's rmse: 0.541338


[2100]	valid_0's rmse: 0.541171


[2250]	valid_0's rmse: 0.541062


[2400]	valid_0's rmse: 0.540927


[2550]	valid_0's rmse: 0.540822


[2700]	valid_0's rmse: 0.540742


[2850]	valid_0's rmse: 0.540615


[3000]	valid_0's rmse: 0.540533


[3150]	valid_0's rmse: 0.540454


[3300]	valid_0's rmse: 0.540368


[3450]	valid_0's rmse: 0.5403


[3600]	valid_0's rmse: 0.540208


[3750]	valid_0's rmse: 0.540194


[3900]	valid_0's rmse: 0.540162


[Stage1 Diagnostics] Fold 3: val_rmse_log=0.5401, best_iter=4000, sigma_va=0.5403, gate=0.3500
[Stage1 WARN] Fold 3: gate failed (rmse=0.540 > gate=0.350). Continuing for diagnostics.


[Weight Diagnostics] Fold 3: min=0.1000, max=1.0000, mean=0.5523


[200]	valid_0's rmse: 0.200952


[400]	valid_0's rmse: 0.199067


[600]	valid_0's rmse: 0.198475


[800]	valid_0's rmse: 0.19837


[1000]	valid_0's rmse: 0.198434


[1200]	valid_0's rmse: 0.198336


[1400]	valid_0's rmse: 0.198363


[Two-Stage Enhanced] Fold 3: RMSE=3.44298, s2_best_iter=1182


[150]	valid_0's rmse: 0.559795


[300]	valid_0's rmse: 0.563876


[Stage1 Diagnostics] Fold 4: val_rmse_log=0.5582, best_iter=52, sigma_va=0.5583, gate=0.3500
[Stage1 WARN] Fold 4: gate failed (rmse=0.558 > gate=0.350). Continuing for diagnostics.


[Weight Diagnostics] Fold 4: min=0.1000, max=1.0000, mean=0.5537


[200]	valid_0's rmse: 0.201601


[400]	valid_0's rmse: 0.198516


[600]	valid_0's rmse: 0.197483


[800]	valid_0's rmse: 0.196991


[1000]	valid_0's rmse: 0.196664


[1200]	valid_0's rmse: 0.19648


[1400]	valid_0's rmse: 0.19637


[1600]	valid_0's rmse: 0.196346


[1800]	valid_0's rmse: 0.196296


[2000]	valid_0's rmse: 0.19636


[2200]	valid_0's rmse: 0.196392


[Two-Stage Enhanced] Fold 4: RMSE=3.48977, s2_best_iter=1824


[150]	valid_0's rmse: 0.572763


[300]	valid_0's rmse: 0.573284


[Stage1 Diagnostics] Fold 5: val_rmse_log=0.5725, best_iter=73, sigma_va=0.5730, gate=0.3500
[Stage1 WARN] Fold 5: gate failed (rmse=0.573 > gate=0.350). Continuing for diagnostics.


[Weight Diagnostics] Fold 5: min=0.1000, max=1.0000, mean=0.5524


[200]	valid_0's rmse: 0.184764


[400]	valid_0's rmse: 0.181075


[600]	valid_0's rmse: 0.179686


[800]	valid_0's rmse: 0.179005


[1000]	valid_0's rmse: 0.178532


[1200]	valid_0's rmse: 0.178251


[1400]	valid_0's rmse: 0.178005


[1600]	valid_0's rmse: 0.177906


[1800]	valid_0's rmse: 0.177832


[2000]	valid_0's rmse: 0.177822


[2200]	valid_0's rmse: 0.177816


[2400]	valid_0's rmse: 0.177844


[Two-Stage Enhanced] Fold 5: RMSE=3.21613, s2_best_iter=2024
{'cv_rmse_mean': 3.339006, 'cv_rmse_std': 0.270046, 'note': 'Two-stage: enhanced Stage1 (seasonal/holiday) + purified Stage2 + linear weights'}
Next: if gates improve, re-introduce residual TE (coarse) as advised.


In [34]:
# Two-Stage Residual Model — Stage1 FIX: monotone only on days_since_start; hard gate; increased capacity
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error
from pandas.tseries.holiday import USFederalHolidayCalendar as USCal

assert 'train_df' in globals() and 'test_df' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# Ensure minimal time/policy + seasonal/holiday features
def ensure_days_since_start(df, global_start_ns):
    if 'days_since_start' in df.columns: return df
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    ns = dt_local.astype('int64').values
    df['days_since_start'] = ((ns - np.int64(global_start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return df

def ensure_after_hike(df):
    if 'after_hike' in df.columns and 'days_since_start_x_after_hike' in df.columns:
        return df
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    df['after_hike'] = (dt_local >= cutoff).astype('int8')
    if 'days_since_start' in df.columns:
        df['days_since_start_x_after_hike'] = (df['days_since_start'] * df['after_hike']).astype('float32')
    else:
        df['days_since_start_x_after_hike'] = np.float32(0.0)
    return df

def ensure_seasonal_holiday(df):
    need = ['doy_sin','doy_cos','dow_sin','dow_cos','is_holiday']
    if all(c in df.columns for c in need):
        return df
    df = df.copy()
    dt_local = df['pickup_datetime'].dt.tz_convert('America/New_York')
    # Day-of-year cyclic
    doy = dt_local.dt.dayofyear.astype('int16')
    df['doy_sin'] = np.sin(2*np.pi*(doy/365.25)).astype('float32')
    df['doy_cos'] = np.cos(2*np.pi*(doy/365.25)).astype('float32')
    # Day-of-week cyclic
    dow = dt_local.dt.dayofweek.astype('int8')
    df['dow_sin'] = np.sin(2*np.pi*(dow/7)).astype('float32')
    df['dow_cos'] = np.cos(2*np.pi*(dow/7)).astype('float32')
    # US Federal Holidays
    start = dt_local.min().normalize().tz_localize(None)
    end = dt_local.max().normalize().tz_localize(None)
    hol = USCal().holidays(start=start, end=end)
    hol_dates = set(pd.to_datetime(hol).date)
    df['is_holiday'] = dt_local.dt.date.map(lambda d: d in hol_dates).astype('int8')
    return df

# Prepare ordered data
dt_local_all = train_df['pickup_datetime'].dt.tz_convert('America/New_York')
global_start_ns = int(dt_local_all.min().value)
train_df = ensure_days_since_start(train_df, global_start_ns)
test_df = ensure_days_since_start(test_df, global_start_ns)
train_df = ensure_after_hike(train_df)
test_df = ensure_after_hike(test_df)
train_df = ensure_seasonal_holiday(train_df)
test_df = ensure_seasonal_holiday(test_df)

# Stage 1 features per mandate; monotone ONLY on days_since_start
trend_feats = ['days_since_start','after_hike','days_since_start_x_after_hike','doy_sin','doy_cos','dow_sin','dow_cos','is_holiday']
for c in trend_feats: assert c in train_df.columns, f'Missing trend feature: {c}'
mono_map = {'days_since_start': 1}  # ONLY primary time trend is monotonic
mono_constraints = [mono_map.get(f, 0) for f in trend_feats]

# Stage 2 (purified) — no absolute time or Stage1 seasonal/holiday; only trend_pred conveys macro trend
base_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'day','hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_df.columns: base_feats.append('rot_manh_km')
if 'dist_x_after_hike' in train_df.columns: base_feats.append('dist_x_after_hike')

train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True)
dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)

K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

# Params — increased Stage1 capacity; cautious regularization; Stage2 unchanged (purified + linear weights)
s1_params = dict(
    objective='regression', metric='rmse',
    learning_rate=0.04, n_estimators=6000,
    num_leaves=128, min_data_in_leaf=400,
    feature_fraction=0.9, bagging_fraction=0.9, bagging_freq=1,
    max_bin=127, reg_lambda=2.0, random_state=2025, n_jobs=-1, verbose=-1
)
s2_params = dict(
    objective='huber', metric='rmse',
    learning_rate=0.05, n_estimators=14000,
    num_leaves=128, min_data_in_leaf=600,
    feature_fraction=0.85, bagging_fraction=0.8, bagging_freq=1,
    max_bin=127, reg_lambda=10.0, random_state=2026, n_jobs=-1, verbose=-1
)

cv_rmses = []; s1_best_iters = []; s2_best_iters = []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx]; va = train_ord.iloc[va_idx]
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]

    # Stage 1
    X1_tr = tr[trend_feats].astype('float32').values
    X1_va = va[trend_feats].astype('float32').values
    trend_model = lgb.LGBMRegressor(**{**s1_params, 'monotone_constraints': mono_constraints})
    trend_model.fit(
        X1_tr, y_tr,
        eval_set=[(X1_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=400, verbose=False), lgb.log_evaluation(period=150)]
    )
    va_tr_rmse = float(trend_model.best_score_['valid_0']['rmse'])
    best_iter_s1 = int(trend_model.best_iteration_ or s1_params['n_estimators'])
    sigma_va = float(np.std(y_va))
    gate = min(0.35, 0.9 * sigma_va)
    print(f"[Stage1 FIX] Fold {i}: val_rmse_log={va_tr_rmse:.4f}, best_iter={best_iter_s1}, sigma_va={sigma_va:.4f}, gate={gate:.4f}")
    assert va_tr_rmse <= gate, f"CRITICAL: Stage1 gate failed on fold {i}. RMSE={va_tr_rmse:.4f} > Gate={gate:.4f}"
    s1_best_iters.append(best_iter_s1)
    tr_trend = trend_model.predict(X1_tr, num_iteration=trend_model.best_iteration_).astype('float32')
    va_trend = trend_model.predict(X1_va, num_iteration=trend_model.best_iteration_).astype('float32')
    y_tr_res = (y_tr - tr_trend).astype('float32'); y_va_res = (y_va - va_trend).astype('float32')

    # Stage 2 with linear ramp weights (no underflow) and purified inputs
    tr2 = tr.copy(); va2 = va.copy()
    tr2['trend_pred'] = tr_trend; va2['trend_pred'] = va_trend
    use_cols2 = base_feats + ['trend_pred']
    X2_tr = tr2[use_cols2].astype('float32').values
    X2_va = va2[use_cols2].astype('float32').values
    days_tr = tr2['days_since_start'].astype('float32').values
    dmin, dmax = float(np.min(days_tr)), float(np.max(days_tr))
    denom = (dmax - dmin) if (dmax > dmin) else 1.0
    w_tr = (0.2 + 0.8 * (days_tr - dmin) / denom).astype('float32')  # floor at 0.2 for extra stability
    print(f"[Weights] Fold {i}: min={w_tr.min():.4f}, max={w_tr.max():.4f}, mean={w_tr.mean():.4f}")
    assert np.isfinite(w_tr).all() and (0.19 <= w_tr.min() <= 1.0) and (w_tr.max() <= 1.0), 'Bad weights'

    s2 = lgb.LGBMRegressor(**s2_params)
    s2.fit(
        X2_tr, y_tr_res, sample_weight=w_tr,
        eval_set=[(X2_va, y_va_res)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=400, verbose=False), lgb.log_evaluation(period=200)]
    )
    s2_best = int(s2.best_iteration_ or s2_params['n_estimators'])
    s2_best_iters.append(s2_best)
    va_res_pred = s2.predict(X2_va, num_iteration=s2.best_iteration_).astype('float32')
    va_final_log = (va_trend + va_res_pred).astype('float32')
    rmse = float(root_mean_squared_error(np.expm1(y_va), np.expm1(va_final_log)))
    cv_rmses.append(rmse)
    print(f"[Two-Stage S1-FIX] Fold {i}: RMSE={rmse:.5f}, s2_best_iter={s2_best}")

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
print({'cv_rmse_mean': round(cv_mean, 6), 'cv_rmse_std': round(cv_std, 6), 'note': 'Two-stage with Stage1 monotone fix (days only) + hard gate + purified Stage2 + linear weights'})
assert cv_std <= 0.15 and cv_mean <= 3.35, f"Gates not met: mean={cv_mean:.4f}, std={cv_std:.4f}"
print('Gates met. Proceed to residual TE (coarse) in Stage 2 next.')

[150]	valid_0's rmse: 0.509473


[300]	valid_0's rmse: 0.511097


[Stage1 FIX] Fold 1: val_rmse_log=0.5084, best_iter=21, sigma_va=0.5086, gate=0.3500


AssertionError: CRITICAL: Stage1 gate failed on fold 1. RMSE=0.5084 > Gate=0.3500

In [35]:
# Diagnostics: Stage1 trend features (extended) variability and correlation in Fold 1
import numpy as np, pandas as pd
from scipy.stats import pearsonr

assert 'train_df' in globals() and 'make_time_folds_quantile' in globals(), 'Prereqs missing'

# Use the Stage1 feature set from Cell 17
trend_feats = ['days_since_start','after_hike','days_since_start_x_after_hike','doy_sin','doy_cos','dow_sin','dow_cos','is_holiday']
for c in trend_feats: assert c in train_df.columns, f'Missing {c} in train_df'

train_ord_dbg = train_df.sort_values('pickup_datetime').reset_index(drop=True).copy()
dt_all_dbg = train_ord_dbg['pickup_datetime']
y_log_dbg = np.log1p(train_ord_dbg['fare_amount'].astype('float32').values)
folds_dbg = make_time_folds_quantile(dt_all_dbg, n_folds=5, gap_days=1)
tr_idx, va_idx = folds_dbg[0]
tr = train_ord_dbg.iloc[tr_idx].copy(); va = train_ord_dbg.iloc[va_idx].copy()

def feat_stats(df, y, name):
    stats = {'name': name, 'rows': int(len(df))}
    for c in trend_feats:
        v = df[c].astype('float32').values
        stats[f'{c}_mean'] = float(np.nanmean(v))
        stats[f'{c}_std'] = float(np.nanstd(v))
        if np.nanstd(v) > 0 and np.nanstd(y) > 0:
            try:
                stats[f'corr_{c}'] = float(pearsonr(v, y)[0])
            except Exception:
                stats[f'corr_{c}'] = np.nan
        else:
            stats[f'corr_{c}'] = np.nan
    return stats

tr_stats = feat_stats(tr, y_log_dbg[tr_idx], 'train_fold1')
va_stats = feat_stats(va, y_log_dbg[va_idx], 'valid_fold1')
print('Stage1 extended trend feature diagnostics (Fold 1):')
print(tr_stats)
print(va_stats)

# Quick sanity: unique counts for binary features
print('Unique counts: after_hike (tr,va)=', tr['after_hike'].nunique(), va['after_hike'].nunique(),
      '; is_holiday (tr,va)=', tr['is_holiday'].nunique(), va['is_holiday'].nunique())

# Check if seasonal features are near-constant or miscomputed (very low std indicates issue)
low_std = {c: (tr_stats[f'{c}_std'], va_stats[f'{c}_std']) for c in trend_feats}
print('STD per feature (train, valid):', low_std)

Stage1 extended trend feature diagnostics (Fold 1):
{'name': 'train_fold1', 'rows': 319908, 'days_since_start_mean': 196.10989379882812, 'days_since_start_std': 112.65982055664062, 'corr_days_since_start': 0.004789734683296588, 'after_hike_mean': 0.0, 'after_hike_std': 0.0, 'corr_after_hike': nan, 'days_since_start_x_after_hike_mean': 0.0, 'days_since_start_x_after_hike_std': 0.0, 'corr_days_since_start_x_after_hike': nan, 'doy_sin_mean': 0.013525240123271942, 'doy_sin_std': 0.6925604343414307, 'corr_doy_sin': -0.012210300229901801, 'doy_cos_mean': 0.0685039758682251, 'doy_cos_std': 0.7179723978042603, 'corr_doy_cos': -0.011701873185007804, 'dow_sin_mean': 0.01914128102362156, 'dow_sin_std': 0.7052960991859436, 'corr_dow_sin': -0.008666983029101032, 'dow_cos_mean': -0.08087312430143356, 'dow_cos_std': 0.704024612903595, 'corr_dow_cos': -0.011065482260595182, 'is_holiday_mean': 0.021368643268942833, 'is_holiday_std': 0.1446099430322647, 'corr_is_holiday': -0.004848697234120326}
{'name':

In [36]:
# Stage 1 Gate Fix — Enrich features (year/month) + optional fold-local time anchor; hard gate enforced
import numpy as np, pandas as pd

assert 'train_df' in globals() and 'make_time_folds_quantile' in globals(), 'Prereqs missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# Prepare ordered data
train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True)
dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)

# Mandated enriched Stage1 features
trend_feats = [
    'days_since_start', 'year', 'month', 'after_hike',
    'days_since_start_x_after_hike', 'doy_sin', 'doy_cos',
    'dow_sin', 'dow_cos', 'is_holiday'
]
for c in trend_feats: assert c in train_ord.columns, f'Missing {c} in train_df'

# Build 5 time folds with 1-day gap
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

# Use the same Stage1 params as prior cell (do not alter capacity/Stage2 per mandate)
s1_params = dict(
    objective='regression', metric='rmse',
    learning_rate=0.04, n_estimators=6000,
    num_leaves=128, min_data_in_leaf=400,
    feature_fraction=0.9, bagging_fraction=0.9, bagging_freq=1,
    max_bin=127, reg_lambda=2.0, random_state=2025, n_jobs=-1, verbose=-1
)

def add_fold_local_anchor(tr, va):
    # Optional but recommended: fold-local time anchor to strengthen signal
    # Compute days since train-start in NY local time and use as an additional feature
    tr = tr.copy(); va = va.copy()
    dt_tr_local = tr['pickup_datetime'].dt.tz_convert('America/New_York')
    dt_va_local = va['pickup_datetime'].dt.tz_convert('America/New_York')
    start_ns = int(dt_tr_local.min().value)
    tr['days_since_train_start'] = ((dt_tr_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    va['days_since_train_start'] = ((dt_va_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return tr, va

# Monotone constraint ONLY on the fold-local continuous time feature
mono_map_static = {'days_since_train_start': 1}

gate_pass = True
fold_scores = []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx]; va = train_ord.iloc[va_idx]
    # Add fold-local anchor
    tr, va = add_fold_local_anchor(tr, va)
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    use_feats = trend_feats + ['days_since_train_start']
    X1_tr = tr[use_feats].astype('float32').values
    X1_va = va[use_feats].astype('float32').values
    mono_constraints = [mono_map_static.get(f, 0) for f in use_feats]
    trend_model = lgb.LGBMRegressor(**{**s1_params, 'monotone_constraints': mono_constraints})
    trend_model.fit(
        X1_tr, y_tr,
        eval_set=[(X1_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=400, verbose=False), lgb.log_evaluation(period=150)]
    )
    va_tr_rmse = float(trend_model.best_score_['valid_0']['rmse'])
    sigma_va = float(np.std(y_va))
    gate = min(0.35, 0.9 * sigma_va)
    print(f"[Stage1 GateCheck] Fold {i}: rmse_log={va_tr_rmse:.4f}, sigma_va={sigma_va:.4f}, gate={gate:.4f}")
    fold_scores.append(va_tr_rmse)
    if not (va_tr_rmse <= gate):
        gate_pass = False

assert gate_pass, f"CRITICAL: Stage1 gate failed on ≥1 folds. rmse_log per fold={fold_scores}"
print({'stage1_gate': 'PASSED', 'rmse_log_per_fold': [round(s,6) for s in fold_scores]})

[150]	valid_0's rmse: 0.508881


[300]	valid_0's rmse: 0.508842


[Stage1 GateCheck] Fold 1: rmse_log=0.5086, sigma_va=0.5086, gate=0.3500


[150]	valid_0's rmse: 0.516603


[300]	valid_0's rmse: 0.516644


[450]	valid_0's rmse: 0.516651


[Stage1 GateCheck] Fold 2: rmse_log=0.5166, sigma_va=0.5166, gate=0.3500


[150]	valid_0's rmse: 0.545897


[300]	valid_0's rmse: 0.545592


[450]	valid_0's rmse: 0.545453


[600]	valid_0's rmse: 0.545312


[750]	valid_0's rmse: 0.545267


[900]	valid_0's rmse: 0.545162


[1050]	valid_0's rmse: 0.545091


[1200]	valid_0's rmse: 0.544819


[1350]	valid_0's rmse: 0.544526


[1500]	valid_0's rmse: 0.544579


[1650]	valid_0's rmse: 0.544307


[1800]	valid_0's rmse: 0.544134


[1950]	valid_0's rmse: 0.544306


In [37]:
# Stage 1 Gate Resolution — Add week_idx + retain fold-local continuous time with monotone (+1 only on days_since_train_start)
import numpy as np, pandas as pd

assert 'train_df' in globals() and 'make_time_folds_quantile' in globals(), 'Prereqs missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# Ordered data and target
train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True)
dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)

# Required Stage1 features (from mandate) — ensure they exist
trend_feats_base = [
    'days_since_start', 'year', 'month', 'after_hike',
    'days_since_start_x_after_hike', 'doy_sin', 'doy_cos',
    'dow_sin', 'dow_cos', 'is_holiday'
]
for c in trend_feats_base: assert c in train_ord.columns, f'Missing {c} in train_df'

# 5-fold time CV with 1-day gap
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

# Use same params as Cell 19 (no changes mandated)
s1_params = dict(
    objective='regression', metric='rmse',
    learning_rate=0.04, n_estimators=6000,
    num_leaves=128, min_data_in_leaf=400,
    feature_fraction=0.9, bagging_fraction=0.9, bagging_freq=1,
    max_bin=127, reg_lambda=2.0, random_state=2025, n_jobs=-1, verbose=-1
)

def add_fold_local_time(df_tr, df_va):
    tr = df_tr.copy(); va = df_va.copy()
    dt_tr_local = tr['pickup_datetime'].dt.tz_convert('America/New_York')
    dt_va_local = va['pickup_datetime'].dt.tz_convert('America/New_York')
    start_ns = int(dt_tr_local.min().value)
    tr_days = ((dt_tr_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    va_days = ((dt_va_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    tr['days_since_train_start'] = tr_days
    va['days_since_train_start'] = va_days
    tr['week_idx'] = np.floor(tr_days / 7.0).astype('int16')
    va['week_idx'] = np.floor(va_days / 7.0).astype('int16')
    return tr, va

# Monotone constraint ONLY on days_since_train_start
mono_map = {'days_since_train_start': 1}

gate_pass = True
fold_scores = []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx]; va = train_ord.iloc[va_idx]
    tr, va = add_fold_local_time(tr, va)
    use_feats = trend_feats_base + ['days_since_train_start', 'week_idx']
    for c in use_feats: assert c in tr.columns, f'Missing {c} in fold data'
    X_tr = tr[use_feats].astype('float32').values
    X_va = va[use_feats].astype('float32').values
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    mono_constraints = [mono_map.get(f, 0) for f in use_feats]
    model = lgb.LGBMRegressor(**{**s1_params, 'monotone_constraints': mono_constraints})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=400, verbose=False), lgb.log_evaluation(period=150)]
    )
    rmse = float(model.best_score_['valid_0']['rmse'])
    sigma_va = float(np.std(y_va))
    gate = min(0.35, 0.9 * sigma_va)
    print(f"[Stage1 WeekIdx Gate] Fold {i}: rmse_log={rmse:.4f}, sigma_va={sigma_va:.4f}, gate={gate:.4f}")
    fold_scores.append(rmse)
    if rmse > gate:
        gate_pass = False

assert gate_pass, f"CRITICAL: Stage1 gate failed on ≥1 folds. rmse_log per fold={fold_scores}"
print({'stage1_gate': 'PASSED', 'rmse_log_per_fold': [round(s,6) for s in fold_scores], 'note': 'Stage1 with week_idx + days_since_train_start (monotone on days_since_train_start only)'})

[150]	valid_0's rmse: 0.508674


[300]	valid_0's rmse: 0.508696


[Stage1 WeekIdx Gate] Fold 1: rmse_log=0.5086, sigma_va=0.5086, gate=0.3500


[150]	valid_0's rmse: 0.516915


[300]	valid_0's rmse: 0.517223


[450]	valid_0's rmse: 0.517421


[Stage1 WeekIdx Gate] Fold 2: rmse_log=0.5168, sigma_va=0.5166, gate=0.3500


[150]	valid_0's rmse: 0.545261


[300]	valid_0's rmse: 0.544777


[450]	valid_0's rmse: 0.544425


[600]	valid_0's rmse: 0.543969


[750]	valid_0's rmse: 0.543603


[900]	valid_0's rmse: 0.543318


[1050]	valid_0's rmse: 0.542997


[1200]	valid_0's rmse: 0.542752


[1350]	valid_0's rmse: 0.542533


In [38]:
# Stage 1 with categorical discrete time features + hard gate (per mandate)
import numpy as np
import pandas as pd

assert 'train_df' in globals() and 'make_time_folds_quantile' in globals(), 'Prereqs missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# Ordered data and target
train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True)
dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)

# Base Stage1 features from mandate (must already exist in train_ord)
trend_feats_base = [
    'days_since_start', 'year', 'month', 'after_hike',
    'days_since_start_x_after_hike', 'doy_sin', 'doy_cos',
    'dow_sin', 'dow_cos', 'is_holiday'
]
for c in trend_feats_base:
    assert c in train_ord.columns, f'Missing {c} in train_df'

# 5-fold time CV with 1-day gap
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

# Stage1 params (unchanged)
s1_params = dict(
    objective='regression', metric='rmse',
    learning_rate=0.04, n_estimators=6000,
    num_leaves=128, min_data_in_leaf=400,
    feature_fraction=0.9, bagging_fraction=0.9, bagging_freq=1,
    max_bin=127, reg_lambda=2.0, random_state=2025, n_jobs=-1, verbose=-1
)

def add_fold_local_time(df_tr, df_va):
    tr = df_tr.copy(); va = df_va.copy()
    dt_tr_local = tr['pickup_datetime'].dt.tz_convert('America/New_York')
    dt_va_local = va['pickup_datetime'].dt.tz_convert('America/New_York')
    start_ns = int(dt_tr_local.min().value)
    tr_days = ((dt_tr_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    va_days = ((dt_va_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    tr['days_since_train_start'] = tr_days
    va['days_since_train_start'] = va_days
    tr['week_idx'] = np.floor(tr_days / 7.0).astype('int16')
    va['week_idx'] = np.floor(va_days / 7.0).astype('int16')
    return tr, va

# Monotone constraint ONLY on the continuous fold-local time feature
mono_map = {'days_since_train_start': 1}
# Discrete temporal categorical features
cat_feats = ['year','month','week_idx','is_holiday','after_hike']

gate_pass = True
fold_scores = []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx]; va = train_ord.iloc[va_idx]
    tr, va = add_fold_local_time(tr, va)
    use_feats = trend_feats_base + ['days_since_train_start', 'week_idx']
    for c in use_feats:
        assert c in tr.columns, f'Missing {c} in fold data'
    X_tr = tr[use_feats].copy()
    X_va = va[use_feats].copy()
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    mono_constraints = [mono_map.get(f, 0) for f in use_feats]
    # Fit with categorical_feature specified
    model = lgb.LGBMRegressor(**{**s1_params, 'monotone_constraints': mono_constraints})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        categorical_feature=cat_feats,
        callbacks=[lgb.early_stopping(stopping_rounds=400, verbose=False), lgb.log_evaluation(period=150)]
    )
    rmse = float(model.best_score_['valid_0']['rmse'])
    sigma_va = float(np.std(y_va))
    gate = min(0.35, 0.9 * sigma_va)
    print(f"[Stage1 Categorical Gate] Fold {i}: rmse_log={rmse:.4f}, sigma_va={sigma_va:.4f}, gate={gate:.4f}")
    fold_scores.append(rmse)
    if rmse > gate:
        gate_pass = False

assert gate_pass, f"CRITICAL: Stage1 gate failed on ≥1 folds. rmse_log per fold={fold_scores}"
print({'stage1_gate': 'PASSED', 'rmse_log_per_fold': [round(s,6) for s in fold_scores], 'note': 'Categorical discrete time features + monotone on days_since_train_start'})



[150]	valid_0's rmse: 0.508502


[300]	valid_0's rmse: 0.508504


[450]	valid_0's rmse: 0.50852


[Stage1 Categorical Gate] Fold 1: rmse_log=0.5085, sigma_va=0.5086, gate=0.3500




[150]	valid_0's rmse: 0.517682


[300]	valid_0's rmse: 0.517783


[Stage1 Categorical Gate] Fold 2: rmse_log=0.5174, sigma_va=0.5166, gate=0.3500




[150]	valid_0's rmse: 0.552615


[300]	valid_0's rmse: 0.5516


[Stage1 Categorical Gate] Fold 3: rmse_log=0.5499, sigma_va=0.5403, gate=0.3500




[150]	valid_0's rmse: 0.558747


[300]	valid_0's rmse: 0.559394


[450]	valid_0's rmse: 0.559849


[Stage1 Categorical Gate] Fold 4: rmse_log=0.5583, sigma_va=0.5583, gate=0.3500




[150]	valid_0's rmse: 0.572944


[300]	valid_0's rmse: 0.573152


[450]	valid_0's rmse: 0.57342


[Stage1 Categorical Gate] Fold 5: rmse_log=0.5728, sigma_va=0.5730, gate=0.3500


AssertionError: CRITICAL: Stage1 gate failed on ≥1 folds. rmse_log per fold=[0.508498431160195, 0.51736767887961, 0.5498656296773534, 0.5583302984384716, 0.5727985519035138]

In [39]:
# Stage 1 — Strategic Reset: cyclical-safe discrete time as categoricals + fold-local trend with monotone (+1)
import numpy as np, pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as USCal

assert 'train_df' in globals() and 'make_time_folds_quantile' in globals(), 'Prereqs missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# Ordered data and target
train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True)
dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)

# 5-fold time CV with 1-day gap
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

# Helper: add fold-local continuous time anchor and discrete time categoricals (NY local)
def build_stage1_views(df_tr, df_va):
    tr = df_tr.copy(); va = df_va.copy()
    dt_tr_loc = tr['pickup_datetime'].dt.tz_convert('America/New_York')
    dt_va_loc = va['pickup_datetime'].dt.tz_convert('America/New_York')
    # Fold-local continuous time
    start_ns = int(dt_tr_loc.min().value)
    tr['days_since_train_start'] = ((dt_tr_loc.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    va['days_since_train_start'] = ((dt_va_loc.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    # Discrete time features (raw categoricals for trees)
    tr['year'] = dt_tr_loc.dt.year.astype('int16'); va['year'] = dt_va_loc.dt.year.astype('int16')
    tr['month'] = dt_tr_loc.dt.month.astype('int8'); va['month'] = dt_va_loc.dt.month.astype('int8')
    tr['week_of_year'] = dt_tr_loc.dt.isocalendar().week.astype('int16')
    va['week_of_year'] = dt_va_loc.dt.isocalendar().week.astype('int16')
    tr['day_of_week'] = dt_tr_loc.dt.dayofweek.astype('int8'); va['day_of_week'] = dt_va_loc.dt.dayofweek.astype('int8')
    tr['hour'] = dt_tr_loc.dt.hour.astype('int8'); va['hour'] = dt_va_loc.dt.hour.astype('int8')
    # Use existing holiday/after_hike if present, else compute quickly
    if 'is_holiday' not in tr.columns or 'is_holiday' not in va.columns:
        start = dt_tr_loc.min().normalize().tz_localize(None)
        end = dt_va_loc.max().normalize().tz_localize(None)
        hol = USCal().holidays(start=start, end=end)
        hol_dates = set(pd.to_datetime(hol).date)
        tr['is_holiday'] = dt_tr_loc.dt.date.map(lambda d: d in hol_dates).astype('int8')
        va['is_holiday'] = dt_va_loc.dt.date.map(lambda d: d in hol_dates).astype('int8')
    if 'after_hike' not in tr.columns or 'after_hike' not in va.columns:
        cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
        tr['after_hike'] = (dt_tr_loc >= cutoff).astype('int8')
        va['after_hike'] = (dt_va_loc >= cutoff).astype('int8')
    return tr, va

# Feature lists
cont_feat = ['days_since_train_start']
cat_feats = ['year','month','week_of_year','day_of_week','hour','is_holiday','after_hike']
use_feats = cont_feat + cat_feats

# LightGBM params (with contingency tweaks for categorical splitting robustness)
s1_params = dict(
    objective='regression', metric='rmse',
    learning_rate=0.04, n_estimators=6000,
    num_leaves=128, min_data_in_leaf=100,  # contingency tweak from 400 -> 100
    feature_fraction=0.9, bagging_fraction=0.9, bagging_freq=1,
    max_bin=255, reg_lambda=10.0,        # contingency tweak from 2.0 -> 10.0
    random_state=2025, n_jobs=-1, verbose=-1
)

gate_pass = True
fold_scores = []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx]; va = train_ord.iloc[va_idx]
    tr, va = build_stage1_views(tr, va)
    X_tr = tr[use_feats].copy(); X_va = va[use_feats].copy()
    # Explicit categorical dtype casting (robust for LightGBM)
    for c in cat_feats:
        X_tr[c] = X_tr[c].astype('category')
        X_va[c] = X_va[c].astype('category')
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    # Monotone constraint only on continuous trend anchor
    mono = [1] + [0]*(len(use_feats)-1)
    model = lgb.LGBMRegressor(**{**s1_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        categorical_feature='auto',
        callbacks=[lgb.early_stopping(stopping_rounds=400, verbose=False), lgb.log_evaluation(period=150)]
    )
    rmse = float(model.best_score_['valid_0']['rmse'])
    sigma_va = float(np.std(y_va))
    gate = min(0.35, 0.9 * sigma_va)
    print(f"[Stage1 RESET Gate] Fold {i}: rmse_log={rmse:.4f}, sigma_va={sigma_va:.4f}, gate={gate:.4f}")
    fold_scores.append(rmse)
    if rmse > gate:
        gate_pass = False

assert gate_pass, f"CRITICAL: Stage1 gate failed on ≥1 folds. rmse_log per fold={fold_scores}"
print({'stage1_gate': 'PASSED', 'rmse_log_per_fold': [round(s,6) for s in fold_scores], 'note': 'Stage1 strategic reset with robust categoricals + monotone on fold-local trend'})

[150]	valid_0's rmse: 0.508046


[300]	valid_0's rmse: 0.508864


[Stage1 RESET Gate] Fold 1: rmse_log=0.5075, sigma_va=0.5086, gate=0.3500


[150]	valid_0's rmse: 0.515614


[300]	valid_0's rmse: 0.516112


[450]	valid_0's rmse: 0.516706
[Stage1 RESET Gate] Fold 2: rmse_log=0.5154, sigma_va=0.5166, gate=0.3500


[150]	valid_0's rmse: 0.543581


[300]	valid_0's rmse: 0.543605


[450]	valid_0's rmse: 0.543213


[600]	valid_0's rmse: 0.543053


[750]	valid_0's rmse: 0.542857


[900]	valid_0's rmse: 0.542584


[1050]	valid_0's rmse: 0.542396


[1200]	valid_0's rmse: 0.542287


[1350]	valid_0's rmse: 0.542199


[1500]	valid_0's rmse: 0.542128


[1650]	valid_0's rmse: 0.542011


[1800]	valid_0's rmse: 0.541919


[1950]	valid_0's rmse: 0.541831


[2100]	valid_0's rmse: 0.541756


[2250]	valid_0's rmse: 0.541675


In [40]:
# Stage 1 — Encoding Fix: shared CategoricalDtype across folds; hard gate enforced
import numpy as np, pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as USCal

assert 'train_df' in globals() and 'make_time_folds_quantile' in globals(), 'Prereqs missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# Ordered data and target
train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True)
dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)

# Helper: build global discrete time columns once (NY local) to derive stable category sets
dt_loc_all = train_ord['pickup_datetime'].dt.tz_convert('America/New_York')
global_disc = pd.DataFrame(index=train_ord.index)
global_disc['year'] = dt_loc_all.dt.year.astype('int16')
global_disc['month'] = dt_loc_all.dt.month.astype('int8')
global_disc['week_of_year'] = dt_loc_all.dt.isocalendar().week.astype('int16')
global_disc['day_of_week'] = dt_loc_all.dt.dayofweek.astype('int8')
global_disc['hour'] = dt_loc_all.dt.hour.astype('int8')
# is_holiday, after_hike (compute if not already present)
cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
if 'is_holiday' in train_ord.columns:
    global_disc['is_holiday'] = train_ord['is_holiday'].astype('int8')
else:
    hol = USCal().holidays(start=dt_loc_all.min().normalize().tz_localize(None), end=dt_loc_all.max().normalize().tz_localize(None))
    hol_dates = set(pd.to_datetime(hol).date)
    global_disc['is_holiday'] = dt_loc_all.dt.date.map(lambda d: d in hol_dates).astype('int8')
if 'after_hike' in train_ord.columns:
    global_disc['after_hike'] = train_ord['after_hike'].astype('int8')
else:
    global_disc['after_hike'] = (dt_loc_all >= cutoff).astype('int8')

# Build shared CategoricalDtype per categorical feature using union of categories from full train
cat_feats = ['year','month','week_of_year','day_of_week','hour','is_holiday','after_hike']
cat_dtype_map = {}
for c in cat_feats:
    all_cats = pd.Index(global_disc[c].unique())
    cat_dtype_map[c] = pd.CategoricalDtype(categories=all_cats.tolist(), ordered=False)

# 5-fold time CV with 1-day gap
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

# Params (unchanged from Cell 22 per mandate)
s1_params = dict(
    objective='regression', metric='rmse',
    learning_rate=0.04, n_estimators=6000,
    num_leaves=128, min_data_in_leaf=100,
    feature_fraction=0.9, bagging_fraction=0.9, bagging_freq=1,
    max_bin=255, reg_lambda=10.0,
    random_state=2025, n_jobs=-1, verbose=-1
)

def build_fold_views(df_tr, df_va):
    tr = df_tr.copy(); va = df_va.copy()
    dt_tr_loc = tr['pickup_datetime'].dt.tz_convert('America/New_York')
    dt_va_loc = va['pickup_datetime'].dt.tz_convert('America/New_York')
    # Fold-local continuous time anchor
    start_ns = int(dt_tr_loc.min().value)
    tr['days_since_train_start'] = ((dt_tr_loc.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    va['days_since_train_start'] = ((dt_va_loc.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    # Discrete time features (same definitions as global_disc)
    tr['year'] = dt_tr_loc.dt.year.astype('int16'); va['year'] = dt_va_loc.dt.year.astype('int16')
    tr['month'] = dt_tr_loc.dt.month.astype('int8'); va['month'] = dt_va_loc.dt.month.astype('int8')
    tr['week_of_year'] = dt_tr_loc.dt.isocalendar().week.astype('int16'); va['week_of_year'] = dt_va_loc.dt.isocalendar().week.astype('int16')
    tr['day_of_week'] = dt_tr_loc.dt.dayofweek.astype('int8'); va['day_of_week'] = dt_va_loc.dt.dayofweek.astype('int8')
    tr['hour'] = dt_tr_loc.dt.hour.astype('int8'); va['hour'] = dt_va_loc.dt.hour.astype('int8')
    if 'is_holiday' not in tr.columns or 'is_holiday' not in va.columns:
        hol = USCal().holidays(start=dt_tr_loc.min().normalize().tz_localize(None), end=dt_va_loc.max().normalize().tz_localize(None))
        hol_dates = set(pd.to_datetime(hol).date)
        tr['is_holiday'] = dt_tr_loc.dt.date.map(lambda d: d in hol_dates).astype('int8')
        va['is_holiday'] = dt_va_loc.dt.date.map(lambda d: d in hol_dates).astype('int8')
    if 'after_hike' not in tr.columns or 'after_hike' not in va.columns:
        tr['after_hike'] = (dt_tr_loc >= cutoff).astype('int8')
        va['after_hike'] = (dt_va_loc >= cutoff).astype('int8')
    return tr, va

use_feats = ['days_since_train_start'] + cat_feats
mono = [1] + [0]*(len(use_feats)-1)  # +1 only on continuous trend anchor

gate_pass = True
fold_scores = []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx]; va = train_ord.iloc[va_idx]
    tr, va = build_fold_views(tr, va)
    X_tr = tr[use_feats].copy(); X_va = va[use_feats].copy()
    # Apply shared CategoricalDtype per feature (consistent encoding across folds)
    for c in cat_feats:
        X_tr[c] = X_tr[c].astype(cat_dtype_map[c])
        X_va[c] = X_va[c].astype(cat_dtype_map[c])
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    model = lgb.LGBMRegressor(**{**s1_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        categorical_feature='auto',
        callbacks=[lgb.early_stopping(stopping_rounds=400, verbose=False), lgb.log_evaluation(period=150)]
    )
    rmse = float(model.best_score_['valid_0']['rmse'])
    sigma_va = float(np.std(y_va))
    gate = min(0.35, 0.9 * sigma_va)
    print(f"[Stage1 ENCODING FIX Gate] Fold {i}: rmse_log={rmse:.4f}, sigma_va={sigma_va:.4f}, gate={gate:.4f}")
    fold_scores.append(rmse)
    if rmse > gate:
        gate_pass = False

assert gate_pass, f"CRITICAL: Stage1 gate failed on ≥1 folds. rmse_log per fold={fold_scores}"
print({'stage1_gate': 'PASSED', 'rmse_log_per_fold': [round(s,6) for s in fold_scores], 'note': 'Shared categorical dtypes across folds; monotone on fold-local time'})

[150]	valid_0's rmse: 0.508046


[300]	valid_0's rmse: 0.508864


[Stage1 ENCODING FIX Gate] Fold 1: rmse_log=0.5075, sigma_va=0.5086, gate=0.3500


[150]	valid_0's rmse: 0.515613


[300]	valid_0's rmse: 0.516112


[450]	valid_0's rmse: 0.516705
[Stage1 ENCODING FIX Gate] Fold 2: rmse_log=0.5154, sigma_va=0.5166, gate=0.3500


[150]	valid_0's rmse: 0.543584


[300]	valid_0's rmse: 0.543605


[450]	valid_0's rmse: 0.543213


[600]	valid_0's rmse: 0.543054


[750]	valid_0's rmse: 0.542858


[900]	valid_0's rmse: 0.542585


[1050]	valid_0's rmse: 0.542396


[1200]	valid_0's rmse: 0.542286


[1350]	valid_0's rmse: 0.542198


[1500]	valid_0's rmse: 0.542126


[1650]	valid_0's rmse: 0.542008


[1800]	valid_0's rmse: 0.541916


[1950]	valid_0's rmse: 0.541831


[2100]	valid_0's rmse: 0.541755


[2250]	valid_0's rmse: 0.541676


[2400]	valid_0's rmse: 0.541648


[2550]	valid_0's rmse: 0.541627


In [42]:
# Stage 1 Pivot — Linear Ridge with proper one-hot encoding and hard gate
import numpy as np, pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import Ridge

assert 'train_df' in globals() and 'make_time_folds_quantile' in globals(), 'Prereqs missing'

train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True)
dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)

K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

def build_fold_views(df_tr, df_va):
    tr = df_tr.copy(); va = df_va.copy()
    dt_tr_loc = tr['pickup_datetime'].dt.tz_convert('America/New_York')
    dt_va_loc = va['pickup_datetime'].dt.tz_convert('America/New_York')
    # Fold-local continuous trend
    start_ns = int(dt_tr_loc.min().value)
    tr['days_since_train_start'] = ((dt_tr_loc.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    va['days_since_train_start'] = ((dt_va_loc.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    # Discrete time categoricals
    tr['year'] = dt_tr_loc.dt.year.astype('int16'); va['year'] = dt_va_loc.dt.year.astype('int16')
    tr['month'] = dt_tr_loc.dt.month.astype('int8'); va['month'] = dt_va_loc.dt.month.astype('int8')
    tr['week_of_year'] = dt_tr_loc.dt.isocalendar().week.astype('int16'); va['week_of_year'] = dt_va_loc.dt.isocalendar().week.astype('int16')
    tr['day_of_week'] = dt_tr_loc.dt.dayofweek.astype('int8'); va['day_of_week'] = dt_va_loc.dt.dayofweek.astype('int8')
    tr['hour'] = dt_tr_loc.dt.hour.astype('int8'); va['hour'] = dt_va_loc.dt.hour.astype('int8')
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    tr['after_hike'] = (dt_tr_loc >= cutoff).astype('int8'); va['after_hike'] = (dt_va_loc >= cutoff).astype('int8')
    # Holiday flag (fast local window)
    from pandas.tseries.holiday import USFederalHolidayCalendar as USCal
    hol = USCal().holidays(start=dt_tr_loc.min().normalize().tz_localize(None), end=dt_va_loc.max().normalize().tz_localize(None))
    hol_dates = set(pd.to_datetime(hol).date)
    tr['is_holiday'] = dt_tr_loc.dt.date.map(lambda d: d in hol_dates).astype('int8')
    va['is_holiday'] = dt_va_loc.dt.date.map(lambda d: d in hol_dates).astype('int8')
    use_cols = ['days_since_train_start','year','month','week_of_year','day_of_week','hour','is_holiday','after_hike']
    return tr[use_cols], va[use_cols]

cont_cols = ['days_since_train_start']
cat_cols = ['year','month','week_of_year','day_of_week','hour','is_holiday','after_hike']
pre = ColumnTransformer([
    ('cont', StandardScaler(with_mean=True, with_std=True), cont_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=True), cat_cols)
])
ridge = Ridge(alpha=5.0)
pipe = Pipeline([('pre', pre), ('model', ridge)])

gate_pass = True
fold_scores = []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    X_tr_df, X_va_df = build_fold_views(train_ord.iloc[tr_idx], train_ord.iloc[va_idx])
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    pipe.fit(X_tr_df, y_tr)
    pred_va = pipe.predict(X_va_df)
    rmse = float(np.sqrt(np.mean((pred_va - y_va)**2)))
    sigma_va = float(np.std(y_va))
    gate = min(0.35, 0.9 * sigma_va)
    print(f"[Stage1 Ridge Gate] Fold {i}: rmse_log={rmse:.4f}, sigma_va={sigma_va:.4f}, gate={gate:.4f}")
    fold_scores.append(rmse)
    if rmse > gate: gate_pass = False

assert gate_pass, f"CRITICAL: Stage1 gate failed on ≥1 folds. rmse_log per fold={fold_scores}"
print({'stage1_gate': 'PASSED', 'rmse_log_per_fold': [round(s,6) for s in fold_scores], 'note': 'Linear Ridge Stage1 with one-hot categoricals and fold-local trend'})

[Stage1 Ridge Gate] Fold 1: rmse_log=0.5090, sigma_va=0.5086, gate=0.3500


[Stage1 Ridge Gate] Fold 2: rmse_log=0.5179, sigma_va=0.5166, gate=0.3500


[Stage1 Ridge Gate] Fold 3: rmse_log=0.5416, sigma_va=0.5403, gate=0.3500


[Stage1 Ridge Gate] Fold 4: rmse_log=0.5600, sigma_va=0.5583, gate=0.3500


[Stage1 Ridge Gate] Fold 5: rmse_log=0.5718, sigma_va=0.5730, gate=0.3500


AssertionError: CRITICAL: Stage1 gate failed on ≥1 folds. rmse_log per fold=[0.5089564672389532, 0.5179140568447872, 0.5415669378943179, 0.5599786132407165, 0.5717733052967201]

In [43]:
# Stage 1 — Contingency Tree Params: softer reg + higher bin resolution; shared categoricals; hard gate
import numpy as np, pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as USCal

assert 'train_df' in globals() and 'make_time_folds_quantile' in globals(), 'Prereqs missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# Ordered data and target
train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True)
dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)
dt_loc_all = train_ord['pickup_datetime'].dt.tz_convert('America/New_York')

# Build global discrete columns for stable category sets
global_disc = pd.DataFrame(index=train_ord.index)
global_disc['year'] = dt_loc_all.dt.year.astype('int16')
global_disc['month'] = dt_loc_all.dt.month.astype('int8')
global_disc['week_of_year'] = dt_loc_all.dt.isocalendar().week.astype('int16')
global_disc['day_of_week'] = dt_loc_all.dt.dayofweek.astype('int8')
global_disc['hour'] = dt_loc_all.dt.hour.astype('int8')
cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
if 'is_holiday' in train_ord.columns:
    global_disc['is_holiday'] = train_ord['is_holiday'].astype('int8')
else:
    hol = USCal().holidays(start=dt_loc_all.min().normalize().tz_localize(None), end=dt_loc_all.max().normalize().tz_localize(None))
    hol_dates = set(pd.to_datetime(hol).date)
    global_disc['is_holiday'] = dt_loc_all.dt.date.map(lambda d: d in hol_dates).astype('int8')
if 'after_hike' in train_ord.columns:
    global_disc['after_hike'] = train_ord['after_hike'].astype('int8')
else:
    global_disc['after_hike'] = (dt_loc_all >= cutoff).astype('int8')

cat_feats = ['year','month','week_of_year','day_of_week','hour','is_holiday','after_hike']
cat_dtype_map = {c: pd.CategoricalDtype(categories=pd.Index(global_disc[c].unique()).tolist(), ordered=False) for c in cat_feats}

# 5-fold time CV with 1-day gap
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

# Contingency params (per audit): min_data_in_leaf=50, reg_lambda=5.0, max_bin=511
s1_params = dict(
    objective='regression', metric='rmse',
    learning_rate=0.04, n_estimators=8000,
    num_leaves=128, min_data_in_leaf=50,
    feature_fraction=0.9, bagging_fraction=0.9, bagging_freq=1,
    max_bin=511, reg_lambda=5.0,
    random_state=2025, n_jobs=-1, verbose=-1
)

def build_fold_views(df_tr, df_va):
    tr = df_tr.copy(); va = df_va.copy()
    dt_tr = tr['pickup_datetime'].dt.tz_convert('America/New_York')
    dt_va = va['pickup_datetime'].dt.tz_convert('America/New_York')
    start_ns = int(dt_tr.min().value)
    tr['days_since_train_start'] = ((dt_tr.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    va['days_since_train_start'] = ((dt_va.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    tr['year'] = dt_tr.dt.year.astype('int16'); va['year'] = dt_va.dt.year.astype('int16')
    tr['month'] = dt_tr.dt.month.astype('int8'); va['month'] = dt_va.dt.month.astype('int8')
    tr['week_of_year'] = dt_tr.dt.isocalendar().week.astype('int16'); va['week_of_year'] = dt_va.dt.isocalendar().week.astype('int16')
    tr['day_of_week'] = dt_tr.dt.dayofweek.astype('int8'); va['day_of_week'] = dt_va.dt.dayofweek.astype('int8')
    tr['hour'] = dt_tr.dt.hour.astype('int8'); va['hour'] = dt_va.dt.hour.astype('int8')
    if 'is_holiday' not in tr.columns or 'is_holiday' not in va.columns:
        hol = USCal().holidays(start=dt_tr.min().normalize().tz_localize(None), end=dt_va.max().normalize().tz_localize(None))
        hol_dates = set(pd.to_datetime(hol).date)
        tr['is_holiday'] = dt_tr.dt.date.map(lambda d: d in hol_dates).astype('int8')
        va['is_holiday'] = dt_va.dt.date.map(lambda d: d in hol_dates).astype('int8')
    if 'after_hike' not in tr.columns or 'after_hike' not in va.columns:
        tr['after_hike'] = (dt_tr >= cutoff).astype('int8')
        va['after_hike'] = (dt_va >= cutoff).astype('int8')
    return tr, va

use_feats = ['days_since_train_start'] + cat_feats
mono = [1] + [0]*(len(use_feats)-1)

gate_pass = True
fold_scores = []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx]; va = train_ord.iloc[va_idx]
    tr, va = build_fold_views(tr, va)
    X_tr = tr[use_feats].copy(); X_va = va[use_feats].copy()
    for c in cat_feats:
        X_tr[c] = X_tr[c].astype(cat_dtype_map[c])
        X_va[c] = X_va[c].astype(cat_dtype_map[c])
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    model = lgb.LGBMRegressor(**{**s1_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        categorical_feature='auto',
        callbacks=[lgb.early_stopping(stopping_rounds=500, verbose=False), lgb.log_evaluation(period=200)]
    )
    rmse = float(model.best_score_['valid_0']['rmse'])
    sigma_va = float(np.std(y_va))
    gate = min(0.35, 0.9 * sigma_va)
    print(f"[Stage1 CONTINGENCY Gate] Fold {i}: rmse_log={rmse:.4f}, sigma_va={sigma_va:.4f}, gate={gate:.4f}")
    fold_scores.append(rmse)
    if rmse > gate: gate_pass = False

assert gate_pass, f"CRITICAL: Stage1 gate failed on ≥1 folds. rmse_log per fold={fold_scores}"
print({'stage1_gate': 'PASSED', 'rmse_log_per_fold': [round(s,6) for s in fold_scores], 'note': 'Contingency params applied: min_data_in_leaf=50, reg_lambda=5, max_bin=511'})

[200]	valid_0's rmse: 0.509027


[400]	valid_0's rmse: 0.51254


[Stage1 CONTINGENCY Gate] Fold 1: rmse_log=0.5076, sigma_va=0.5086, gate=0.3500


[200]	valid_0's rmse: 0.515904


[400]	valid_0's rmse: 0.516957


[Stage1 CONTINGENCY Gate] Fold 2: rmse_log=0.5154, sigma_va=0.5166, gate=0.3500


[200]	valid_0's rmse: 0.543095


[400]	valid_0's rmse: 0.542439


[600]	valid_0's rmse: 0.541852


[800]	valid_0's rmse: 0.541388


[1000]	valid_0's rmse: 0.541059


[1200]	valid_0's rmse: 0.540733


[1400]	valid_0's rmse: 0.540648


[1600]	valid_0's rmse: 0.540646


[1800]	valid_0's rmse: 0.540797


[2000]	valid_0's rmse: 0.540998


[Stage1 CONTINGENCY Gate] Fold 3: rmse_log=0.5406, sigma_va=0.5403, gate=0.3500


[200]	valid_0's rmse: 0.557078


[400]	valid_0's rmse: 0.557447


In [45]:
# Path B — Single-Model LightGBM with per-fold TE (pu/do only) + counts; debug rmse in log/original; monotone time trend
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals(), 'Prepared train/test required'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

def ensure_days_since_start(df):
    if 'days_since_start' in df.columns: return df
    d = df.copy(); dt_local = d['pickup_datetime'].dt.tz_convert('America/New_York')
    start_ns = int(dt_local.min().value)
    d['days_since_start'] = ((dt_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d

def ensure_spatial_bins(df, res=0.005):
    if all(c in df.columns for c in ['pu_bin','do_bin','pair_bin']): return df
    d = df.copy()
    pu_lat_bin = np.floor(d['pickup_latitude'] / res).astype('int32')
    pu_lon_bin = np.floor(d['pickup_longitude'] / res).astype('int32')
    do_lat_bin = np.floor(d['dropoff_latitude'] / res).astype('int32')
    do_lon_bin = np.floor(d['dropoff_longitude'] / res).astype('int32')
    d['pu_bin'] = (pu_lat_bin.astype(str) + '_' + pu_lon_bin.astype(str))
    d['do_bin'] = (do_lat_bin.astype(str) + '_' + do_lon_bin.astype(str))
    d['pair_bin'] = (d['pu_bin'] + '|' + d['do_bin'])
    return d

train_df = ensure_days_since_start(train_df)
test_df = ensure_days_since_start(test_df)
train_df = ensure_spatial_bins(train_df)
test_df = ensure_spatial_bins(test_df)

base_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'year','month','day','hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday','days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_df.columns: base_feats.append('rot_manh_km')
if 'dist_x_after_hike' in train_df.columns: base_feats.append('dist_x_after_hike')

train_ord = train_df.dropna(subset=base_feats + ['fare_amount']).sort_values('pickup_datetime').reset_index(drop=True)
test_df[base_feats] = test_df[base_feats].fillna(0)
dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)

K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

def te_smooth(train_key, train_tgt, apply_key, m=500.0, prior=None):
    if prior is None: prior = float(np.mean(train_tgt))
    g = pd.DataFrame({'k': train_key.astype('object'), 'y': train_tgt}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict()
    enc = apply_key.astype('object').map(mp).fillna(prior).astype('float32').values
    return enc, prior, mp

def build_fold_mats(df, tr_idx, va_idx):
    tr = df.iloc[tr_idx].copy(); va = df.iloc[va_idx].copy()
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    prior = float(y_tr.mean())
    # TE on pu/do only (ablate pair for stability)
    tr['te_pu'], _, _ = te_smooth(tr['pu_bin'], y_tr, tr['pu_bin'], m=600.0, prior=prior)
    va['te_pu'], _, _ = te_smooth(tr['pu_bin'], y_tr, va['pu_bin'], m=600.0, prior=prior)
    tr['te_do'], _, _ = te_smooth(tr['do_bin'], y_tr, tr['do_bin'], m=600.0, prior=prior)
    va['te_do'], _, _ = te_smooth(tr['do_bin'], y_tr, va['do_bin'], m=600.0, prior=prior)
    # Frequency encodings (log-counts) from train-only maps
    pu_cnt = tr['pu_bin'].astype('object').value_counts().astype('int32')
    do_cnt = tr['do_bin'].astype('object').value_counts().astype('int32')
    pair_cnt = tr['pair_bin'].astype('object').value_counts().astype('int32')
    for d in (tr, va):
        d['log_pu_cnt'] = np.log1p(d['pu_bin'].astype('object').map(pu_cnt).fillna(0).astype('int32')).astype('float32')
        d['log_do_cnt'] = np.log1p(d['do_bin'].astype('object').map(do_cnt).fillna(0).astype('int32')).astype('float32')
        d['log_pair_cnt'] = np.log1p(d['pair_bin'].astype('object').map(pair_cnt).fillna(0).astype('int32')).astype('float32')
    use_cols = base_feats + ['te_pu','te_do','log_pu_cnt','log_do_cnt','log_pair_cnt']
    X_tr = tr[use_cols].astype('float32').values; X_va = va[use_cols].astype('float32').values
    return X_tr, X_va, y_tr, y_va, use_cols

SEED = 2025
lgb_params = dict(
    objective='regression',
    metric='rmse',
    learning_rate=0.045,
    n_estimators=25000,
    num_leaves=256,
    max_depth=-1,
    min_data_in_leaf=120,
    feature_fraction=0.85,
    bagging_fraction=0.8,
    bagging_freq=1,
    max_bin=255,
    reg_alpha=0.0,
    reg_lambda=5.0,
    random_state=SEED,
    n_jobs=-1,
    verbose=-1
)

cv_rmses, best_iters = [], []
final_use_cols = None
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    X_tr, X_va, y_tr, y_va, use_cols = build_fold_mats(train_ord, tr_idx, va_idx)
    mono = [0]*len(use_cols)
    if 'days_since_start' in use_cols:
        mono[use_cols.index('days_since_start')] = 1
    model = lgb.LGBMRegressor(**{**lgb_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=800, verbose=False), lgb.log_evaluation(period=200)]
    )
    y_pred_val_log = model.predict(X_va, num_iteration=model.best_iteration_)
    y_pred_val = np.expm1(y_pred_val_log)
    y_true_val = np.expm1(y_va)
    rmse = float(root_mean_squared_error(y_true_val, y_pred_val))
    rmse_log = float(np.sqrt(np.mean((y_pred_val_log - y_va)**2)))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    final_use_cols = use_cols
    print(f"[Single-Model] Fold {i}: RMSE={rmse:.5f}, RMSE_log={rmse_log:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})")

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_final = int(np.clip(int(np.median(best_iters)), 1000, lgb_params['n_estimators']))
print({'cv_rmse_mean': round(cv_mean, 6), 'cv_rmse_std': round(cv_std, 6), 'best_iter_final': best_iter_final, 'note': 'Single-Model per-fold pu/do TE + counts + monotone(days_since_start)'})

# Final fit on all data with full-data TE maps
full = train_ord.copy(); tst = test_df.copy()
y_full_log = y_all_log
prior_full = float(y_full_log.mean())
def te_apply_full(tr_key, tr_tgt, ap_key, m):
    g = pd.DataFrame({'k': tr_key.astype('object'), 'y': tr_tgt}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior_full) / (cnts + m)
    mp = smooth.to_dict()
    tr_enc = tr_key.astype('object').map(mp).fillna(prior_full).astype('float32').values
    ap_enc = ap_key.astype('object').map(mp).fillna(prior_full).astype('float32').values
    return tr_enc, ap_enc
full['te_pu'], tst['te_pu'] = te_apply_full(full['pu_bin'], y_full_log, tst['pu_bin'], m=600.0)
full['te_do'], tst['te_do'] = te_apply_full(full['do_bin'], y_full_log, tst['do_bin'], m=600.0)
pu_cnt_full = full['pu_bin'].astype('object').value_counts().astype('int32')
do_cnt_full = full['do_bin'].astype('object').value_counts().astype('int32')
pair_cnt_full = full['pair_bin'].astype('object').value_counts().astype('int32')
for d in (full, tst):
    d['log_pu_cnt'] = np.log1p(d['pu_bin'].astype('object').map(pu_cnt_full).fillna(0).astype('int32')).astype('float32')
    d['log_do_cnt'] = np.log1p(d['do_bin'].astype('object').map(do_cnt_full).fillna(0).astype('int32')).astype('float32')
    d['log_pair_cnt'] = np.log1p(d['pair_bin'].astype('object').map(pair_cnt_full).fillna(0).astype('int32')).astype('float32')
final_features = final_use_cols
X_full = full[final_features].astype('float32').values
X_test = tst[final_features].astype('float32').values
mono = [0]*len(final_features)
if 'days_since_start' in final_features:
    mono[final_features.index('days_since_start')] = 1
final_model = lgb.LGBMRegressor(**{**lgb_params, 'n_estimators': best_iter_final, 'monotone_constraints': mono})
final_model.fit(X_full, y_full_log)
test_pred = np.expm1(final_model.predict(X_test)).astype('float32')
test_pred = np.clip(test_pred, 0, 500)
pred_df = pd.DataFrame({'key': test_df['key'].astype('string'), 'fare_amount': test_pred})
sub = sample[['key']].merge(pred_df, on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv with shape:', sub.shape)

[200]	valid_0's rmse: 0.232537


[400]	valid_0's rmse: 0.233322


[600]	valid_0's rmse: 0.234307


[800]	valid_0's rmse: 0.235008


[1000]	valid_0's rmse: 0.235799


[Single-Model] Fold 1: RMSE=3.78379, RMSE_log=0.23244, best_iter=224 (train_n=319908, val_n=320756)


[200]	valid_0's rmse: 0.188939


[400]	valid_0's rmse: 0.1896


[600]	valid_0's rmse: 0.189997


[800]	valid_0's rmse: 0.190606


[Single-Model] Fold 2: RMSE=3.18343, RMSE_log=0.18871, best_iter=188 (train_n=640523, val_n=320756)


[200]	valid_0's rmse: 0.210777


[400]	valid_0's rmse: 0.209468


[600]	valid_0's rmse: 0.209057


[800]	valid_0's rmse: 0.208914


[1000]	valid_0's rmse: 0.208728


[1200]	valid_0's rmse: 0.20878


[1400]	valid_0's rmse: 0.208582


[1600]	valid_0's rmse: 0.208617


[1800]	valid_0's rmse: 0.208597


[2000]	valid_0's rmse: 0.208618


[Single-Model] Fold 3: RMSE=3.75448, RMSE_log=0.20856, best_iter=1329 (train_n=961323, val_n=320756)


[200]	valid_0's rmse: 0.199909


[400]	valid_0's rmse: 0.197326


[600]	valid_0's rmse: 0.19671


[800]	valid_0's rmse: 0.196517


[1000]	valid_0's rmse: 0.196476


[1200]	valid_0's rmse: 0.196449


[1400]	valid_0's rmse: 0.196436


[1600]	valid_0's rmse: 0.196516


[1800]	valid_0's rmse: 0.196515


[Single-Model] Fold 4: RMSE=3.73546, RMSE_log=0.19641, best_iter=1147 (train_n=1282234, val_n=320756)


[200]	valid_0's rmse: 0.183052


[400]	valid_0's rmse: 0.179518


[600]	valid_0's rmse: 0.178528


[800]	valid_0's rmse: 0.178129


[1000]	valid_0's rmse: 0.178065


[1200]	valid_0's rmse: 0.178005


[1400]	valid_0's rmse: 0.178097


[1600]	valid_0's rmse: 0.178165


[1800]	valid_0's rmse: 0.17837


[2000]	valid_0's rmse: 0.178399


[Single-Model] Fold 5: RMSE=3.46122, RMSE_log=0.17800, best_iter=1203 (train_n=1602868, val_n=320755)
{'cv_rmse_mean': 3.583678, 'cv_rmse_std': 0.231278, 'best_iter_final': 1147, 'note': 'Single-Model per-fold pu/do TE + counts + monotone(days_since_start)'}


Saved submission.csv with shape: (9914, 2)


In [46]:
# Path B (Stabilized Baseline) — Single-Model LGBM without Target Encoding; robust numeric+temporal+POI + KMeans clusters
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error
from sklearn.cluster import MiniBatchKMeans

assert 'train_df' in globals() and 'test_df' in globals() and 'sample' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# Ensure minimal time features
def ensure_days_since_start(df):
    if 'days_since_start' in df.columns: return df
    d = df.copy(); dt_local = d['pickup_datetime'].dt.tz_convert('America/New_York')
    start_ns = int(dt_local.min().value)
    d['days_since_start'] = ((dt_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d

def ensure_after_hike(df):
    if 'after_hike' in df.columns: return df
    d = df.copy()
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    d['after_hike'] = (d['pickup_datetime'].dt.tz_convert('America/New_York') >= cutoff).astype('int8')
    return d

# Add KMeans clusters if missing
def ensure_clusters(train, test, k=100, sample_n=300_000, seed=2025):
    need = ['pu_cl','do_cl','same_cl']
    if all(c in train.columns for c in need) and all(c in test.columns for c in need):
        return train, test
    pu_coords_tr = train[['pickup_latitude','pickup_longitude']].astype('float32').values
    do_coords_tr = train[['dropoff_latitude','dropoff_longitude']].astype('float32').values
    if len(train) > sample_n:
        rng = np.random.default_rng(seed)
        idx = rng.choice(len(train), size=sample_n, replace=False)
        pu_fit = pu_coords_tr[idx]
        do_fit = do_coords_tr[idx]
    else:
        pu_fit = pu_coords_tr
        do_fit = do_coords_tr
    km_pu = MiniBatchKMeans(n_clusters=k, random_state=seed, batch_size=20000, n_init=5, max_no_improvement=30)
    km_do = MiniBatchKMeans(n_clusters=k, random_state=seed+1, batch_size=20000, n_init=5, max_no_improvement=30)
    km_pu.fit(pu_fit); km_do.fit(do_fit)
    train = train.copy(); test = test.copy()
    train['pu_cl'] = km_pu.predict(pu_coords_tr).astype('int32')
    train['do_cl'] = km_do.predict(do_coords_tr).astype('int32')
    test['pu_cl'] = km_pu.predict(test[['pickup_latitude','pickup_longitude']].astype('float32').values).astype('int32')
    test['do_cl'] = km_do.predict(test[['dropoff_latitude','dropoff_longitude']].astype('float32').values).astype('int32')
    train['same_cl'] = (train['pu_cl'] == train['do_cl']).astype('int8')
    test['same_cl'] = (test['pu_cl'] == test['do_cl']).astype('int8')
    return train, test

# Prepare data
train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True).copy()
test_prep = test_df.copy()
train_ord = ensure_days_since_start(train_ord); test_prep = ensure_days_since_start(test_prep)
train_ord = ensure_after_hike(train_ord); test_prep = ensure_after_hike(test_prep)
train_ord, test_prep = ensure_clusters(train_ord, test_prep, k=100)

# Robust numeric+temporal+POI feature set (no target encoding, no frequency encodings)
feature_cols = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'year','month','day','hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday','days_since_start','after_hike',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend',
    'pu_cl','do_cl','same_cl'
]
if 'rot_manh_km' in train_ord.columns: feature_cols.append('rot_manh_km')
if 'dist_x_after_hike' in train_ord.columns: feature_cols.append('dist_x_after_hike')

train_ord = train_ord.dropna(subset=feature_cols + ['fare_amount']).reset_index(drop=True)
test_prep[feature_cols] = test_prep[feature_cols].fillna(0)

X_all = train_ord[feature_cols].astype('float32').values
y_all = np.log1p(train_ord['fare_amount'].astype('float32').values)
dt_all = train_ord['pickup_datetime']
X_test = test_prep[feature_cols].astype('float32').values

# 5-fold time CV with 1-day gap
folds = make_time_folds_quantile(dt_all, n_folds=5, gap_days=1)
assert len(folds) == 5, 'Expected 5 folds'

# LightGBM params — stable, regularized; no categorical handling needed (all numeric); monotone on days and distances
lgb_params = dict(
    objective='regression',
    metric='rmse',
    learning_rate=0.045,
    n_estimators=20000,
    num_leaves=192,
    min_data_in_leaf=400,
    feature_fraction=0.8,
    bagging_fraction=0.8,
    bagging_freq=1,
    max_bin=255,
    reg_alpha=2.0,
    reg_lambda=10.0,
    random_state=2025,
    n_jobs=-1,
    verbose=-1
)

def build_monotone_constraints(cols):
    mono = [0]*len(cols)
    for nm in ['days_since_start','dist_hav_km','dist_man_km']:
        if nm in cols:
            mono[cols.index(nm)] = 1
    return mono

cv_rmses, best_iters = [], []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
    X_va, y_va = X_all[va_idx], y_all[va_idx]
    mono = build_monotone_constraints(feature_cols)
    model = lgb.LGBMRegressor(**{**lgb_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=600, verbose=False), lgb.log_evaluation(period=200)]
    )
    y_pred = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    rmse = float(root_mean_squared_error(np.expm1(y_va), y_pred))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    print(f"[Stabilized Single] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]}")

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_final = int(np.clip(int(np.median(best_iters)), 1000, lgb_params['n_estimators']))
print({'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 'best_iter_final': best_iter_final, 'note': 'No TE; robust features + KMeans clusters; monotone on time/dist'})

# Fit final model and create submission
final_mono = build_monotone_constraints(feature_cols)
final_model = lgb.LGBMRegressor(**{**lgb_params, 'n_estimators': best_iter_final, 'monotone_constraints': final_mono})
final_model.fit(X_all, y_all)
test_pred = np.expm1(final_model.predict(X_test)).astype('float32')
test_pred = np.clip(test_pred, 0, 500)
pred_df = pd.DataFrame({'key': test_prep['key'].astype('string'), 'fare_amount': test_pred})
sub = sample[['key']].merge(pred_df, on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv with shape:', sub.shape)

[200]	valid_0's rmse: 0.232668


[400]	valid_0's rmse: 0.230464


[600]	valid_0's rmse: 0.229994


[800]	valid_0's rmse: 0.229902


[1000]	valid_0's rmse: 0.229997


[1200]	valid_0's rmse: 0.230142


[Stabilized Single] Fold 1: RMSE=3.67499, best_iter=762


[200]	valid_0's rmse: 0.184857


[400]	valid_0's rmse: 0.184753


[600]	valid_0's rmse: 0.184664


[Stabilized Single] Fold 2: RMSE=3.09670, best_iter=129


[200]	valid_0's rmse: 0.212231


[400]	valid_0's rmse: 0.20824


[600]	valid_0's rmse: 0.206936


[800]	valid_0's rmse: 0.206456


[1000]	valid_0's rmse: 0.206143


[1200]	valid_0's rmse: 0.205898


[1400]	valid_0's rmse: 0.205706


[1600]	valid_0's rmse: 0.205602


[1800]	valid_0's rmse: 0.205564


[2000]	valid_0's rmse: 0.205565


[2200]	valid_0's rmse: 0.20559


[2400]	valid_0's rmse: 0.205644


[Stabilized Single] Fold 3: RMSE=3.57691, best_iter=1919


[200]	valid_0's rmse: 0.202634


[400]	valid_0's rmse: 0.197082


[600]	valid_0's rmse: 0.195469


[800]	valid_0's rmse: 0.194733


[1000]	valid_0's rmse: 0.194317


[1200]	valid_0's rmse: 0.194005


[1400]	valid_0's rmse: 0.193854


[1600]	valid_0's rmse: 0.193714


[1800]	valid_0's rmse: 0.193636


[2000]	valid_0's rmse: 0.193588


[2200]	valid_0's rmse: 0.193579


[2400]	valid_0's rmse: 0.193566


[2600]	valid_0's rmse: 0.193575


[2800]	valid_0's rmse: 0.193581


[3000]	valid_0's rmse: 0.193597


[3200]	valid_0's rmse: 0.193612


[Stabilized Single] Fold 4: RMSE=3.49711, best_iter=2655


[200]	valid_0's rmse: 0.185938


[400]	valid_0's rmse: 0.179215


[600]	valid_0's rmse: 0.177161


[800]	valid_0's rmse: 0.176339


[1000]	valid_0's rmse: 0.175808


[1200]	valid_0's rmse: 0.175509


[1400]	valid_0's rmse: 0.175321


[1600]	valid_0's rmse: 0.175184


[1800]	valid_0's rmse: 0.175086


[2000]	valid_0's rmse: 0.175023


[2200]	valid_0's rmse: 0.175033


[2400]	valid_0's rmse: 0.174999


[2600]	valid_0's rmse: 0.174995


[2800]	valid_0's rmse: 0.174994


[3000]	valid_0's rmse: 0.175023


[3200]	valid_0's rmse: 0.175086


[Stabilized Single] Fold 5: RMSE=3.20647, best_iter=2668
{'cv_rmse_mean': 3.410434, 'cv_rmse_std': 0.221471, 'best_iter_final': 1919, 'note': 'No TE; robust features + KMeans clusters; monotone on time/dist'}


Saved submission.csv with shape: (9914, 2)


In [47]:
# Phase 1 — Stability fixes: fold ensembling + huber + per-fold freq enc + minimal TE (pu/do); no global refit
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals() and 'sample' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# Ensure helpers
def ensure_days_since_start(df):
    if 'days_since_start' in df.columns: return df
    d = df.copy(); dt_local = d['pickup_datetime'].dt.tz_convert('America/New_York')
    start_ns = int(dt_local.min().value)
    d['days_since_start'] = ((dt_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d

def ensure_spatial_bins(df, res=0.005):
    if all(c in df.columns for c in ['pu_bin','do_bin','pair_bin']): return df
    d = df.copy()
    pu_lat_bin = np.floor(d['pickup_latitude'] / res).astype('int32')
    pu_lon_bin = np.floor(d['pickup_longitude'] / res).astype('int32')
    do_lat_bin = np.floor(d['dropoff_latitude'] / res).astype('int32')
    do_lon_bin = np.floor(d['dropoff_longitude'] / res).astype('int32')
    d['pu_bin'] = (pu_lat_bin.astype(str) + '_' + pu_lon_bin.astype(str))
    d['do_bin'] = (do_lat_bin.astype(str) + '_' + do_lon_bin.astype(str))
    d['pair_bin'] = (d['pu_bin'] + '|' + d['do_bin'])
    return d

train_use = ensure_spatial_bins(ensure_days_since_start(train_df))
test_use = ensure_spatial_bins(ensure_days_since_start(test_df))
train_ord = train_use.dropna(subset=['fare_amount']).sort_values('pickup_datetime').reset_index(drop=True)

# Base robust features (no high-risk IDs); keep time/dist/POI/temporal cycles
base_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'year','month','day','hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday','days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_ord.columns: base_feats.append('rot_manh_km')
if 'dist_x_after_hike' in train_ord.columns: base_feats.append('dist_x_after_hike')

dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

# TE helper (smoothed, on log target)
def te_smooth(train_key, train_tgt, apply_key, m=600.0, prior=None):
    if prior is None: prior = float(np.mean(train_tgt))
    g = pd.DataFrame({'k': train_key.astype('object'), 'y': train_tgt}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict()
    enc = apply_key.astype('object').map(mp).fillna(prior).astype('float32').values
    return enc, prior, mp

# Model params: huber for robustness; leave capacity reasonable; use monotone on days_since_start
lgb_params = dict(
    objective='huber', metric='rmse',
    learning_rate=0.045, n_estimators=22000,
    num_leaves=256, max_depth=-1, min_data_in_leaf=120,
    feature_fraction=0.85, bagging_fraction=0.8, bagging_freq=1,
    max_bin=255, reg_alpha=0.0, reg_lambda=8.0,
    random_state=2025, n_jobs=-1, verbose=-1
)

cv_rmses, best_iters = [], []
fold_test_preds = []  # store per-fold test predictions for ensembling
final_use_cols = None

for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx].copy(); va = train_ord.iloc[va_idx].copy()
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    prior = float(y_tr.mean())
    # Per-fold frequency encodings from train only
    pu_cnt = tr['pu_bin'].astype('object').value_counts().astype('int32')
    do_cnt = tr['do_bin'].astype('object').value_counts().astype('int32')
    pair_cnt = tr['pair_bin'].astype('object').value_counts().astype('int32')
    for d in (tr, va):
        d['log_pu_cnt'] = np.log1p(d['pu_bin'].astype('object').map(pu_cnt).fillna(0).astype('int32')).astype('float32')
        d['log_do_cnt'] = np.log1p(d['do_bin'].astype('object').map(do_cnt).fillna(0).astype('int32')).astype('float32')
        d['log_pair_cnt'] = np.log1p(d['pair_bin'].astype('object').map(pair_cnt).fillna(0).astype('int32')).astype('float32')
    # Minimal per-fold TE (pu/do only) with strong smoothing
    tr['te_pu'], _, _ = te_smooth(tr['pu_bin'], y_tr, tr['pu_bin'], m=600.0, prior=prior)
    va['te_pu'], _, _ = te_smooth(tr['pu_bin'], y_tr, va['pu_bin'], m=600.0, prior=prior)
    tr['te_do'], _, _ = te_smooth(tr['do_bin'], y_tr, tr['do_bin'], m=600.0, prior=prior)
    va['te_do'], _, _ = te_smooth(tr['do_bin'], y_tr, va['do_bin'], m=600.0, prior=prior)
    use_cols = base_feats + ['te_pu','te_do','log_pu_cnt','log_do_cnt','log_pair_cnt']
    X_tr = tr[use_cols].astype('float32').values
    X_va = va[use_cols].astype('float32').values
    final_use_cols = use_cols
    mono = [0]*len(use_cols)
    if 'days_since_start' in use_cols:
        mono[use_cols.index('days_since_start')] = 1
    model = lgb.LGBMRegressor(**{**lgb_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=700, verbose=False), lgb.log_evaluation(period=200)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    rmse = float(root_mean_squared_error(np.expm1(y_va), y_pred_val))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    print(f"[Phase1-Stable] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})")

    # Fold-ensemble prediction for test: build encodings from train-only maps and apply
    tst = test_use.copy()
    tst['log_pu_cnt'] = np.log1p(tst['pu_bin'].astype('object').map(pu_cnt).fillna(0).astype('int32')).astype('float32')
    tst['log_do_cnt'] = np.log1p(tst['do_bin'].astype('object').map(do_cnt).fillna(0).astype('int32')).astype('float32')
    tst['log_pair_cnt'] = np.log1p(tst['pair_bin'].astype('object').map(pair_cnt).fillna(0).astype('int32')).astype('float32')
    tst['te_pu'], _prior_unused, _mp_unused = te_smooth(tr['pu_bin'], y_tr, tst['pu_bin'], m=600.0, prior=prior)
    tst['te_do'], _prior_unused2, _mp_unused2 = te_smooth(tr['do_bin'], y_tr, tst['do_bin'], m=600.0, prior=prior)
    X_test_fold = tst[use_cols].astype('float32').values
    fold_test_pred = np.expm1(model.predict(X_test_fold, num_iteration=model.best_iteration_)).astype('float32')
    fold_test_preds.append(fold_test_pred)

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_median = int(np.median(best_iters)) if best_iters else 0
print({'cv_rmse_mean': round(cv_mean, 6), 'cv_rmse_std': round(cv_std, 6), 'median_best_iter': best_iter_median, 'note': 'Fold ensembling + huber + per-fold freq + minimal TE pu/do'})

# Average fold predictions for test (fold ensembling); no global refit
test_pred_ens = np.mean(np.vstack(fold_test_preds), axis=0).astype('float32')
test_pred_ens = np.clip(test_pred_ens, 0, 500)
sub_df = pd.DataFrame({'key': test_use['key'].astype('string'), 'fare_amount': test_pred_ens})
sub = sample[['key']].merge(sub_df, on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (fold-ensemble) with shape:', sub.shape)

[200]	valid_0's rmse: 0.232048


[400]	valid_0's rmse: 0.232381


[600]	valid_0's rmse: 0.23328


[800]	valid_0's rmse: 0.23391


[Phase1-Stable] Fold 1: RMSE=3.74845, best_iter=218 (train_n=319908, val_n=320756)


[200]	valid_0's rmse: 0.186919


[400]	valid_0's rmse: 0.187621


[600]	valid_0's rmse: 0.188406


[800]	valid_0's rmse: 0.18889


[Phase1-Stable] Fold 2: RMSE=3.14178, best_iter=214 (train_n=640523, val_n=320756)


[200]	valid_0's rmse: 0.210618


[400]	valid_0's rmse: 0.208772


[600]	valid_0's rmse: 0.208137


[800]	valid_0's rmse: 0.208081


[1000]	valid_0's rmse: 0.207847


[1200]	valid_0's rmse: 0.207807


[1400]	valid_0's rmse: 0.207814


[1600]	valid_0's rmse: 0.207874


[1800]	valid_0's rmse: 0.207914


[2000]	valid_0's rmse: 0.207863


[Phase1-Stable] Fold 3: RMSE=3.73304, best_iter=1317 (train_n=961323, val_n=320756)


[200]	valid_0's rmse: 0.199612


[400]	valid_0's rmse: 0.196948


[600]	valid_0's rmse: 0.196335


[800]	valid_0's rmse: 0.196073


[1000]	valid_0's rmse: 0.19593


[1200]	valid_0's rmse: 0.196013


[1400]	valid_0's rmse: 0.196091


[1600]	valid_0's rmse: 0.196136


[Phase1-Stable] Fold 4: RMSE=3.70937, best_iter=1072 (train_n=1282234, val_n=320756)


[200]	valid_0's rmse: 0.182735


[400]	valid_0's rmse: 0.17883


[600]	valid_0's rmse: 0.177858


[800]	valid_0's rmse: 0.177407


[1000]	valid_0's rmse: 0.177268


[1200]	valid_0's rmse: 0.177105


[1400]	valid_0's rmse: 0.177175


[1600]	valid_0's rmse: 0.177189


[1800]	valid_0's rmse: 0.177213


[Phase1-Stable] Fold 5: RMSE=3.42745, best_iter=1248 (train_n=1602868, val_n=320755)


{'cv_rmse_mean': 3.552016, 'cv_rmse_std': 0.236614, 'median_best_iter': 1072, 'note': 'Fold ensembling + huber + per-fold freq + minimal TE pu/do'}
Saved submission.csv (fold-ensemble) with shape: (9914, 2)


In [50]:
# Phase 1 (Stabilized Geohash) — per-fold hierarchical TE (gh6 -> gh5 back-off), stronger smoothing/back-off, time-decay weights (exp), constrained capacity, fold ensembling
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals() and 'sample' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb
try:
    import pygeohash as pgh
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'pygeohash'])
    import pygeohash as pgh

# Ensure helper features
def ensure_days_since_start(df):
    if 'days_since_start' in df.columns: return df
    d = df.copy(); dt_local = d['pickup_datetime'].dt.tz_convert('America/New_York')
    start_ns = int(dt_local.min().value)
    d['days_since_start'] = ((dt_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d

def add_geohash(df, prec_list=(6,5)):
    d = df.copy()
    lat_pu = d['pickup_latitude'].astype('float32').values
    lon_pu = d['pickup_longitude'].astype('float32').values
    lat_do = d['dropoff_latitude'].astype('float32').values
    lon_do = d['dropoff_longitude'].astype('float32').values
    for prec in prec_list:
        pu_col = f'pu_gh{prec}'; do_col = f'do_gh{prec}'
        if pu_col in d.columns and do_col in d.columns: continue
        d[pu_col] = [pgh.encode(float(lat), float(lon), precision=prec) if np.isfinite(lat) and np.isfinite(lon) else '' for lat, lon in zip(lat_pu, lon_pu)]
        d[do_col] = [pgh.encode(float(lat), float(lon), precision=prec) if np.isfinite(lat) and np.isfinite(lon) else '' for lat, lon in zip(lat_do, lon_do)]
    return d

train_use = ensure_days_since_start(train_df).sort_values('pickup_datetime').reset_index(drop=True)
test_use = ensure_days_since_start(test_df).copy()
train_use = add_geohash(train_use, prec_list=(6,5))
test_use = add_geohash(test_use, prec_list=(6,5))

# Base robust numeric/temporal/POI features — drop absolute identifiers year/month/day per mandate
base_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday','days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_use.columns: base_feats.append('rot_manh_km')
if 'dist_x_after_hike' in train_use.columns: base_feats.append('dist_x_after_hike')

# Drop NaNs; fill test
train_ord = train_use.dropna(subset=base_feats + ['fare_amount']).reset_index(drop=True)
test_use[base_feats] = test_use[base_feats].fillna(0)

dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

# Smoothed TE helper (log-target); returns numpy vector for apply_key
def te_smooth(train_key, train_tgt, apply_key, m=700.0, prior=None):
    if prior is None: prior = float(np.mean(train_tgt))
    g = pd.DataFrame({'k': train_key.astype('object'), 'y': train_tgt}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict()
    enc = apply_key.astype('object').map(mp).fillna(prior).astype('float32').values
    return enc, prior, mp, cnts.to_dict()

# Blend gh6 and gh5 TEs using data-driven weight from gh6 counts; stronger smoothing/back-off
def blend_te(gh6_series_tr, gh5_series_tr, tgt_tr, gh6_series_ap, gh5_series_ap, m6=1500.0, m5=1000.0, alpha=200.0):
    enc6, prior, mp6, cnt6 = te_smooth(gh6_series_tr, tgt_tr, gh6_series_ap, m=m6)
    enc5, _,    mp5, cnt5 = te_smooth(gh5_series_tr, tgt_tr, gh5_series_ap, m=m5, prior=prior)
    if isinstance(gh6_series_ap, pd.Series):
        ap_keys = gh6_series_ap.astype('object').values
    else:
        ap_keys = gh6_series_ap.astype('object')
    w = np.array([cnt6.get(k, 0.0) for k in ap_keys], dtype='float32')
    w = w / (w + np.float32(alpha))
    return (w * enc6 + (1.0 - w) * enc5).astype('float32'), (mp6, mp5), (cnt6, cnt5)

# Model params (constrained capacity) and monotone on days_since_start
lgb_params = dict(
    objective='huber', metric='rmse',
    learning_rate=0.045, n_estimators=20000,
    num_leaves=128, max_depth=-1, min_data_in_leaf=800,
    feature_fraction=0.70, bagging_fraction=0.8, bagging_freq=1,
    max_bin=127, reg_alpha=0.0, reg_lambda=40.0,
    random_state=2025, n_jobs=-1, verbose=-1
)

cv_rmses, best_iters = [], []
fold_test_preds = []
final_use_cols = None

for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx].copy(); va = train_ord.iloc[va_idx].copy()
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    prior = float(y_tr.mean())
    # Hierarchical TE for PU
    te_pu_blend_va, (pu_mp6, pu_mp5), (pu_cnt6, pu_cnt5) = blend_te(
        tr['pu_gh6'], tr['pu_gh5'], y_tr, va['pu_gh6'], va['pu_gh5'], m6=1500.0, m5=1000.0, alpha=200.0
    )
    te_pu_blend_tr, _, _ = blend_te(
        tr['pu_gh6'], tr['pu_gh5'], y_tr, tr['pu_gh6'], tr['pu_gh5'], m6=1500.0, m5=1000.0, alpha=200.0
    )
    tr['te_pu'] = te_pu_blend_tr; va['te_pu'] = te_pu_blend_va
    # Hierarchical TE for DO
    te_do_blend_va, (do_mp6, do_mp5), (do_cnt6, do_cnt5) = blend_te(
        tr['do_gh6'], tr['do_gh5'], y_tr, va['do_gh6'], va['do_gh5'], m6=1500.0, m5=1000.0, alpha=200.0
    )
    te_do_blend_tr, _, _ = blend_te(
        tr['do_gh6'], tr['do_gh5'], y_tr, tr['do_gh6'], tr['do_gh5'], m6=1500.0, m5=1000.0, alpha=200.0
    )
    tr['te_do'] = te_do_blend_tr; va['te_do'] = te_do_blend_va
    # Frequency encodings (log-counts) using gh6 and gh5, computed on train-only
    pu6_counts = tr['pu_gh6'].astype('object').value_counts().astype('int32')
    do6_counts = tr['do_gh6'].astype('object').value_counts().astype('int32')
    pu5_counts = tr['pu_gh5'].astype('object').value_counts().astype('int32')
    do5_counts = tr['do_gh5'].astype('object').value_counts().astype('int32')
    for d in (tr, va):
        d['log_pu6_cnt'] = np.log1p(d['pu_gh6'].astype('object').map(pu6_counts).fillna(0).astype('int32')).astype('float32')
        d['log_do6_cnt'] = np.log1p(d['do_gh6'].astype('object').map(do6_counts).fillna(0).astype('int32')).astype('float32')
        d['log_pu5_cnt'] = np.log1p(d['pu_gh5'].astype('object').map(pu5_counts).fillna(0).astype('int32')).astype('float32')
        d['log_do5_cnt'] = np.log1p(d['do_gh5'].astype('object').map(do5_counts).fillna(0).astype('int32')).astype('float32')
    use_cols = base_feats + ['te_pu','te_do','log_pu6_cnt','log_do6_cnt','log_pu5_cnt','log_do5_cnt']
    X_tr = tr[use_cols].astype('float32').values
    X_va = va[use_cols].astype('float32').values
    final_use_cols = use_cols
    # Monotone constraint only on days_since_start
    mono = [0]*len(use_cols)
    if 'days_since_start' in use_cols:
        mono[use_cols.index('days_since_start')] = 1
    # Exponential time-decay weights within fold (tau=180 days)
    days_tr = tr['days_since_start'].astype('float32').values
    max_days = float(np.max(days_tr)) if days_tr.size else 0.0
    w_tr = np.exp((days_tr - np.float32(max_days)) / np.float32(180.0)).astype('float32')
    model = lgb.LGBMRegressor(**{**lgb_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr, sample_weight=w_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=700, verbose=False), lgb.log_evaluation(period=200)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    rmse = float(root_mean_squared_error(np.expm1(y_va), y_pred_val))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    print(f"[GH-Stable] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})")

    # Build test features using train-only maps (same blending and counts); fold-ensemble predictions
    tst = test_use.copy()
    # Blend TE for PU on test
    enc6_pu_test, _, _, cnt6_pu = te_smooth(tr['pu_gh6'], y_tr, tst['pu_gh6'], m=1500.0, prior=prior)
    enc5_pu_test, _, _, cnt5_pu = te_smooth(tr['pu_gh5'], y_tr, tst['pu_gh5'], m=1000.0, prior=prior)
    w_pu = np.array([cnt6_pu.get(k, 0.0) for k in tst['pu_gh6'].astype('object').values], dtype='float32')
    w_pu = w_pu / (w_pu + np.float32(200.0))
    tst['te_pu'] = (w_pu * enc6_pu_test + (1.0 - w_pu) * enc5_pu_test).astype('float32')
    # Blend TE for DO on test
    enc6_do_test, _, _, cnt6_do = te_smooth(tr['do_gh6'], y_tr, tst['do_gh6'], m=1500.0, prior=prior)
    enc5_do_test, _, _, cnt5_do = te_smooth(tr['do_gh5'], y_tr, tst['do_gh5'], m=1000.0, prior=prior)
    w_do = np.array([cnt6_do.get(k, 0.0) for k in tst['do_gh6'].astype('object').values], dtype='float32')
    w_do = w_do / (w_do + np.float32(200.0))
    tst['te_do'] = (w_do * enc6_do_test + (1.0 - w_do) * enc5_do_test).astype('float32')
    # Counts
    pu6_counts_ap = tr['pu_gh6'].astype('object').value_counts().astype('int32')
    do6_counts_ap = tr['do_gh6'].astype('object').value_counts().astype('int32')
    pu5_counts_ap = tr['pu_gh5'].astype('object').value_counts().astype('int32')
    do5_counts_ap = tr['do_gh5'].astype('object').value_counts().astype('int32')
    tst['log_pu6_cnt'] = np.log1p(tst['pu_gh6'].astype('object').map(pu6_counts_ap).fillna(0).astype('int32')).astype('float32')
    tst['log_do6_cnt'] = np.log1p(tst['do_gh6'].astype('object').map(do6_counts_ap).fillna(0).astype('int32')).astype('float32')
    tst['log_pu5_cnt'] = np.log1p(tst['pu_gh5'].astype('object').map(pu5_counts_ap).fillna(0).astype('int32')).astype('float32')
    tst['log_do5_cnt'] = np.log1p(tst['do_gh5'].astype('object').map(do5_counts_ap).fillna(0).astype('int32')).astype('float32')
    X_test_fold = tst[use_cols].astype('float32').values
    fold_pred = np.expm1(model.predict(X_test_fold, num_iteration=model.best_iteration_)).astype('float32')
    fold_test_preds.append(fold_pred)

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_median = int(np.median(best_iters)) if best_iters else 0
print({'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 'median_best_iter': best_iter_median, 'note': 'Hierarchical gh6/gh5 TE blend (strong smoothing/back-off) + constrained model + exp time-decay + fold ensembling'})

# Average fold predictions for test (fold ensembling); no global refit
test_pred = np.mean(np.vstack(fold_test_preds), axis=0).astype('float32')
test_pred = np.clip(test_pred, 0, 500)
pred_df = pd.DataFrame({'key': test_use['key'].astype('string'), 'fare_amount': test_pred})
sub = sample[['key']].merge(pred_df, on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (geohash fold-ensemble) with shape:', sub.shape)

[200]	valid_0's rmse: 0.233087


[400]	valid_0's rmse: 0.231237


[600]	valid_0's rmse: 0.230663


[800]	valid_0's rmse: 0.230384


[1000]	valid_0's rmse: 0.23026


[1200]	valid_0's rmse: 0.230155


[1400]	valid_0's rmse: 0.230203


[1600]	valid_0's rmse: 0.23022


[1800]	valid_0's rmse: 0.230278


[GH-Stable] Fold 1: RMSE=3.65475, best_iter=1212 (train_n=319908, val_n=320756)


[200]	valid_0's rmse: 0.187305


[400]	valid_0's rmse: 0.187093


[600]	valid_0's rmse: 0.187134


[800]	valid_0's rmse: 0.187215


[GH-Stable] Fold 2: RMSE=3.09381, best_iter=267 (train_n=640523, val_n=320756)


[200]	valid_0's rmse: 0.212057


[400]	valid_0's rmse: 0.208102


[600]	valid_0's rmse: 0.20692


[800]	valid_0's rmse: 0.206215


[1000]	valid_0's rmse: 0.205873


[1200]	valid_0's rmse: 0.20559


[1400]	valid_0's rmse: 0.205418


[1600]	valid_0's rmse: 0.205331


[1800]	valid_0's rmse: 0.205303


[2000]	valid_0's rmse: 0.205348


[2200]	valid_0's rmse: 0.205331


[2400]	valid_0's rmse: 0.205355


[GH-Stable] Fold 3: RMSE=3.55763, best_iter=1806 (train_n=961323, val_n=320756)


[200]	valid_0's rmse: 0.201123


[400]	valid_0's rmse: 0.197636


[600]	valid_0's rmse: 0.196609


[800]	valid_0's rmse: 0.196123


[1000]	valid_0's rmse: 0.19592


[1200]	valid_0's rmse: 0.195862


[1400]	valid_0's rmse: 0.195854


[1600]	valid_0's rmse: 0.19586


[1800]	valid_0's rmse: 0.195914


[2000]	valid_0's rmse: 0.195981


[GH-Stable] Fold 4: RMSE=3.48745, best_iter=1313 (train_n=1282234, val_n=320756)


[200]	valid_0's rmse: 0.183582


[400]	valid_0's rmse: 0.179804


[600]	valid_0's rmse: 0.178636


[800]	valid_0's rmse: 0.177981


[1000]	valid_0's rmse: 0.177633


[1200]	valid_0's rmse: 0.17748


[1400]	valid_0's rmse: 0.177473


[1600]	valid_0's rmse: 0.177513


[1800]	valid_0's rmse: 0.17752


[GH-Stable] Fold 5: RMSE=3.18571, best_iter=1287 (train_n=1602868, val_n=320755)


{'cv_rmse_mean': 3.395868, 'cv_rmse_std': 0.217706, 'median_best_iter': 1287, 'note': 'Hierarchical gh6/gh5 TE blend (strong smoothing/back-off) + constrained model + exp time-decay + fold ensembling'}
Saved submission.csv (geohash fold-ensemble) with shape: (9914, 2)


In [51]:
# Phase 1 — Final Stability Lockdown: hierarchical geohash TE (gh6 -> gh5) with hard back-off, extreme regularization, purified features, exp decay (tau=120), fold ensembling
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals() and 'sample' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb
try:
    import pygeohash as pgh
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'pygeohash'])
    import pygeohash as pgh

def ensure_days_since_start(df):
    if 'days_since_start' in df.columns: return df
    d = df.copy(); dt_local = d['pickup_datetime'].dt.tz_convert('America/New_York')
    start_ns = int(dt_local.min().value)
    d['days_since_start'] = ((dt_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d

def add_geohash(df, prec_list=(6,5)):
    d = df.copy()
    lat_pu = d['pickup_latitude'].astype('float32').values
    lon_pu = d['pickup_longitude'].astype('float32').values
    lat_do = d['dropoff_latitude'].astype('float32').values
    lon_do = d['dropoff_longitude'].astype('float32').values
    for prec in prec_list:
        pu_col = f'pu_gh{prec}'; do_col = f'do_gh{prec}'
        if pu_col in d.columns and do_col in d.columns: continue
        d[pu_col] = [pgh.encode(float(lat), float(lon), precision=prec) if np.isfinite(lat) and np.isfinite(lon) else '' for lat, lon in zip(lat_pu, lon_pu)]
        d[do_col] = [pgh.encode(float(lat), float(lon), precision=prec) if np.isfinite(lat) and np.isfinite(lon) else '' for lat, lon in zip(lat_do, lon_do)]
    return d

train_use = ensure_days_since_start(train_df).sort_values('pickup_datetime').reset_index(drop=True)
test_use = ensure_days_since_start(test_df).copy()
train_use = add_geohash(train_use, prec_list=(6,5))
test_use = add_geohash(test_use, prec_list=(6,5))

# Purified base features: drop absolute time (year, month, day) and raw discrete time (hour, dow); keep cyclical + flags
base_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday','days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_use.columns: base_feats.append('rot_manh_km')
if 'dist_x_after_hike' in train_use.columns: base_feats.append('dist_x_after_hike')

train_ord = train_use.dropna(subset=base_feats + ['fare_amount']).reset_index(drop=True)
test_use[base_feats] = test_use[base_feats].fillna(0)

dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)
folds = make_time_folds_quantile(dt_all, n_folds=5, gap_days=1)
assert len(folds) == 5, 'Expected 5 folds'

# Smoothed TE (log-target) returning enc/prior/map/counts
def te_smooth(train_key, train_tgt, apply_key, m=700.0, prior=None):
    if prior is None: prior = float(np.mean(train_tgt))
    g = pd.DataFrame({'k': train_key.astype('object'), 'y': train_tgt}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict()
    enc = apply_key.astype('object').map(mp).fillna(prior).astype('float32').values
    return enc, prior, mp, cnts.to_dict()

# Hierarchical blend with hard back-off: if gh6 count < thr, w=0 (use gh5 entirely)
def blend_te(gh6_tr, gh5_tr, tgt_tr, gh6_ap, gh5_ap, m6=3000.0, m5=2000.0, alpha=400.0, hard_thr=20):
    enc6, prior, mp6, cnt6 = te_smooth(gh6_tr, tgt_tr, gh6_ap, m=m6)
    enc5, _,    mp5, cnt5 = te_smooth(gh5_tr, tgt_tr, gh5_ap, m=m5, prior=prior)
    ap_keys = gh6_ap.astype('object').values if isinstance(gh6_ap, pd.Series) else gh6_ap.astype('object')
    counts = np.array([cnt6.get(k, 0.0) for k in ap_keys], dtype='float32')
    w = counts / (counts + np.float32(alpha))
    # Hard back-off
    if hard_thr is not None and hard_thr > 0:
        mask = (counts < np.float32(hard_thr))
        if mask.any():
            w = w.copy(); w[mask] = 0.0
    return (w * enc6 + (1.0 - w) * enc5).astype('float32'), (mp6, mp5), (cnt6, cnt5)

# Extremely constrained model
lgb_params = dict(
    objective='huber', metric='rmse',
    learning_rate=0.045, n_estimators=25000,
    num_leaves=96, max_depth=-1, min_data_in_leaf=1200,
    feature_fraction=0.65, bagging_fraction=0.8, bagging_freq=1,
    max_bin=127, reg_alpha=5.0, reg_lambda=60.0, min_gain_to_split=0.1,
    random_state=2025, n_jobs=-1, verbose=-1
)

cv_rmses, best_iters = [], []
fold_test_preds = []

for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx].copy(); va = train_ord.iloc[va_idx].copy()
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    prior = float(y_tr.mean())
    # Hierarchical TE for PU and DO with strong smoothing/back-off
    va_pu_te, (pu_mp6, pu_mp5), (pu_cnt6, pu_cnt5) = blend_te(tr['pu_gh6'], tr['pu_gh5'], y_tr, va['pu_gh6'], va['pu_gh5'], m6=3000.0, m5=2000.0, alpha=400.0, hard_thr=20)
    tr_pu_te, _, _ = blend_te(tr['pu_gh6'], tr['pu_gh5'], y_tr, tr['pu_gh6'], tr['pu_gh5'], m6=3000.0, m5=2000.0, alpha=400.0, hard_thr=20)
    va_do_te, (do_mp6, do_mp5), (do_cnt6, do_cnt5) = blend_te(tr['do_gh6'], tr['do_gh5'], y_tr, va['do_gh6'], va['do_gh5'], m6=3000.0, m5=2000.0, alpha=400.0, hard_thr=20)
    tr_do_te, _, _ = blend_te(tr['do_gh6'], tr['do_gh5'], y_tr, tr['do_gh6'], tr['do_gh5'], m6=3000.0, m5=2000.0, alpha=400.0, hard_thr=20)
    tr['te_pu'] = tr_pu_te; va['te_pu'] = va_pu_te
    tr['te_do'] = tr_do_te; va['te_do'] = va_do_te
    # Frequency: keep only gh6 counts per mandate
    pu6_counts = tr['pu_gh6'].astype('object').value_counts().astype('int32')
    do6_counts = tr['do_gh6'].astype('object').value_counts().astype('int32')
    for d in (tr, va):
        d['log_pu6_cnt'] = np.log1p(d['pu_gh6'].astype('object').map(pu6_counts).fillna(0).astype('int32')).astype('float32')
        d['log_do6_cnt'] = np.log1p(d['do_gh6'].astype('object').map(do6_counts).fillna(0).astype('int32')).astype('float32')
    use_cols = base_feats + ['te_pu','te_do','log_pu6_cnt','log_do6_cnt']
    X_tr = tr[use_cols].astype('float32').values
    X_va = va[use_cols].astype('float32').values
    # Monotone on days_since_start
    mono = [0]*len(use_cols)
    if 'days_since_start' in use_cols:
        mono[use_cols.index('days_since_start')] = 1
    # Exponential decay weights with tau=120 days (more aggressive)
    days_tr = tr['days_since_start'].astype('float32').values
    max_days = float(np.max(days_tr)) if days_tr.size else 0.0
    w_tr = np.exp((days_tr - np.float32(max_days)) / np.float32(120.0)).astype('float32')
    model = lgb.LGBMRegressor(**{**lgb_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr, sample_weight=w_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=700, verbose=False), lgb.log_evaluation(period=200)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    rmse = float(root_mean_squared_error(np.expm1(y_va), y_pred_val))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    print(f"[GH-Lockdown] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})")

    # Test-time: apply same hardened logic; gh6/gh5 blend with hard back-off and gh6 counts only
    tst = test_use.copy()
    enc6_pu_test, _, _, cnt6_pu = te_smooth(tr['pu_gh6'], y_tr, tst['pu_gh6'], m=3000.0, prior=prior)
    enc5_pu_test, _, _, cnt5_pu = te_smooth(tr['pu_gh5'], y_tr, tst['pu_gh5'], m=2000.0, prior=prior)
    keys_pu = tst['pu_gh6'].astype('object').values
    cts_pu = np.array([cnt6_pu.get(k, 0.0) for k in keys_pu], dtype='float32')
    w_pu = cts_pu / (cts_pu + np.float32(400.0))
    if (cts_pu < 20).any():
        w_pu = w_pu.copy(); w_pu[cts_pu < 20] = 0.0
    tst['te_pu'] = (w_pu * enc6_pu_test + (1.0 - w_pu) * enc5_pu_test).astype('float32')
    enc6_do_test, _, _, cnt6_do = te_smooth(tr['do_gh6'], y_tr, tst['do_gh6'], m=3000.0, prior=prior)
    enc5_do_test, _, _, cnt5_do = te_smooth(tr['do_gh5'], y_tr, tst['do_gh5'], m=2000.0, prior=prior)
    keys_do = tst['do_gh6'].astype('object').values
    cts_do = np.array([cnt6_do.get(k, 0.0) for k in keys_do], dtype='float32')
    w_do = cts_do / (cts_do + np.float32(400.0))
    if (cts_do < 20).any():
        w_do = w_do.copy(); w_do[cts_do < 20] = 0.0
    tst['te_do'] = (w_do * enc6_do_test + (1.0 - w_do) * enc5_do_test).astype('float32')
    pu6_counts_ap = tr['pu_gh6'].astype('object').value_counts().astype('int32')
    do6_counts_ap = tr['do_gh6'].astype('object').value_counts().astype('int32')
    tst['log_pu6_cnt'] = np.log1p(tst['pu_gh6'].astype('object').map(pu6_counts_ap).fillna(0).astype('int32')).astype('float32')
    tst['log_do6_cnt'] = np.log1p(tst['do_gh6'].astype('object').map(do6_counts_ap).fillna(0).astype('int32')).astype('float32')
    X_test_fold = tst[use_cols].astype('float32').values
    fold_pred = np.expm1(model.predict(X_test_fold, num_iteration=model.best_iteration_)).astype('float32')
    fold_test_preds.append(fold_pred)

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_median = int(np.median(best_iters)) if best_iters else 0
print({'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 'median_best_iter': best_iter_median, 'note': 'Final Stability Lockdown: heavy regularization + hardened hierarchical TE + purified time + tau=120 + fold ensembling'})

# Fold-ensemble submission
test_pred = np.mean(np.vstack(fold_test_preds), axis=0).astype('float32')
test_pred = np.clip(test_pred, 0, 500)
pred_df = pd.DataFrame({'key': test_use['key'].astype('string'), 'fare_amount': test_pred})
sub = sample[['key']].merge(pred_df, on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (Final Stability Lockdown) with shape:', sub.shape)

[200]	valid_0's rmse: 0.236976


[400]	valid_0's rmse: 0.236598


[600]	valid_0's rmse: 0.236516


[800]	valid_0's rmse: 0.236475


[1000]	valid_0's rmse: 0.236396


[1200]	valid_0's rmse: 0.236377


[1400]	valid_0's rmse: 0.236359


[1600]	valid_0's rmse: 0.236338


[1800]	valid_0's rmse: 0.236328


[2000]	valid_0's rmse: 0.236319


[2200]	valid_0's rmse: 0.236313


[2400]	valid_0's rmse: 0.236296


[2600]	valid_0's rmse: 0.23629


[2800]	valid_0's rmse: 0.236285


[3000]	valid_0's rmse: 0.236285


[3200]	valid_0's rmse: 0.236258


[3400]	valid_0's rmse: 0.23626


[3600]	valid_0's rmse: 0.236255


[3800]	valid_0's rmse: 0.236251


[4000]	valid_0's rmse: 0.236247


[4200]	valid_0's rmse: 0.236237


[4400]	valid_0's rmse: 0.236235


[4600]	valid_0's rmse: 0.236225


[4800]	valid_0's rmse: 0.236218


[5000]	valid_0's rmse: 0.236219


[5200]	valid_0's rmse: 0.236219


[5400]	valid_0's rmse: 0.236213


[5600]	valid_0's rmse: 0.236204


[5800]	valid_0's rmse: 0.236204


[6000]	valid_0's rmse: 0.236198


[6200]	valid_0's rmse: 0.236192


[6400]	valid_0's rmse: 0.236197


[6600]	valid_0's rmse: 0.236193


[6800]	valid_0's rmse: 0.236184


[7000]	valid_0's rmse: 0.236181


[7200]	valid_0's rmse: 0.236177


[7400]	valid_0's rmse: 0.236181


[7600]	valid_0's rmse: 0.236173


[7800]	valid_0's rmse: 0.236175


[8000]	valid_0's rmse: 0.236177


[8200]	valid_0's rmse: 0.236168


[8400]	valid_0's rmse: 0.236161


[8600]	valid_0's rmse: 0.236166


[8800]	valid_0's rmse: 0.236162


[9000]	valid_0's rmse: 0.236165


[9200]	valid_0's rmse: 0.236156


[9400]	valid_0's rmse: 0.236154


[9600]	valid_0's rmse: 0.236153


[9800]	valid_0's rmse: 0.236154


[10000]	valid_0's rmse: 0.236147


[10200]	valid_0's rmse: 0.236146


[10400]	valid_0's rmse: 0.236138


[10600]	valid_0's rmse: 0.236136


[10800]	valid_0's rmse: 0.236133


[11000]	valid_0's rmse: 0.236131


[11200]	valid_0's rmse: 0.236132


[11400]	valid_0's rmse: 0.236131


[11600]	valid_0's rmse: 0.236133


[11800]	valid_0's rmse: 0.236126


[12000]	valid_0's rmse: 0.236124


[12200]	valid_0's rmse: 0.236126


[12400]	valid_0's rmse: 0.236127


[12600]	valid_0's rmse: 0.236122


[12800]	valid_0's rmse: 0.236112


[13000]	valid_0's rmse: 0.236114


[13200]	valid_0's rmse: 0.236119


[13400]	valid_0's rmse: 0.236123


[13600]	valid_0's rmse: 0.236113


[GH-Lockdown] Fold 1: RMSE=3.77008, best_iter=12915 (train_n=319908, val_n=320756)


[200]	valid_0's rmse: 0.191363


[400]	valid_0's rmse: 0.190667


[600]	valid_0's rmse: 0.190578


[800]	valid_0's rmse: 0.190549


[1000]	valid_0's rmse: 0.190512


[1200]	valid_0's rmse: 0.190467


[1400]	valid_0's rmse: 0.190465


[1600]	valid_0's rmse: 0.190427


[1800]	valid_0's rmse: 0.190408


[2000]	valid_0's rmse: 0.190395


[2200]	valid_0's rmse: 0.190375


[2400]	valid_0's rmse: 0.190365


[2600]	valid_0's rmse: 0.190356


[2800]	valid_0's rmse: 0.190407


[3000]	valid_0's rmse: 0.190395


[3200]	valid_0's rmse: 0.190384


[3400]	valid_0's rmse: 0.190367


[GH-Lockdown] Fold 2: RMSE=3.18991, best_iter=2743 (train_n=640523, val_n=320756)


[200]	valid_0's rmse: 0.218826


[400]	valid_0's rmse: 0.218128


[600]	valid_0's rmse: 0.218036


[800]	valid_0's rmse: 0.21797


[1000]	valid_0's rmse: 0.217936


[1200]	valid_0's rmse: 0.217916


[1400]	valid_0's rmse: 0.217835


[1600]	valid_0's rmse: 0.21782


[1800]	valid_0's rmse: 0.217809


[2000]	valid_0's rmse: 0.217805


[2200]	valid_0's rmse: 0.217774


[2400]	valid_0's rmse: 0.217784


[2600]	valid_0's rmse: 0.217769


[2800]	valid_0's rmse: 0.217757


[3000]	valid_0's rmse: 0.217745


[3200]	valid_0's rmse: 0.217749


[3400]	valid_0's rmse: 0.217751


[GH-Lockdown] Fold 3: RMSE=3.82524, best_iter=2701 (train_n=961323, val_n=320756)


[200]	valid_0's rmse: 0.206598


[400]	valid_0's rmse: 0.205776


[600]	valid_0's rmse: 0.205622


[800]	valid_0's rmse: 0.20557


[1000]	valid_0's rmse: 0.205519


[1200]	valid_0's rmse: 0.205504


[1400]	valid_0's rmse: 0.205486


[1600]	valid_0's rmse: 0.205473


[1800]	valid_0's rmse: 0.205451


[2000]	valid_0's rmse: 0.205437


[2200]	valid_0's rmse: 0.205431


[2400]	valid_0's rmse: 0.205403


[2600]	valid_0's rmse: 0.20539


[2800]	valid_0's rmse: 0.205377


[3000]	valid_0's rmse: 0.20537


[3200]	valid_0's rmse: 0.205363


[3400]	valid_0's rmse: 0.20536


[3600]	valid_0's rmse: 0.205358


[3800]	valid_0's rmse: 0.205349


[4000]	valid_0's rmse: 0.205347


[4200]	valid_0's rmse: 0.205341


[4400]	valid_0's rmse: 0.205335


[4600]	valid_0's rmse: 0.205333


[4800]	valid_0's rmse: 0.205327


[5000]	valid_0's rmse: 0.205321


[5200]	valid_0's rmse: 0.205316


[5400]	valid_0's rmse: 0.205316


[5600]	valid_0's rmse: 0.205317


[5800]	valid_0's rmse: 0.20531


[6000]	valid_0's rmse: 0.205299


[6200]	valid_0's rmse: 0.205295


[6400]	valid_0's rmse: 0.205287


[6600]	valid_0's rmse: 0.205279


[6800]	valid_0's rmse: 0.205281


[7000]	valid_0's rmse: 0.205271


[7200]	valid_0's rmse: 0.205269


[7400]	valid_0's rmse: 0.205262


[7600]	valid_0's rmse: 0.205256


[7800]	valid_0's rmse: 0.20525


[8000]	valid_0's rmse: 0.205249


[8200]	valid_0's rmse: 0.205256


[8400]	valid_0's rmse: 0.205252


[8600]	valid_0's rmse: 0.205246


[8800]	valid_0's rmse: 0.205242


[9000]	valid_0's rmse: 0.205241


[9200]	valid_0's rmse: 0.205239


[9400]	valid_0's rmse: 0.205236


[9600]	valid_0's rmse: 0.205236


[9800]	valid_0's rmse: 0.205234


[10000]	valid_0's rmse: 0.205231


[10200]	valid_0's rmse: 0.205226


In [52]:
# Phase 1 — Lockdown v2: ultra-constrained model + hardened hierarchical TE + purified time (no hour/dow), stronger decay, fold ensembling
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals() and 'sample' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb
try:
    import pygeohash as pgh
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'pygeohash'])
    import pygeohash as pgh

def ensure_days_since_start(df):
    if 'days_since_start' in df.columns: return df
    d = df.copy(); dt_local = d['pickup_datetime'].dt.tz_convert('America/New_York')
    start_ns = int(dt_local.min().value)
    d['days_since_start'] = ((dt_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d

def add_geohash(df, prec_list=(6,5)):
    d = df.copy()
    lat_pu = d['pickup_latitude'].astype('float32').values
    lon_pu = d['pickup_longitude'].astype('float32').values
    lat_do = d['dropoff_latitude'].astype('float32').values
    lon_do = d['dropoff_longitude'].astype('float32').values
    for prec in prec_list:
        pu_col = f'pu_gh{prec}'; do_col = f'do_gh{prec}'
        if pu_col in d.columns and do_col in d.columns: continue
        d[pu_col] = [pgh.encode(float(lat), float(lon), precision=prec) if np.isfinite(lat) and np.isfinite(lon) else '' for lat, lon in zip(lat_pu, lon_pu)]
        d[do_col] = [pgh.encode(float(lat), float(lon), precision=prec) if np.isfinite(lat) and np.isfinite(lon) else '' for lat, lon in zip(lat_do, lon_do)]
    return d

train_use = ensure_days_since_start(train_df).sort_values('pickup_datetime').reset_index(drop=True)
test_use = ensure_days_since_start(test_df).copy()
train_use = add_geohash(train_use, prec_list=(6,5))
test_use = add_geohash(test_use, prec_list=(6,5))

# Purified features: drop raw hour/dow and their cycles; keep only doy cycles + holiday + days_since_start
base_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'doy_sin','doy_cos','is_holiday','days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_use.columns: base_feats.append('rot_manh_km')
if 'dist_x_after_hike' in train_use.columns: base_feats.append('dist_x_after_hike')

train_ord = train_use.dropna(subset=base_feats + ['fare_amount']).reset_index(drop=True)
test_use[base_feats] = test_use[base_feats].fillna(0)

dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)
folds = make_time_folds_quantile(dt_all, n_folds=5, gap_days=1)
assert len(folds) == 5, 'Expected 5 folds'

# Smoothed TE with counts map
def te_smooth(train_key, train_tgt, apply_key, m=700.0, prior=None):
    if prior is None: prior = float(np.mean(train_tgt))
    g = pd.DataFrame({'k': train_key.astype('object'), 'y': train_tgt}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict()
    enc = apply_key.astype('object').map(mp).fillna(prior).astype('float32').values
    return enc, prior, mp, cnts.to_dict()

# Hierarchical blend with hard back-off and very strong smoothing/back-off
def blend_te(gh6_tr, gh5_tr, tgt_tr, gh6_ap, gh5_ap, m6=5000.0, m5=3500.0, alpha=800.0, hard_thr=40):
    enc6, prior, mp6, cnt6 = te_smooth(gh6_tr, tgt_tr, gh6_ap, m=m6)
    enc5, _,    mp5, cnt5 = te_smooth(gh5_tr, tgt_tr, gh5_ap, m=m5, prior=prior)
    ap_keys = gh6_ap.astype('object').values if isinstance(gh6_ap, pd.Series) else gh6_ap.astype('object')
    counts = np.array([cnt6.get(k, 0.0) for k in ap_keys], dtype='float32')
    w = counts / (counts + np.float32(alpha))
    if hard_thr is not None and hard_thr > 0:
        mask = (counts < np.float32(hard_thr))
        if mask.any():
            w = w.copy(); w[mask] = 0.0
    return (w * enc6 + (1.0 - w) * enc5).astype('float32'), (mp6, mp5), (cnt6, cnt5)

# Ultra-constrained LightGBM
lgb_params = dict(
    objective='huber', metric='rmse',
    learning_rate=0.04, n_estimators=20000,
    num_leaves=64, max_depth=-1, min_data_in_leaf=2000,
    feature_fraction=0.60, bagging_fraction=0.70, bagging_freq=1,
    max_bin=63, reg_alpha=10.0, reg_lambda=80.0, min_gain_to_split=0.2,
    random_state=2025, n_jobs=-1, verbose=-1
)

cv_rmses, best_iters = [], []
fold_test_preds = []

for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx].copy(); va = train_ord.iloc[va_idx].copy()
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    prior = float(y_tr.mean())
    # Hardened hierarchical TE
    va_pu_te, (pu_mp6, pu_mp5), (pu_cnt6, pu_cnt5) = blend_te(tr['pu_gh6'], tr['pu_gh5'], y_tr, va['pu_gh6'], va['pu_gh5'])
    tr_pu_te, _, _ = blend_te(tr['pu_gh6'], tr['pu_gh5'], y_tr, tr['pu_gh6'], tr['pu_gh5'])
    va_do_te, (do_mp6, do_mp5), (do_cnt6, do_cnt5) = blend_te(tr['do_gh6'], tr['do_gh5'], y_tr, va['do_gh6'], va['do_gh5'])
    tr_do_te, _, _ = blend_te(tr['do_gh6'], tr['do_gh5'], y_tr, tr['do_gh6'], tr['do_gh5'])
    tr['te_pu'] = tr_pu_te; va['te_pu'] = va_pu_te
    tr['te_do'] = tr_do_te; va['te_do'] = va_do_te
    # Keep only gh6 counts
    pu6_counts = tr['pu_gh6'].astype('object').value_counts().astype('int32')
    do6_counts = tr['do_gh6'].astype('object').value_counts().astype('int32')
    for d in (tr, va):
        d['log_pu6_cnt'] = np.log1p(d['pu_gh6'].astype('object').map(pu6_counts).fillna(0).astype('int32')).astype('float32')
        d['log_do6_cnt'] = np.log1p(d['do_gh6'].astype('object').map(do6_counts).fillna(0).astype('int32')).astype('float32')
    use_cols = base_feats + ['te_pu','te_do','log_pu6_cnt','log_do6_cnt']
    X_tr = tr[use_cols].astype('float32').values
    X_va = va[use_cols].astype('float32').values
    # Monotone on days_since_start only
    mono = [0]*len(use_cols)
    if 'days_since_start' in use_cols:
        mono[use_cols.index('days_since_start')] = 1
    # Stronger exponential decay (tau=90 days)
    days_tr = tr['days_since_start'].astype('float32').values
    max_days = float(np.max(days_tr)) if days_tr.size else 0.0
    w_tr = np.exp((days_tr - np.float32(max_days)) / np.float32(90.0)).astype('float32')
    model = lgb.LGBMRegressor(**{**lgb_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr, sample_weight=w_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=600, verbose=False), lgb.log_evaluation(period=300)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    rmse = float(root_mean_squared_error(np.expm1(y_va), y_pred_val))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    print(f"[GH-Lockdown v2] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})")

    # Test features per fold (same hardened logic)
    tst = test_use.copy()
    enc6_pu_test, _, _, cnt6_pu = te_smooth(tr['pu_gh6'], y_tr, tst['pu_gh6'], m=5000.0, prior=prior)
    enc5_pu_test, _, _, cnt5_pu = te_smooth(tr['pu_gh5'], y_tr, tst['pu_gh5'], m=3500.0, prior=prior)
    k_pu = tst['pu_gh6'].astype('object').values
    c_pu = np.array([cnt6_pu.get(k, 0.0) for k in k_pu], dtype='float32')
    w_pu = c_pu / (c_pu + np.float32(800.0))
    if (c_pu < 40).any():
        w_pu = w_pu.copy(); w_pu[c_pu < 40] = 0.0
    tst['te_pu'] = (w_pu * enc6_pu_test + (1.0 - w_pu) * enc5_pu_test).astype('float32')
    enc6_do_test, _, _, cnt6_do = te_smooth(tr['do_gh6'], y_tr, tst['do_gh6'], m=5000.0, prior=prior)
    enc5_do_test, _, _, cnt5_do = te_smooth(tr['do_gh5'], y_tr, tst['do_gh5'], m=3500.0, prior=prior)
    k_do = tst['do_gh6'].astype('object').values
    c_do = np.array([cnt6_do.get(k, 0.0) for k in k_do], dtype='float32')
    w_do = c_do / (c_do + np.float32(800.0))
    if (c_do < 40).any():
        w_do = w_do.copy(); w_do[c_do < 40] = 0.0
    tst['te_do'] = (w_do * enc6_do_test + (1.0 - w_do) * enc5_do_test).astype('float32')
    pu6_counts_ap = tr['pu_gh6'].astype('object').value_counts().astype('int32')
    do6_counts_ap = tr['do_gh6'].astype('object').value_counts().astype('int32')
    tst['log_pu6_cnt'] = np.log1p(tst['pu_gh6'].astype('object').map(pu6_counts_ap).fillna(0).astype('int32')).astype('float32')
    tst['log_do6_cnt'] = np.log1p(tst['do_gh6'].astype('object').map(do6_counts_ap).fillna(0).astype('int32')).astype('float32')
    X_test_fold = tst[use_cols].astype('float32').values
    fold_pred = np.expm1(model.predict(X_test_fold, num_iteration=model.best_iteration_)).astype('float32')
    fold_test_preds.append(fold_pred)

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_median = int(np.median(best_iters)) if best_iters else 0
print({'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 'median_best_iter': best_iter_median, 'note': 'Lockdown v2: ultra-constrained + very strong TE smoothing/back-off + no hour/dow cycles + tau=90'})

# Fold-ensemble submission
test_pred = np.mean(np.vstack(fold_test_preds), axis=0).astype('float32') if fold_test_preds else np.zeros(len(test_use), dtype='float32')
test_pred = np.clip(test_pred, 0, 500)
pred_df = pd.DataFrame({'key': test_use['key'].astype('string'), 'fare_amount': test_pred})
sub = sample[['key']].merge(pred_df, on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (Lockdown v2) with shape:', sub.shape)

[300]	valid_0's rmse: 0.243692


[600]	valid_0's rmse: 0.243639


[900]	valid_0's rmse: 0.243587


[1200]	valid_0's rmse: 0.243557


[1500]	valid_0's rmse: 0.243545


[1800]	valid_0's rmse: 0.243539


[2100]	valid_0's rmse: 0.24353


[2400]	valid_0's rmse: 0.243532


[2700]	valid_0's rmse: 0.243531


[3000]	valid_0's rmse: 0.24352


[3300]	valid_0's rmse: 0.243516


[3600]	valid_0's rmse: 0.243501


[3900]	valid_0's rmse: 0.243501


[4200]	valid_0's rmse: 0.243503


[4500]	valid_0's rmse: 0.2435


[4800]	valid_0's rmse: 0.243491


[5100]	valid_0's rmse: 0.243493


[5400]	valid_0's rmse: 0.243491


[5700]	valid_0's rmse: 0.243477


[6000]	valid_0's rmse: 0.243472


[6300]	valid_0's rmse: 0.243468


[6600]	valid_0's rmse: 0.243465


[6900]	valid_0's rmse: 0.243468


[GH-Lockdown v2] Fold 1: RMSE=3.87440, best_iter=6370 (train_n=319908, val_n=320756)


[300]	valid_0's rmse: 0.200298


[600]	valid_0's rmse: 0.200209


[900]	valid_0's rmse: 0.200174


[1200]	valid_0's rmse: 0.200134


[1500]	valid_0's rmse: 0.200106


[1800]	valid_0's rmse: 0.200066


[2100]	valid_0's rmse: 0.20005


[2400]	valid_0's rmse: 0.200032


[2700]	valid_0's rmse: 0.200032


[3000]	valid_0's rmse: 0.200008


[3300]	valid_0's rmse: 0.200012


[3600]	valid_0's rmse: 0.199994


[3900]	valid_0's rmse: 0.199998


[4200]	valid_0's rmse: 0.19998


[4500]	valid_0's rmse: 0.199962


[4800]	valid_0's rmse: 0.199952


[5100]	valid_0's rmse: 0.199961


[GH-Lockdown v2] Fold 2: RMSE=3.28725, best_iter=4581 (train_n=640523, val_n=320756)


[300]	valid_0's rmse: 0.229891


In [54]:
# Phase 2 — Two-Stage Hybrid Residual Model (Drift-Modeled) with rolling window + hardened residual TE + extreme regularization
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals() and 'sample' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb
try:
    import pygeohash as pgh
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'pygeohash'])
    import pygeohash as pgh

# Helpers
def to_local(dt):
    return dt.dt.tz_convert('America/New_York')

def ensure_days_since_start(df):
    if 'days_since_start' in df.columns: return df
    d = df.copy(); dt_local = to_local(d['pickup_datetime'])
    start_ns = int(dt_local.min().value)
    d['days_since_start'] = ((dt_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d

def ensure_policy_feats(df):
    if all(c in df.columns for c in ['after_hike']): return df
    d = df.copy(); dt_local = to_local(d['pickup_datetime'])
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    d['after_hike'] = (dt_local >= cutoff).astype('int8')
    return d

def ensure_holiday(df):
    if 'is_holiday' in df.columns: return df
    from pandas.tseries.holiday import USFederalHolidayCalendar as USCal
    d = df.copy(); dt_local = to_local(d['pickup_datetime'])
    hol = USCal().holidays(start=dt_local.min().normalize().tz_localize(None), end=dt_local.max().normalize().tz_localize(None))
    hol_dates = set(pd.to_datetime(hol).date)
    d['is_holiday'] = dt_local.dt.date.map(lambda x: x in hol_dates).astype('int8')
    return d

def add_geohash(df, prec_list=(6,5)):
    d = df.copy()
    lat_pu = d['pickup_latitude'].astype('float32').values
    lon_pu = d['pickup_longitude'].astype('float32').values
    lat_do = d['dropoff_latitude'].astype('float32').values
    lon_do = d['dropoff_longitude'].astype('float32').values
    for prec in prec_list:
        pu_col = f'pu_gh{prec}'; do_col = f'do_gh{prec}'
        if pu_col in d.columns and do_col in d.columns: continue
        d[pu_col] = [pgh.encode(float(la), float(lo), precision=prec) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(lat_pu, lon_pu)]
        d[do_col] = [pgh.encode(float(la), float(lo), precision=prec) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(lat_do, lon_do)]
    return d

# Residual TE helpers (OOF-safe inside each fold)
def te_smooth(train_key, train_tgt, apply_key, m=700.0, prior=None):
    if prior is None: prior = float(np.mean(train_tgt))
    g = pd.DataFrame({'k': train_key.astype('object'), 'y': train_tgt}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict()
    enc = apply_key.astype('object').map(mp).fillna(prior).astype('float32').values
    return enc, prior, mp, cnts.to_dict()

def blend_te(gh6_tr, gh5_tr, tgt_tr, gh6_ap, gh5_ap, m6=8000.0, m5=5000.0, alpha=1200.0, hard_thr=100):
    enc6, prior, mp6, cnt6 = te_smooth(gh6_tr, tgt_tr, gh6_ap, m=m6)
    enc5, _,    mp5, cnt5 = te_smooth(gh5_tr, tgt_tr, gh5_ap, m=m5, prior=prior)
    ap_keys = gh6_ap.astype('object').values if isinstance(gh6_ap, pd.Series) else gh6_ap.astype('object')
    counts = np.array([cnt6.get(k, 0.0) for k in ap_keys], dtype='float32')
    w = counts / (counts + np.float32(alpha))
    if hard_thr is not None and hard_thr > 0:
        mask = (counts < np.float32(hard_thr))
        if mask.any():
            w = w.copy(); w[mask] = 0.0
    return (w * enc6 + (1.0 - w) * enc5).astype('float32'), (mp6, mp5), (cnt6, cnt5)

# Prepare data
train_ord = ensure_holiday(ensure_policy_feats(ensure_days_since_start(train_df.copy()))).sort_values('pickup_datetime').reset_index(drop=True)
test_prep = ensure_holiday(ensure_policy_feats(ensure_days_since_start(test_df.copy())))
train_ord = add_geohash(train_ord, (6,5)); test_prep = add_geohash(test_prep, (6,5))

dt_all = train_ord['pickup_datetime']
y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)
folds = make_time_folds_quantile(dt_all, n_folds=5, gap_days=1)
assert len(folds) == 5, 'Expected 5 folds'

# Stage 1 Trend model (small, regularized; categorical discrete features)
s1_params = dict(objective='regression', metric='rmse',
                 learning_rate=0.05, n_estimators=4000,
                 num_leaves=16, min_data_in_leaf=2000,
                 feature_fraction=0.9, bagging_fraction=0.9, bagging_freq=1,
                 max_bin=127, reg_lambda=50.0, random_state=2025, n_jobs=-1, verbose=-1)

# Stage 2 Residual model (ultra-constrained)
s2_params = dict(objective='huber', metric='rmse',
                 learning_rate=0.045, n_estimators=28000,
                 num_leaves=48, min_data_in_leaf=3000,
                 feature_fraction=0.60, bagging_fraction=0.75, bagging_freq=1,
                 max_bin=31, reg_alpha=15.0, reg_lambda=100.0, min_gain_to_split=0.5,
                 random_state=2026, n_jobs=-1, verbose=-1)

# Features
trend_feats = ['days_since_start','year','month','is_holiday','after_hike']
for c in trend_feats: assert c in train_ord.columns, f'Missing {c} for Stage1'
# Stage 2 base (purified): remove absolute time identifiers; keep geometry + hour cycle + is_holiday
base2 = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'hour_sin','hour_cos','is_holiday',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_ord.columns: base2.append('rot_manh_km')
if 'dist_x_after_hike' in train_ord.columns: base2.append('dist_x_after_hike')

cv_rmses, s2_best_iters = [], []
fold_preds = []

for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx].copy(); va = train_ord.iloc[va_idx].copy()
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    # Stage 1: categorical discrete time for year/month/is_holiday/after_hike
    X1_tr = tr[trend_feats].copy(); X1_va = va[trend_feats].copy()
    for c in ['year','month','is_holiday','after_hike']:
        if c in X1_tr.columns:
            X1_tr[c] = X1_tr[c].astype('category'); X1_va[c] = X1_va[c].astype('category')
    s1 = lgb.LGBMRegressor(**s1_params)
    s1.fit(X1_tr, y_tr, eval_set=[(X1_va, y_va)], eval_metric='rmse', categorical_feature=['year','month','is_holiday','after_hike'],
           callbacks=[lgb.early_stopping(stopping_rounds=400, verbose=False), lgb.log_evaluation(period=200)])
    # Capture training categories for categorical features to reuse in predictions
    cat_categories = {c: X1_tr[c].cat.categories for c in ['year','month','is_holiday','after_hike'] if c in X1_tr.columns}
    tr_trend = s1.predict(X1_tr, num_iteration=s1.best_iteration_).astype('float32')
    va_trend = s1.predict(X1_va, num_iteration=s1.best_iteration_).astype('float32')
    y_tr_res = (y_tr - tr_trend).astype('float32'); y_va_res = (y_va - va_trend).astype('float32')

    # Stage 2 rolling window (480 days) within train fold
    dmax = float(tr['days_since_start'].max()) if len(tr) else 0.0
    tr_win = tr.loc[tr['days_since_start'] >= (dmax - 480.0)].reset_index(drop=True).copy()
    # Align categorical dtypes for Stage1 prediction on tr_win to match training categories
    X1_trwin = tr_win[trend_feats].copy()
    for c in ['year','month','is_holiday','after_hike']:
        if c in X1_trwin.columns:
            X1_trwin[c] = X1_trwin[c].astype('category')
            if c in cat_categories:
                X1_trwin[c] = X1_trwin[c].cat.set_categories(cat_categories[c])
    y_tr_res_win = (np.log1p(tr_win['fare_amount'].astype('float32').values) - s1.predict(X1_trwin, num_iteration=s1.best_iteration_).astype('float32'))

    # Residual TE on windowed train (gh6 -> gh5) for PU and DO
    # Ensure geohash cols exist
    assert all(c in tr_win.columns for c in ['pu_gh6','pu_gh5','do_gh6','do_gh5']), 'Geohash cols missing'
    te_pu_va, (pu_mp6, pu_mp5), (pu_cnt6, pu_cnt5) = blend_te(tr_win['pu_gh6'], tr_win['pu_gh5'], y_tr_res_win, va['pu_gh6'], va['pu_gh5'],
                                                             m6=8000.0, m5=5000.0, alpha=1200.0, hard_thr=100)
    te_do_va, (do_mp6, do_mp5), (do_cnt6, do_cnt5) = blend_te(tr_win['do_gh6'], tr_win['do_gh5'], y_tr_res_win, va['do_gh6'], va['do_gh5'],
                                                             m6=8000.0, m5=5000.0, alpha=1200.0, hard_thr=100)
    # Train encodings for Stage 2 on window (train side)
    te_pu_tr, _, _ = blend_te(tr_win['pu_gh6'], tr_win['pu_gh5'], y_tr_res_win, tr_win['pu_gh6'], tr_win['pu_gh5'],
                              m6=8000.0, m5=5000.0, alpha=1200.0, hard_thr=100)
    te_do_tr, _, _ = blend_te(tr_win['do_gh6'], tr_win['do_gh5'], y_tr_res_win, tr_win['do_gh6'], tr_win['do_gh5'],
                              m6=8000.0, m5=5000.0, alpha=1200.0, hard_thr=100)
    tr_win['te_pu'] = te_pu_tr; tr_win['te_do'] = te_do_tr
    va = va.copy(); va['te_pu'] = te_pu_va; va['te_do'] = te_do_va
    # Counts (gh6 only)
    pu6_counts = tr_win['pu_gh6'].astype('object').value_counts().astype('int32')
    do6_counts = tr_win['do_gh6'].astype('object').value_counts().astype('int32')
    tr_win['log_pu6_cnt'] = np.log1p(tr_win['pu_gh6'].astype('object').map(pu6_counts).fillna(0).astype('int32')).astype('float32')
    tr_win['log_do6_cnt'] = np.log1p(tr_win['do_gh6'].astype('object').map(do6_counts).fillna(0).astype('int32')).astype('float32')
    va['log_pu6_cnt'] = np.log1p(va['pu_gh6'].astype('object').map(pu6_counts).fillna(0).astype('int32')).astype('float32')
    va['log_do6_cnt'] = np.log1p(va['do_gh6'].astype('object').map(do6_counts).fillna(0).astype('int32')).astype('float32')

    use_cols2 = base2 + ['te_pu','te_do','log_pu6_cnt','log_do6_cnt']
    X2_tr = tr_win[use_cols2].astype('float32').values
    X2_va = va[use_cols2].astype('float32').values
    s2 = lgb.LGBMRegressor(**s2_params)
    s2.fit(X2_tr, y_tr_res_win, eval_set=[(X2_va, y_va_res)], eval_metric='rmse',
           callbacks=[lgb.early_stopping(stopping_rounds=900, verbose=False), lgb.log_evaluation(period=300)])
    s2_best = int(s2.best_iteration_ or s2_params['n_estimators']); s2_best_iters.append(s2_best)
    va_res_pred = s2.predict(X2_va, num_iteration=s2.best_iteration_).astype('float32')
    va_final_log = (va_trend + va_res_pred).astype('float32')
    rmse = float(root_mean_squared_error(np.expm1(y_va), np.expm1(va_final_log)))
    cv_rmses.append(rmse)
    print(f"[TwoStage-Hybrid] Fold {i}: RMSE={rmse:.5f}, s2_best_iter={s2_best} (train_win_n={len(tr_win)}, val_n={len(va_idx)})")

    # Test-time per fold: trend + residual
    X1_test = test_prep[trend_feats].copy()
    for c in ['year','month','is_holiday','after_hike']:
        if c in X1_test.columns:
            X1_test[c] = X1_test[c].astype('category')
            if c in cat_categories:
                X1_test[c] = X1_test[c].cat.set_categories(cat_categories[c])
    test_trend = s1.predict(X1_test, num_iteration=s1.best_iteration_).astype('float32')
    tst = test_prep.copy()
    enc6_pu_test, _, _, cnt6_pu = te_smooth(tr_win['pu_gh6'], y_tr_res_win, tst['pu_gh6'], m=8000.0, prior=None)
    enc5_pu_test, _, _, cnt5_pu = te_smooth(tr_win['pu_gh5'], y_tr_res_win, tst['pu_gh5'], m=5000.0, prior=None)
    k_pu = tst['pu_gh6'].astype('object').values
    c_pu = np.array([cnt6_pu.get(k, 0.0) for k in k_pu], dtype='float32')
    w_pu = c_pu / (c_pu + np.float32(1200.0))
    if (c_pu < 100).any():
        w_pu = w_pu.copy(); w_pu[c_pu < 100] = 0.0
    tst['te_pu'] = (w_pu * enc6_pu_test + (1.0 - w_pu) * enc5_pu_test).astype('float32')
    enc6_do_test, _, _, cnt6_do = te_smooth(tr_win['do_gh6'], y_tr_res_win, tst['do_gh6'], m=8000.0, prior=None)
    enc5_do_test, _, _, cnt5_do = te_smooth(tr_win['do_gh5'], y_tr_res_win, tst['do_gh5'], m=5000.0, prior=None)
    k_do = tst['do_gh6'].astype('object').values
    c_do = np.array([cnt6_do.get(k, 0.0) for k in k_do], dtype='float32')
    w_do = c_do / (c_do + np.float32(1200.0))
    if (c_do < 100).any():
        w_do = w_do.copy(); w_do[c_do < 100] = 0.0
    tst['te_do'] = (w_do * enc6_do_test + (1.0 - w_do) * enc5_do_test).astype('float32')
    # Counts
    pu6_counts_ap = tr_win['pu_gh6'].astype('object').value_counts().astype('int32')
    do6_counts_ap = tr_win['do_gh6'].astype('object').value_counts().astype('int32')
    tst['log_pu6_cnt'] = np.log1p(tst['pu_gh6'].astype('object').map(pu6_counts_ap).fillna(0).astype('int32')).astype('float32')
    tst['log_do6_cnt'] = np.log1p(tst['do_gh6'].astype('object').map(do6_counts_ap).fillna(0).astype('int32')).astype('float32')
    X2_test = tst[use_cols2].astype('float32').values
    test_res = s2.predict(X2_test, num_iteration=s2.best_iteration_).astype('float32')
    fold_pred = np.expm1(test_trend + test_res).astype('float32')
    fold_preds.append(fold_pred)

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_median = int(np.median(s2_best_iters)) if s2_best_iters else 0
print({'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 's2_best_iter_median': best_iter_median, 'note': 'Two-Stage Hybrid: Stage1 trend + Stage2 residual (480-day window) with hardened gh6->gh5 TE'})

# Fold-ensemble submission
test_pred = np.mean(np.vstack(fold_preds), axis=0).astype('float32') if fold_preds else np.zeros(len(test_prep), dtype='float32')
test_pred = np.clip(test_pred, 0, 500)
sub = sample[['key']].merge(pd.DataFrame({'key': test_prep['key'].astype('string'), 'fare_amount': test_pred}), on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (Two-Stage Hybrid) with shape:', sub.shape)

[200]	valid_0's rmse: 0.509142


[400]	valid_0's rmse: 0.509064


[300]	valid_0's rmse: 0.240162


[600]	valid_0's rmse: 0.24007


[900]	valid_0's rmse: 0.240038


[1200]	valid_0's rmse: 0.240024


[1500]	valid_0's rmse: 0.240023


[1800]	valid_0's rmse: 0.240024


[2100]	valid_0's rmse: 0.240021


[2400]	valid_0's rmse: 0.239997


[2700]	valid_0's rmse: 0.239998


[3000]	valid_0's rmse: 0.239997


[3300]	valid_0's rmse: 0.239985


[3600]	valid_0's rmse: 0.239993


[3900]	valid_0's rmse: 0.239975


[4200]	valid_0's rmse: 0.239976


[4500]	valid_0's rmse: 0.239971


[4800]	valid_0's rmse: 0.239971


[5100]	valid_0's rmse: 0.239971


[5400]	valid_0's rmse: 0.239969


[TwoStage-Hybrid] Fold 1: RMSE=3.84993, s2_best_iter=4628 (train_win_n=319908, val_n=320756)


[200]	valid_0's rmse: 0.516681


[400]	valid_0's rmse: 0.516727


[300]	valid_0's rmse: 0.186038


[600]	valid_0's rmse: 0.185898


[900]	valid_0's rmse: 0.185865


[1200]	valid_0's rmse: 0.185813


[1500]	valid_0's rmse: 0.185786


[1800]	valid_0's rmse: 0.185763


[2100]	valid_0's rmse: 0.185747


[2400]	valid_0's rmse: 0.185731


[2700]	valid_0's rmse: 0.185711


[3000]	valid_0's rmse: 0.185706


[3300]	valid_0's rmse: 0.185697


[3600]	valid_0's rmse: 0.185688


[3900]	valid_0's rmse: 0.185673


[4200]	valid_0's rmse: 0.185656


[4500]	valid_0's rmse: 0.185635


[4800]	valid_0's rmse: 0.185636


[5100]	valid_0's rmse: 0.185639


[5400]	valid_0's rmse: 0.185637


[5700]	valid_0's rmse: 0.18563


[6000]	valid_0's rmse: 0.185624


[6300]	valid_0's rmse: 0.185614


[6600]	valid_0's rmse: 0.185613


[6900]	valid_0's rmse: 0.185609


[7200]	valid_0's rmse: 0.185605


[7500]	valid_0's rmse: 0.1856


[7800]	valid_0's rmse: 0.185591


[8100]	valid_0's rmse: 0.185591


[8400]	valid_0's rmse: 0.185588


[8700]	valid_0's rmse: 0.185587


[9000]	valid_0's rmse: 0.185585


[9300]	valid_0's rmse: 0.185577


[9600]	valid_0's rmse: 0.185574


[9900]	valid_0's rmse: 0.185571


[10200]	valid_0's rmse: 0.185583


[10500]	valid_0's rmse: 0.185562


[10800]	valid_0's rmse: 0.185564


In [55]:
# Phase 3 — Time-Aware Single Model with Hierarchical Temporal TE (gh6/gh5 × year/month) + fold ensembling
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals() and 'sample' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb
try:
    import pygeohash as pgh
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'pygeohash'])
    import pygeohash as pgh

# Helpers
def to_local(dt):
    return dt.dt.tz_convert('America/New_York')

def ensure_days_since_start(df):
    if 'days_since_start' in df.columns: return df
    d = df.copy(); dt_local = to_local(d['pickup_datetime'])
    start_ns = int(dt_local.min().value)
    d['days_since_start'] = ((dt_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d

def add_geohash(df, prec_list=(6,5)):
    d = df.copy()
    lat_pu = d['pickup_latitude'].astype('float32').values
    lon_pu = d['pickup_longitude'].astype('float32').values
    lat_do = d['dropoff_latitude'].astype('float32').values
    lon_do = d['dropoff_longitude'].astype('float32').values
    for prec in prec_list:
        pu_col = f'pu_gh{prec}'; do_col = f'do_gh{prec}'
        if pu_col in d.columns and do_col in d.columns: continue
        d[pu_col] = [pgh.encode(float(la), float(lo), precision=prec) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(lat_pu, lon_pu)]
        d[do_col] = [pgh.encode(float(la), float(lo), precision=prec) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(lat_do, lon_do)]
    return d

def ensure_time_parts(df):
    d = df.copy(); dt_local = to_local(d['pickup_datetime'])
    d['year'] = dt_local.dt.year.astype('int16'); d['month'] = dt_local.dt.month.astype('int8')
    d['day'] = dt_local.dt.day.astype('int8'); d['hour'] = dt_local.dt.hour.astype('int8'); d['dow'] = dt_local.dt.dayofweek.astype('int8')
    d['hour_sin'] = np.sin(2*np.pi*d['hour']/24).astype('float32'); d['hour_cos'] = np.cos(2*np.pi*d['hour']/24).astype('float32')
    d['dow_sin'] = np.sin(2*np.pi*d['dow']/7).astype('float32'); d['dow_cos'] = np.cos(2*np.pi*d['dow']/7).astype('float32')
    d['doy_sin'] = np.sin(2*np.pi*(dt_local.dt.dayofyear.astype('int16')/365.25)).astype('float32')
    d['doy_cos'] = np.cos(2*np.pi*(dt_local.dt.dayofyear.astype('int16')/365.25)).astype('float32')
    d['is_weekend'] = d['dow'].isin([5,6]).astype('int8')
    d['is_rush'] = ((d['hour'].between(7,10)) | (d['hour'].between(16,19))).astype('int8')
    d['is_night'] = ((d['hour'] >= 22) | (d['hour'] <= 5)).astype('int8')
    return d

def make_timeaware_keys(df):
    d = df.copy()
    # string concat keys
    d['k_pu_l1'] = (d['pu_gh6'].astype('string') + '_' + d['year'].astype('string') + '_' + d['month'].astype('string'))
    d['k_pu_l2'] = (d['pu_gh6'].astype('string') + '_' + d['year'].astype('string'))
    d['k_pu_l3'] = (d['pu_gh5'].astype('string') + '_' + d['year'].astype('string'))
    d['k_do_l1'] = (d['do_gh6'].astype('string') + '_' + d['year'].astype('string') + '_' + d['month'].astype('string'))
    d['k_do_l2'] = (d['do_gh6'].astype('string') + '_' + d['year'].astype('string'))
    d['k_do_l3'] = (d['do_gh5'].astype('string') + '_' + d['year'].astype('string'))
    return d

# TE helper with smoothing; returns enc vec and count map
def te_smooth(train_key, train_tgt, apply_key, m=500.0, prior=None):
    if prior is None: prior = float(np.mean(train_tgt))
    g = pd.DataFrame({'k': train_key.astype('object'), 'y': train_tgt}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict(); cnt_map = cnts.to_dict()
    enc = apply_key.astype('object').map(mp).fillna(prior).astype('float32').values
    return enc, mp, cnt_map, prior

def blend_three_levels(k1_tr, k2_tr, k3_tr, y_tr, k1_ap, k2_ap, k3_ap, m=500.0, alpha=200.0, hard_thr=50):
    # Level 1 (fine), Level 2 (medium), Level 3 (coarse)
    enc1, mp1, cnt1, prior = te_smooth(k1_tr, y_tr, k1_ap, m=m, prior=None)
    enc2, mp2, cnt2, _ = te_smooth(k2_tr, y_tr, k2_ap, m=m, prior=prior)
    enc3, mp3, cnt3, _ = te_smooth(k3_tr, y_tr, k3_ap, m=m, prior=prior)
    # weights
    ap_keys1 = k1_ap.astype('object').values
    ap_keys2 = k2_ap.astype('object').values
    c1 = np.array([cnt1.get(k, 0.0) for k in ap_keys1], dtype='float32')
    c2 = np.array([cnt2.get(k, 0.0) for k in ap_keys2], dtype='float32')
    w1 = c1 / (c1 + np.float32(alpha))
    w2 = c2 / (c2 + np.float32(alpha))
    if hard_thr is not None and hard_thr > 0:
        m1 = (c1 < np.float32(hard_thr)); m2 = (c2 < np.float32(hard_thr))
        if m1.any(): w1 = w1.copy(); w1[m1] = 0.0
        if m2.any(): w2 = w2.copy(); w2[m2] = 0.0
    enc12 = (w1 * enc1 + (1.0 - w1) * enc2).astype('float32')
    enc = (w2 * enc12 + (1.0 - w2) * enc3).astype('float32')
    return enc

# Prepare data
train_ord = ensure_days_since_start(train_df.copy()).sort_values('pickup_datetime').reset_index(drop=True)
test_prep = ensure_days_since_start(test_df.copy())
train_ord = ensure_time_parts(train_ord); test_prep = ensure_time_parts(test_prep)
train_ord = add_geohash(train_ord, (6,5)); test_prep = add_geohash(test_prep, (6,5))
train_ord = make_timeaware_keys(train_ord); test_prep = make_timeaware_keys(test_prep)

# Base robust features; drop raw year/month/day from final model (used only in TE keys) per mandate
base_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'hour_sin','hour_cos','dow_sin','dow_cos','doy_sin','doy_cos','is_holiday','is_weekend','is_rush','is_night',
    'days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_ord.columns: base_feats.append('rot_manh_km')
if 'dist_x_after_hike' in train_ord.columns: base_feats.append('dist_x_after_hike')

train_ord = train_ord.dropna(subset=base_feats + ['fare_amount']).reset_index(drop=True)
test_prep[base_feats] = test_prep[base_feats].fillna(0)

y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)
dt_all = train_ord['pickup_datetime']
K = 5
folds = make_time_folds_quantile(dt_all, n_folds=K, gap_days=1)
assert len(folds) == K, f'Expected {K} folds, got {len(folds)}'

# Model params (moderately regularized) and monotone +1 on days_since_start
lgb_params = dict(
    objective='huber', metric='rmse',
    learning_rate=0.045, n_estimators=24000,
    num_leaves=128, max_depth=-1, min_data_in_leaf=800,
    feature_fraction=0.75, bagging_fraction=0.8, bagging_freq=1,
    max_bin=127, reg_alpha=2.0, reg_lambda=30.0,
    random_state=2025, n_jobs=-1, verbose=-1
)

cv_rmses, best_iters = [], []
fold_test_preds = []
rng_master = np.random.default_rng(2026)

for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx].copy(); va = train_ord.iloc[va_idx].copy()
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    # Time-aware TE for PU
    te_pu_tr = blend_three_levels(tr['k_pu_l1'], tr['k_pu_l2'], tr['k_pu_l3'], y_tr, tr['k_pu_l1'], tr['k_pu_l2'], tr['k_pu_l3'], m=500.0, alpha=200.0, hard_thr=50)
    te_pu_va = blend_three_levels(tr['k_pu_l1'], tr['k_pu_l2'], tr['k_pu_l3'], y_tr, va['k_pu_l1'], va['k_pu_l2'], va['k_pu_l3'], m=500.0, alpha=200.0, hard_thr=50)
    # Time-aware TE for DO
    te_do_tr = blend_three_levels(tr['k_do_l1'], tr['k_do_l2'], tr['k_do_l3'], y_tr, tr['k_do_l1'], tr['k_do_l2'], tr['k_do_l3'], m=500.0, alpha=200.0, hard_thr=50)
    te_do_va = blend_three_levels(tr['k_do_l1'], tr['k_do_l2'], tr['k_do_l3'], y_tr, va['k_do_l1'], va['k_do_l2'], va['k_do_l3'], m=500.0, alpha=200.0, hard_thr=50)
    # Add tiny Gaussian noise to TE (stability)
    seed_fold = 1000 + i
    rng = np.random.default_rng(seed_fold)
    tr['te_pu'] = (te_pu_tr + np.float32(1e-4) * rng.standard_normal(len(tr))).astype('float32')
    va['te_pu'] = (te_pu_va + np.float32(1e-4) * rng.standard_normal(len(va))).astype('float32')
    tr['te_do'] = (te_do_tr + np.float32(1e-4) * rng.standard_normal(len(tr))).astype('float32')
    va['te_do'] = (te_do_va + np.float32(1e-4) * rng.standard_normal(len(va))).astype('float32')
    use_cols = base_feats + ['te_pu','te_do']
    X_tr = tr[use_cols].astype('float32').values
    X_va = va[use_cols].astype('float32').values
    mono = [0]*len(use_cols)
    if 'days_since_start' in use_cols:
        mono[use_cols.index('days_since_start')] = 1
    model = lgb.LGBMRegressor(**{**lgb_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=700, verbose=False), lgb.log_evaluation(period=250)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    rmse = float(root_mean_squared_error(np.expm1(y_va), y_pred_val))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    print(f"[TimeAware-Single] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})")

    # Test-time TE with same time-aware keys computed from test and train-only maps
    tst = test_prep.copy()
    te_pu_test = blend_three_levels(tr['k_pu_l1'], tr['k_pu_l2'], tr['k_pu_l3'], y_tr, tst['k_pu_l1'], tst['k_pu_l2'], tst['k_pu_l3'], m=500.0, alpha=200.0, hard_thr=50)
    te_do_test = blend_three_levels(tr['k_do_l1'], tr['k_do_l2'], tr['k_do_l3'], y_tr, tst['k_do_l1'], tst['k_do_l2'], tst['k_do_l3'], m=500.0, alpha=200.0, hard_thr=50)
    tst['te_pu'] = te_pu_test.astype('float32'); tst['te_do'] = te_do_test.astype('float32')
    X_test_fold = tst[use_cols].astype('float32').values
    fold_pred = np.expm1(model.predict(X_test_fold, num_iteration=model.best_iteration_)).astype('float32')
    fold_test_preds.append(fold_pred)

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_median = int(np.median(best_iters)) if best_iters else 0
print({'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 'median_best_iter': best_iter_median, 'note': 'Time-aware TE (gh6/gh5 × year/month) + huber + monotone(time) + fold ensembling'})

# Fold-ensemble predictions
test_pred = np.mean(np.vstack(fold_test_preds), axis=0).astype('float32') if fold_test_preds else np.zeros(len(test_prep), dtype='float32')
test_pred = np.clip(test_pred, 0, 500)
sub = sample[['key']].merge(pd.DataFrame({'key': test_prep['key'].astype('string'), 'fare_amount': test_pred}), on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (Time-Aware Single Model) with shape:', sub.shape)

[250]	valid_0's rmse: 0.23419


[500]	valid_0's rmse: 0.232584


[750]	valid_0's rmse: 0.231963


[1000]	valid_0's rmse: 0.231703


[1250]	valid_0's rmse: 0.231627


[1500]	valid_0's rmse: 0.231631


[1750]	valid_0's rmse: 0.231614


[TimeAware-Single] Fold 1: RMSE=3.68843, best_iter=1175 (train_n=319908, val_n=320756)


[250]	valid_0's rmse: 0.178706


[500]	valid_0's rmse: 0.176968


[750]	valid_0's rmse: 0.176705


[1000]	valid_0's rmse: 0.176535


[1250]	valid_0's rmse: 0.176586


[1500]	valid_0's rmse: 0.1766


[TimeAware-Single] Fold 2: RMSE=2.96820, best_iter=991 (train_n=640523, val_n=320756)


[250]	valid_0's rmse: 0.212025


[500]	valid_0's rmse: 0.21023


[750]	valid_0's rmse: 0.21026


[1000]	valid_0's rmse: 0.210467


[TimeAware-Single] Fold 3: RMSE=3.82885, best_iter=545 (train_n=961323, val_n=320756)


In [57]:
# Phase 15 — Maximalist Hardening: Time-Aware Single Model with Aggressive Hierarchical TE + Reliability + Lockdown Regularization
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals() and 'sample' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb
try:
    import pygeohash as pgh
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'pygeohash'])
    import pygeohash as pgh

# Helpers
def to_local(dt):
    return dt.dt.tz_convert('America/New_York')

def ensure_days_since_start(df):
    if 'days_since_start' in df.columns: return df
    d = df.copy(); dt_local = to_local(d['pickup_datetime'])
    start_ns = int(dt_local.min().value)
    d['days_since_start'] = ((dt_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d

def ensure_time_parts(df):
    d = df.copy(); dt = to_local(d['pickup_datetime'])
    d['year'] = dt.dt.year.astype('int16'); d['month'] = dt.dt.month.astype('int8')
    return d

def add_geohash(df, prec_list=(6,5)):
    d = df.copy()
    lat_pu = d['pickup_latitude'].astype('float32').values
    lon_pu = d['pickup_longitude'].astype('float32').values
    lat_do = d['dropoff_latitude'].astype('float32').values
    lon_do = d['dropoff_longitude'].astype('float32').values
    for prec in prec_list:
        pu_col = f'pu_gh{prec}'; do_col = f'do_gh{prec}'
        if pu_col in d.columns and do_col in d.columns: continue
        d[pu_col] = [pgh.encode(float(la), float(lo), precision=prec) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(lat_pu, lon_pu)]
        d[do_col] = [pgh.encode(float(la), float(lo), precision=prec) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(lat_do, lon_do)]
    return d

def build_timeaware_keys(df):
    d = df.copy()
    # PU keys
    d['k_pu_ym6'] = (d['pu_gh6'].astype('string') + '_' + d['year'].astype('string') + '_' + d['month'].astype('string'))
    d['k_pu_y6']  = (d['pu_gh6'].astype('string') + '_' + d['year'].astype('string'))
    d['k_pu_y5']  = (d['pu_gh5'].astype('string') + '_' + d['year'].astype('string'))
    # DO keys
    d['k_do_ym6'] = (d['do_gh6'].astype('string') + '_' + d['year'].astype('string') + '_' + d['month'].astype('string'))
    d['k_do_y6']  = (d['do_gh6'].astype('string') + '_' + d['year'].astype('string'))
    d['k_do_y5']  = (d['do_gh5'].astype('string') + '_' + d['year'].astype('string'))
    return d

# Smoothed TE helper (log-target). Returns enc and count map.
def te_smooth(train_key, train_tgt, apply_key, m=3000.0, prior=None):
    if prior is None: prior = float(np.mean(train_tgt))
    g = pd.DataFrame({'k': train_key.astype('object'), 'y': train_tgt}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict(); cnt_map = cnts.to_dict()
    enc = apply_key.astype('object').map(mp).fillna(prior).astype('float32').values
    return enc, prior, mp, cnt_map

# Hierarchical back-off blend with dynamic prior, clipping, and hard back-off.
def hardened_timeaware_te(tr_df, ap_df, key_cols, y_tr_log, m_levels=(3500.0, 2500.0, 1500.0), alpha=800.0, hard_thr=100, clip_width=0.12):
    # Dynamic prior on recent 120 days if available
    days = tr_df['days_since_start'].astype('float32').values
    prior_dyn = float(np.mean(y_tr_log[days >= (days.max() - 120.0)])) if len(days) and (days >= (days.max() - 120.0)).any() else float(np.mean(y_tr_log))
    # Level order: most specific -> least specific
    encs_ap = []; cnts_list = [];
    for k_col, m in zip(key_cols, m_levels):
        enc_ap, prior_dyn, _, cnt_map = te_smooth(tr_df[k_col], y_tr_log, ap_df[k_col], m=m, prior=prior_dyn)
        encs_ap.append(enc_ap.astype('float32'))
        cnts_list.append(cnt_map)
    # Weights based on counts of higher-resolution keys
    k1 = key_cols[0]; c1 = np.array([cnts_list[0].get(k, 0.0) for k in ap_df[k1].astype('object').values], dtype='float32')
    w1 = c1 / (c1 + np.float32(alpha))
    if hard_thr is not None and hard_thr > 0:
        mask1 = (c1 < np.float32(hard_thr));
        if mask1.any():
            w1 = w1.copy(); w1[mask1] = 0.0
    # Blend level1 with level2
    enc12 = (w1 * encs_ap[0] + (1.0 - w1) * encs_ap[1]).astype('float32')
    # Second-stage weights from level2 counts
    k2 = key_cols[1]; c2 = np.array([cnts_list[1].get(k, 0.0) for k in ap_df[k2].astype('object').values], dtype='float32')
    w2 = c2 / (c2 + np.float32(alpha))
    if hard_thr is not None and hard_thr > 0:
        mask2 = (c2 < np.float32(hard_thr));
        if mask2.any():
            w2 = w2.copy(); w2[mask2] = 0.0
    enc = (w2 * enc12 + (1.0 - w2) * encs_ap[2]).astype('float32')
    # Clip TE to reduce volatility
    enc = np.clip(enc, prior_dyn - clip_width, prior_dyn + clip_width).astype('float32')
    return enc, prior_dyn, (c1, c2)

# Reliability counts on gh6
def add_reliability_counts(df_tr, df_ap):
    pu6_counts = df_tr['pu_gh6'].astype('object').value_counts().astype('int32')
    do6_counts = df_tr['do_gh6'].astype('object').value_counts().astype('int32')
    out = df_ap.copy()
    out['log_pu6_cnt'] = np.log1p(out['pu_gh6'].astype('object').map(pu6_counts).fillna(0).astype('int32')).astype('float32')
    out['log_do6_cnt'] = np.log1p(out['do_gh6'].astype('object').map(do6_counts).fillna(0).astype('int32')).astype('float32')
    return out

# Prepare data
train_ord = ensure_days_since_start(train_df.copy()).sort_values('pickup_datetime').reset_index(drop=True)
test_prep = ensure_days_since_start(test_df.copy())
train_ord = ensure_time_parts(train_ord); test_prep = ensure_time_parts(test_prep)
train_ord = add_geohash(train_ord, (6,5)); test_prep = add_geohash(test_prep, (6,5))

# Base features (no raw year/month/day in model)
base_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'hour_sin','hour_cos','dow_sin','dow_cos','doy_sin','doy_cos','is_holiday','is_weekend','is_rush','is_night',
    'days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_ord.columns: base_feats.append('rot_manh_km')
if 'dist_x_after_hike' in train_ord.columns: base_feats.append('dist_x_after_hike')

train_ord = train_ord.dropna(subset=base_feats + ['fare_amount']).reset_index(drop=True)
test_prep[base_feats] = test_prep[base_feats].fillna(0)

y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)
dt_all = train_ord['pickup_datetime']
folds = make_time_folds_quantile(dt_all, n_folds=5, gap_days=1)
assert len(folds) == 5, 'Expected 5 folds'

# LightGBM lockdown params (per mandate)
lgb_params = dict(
    objective='huber', metric='rmse',
    learning_rate=0.045, n_estimators=26000,
    num_leaves=96, max_depth=-1, min_data_in_leaf=1800,
    feature_fraction=0.65, bagging_fraction=0.80, bagging_freq=1,
    max_bin=127, reg_alpha=5.0, reg_lambda=60.0, min_gain_to_split=0.1,
    random_state=2025, n_jobs=-1, verbose=-1
)

cv_rmses, best_iters = [], []
fold_test_preds = []
REL_THR = np.log(5.0).astype('float32')  # surgical reliability guard threshold

for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx].copy(); va = train_ord.iloc[va_idx].copy()
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    # Build time-aware keys
    tr = build_timeaware_keys(tr); va = build_timeaware_keys(va)
    # Hardened hierarchical time-aware TE (PU and DO) with mandated params
    te_pu_va, prior_pu, _ = hardened_timeaware_te(tr, va, ['k_pu_ym6','k_pu_y6','k_pu_y5'], y_tr,
                                                  m_levels=(3500.0, 2500.0, 1500.0), alpha=800.0, hard_thr=100, clip_width=0.12)
    te_do_va, prior_do, _ = hardened_timeaware_te(tr, va, ['k_do_ym6','k_do_y6','k_do_y5'], y_tr,
                                                  m_levels=(3500.0, 2500.0, 1500.0), alpha=800.0, hard_thr=100, clip_width=0.12)
    te_pu_tr, _, _ = hardened_timeaware_te(tr, tr, ['k_pu_ym6','k_pu_y6','k_pu_y5'], y_tr,
                                           m_levels=(3500.0, 2500.0, 1500.0), alpha=800.0, hard_thr=100, clip_width=0.12)
    te_do_tr, _, _ = hardened_timeaware_te(tr, tr, ['k_do_ym6','k_do_y6','k_do_y5'], y_tr,
                                           m_levels=(3500.0, 2500.0, 1500.0), alpha=800.0, hard_thr=100, clip_width=0.12)
    tr['te_pu'] = te_pu_tr.astype('float32'); va['te_pu'] = te_pu_va.astype('float32')
    tr['te_do'] = te_do_tr.astype('float32'); va['te_do'] = te_do_va.astype('float32')
    # Reliability features and surgical guard on validation
    tr = add_reliability_counts(tr, tr); va = add_reliability_counts(tr, va)
    # Overwrite TE with dynamic prior for sparse keys (guard)
    va_mask_pu = va['log_pu6_cnt'].values < REL_THR
    if np.any(va_mask_pu):
        va.loc[va_mask_pu, 'te_pu'] = prior_pu
    va_mask_do = va['log_do6_cnt'].values < REL_THR
    if np.any(va_mask_do):
        va.loc[va_mask_do, 'te_do'] = prior_do

    use_cols = base_feats + ['te_pu','te_do','log_pu6_cnt','log_do6_cnt']
    X_tr = tr[use_cols].astype('float32').values; X_va = va[use_cols].astype('float32').values
    # Monotone constraints: +1 on days_since_start, dist_hav_km, dist_man_km
    mono = [0]*len(use_cols)
    for nm in ['days_since_start','dist_hav_km','dist_man_km']:
        if nm in use_cols:
            mono[use_cols.index(nm)] = 1
    model = lgb.LGBMRegressor(**{**lgb_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=900, verbose=False), lgb.log_evaluation(period=250)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    rmse = float(root_mean_squared_error(np.expm1(y_va), y_pred_val))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    print(f"[Hardened TimeAware] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})")

    # Test predictions for fold ensembling with train-only encodings + reliability guard
    tst = test_prep.copy()
    tst = build_timeaware_keys(tst)
    te_pu_test, _, _ = hardened_timeaware_te(tr, tst, ['k_pu_ym6','k_pu_y6','k_pu_y5'], y_tr,
                                             m_levels=(3500.0, 2500.0, 1500.0), alpha=800.0, hard_thr=100, clip_width=0.12)
    te_do_test, _, _ = hardened_timeaware_te(tr, tst, ['k_do_ym6','k_do_y6','k_do_y5'], y_tr,
                                             m_levels=(3500.0, 2500.0, 1500.0), alpha=800.0, hard_thr=100, clip_width=0.12)
    tst['te_pu'] = te_pu_test.astype('float32'); tst['te_do'] = te_do_test.astype('float32')
    tst = add_reliability_counts(tr, tst)
    # Guard on test
    tst_mask_pu = tst['log_pu6_cnt'].values < REL_THR
    if np.any(tst_mask_pu):
        tst.loc[tst_mask_pu, 'te_pu'] = prior_pu
    tst_mask_do = tst['log_do6_cnt'].values < REL_THR
    if np.any(tst_mask_do):
        tst.loc[tst_mask_do, 'te_do'] = prior_do

    X_test_fold = tst[use_cols].astype('float32').values
    fold_pred = np.expm1(model.predict(X_test_fold, num_iteration=model.best_iteration_)).astype('float32')
    fold_pred = np.clip(fold_pred, 0, 500)
    fold_test_preds.append(fold_pred)

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_median = int(np.median(best_iters)) if best_iters else 0
print({'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 'median_best_iter': best_iter_median, 'note': 'Maximalist hardened time-aware single model (mandated params + reliability guard)'})

# Fold-ensemble submission
test_pred = np.mean(np.vstack(fold_test_preds), axis=0).astype('float32') if fold_test_preds else np.zeros(len(test_prep), dtype='float32')
test_pred = np.clip(test_pred, 0, 500)
sub = sample[['key']].merge(pd.DataFrame({'key': test_prep['key'].astype('string'), 'fare_amount': test_pred}), on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (Hardened Time-Aware Single Model) with shape:', sub.shape)

[250]	valid_0's rmse: 0.235701


[500]	valid_0's rmse: 0.23494


[750]	valid_0's rmse: 0.23481


[1000]	valid_0's rmse: 0.234737


[1250]	valid_0's rmse: 0.234704


[1500]	valid_0's rmse: 0.234681


[1750]	valid_0's rmse: 0.234651


[2000]	valid_0's rmse: 0.234639


[2250]	valid_0's rmse: 0.234605


[2500]	valid_0's rmse: 0.234587


[2750]	valid_0's rmse: 0.234575


[3000]	valid_0's rmse: 0.234584


[3250]	valid_0's rmse: 0.234561


[3500]	valid_0's rmse: 0.234553


[3750]	valid_0's rmse: 0.234552


[4000]	valid_0's rmse: 0.234542


[4250]	valid_0's rmse: 0.234537


[4500]	valid_0's rmse: 0.234538


[4750]	valid_0's rmse: 0.23452


[5000]	valid_0's rmse: 0.234519


[5250]	valid_0's rmse: 0.234517


[5500]	valid_0's rmse: 0.234504


[5750]	valid_0's rmse: 0.234497


[6000]	valid_0's rmse: 0.234502


[6250]	valid_0's rmse: 0.234495


[6500]	valid_0's rmse: 0.234489


[6750]	valid_0's rmse: 0.234498


[7000]	valid_0's rmse: 0.234498


[7250]	valid_0's rmse: 0.234485


[7500]	valid_0's rmse: 0.234481


[7750]	valid_0's rmse: 0.234474


[8000]	valid_0's rmse: 0.234474


[8250]	valid_0's rmse: 0.234476


[8500]	valid_0's rmse: 0.23448


[Hardened TimeAware] Fold 1: RMSE=3.73716, best_iter=7771 (train_n=319908, val_n=320756)


In [60]:
# Phase 15 — Balanced Attack: relaxed TE clipping/back-off + relaxed model regularization + reliability guard (fixed threshold) + fold ensembling
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals() and 'sample' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb
try:
    import pygeohash as pgh
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'pygeohash'])
    import pygeohash as pgh

def to_local(dt):
    return dt.dt.tz_convert('America/New_York')

def ensure_days_since_start(df):
    if 'days_since_start' in df.columns: return df
    d = df.copy(); dt_local = to_local(d['pickup_datetime'])
    start_ns = int(dt_local.min().value)
    d['days_since_start'] = ((dt_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d

def ensure_time_parts(df):
    d = df.copy(); dt = to_local(d['pickup_datetime'])
    d['year'] = dt.dt.year.astype('int16'); d['month'] = dt.dt.month.astype('int8')
    return d

def add_geohash(df, prec_list=(6,5)):
    d = df.copy()
    lat_pu = d['pickup_latitude'].astype('float32').values
    lon_pu = d['pickup_longitude'].astype('float32').values
    lat_do = d['dropoff_latitude'].astype('float32').values
    lon_do = d['dropoff_longitude'].astype('float32').values
    for prec in prec_list:
        pu_col = f'pu_gh{prec}'; do_col = f'do_gh{prec}'
        if pu_col in d.columns and do_col in d.columns: continue
        d[pu_col] = [pgh.encode(float(la), float(lo), precision=prec) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(lat_pu, lon_pu)]
        d[do_col] = [pgh.encode(float(la), float(lo), precision=prec) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(lat_do, lon_do)]
    return d

def build_timeaware_keys(df):
    d = df.copy()
    d['k_pu_ym6'] = (d['pu_gh6'].astype('string') + '_' + d['year'].astype('string') + '_' + d['month'].astype('string'))
    d['k_pu_y6']  = (d['pu_gh6'].astype('string') + '_' + d['year'].astype('string'))
    d['k_pu_y5']  = (d['pu_gh5'].astype('string') + '_' + d['year'].astype('string'))
    d['k_do_ym6'] = (d['do_gh6'].astype('string') + '_' + d['year'].astype('string') + '_' + d['month'].astype('string'))
    d['k_do_y6']  = (d['do_gh6'].astype('string') + '_' + d['year'].astype('string'))
    d['k_do_y5']  = (d['do_gh5'].astype('string') + '_' + d['year'].astype('string'))
    return d

def te_smooth(train_key, train_tgt, apply_key, m=3000.0, prior=None):
    if prior is None: prior = float(np.mean(train_tgt))
    g = pd.DataFrame({'k': train_key.astype('object'), 'y': train_tgt}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict(); cnt_map = cnts.to_dict()
    enc = apply_key.astype('object').map(mp).fillna(prior).astype('float32').values
    return enc, prior, mp, cnt_map

def hardened_timeaware_te(tr_df, ap_df, key_cols, y_tr_log, m_levels=(3500.0, 2500.0, 1500.0), alpha=1000.0, hard_thr=50, clip_width=0.20):
    days = tr_df['days_since_start'].astype('float32').values
    prior_dyn = float(np.mean(y_tr_log[days >= (days.max() - 120.0)])) if len(days) and (days >= (days.max() - 120.0)).any() else float(np.mean(y_tr_log))
    encs_ap = []; cnts_list = [];
    for k_col, m in zip(key_cols, m_levels):
        enc_ap, prior_dyn, _, cnt_map = te_smooth(tr_df[k_col], y_tr_log, ap_df[k_col], m=m, prior=prior_dyn)
        encs_ap.append(enc_ap.astype('float32'))
        cnts_list.append(cnt_map)
    k1 = key_cols[0]; c1 = np.array([cnts_list[0].get(k, 0.0) for k in ap_df[k1].astype('object').values], dtype='float32')
    w1 = c1 / (c1 + np.float32(alpha))
    if hard_thr is not None and hard_thr > 0:
        m1 = (c1 < np.float32(hard_thr))
        if m1.any():
            w1 = w1.copy(); w1[m1] = 0.0
    enc12 = (w1 * encs_ap[0] + (1.0 - w1) * encs_ap[1]).astype('float32')
    k2 = key_cols[1]; c2 = np.array([cnts_list[1].get(k, 0.0) for k in ap_df[k2].astype('object').values], dtype='float32')
    w2 = c2 / (c2 + np.float32(alpha))
    if hard_thr is not None and hard_thr > 0:
        m2 = (c2 < np.float32(hard_thr))
        if m2.any():
            w2 = w2.copy(); w2[m2] = 0.0
    enc = (w2 * enc12 + (1.0 - w2) * encs_ap[2]).astype('float32')
    enc = np.clip(enc, prior_dyn - clip_width, prior_dyn + clip_width).astype('float32')
    return enc, prior_dyn, (c1, c2)

def add_reliability_counts(df_tr, df_ap):
    pu6_counts = df_tr['pu_gh6'].astype('object').value_counts().astype('int32')
    do6_counts = df_tr['do_gh6'].astype('object').value_counts().astype('int32')
    out = df_ap.copy()
    out['log_pu6_cnt'] = np.log1p(out['pu_gh6'].astype('object').map(pu6_counts).fillna(0).astype('int32')).astype('float32')
    out['log_do6_cnt'] = np.log1p(out['do_gh6'].astype('object').map(do6_counts).fillna(0).astype('int32')).astype('float32')
    return out

train_ord = ensure_days_since_start(train_df.copy()).sort_values('pickup_datetime').reset_index(drop=True)
test_prep = ensure_days_since_start(test_df.copy())
train_ord = ensure_time_parts(train_ord); test_prep = ensure_time_parts(test_prep)
train_ord = add_geohash(train_ord, (6,5)); test_prep = add_geohash(test_prep, (6,5))

base_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'hour_sin','hour_cos','dow_sin','dow_cos','doy_sin','doy_cos','is_holiday','is_weekend','is_rush','is_night',
    'days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_ord.columns: base_feats.append('rot_manh_km')
if 'dist_x_after_hike' in train_ord.columns: base_feats.append('dist_x_after_hike')

train_ord = train_ord.dropna(subset=base_feats + ['fare_amount']).reset_index(drop=True)
test_prep[base_feats] = test_prep[base_feats].fillna(0)

y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)
dt_all = train_ord['pickup_datetime']
folds = make_time_folds_quantile(dt_all, n_folds=5, gap_days=1)
assert len(folds) == 5, 'Expected 5 folds'

lgb_params = dict(
    objective='huber', metric='rmse',
    learning_rate=0.045, n_estimators=26000,
    num_leaves=192, max_depth=-1, min_data_in_leaf=800,
    feature_fraction=0.80, bagging_fraction=0.80, bagging_freq=1,
    max_bin=127, reg_alpha=1.0, reg_lambda=30.0, min_gain_to_split=0.1,
    random_state=2025, n_jobs=-1, verbose=-1
)

cv_rmses, best_iters = [], []
fold_test_preds = []
REL_THR = np.log1p(5.0).astype('float32')  # corrected guard threshold to match log1p encoding

for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx].copy(); va = train_ord.iloc[va_idx].copy()
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    tr = build_timeaware_keys(tr); va = build_timeaware_keys(va)
    te_pu_va, prior_pu, _ = hardened_timeaware_te(tr, va, ['k_pu_ym6','k_pu_y6','k_pu_y5'], y_tr,
                                                  m_levels=(3500.0, 2500.0, 1500.0), alpha=1000.0, hard_thr=50, clip_width=0.20)
    te_do_va, prior_do, _ = hardened_timeaware_te(tr, va, ['k_do_ym6','k_do_y6','k_do_y5'], y_tr,
                                                  m_levels=(3500.0, 2500.0, 1500.0), alpha=1000.0, hard_thr=50, clip_width=0.20)
    te_pu_tr, _, _ = hardened_timeaware_te(tr, tr, ['k_pu_ym6','k_pu_y6','k_pu_y5'], y_tr,
                                           m_levels=(3500.0, 2500.0, 1500.0), alpha=1000.0, hard_thr=50, clip_width=0.20)
    te_do_tr, _, _ = hardened_timeaware_te(tr, tr, ['k_do_ym6','k_do_y6','k_do_y5'], y_tr,
                                           m_levels=(3500.0, 2500.0, 1500.0), alpha=1000.0, hard_thr=50, clip_width=0.20)
    tr['te_pu'] = te_pu_tr.astype('float32'); va['te_pu'] = te_pu_va.astype('float32')
    tr['te_do'] = te_do_tr.astype('float32'); va['te_do'] = te_do_va.astype('float32')
    tr = add_reliability_counts(tr, tr); va = add_reliability_counts(tr, va)
    # Symmetric reliability guard: apply to TRAIN and VALID equally
    tr_mask_pu = tr['log_pu6_cnt'].values < REL_THR
    if np.any(tr_mask_pu):
        tr.loc[tr_mask_pu, 'te_pu'] = prior_pu
    tr_mask_do = tr['log_do6_cnt'].values < REL_THR
    if np.any(tr_mask_do):
        tr.loc[tr_mask_do, 'te_do'] = prior_do
    va_mask_pu = va['log_pu6_cnt'].values < REL_THR
    if np.any(va_mask_pu):
        va.loc[va_mask_pu, 'te_pu'] = prior_pu
    va_mask_do = va['log_do6_cnt'].values < REL_THR
    if np.any(va_mask_do):
        va.loc[va_mask_do, 'te_do'] = prior_do

    use_cols = base_feats + ['te_pu','te_do','log_pu6_cnt','log_do6_cnt']
    X_tr = tr[use_cols].astype('float32').values; X_va = va[use_cols].astype('float32').values
    mono = [0]*len(use_cols)
    for nm in ['days_since_start','dist_hav_km','dist_man_km']:
        if nm in use_cols:
            mono[use_cols.index(nm)] = 1
    model = lgb.LGBMRegressor(**{**lgb_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=900, verbose=False), lgb.log_evaluation(period=250)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    rmse = float(root_mean_squared_error(np.expm1(y_va), y_pred_val))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    print(f"[Balanced TimeAware] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})")

    tst = test_prep.copy()
    tst = build_timeaware_keys(tst)
    te_pu_test, _, _ = hardened_timeaware_te(tr, tst, ['k_pu_ym6','k_pu_y6','k_pu_y5'], y_tr,
                                             m_levels=(3500.0, 2500.0, 1500.0), alpha=1000.0, hard_thr=50, clip_width=0.20)
    te_do_test, _, _ = hardened_timeaware_te(tr, tst, ['k_do_ym6','k_do_y6','k_do_y5'], y_tr,
                                             m_levels=(3500.0, 2500.0, 1500.0), alpha=1000.0, hard_thr=50, clip_width=0.20)
    tst['te_pu'] = te_pu_test.astype('float32'); tst['te_do'] = te_do_test.astype('float32')
    tst = add_reliability_counts(tr, tst)
    tst_mask_pu = tst['log_pu6_cnt'].values < REL_THR
    if np.any(tst_mask_pu):
        tst.loc[tst_mask_pu, 'te_pu'] = prior_pu
    tst_mask_do = tst['log_do6_cnt'].values < REL_THR
    if np.any(tst_mask_do):
        tst.loc[tst_mask_do, 'te_do'] = prior_do
    X_test_fold = tst[use_cols].astype('float32').values
    fold_pred = np.expm1(model.predict(X_test_fold, num_iteration=model.best_iteration_)).astype('float32')
    fold_pred = np.clip(fold_pred, 0, 500)
    fold_test_preds.append(fold_pred)

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_median = int(np.median(best_iters)) if best_iters else 0
print({'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 'median_best_iter': best_iter_median, 'note': 'Balanced Time-Aware single model (relaxed TE + relaxed LGBM + reliability guard) — minimal hardening applied'})

test_pred = np.mean(np.vstack(fold_test_preds), axis=0).astype('float32') if fold_test_preds else np.zeros(len(test_prep), dtype='float32')
test_pred = np.clip(test_pred, 0, 500)
sub = sample[['key']].merge(pd.DataFrame({'key': test_prep['key'].astype('string'), 'fare_amount': test_pred}), on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (Balanced Time-Aware Single Model) with shape:', sub.shape)

[250]	valid_0's rmse: 0.234176


[500]	valid_0's rmse: 0.233755


[750]	valid_0's rmse: 0.233626


[1000]	valid_0's rmse: 0.233551


[1250]	valid_0's rmse: 0.233511


[1500]	valid_0's rmse: 0.233492


[1750]	valid_0's rmse: 0.233466


[2000]	valid_0's rmse: 0.233456


[2250]	valid_0's rmse: 0.233412


[2500]	valid_0's rmse: 0.233386


[2750]	valid_0's rmse: 0.23337


[3000]	valid_0's rmse: 0.233365


[3250]	valid_0's rmse: 0.233349


[3500]	valid_0's rmse: 0.233337


[3750]	valid_0's rmse: 0.233329


[4000]	valid_0's rmse: 0.233323


[4250]	valid_0's rmse: 0.233325


[4500]	valid_0's rmse: 0.233314


[4750]	valid_0's rmse: 0.233291


[5000]	valid_0's rmse: 0.233287


[5250]	valid_0's rmse: 0.233286


[5500]	valid_0's rmse: 0.233272


[5750]	valid_0's rmse: 0.233268


[6000]	valid_0's rmse: 0.233261


[6250]	valid_0's rmse: 0.233251


[6500]	valid_0's rmse: 0.23325


[6750]	valid_0's rmse: 0.23325


[7000]	valid_0's rmse: 0.233242


[7250]	valid_0's rmse: 0.233224


[7500]	valid_0's rmse: 0.233222


[7750]	valid_0's rmse: 0.233218


[8000]	valid_0's rmse: 0.233209


[8250]	valid_0's rmse: 0.233195


[8500]	valid_0's rmse: 0.233199


[8750]	valid_0's rmse: 0.233193


[9000]	valid_0's rmse: 0.23319


[9250]	valid_0's rmse: 0.233184


[9500]	valid_0's rmse: 0.233182


[9750]	valid_0's rmse: 0.233178


[10000]	valid_0's rmse: 0.233178


[10250]	valid_0's rmse: 0.233171


[10500]	valid_0's rmse: 0.233176


[10750]	valid_0's rmse: 0.233167


[11000]	valid_0's rmse: 0.233163


[11250]	valid_0's rmse: 0.233162


[11500]	valid_0's rmse: 0.233171


[11750]	valid_0's rmse: 0.233164


[12000]	valid_0's rmse: 0.233155


[12250]	valid_0's rmse: 0.233152


[12500]	valid_0's rmse: 0.233155


[12750]	valid_0's rmse: 0.233151


[13000]	valid_0's rmse: 0.233148


[13250]	valid_0's rmse: 0.233141


[13500]	valid_0's rmse: 0.233144


[13750]	valid_0's rmse: 0.23312


[14000]	valid_0's rmse: 0.23312


In [61]:
# Phase 15 — Contingency: Adaptive Decay (tau=180) on Balanced Time-Aware Single Model with symmetric reliability guard + fold ensembling
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals() and 'sample' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb
try:
    import pygeohash as pgh
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'pygeohash'])
    import pygeohash as pgh

def to_local(dt):
    return dt.dt.tz_convert('America/New_York')

def ensure_days_since_start(df):
    if 'days_since_start' in df.columns: return df
    d = df.copy(); dt_local = to_local(d['pickup_datetime'])
    start_ns = int(dt_local.min().value)
    d['days_since_start'] = ((dt_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d

def ensure_time_parts(df):
    d = df.copy(); dt = to_local(d['pickup_datetime'])
    d['year'] = dt.dt.year.astype('int16'); d['month'] = dt.dt.month.astype('int8')
    return d

def add_geohash(df, prec_list=(6,5)):
    d = df.copy()
    lat_pu = d['pickup_latitude'].astype('float32').values
    lon_pu = d['pickup_longitude'].astype('float32').values
    lat_do = d['dropoff_latitude'].astype('float32').values
    lon_do = d['dropoff_longitude'].astype('float32').values
    for prec in prec_list:
        pu_col = f'pu_gh{prec}'; do_col = f'do_gh{prec}'
        if pu_col in d.columns and do_col in d.columns: continue
        d[pu_col] = [pgh.encode(float(la), float(lo), precision=prec) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(lat_pu, lon_pu)]
        d[do_col] = [pgh.encode(float(la), float(lo), precision=prec) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(lat_do, lon_do)]
    return d

def build_timeaware_keys(df):
    d = df.copy()
    d['k_pu_ym6'] = (d['pu_gh6'].astype('string') + '_' + d['year'].astype('string') + '_' + d['month'].astype('string'))
    d['k_pu_y6']  = (d['pu_gh6'].astype('string') + '_' + d['year'].astype('string'))
    d['k_pu_y5']  = (d['pu_gh5'].astype('string') + '_' + d['year'].astype('string'))
    d['k_do_ym6'] = (d['do_gh6'].astype('string') + '_' + d['year'].astype('string') + '_' + d['month'].astype('string'))
    d['k_do_y6']  = (d['do_gh6'].astype('string') + '_' + d['year'].astype('string'))
    d['k_do_y5']  = (d['do_gh5'].astype('string') + '_' + d['year'].astype('string'))
    return d

def te_smooth(train_key, train_tgt, apply_key, m=3000.0, prior=None):
    if prior is None: prior = float(np.mean(train_tgt))
    g = pd.DataFrame({'k': train_key.astype('object'), 'y': train_tgt}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict(); cnt_map = cnts.to_dict()
    enc = apply_key.astype('object').map(mp).fillna(prior).astype('float32').values
    return enc, prior, mp, cnt_map

def hardened_timeaware_te(tr_df, ap_df, key_cols, y_tr_log, m_levels=(3500.0, 2500.0, 1500.0), alpha=1000.0, hard_thr=50, clip_width=0.30):
    days = tr_df['days_since_start'].astype('float32').values
    prior_dyn = float(np.mean(y_tr_log[days >= (days.max() - 120.0)])) if len(days) and (days >= (days.max() - 120.0)).any() else float(np.mean(y_tr_log))
    encs_ap = []; cnts_list = [];
    for k_col, m in zip(key_cols, m_levels):
        enc_ap, prior_dyn, _, cnt_map = te_smooth(tr_df[k_col], y_tr_log, ap_df[k_col], m=m, prior=prior_dyn)
        encs_ap.append(enc_ap.astype('float32'))
        cnts_list.append(cnt_map)
    k1 = key_cols[0]; c1 = np.array([cnts_list[0].get(k, 0.0) for k in ap_df[k1].astype('object').values], dtype='float32')
    w1 = c1 / (c1 + np.float32(alpha))
    if hard_thr is not None and hard_thr > 0:
        m1 = (c1 < np.float32(hard_thr))
        if m1.any():
            w1 = w1.copy(); w1[m1] = 0.0
    enc12 = (w1 * encs_ap[0] + (1.0 - w1) * encs_ap[1]).astype('float32')
    k2 = key_cols[1]; c2 = np.array([cnts_list[1].get(k, 0.0) for k in ap_df[k2].astype('object').values], dtype='float32')
    w2 = c2 / (c2 + np.float32(alpha))
    if hard_thr is not None and hard_thr > 0:
        m2 = (c2 < np.float32(hard_thr))
        if m2.any():
            w2 = w2.copy(); w2[m2] = 0.0
    enc = (w2 * enc12 + (1.0 - w2) * encs_ap[2]).astype('float32')
    enc = np.clip(enc, prior_dyn - clip_width, prior_dyn + clip_width).astype('float32')
    return enc, prior_dyn, (c1, c2)

def add_reliability_counts(df_tr, df_ap):
    pu6_counts = df_tr['pu_gh6'].astype('object').value_counts().astype('int32')
    do6_counts = df_tr['do_gh6'].astype('object').value_counts().astype('int32')
    out = df_ap.copy()
    out['log_pu6_cnt'] = np.log1p(out['pu_gh6'].astype('object').map(pu6_counts).fillna(0).astype('int32')).astype('float32')
    out['log_do6_cnt'] = np.log1p(out['do_gh6'].astype('object').map(do6_counts).fillna(0).astype('int32')).astype('float32')
    return out

train_ord = ensure_days_since_start(train_df.copy()).sort_values('pickup_datetime').reset_index(drop=True)
test_prep = ensure_days_since_start(test_df.copy())
train_ord = ensure_time_parts(train_ord); test_prep = ensure_time_parts(test_prep)
train_ord = add_geohash(train_ord, (6,5)); test_prep = add_geohash(test_prep, (6,5))

base_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'hour_sin','hour_cos','dow_sin','dow_cos','doy_sin','doy_cos','is_holiday','is_weekend','is_rush','is_night',
    'days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_ord.columns: base_feats.append('rot_manh_km')
if 'dist_x_after_hike' in train_ord.columns: base_feats.append('dist_x_after_hike')

train_ord = train_ord.dropna(subset=base_feats + ['fare_amount']).reset_index(drop=True)
test_prep[base_feats] = test_prep[base_feats].fillna(0)

y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)
dt_all = train_ord['pickup_datetime']
folds = make_time_folds_quantile(dt_all, n_folds=5, gap_days=1)
assert len(folds) == 5, 'Expected 5 folds'

lgb_params = dict(
    objective='huber', metric='rmse',
    learning_rate=0.045, n_estimators=26000,
    num_leaves=192, max_depth=-1, min_data_in_leaf=800,
    feature_fraction=0.80, bagging_fraction=0.80, bagging_freq=1,
    max_bin=127, reg_alpha=1.0, reg_lambda=30.0, min_gain_to_split=0.1,
    random_state=2025, n_jobs=-1, verbose=-1
)

cv_rmses, best_iters = [], []
fold_test_preds = []
REL_THR = np.log1p(5.0).astype('float32')  # guard threshold aligned with log1p counts
TAU = np.float32(180.0)  # Adaptive decay timescale (days)

for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx].copy(); va = train_ord.iloc[va_idx].copy()
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    tr = build_timeaware_keys(tr); va = build_timeaware_keys(va)
    te_pu_va, prior_pu, _ = hardened_timeaware_te(tr, va, ['k_pu_ym6','k_pu_y6','k_pu_y5'], y_tr,
                                                  m_levels=(3500.0, 2500.0, 1500.0), alpha=1000.0, hard_thr=50, clip_width=0.30)
    te_do_va, prior_do, _ = hardened_timeaware_te(tr, va, ['k_do_ym6','k_do_y6','k_do_y5'], y_tr,
                                                  m_levels=(3500.0, 2500.0, 1500.0), alpha=1000.0, hard_thr=50, clip_width=0.30)
    te_pu_tr, _, _ = hardened_timeaware_te(tr, tr, ['k_pu_ym6','k_pu_y6','k_pu_y5'], y_tr,
                                           m_levels=(3500.0, 2500.0, 1500.0), alpha=1000.0, hard_thr=50, clip_width=0.30)
    te_do_tr, _, _ = hardened_timeaware_te(tr, tr, ['k_do_ym6','k_do_y6','k_do_y5'], y_tr,
                                           m_levels=(3500.0, 2500.0, 1500.0), alpha=1000.0, hard_thr=50, clip_width=0.30)
    tr['te_pu'] = te_pu_tr.astype('float32'); va['te_pu'] = te_pu_va.astype('float32')
    tr['te_do'] = te_do_tr.astype('float32'); va['te_do'] = te_do_va.astype('float32')
    tr = add_reliability_counts(tr, tr); va = add_reliability_counts(tr, va)
    # Symmetric reliability guard (train + val)
    m_tr_pu = tr['log_pu6_cnt'].values < REL_THR
    if np.any(m_tr_pu): tr.loc[m_tr_pu, 'te_pu'] = prior_pu
    m_tr_do = tr['log_do6_cnt'].values < REL_THR
    if np.any(m_tr_do): tr.loc[m_tr_do, 'te_do'] = prior_do
    m_va_pu = va['log_pu6_cnt'].values < REL_THR
    if np.any(m_va_pu): va.loc[m_va_pu, 'te_pu'] = prior_pu
    m_va_do = va['log_do6_cnt'].values < REL_THR
    if np.any(m_va_do): va.loc[m_va_do, 'te_do'] = prior_do

    use_cols = base_feats + ['te_pu','te_do','log_pu6_cnt','log_do6_cnt']
    X_tr = tr[use_cols].astype('float32').values; X_va = va[use_cols].astype('float32').values
    # Monotone +1 on time and distances
    mono = [0]*len(use_cols)
    for nm in ['days_since_start','dist_hav_km','dist_man_km']:
        if nm in use_cols: mono[use_cols.index(nm)] = 1
    # Adaptive decay sample weights (favor recent data within each fold)
    days_tr = tr['days_since_start'].astype('float32').values
    max_days = np.float32(days_tr.max()) if days_tr.size else np.float32(0.0)
    w_tr = np.exp((days_tr - max_days) / TAU).astype('float32')

    model = lgb.LGBMRegressor(**{**lgb_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr, sample_weight=w_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=900, verbose=False), lgb.log_evaluation(period=250)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    rmse = float(root_mean_squared_error(np.expm1(y_va), y_pred_val))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    print(f"[AdaptiveDecay TA] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})")

    # Test predictions (fold ensemble)
    tst = test_prep.copy()
    tst = build_timeaware_keys(tst)
    te_pu_test, _, _ = hardened_timeaware_te(tr, tst, ['k_pu_ym6','k_pu_y6','k_pu_y5'], y_tr,
                                             m_levels=(3500.0, 2500.0, 1500.0), alpha=1000.0, hard_thr=50, clip_width=0.30)
    te_do_test, _, _ = hardened_timeaware_te(tr, tst, ['k_do_ym6','k_do_y6','k_do_y5'], y_tr,
                                             m_levels=(3500.0, 2500.0, 1500.0), alpha=1000.0, hard_thr=50, clip_width=0.30)
    tst['te_pu'] = te_pu_test.astype('float32'); tst['te_do'] = te_do_test.astype('float32')
    tst = add_reliability_counts(tr, tst)
    m_tst_pu = tst['log_pu6_cnt'].values < REL_THR
    if np.any(m_tst_pu): tst.loc[m_tst_pu, 'te_pu'] = prior_pu
    m_tst_do = tst['log_do6_cnt'].values < REL_THR
    if np.any(m_tst_do): tst.loc[m_tst_do, 'te_do'] = prior_do
    X_test_fold = tst[use_cols].astype('float32').values
    fold_pred = np.expm1(model.predict(X_test_fold, num_iteration=model.best_iteration_)).astype('float32')
    fold_pred = np.clip(fold_pred, 0, 500)
    fold_test_preds.append(fold_pred)

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_median = int(np.median(best_iters)) if best_iters else 0
print({'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 'median_best_iter': best_iter_median, 'note': 'Adaptive Decay (tau=180) + Balanced Time-Aware TE + symmetric guard + fold ensembling'})

test_pred = np.mean(np.vstack(fold_test_preds), axis=0).astype('float32') if fold_test_preds else np.zeros(len(test_prep), dtype='float32')
test_pred = np.clip(test_pred, 0, 500)
sub = sample[['key']].merge(pd.DataFrame({'key': test_prep['key'].astype('string'), 'fare_amount': test_pred}), on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (Adaptive Decay) with shape:', sub.shape)

[250]	valid_0's rmse: 0.235366


[500]	valid_0's rmse: 0.235131


[750]	valid_0's rmse: 0.23503


[1000]	valid_0's rmse: 0.234985


[1250]	valid_0's rmse: 0.234966


[1500]	valid_0's rmse: 0.234936


[1750]	valid_0's rmse: 0.234921


[2000]	valid_0's rmse: 0.234912


[2250]	valid_0's rmse: 0.234903


[2500]	valid_0's rmse: 0.234878


[2750]	valid_0's rmse: 0.234872


[3000]	valid_0's rmse: 0.234868


[3250]	valid_0's rmse: 0.23485


[3500]	valid_0's rmse: 0.23482


[3750]	valid_0's rmse: 0.2348


[4000]	valid_0's rmse: 0.234787


[4250]	valid_0's rmse: 0.234786


[4500]	valid_0's rmse: 0.234784


[4750]	valid_0's rmse: 0.234768


[5000]	valid_0's rmse: 0.234775


[5250]	valid_0's rmse: 0.23478


[5500]	valid_0's rmse: 0.234767


[5750]	valid_0's rmse: 0.234765


[6000]	valid_0's rmse: 0.234762


[6250]	valid_0's rmse: 0.234756


[6500]	valid_0's rmse: 0.234751


[6750]	valid_0's rmse: 0.234728


[7000]	valid_0's rmse: 0.234722


[7250]	valid_0's rmse: 0.234716


[7500]	valid_0's rmse: 0.234709


[7750]	valid_0's rmse: 0.234713


[8000]	valid_0's rmse: 0.234709


[8250]	valid_0's rmse: 0.234698


[8500]	valid_0's rmse: 0.234701


[8750]	valid_0's rmse: 0.234695


[9000]	valid_0's rmse: 0.234704


[9250]	valid_0's rmse: 0.234691


[9500]	valid_0's rmse: 0.234688


[9750]	valid_0's rmse: 0.234688


[10000]	valid_0's rmse: 0.234686


[10250]	valid_0's rmse: 0.234683


[10500]	valid_0's rmse: 0.234683


[10750]	valid_0's rmse: 0.234678


[11000]	valid_0's rmse: 0.234678


[11250]	valid_0's rmse: 0.234691


[11500]	valid_0's rmse: 0.234687


[AdaptiveDecay TA] Fold 1: RMSE=3.73807, best_iter=10815 (train_n=319908, val_n=320756)


In [62]:
# Phase 15 — Pivot: Coarse TE (gh5 x week_of_year), heavy smoothing, symmetric reliability guard; huber + monotone + fold ensembling
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals() and 'sample' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb
try:
    import pygeohash as pgh
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'pygeohash'])
    import pygeohash as pgh

def to_local(dt):
    return dt.dt.tz_convert('America/New_York')

def ensure_days_since_start(df):
    if 'days_since_start' in df.columns: return df
    d = df.copy(); dt = to_local(d['pickup_datetime'])
    start_ns = int(dt.min().value)
    d['days_since_start'] = ((dt.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d

def ensure_time_parts_week(df):
    d = df.copy(); dt = to_local(d['pickup_datetime'])
    d['year'] = dt.dt.year.astype('int16')
    d['week_of_year'] = dt.dt.isocalendar().week.astype('int16')
    return d

def add_geohash5(df):
    d = df.copy()
    if 'pu_gh5' in d.columns and 'do_gh5' in d.columns: return d
    lat_pu = d['pickup_latitude'].astype('float32').values
    lon_pu = d['pickup_longitude'].astype('float32').values
    lat_do = d['dropoff_latitude'].astype('float32').values
    lon_do = d['dropoff_longitude'].astype('float32').values
    d['pu_gh5'] = [pgh.encode(float(la), float(lo), precision=5) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(lat_pu, lon_pu)]
    d['do_gh5'] = [pgh.encode(float(la), float(lo), precision=5) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(lat_do, lon_do)]
    return d

def build_coarse_keys(df):
    d = df.copy()
    d['k_pu_w5'] = (d['pu_gh5'].astype('string') + '_' + d['week_of_year'].astype('string'))
    d['k_do_w5'] = (d['do_gh5'].astype('string') + '_' + d['week_of_year'].astype('string'))
    return d

def te_smooth(train_key, train_tgt, apply_key, m=2000.0, prior=None):
    if prior is None: prior = float(np.mean(train_tgt))
    g = pd.DataFrame({'k': train_key.astype('object'), 'y': train_tgt}).groupby('k').agg(['mean','count'])['y']
    means = g['mean'].astype('float64'); cnts = g['count'].astype('float64')
    smooth = (cnts * means + m * prior) / (cnts + m)
    mp = smooth.to_dict(); cnt_map = cnts.to_dict()
    enc = apply_key.astype('object').map(mp).fillna(prior).astype('float32').values
    return enc, prior, cnt_map

def add_reliability_counts_gh5(df_tr, df_ap):
    pu5_counts = df_tr['pu_gh5'].astype('object').value_counts().astype('int32')
    do5_counts = df_tr['do_gh5'].astype('object').value_counts().astype('int32')
    out = df_ap.copy()
    out['log_pu5_cnt'] = np.log1p(out['pu_gh5'].astype('object').map(pu5_counts).fillna(0).astype('int32')).astype('float32')
    out['log_do5_cnt'] = np.log1p(out['do_gh5'].astype('object').map(do5_counts).fillna(0).astype('int32')).astype('float32')
    return out

# Prepare data
train_ord = ensure_days_since_start(train_df.copy()).sort_values('pickup_datetime').reset_index(drop=True)
test_prep = ensure_days_since_start(test_df.copy())
train_ord = ensure_time_parts_week(train_ord); test_prep = ensure_time_parts_week(test_prep)
train_ord = add_geohash5(train_ord); test_prep = add_geohash5(test_prep)
train_ord = build_coarse_keys(train_ord); test_prep = build_coarse_keys(test_prep)

# Base robust features
base_feats = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'hour_sin','hour_cos','dow_sin','dow_cos','doy_sin','doy_cos','is_holiday','is_weekend','is_rush','is_night',
    'days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_ord.columns: base_feats.append('rot_manh_km')
if 'dist_x_after_hike' in train_ord.columns: base_feats.append('dist_x_after_hike')
train_ord = train_ord.dropna(subset=base_feats + ['fare_amount']).reset_index(drop=True)
test_prep[base_feats] = test_prep[base_feats].fillna(0)

y_all_log = np.log1p(train_ord['fare_amount'].astype('float32').values)
dt_all = train_ord['pickup_datetime']
folds = make_time_folds_quantile(dt_all, n_folds=5, gap_days=1)
assert len(folds) == 5, 'Expected 5 folds'

# LGB params (stable baseline per mandate)
lgb_params = dict(
    objective='huber', metric='rmse',
    learning_rate=0.045, n_estimators=26000,
    num_leaves=192, max_depth=-1, min_data_in_leaf=800,
    feature_fraction=0.80, bagging_fraction=0.80, bagging_freq=1,
    max_bin=127, reg_alpha=1.0, reg_lambda=30.0, min_gain_to_split=0.1,
    random_state=2025, n_jobs=-1, verbose=-1
)

cv_rmses, best_iters = [], []
fold_test_preds = []
REL_THR = np.log1p(5.0).astype('float32')

for i, (tr_idx, va_idx) in enumerate(folds, 1):
    tr = train_ord.iloc[tr_idx].copy(); va = train_ord.iloc[va_idx].copy()
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    # Single-level coarse TE on (gh5 x week_of_year)
    te_pu_tr, prior_pu, pu_cnt_map = te_smooth(tr['k_pu_w5'], y_tr, tr['k_pu_w5'], m=2000.0, prior=None)
    te_pu_va, _, _ = te_smooth(tr['k_pu_w5'], y_tr, va['k_pu_w5'], m=2000.0, prior=prior_pu)
    te_do_tr, prior_do, do_cnt_map = te_smooth(tr['k_do_w5'], y_tr, tr['k_do_w5'], m=2000.0, prior=None)
    te_do_va, _, _ = te_smooth(tr['k_do_w5'], y_tr, va['k_do_w5'], m=2000.0, prior=prior_do)
    tr['te_pu'] = te_pu_tr.astype('float32'); va['te_pu'] = te_pu_va.astype('float32')
    tr['te_do'] = te_do_tr.astype('float32'); va['te_do'] = te_do_va.astype('float32')
    # Reliability counts and symmetric guard on gh5
    tr = add_reliability_counts_gh5(tr, tr); va = add_reliability_counts_gh5(tr, va)
    m_tr_pu = tr['log_pu5_cnt'].values < REL_THR
    if np.any(m_tr_pu): tr.loc[m_tr_pu, 'te_pu'] = prior_pu
    m_tr_do = tr['log_do5_cnt'].values < REL_THR
    if np.any(m_tr_do): tr.loc[m_tr_do, 'te_do'] = prior_do
    m_va_pu = va['log_pu5_cnt'].values < REL_THR
    if np.any(m_va_pu): va.loc[m_va_pu, 'te_pu'] = prior_pu
    m_va_do = va['log_do5_cnt'].values < REL_THR
    if np.any(m_va_do): va.loc[m_va_do, 'te_do'] = prior_do

    use_cols = base_feats + ['te_pu','te_do','log_pu5_cnt','log_do5_cnt']
    X_tr = tr[use_cols].astype('float32').values; X_va = va[use_cols].astype('float32').values
    # Monotone +1 on days_since_start, dist_hav_km, dist_man_km
    mono = [0]*len(use_cols)
    for nm in ['days_since_start','dist_hav_km','dist_man_km']:
        if nm in use_cols: mono[use_cols.index(nm)] = 1
    model = lgb.LGBMRegressor(**{**lgb_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=900, verbose=False), lgb.log_evaluation(period=250)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    rmse = float(root_mean_squared_error(np.expm1(y_va), y_pred_val))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    print(f"[Coarse TE] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})")

    # Fold-ensemble preds on test with train-only maps and guard
    tst = test_prep.copy()
    tst = add_reliability_counts_gh5(tr, tst)
    te_pu_test, _, _ = te_smooth(tr['k_pu_w5'], y_tr, tst['k_pu_w5'], m=2000.0, prior=prior_pu)
    te_do_test, _, _ = te_smooth(tr['k_do_w5'], y_tr, tst['k_do_w5'], m=2000.0, prior=prior_do)
    tst['te_pu'] = te_pu_test.astype('float32'); tst['te_do'] = te_do_test.astype('float32')
    m_tst_pu = tst['log_pu5_cnt'].values < REL_THR
    if np.any(m_tst_pu): tst.loc[m_tst_pu, 'te_pu'] = prior_pu
    m_tst_do = tst['log_do5_cnt'].values < REL_THR
    if np.any(m_tst_do): tst.loc[m_tst_do, 'te_do'] = prior_do
    X_test_fold = tst[use_cols].astype('float32').values
    fold_pred = np.expm1(model.predict(X_test_fold, num_iteration=model.best_iteration_)).astype('float32')
    fold_pred = np.clip(fold_pred, 0, 500)
    fold_test_preds.append(fold_pred)

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_median = int(np.median(best_iters)) if best_iters else 0
print({'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 'median_best_iter': best_iter_median, 'note': 'Coarse TE gh5 x week, heavy smoothing m=2000, symmetric guard; huber + monotone + fold ensembling'})

# Fold-ensemble submission
test_pred = np.mean(np.vstack(fold_test_preds), axis=0).astype('float32') if fold_test_preds else np.zeros(len(test_prep), dtype='float32')
test_pred = np.clip(test_pred, 0, 500)
sub = sample[['key']].merge(pd.DataFrame({'key': test_prep['key'].astype('string'), 'fare_amount': test_pred}), on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (Coarse TE gh5 x week) with shape:', sub.shape)

[250]	valid_0's rmse: 0.233634


[500]	valid_0's rmse: 0.23319


[750]	valid_0's rmse: 0.233065


[1000]	valid_0's rmse: 0.233002


[1250]	valid_0's rmse: 0.232954


[1500]	valid_0's rmse: 0.232911


[1750]	valid_0's rmse: 0.232893


[2000]	valid_0's rmse: 0.23288


[2250]	valid_0's rmse: 0.232854


[2500]	valid_0's rmse: 0.232826


[2750]	valid_0's rmse: 0.232812


[3000]	valid_0's rmse: 0.232808


[3250]	valid_0's rmse: 0.232791


[3500]	valid_0's rmse: 0.232777


[3750]	valid_0's rmse: 0.232766


[4000]	valid_0's rmse: 0.232764


[4250]	valid_0's rmse: 0.232763


[4500]	valid_0's rmse: 0.232729


[4750]	valid_0's rmse: 0.232721


[5000]	valid_0's rmse: 0.23272


[5250]	valid_0's rmse: 0.232717


[5500]	valid_0's rmse: 0.232702


[5750]	valid_0's rmse: 0.232695


[6000]	valid_0's rmse: 0.232688


[6250]	valid_0's rmse: 0.232683


[6500]	valid_0's rmse: 0.232679


[6750]	valid_0's rmse: 0.232671


[7000]	valid_0's rmse: 0.232671


[7250]	valid_0's rmse: 0.232655


[7500]	valid_0's rmse: 0.232653


[7750]	valid_0's rmse: 0.232652


[8000]	valid_0's rmse: 0.232645


[8250]	valid_0's rmse: 0.232638


[8500]	valid_0's rmse: 0.232643


[8750]	valid_0's rmse: 0.232638


[9000]	valid_0's rmse: 0.232637


[9250]	valid_0's rmse: 0.232635


[Coarse TE] Fold 1: RMSE=3.70594, best_iter=8369 (train_n=319908, val_n=320756)


[250]	valid_0's rmse: 0.177763


[500]	valid_0's rmse: 0.177676


[750]	valid_0's rmse: 0.177659


[1000]	valid_0's rmse: 0.177655


[Coarse TE] Fold 2: RMSE=2.95000, best_iter=269 (train_n=640523, val_n=320756)


[250]	valid_0's rmse: 0.212298


[500]	valid_0's rmse: 0.211378


[750]	valid_0's rmse: 0.211284


[1000]	valid_0's rmse: 0.211217


[1250]	valid_0's rmse: 0.211168


[1500]	valid_0's rmse: 0.211141


[1750]	valid_0's rmse: 0.211126


[2000]	valid_0's rmse: 0.211116


[2250]	valid_0's rmse: 0.211079


[2500]	valid_0's rmse: 0.211069


[2750]	valid_0's rmse: 0.211038


[3000]	valid_0's rmse: 0.211025


[3250]	valid_0's rmse: 0.211016


[3500]	valid_0's rmse: 0.21102


[3750]	valid_0's rmse: 0.211006


[4000]	valid_0's rmse: 0.211007


[4250]	valid_0's rmse: 0.211005


[4500]	valid_0's rmse: 0.211


[4750]	valid_0's rmse: 0.21099


[5000]	valid_0's rmse: 0.210994


[5250]	valid_0's rmse: 0.210984


[5500]	valid_0's rmse: 0.210976


[5750]	valid_0's rmse: 0.210967


[6000]	valid_0's rmse: 0.210958


[6250]	valid_0's rmse: 0.210966


[6500]	valid_0's rmse: 0.210953


[6750]	valid_0's rmse: 0.210952


[7000]	valid_0's rmse: 0.210959


[7250]	valid_0's rmse: 0.21094


[7500]	valid_0's rmse: 0.210935


[7750]	valid_0's rmse: 0.210942


[8000]	valid_0's rmse: 0.210941


[8250]	valid_0's rmse: 0.210934


[8500]	valid_0's rmse: 0.210925


[8750]	valid_0's rmse: 0.210928


[9000]	valid_0's rmse: 0.210928


[9250]	valid_0's rmse: 0.210922


[Coarse TE] Fold 3: RMSE=3.72642, best_iter=8403 (train_n=961323, val_n=320756)


In [64]:
# Phase 15 — Pivot to CatBoost: Ordered target statistics on coarse categoricals (gh5 + temporal), robust numeric features, 5-fold time CV + fold ensembling
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals() and 'sample' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    from catboost import CatBoostRegressor, Pool
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'catboost'])
    from catboost import CatBoostRegressor, Pool
try:
    import pygeohash as pgh
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'pygeohash'])
    import pygeohash as pgh

def to_local(dt):
    return dt.dt.tz_convert('America/New_York')

def ensure_days_since_start(df):
    if 'days_since_start' in df.columns: return df
    d = df.copy(); dt = to_local(d['pickup_datetime'])
    start_ns = int(dt.min().value)
    d['days_since_start'] = ((dt.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d

def ensure_time_parts(df):
    d = df.copy(); dt = to_local(d['pickup_datetime'])
    d['hour'] = dt.dt.hour.astype('int8')
    d['dow'] = dt.dt.dayofweek.astype('int8')
    d['week_of_year'] = dt.dt.isocalendar().week.astype('int16')
    d['is_weekend'] = d['dow'].isin([5,6]).astype('int8')
    d['hour_sin'] = np.sin(2*np.pi*d['hour']/24).astype('float32')
    d['hour_cos'] = np.cos(2*np.pi*d['hour']/24).astype('float32')
    d['dow_sin'] = np.sin(2*np.pi*d['dow']/7).astype('float32')
    d['dow_cos'] = np.cos(2*np.pi*d['dow']/7).astype('float32')
    d['doy_sin'] = np.sin(2*np.pi*(dt.dt.dayofyear.astype('int16')/365.25)).astype('float32')
    d['doy_cos'] = np.cos(2*np.pi*(dt.dt.dayofyear.astype('int16')/365.25)).astype('float32')
    return d

def ensure_policy_and_holiday(df):
    d = df.copy(); dt = to_local(d['pickup_datetime'])
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    d['after_hike'] = (dt >= cutoff).astype('int8')
    # Holiday flag via pandas US Federal Holidays
    from pandas.tseries.holiday import USFederalHolidayCalendar as USCal
    hol = USCal().holidays(start=dt.min().normalize().tz_localize(None), end=dt.max().normalize().tz_localize(None))
    hol_dates = set(pd.to_datetime(hol).date)
    d['is_holiday'] = dt.dt.date.map(lambda x: x in hol_dates).astype('int8')
    return d

def add_geohash5(df):
    d = df.copy()
    if 'pu_gh5' in d.columns and 'do_gh5' in d.columns: return d
    pu_lat = d['pickup_latitude'].astype('float32').values
    pu_lon = d['pickup_longitude'].astype('float32').values
    do_lat = d['dropoff_latitude'].astype('float32').values
    do_lon = d['dropoff_longitude'].astype('float32').values
    d['pu_gh5'] = [pgh.encode(float(la), float(lo), precision=5) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(pu_lat, pu_lon)]
    d['do_gh5'] = [pgh.encode(float(la), float(lo), precision=5) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(do_lat, do_lon)]
    return d

# Prepare ordered datasets
train_cat = ensure_days_since_start(train_df.copy())
test_cat = ensure_days_since_start(test_df.copy())
train_cat = ensure_time_parts(train_cat); test_cat = ensure_time_parts(test_cat)
train_cat = ensure_policy_and_holiday(train_cat); test_cat = ensure_policy_and_holiday(test_cat)
train_cat = add_geohash5(train_cat); test_cat = add_geohash5(test_cat)

# Robust numeric feature set (same geometry/POIs as before) + simple temporals; CatBoost handles categoricals internally
base_num = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'hour_sin','hour_cos','dow_sin','dow_cos','doy_sin','doy_cos','is_holiday','is_weekend','days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_cat.columns: base_num.append('rot_manh_km')
if 'dist_x_after_hike' in train_cat.columns: base_num.append('dist_x_after_hike')

# Categorical features for CatBoost ordered statistics
cat_cols = ['pu_gh5','do_gh5','hour','dow','week_of_year','after_hike']

# Final feature columns
use_cols = base_num + cat_cols
train_ord = train_cat.dropna(subset=use_cols + ['fare_amount']).sort_values('pickup_datetime').reset_index(drop=True)
test_cat[base_num] = test_cat[base_num].fillna(0)
X_all = train_ord[use_cols].copy()
y_all = np.log1p(train_ord['fare_amount'].astype('float32').values)
dt_all = train_ord['pickup_datetime']
X_test = test_cat[use_cols].copy()

# Indices of categorical features for CatBoost
cat_indices = [use_cols.index(c) for c in cat_cols if c in use_cols]

# 5-fold time CV with 1-day gap
folds = make_time_folds_quantile(dt_all, n_folds=5, gap_days=1)
assert len(folds) == 5, 'Expected 5 folds'

# CatBoost params — ordered boosting with conservative regularization
cb_params = dict(
    loss_function='RMSE',
    iterations=30000,
    learning_rate=0.05,
    depth=8,
    l2_leaf_reg=10.0,
    random_strength=1.0,
    bootstrap_type='Bernoulli', subsample=0.8,
    od_type='Iter', od_wait=900,
    random_seed=2025,
    task_type='CPU',
    verbose=False
)

cv_rmses, best_iters = [], []
fold_test_preds = []

for i, (tr_idx, va_idx) in enumerate(folds, 1):
    X_tr = X_all.iloc[tr_idx].copy(); X_va = X_all.iloc[va_idx].copy()
    y_tr = y_all[tr_idx]; y_va = y_all[va_idx]
    # CatBoost Pools
    train_pool = Pool(X_tr, label=y_tr, cat_features=cat_indices)
    valid_pool = Pool(X_va, label=y_va, cat_features=cat_indices)
    model = CatBoostRegressor(**cb_params)
    model.fit(train_pool, eval_set=valid_pool, verbose=False)
    # Evaluate on original scale
    y_pred_val_log = model.predict(valid_pool)
    y_pred_val = np.expm1(y_pred_val_log)
    rmse = float(root_mean_squared_error(np.expm1(y_va), y_pred_val))
    cv_rmses.append(rmse);
    best_iters.append(getattr(model, 'tree_count_', cb_params['iterations']))
    print(f"[CatBoost Coarse] Fold {i}: RMSE={rmse:.5f}, trees={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})")
    # Test predictions per fold for ensembling
    test_pool = Pool(X_test, cat_features=cat_indices)
    fold_pred = np.expm1(model.predict(test_pool)).astype('float32')
    fold_pred = np.clip(fold_pred, 0, 500)
    fold_test_preds.append(fold_pred)

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_median = int(np.median(best_iters)) if best_iters else 0
print({'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 'median_trees': best_iter_median, 'note': 'CatBoost ordered TE on gh5/week/hour/dow/after_hike'})

# Fold-ensemble submission
test_pred = np.mean(np.vstack(fold_test_preds), axis=0).astype('float32') if fold_test_preds else np.zeros(len(X_test), dtype='float32')
test_pred = np.clip(test_pred, 0, 500)
sub = sample[['key']].merge(pd.DataFrame({'key': test_cat['key'].astype('string'), 'fare_amount': test_pred}), on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (CatBoost ordered TE) with shape:', sub.shape)

In [65]:
# Phase 15 — Mandatory CatBoost Baseline (APPROVED PARAMS):
# Model: CatBoostRegressor
# Categoricals: ['pu_gh5','do_gh5','week_of_year','hour','dow','after_hike']
# Params: depth=7, l2_leaf_reg=30.0, subsample=0.7, od_wait=1500, iterations=30000, learning_rate=0.05, od_type='Iter'
# Protocol: 5-fold time CV (1-day gap) to completion, report CV mean/std, fold-ensemble submission

import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error
assert 'train_df' in globals() and 'test_df' in globals() and 'sample' in globals(), 'Prereqs missing'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'

try:
    from catboost import CatBoostRegressor, Pool
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'catboost'])
    from catboost import CatBoostRegressor, Pool

try:
    import pygeohash as pgh
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'pygeohash'])
    import pygeohash as pgh

def to_local(dt):
    return dt.dt.tz_convert('America/New_York')

def ensure_days_since_start(df):
    if 'days_since_start' in df.columns: return df
    d = df.copy(); dt = to_local(d['pickup_datetime'])
    start_ns = int(dt.min().value)
    d['days_since_start'] = ((dt.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d

def ensure_time_parts(df):
    d = df.copy(); dt = to_local(d['pickup_datetime'])
    d['hour'] = dt.dt.hour.astype('int8')
    d['dow'] = dt.dt.dayofweek.astype('int8')
    d['week_of_year'] = dt.dt.isocalendar().week.astype('int16')
    # robust cyclic temporals
    d['hour_sin'] = np.sin(2*np.pi*d['hour']/24).astype('float32')
    d['hour_cos'] = np.cos(2*np.pi*d['hour']/24).astype('float32')
    d['dow_sin'] = np.sin(2*np.pi*d['dow']/7).astype('float32')
    d['dow_cos'] = np.cos(2*np.pi*d['dow']/7).astype('float32')
    d['doy_sin'] = np.sin(2*np.pi*(dt.dt.dayofyear.astype('int16')/365.25)).astype('float32')
    d['doy_cos'] = np.cos(2*np.pi*(dt.dt.dayofyear.astype('int16')/365.25)).astype('float32')
    d['is_weekend'] = d['dow'].isin([5,6]).astype('int8')
    d['is_rush'] = ((d['hour'].between(7,10)) | (d['hour'].between(16,19))).astype('int8')
    d['is_night'] = ((d['hour'] >= 22) | (d['hour'] <= 5)).astype('int8')
    return d

def ensure_policy_and_holiday(df):
    d = df.copy(); dt = to_local(d['pickup_datetime'])
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    d['after_hike'] = (dt >= cutoff).astype('int8')
    from pandas.tseries.holiday import USFederalHolidayCalendar as USCal
    hol = USCal().holidays(start=dt.min().normalize().tz_localize(None), end=dt.max().normalize().tz_localize(None))
    hol_dates = set(pd.to_datetime(hol).date)
    d['is_holiday'] = dt.dt.date.map(lambda x: x in hol_dates).astype('int8')
    return d

def add_geohash5(df):
    d = df.copy()
    if 'pu_gh5' in d.columns and 'do_gh5' in d.columns: return d
    pu_lat = d['pickup_latitude'].astype('float32').values
    pu_lon = d['pickup_longitude'].astype('float32').values
    do_lat = d['dropoff_latitude'].astype('float32').values
    do_lon = d['dropoff_longitude'].astype('float32').values
    d['pu_gh5'] = [pgh.encode(float(la), float(lo), precision=5) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(pu_lat, pu_lon)]
    d['do_gh5'] = [pgh.encode(float(la), float(lo), precision=5) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(do_lat, do_lon)]
    return d

# Prepare ordered datasets
train_cat = ensure_days_since_start(train_df.copy())
test_cat = ensure_days_since_start(test_df.copy())
train_cat = ensure_time_parts(train_cat); test_cat = ensure_time_parts(test_cat)
train_cat = ensure_policy_and_holiday(train_cat); test_cat = ensure_policy_and_holiday(test_cat)
train_cat = add_geohash5(train_cat); test_cat = add_geohash5(test_cat)

# Robust numeric features (no leakage)
base_num = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'hour_sin','hour_cos','dow_sin','dow_cos','doy_sin','doy_cos','is_holiday','is_weekend','is_rush','is_night','days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_cat.columns: base_num.append('rot_manh_km')
if 'dist_x_after_hike' in train_cat.columns: base_num.append('dist_x_after_hike')

# Mandated categoricals
cat_cols = ['pu_gh5','do_gh5','week_of_year','hour','dow','after_hike']

use_cols = base_num + cat_cols
train_ord = train_cat.dropna(subset=use_cols + ['fare_amount']).sort_values('pickup_datetime').reset_index(drop=True)
test_cat[base_num] = test_cat[base_num].fillna(0)
X_all = train_ord[use_cols].copy()
y_all = np.log1p(train_ord['fare_amount'].astype('float32').values)
dt_all = train_ord['pickup_datetime']
X_test = test_cat[use_cols].copy()

# Categorical indices
cat_indices = [use_cols.index(c) for c in cat_cols]

# 5-fold time CV with 1-day gap
folds = make_time_folds_quantile(dt_all, n_folds=5, gap_days=1)
assert len(folds) == 5, 'Expected 5 folds'

# CatBoost params (MANDATED)
cb_params = dict(
    loss_function='RMSE',
    iterations=30000,
    learning_rate=0.05,
    depth=7,
    l2_leaf_reg=30.0,
    bootstrap_type='Bernoulli', subsample=0.7,
    od_type='Iter', od_wait=1500,
    random_seed=2025,
    task_type='CPU',
    verbose=False
)

cv_rmses, best_iters = [], []
fold_test_preds = []

for i, (tr_idx, va_idx) in enumerate(folds, 1):
    X_tr = X_all.iloc[tr_idx].copy(); X_va = X_all.iloc[va_idx].copy()
    y_tr = y_all[tr_idx]; y_va = y_all[va_idx]
    train_pool = Pool(X_tr, label=y_tr, cat_features=cat_indices)
    valid_pool = Pool(X_va, label=y_va, cat_features=cat_indices)
    model = CatBoostRegressor(**cb_params)
    model.fit(train_pool, eval_set=valid_pool, verbose=False)
    # Evaluate on original target scale
    y_pred_val_log = model.predict(valid_pool)
    y_pred_val = np.expm1(y_pred_val_log)
    rmse = float(root_mean_squared_error(np.expm1(y_va), y_pred_val))
    trees = int(getattr(model, 'tree_count_', cb_params['iterations']))
    cv_rmses.append(rmse); best_iters.append(trees)
    print(f"[CatBoost Mandated] Fold {i}: RMSE={rmse:.5f}, trees={trees} (train_n={len(tr_idx)}, val_n={len(va_idx)})")
    # Test predictions per fold (fold ensembling)
    test_pool = Pool(X_test, cat_features=cat_indices)
    fold_pred = np.expm1(model.predict(test_pool)).astype('float32')
    fold_pred = np.clip(fold_pred, 0, 500)
    fold_test_preds.append(fold_pred)

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_median = int(np.median(best_iters)) if best_iters else 0
print({'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 'median_trees': best_iter_median, 'note': 'CatBoost baseline (mandated params) 5-fold time CV'})

# Fold-ensemble submission
test_pred = np.mean(np.vstack(fold_test_preds), axis=0).astype('float32') if fold_test_preds else np.zeros(len(X_test), dtype='float32')
test_pred = np.clip(test_pred, 0, 500)
sub = sample[['key']].merge(pd.DataFrame({'key': test_cat['key'].astype('string'), 'fare_amount': test_pred}), on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (CatBoost mandated) with shape:', sub.shape)

[CatBoost Mandated] Fold 1: RMSE=3.75623, trees=3484 (train_n=319908, val_n=320756)


[CatBoost Mandated] Fold 2: RMSE=3.35869, trees=246 (train_n=640523, val_n=320756)


[CatBoost Mandated] Fold 3: RMSE=3.56513, trees=16577 (train_n=961323, val_n=320756)


[CatBoost Mandated] Fold 4: RMSE=3.50761, trees=16735 (train_n=1282234, val_n=320756)


[CatBoost Mandated] Fold 5: RMSE=3.22177, trees=18746 (train_n=1602868, val_n=320755)
{'cv_rmse_mean': 3.481887, 'cv_rmse_std': 0.18204, 'median_trees': 16577, 'note': 'CatBoost baseline (mandated params) 5-fold time CV'}
Saved submission.csv (CatBoost mandated) with shape: (9914, 2)


In [69]:
# Phase 15 — Contingency Blend (self-contained): CatBoost (mandated) + LightGBM anchor (distance-only) with 5-fold time CV
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals() and 'sample' in globals(), 'Prereqs missing: train_df/test_df/sample'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'

try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb
try:
    from catboost import CatBoostRegressor, Pool
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'catboost'])
    from catboost import CatBoostRegressor, Pool
try:
    import pygeohash as pgh
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'pygeohash'])
    import pygeohash as pgh

# Rebuild CatBoost-ready features locally to avoid dtype/name drift
def to_local(dt):
    return dt.dt.tz_convert('America/New_York')

def ensure_days_since_start(df):
    if 'days_since_start' in df.columns: return df
    d = df.copy(); dt = to_local(d['pickup_datetime'])
    start_ns = int(dt.min().value)
    d['days_since_start'] = ((dt.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d

def ensure_time_parts(df):
    d = df.copy(); dt = to_local(d['pickup_datetime'])
    d['hour'] = dt.dt.hour.astype('int8')
    d['dow'] = dt.dt.dayofweek.astype('int8')
    d['week_of_year'] = dt.dt.isocalendar().week.astype('int16')
    d['hour_sin'] = np.sin(2*np.pi*d['hour']/24).astype('float32')
    d['hour_cos'] = np.cos(2*np.pi*d['hour']/24).astype('float32')
    d['dow_sin'] = np.sin(2*np.pi*d['dow']/7).astype('float32')
    d['dow_cos'] = np.cos(2*np.pi*d['dow']/7).astype('float32')
    d['doy_sin'] = np.sin(2*np.pi*(dt.dt.dayofyear.astype('int16')/365.25)).astype('float32')
    d['doy_cos'] = np.cos(2*np.pi*(dt.dt.dayofyear.astype('int16')/365.25)).astype('float32')
    d['is_weekend'] = d['dow'].isin([5,6]).astype('int8')
    d['is_rush'] = ((d['hour'].between(7,10)) | (d['hour'].between(16,19))).astype('int8')
    d['is_night'] = ((d['hour'] >= 22) | (d['hour'] <= 5)).astype('int8')
    return d

def ensure_policy_and_holiday(df):
    d = df.copy(); dt = to_local(d['pickup_datetime'])
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    d['after_hike'] = (dt >= cutoff).astype('int8')
    from pandas.tseries.holiday import USFederalHolidayCalendar as USCal
    hol = USCal().holidays(start=dt.min().normalize().tz_localize(None), end=dt.max().normalize().tz_localize(None))
    hol_dates = set(pd.to_datetime(hol).date)
    d['is_holiday'] = dt.dt.date.map(lambda x: x in hol_dates).astype('int8')
    return d

def add_geohash5(df):
    d = df.copy()
    if 'pu_gh5' in d.columns and 'do_gh5' in d.columns: return d
    pu_lat = d['pickup_latitude'].astype('float32').values
    pu_lon = d['pickup_longitude'].astype('float32').values
    do_lat = d['dropoff_latitude'].astype('float32').values
    do_lon = d['dropoff_longitude'].astype('float32').values
    d['pu_gh5'] = [pgh.encode(float(la), float(lo), precision=5) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(pu_lat, pu_lon)]
    d['do_gh5'] = [pgh.encode(float(la), float(lo), precision=5) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(do_lat, do_lon)]
    return d

# Build CatBoost dataset
train_cat = ensure_days_since_start(train_df.copy())
test_cat = ensure_days_since_start(test_df.copy())
train_cat = ensure_time_parts(train_cat); test_cat = ensure_time_parts(test_cat)
train_cat = ensure_policy_and_holiday(train_cat); test_cat = ensure_policy_and_holiday(test_cat)
train_cat = add_geohash5(train_cat); test_cat = add_geohash5(test_cat)

base_num = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'hour_sin','hour_cos','dow_sin','dow_cos','doy_sin','doy_cos','is_holiday','is_weekend','is_rush','is_night','days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_cat.columns: base_num.append('rot_manh_km')
if 'dist_x_after_hike' in train_cat.columns: base_num.append('dist_x_after_hike')

cat_cols = ['pu_gh5','do_gh5','week_of_year','hour','dow','after_hike']
use_cols = base_num + cat_cols
train_ord_local = train_cat.dropna(subset=use_cols + ['fare_amount']).sort_values('pickup_datetime').reset_index(drop=True)
test_cat[base_num] = test_cat[base_num].fillna(0)
X_all_df = train_ord_local[use_cols].copy()
y_all_log = np.log1p(train_ord_local['fare_amount'].astype('float32').values)
dt_all = train_ord_local['pickup_datetime']
X_test_df = test_cat[use_cols].copy()
cat_indices = [use_cols.index(c) for c in cat_cols]

# Folds
folds = make_time_folds_quantile(dt_all, n_folds=5, gap_days=1)
assert len(folds) == 5

# CatBoost params (mandated)
cb_params = dict(
    loss_function='RMSE',
    iterations=30000,
    learning_rate=0.05,
    depth=7,
    l2_leaf_reg=30.0,
    bootstrap_type='Bernoulli', subsample=0.7,
    od_type='Iter', od_wait=1500,
    random_seed=2025,
    task_type='CPU',
    verbose=False
)

# Anchor LGBM (distance-only, huber, monotone on distances/time)
lgbm_anchor_params = {
    'objective': 'huber',
    'metric': 'rmse',
    'n_estimators': 10000,
    'learning_rate': 0.04,
    'num_leaves': 64,
    'min_data_in_leaf': 1000,
    'reg_lambda': 40.0,
    'feature_fraction': 0.7,
    'bagging_fraction': 0.7,
    'bagging_freq': 1,
    'random_state': 2025,
    'verbose': -1
}
anchor_feats = [
    'dist_hav_km','dist_man_km','passenger_count',
    'hour_sin','hour_cos','dow_sin','dow_cos',
    'days_since_start','center_lat','center_lon'
]
for f in anchor_feats:
    assert f in X_all_df.columns, f'Anchor feature missing: {f}'

n = len(train_ord_local)
oof_cb_log = np.full(n, np.nan, dtype='float32')
oof_lgb_log = np.full(n, np.nan, dtype='float32')
test_cb_logs, test_lgb_logs = [], []

for i, (tr_idx, va_idx) in enumerate(folds, 1):
    # CatBoost
    X_tr_cb = X_all_df.iloc[tr_idx].copy(); X_va_cb = X_all_df.iloc[va_idx].copy()
    y_tr = y_all_log[tr_idx]; y_va = y_all_log[va_idx]
    train_pool = Pool(X_tr_cb, label=y_tr, cat_features=cat_indices)
    valid_pool = Pool(X_va_cb, label=y_va, cat_features=cat_indices)
    cb = CatBoostRegressor(**cb_params)
    cb.fit(train_pool, eval_set=valid_pool, verbose=False)
    va_cb_log = cb.predict(valid_pool).astype('float32')
    oof_cb_log[va_idx] = va_cb_log
    test_pool = Pool(X_test_df, cat_features=cat_indices)
    test_cb_logs.append(cb.predict(test_pool).astype('float32'))

    # LightGBM anchor
    X_tr_lgb = X_all_df.iloc[tr_idx][anchor_feats].astype('float32').values
    X_va_lgb = X_all_df.iloc[va_idx][anchor_feats].astype('float32').values
    lgbm = lgb.LGBMRegressor(**lgbm_anchor_params)
    mono = [0]*len(anchor_feats)
    for nm in ['dist_hav_km','dist_man_km','days_since_start']:
        mono[anchor_feats.index(nm)] = 1
    lgbm.set_params(monotone_constraints=mono)
    lgbm.fit(
        X_tr_lgb, y_tr,
        eval_set=[(X_va_lgb, y_va)],
        eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=200, verbose=False)]
    )
    va_lgb_log = lgbm.predict(X_va_lgb, num_iteration=lgbm.best_iteration_).astype('float32')
    oof_lgb_log[va_idx] = va_lgb_log
    test_lgb_logs.append(lgbm.predict(X_test_df[anchor_feats].astype('float32').values, num_iteration=lgbm.best_iteration_).astype('float32'))

    va_blend_log = 0.5 * va_cb_log + 0.5 * va_lgb_log
    rmse_fold = float(root_mean_squared_error(np.expm1(y_va), np.expm1(va_blend_log)))
    print(f"[Blend] Fold {i}: RMSE_blend(0.5)={rmse_fold:.5f} (train_n={len(tr_idx)}, val_n={len(va_idx)})")

# Weight grid
weights = [0.5, 0.6, 0.4]
results = []
for w in weights:
    oof_blend_log = (w * oof_cb_log + (1.0 - w) * oof_lgb_log).astype('float32')
    per_fold = []
    for (_, va_idx) in folds:
        per_fold.append(float(root_mean_squared_error(np.expm1(y_all_log[va_idx]), np.expm1(oof_blend_log[va_idx]))))
    cv_mean = float(np.mean(per_fold)); cv_std = float(np.std(per_fold))
    results.append((w, cv_mean, cv_std, per_fold))
    print({'weight_cb': w, 'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 'per_fold': [round(x,5) for x in per_fold]})

best = sorted(results, key=lambda x: (x[1], x[2]))[0]
w_best, cv_mean_best, cv_std_best, per_fold_best = best
print({'selected_weight_cb': w_best, 'cv_rmse_mean': round(cv_mean_best,6), 'cv_rmse_std': round(cv_std_best,6)})

# Build blended test predictions
test_cb_mean_log = np.mean(np.vstack(test_cb_logs), axis=0).astype('float32')
test_lgb_mean_log = np.mean(np.vstack(test_lgb_logs), axis=0).astype('float32')
test_blend_log = (w_best * test_cb_mean_log + (1.0 - w_best) * test_lgb_mean_log).astype('float32')
test_pred = np.expm1(test_blend_log).astype('float32')
test_pred = np.clip(test_pred, 0, 500)

sub = sample[['key']].merge(pd.DataFrame({'key': test_cat['key'].astype('string'), 'fare_amount': test_pred}), on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (CatBoost+LGBM anchor blend) with shape:', sub.shape)

In [67]:
# Phase 16 — Scale Up for Stability: 6M-row hash sample + robust LightGBM (distance-only anchor) 5-fold time CV
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'test' in globals() and 'sample' in globals(), 'Prereqs missing: test/sample not loaded'
assert 'load_train_sample_hash' in globals(), 'Hash sampler not found (cell 2)';
assert 'clean_train' in globals() and 'clip_test' in globals() and 'add_features' in globals(), 'Feature helpers missing (cell 3)'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# 1) Load larger, unbiased training sample (6,000,000 rows)
BIG_N = 6_000_000
train_big = load_train_sample_hash(target_rows=BIG_N, est_total_rows=55_000_000, oversample=1.2, cache_path=None)
print({'train_big_rows': int(len(train_big))})

# 2) Train/test parity prep
train_df_big = clean_train(train_big)
test_df_big = clip_test(test)
train_df_big = add_features(train_df_big)
test_df_big = add_features(test_df_big)
for col in ['dist_hav_km','dist_man_km']:
    if col in test_df_big.columns:
        test_df_big[col] = test_df_big[col].clip(lower=0.01, upper=200).astype('float32')
    if col in train_df_big.columns:
        train_df_big[col] = train_df_big[col].clip(lower=0.01, upper=200).astype('float32')

# 3) Anchor feature set (numeric only, stable)
anchor_feats = [
    'dist_hav_km','dist_man_km','passenger_count',
    'hour_sin','hour_cos','dow_sin','dow_cos',
    'days_since_start','center_lat','center_lon'
]
# Ensure required columns exist; fallback to derive minimal time cycles from pickup_datetime if missing
if 'days_since_start' not in train_df_big.columns:
    dt_local = train_df_big['pickup_datetime'].dt.tz_convert('America/New_York')
    start_ns = int(dt_local.min().value)
    train_df_big['days_since_start'] = ((dt_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    dt_l2 = test_df_big['pickup_datetime'].dt.tz_convert('America/New_York')
    test_df_big['days_since_start'] = ((dt_l2.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
if 'hour_sin' not in train_df_big.columns or 'dow_sin' not in train_df_big.columns:
    dlt = train_df_big['pickup_datetime'].dt.tz_convert('America/New_York')
    train_df_big['hour'] = dlt.dt.hour.astype('int8'); train_df_big['dow'] = dlt.dt.dayofweek.astype('int8')
    train_df_big['hour_sin'] = np.sin(2*np.pi*train_df_big['hour']/24).astype('float32')
    train_df_big['hour_cos'] = np.cos(2*np.pi*train_df_big['hour']/24).astype('float32')
    train_df_big['dow_sin'] = np.sin(2*np.pi*train_df_big['dow']/7).astype('float32')
    train_df_big['dow_cos'] = np.cos(2*np.pi*train_df_big['dow']/7).astype('float32')
    dlt2 = test_df_big['pickup_datetime'].dt.tz_convert('America/New_York')
    test_df_big['hour'] = dlt2.dt.hour.astype('int8'); test_df_big['dow'] = dlt2.dt.dayofweek.astype('int8')
    test_df_big['hour_sin'] = np.sin(2*np.pi*test_df_big['hour']/24).astype('float32')
    test_df_big['hour_cos'] = np.cos(2*np.pi*test_df_big['hour']/24).astype('float32')
    test_df_big['dow_sin'] = np.sin(2*np.pi*test_df_big['dow']/7).astype('float32')
    test_df_big['dow_cos'] = np.cos(2*np.pi*test_df_big['dow']/7).astype('float32')

need_cols = anchor_feats + ['fare_amount']
train_df_big = train_df_big.dropna(subset=need_cols).sort_values('pickup_datetime').reset_index(drop=True)
test_df_big[anchor_feats] = test_df_big[anchor_feats].fillna(0)

X_all = train_df_big[anchor_feats].astype('float32').values
y_all = np.log1p(train_df_big['fare_amount'].astype('float32').values)
dt_all = train_df_big['pickup_datetime']
X_test = test_df_big[anchor_feats].astype('float32').values

# 4) 5-fold time CV (1-day gap) with huber objective and strong regularization
assert 'make_time_folds_quantile' in globals(), 'fold generator missing'
folds = make_time_folds_quantile(dt_all, n_folds=5, gap_days=1)
assert len(folds) == 5

lgbm_anchor_params = {
    'objective': 'huber',
    'metric': 'rmse',
    'n_estimators': 16000,
    'learning_rate': 0.04,
    'num_leaves': 64,
    'min_data_in_leaf': 1500,
    'reg_lambda': 40.0,
    'feature_fraction': 0.7,
    'bagging_fraction': 0.7,
    'bagging_freq': 1,
    'random_state': 2025,
    'verbose': -1
}

cv_rmses, best_iters = [], []
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
    X_va, y_va = X_all[va_idx], y_all[va_idx]
    mono = [0]*len(anchor_feats)
    for nm in ['dist_hav_km','dist_man_km','days_since_start']:
        mono[anchor_feats.index(nm)] = 1
    model = lgb.LGBMRegressor(**{**lgbm_anchor_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)],
        eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=400, verbose=False), lgb.log_evaluation(period=300)]
    )
    y_pred_val = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    rmse = float(root_mean_squared_error(np.expm1(y_va), y_pred_val))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    print(f"[Anchor 6M] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})")

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_final = int(np.clip(int(np.median(best_iters)), 600, lgbm_anchor_params['n_estimators']))
print({'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 'best_iter_final': best_iter_final, 'note': '6M rows, distance-only anchor LGBM'})

# 5) Final fit on all 6M and submission (anchor-only for now)
final_model = lgb.LGBMRegressor(**{**lgbm_anchor_params, 'n_estimators': best_iter_final, 'monotone_constraints': mono})
final_model.fit(X_all, y_all)
test_pred = np.expm1(final_model.predict(X_test)).astype('float32')
test_pred = np.clip(test_pred, 0, 500)
sub = sample[['key']].merge(pd.DataFrame({'key': test_df_big['key'].astype('string'), 'fare_amount': test_pred}), on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (6M anchor LGBM) with shape:', sub.shape)

{'train_sample_rows': 6000000, 'p': 0.1309090909090909}
{'train_big_rows': 6000000}


[300]	valid_0's rmse: 0.244552


[600]	valid_0's rmse: 0.244095


[900]	valid_0's rmse: 0.243961


[1200]	valid_0's rmse: 0.243892


[1500]	valid_0's rmse: 0.243853


[1800]	valid_0's rmse: 0.243821


[2100]	valid_0's rmse: 0.243808


[2400]	valid_0's rmse: 0.2438


[2700]	valid_0's rmse: 0.243793


[3000]	valid_0's rmse: 0.243794


[Anchor 6M] Fold 1: RMSE=3.81526, best_iter=2847 (train_n=959825, val_n=962298)


[300]	valid_0's rmse: 0.191257


[600]	valid_0's rmse: 0.190177


[900]	valid_0's rmse: 0.189822


[1200]	valid_0's rmse: 0.189651


[1500]	valid_0's rmse: 0.189548


[1800]	valid_0's rmse: 0.189476


[2100]	valid_0's rmse: 0.189425


[2400]	valid_0's rmse: 0.189388


[2700]	valid_0's rmse: 0.189363


[3000]	valid_0's rmse: 0.18934


[3300]	valid_0's rmse: 0.189333


[3600]	valid_0's rmse: 0.189328


[Anchor 6M] Fold 2: RMSE=2.98583, best_iter=3465 (train_n=1921688, val_n=962298)


[300]	valid_0's rmse: 0.226244


[600]	valid_0's rmse: 0.225139


[900]	valid_0's rmse: 0.224849


[1200]	valid_0's rmse: 0.224673


[1500]	valid_0's rmse: 0.224547


[1800]	valid_0's rmse: 0.224458


[2100]	valid_0's rmse: 0.224401


[2400]	valid_0's rmse: 0.224358


[2700]	valid_0's rmse: 0.224315


[3000]	valid_0's rmse: 0.224279


[3300]	valid_0's rmse: 0.224261


[3600]	valid_0's rmse: 0.224245


[3900]	valid_0's rmse: 0.224236


[4200]	valid_0's rmse: 0.224217


[4500]	valid_0's rmse: 0.224213


[4800]	valid_0's rmse: 0.224205


[5100]	valid_0's rmse: 0.224201


[5400]	valid_0's rmse: 0.224189


[Anchor 6M] Fold 3: RMSE=3.88859, best_iter=5234 (train_n=2884002, val_n=962298)


[300]	valid_0's rmse: 0.219733


[600]	valid_0's rmse: 0.2184


[900]	valid_0's rmse: 0.218009


[1200]	valid_0's rmse: 0.217829


[1500]	valid_0's rmse: 0.217698


[1800]	valid_0's rmse: 0.217596


[2100]	valid_0's rmse: 0.217522


[2400]	valid_0's rmse: 0.217472


[2700]	valid_0's rmse: 0.217429


[3000]	valid_0's rmse: 0.217391


[3300]	valid_0's rmse: 0.217368


[3600]	valid_0's rmse: 0.217342


[3900]	valid_0's rmse: 0.217331


[4200]	valid_0's rmse: 0.217313


[4500]	valid_0's rmse: 0.217302


[4800]	valid_0's rmse: 0.217287


[5100]	valid_0's rmse: 0.217278


[5400]	valid_0's rmse: 0.217272


[5700]	valid_0's rmse: 0.217264


[6000]	valid_0's rmse: 0.217256


[6300]	valid_0's rmse: 0.217245


[6600]	valid_0's rmse: 0.217243


[6900]	valid_0's rmse: 0.217239


[7200]	valid_0's rmse: 0.217231


[7500]	valid_0's rmse: 0.217228


[7800]	valid_0's rmse: 0.217231


[Anchor 6M] Fold 4: RMSE=3.91523, best_iter=7517 (train_n=3846744, val_n=962298)


[300]	valid_0's rmse: 0.204723


[600]	valid_0's rmse: 0.202878


[900]	valid_0's rmse: 0.202418


[1200]	valid_0's rmse: 0.202198


[1500]	valid_0's rmse: 0.202067


[1800]	valid_0's rmse: 0.201969


[2100]	valid_0's rmse: 0.201883


[2400]	valid_0's rmse: 0.201825


[2700]	valid_0's rmse: 0.201771


[3000]	valid_0's rmse: 0.201725


[3300]	valid_0's rmse: 0.201699


[3600]	valid_0's rmse: 0.201675


[3900]	valid_0's rmse: 0.201655


[4200]	valid_0's rmse: 0.201633


[4500]	valid_0's rmse: 0.201619


[4800]	valid_0's rmse: 0.20161


[5100]	valid_0's rmse: 0.201593


[5400]	valid_0's rmse: 0.201581


[5700]	valid_0's rmse: 0.201566


[6000]	valid_0's rmse: 0.201554


[6300]	valid_0's rmse: 0.201543


[6600]	valid_0's rmse: 0.201536


[6900]	valid_0's rmse: 0.20153


[7200]	valid_0's rmse: 0.20152


[7500]	valid_0's rmse: 0.201518


[7800]	valid_0's rmse: 0.201515


[8100]	valid_0's rmse: 0.201513


[8400]	valid_0's rmse: 0.201511


[8700]	valid_0's rmse: 0.201509


[9000]	valid_0's rmse: 0.201504


[9300]	valid_0's rmse: 0.201505


[9600]	valid_0's rmse: 0.201502


[9900]	valid_0's rmse: 0.201503


[10200]	valid_0's rmse: 0.201508


[Anchor 6M] Fold 5: RMSE=3.68505, best_iter=9809 (train_n=4808673, val_n=962298)
{'cv_rmse_mean': 3.657992, 'cv_rmse_std': 0.34543, 'best_iter_final': 5234, 'note': '6M rows, distance-only anchor LGBM'}


Saved submission.csv (6M anchor LGBM) with shape: (9914, 2)


In [72]:
# Phase 16b — Option A Pragmatic Blend: Small CatBoost on 2M (early data) + 6M Anchor LGBM preds; weight tuned on recent slice
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df' in globals() and 'test_df' in globals() and 'sample' in globals(), 'Prereqs missing'

try:
    from catboost import CatBoostRegressor, Pool
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'catboost'])
    from catboost import CatBoostRegressor, Pool
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb
try:
    import pygeohash as pgh
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'pygeohash'])
    import pygeohash as pgh

# 0) Utilities
def to_local(dt):
    return dt.dt.tz_convert('America/New_York')

def ensure_time_feats(df):
    d = df.copy()
    dt = to_local(d['pickup_datetime'])
    # days_since_start anchored to train's earliest timestamp
    if 'days_since_start' not in d.columns:
        start_ns = int(to_local(train_df['pickup_datetime']).min().value)
        d['days_since_start'] = ((dt.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    d['hour'] = dt.dt.hour.astype('int8')
    d['dow'] = dt.dt.dayofweek.astype('int8')
    d['week_of_year'] = dt.dt.isocalendar().week.astype('int16')
    d['hour_sin'] = np.sin(2*np.pi*d['hour']/24).astype('float32')
    d['hour_cos'] = np.cos(2*np.pi*d['hour']/24).astype('float32')
    d['dow_sin'] = np.sin(2*np.pi*d['dow']/7).astype('float32')
    d['dow_cos'] = np.cos(2*np.pi*d['dow']/7).astype('float32')
    d['doy_sin'] = np.sin(2*np.pi*(dt.dt.dayofyear.astype('int16')/365.25)).astype('float32')
    d['doy_cos'] = np.cos(2*np.pi*(dt.dt.dayofyear.astype('int16')/365.25)).astype('float32')
    d['is_weekend'] = d['dow'].isin([5,6]).astype('int8')
    d['is_rush'] = ((d['hour'].between(7,10)) | (d['hour'].between(16,19))).astype('int8')
    d['is_night'] = ((d['hour'] >= 22) | (d['hour'] <= 5)).astype('int8')
    # Policy and holiday
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    d['after_hike'] = (dt >= cutoff).astype('int8')
    from pandas.tseries.holiday import USFederalHolidayCalendar as USCal
    hol = USCal().holidays(start=dt.min().normalize().tz_localize(None), end=dt.max().normalize().tz_localize(None))
    hol_dates = set(pd.to_datetime(hol).date)
    d['is_holiday'] = dt.dt.date.map(lambda x: x in hol_dates).astype('int8')
    return d

def add_geohash5(df):
    d = df.copy()
    if 'pu_gh5' in d.columns and 'do_gh5' in d.columns: return d
    pu_lat = d['pickup_latitude'].astype('float32').values
    pu_lon = d['pickup_longitude'].astype('float32').values
    do_lat = d['dropoff_latitude'].astype('float32').values
    do_lon = d['dropoff_longitude'].astype('float32').values
    d['pu_gh5'] = [pgh.encode(float(la), float(lo), precision=5) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(pu_lat, pu_lon)]
    d['do_gh5'] = [pgh.encode(float(la), float(lo), precision=5) if np.isfinite(la) and np.isfinite(lo) else '' for la, lo in zip(do_lat, do_lon)]
    return d

# 1) Prepare ordered 2M train view for small CatBoost and small Anchor
train_ord = train_df.sort_values('pickup_datetime').reset_index(drop=True).copy()
train_ord = ensure_time_feats(train_ord)
train_ord = add_geohash5(train_ord)

# CatBoost feature sets
cb_num = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'hour_sin','hour_cos','dow_sin','dow_cos','doy_sin','doy_cos','is_holiday','is_weekend','is_rush','is_night','days_since_start',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend'
]
if 'rot_manh_km' in train_ord.columns: cb_num.append('rot_manh_km')
if 'dist_x_after_hike' in train_ord.columns: cb_num.append('dist_x_after_hike')
cb_cat = ['pu_gh5','do_gh5','week_of_year','hour','dow','after_hike']
cb_cols = cb_num + cb_cat

# Anchor features (distance-only, stable)
anchor_feats = [
    'dist_hav_km','dist_man_km','passenger_count',
    'hour_sin','hour_cos','dow_sin','dow_cos',
    'days_since_start','center_lat','center_lon'
]
for f in anchor_feats: assert f in train_ord.columns, f'Missing anchor feature {f}'

# 2) Define recent validation slice from 2M and early training part
N = len(train_ord)
val_n = min(300_000, int(0.15 * N))
train_end = N - val_n
assert train_end > 0, 'Not enough rows for split'
tr_part = train_ord.iloc[:train_end].copy()
va_part = train_ord.iloc[train_end:].copy()
y_tr_log = np.log1p(tr_part['fare_amount'].astype('float32').values)
y_va_log = np.log1p(va_part['fare_amount'].astype('float32').values)

# 3) Small CatBoost on early part
X_tr_cb = tr_part[cb_cols].copy()
X_va_cb = va_part[cb_cols].copy()
# Build test features with required temporals/categoricals
test_te = ensure_time_feats(test_df.copy())
test_te = add_geohash5(test_te)
X_te_cb = test_te[cb_cols].copy()
cb_cat_idx = [cb_cols.index(c) for c in cb_cat]
cb_params_small = dict(
    loss_function='RMSE',
    iterations=12000,
    learning_rate=0.05,
    depth=7,
    l2_leaf_reg=30.0,
    bootstrap_type='Bernoulli', subsample=0.7,
    od_type='Iter', od_wait=800,
    random_seed=2026, task_type='CPU', verbose=False
)
pool_tr = Pool(X_tr_cb, label=y_tr_log, cat_features=cb_cat_idx)
pool_va = Pool(X_va_cb, label=y_va_log, cat_features=cb_cat_idx)
pool_te = Pool(X_te_cb, cat_features=cb_cat_idx)
cb_small = CatBoostRegressor(**cb_params_small)
cb_small.fit(pool_tr, eval_set=pool_va, verbose=False)
va_cb_log = cb_small.predict(pool_va).astype('float32')
test_cb_pred = np.expm1(cb_small.predict(pool_te)).astype('float32')

# 4) Small Anchor LGBM on early part
X_tr_lgb = tr_part[anchor_feats].astype('float32').values
X_va_lgb = va_part[anchor_feats].astype('float32').values
lgb_params_anchor_small = {
    'objective': 'huber', 'metric': 'rmse',
    'n_estimators': 12000, 'learning_rate': 0.04,
    'num_leaves': 64, 'min_data_in_leaf': 1000,
    'reg_lambda': 40.0, 'feature_fraction': 0.7,
    'bagging_fraction': 0.7, 'bagging_freq': 1,
    'random_state': 2026, 'verbose': -1
}
mono = [0]*len(anchor_feats)
for nm in ['dist_hav_km','dist_man_km','days_since_start']:
    mono[anchor_feats.index(nm)] = 1
lgb_anchor = lgb.LGBMRegressor(**{**lgb_params_anchor_small, 'monotone_constraints': mono})
lgb_anchor.fit(
    X_tr_lgb, y_tr_log,
    eval_set=[(X_va_lgb, y_va_log)],
    eval_metric='rmse',
    callbacks=[lgb.early_stopping(stopping_rounds=300, verbose=False)]
)
va_lgb_log = lgb_anchor.predict(X_va_lgb, num_iteration=lgb_anchor.best_iteration_).astype('float32')

# 5) Load existing 6M Anchor test predictions from current submission (must exist from Cell 41)
sub6m = pd.read_csv('submission.csv', dtype={'key':'string'})
assert set(sub6m.columns)=={'key','fare_amount'}, 'submission.csv not in expected format'
anchor6m_pred = sub6m['fare_amount'].astype('float32').values
assert len(anchor6m_pred) == len(sample), 'Anchor 6M submission size mismatch'

# 6) Tune blend weights on recent slice (grid) between CatBoost_small and Anchor_small
weights = [0.2,0.3,0.4,0.5,0.6,0.7,0.8]
best = (None, 1e9)
y_va_true = np.expm1(y_va_log)
for w in weights:
    blend_va = w * np.expm1(va_cb_log) + (1.0 - w) * np.expm1(va_lgb_log)
    rmse = float(root_mean_squared_error(y_va_true, np.clip(blend_va, 0, 500)))
    print({'w_cb': w, 'rmse_recent_window': round(rmse,6)})
    if rmse < best[1]:
        best = (w, rmse)
w_best = best[0]
print({'selected_w_cb': w_best, 'rmse_recent_window': round(best[1],6)})

# 7) Build final blended test predictions: CatBoost_small vs 6M Anchor
assert len(test_cb_pred) == len(anchor6m_pred) == len(sample), 'Test size mismatch'
final_pred = (w_best * test_cb_pred + (1.0 - w_best) * anchor6m_pred).astype('float32')
final_pred = np.clip(final_pred, 0, 500)

# 8) Save submission aligned to sample order
out = pd.DataFrame({'key': sample['key'].astype('string'), 'fare_amount': final_pred})
out.to_csv('submission.csv', index=False)
print('Saved submission.csv (Option A blend: CB_small + 6M anchor) with shape:', out.shape, 'w_cb=', w_best)

{'w_cb': 0.2, 'rmse_recent_window': 3.508365}
{'w_cb': 0.3, 'rmse_recent_window': 3.427836}
{'w_cb': 0.4, 'rmse_recent_window': 3.357436}
{'w_cb': 0.5, 'rmse_recent_window': 3.297813}
{'w_cb': 0.6, 'rmse_recent_window': 3.24956}
{'w_cb': 0.7, 'rmse_recent_window': 3.213189}
{'w_cb': 0.8, 'rmse_recent_window': 3.189107}
{'selected_w_cb': 0.8, 'rmse_recent_window': 3.189107}
Saved submission.csv (Option A blend: CB_small + 6M anchor) with shape: (9914, 2) w_cb= 0.8


In [75]:
# Phase 16 — Single Robust LightGBM on 6M (Fast Baseline Reset): tuned for speed, 5-fold time CV
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error
from sklearn.cluster import MiniBatchKMeans

assert 'train_df_big' in globals() and 'test_df_big' in globals(), '6M train/test not prepared (run cell 41)'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# Helpers
def ensure_days_since_start(df, ref_start_ns=None):
    d = df.copy()
    dt_local = d['pickup_datetime'].dt.tz_convert('America/New_York')
    start_ns = int(dt_local.min().value) if ref_start_ns is None else int(ref_start_ns)
    d['days_since_start'] = ((dt_local.astype('int64') - np.int64(start_ns)) / np.float64(24*3600*1e9)).astype('float32')
    return d, start_ns

def ensure_after_hike(df):
    d = df.copy()
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    dt_local = d['pickup_datetime'].dt.tz_convert('America/New_York')
    d['after_hike'] = (dt_local >= cutoff).astype('int8')
    return d

def ensure_clusters(train, test, k=80, sample_n=200_000, seed=2025):
    need = ['pu_cl','do_cl','same_cl']
    if all(c in train.columns for c in need) and all(c in test.columns for c in need):
        return train, test
    pu_tr = train[['pickup_latitude','pickup_longitude']].astype('float32').values
    do_tr = train[['dropoff_latitude','dropoff_longitude']].astype('float32').values
    if len(train) > sample_n:
        idx = np.random.default_rng(seed).choice(len(train), size=sample_n, replace=False)
        pu_fit = pu_tr[idx]; do_fit = do_tr[idx]
    else:
        pu_fit = pu_tr; do_fit = do_tr
    km_pu = MiniBatchKMeans(n_clusters=k, random_state=seed, batch_size=20000, n_init=3, max_no_improvement=20)
    km_do = MiniBatchKMeans(n_clusters=k, random_state=seed+1, batch_size=20000, n_init=3, max_no_improvement=20)
    km_pu.fit(pu_fit); km_do.fit(do_fit)
    train = train.copy(); test = test.copy()
    train['pu_cl'] = km_pu.predict(pu_tr).astype('int32')
    train['do_cl'] = km_do.predict(do_tr).astype('int32')
    test['pu_cl'] = km_pu.predict(test[['pickup_latitude','pickup_longitude']].astype('float32').values).astype('int32')
    test['do_cl'] = km_do.predict(test[['dropoff_latitude','dropoff_longitude']].astype('float32').values).astype('int32')
    train['same_cl'] = (train['pu_cl'] == train['do_cl']).astype('int8')
    test['same_cl'] = (test['pu_cl'] == test['do_cl']).astype('int8')
    return train, test

# Copy to avoid mutating prior cells
tr = train_df_big.copy()
ts = test_df_big.copy()

# Ensure time anchors and policy parity
tr, start_ns = ensure_days_since_start(tr, ref_start_ns=None)
ts, _ = ensure_days_since_start(ts, ref_start_ns=start_ns)
tr = ensure_after_hike(tr); ts = ensure_after_hike(ts)

# Add KMeans clusters (reduced k for speed)
tr, ts = ensure_clusters(tr, ts, k=80, sample_n=200_000, seed=2025)

# Ensure engineered features (holiday + dropoff POIs + any-airport flag) exist
assert 'add_holiday_and_doy' in globals(), 'add_holiday_and_doy not found'
assert 'add_dropoff_pois' in globals(), 'add_dropoff_pois not found'
tr = add_holiday_and_doy(tr); ts = add_holiday_and_doy(ts)
tr = add_dropoff_pois(tr); ts = add_dropoff_pois(ts)

# Feature set (slightly pruned for speed) — no TE
feature_cols = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday','days_since_start','after_hike',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend',
    'pu_cl','do_cl','same_cl'
]
if 'rot_manh_km' in tr.columns: feature_cols.append('rot_manh_km')
if 'dist_x_after_hike' in tr.columns: feature_cols.append('dist_x_after_hike')

# Drop NaNs on train; fill test
tr = tr.dropna(subset=feature_cols + ['fare_amount']).sort_values('pickup_datetime').reset_index(drop=True)
ts[feature_cols] = ts[feature_cols].fillna(0)

# Matrices
X_all = tr[feature_cols].astype('float32').values
y_all = np.log1p(tr['fare_amount'].astype('float32').values)
dt_all = tr['pickup_datetime']
X_test = ts[feature_cols].astype('float32').values

# 5-fold time CV with 1-day gap
folds = make_time_folds_quantile(dt_all, n_folds=5, gap_days=1)
assert len(folds) == 5

# LightGBM params — FAST baseline, stronger regularization, quicker ES
lgb_params = dict(
    objective='regression',
    metric='rmse',
    learning_rate=0.06,
    n_estimators=8000,
    num_leaves=96,
    min_data_in_leaf=2000,
    feature_fraction=0.75,
    bagging_fraction=0.75,
    bagging_freq=1,
    max_bin=127,
    reg_alpha=2.0,
    reg_lambda=20.0,
    random_state=2025,
    n_jobs=-1,
    verbose=-1
)

def build_monotone(cols):
    mono = [0]*len(cols)
    for nm in ['days_since_start','dist_hav_km','dist_man_km']:
        if nm in cols:
            mono[cols.index(nm)] = 1
    return mono

cv_rmses, best_iters = [], []
mono = build_monotone(feature_cols)
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
    X_va, y_va = X_all[va_idx], y_all[va_idx]
    model = lgb.LGBMRegressor(**{**lgb_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)],
        eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=200, verbose=False), lgb.log_evaluation(period=100)]
    )
    y_pred = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    rmse = float(root_mean_squared_error(np.expm1(y_va), y_pred))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or model.n_estimators))
    print(f"[LGB 6M FAST] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})")

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_final = int(np.clip(int(np.median(best_iters)), 400, lgb_params['n_estimators']))
print({'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 'best_iter_final': best_iter_final, 'note': '6M FAST robust LGB baseline reset'})

# Final fit and submission
final_model = lgb.LGBMRegressor(**{**lgb_params, 'n_estimators': best_iter_final, 'monotone_constraints': mono})
final_model.fit(X_all, y_all)
test_pred = np.expm1(final_model.predict(X_test)).astype('float32')
test_pred = np.clip(test_pred, 0, 500)
sub = sample[['key']].merge(pd.DataFrame({'key': ts['key'].astype('string'), 'fare_amount': test_pred}), on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (6M FAST robust LGB baseline reset) with shape:', sub.shape)

[100]	valid_0's rmse: 0.235365


[200]	valid_0's rmse: 0.231526


In [78]:
# Phase 16 — Silver-Bullet Features + Robust LGBM on 6M with fold ensembling (boroughs + refined airport flags)
import numpy as np, pandas as pd
from sklearn.metrics import root_mean_squared_error

assert 'train_df_big' in globals() and 'test_df_big' in globals(), '6M data not prepared (run cell 41)'
assert 'make_time_folds_quantile' in globals(), 'Fold generator missing'
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# Copy to avoid mutation
tr = train_df_big.copy()
ts = test_df_big.copy()

# Ensure required engineered features (holiday/day-of-year, dropoff POIs, after_hike, clusters) exist
def ensure_after_hike(df):
    d = df.copy()
    if 'after_hike' in d.columns:
        return d
    cutoff = pd.Timestamp('2012-09-04 00:00:00', tz='America/New_York')
    dt_local = d['pickup_datetime'].dt.tz_convert('America/New_York')
    d['after_hike'] = (dt_local >= cutoff).astype('int8')
    return d

assert 'add_holiday_and_doy' in globals(), 'add_holiday_and_doy not found (see cell 5)'
assert 'add_dropoff_pois' in globals(), 'add_dropoff_pois not found (see cell 6)'

# Apply holiday/doy, after_hike, and dropoff POIs
tr = add_holiday_and_doy(tr); ts = add_holiday_and_doy(ts)
tr = ensure_after_hike(tr); ts = ensure_after_hike(ts)
tr = add_dropoff_pois(tr); ts = add_dropoff_pois(ts)

# Ensure clusters (pu_cl/do_cl/same_cl)
try:
    tr, ts = ensure_clusters(tr, ts, k=80, sample_n=200_000, seed=2025)
except NameError:
    from sklearn.cluster import MiniBatchKMeans
    def _ensure_clusters(train, test, k=80, sample_n=200_000, seed=2025):
        pu_tr = train[['pickup_latitude','pickup_longitude']].astype('float32').values
        do_tr = train[['dropoff_latitude','dropoff_longitude']].astype('float32').values
        if len(train) > sample_n:
            idx = np.random.default_rng(seed).choice(len(train), size=sample_n, replace=False)
            pu_fit = pu_tr[idx]; do_fit = do_tr[idx]
        else:
            pu_fit = pu_tr; do_fit = do_tr
        km_pu = MiniBatchKMeans(n_clusters=k, random_state=seed, batch_size=20000, n_init=3, max_no_improvement=20)
        km_do = MiniBatchKMeans(n_clusters=k, random_state=seed+1, batch_size=20000, n_init=3, max_no_improvement=20)
        km_pu.fit(pu_fit); km_do.fit(do_fit)
        train = train.copy(); test = test.copy()
        train['pu_cl'] = km_pu.predict(pu_tr).astype('int32')
        train['do_cl'] = km_do.predict(do_tr).astype('int32')
        test['pu_cl'] = km_pu.predict(test[['pickup_latitude','pickup_longitude']].astype('float32').values).astype('int32')
        test['do_cl'] = km_do.predict(test[['dropoff_latitude','dropoff_longitude']].astype('float32').values).astype('int32')
        train['same_cl'] = (train['pu_cl'] == train['do_cl']).astype('int8')
        test['same_cl'] = (test['pu_cl'] == test['do_cl']).astype('int8')
        return train, test
    tr, ts = _ensure_clusters(tr, ts, k=80, sample_n=200_000, seed=2025)

# 1) Silver-bullet spatial features: borough proxies (vectorized) + refined airport flags
def add_borough_and_airports(df):
    d = df.copy()
    pu_lat = d['pickup_latitude'].astype('float32').values
    pu_lon = d['pickup_longitude'].astype('float32').values
    do_lat = d['dropoff_latitude'].astype('float32').values
    do_lon = d['dropoff_longitude'].astype('float32').values
    # Vectorized borough masks (rough bounding boxes)
    pu_manh = (pu_lat >= 40.70) & (pu_lat <= 40.88) & (pu_lon >= -74.02) & (pu_lon <= -73.92)
    pu_bk   = (pu_lat >= 40.57) & (pu_lat <= 40.73) & (pu_lon >= -74.05) & (pu_lon <= -73.85)
    pu_qn   = (pu_lat >= 40.54) & (pu_lat <= 40.80) & (pu_lon >= -73.96) & (pu_lon <= -73.70)
    pu_bx   = (pu_lat >= 40.79) & (pu_lat <= 40.91) & (pu_lon >= -73.93) & (pu_lon <= -73.76)
    pu_si   = (pu_lat >= 40.49) & (pu_lat <= 40.65) & (pu_lon >= -74.25) & (pu_lon <= -74.05)
    do_manh = (do_lat >= 40.70) & (do_lat <= 40.88) & (do_lon >= -74.02) & (do_lon <= -73.92)
    do_bk   = (do_lat >= 40.57) & (do_lat <= 40.73) & (do_lon >= -74.05) & (do_lon <= -73.85)
    do_qn   = (do_lat >= 40.54) & (do_lat <= 40.80) & (do_lon >= -73.96) & (do_lon <= -73.70)
    do_bx   = (do_lat >= 40.79) & (do_lat <= 40.91) & (do_lon >= -73.93) & (do_lon <= -73.76)
    do_si   = (do_lat >= 40.49) & (do_lat <= 40.65) & (do_lon >= -74.25) & (do_lon <= -74.05)
    d['pu_borough'] = np.select([pu_manh, pu_bk, pu_qn, pu_bx, pu_si], [1,2,3,4,5], default=0).astype('int8')
    d['do_borough'] = np.select([do_manh, do_bk, do_qn, do_bx, do_si], [1,2,3,4,5], default=0).astype('int8')
    d['is_inter_borough'] = ((d['pu_borough'] != 0) & (d['do_borough'] != 0) & (d['pu_borough'] != d['do_borough'])).astype('int8')
    d['is_manhattan_trip'] = ((d['pu_borough'] == 1) | (d['do_borough'] == 1)).astype('int8')
    # Refined airport proximity and code
    pu_min = d[['dist_jfk','dist_lga','dist_ewr']].astype('float32').min(axis=1)
    d['pu_is_airport'] = (pu_min < 2.0).astype('int8')
    pu_is_jfk = (d['dist_jfk'] <= d[['dist_lga','dist_ewr']].min(axis=1))
    pu_is_lga = (d['dist_lga'] <= d[['dist_jfk','dist_ewr']].min(axis=1))
    d['pu_airport_code'] = np.select([pu_is_jfk, pu_is_lga], [1, 2], default=3).astype('int8')
    d.loc[d['pu_is_airport'] == 0, 'pu_airport_code'] = 0
    if all(c in d.columns for c in ['do_dist_jfk','do_dist_lga','do_dist_ewr']):
        do_min = d[['do_dist_jfk','do_dist_lga','do_dist_ewr']].astype('float32').min(axis=1)
        d['do_is_airport'] = (do_min < 2.0).astype('int8')
        do_is_jfk = (d['do_dist_jfk'] <= d[['do_dist_lga','do_dist_ewr']].min(axis=1))
        do_is_lga = (d['do_dist_lga'] <= d[['do_dist_jfk','do_dist_ewr']].min(axis=1))
        d['do_airport_code'] = np.select([do_is_jfk, do_is_lga], [1, 2], default=3).astype('int8')
        d.loc[d['do_is_airport'] == 0, 'do_airport_code'] = 0
    else:
        d['do_is_airport'] = np.int8(0); d['do_airport_code'] = np.int8(0)
    d['jfk_to_manhattan'] = ((d.get('pu_is_airport', 0) == 1) & (d['pu_airport_code'] == 1) & (d['do_borough'] == 1)).astype('int8')
    d['manhattan_to_jfk'] = ((d['pu_borough'] == 1) & (d.get('do_is_airport', 0) == 1) & (d.get('do_airport_code', 0) == 1)).astype('int8')
    return d

tr = add_borough_and_airports(tr)
ts = add_borough_and_airports(ts)

# 2) Feature set: robust + silver-bullet features (no TE). Ensure engineered features exist.
feature_cols = [
    'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count',
    'dist_hav_km','dist_man_km','delta_lat','delta_lon','bear_sin','bear_cos','center_lat','center_lon',
    'hour','dow','is_weekend','is_rush','is_night','hour_sin','hour_cos','dow_sin','dow_cos',
    'doy_sin','doy_cos','is_holiday','days_since_start','after_hike',
    'dist_jfk','dist_lga','dist_ewr','dist_midtown','dist_fidi',
    'do_dist_jfk','do_dist_lga','do_dist_ewr','do_dist_midtown','do_dist_fidi',
    'is_airport_trip','is_airport_trip_any','dist_x_rush','dist_x_weekend',
    'pu_cl','do_cl','same_cl',
    'pu_borough','do_borough','is_inter_borough','is_manhattan_trip',
    'pu_is_airport','pu_airport_code','do_is_airport','do_airport_code',
    'jfk_to_manhattan','manhattan_to_jfk'
]
if 'rot_manh_km' in tr.columns: feature_cols.append('rot_manh_km')
if 'dist_x_after_hike' in tr.columns: feature_cols.append('dist_x_after_hike')

# Drop NaNs and prepare matrices
tr = tr.dropna(subset=feature_cols + ['fare_amount']).sort_values('pickup_datetime').reset_index(drop=True)
ts[feature_cols] = ts[feature_cols].fillna(0)
X_all = tr[feature_cols].astype('float32').values
y_all = np.log1p(tr['fare_amount'].astype('float32').values)
dt_all = tr['pickup_datetime']
X_test = ts[feature_cols].astype('float32').values

# 3) 5-fold time CV with 1-day gap and fold ensembling
folds = make_time_folds_quantile(dt_all, n_folds=5, gap_days=1)
assert len(folds) == 5

# LGBM params: competition-safe fast config to complete within time
lgb_params = dict(
    objective='huber', metric='rmse',
    learning_rate=0.06, n_estimators=4000,
    num_leaves=96, min_data_in_leaf=3000,
    feature_fraction=0.70, bagging_fraction=0.70, bagging_freq=1,
    max_bin=63, reg_alpha=2.0, reg_lambda=20.0,
    random_state=2025, n_jobs=-1, verbose=-1
)

def build_monotone(cols):
    mono = [0]*len(cols)
    for nm in ['days_since_start','dist_hav_km','dist_man_km']:
        if nm in cols: mono[cols.index(nm)] = 1
    return mono

cv_rmses, best_iters, fold_test_preds = [], [], []
mono = build_monotone(feature_cols)
for i, (tr_idx, va_idx) in enumerate(folds, 1):
    X_tr, y_tr = X_all[tr_idx], y_all[tr_idx]
    X_va, y_va = X_all[va_idx], y_all[va_idx]
    model = lgb.LGBMRegressor(**{**lgb_params, 'monotone_constraints': mono})
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)], eval_metric='rmse',
        callbacks=[lgb.early_stopping(stopping_rounds=200, verbose=False), lgb.log_evaluation(period=200)]
    )
    y_pred = np.expm1(model.predict(X_va, num_iteration=model.best_iteration_))
    rmse = float(root_mean_squared_error(np.expm1(y_va), y_pred))
    cv_rmses.append(rmse); best_iters.append(int(model.best_iteration_ or lgb_params['n_estimators']))
    print(f"[6M+Silver FAST] Fold {i}: RMSE={rmse:.5f}, best_iter={best_iters[-1]} (train_n={len(tr_idx)}, val_n={len(va_idx)})")
    # Test prediction for fold ensembling
    fold_pred = np.expm1(model.predict(X_test, num_iteration=model.best_iteration_)).astype('float32')
    fold_pred = np.clip(fold_pred, 0, 500)
    fold_test_preds.append(fold_pred)

cv_mean = float(np.mean(cv_rmses)); cv_std = float(np.std(cv_rmses))
best_iter_median = int(np.median(best_iters)) if best_iters else 0
print({'cv_rmse_mean': round(cv_mean,6), 'cv_rmse_std': round(cv_std,6), 'median_best_iter': best_iter_median, 'note': '6M robust LGB (FAST) with borough/airport features + fold ensembling'})

# 4) Fold-ensemble submission
test_pred = np.mean(np.vstack(fold_test_preds), axis=0).astype('float32') if fold_test_preds else np.zeros(len(ts), dtype='float32')
test_pred = np.clip(test_pred, 0, 500)
sub = sample[['key']].merge(pd.DataFrame({'key': ts['key'].astype('string'), 'fare_amount': test_pred}), on='key', how='left')
assert sub.shape[0] == sample.shape[0] and sub['fare_amount'].notna().all(), 'Submission alignment/missing preds'
sub[['key','fare_amount']].to_csv('submission.csv', index=False)
print('Saved submission.csv (6M + Silver-Bullet Features, fold-ensemble, FAST) with shape:', sub.shape)