# CHAMPS Scalar Coupling — Medal Plan

Objective: WIN A MEDAL via rigorous CV, strong feature engineering, and fast iteration.

Plan:
- Environment
  - Verify GPU (nvidia-smi). If unavailable, exit.
  - Avoid LightGBM GPU; prefer XGBoost/CatBoost GPU.

- Data & Validation
  - Load train/test, structures.csv, and auxiliary tables.
  - Target: scalar_coupling_constant. Groups: type (1JHC, 2JHH, etc.).
  - CV: GroupKFold by molecule_name to avoid leakage, stratify per type.
  - Metric: log-MAE computed per type then averaged (LMAE). Implement exact metric.

- Baseline v0
  - Simple features without heavy physics:
    - Atom-level joins from structures: atom types for atom_index_0/1, coordinates, distances (d, dx, dy, dz), angle proxies via nearest neighbors.
    - Count features per molecule, atom type counts.
    - Bond length stats by type.
  - Model: XGBoost (gpu_hist) or CatBoost with type-wise models (one per coupling type).
  - 3-5 fold CV with logging; OOF saved; speed-first.

- Feature Engineering v1
  - Add neighbor-based features (kNN distances for each atom within same molecule).
  - Per-molecule potential_energy, dipole_moments summary joins.
  - Magnetic shielding, Mulliken charges: per-atom join (requires molecule+atom_index mapping) with aggregations for the two atoms and their neighborhood.

- Feature Engineering v2
  - Path-based graph features: shortest path length between atoms (topology), same as order of coupling (1J/2J/3J/etc.).
  - Angles: angle at atom0-…-atom1 via nearest path atoms; dihedral approximations.

- Modeling
  - Train separate models per type (stronger).
  - Try XGB + CatBoost blend; seed averaging.
  - Cache datasets per type to feather/parquet.

- Error Analysis
  - OOF LMAE overall and by type; focus on worst types.
  - Bucket by distance bins and path length.

- Submission
  - Predict test per type, concat, ensure id alignment, write submission.csv.

Checkpoints (request expert review):
1) After this plan
2) After baseline data pipeline + CV metric implementation
3) After baseline model OOF
4) After FE v1 OOF
5) After ensembling

Time budget:
- 1h baseline pipeline + CV
- 2h baseline model per type
- 6-8h FE v1 + retrain
- 4h FE v2 selective
- 2h ensembling/tuning
- Remainder for iterations + submissions

In [1]:
# Environment check: GPU + deps
import sys, subprocess, time, importlib, os

def run(cmd):
    print(f"$ {' '.join(cmd)}", flush=True)
    p = subprocess.run(cmd, capture_output=True, text=True)
    print(p.stdout, flush=True)
    if p.stderr:
        print(p.stderr, flush=True)
    return p.returncode

print('== NVIDIA SMI ==', flush=True)
run(['bash','-lc','nvidia-smi || true'])

def ensure(pkg, pip_name=None, ver=None):
    name = pip_name or pkg
    try:
        importlib.import_module(pkg)
        print(f"OK: {pkg} already installed")
    except Exception:
        args = [sys.executable, '-m', 'pip', 'install']
        if ver:
            args.append(f"{name}=={ver}")
        else:
            args.append(name)
        print('Installing', ' '.join(args), flush=True)
        subprocess.check_call(args)
        importlib.invalidate_caches()
        importlib.import_module(pkg)
        print(f"OK: {pkg} installed")

# Core deps (avoid torch; use XGBoost/CatBoost GPU)
ensure('pandas')
ensure('numpy')
ensure('sklearn', pip_name='scikit-learn')
ensure('xgboost', ver='2.0.3')
ensure('catboost', ver='1.2.5')

import pandas as pd, numpy as np
import sklearn
import xgboost as xgb
from catboost import CatBoostRegressor

print('Versions:',
      'pandas', pd.__version__,
      '| numpy', np.__version__,
      '| sklearn', sklearn.__version__,
      '| xgboost', xgb.__version__, flush=True)

print('GPU env vars:', {k:v for k,v in os.environ.items() if k.startswith('CUDA') or k.startswith('NVIDIA')})
print('Env check complete.')

== NVIDIA SMI ==


$ bash -lc nvidia-smi || true


Failed to initialize NVML: Unknown Error



OK: pandas already installed
OK: numpy already installed


OK: sklearn already installed
OK: xgboost already installed
OK: catboost already installed
Versions: pandas 2.2.2 | numpy 1.26.4 | sklearn 1.5.2 | xgboost 2.1.4


GPU env vars: {}
Env check complete.


In [10]:
# Data loading, feature builder, CV + metric (no training yet)
import pandas as pd, numpy as np
from sklearn.model_selection import GroupKFold

DATA_DIR = '.'
TRAIN_PATH = f"{DATA_DIR}/train.csv"
TEST_PATH = f"{DATA_DIR}/test.csv"
STRUCT_PATH = f"{DATA_DIR}/structures.csv"

# Load core tables
train = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)
structures = pd.read_csv(STRUCT_PATH)
print('Loaded:', len(train), 'train rows |', len(test), 'test rows |', len(structures), 'structure atoms', flush=True)

# Atomic number lookup
ATOM_Z = {'H':1, 'C':6, 'N':7, 'O':8, 'F':9}

def build_basic_features(df: pd.DataFrame, structures: pd.DataFrame) -> pd.DataFrame:
    # Select only needed cols from structures
    s = structures[['molecule_name','atom_index','atom','x','y','z']].copy()
    s['Z'] = s['atom'].map(ATOM_Z).astype('float32')  # keep float to allow NaN, cast later
    # atom0 merge
    s0 = s.rename(columns={
        'atom_index':'atom_index_0','atom':'atom_0','x':'x0','y':'y0','z':'z0','Z':'Z0'
    })
    df = df.merge(s0, on=['molecule_name','atom_index_0'], how='left')
    # atom1 merge
    s1 = s.rename(columns={
        'atom_index':'atom_index_1','atom':'atom_1','x':'x1','y':'y1','z':'z1','Z':'Z1'
    })
    df = df.merge(s1, on=['molecule_name','atom_index_1'], how='left')
    # geometry
    for c in ['x0','y0','z0','x1','y1','z1']:
        df[c] = df[c].astype('float32')
    df['dx'] = (df['x0'] - df['x1']).astype('float32')
    df['dy'] = (df['y0'] - df['y1']).astype('float32')
    df['dz'] = (df['z0'] - df['z1']).astype('float32')
    df['d2'] = (df['dx']*df['dx'] + df['dy']*df['dy'] + df['dz']*df['dz']).astype('float32')
    df['d'] = np.sqrt(df['d2']).astype('float32')
    df['inv_d'] = (1.0/df['d'].replace(0, np.nan)).fillna(0).astype('float32')
    df['inv_d2'] = (1.0/df['d2'].replace(0, np.nan)).fillna(0).astype('float32')
    # atom identity
    df['Z0'] = df['Z0'].fillna(-1).astype('int16')
    df['Z1'] = df['Z1'].fillna(-1).astype('int16')
    df['same_element'] = (df['Z0'] == df['Z1']).astype('int8')
    # per-molecule counts (nH, nC, ... and total atoms)
    counts = s.groupby(['molecule_name','atom']).size().unstack('atom').fillna(0)
    counts = counts.rename(columns={a:f"n{a}" for a in ['H','C','N','O','F'] if a in counts.columns})
    counts['n_atoms'] = counts.sum(axis=1)
    counts = counts.astype('int16')
    df = df.merge(counts.reset_index(), on='molecule_name', how='left')
    for a in ['H','C','N','O','F']:
        col = f"n{a}"
        if col not in df:
            df[col] = 0
        df[col] = df[col].fillna(0).astype('int16')
    df['n_atoms'] = df['n_atoms'].fillna(0).astype('int16')
    # Keep compact dtypes
    return df

X_train = build_basic_features(train[['id','molecule_name','atom_index_0','atom_index_1','type','scalar_coupling_constant']].copy(), structures)
X_test = build_basic_features(test[['id','molecule_name','atom_index_0','atom_index_1','type']].copy(), structures)
print('Features built:', X_train.shape, X_test.shape, flush=True)

# Feature list
base_features = [
    'Z0','Z1','same_element',
    'dx','dy','dz','d','d2','inv_d','inv_d2',
    'nH','nC','nN','nO','nF','n_atoms'
]
missing_train = [c for c in base_features if c not in X_train.columns]
missing_test = [c for c in base_features if c not in X_test.columns]
if missing_train or missing_test:
    print('Missing features (train/test):', missing_train, missing_test, flush=True)

# Metric: competition LMAE
def lmae_score(y_true: np.ndarray, y_pred: np.ndarray, types: pd.Series, eps: float = 1e-9) -> float:
    df = pd.DataFrame({'y': y_true, 'p': y_pred, 'type': types})
    mae_by_type = df.groupby('type').apply(lambda g: np.mean(np.abs(g['y'] - g['p']))).astype('float64')
    return float(np.log(mae_by_type.clip(lower=eps)).mean())

# CV splitter (per-type models will use same folds by molecule_name) 
def get_folds(df: pd.DataFrame, n_splits: int = 5, seed: int = 42):
    # Shuffle molecule order reproducibly
    rng = np.random.RandomState(seed)
    uniq = df['molecule_name'].drop_duplicates().values
    rng.shuffle(uniq)
    order = pd.Series(np.arange(len(uniq)), index=uniq)
    df_ = df[['molecule_name']].copy()
    df_['ord'] = df_['molecule_name'].map(order).values
    gkf = GroupKFold(n_splits=n_splits)
    folds = []
    for k, (tr, va) in enumerate(gkf.split(df_, None, groups=df_['molecule_name'])):
        folds.append((tr, va))
    return folds

folds = get_folds(X_train, n_splits=5, seed=42)
print('Prepared folds:', len(folds), 'splits', flush=True)

# Sanity checks
assert X_train['molecule_name'].iloc[folds[0][0]].isin(X_train['molecule_name'].iloc[folds[0][1]]).sum() == 0, 'Leakage: molecule spans folds'
print('Data pipeline ready. Next: per-type training with XGBoost/CatBoost.', flush=True)

Loaded: 4191263 train rows | 467813 test rows | 1379964 structure atoms


Features built: (4191263, 30) (467813, 29)


Prepared folds: 5 splits


Data pipeline ready. Next: per-type training with XGBoost/CatBoost.


In [15]:
# Per-type XGBoost GPU training baseline with OOF LMAE and submission (core.train API)
import time, numpy as np, pandas as pd, xgboost as xgb
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error

# Use KFold on unique molecules to create group-respecting folds
def get_folds_by_molecule(df: pd.DataFrame, n_splits: int = 5, seed: int = 42):
    mols = df['molecule_name'].unique()
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    fold_of = {}
    for f, (_, val_idx) in enumerate(kf.split(mols)):
        for m in mols[val_idx]:
            fold_of[m] = f
    fold = df['molecule_name'].map(fold_of).values
    folds = [(np.where(fold!=i)[0], np.where(fold==i)[0]) for i in range(n_splits)]
    return folds

seed = 42
n_splits = 5
folds = get_folds_by_molecule(X_train, n_splits=n_splits, seed=seed)
print(f'Prepared molecule-aware folds: {len(folds)}')

feature_cols = [
    'Z0','Z1','same_element',
    'dx','dy','dz','d','d2','inv_d','inv_d2',
    'nH','nC','nN','nO','nF','n_atoms',
    # FE v1 additions:
    'path_len','inv_path','is_bonded','min_nb_d0','min_nb_d1','cos0','cos1',
    'potential_energy','dipole_x','dipole_y','dipole_z','dipole_mag',
    # FE v2 additions (quantum, identity, interactions, normalization):
    'mulliken_0','mulliken_1','z_mulliken_0','z_mulliken_1',
    'shield_iso_0','shield_iso_1','z_shield_0','z_shield_1',
    'mulliken_diff','mulliken_abs_diff','mulliken_sum','mulliken_prod',
    'shield_diff','shield_abs_diff','shield_sum','shield_prod',
    'mulliken_diff_over_d','shield_diff_over_d','mulliken_diff_x_inv_d','shield_diff_x_inv_d',
    'element_pair_id','element_pair_id_sorted','EN0','EN1','EN_diff','EN_abs_diff',
    'path_len_bucket','path_le2','d_x_inv_path','d_over_1p_path','is_bonded_x_inv_d','inv_d_x_path_le2',
    'cos0_x_inv_path','cos1_x_inv_path','min_nb_d0_x_inv_path','min_nb_d1_x_inv_path',
    'd_over_n_atoms','pe_per_atom','d_over_mol_mean_nb_d','expected_d_by_type','d_from_expected'
]

types = sorted(X_train['type'].unique())
oof = np.zeros(len(X_train), dtype=np.float32)
test_pred = np.zeros(len(X_test), dtype=np.float32)
per_type_scores = {}

# XGBoost params (CPU fallback: device='cpu'); tree_method hist for speed
xgb_params = {
    'objective': 'reg:squarederror',
    'eval_metric': 'mae',
    'device': 'cpu',
    'tree_method': 'hist',
    'max_depth': 7,
    'eta': 0.10,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 2.0,
    'reg_lambda': 1.0,
    'max_bin': 512,
    'seed': seed
}
num_boost_round = 2200
early_stopping_rounds = 100

def sanitize_df(df: pd.DataFrame) -> pd.DataFrame:
    # Replace inf/-inf with NaN then fill with 0; ensure float32
    df = df.replace([np.inf, -np.inf], np.nan).fillna(0)
    return df.astype('float32')

start_all = time.time()
for t in types:
    tr_mask = (X_train['type'] == t).values
    te_mask = (X_test['type'] == t).values
    # Ensure all required columns exist before selection
    missing_cols = [c for c in feature_cols if c not in X_train.columns]
    if missing_cols:
        raise KeyError(f"Missing feature columns for training: {missing_cols[:10]}... (total {len(missing_cols)})")
    X_t = X_train.loc[tr_mask, feature_cols].copy()
    X_te_t = X_test.loc[te_mask, feature_cols].copy()
    # Sanitize numeric matrices to avoid infs
    X_t = sanitize_df(X_t)
    X_te_t = sanitize_df(X_te_t)
    y_t = X_train.loc[tr_mask, 'scalar_coupling_constant'].astype('float32').values
    idx_t = np.where(tr_mask)[0]
    oof_t = np.zeros(X_t.shape[0], dtype=np.float32)
    pred_te_t = np.zeros(X_te_t.shape[0], dtype=np.float32)
    print(f'\nType {t}: n_train={X_t.shape[0]} n_test={X_te_t.shape[0]}', flush=True)
    # Quick diagnostics for infs
    if np.isinf(X_t.to_numpy()).any() or np.isinf(X_te_t.to_numpy()).any():
        n_inf_tr = np.isinf(X_t.to_numpy()).sum()
        n_inf_te = np.isinf(X_te_t.to_numpy()).sum()
        print(f'Warning: found inf values | train={n_inf_tr} test={n_inf_te}', flush=True)
    # Pre-build DMatrix for test to reuse across folds
    dtest_t = xgb.DMatrix(X_te_t)
    for fold_i, (tr_idx, va_idx) in enumerate(folds):
        tr_loc = np.intersect1d(idx_t, tr_idx, assume_unique=False)
        va_loc = np.intersect1d(idx_t, va_idx, assume_unique=False)
        tr_loc_local = np.searchsorted(idx_t, tr_loc)
        va_loc_local = np.searchsorted(idx_t, va_loc)
        if len(va_loc_local) == 0 or len(tr_loc_local) == 0:
            continue
        t0 = time.time()
        dtrain = xgb.DMatrix(X_t.iloc[tr_loc_local, :], label=y_t[tr_loc_local])
        dvalid = xgb.DMatrix(X_t.iloc[va_loc_local, :], label=y_t[va_loc_local])
        evals = [(dtrain, 'train'), (dvalid, 'valid')]
        bst = xgb.train(
            params=xgb_params,
            dtrain=dtrain,
            num_boost_round=num_boost_round,
            evals=evals,
            early_stopping_rounds=early_stopping_rounds,
            verbose_eval=200
        )
        best_iter = bst.best_iteration if hasattr(bst, 'best_iteration') and bst.best_iteration is not None else bst.best_ntree_limit - 1
        oof_t[va_loc_local] = bst.predict(dvalid, iteration_range=(0, int(best_iter)+1)).astype('float32')
        pred_te_t += bst.predict(dtest_t, iteration_range=(0, int(best_iter)+1)).astype('float32') / n_splits
        dt = time.time() - t0
        mae_fold = mean_absolute_error(y_t[va_loc_local], oof_t[va_loc_local])
        print(f'  Fold {fold_i}: n_tr={len(tr_loc_local)} n_va={len(va_loc_local)} | MAE={mae_fold:.5f} | {dt:.1f}s', flush=True)
    oof[idx_t] = oof_t
    test_pred[te_mask] = pred_te_t
    mae_t = float(np.mean(np.abs(y_t - oof_t)))
    per_type_scores[t] = mae_t
    print(f'Type {t}: MAE={mae_t:.6f}', flush=True)

overall_lmae = lmae_score(X_train['scalar_coupling_constant'].values, oof, X_train['type'])
print('\nPer-type MAE:', {k: round(v,6) for k,v in per_type_scores.items()})
print(f'Overall OOF LMAE: {overall_lmae:.6f} | elapsed {(time.time()-start_all)/60:.1f} min', flush=True)

# Build submission
sub = pd.DataFrame({'id': X_test['id'].values, 'scalar_coupling_constant': test_pred.astype('float32')})
sub = sub.sort_values('id')
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv:', sub.shape, 'head:\n', sub.head())

Prepared molecule-aware folds: 5



Type 1JHC: n_train=637912 n_test=71221


[0]	train-mae:11.63984	valid-mae:11.70639


[200]	train-mae:2.13083	valid-mae:2.22327


[400]	train-mae:1.92015	valid-mae:2.08051


[600]	train-mae:1.78791	valid-mae:2.00767


[800]	train-mae:1.68517	valid-mae:1.95862


[1000]	train-mae:1.60331	valid-mae:1.92589


[1200]	train-mae:1.53279	valid-mae:1.90148


[1400]	train-mae:1.46790	valid-mae:1.87928


[1600]	train-mae:1.41108	valid-mae:1.86268


[1800]	train-mae:1.35870	valid-mae:1.84924


[2000]	train-mae:1.30999	valid-mae:1.83764


[2199]	train-mae:1.26505	valid-mae:1.82785


  Fold 0: n_tr=510325 n_va=127587 | MAE=1.82785 | 65.3s


[0]	train-mae:11.65894	valid-mae:11.64864


[200]	train-mae:2.12741	valid-mae:2.23031


[400]	train-mae:1.91500	valid-mae:2.08798


[600]	train-mae:1.78728	valid-mae:2.01808


[800]	train-mae:1.68638	valid-mae:1.97064


[1000]	train-mae:1.60222	valid-mae:1.93669


[1200]	train-mae:1.53319	valid-mae:1.91336


[1400]	train-mae:1.46819	valid-mae:1.89202


[1600]	train-mae:1.41213	valid-mae:1.87657


[1800]	train-mae:1.35922	valid-mae:1.86405


[2000]	train-mae:1.31121	valid-mae:1.85157


[2199]	train-mae:1.26578	valid-mae:1.84090


  Fold 1: n_tr=510456 n_va=127456 | MAE=1.84090 | 71.1s


[0]	train-mae:11.67324	valid-mae:11.63263


[200]	train-mae:2.11848	valid-mae:2.20881


[400]	train-mae:1.91398	valid-mae:2.06730


[600]	train-mae:1.78226	valid-mae:1.99507


[800]	train-mae:1.68499	valid-mae:1.94925


[1000]	train-mae:1.60253	valid-mae:1.91703


[1200]	train-mae:1.53092	valid-mae:1.89158


[1400]	train-mae:1.46772	valid-mae:1.87058


[1600]	train-mae:1.41054	valid-mae:1.85368


[1800]	train-mae:1.35824	valid-mae:1.84037


[2000]	train-mae:1.30830	valid-mae:1.82812


[2199]	train-mae:1.26441	valid-mae:1.81855


  Fold 2: n_tr=509821 n_va=128091 | MAE=1.81855 | 64.1s


[0]	train-mae:11.63944	valid-mae:11.67909


[200]	train-mae:2.11993	valid-mae:2.22933


[400]	train-mae:1.91474	valid-mae:2.08872


[600]	train-mae:1.78477	valid-mae:2.01645


[800]	train-mae:1.68671	valid-mae:1.97204


[1000]	train-mae:1.60267	valid-mae:1.93778


[1200]	train-mae:1.53264	valid-mae:1.91291


[1400]	train-mae:1.47001	valid-mae:1.89300


[1600]	train-mae:1.41246	valid-mae:1.87670


[1800]	train-mae:1.35945	valid-mae:1.86254


[2000]	train-mae:1.31055	valid-mae:1.85001


[2199]	train-mae:1.26500	valid-mae:1.83911


  Fold 3: n_tr=510720 n_va=127192 | MAE=1.83911 | 63.9s


[0]	train-mae:11.66505	valid-mae:11.61285


[200]	train-mae:2.12301	valid-mae:2.20963


[400]	train-mae:1.91085	valid-mae:2.06457


[600]	train-mae:1.78239	valid-mae:1.99515


[800]	train-mae:1.68436	valid-mae:1.95041


[1000]	train-mae:1.60249	valid-mae:1.91779


[1200]	train-mae:1.53235	valid-mae:1.89524


[1400]	train-mae:1.47003	valid-mae:1.87585


[1600]	train-mae:1.41039	valid-mae:1.85759


[1800]	train-mae:1.35801	valid-mae:1.84358


[2000]	train-mae:1.30919	valid-mae:1.83142


[2199]	train-mae:1.26504	valid-mae:1.82218


  Fold 4: n_tr=510326 n_va=127586 | MAE=1.82216 | 67.0s


Type 1JHC: MAE=1.829696



Type 1JHN: n_train=39416 n_test=4264


[0]	train-mae:8.75381	valid-mae:8.76978


[200]	train-mae:0.51491	valid-mae:0.79317


[400]	train-mae:0.34934	valid-mae:0.74885


[600]	train-mae:0.25178	valid-mae:0.73251


[800]	train-mae:0.18588	valid-mae:0.72343


[1000]	train-mae:0.13917	valid-mae:0.71778


[1200]	train-mae:0.10470	valid-mae:0.71428


[1400]	train-mae:0.08044	valid-mae:0.71206


[1600]	train-mae:0.06151	valid-mae:0.71071


[1800]	train-mae:0.04742	valid-mae:0.70947


[2000]	train-mae:0.03684	valid-mae:0.70868


[2199]	train-mae:0.02873	valid-mae:0.70834


  Fold 0: n_tr=31650 n_va=7766 | MAE=0.70833 | 18.0s


[0]	train-mae:8.74650	valid-mae:8.79463


[200]	train-mae:0.51326	valid-mae:0.77433


[400]	train-mae:0.35104	valid-mae:0.73704


[600]	train-mae:0.25092	valid-mae:0.71836


[800]	train-mae:0.18382	valid-mae:0.71114


[1000]	train-mae:0.13545	valid-mae:0.70663


[1200]	train-mae:0.10278	valid-mae:0.70342


[1400]	train-mae:0.07851	valid-mae:0.70167


[1600]	train-mae:0.06024	valid-mae:0.70023


[1800]	train-mae:0.04661	valid-mae:0.69970


[2000]	train-mae:0.03600	valid-mae:0.69920


[2199]	train-mae:0.02807	valid-mae:0.69871


  Fold 1: n_tr=31500 n_va=7916 | MAE=0.69871 | 18.2s


[0]	train-mae:8.76461	valid-mae:8.72097


[200]	train-mae:0.51740	valid-mae:0.77270


[400]	train-mae:0.35328	valid-mae:0.73392


[600]	train-mae:0.25231	valid-mae:0.71796


[800]	train-mae:0.18592	valid-mae:0.70943


[1000]	train-mae:0.13862	valid-mae:0.70381


[1200]	train-mae:0.10426	valid-mae:0.70131


[1400]	train-mae:0.07900	valid-mae:0.69943


[1600]	train-mae:0.06084	valid-mae:0.69806


[1800]	train-mae:0.04713	valid-mae:0.69742


[2000]	train-mae:0.03646	valid-mae:0.69681


[2199]	train-mae:0.02849	valid-mae:0.69650


  Fold 2: n_tr=31593 n_va=7823 | MAE=0.69647 | 17.9s


[0]	train-mae:8.75873	valid-mae:8.75150


[200]	train-mae:0.51251	valid-mae:0.77485


[400]	train-mae:0.34846	valid-mae:0.73368


[600]	train-mae:0.24984	valid-mae:0.71445


[800]	train-mae:0.18470	valid-mae:0.70672


[1000]	train-mae:0.13744	valid-mae:0.70156


[1200]	train-mae:0.10402	valid-mae:0.69883


[1400]	train-mae:0.07935	valid-mae:0.69684


[1600]	train-mae:0.06070	valid-mae:0.69515


[1800]	train-mae:0.04666	valid-mae:0.69423


[2000]	train-mae:0.03620	valid-mae:0.69381


[2199]	train-mae:0.02817	valid-mae:0.69336


  Fold 3: n_tr=31276 n_va=8140 | MAE=0.69336 | 17.0s


[0]	train-mae:8.75854	valid-mae:8.75525


[200]	train-mae:0.51514	valid-mae:0.78308


[400]	train-mae:0.34894	valid-mae:0.74457


[600]	train-mae:0.25031	valid-mae:0.72917


[800]	train-mae:0.18233	valid-mae:0.72097


[1000]	train-mae:0.13697	valid-mae:0.71655


[1200]	train-mae:0.10320	valid-mae:0.71314


[1400]	train-mae:0.07894	valid-mae:0.71143


[1600]	train-mae:0.06062	valid-mae:0.71029


[1800]	train-mae:0.04651	valid-mae:0.70939


[2000]	train-mae:0.03586	valid-mae:0.70881


[2199]	train-mae:0.02793	valid-mae:0.70830


  Fold 4: n_tr=31645 n_va=7771 | MAE=0.70829 | 17.9s


Type 1JHN: MAE=0.700947



Type 2JHC: n_train=1026379 n_test=114488


[0]	train-mae:2.56025	valid-mae:2.56744


[200]	train-mae:0.85080	valid-mae:0.87910


[400]	train-mae:0.74888	valid-mae:0.79199


[600]	train-mae:0.68921	valid-mae:0.74577


[800]	train-mae:0.64905	valid-mae:0.71812


[1000]	train-mae:0.61627	valid-mae:0.69764


[1200]	train-mae:0.58966	valid-mae:0.68239


[1400]	train-mae:0.56699	valid-mae:0.67063


[1600]	train-mae:0.54682	valid-mae:0.66058


[1800]	train-mae:0.52848	valid-mae:0.65232


[2000]	train-mae:0.51194	valid-mae:0.64514


[2199]	train-mae:0.49702	valid-mae:0.63938


  Fold 0: n_tr=820505 n_va=205874 | MAE=0.63938 | 94.7s


[0]	train-mae:2.55836	valid-mae:2.56853


[200]	train-mae:0.85420	valid-mae:0.88238


[400]	train-mae:0.75157	valid-mae:0.79576


[600]	train-mae:0.69008	valid-mae:0.74881


[800]	train-mae:0.64726	valid-mae:0.71866


[1000]	train-mae:0.61679	valid-mae:0.70055


[1200]	train-mae:0.58915	valid-mae:0.68463


[1400]	train-mae:0.56621	valid-mae:0.67275


[1600]	train-mae:0.54657	valid-mae:0.66362


[1800]	train-mae:0.52752	valid-mae:0.65431


[2000]	train-mae:0.51108	valid-mae:0.64723


[2199]	train-mae:0.49563	valid-mae:0.64096


  Fold 1: n_tr=820790 n_va=205589 | MAE=0.64096 | 90.9s


[0]	train-mae:2.56601	valid-mae:2.55257


[200]	train-mae:0.85745	valid-mae:0.87337


[400]	train-mae:0.75232	valid-mae:0.78609


[600]	train-mae:0.69357	valid-mae:0.74174


[800]	train-mae:0.65305	valid-mae:0.71440


[1000]	train-mae:0.62049	valid-mae:0.69374


[1200]	train-mae:0.59339	valid-mae:0.67836


[1400]	train-mae:0.57007	valid-mae:0.66588


[1600]	train-mae:0.54875	valid-mae:0.65495


[1800]	train-mae:0.53076	valid-mae:0.64685


[2000]	train-mae:0.51422	valid-mae:0.63982


[2199]	train-mae:0.49896	valid-mae:0.63385


  Fold 2: n_tr=820940 n_va=205439 | MAE=0.63385 | 92.8s


[0]	train-mae:2.56151	valid-mae:2.55715


[200]	train-mae:0.85375	valid-mae:0.88093


[400]	train-mae:0.75320	valid-mae:0.79708


[600]	train-mae:0.69300	valid-mae:0.75066


[800]	train-mae:0.65274	valid-mae:0.72324


[1000]	train-mae:0.61899	valid-mae:0.70122


[1200]	train-mae:0.59202	valid-mae:0.68531


[1400]	train-mae:0.56842	valid-mae:0.67219


[1600]	train-mae:0.54896	valid-mae:0.66275


[1800]	train-mae:0.53017	valid-mae:0.65374


[2000]	train-mae:0.51359	valid-mae:0.64657


[2199]	train-mae:0.49870	valid-mae:0.64060


  Fold 3: n_tr=821450 n_va=204929 | MAE=0.64060 | 98.8s


[0]	train-mae:2.56270	valid-mae:2.56313


[200]	train-mae:0.85443	valid-mae:0.88004


[400]	train-mae:0.74851	valid-mae:0.78983


[600]	train-mae:0.69021	valid-mae:0.74484


[800]	train-mae:0.64837	valid-mae:0.71612


[1000]	train-mae:0.61715	valid-mae:0.69690


[1200]	train-mae:0.58992	valid-mae:0.68124


[1400]	train-mae:0.56634	valid-mae:0.66837


[1600]	train-mae:0.54669	valid-mae:0.65875


[1800]	train-mae:0.52801	valid-mae:0.64965


[2000]	train-mae:0.51140	valid-mae:0.64239


[2199]	train-mae:0.49560	valid-mae:0.63579


  Fold 4: n_tr=821831 n_va=204548 | MAE=0.63579 | 99.3s


Type 2JHC: MAE=0.638120



Type 2JHH: n_train=340097 n_test=37891


[0]	train-mae:2.43523	valid-mae:2.42305


[200]	train-mae:0.41139	valid-mae:0.45022


[400]	train-mae:0.35751	valid-mae:0.41897


[600]	train-mae:0.32320	valid-mae:0.40399


[800]	train-mae:0.29675	valid-mae:0.39420


[1000]	train-mae:0.27472	valid-mae:0.38726


[1200]	train-mae:0.25501	valid-mae:0.38115


[1400]	train-mae:0.23830	valid-mae:0.37671


[1600]	train-mae:0.22360	valid-mae:0.37326


[1800]	train-mae:0.21032	valid-mae:0.37077


[2000]	train-mae:0.19799	valid-mae:0.36830


[2199]	train-mae:0.18716	valid-mae:0.36660


  Fold 0: n_tr=272200 n_va=67897 | MAE=0.36660 | 43.1s


[0]	train-mae:2.43139	valid-mae:2.42193


[200]	train-mae:0.41084	valid-mae:0.44892


[400]	train-mae:0.35802	valid-mae:0.41813


[600]	train-mae:0.32354	valid-mae:0.40285


[800]	train-mae:0.29601	valid-mae:0.39218


[1000]	train-mae:0.27383	valid-mae:0.38498


[1200]	train-mae:0.25488	valid-mae:0.37959


[1400]	train-mae:0.23823	valid-mae:0.37555


[1600]	train-mae:0.22339	valid-mae:0.37241


[1800]	train-mae:0.21018	valid-mae:0.36989


[2000]	train-mae:0.19780	valid-mae:0.36785


[2199]	train-mae:0.18684	valid-mae:0.36612


  Fold 1: n_tr=272317 n_va=67780 | MAE=0.36612 | 45.4s


[0]	train-mae:2.42923	valid-mae:2.43827


[200]	train-mae:0.41112	valid-mae:0.44021


[400]	train-mae:0.35792	valid-mae:0.40994


[600]	train-mae:0.32343	valid-mae:0.39527


[800]	train-mae:0.29673	valid-mae:0.38605


[1000]	train-mae:0.27410	valid-mae:0.37852


[1200]	train-mae:0.25541	valid-mae:0.37356


[1400]	train-mae:0.23866	valid-mae:0.36979


[1600]	train-mae:0.22388	valid-mae:0.36672


[1800]	train-mae:0.21035	valid-mae:0.36397


[2000]	train-mae:0.19809	valid-mae:0.36201


[2199]	train-mae:0.18700	valid-mae:0.36012


  Fold 2: n_tr=271519 n_va=68578 | MAE=0.36012 | 42.5s


[0]	train-mae:2.42362	valid-mae:2.43802


[200]	train-mae:0.40999	valid-mae:0.44975


[400]	train-mae:0.35776	valid-mae:0.42067


[600]	train-mae:0.32154	valid-mae:0.40453


[800]	train-mae:0.29442	valid-mae:0.39481


[1000]	train-mae:0.27133	valid-mae:0.38777


[1200]	train-mae:0.25265	valid-mae:0.38302


[1400]	train-mae:0.23604	valid-mae:0.37878


[1600]	train-mae:0.22161	valid-mae:0.37578


[1800]	train-mae:0.20842	valid-mae:0.37332


[2000]	train-mae:0.19638	valid-mae:0.37102


[2199]	train-mae:0.18548	valid-mae:0.36920


  Fold 3: n_tr=272403 n_va=67694 | MAE=0.36920 | 42.4s


[0]	train-mae:2.43411	valid-mae:2.43285


[200]	train-mae:0.40951	valid-mae:0.44748


[400]	train-mae:0.35640	valid-mae:0.41738


[600]	train-mae:0.32095	valid-mae:0.40186


[800]	train-mae:0.29437	valid-mae:0.39214


[1000]	train-mae:0.27218	valid-mae:0.38530


[1200]	train-mae:0.25280	valid-mae:0.38007


[1400]	train-mae:0.23632	valid-mae:0.37650


[1600]	train-mae:0.22184	valid-mae:0.37373


[1800]	train-mae:0.20857	valid-mae:0.37116


[2000]	train-mae:0.19648	valid-mae:0.36895


[2199]	train-mae:0.18556	valid-mae:0.36730


  Fold 4: n_tr=271949 n_va=68148 | MAE=0.36730 | 42.7s


Type 2JHH: MAE=0.365854



Type 2JHN: n_train=107091 n_test=11968


[0]	train-mae:2.70604	valid-mae:2.70627


[200]	train-mae:0.29756	valid-mae:0.36063


[400]	train-mae:0.23348	valid-mae:0.33019


[600]	train-mae:0.19228	valid-mae:0.31682


[800]	train-mae:0.16288	valid-mae:0.30913


[1000]	train-mae:0.13957	valid-mae:0.30425


[1200]	train-mae:0.12090	valid-mae:0.30076


[1400]	train-mae:0.10520	valid-mae:0.29785


[1600]	train-mae:0.09208	valid-mae:0.29615


[1800]	train-mae:0.08087	valid-mae:0.29463


[2000]	train-mae:0.07130	valid-mae:0.29332


[2199]	train-mae:0.06306	valid-mae:0.29244


  Fold 0: n_tr=86036 n_va=21055 | MAE=0.29243 | 24.2s


[0]	train-mae:2.70776	valid-mae:2.68996


[200]	train-mae:0.30005	valid-mae:0.36195


[400]	train-mae:0.23428	valid-mae:0.33072


[600]	train-mae:0.19173	valid-mae:0.31586


[800]	train-mae:0.16174	valid-mae:0.30868


[1000]	train-mae:0.13810	valid-mae:0.30385


[1200]	train-mae:0.11912	valid-mae:0.30016


[1400]	train-mae:0.10320	valid-mae:0.29754


[1600]	train-mae:0.09015	valid-mae:0.29552


[1800]	train-mae:0.07927	valid-mae:0.29398


[2000]	train-mae:0.06996	valid-mae:0.29291


[2199]	train-mae:0.06175	valid-mae:0.29183


  Fold 1: n_tr=85812 n_va=21279 | MAE=0.29183 | 24.6s


[0]	train-mae:2.70424	valid-mae:2.71948


[200]	train-mae:0.30455	valid-mae:0.36435


[400]	train-mae:0.23594	valid-mae:0.33007


[600]	train-mae:0.19422	valid-mae:0.31553


[800]	train-mae:0.16263	valid-mae:0.30600


[1000]	train-mae:0.13923	valid-mae:0.30096


[1200]	train-mae:0.12040	valid-mae:0.29723


[1400]	train-mae:0.10497	valid-mae:0.29505


[1600]	train-mae:0.09173	valid-mae:0.29302


[1800]	train-mae:0.08048	valid-mae:0.29187


[2000]	train-mae:0.07093	valid-mae:0.29066


[2199]	train-mae:0.06271	valid-mae:0.28981


  Fold 2: n_tr=85603 n_va=21488 | MAE=0.28981 | 25.3s


[0]	train-mae:2.70841	valid-mae:2.70839


[200]	train-mae:0.29942	valid-mae:0.36196


[400]	train-mae:0.23022	valid-mae:0.32694


[600]	train-mae:0.18917	valid-mae:0.31292


[800]	train-mae:0.15929	valid-mae:0.30432


[1000]	train-mae:0.13635	valid-mae:0.29931


[1200]	train-mae:0.11807	valid-mae:0.29587


[1400]	train-mae:0.10282	valid-mae:0.29328


[1600]	train-mae:0.08974	valid-mae:0.29127


[1800]	train-mae:0.07895	valid-mae:0.28979


[2000]	train-mae:0.06964	valid-mae:0.28871


[2199]	train-mae:0.06174	valid-mae:0.28793


  Fold 3: n_tr=85301 n_va=21790 | MAE=0.28793 | 24.1s


[0]	train-mae:2.70553	valid-mae:2.71243


[200]	train-mae:0.30061	valid-mae:0.36955


[400]	train-mae:0.23324	valid-mae:0.33523


[600]	train-mae:0.19281	valid-mae:0.32256


[800]	train-mae:0.16246	valid-mae:0.31429


[1000]	train-mae:0.13858	valid-mae:0.30881


[1200]	train-mae:0.11957	valid-mae:0.30523


[1400]	train-mae:0.10371	valid-mae:0.30261


[1600]	train-mae:0.09047	valid-mae:0.30043


[1800]	train-mae:0.07946	valid-mae:0.29882


[2000]	train-mae:0.06994	valid-mae:0.29773


[2199]	train-mae:0.06179	valid-mae:0.29674


  Fold 4: n_tr=85612 n_va=21479 | MAE=0.29674 | 24.4s


Type 2JHN: MAE=0.291732



Type 3JHC: n_train=1359077 n_test=152130


[0]	train-mae:2.31710	valid-mae:2.31956


[200]	train-mae:0.80764	valid-mae:0.83069


[400]	train-mae:0.72722	valid-mae:0.76169


[600]	train-mae:0.67400	valid-mae:0.71854


[800]	train-mae:0.63755	valid-mae:0.69130


[1000]	train-mae:0.60907	valid-mae:0.67156


[1200]	train-mae:0.58565	valid-mae:0.65644


[1400]	train-mae:0.56510	valid-mae:0.64382


[1600]	train-mae:0.54718	valid-mae:0.63364


[1800]	train-mae:0.53136	valid-mae:0.62543


[2000]	train-mae:0.51669	valid-mae:0.61784


[2199]	train-mae:0.50368	valid-mae:0.61183


  Fold 0: n_tr=1086760 n_va=272317 | MAE=0.61183 | 118.8s


[0]	train-mae:2.31542	valid-mae:2.32810


[200]	train-mae:0.81375	valid-mae:0.83272


[400]	train-mae:0.72170	valid-mae:0.75255


[600]	train-mae:0.67301	valid-mae:0.71435


[800]	train-mae:0.63705	valid-mae:0.68813


[1000]	train-mae:0.60866	valid-mae:0.66868


[1200]	train-mae:0.58357	valid-mae:0.65209


[1400]	train-mae:0.56445	valid-mae:0.64114


[1600]	train-mae:0.54716	valid-mae:0.63176


[1800]	train-mae:0.53193	valid-mae:0.62395


[2000]	train-mae:0.51686	valid-mae:0.61603


[2199]	train-mae:0.50436	valid-mae:0.61060


  Fold 1: n_tr=1086685 n_va=272392 | MAE=0.61060 | 118.5s


[0]	train-mae:2.31876	valid-mae:2.31690


[200]	train-mae:0.80753	valid-mae:0.81491


[400]	train-mae:0.72510	valid-mae:0.74540


[600]	train-mae:0.67378	valid-mae:0.70519


[800]	train-mae:0.63739	valid-mae:0.67887


[1000]	train-mae:0.60927	valid-mae:0.66038


[1200]	train-mae:0.58587	valid-mae:0.64567


[1400]	train-mae:0.56493	valid-mae:0.63302


[1600]	train-mae:0.54835	valid-mae:0.62440


[1800]	train-mae:0.53204	valid-mae:0.61567


[2000]	train-mae:0.51719	valid-mae:0.60821


[2199]	train-mae:0.50415	valid-mae:0.60224


  Fold 2: n_tr=1086834 n_va=272243 | MAE=0.60224 | 120.3s


[0]	train-mae:2.31892	valid-mae:2.31528


[200]	train-mae:0.80598	valid-mae:0.82411


[400]	train-mae:0.72185	valid-mae:0.75208


[600]	train-mae:0.67533	valid-mae:0.71604


[800]	train-mae:0.63730	valid-mae:0.68720


[1000]	train-mae:0.60968	valid-mae:0.66838


[1200]	train-mae:0.58493	valid-mae:0.65231


[1400]	train-mae:0.56431	valid-mae:0.63985


[1600]	train-mae:0.54717	valid-mae:0.63067


[1800]	train-mae:0.53120	valid-mae:0.62212


[2000]	train-mae:0.51649	valid-mae:0.61492


[2199]	train-mae:0.50381	valid-mae:0.60947


  Fold 3: n_tr=1088563 n_va=270514 | MAE=0.60947 | 113.2s


[0]	train-mae:2.32018	valid-mae:2.31119


[200]	train-mae:0.80316	valid-mae:0.82760


KeyboardInterrupt: 

In [27]:
# FE v1: path_len + neighbor features + molecule-level joins (potential_energy, dipole)
import numpy as np, pandas as pd, time, math

R_COV = {1:0.31, 6:0.76, 7:0.71, 8:0.66, 9:0.57}  # H,C,N,O,F
BOND_SCALE = 1.15

def _bfs_depth_limited(adj: np.ndarray, src: int, max_depth: int = 4) -> np.ndarray:
    n = adj.shape[0]
    dist = np.full(n, -1, dtype=np.int16)
    q = [src]
    dist[src] = 0
    head = 0
    while head < len(q):
        u = q[head]; head += 1
        du = int(dist[u])
        if du >= max_depth:
            continue
        nbrs = np.nonzero(adj[u])[0]
        for v in nbrs:
            if dist[v] == -1:
                dist[v] = du + 1
                if dist[v] < max_depth:
                    q.append(int(v))
    return dist

def _build_molecule_cache(structures: pd.DataFrame, mol_names: np.ndarray):
    t0 = time.time()
    cache = {}
    s = structures[['molecule_name','atom_index','x','y','z','atom']].copy()
    s['Z'] = s['atom'].map({'H':1,'C':6,'N':7,'O':8,'F':9}).astype('int16')
    grp = s.groupby('molecule_name')
    found = 0
    for m in mol_names:
        if m not in grp.groups:
            continue
        dfm = grp.get_group(m).sort_values('atom_index')
        coords = dfm[['x','y','z']].to_numpy(dtype=np.float32)
        Z = dfm['Z'].to_numpy(dtype=np.int16)
        cache[m] = (coords, Z)
        found += 1
    print(f'Built molecule cache for {found}/{len(mol_names)} molecules in {time.time()-t0:.1f}s', flush=True)
    return cache

def _compute_per_molecule_features_targeted(coords, Z, rows, max_depth: int = 4):
    # rows: DataFrame slice with columns atom_index_0, atom_index_1 and index_ref
    n = coords.shape[0]
    # pairwise distances (for nearest neighbor + angle proxies)
    diff = coords[:, None, :] - coords[None, :, :]
    D = np.sqrt(np.sum(diff*diff, axis=2, dtype=np.float32)).astype(np.float32)
    D_no_self = D + np.eye(n, dtype=np.float32)*1e9
    nn_idx = np.argmin(D_no_self, axis=1).astype(np.int32)
    nn_dist = D_no_self[np.arange(n), nn_idx].astype(np.float32)
    # adjacency by covalent radii
    rc = np.vectorize(lambda z: R_COV.get(int(z), 0.7), otypes=[np.float32])(Z).astype(np.float32)
    thr = (rc[:, None] + rc[None, :]).astype(np.float32) * BOND_SCALE
    adj = (D > 0) & (D < thr)
    adj = adj.astype(np.uint8)
    np.fill_diagonal(adj, 0)
    # Prepare outputs
    a0 = rows['atom_index_0'].to_numpy(dtype=np.int32)
    a1 = rows['atom_index_1'].to_numpy(dtype=np.int32)
    pl = np.full(len(rows), -1, dtype=np.int16)
    # Targeted BFS: for each unique source among a0, compute dist once (depth-limited)
    unique_srcs = np.unique(a0)
    for src in unique_srcs:
        if src < 0 or src >= n:
            continue
        dist = _bfs_depth_limited(adj, int(src), max_depth=max_depth)
        mask = (a0 == src)
        targets = a1[mask]
        # Safe lookup
        valid = (targets >= 0) & (targets < n)
        if valid.any():
            vals = dist[targets[valid]]
            idxs = np.flatnonzero(mask)[valid]
            pl[idxs] = vals.astype(np.int16, copy=False)
    # inv_path and flags
    pl_clip = np.where(pl < 0, 0, pl).astype(np.float32)
    inv_path = (1.0/(1.0+pl_clip)).astype(np.float32)
    is_bonded = (pl == 1).astype(np.int8)
    # nearest-neighbor distances for endpoints
    min_nb_d0 = nn_dist[a0].astype(np.float32)
    min_nb_d1 = nn_dist[a1].astype(np.float32)
    # angle proxies using nearest neighbor at each end
    eps = 1e-8
    nb0 = nn_idx[a0]
    nb1 = nn_idx[a1]
    v0_nb = coords[nb0] - coords[a0]
    v0_1  = coords[a1] - coords[a0]
    v1_nb = coords[nb1] - coords[a1]
    v1_0  = coords[a0] - coords[a1]
    def _cos(u, v):
        nu = np.linalg.norm(u, axis=1) + eps
        nv = np.linalg.norm(v, axis=1) + eps
        return (np.sum(u*v, axis=1)/ (nu*nv)).astype(np.float32)
    cos0 = _cos(v0_nb, v0_1)
    cos1 = _cos(v1_nb, v1_0)
    return {
        'path_len': pl,
        'inv_path': inv_path,
        'is_bonded': is_bonded,
        'min_nb_d0': min_nb_d0,
        'min_nb_d1': min_nb_d1,
        'cos0': cos0,
        'cos1': cos1,
    }

def add_graph_and_molecule_features(X_train: pd.DataFrame, X_test: pd.DataFrame, structures: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]:
    t0 = time.time()
    pairs = pd.concat([
        X_train[['molecule_name','atom_index_0','atom_index_1']].assign(_src='train', _idx=np.arange(len(X_train))),
        X_test[['molecule_name','atom_index_0','atom_index_1']].assign(_src='test',  _idx=np.arange(len(X_test)))
    ], ignore_index=True)
    mols = pairs['molecule_name'].unique()
    cache = _build_molecule_cache(structures, mols)
    # Prepare output containers
    out_cols = ['path_len','inv_path','is_bonded','min_nb_d0','min_nb_d1','cos0','cos1']
    add_train = {c: np.zeros(len(X_train), dtype=np.float32) for c in out_cols}
    add_test  = {c: np.zeros(len(X_test), dtype=np.float32) for c in out_cols}
    add_train['path_len'] = np.full(len(X_train), -1, dtype=np.int16)
    add_test['path_len']  = np.full(len(X_test), -1, dtype=np.int16)
    add_train['is_bonded']= np.zeros(len(X_train), dtype=np.int8)
    add_test['is_bonded'] = np.zeros(len(X_test), dtype=np.int8)
    # Iterate per molecule
    g = pairs.groupby('molecule_name', sort=False)
    processed = 0
    for m, rows in g:
        if m not in cache:
            continue
        coords, Z = cache[m]
        feats = _compute_per_molecule_features_targeted(coords, Z, rows, max_depth=4)
        train_mask = rows['_src'].values == 'train'
        test_mask  = ~train_mask
        idx_tr = rows.loc[train_mask, '_idx'].to_numpy(dtype=np.int64)
        idx_te = rows.loc[test_mask, '_idx'].to_numpy(dtype=np.int64)
        for c in out_cols:
            vals = feats[c]
            if c == 'path_len':
                add_train[c][idx_tr] = vals[train_mask].astype(np.int16, copy=False)
                add_test[c][idx_te]  = vals[test_mask].astype(np.int16, copy=False)
            elif c == 'is_bonded':
                add_train[c][idx_tr] = vals[train_mask].astype(np.int8, copy=False)
                add_test[c][idx_te]  = vals[test_mask].astype(np.int8, copy=False)
            else:
                add_train[c][idx_tr] = vals[train_mask].astype(np.float32, copy=False)
                add_test[c][idx_te]  = vals[test_mask].astype(np.float32, copy=False)
        processed += 1
        if processed % 1000 == 0:
            print(f'  processed {processed} molecules...', flush=True)
    # Assign back to dataframes
    for c in out_cols:
        X_train[c] = add_train[c]
        X_test[c]  = add_test[c]
    # Molecule-level joins: potential_energy, dipole (with magnitude)
    pe = pd.read_csv('potential_energy.csv')[['molecule_name','potential_energy']].copy()
    dm = pd.read_csv('dipole_moments.csv')[['molecule_name','X','Y','Z']].copy()
    dm['dipole_mag'] = np.sqrt((dm[['X','Y','Z']].astype(np.float64)**2).sum(axis=1)).astype(np.float32)
    X_train = X_train.merge(pe, on='molecule_name', how='left', copy=False)
    X_test  = X_test.merge(pe, on='molecule_name', how='left', copy=False)
    X_train = X_train.merge(dm[['molecule_name','X','Y','Z','dipole_mag']].rename(columns={'X':'dipole_x','Y':'dipole_y','Z':'dipole_z'}), on='molecule_name', how='left', copy=False)
    X_test  = X_test.merge(dm[['molecule_name','X','Y','Z','dipole_mag']].rename(columns={'X':'dipole_x','Y':'dipole_y','Z':'dipole_z'}), on='molecule_name', how='left', copy=False)
    # Fill NaNs
    for c in ['potential_energy','dipole_x','dipole_y','dipole_z','dipole_mag'] + out_cols:
        if c in X_train:
            if X_train[c].dtype.kind in 'iu':
                X_train[c] = X_train[c].fillna(0)
            else:
                X_train[c] = X_train[c].astype('float32').fillna(X_train[c].mean())
        if c in X_test:
            if X_test[c].dtype.kind in 'iu':
                X_test[c] = X_test[c].fillna(0)
            else:
                X_test[c] = X_test[c].astype('float32').fillna(X_train[c].mean())
    print(f'Added graph + molecule features in {(time.time()-t0)/60:.1f} min', flush=True)
    return X_train, X_test

# Execute FE v1 and update global X_train/X_test
X_train, X_test = add_graph_and_molecule_features(X_train, X_test, structures)
print('New columns added:', [c for c in ['path_len','inv_path','is_bonded','min_nb_d0','min_nb_d1','cos0','cos1','potential_energy','dipole_x','dipole_y','dipole_z','dipole_mag'] if c in X_train.columns], flush=True)

Built molecule cache for 76510/85012 molecules in 27.6s


  processed 1000 molecules...


  processed 2000 molecules...


  processed 3000 molecules...


  processed 4000 molecules...


  processed 5000 molecules...


  processed 6000 molecules...


  processed 7000 molecules...


  processed 8000 molecules...


  processed 9000 molecules...


  processed 10000 molecules...


  processed 11000 molecules...


  processed 12000 molecules...


  processed 13000 molecules...


  processed 14000 molecules...


  processed 15000 molecules...


  processed 16000 molecules...


  processed 17000 molecules...


  processed 18000 molecules...


  processed 19000 molecules...


  processed 20000 molecules...


  processed 21000 molecules...


  processed 22000 molecules...


  processed 23000 molecules...


  processed 24000 molecules...


  processed 25000 molecules...


  processed 26000 molecules...


  processed 27000 molecules...


  processed 28000 molecules...


  processed 29000 molecules...


  processed 30000 molecules...


  processed 31000 molecules...


  processed 32000 molecules...


  processed 33000 molecules...


  processed 34000 molecules...


  processed 35000 molecules...


  processed 36000 molecules...


  processed 37000 molecules...


  processed 38000 molecules...


  processed 39000 molecules...


  processed 40000 molecules...


  processed 41000 molecules...


  processed 42000 molecules...


  processed 43000 molecules...


  processed 44000 molecules...


  processed 45000 molecules...


  processed 46000 molecules...


  processed 47000 molecules...


  processed 48000 molecules...


  processed 49000 molecules...


  processed 50000 molecules...


  processed 51000 molecules...


  processed 52000 molecules...


  processed 53000 molecules...


  processed 54000 molecules...


  processed 55000 molecules...


  processed 56000 molecules...


  processed 57000 molecules...


  processed 58000 molecules...


  processed 59000 molecules...


  processed 60000 molecules...


  processed 61000 molecules...


  processed 62000 molecules...


  processed 63000 molecules...


  processed 64000 molecules...


  processed 65000 molecules...


  processed 66000 molecules...


  processed 67000 molecules...


  processed 68000 molecules...


  processed 69000 molecules...


  processed 70000 molecules...


  processed 71000 molecules...


  processed 72000 molecules...


  processed 73000 molecules...


  processed 74000 molecules...


  processed 75000 molecules...


  processed 76000 molecules...


Added graph + molecule features in 1.9 min


New columns added: ['path_len', 'inv_path', 'is_bonded', 'min_nb_d0', 'min_nb_d1', 'cos0', 'cos1']


In [12]:
# FE v2: Quantum (Mulliken + Shielding) + z-scores + high-ROI interactions and identity/normalization features
import pandas as pd, numpy as np, time

EN_MAP = {1:2.20, 6:2.55, 7:3.04, 8:3.44, 9:3.98}  # Pauling EN for H,C,N,O,F

def _per_molecule_stats(df_atoms: pd.DataFrame, value_col: str):
    g = df_atoms.groupby('molecule_name')[value_col]
    stats = g.agg(['mean','std']).rename(columns={'mean': f'{value_col}_mol_mean', 'std': f'{value_col}_mol_std'})
    stats[f'{value_col}_mol_std'] = (stats[f'{value_col}_mol_std'].astype('float32') + 1e-6).astype('float32')
    df_atoms = df_atoms.merge(stats.reset_index(), on='molecule_name', how='left')
    df_atoms[f'z_{value_col}'] = ((df_atoms[value_col].astype('float32') - df_atoms[f'{value_col}_mol_mean'].astype('float32')) / df_atoms[f'{value_col}_mol_std'].astype('float32')).astype('float32')
    return df_atoms[['molecule_name','atom_index', value_col, f'z_{value_col}']]

def add_quantum_and_interactions(X_train: pd.DataFrame, X_test: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]:
    t0 = time.time()
    # 1) Mulliken charges with per-molecule z-scores
    m = pd.read_csv('mulliken_charges.csv', usecols=['molecule_name','atom_index','mulliken_charge'])
    m = _per_molecule_stats(m, 'mulliken_charge')
    m0 = m.rename(columns={'atom_index':'atom_index_0','mulliken_charge':'mulliken_0','z_mulliken_charge':'z_mulliken_0'})
    m1 = m.rename(columns={'atom_index':'atom_index_1','mulliken_charge':'mulliken_1','z_mulliken_charge':'z_mulliken_1'})
    X_train = X_train.merge(m0, on=['molecule_name','atom_index_0'], how='left')
    X_train = X_train.merge(m1, on=['molecule_name','atom_index_1'], how='left')
    X_test  = X_test.merge(m0, on=['molecule_name','atom_index_0'], how='left')
    X_test  = X_test.merge(m1, on=['molecule_name','atom_index_1'], how='left')

    # 2) Magnetic shielding tensors: isotropic + per-molecule z-scores
    s = pd.read_csv('magnetic_shielding_tensors.csv', usecols=['molecule_name','atom_index','XX','YY','ZZ'])
    s['shield_iso'] = ((s['XX'].astype('float32') + s['YY'].astype('float32') + s['ZZ'].astype('float32'))/3.0).astype('float32')
    s = s[['molecule_name','atom_index','shield_iso']]
    s = _per_molecule_stats(s.rename(columns={'shield_iso':'shield'}), 'shield')
    s = s.rename(columns={'shield':'shield_iso'})
    s0 = s.rename(columns={'atom_index':'atom_index_0','shield_iso':'shield_iso_0','z_shield':'z_shield_0'})
    s1 = s.rename(columns={'atom_index':'atom_index_1','shield_iso':'shield_iso_1','z_shield':'z_shield_1'})
    X_train = X_train.merge(s0, on=['molecule_name','atom_index_0'], how='left')
    X_train = X_train.merge(s1, on=['molecule_name','atom_index_1'], how='left')
    X_test  = X_test.merge(s0, on=['molecule_name','atom_index_0'], how='left')
    X_test  = X_test.merge(s1, on=['molecule_name','atom_index_1'], how='left')

    # 3) Derived quantum features and interactions with geometry
    for df in (X_train, X_test):
        # Base diffs/sums/prods
        df['mulliken_diff'] = (df['mulliken_0'] - df['mulliken_1']).astype('float32')
        df['mulliken_abs_diff'] = df['mulliken_diff'].abs().astype('float32')
        df['mulliken_sum'] = (df['mulliken_0'] + df['mulliken_1']).astype('float32')
        df['mulliken_prod'] = (df['mulliken_0'] * df['mulliken_1']).astype('float32')
        df['shield_diff'] = (df['shield_iso_0'] - df['shield_iso_1']).astype('float32')
        df['shield_abs_diff'] = df['shield_diff'].abs().astype('float32')
        df['shield_sum'] = (df['shield_iso_0'] + df['shield_iso_1']).astype('float32')
        df['shield_prod'] = (df['shield_iso_0'] * df['shield_iso_1']).astype('float32')
        # Geometry interactions (guard d==0 already handled by inv_d)
        df['mulliken_diff_over_d'] = (df['mulliken_diff'] * df['inv_d']).astype('float32')
        df['mulliken_diff_x_inv_d'] = (df['mulliken_diff'] * df['inv_d']).astype('float32')
        df['shield_diff_over_d'] = (df['shield_diff'] * df['inv_d']).astype('float32')
        df['shield_diff_x_inv_d'] = (df['shield_diff'] * df['inv_d']).astype('float32')

    # 4) Pair identity and chemistry hints
    for df in (X_train, X_test):
        # Element pair ids
        df['element_pair_id'] = (10*df['Z0'].astype('int16') + df['Z1'].astype('int16')).astype('int16')
        zmin = np.minimum(df['Z0'].astype('int16'), df['Z1'].astype('int16'))
        zmax = np.maximum(df['Z0'].astype('int16'), df['Z1'].astype('int16'))
        df['element_pair_id_sorted'] = (10*zmin + zmax).astype('int16')
        # Electronegativity features
        df['EN0'] = df['Z0'].map(EN_MAP).astype('float32')
        df['EN1'] = df['Z1'].map(EN_MAP).astype('float32')
        df['EN_diff'] = (df['EN0'] - df['EN1']).astype('float32')
        df['EN_abs_diff'] = df['EN_diff'].abs().astype('float32')

    # 5) Path buckets and distance-path interactions
    for df in (X_train, X_test):
        df['path_len_bucket'] = np.where(df['path_len'] <= 1, 1, np.where(df['path_len'] == 2, 2, np.where(df['path_len'] == 3, 3, 4))).astype('int8')
        df['path_le2'] = (df['path_len'] <= 2).astype('int8')
        df['d_x_inv_path'] = (df['d'].astype('float32') * df['inv_path'].astype('float32')).astype('float32')
        df['d_over_1p_path'] = (df['d'].astype('float32') / (1.0 + df['path_len'].astype('float32'))).astype('float32')
        df['is_bonded_x_inv_d'] = (df['is_bonded'].astype('float32') * df['inv_d'].astype('float32')).astype('float32')
        df['inv_d_x_path_le2'] = (df['inv_d'].astype('float32') * df['path_le2'].astype('float32')).astype('float32')
        df['cos0_x_inv_path'] = (df['cos0'].astype('float32') * df['inv_path'].astype('float32')).astype('float32')
        df['cos1_x_inv_path'] = (df['cos1'].astype('float32') * df['inv_path'].astype('float32')).astype('float32')
        df['min_nb_d0_x_inv_path'] = (df['min_nb_d0'].astype('float32') * df['inv_path'].astype('float32')).astype('float32')
        df['min_nb_d1_x_inv_path'] = (df['min_nb_d1'].astype('float32') * df['inv_path'].astype('float32')).astype('float32')

    # 6) Molecule normalization features
    # Compute mol-wise means on combined to avoid train/test drift
    combo = pd.concat([
        X_train[['molecule_name','min_nb_d0','min_nb_d1','potential_energy','n_atoms']].assign(_src='train'),
        X_test[['molecule_name','min_nb_d0','min_nb_d1','potential_energy','n_atoms']].assign(_src='test')
    ], ignore_index=True)
    combo['mean_nb_d_pair'] = combo[['min_nb_d0','min_nb_d1']].astype('float32').mean(axis=1).astype('float32')
    mol_stats = combo.groupby('molecule_name').agg(
        mol_mean_nb_d=('mean_nb_d_pair','mean'),
        mol_pe=('potential_energy','mean'),
        mol_n_atoms=('n_atoms','max')
    ).reset_index()
    for df in (X_train, X_test):
        df = df.merge(mol_stats, on='molecule_name', how='left')
        df['d_over_n_atoms'] = (df['d'].astype('float32') / (df['n_atoms'].replace(0, np.nan)).astype('float32')).fillna(0).astype('float32')
        df['pe_per_atom'] = (df['potential_energy'].astype('float32') / (df['n_atoms'].replace(0, np.nan)).astype('float32')).fillna(0).astype('float32')
        df['d_over_mol_mean_nb_d'] = (df['d'].astype('float32') / (df['mol_mean_nb_d'].replace(0, np.nan)).astype('float32')).fillna(0).astype('float32')
        # assign back merged df
        if '_merge_tag' in df.columns: df.drop(columns=['_merge_tag'], inplace=True)

    # 7) Expected bond distance by type (computed on combined for stability)
    combo2 = pd.concat([X_train[['type','d']], X_test[['type','d']]], ignore_index=True)
    type_mean_d = combo2.groupby('type')['d'].mean().astype('float32')
    for df in (X_train, X_test):
        df['expected_d_by_type'] = df['type'].map(type_mean_d).astype('float32')
        df['d_from_expected'] = (df['d'].astype('float32') - df['expected_d_by_type'].astype('float32')).astype('float32')

    # 8) Fill NaNs with train means for consistency and downcast
    new_cols = [
        'mulliken_0','mulliken_1','z_mulliken_0','z_mulliken_1',
        'shield_iso_0','shield_iso_1','z_shield_0','z_shield_1',
        'mulliken_diff','mulliken_abs_diff','mulliken_sum','mulliken_prod',
        'shield_diff','shield_abs_diff','shield_sum','shield_prod',
        'mulliken_diff_over_d','mulliken_diff_x_inv_d','shield_diff_over_d','shield_diff_x_inv_d',
        'element_pair_id','element_pair_id_sorted','EN0','EN1','EN_diff','EN_abs_diff',
        'path_len_bucket','path_le2','d_x_inv_path','d_over_1p_path','is_bonded_x_inv_d','inv_d_x_path_le2',
        'cos0_x_inv_path','cos1_x_inv_path','min_nb_d0_x_inv_path','min_nb_d1_x_inv_path',
        'd_over_n_atoms','pe_per_atom','d_over_mol_mean_nb_d','expected_d_by_type','d_from_expected'
    ]
    means = {}
    for c in new_cols:
        if c not in X_train.columns or c not in X_test.columns:
            continue
        if X_train[c].dtype.kind in 'iu':
            # categorical ids: fill with mode or 0
            if c in ('element_pair_id','element_pair_id_sorted','path_len_bucket','path_le2'):
                mode_val = X_train[c].mode(dropna=True)
                fillv = int(mode_val.iloc[0]) if len(mode_val) else 0
                X_train[c] = X_train[c].fillna(fillv).astype('int16')
                X_test[c]  = X_test[c].fillna(fillv).astype('int16')
            else:
                X_train[c] = X_train[c].fillna(0)
                X_test[c]  = X_test[c].fillna(0)
        else:
            means[c] = X_train[c].astype('float32').mean()
            X_train[c] = X_train[c].astype('float32').fillna(means[c])
            X_test[c]  = X_test[c].astype('float32').fillna(means[c])

    print(f'Added quantum + interactions in {(time.time()-t0)/60:.1f} min')
    return X_train, X_test

# To run next:
# X_train, X_test = add_quantum_and_interactions(X_train, X_test)
# print('Added columns sample:', [c for c in ['mulliken_0','z_mulliken_0','shield_iso_0','z_shield_0','element_pair_id','path_len_bucket','d_x_inv_path','mulliken_diff_over_d'] if c in X_train.columns])

In [13]:
# Execute FE v2 to add quantum + interaction features
t0 = time.time()
X_train, X_test = add_quantum_and_interactions(X_train, X_test)
print('Post-FE v2 shapes:', X_train.shape, X_test.shape)
sample_new = ['mulliken_0','z_mulliken_0','shield_iso_0','z_shield_0','element_pair_id','path_len_bucket','d_x_inv_path','mulliken_diff_over_d','d_over_n_atoms','d_from_expected']
print('Sample new cols present:', [c for c in sample_new if c in X_train.columns])
print(f'FE v2 total time: {(time.time()-t0)/60:.2f} min')

Added quantum + interactions in 0.1 min
Post-FE v2 shapes: (4191263, 80) (467813, 79)
Sample new cols present: ['mulliken_0', 'z_mulliken_0', 'shield_iso_0', 'z_shield_0', 'element_pair_id', 'path_len_bucket', 'd_x_inv_path', 'mulliken_diff_over_d', 'd_from_expected']
FE v2 total time: 0.12 min


In [14]:
# Patch: add missing molecule-normalization features (without re-running full FE v2 merges)
import pandas as pd, numpy as np, time
t0 = time.time()
need_cols = ['molecule_name','min_nb_d0','min_nb_d1','potential_energy','n_atoms']
for c in need_cols:
    if c not in X_train.columns or c not in X_test.columns:
        raise KeyError(f'Missing prerequisite column for normalization: {c}')

combo = pd.concat([
    X_train[need_cols].assign(_src='train'),
    X_test[need_cols].assign(_src='test')
], ignore_index=True)
combo['mean_nb_d_pair'] = combo[['min_nb_d0','min_nb_d1']].astype('float32').mean(axis=1).astype('float32')
mol_stats = combo.groupby('molecule_name').agg(
    mol_mean_nb_d=('mean_nb_d_pair','mean'),
    mol_pe=('potential_energy','mean'),
    mol_n_atoms=('n_atoms','max')
).reset_index()

# Merge into X_train and X_test explicitly
X_train = X_train.merge(mol_stats, on='molecule_name', how='left')
X_test  = X_test.merge(mol_stats, on='molecule_name', how='left')

# Create normalization features
def _safe_div(num, den):
    den = den.replace(0, np.nan)
    return (num.astype('float32') / den.astype('float32')).fillna(0).astype('float32')

X_train['d_over_n_atoms'] = _safe_div(X_train['d'], X_train['n_atoms'])
X_test['d_over_n_atoms']  = _safe_div(X_test['d'], X_test['n_atoms'])
X_train['pe_per_atom'] = _safe_div(X_train['potential_energy'], X_train['n_atoms'])
X_test['pe_per_atom']  = _safe_div(X_test['potential_energy'], X_test['n_atoms'])
X_train['d_over_mol_mean_nb_d'] = _safe_div(X_train['d'], X_train['mol_mean_nb_d'])
X_test['d_over_mol_mean_nb_d']  = _safe_div(X_test['d'], X_test['mol_mean_nb_d'])

print('Added normalization features. Shapes:', X_train.shape, X_test.shape)
print('Check presence:', all(c in X_train.columns for c in ['d_over_n_atoms','pe_per_atom','d_over_mol_mean_nb_d']))
print(f'Patch time: {(time.time()-t0):.2f}s')

Added normalization features. Shapes: (4191263, 86) (467813, 85)
Check presence: True
Patch time: 1.46s


In [9]:
# CatBoost per-type (GPU) + OOF/test preds + per-type blending with XGB
import numpy as np, pandas as pd, time
from catboost import CatBoostRegressor, Pool
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error

def get_molecule_folds(df: pd.DataFrame, n_splits: int = 5, seed: int = 42):
    mols = df['molecule_name'].unique()
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    fold_of = {}
    for f, (_, val_idx) in enumerate(kf.split(mols)):
        for m in mols[val_idx]:
            fold_of[m] = f
    fold = df['molecule_name'].map(fold_of).values
    return [(np.where(fold!=i)[0], np.where(fold==i)[0]) for i in range(n_splits)], fold_of

# Reuse or rebuild folds mapping
cb_folds, mol2fold = get_molecule_folds(X_train, n_splits=5, seed=42)
print('CatBoost folds ready:', len(cb_folds))

# Ensure categorical columns are integer and non-null (will convert to strings before Pool)
for c in ['Z0','Z1','path_len_bucket','element_pair_id_sorted']:
    if c in X_train.columns:
        X_train[c] = X_train[c].fillna(-1).astype('int32')
    if c in X_test.columns:
        X_test[c] = X_test[c].fillna(-1).astype('int32')

# Feature columns (reuse from XGB cell if present, else define here)
try:
    cb_features = feature_cols.copy()
except NameError:
    cb_features = [
        'Z0','Z1','same_element',
        'dx','dy','dz','d','d2','inv_d','inv_d2',
        'nH','nC','nN','nO','nF','n_atoms',
        'path_len','inv_path','is_bonded','min_nb_d0','min_nb_d1','cos0','cos1',
        'potential_energy','dipole_x','dipole_y','dipole_z','dipole_mag',
        'mulliken_0','mulliken_1','z_mulliken_0','z_mulliken_1',
        'shield_iso_0','shield_iso_1','z_shield_0','z_shield_1',
        'mulliken_diff','mulliken_abs_diff','mulliken_sum','mulliken_prod',
        'shield_diff','shield_abs_diff','shield_sum','shield_prod',
        'mulliken_diff_over_d','shield_diff_over_d','mulliken_diff_x_inv_d','shield_diff_x_inv_d',
        'element_pair_id','element_pair_id_sorted','EN0','EN1','EN_diff','EN_abs_diff',
        'path_len_bucket','path_le2','d_x_inv_path','d_over_1p_path','is_bonded_x_inv_d','inv_d_x_path_le2',
        'cos0_x_inv_path','cos1_x_inv_path','min_nb_d0_x_inv_path','min_nb_d1_x_inv_path',
        'd_over_n_atoms','pe_per_atom','d_over_mol_mean_nb_d','expected_d_by_type','d_from_expected'
    ]

# Cat features indices in cb_features
cat_cols = ['Z0','Z1','path_len_bucket','element_pair_id_sorted']
cat_idx = [cb_features.index(c) for c in cat_cols if c in cb_features]
print('Cat features:', cat_cols, '-> idx', cat_idx)

def sanitize_inf(df: pd.DataFrame) -> pd.DataFrame:
    return df.replace([np.inf, -np.inf], np.nan)

types = sorted(X_train['type'].unique())
oof_cb = np.zeros(len(X_train), dtype=np.float32)
test_cb = np.zeros(len(X_test), dtype=np.float32)
per_type_cb = {}

start = time.time()
for t in types:
    tr_mask = (X_train['type'] == t).values
    te_mask = (X_test['type'] == t).values
    X_t = sanitize_inf(X_train.loc[tr_mask, cb_features].copy())
    y_t = X_train.loc[tr_mask, 'scalar_coupling_constant'].astype('float32').values
    X_te_t = sanitize_inf(X_test.loc[te_mask, cb_features].copy())
    # Cast numerics to float32 and convert categorical columns to strings
    num_features = [c for c in cb_features if c not in cat_cols]
    for df_ in (X_t, X_te_t):
        if num_features:
            df_.loc[:, num_features] = df_.loc[:, num_features].astype('float32')
        for c in cat_cols:
            if c in df_.columns:
                df_[c] = df_[c].fillna(-1).astype('int32').astype(str)
    idx_t = np.where(tr_mask)[0]
    oof_t = np.zeros(X_t.shape[0], dtype=np.float32)
    pred_te_t = np.zeros(X_te_t.shape[0], dtype=np.float32)
    print(f"\n[CatBoost] Type {t}: n_train={X_t.shape[0]} n_test={X_te_t.shape[0]}", flush=True)
    # Pools will be created per fold to honor train/valid split
    for fold_i, (tr_idx_all, va_idx_all) in enumerate(cb_folds):
        tr_loc = np.intersect1d(idx_t, tr_idx_all, assume_unique=False)
        va_loc = np.intersect1d(idx_t, va_idx_all, assume_unique=False)
        tr_loc_local = np.searchsorted(idx_t, tr_loc)
        va_loc_local = np.searchsorted(idx_t, va_loc)
        if len(va_loc_local) == 0 or len(tr_loc_local) == 0:
            continue
        t0 = time.time()
        train_pool = Pool(X_t.iloc[tr_loc_local, :], y_t[tr_loc_local], cat_features=cat_idx)
        valid_pool = Pool(X_t.iloc[va_loc_local, :], y_t[va_loc_local], cat_features=cat_idx)
        model = CatBoostRegressor(
            loss_function='MAE', task_type='GPU',
            iterations=5000, learning_rate=0.05, depth=8,
            l2_leaf_reg=5.0, bagging_temperature=0.5,
            od_type='Iter', od_wait=200, random_seed=42,
            border_count=256, verbose=200
        )
        model.fit(train_pool, eval_set=valid_pool, use_best_model=True, verbose=200)
        oof_t[va_loc_local] = model.predict(valid_pool).astype('float32')
        pred_te_t += model.predict(Pool(X_te_t, cat_features=cat_idx)).astype('float32') / len(cb_folds)
        dt = time.time() - t0
        mae_fold = mean_absolute_error(y_t[va_loc_local], oof_t[va_loc_local])
        print(f'  Fold {fold_i}: MAE={mae_fold:.5f} | {dt:.1f}s', flush=True)
    oof_cb[idx_t] = oof_t
    test_cb[te_mask] = pred_te_t
    mae_t = float(np.mean(np.abs(y_t - oof_t)))
    per_type_cb[t] = mae_t
    print(f'[CatBoost] Type {t}: MAE={mae_t:.6f}', flush=True)

def lmae_from_oof(oof_vals: np.ndarray) -> float:
    return lmae_score(X_train['scalar_coupling_constant'].values, oof_vals, X_train['type'])

# Save artifacts for blending and reuse
np.save('oof_xgb.npy', oof.astype('float32'))
np.save('pred_test_xgb.npy', test_pred.astype('float32'))
np.save('oof_cb.npy', oof_cb.astype('float32'))
np.save('pred_test_cb.npy', test_cb.astype('float32'))
pd.Series(per_type_cb).to_csv('per_type_mae_cb.csv')

print('OOF LMAE XGB:', lmae_from_oof(oof))
print('OOF LMAE CB :', lmae_from_oof(oof_cb))

# Per-type weight search for blend on OOF
blend_oof = np.zeros_like(oof_cb)
blend_test = np.zeros_like(test_cb)
w_per_type = {}
for t in types:
    m = (X_train['type'] == t).values
    best_mae, best_w = 1e9, 0.5
    for w in np.linspace(0.0, 1.0, 21):
        o = w*oof[m] + (1.0-w)*oof_cb[m]
        mae = float(np.mean(np.abs(X_train.loc[m, 'scalar_coupling_constant'].values - o)))
        if mae < best_mae:
            best_mae, best_w = mae, float(w)
    w_per_type[t] = best_w
    blend_oof[m] = best_w*oof[m] + (1.0-best_w)*oof_cb[m]
    mt = (X_test['type'] == t).values
    blend_test[mt] = best_w*test_pred[mt] + (1.0-best_w)*test_cb[mt]
print('Per-type blend weights:', w_per_type)
print('OOF LMAE Blend:', lmae_from_oof(blend_oof))

# Write blended submission
sub_blend = pd.DataFrame({'id': X_test['id'].values, 'scalar_coupling_constant': blend_test.astype('float32')}).sort_values('id')
sub_blend.to_csv('submission.csv', index=False)
print('Saved blended submission.csv:', sub_blend.shape, 'head:\n', sub_blend.head())
print(f'Total CatBoost+Blend time: {(time.time()-start)/60:.1f} min')

In [16]:
# FE v3: True geometry features (exact angles for path_len==2 and dihedrals for path_len==3)
import numpy as np, pandas as pd, time

R_COV = {1:0.31, 6:0.76, 7:0.71, 8:0.66, 9:0.57}  # H,C,N,O,F
BOND_SCALE = 1.15

def _build_mol_coords_Z(structures: pd.DataFrame, mol_names: np.ndarray):
    s = structures[['molecule_name','atom_index','x','y','z','atom']].copy()
    s['Z'] = s['atom'].map({'H':1,'C':6,'N':7,'O':8,'F':9}).astype('int16')
    grp = s.groupby('molecule_name')
    cache = {}
    for m in mol_names:
        if m in grp.groups:
            dfm = grp.get_group(m).sort_values('atom_index')
            coords = dfm[['x','y','z']].to_numpy(dtype=np.float32)
            Z = dfm['Z'].to_numpy(dtype=np.int16)
            cache[m] = (coords, Z)
    return cache

def _adjacency_from_coords(coords: np.ndarray, Z: np.ndarray) -> np.ndarray:
    n = coords.shape[0]
    if n == 0:
        return np.zeros((0,0), dtype=np.uint8)
    diff = coords[:, None, :] - coords[None, :, :]
    D = np.sqrt(np.sum(diff*diff, axis=2, dtype=np.float32)).astype(np.float32)
    rc = np.vectorize(lambda z: R_COV.get(int(z), 0.7), otypes=[np.float32])(Z).astype(np.float32)
    thr = (rc[:, None] + rc[None, :]).astype(np.float32) * np.float32(BOND_SCALE)
    adj = (D > 0) & (D < thr)
    adj = adj.astype(np.uint8)
    np.fill_diagonal(adj, 0)
    return adj

def _angle_features(coords: np.ndarray, a: int, k: int, b: int):
    # angle at k formed by a-k-b
    v1 = coords[a] - coords[k]
    v2 = coords[b] - coords[k]
    n1 = np.linalg.norm(v1) + 1e-12
    n2 = np.linalg.norm(v2) + 1e-12
    cos_th = float(np.dot(v1, v2) / (n1 * n2))
    cos_th = max(-1.0, min(1.0, cos_th))
    sin_th = float(np.linalg.norm(np.cross(v1, v2)) / (n1 * n2))
    theta = float(np.arccos(cos_th))
    return cos_th, sin_th, theta

def _dihedral_features(coords: np.ndarray, a: int, b: int, c: int, d: int):
    # torsion angle for a-b-c-d using stable atan2 formulation
    b0 = coords[b] - coords[a]
    b1 = coords[c] - coords[b]
    b2 = coords[d] - coords[c]
    # Normalize b1 for projection stability
    b1n = b1 / (np.linalg.norm(b1) + 1e-12)
    v = b0 - np.dot(b0, b1n) * b1n
    w = b2 - np.dot(b2, b1n) * b1n
    x = np.dot(v, w)
    y = np.dot(np.cross(b1n, v), w)
    phi = float(np.arctan2(y, x))
    cos_phi = float(np.cos(phi))
    sin_phi = float(np.sin(phi))
    cos2_phi = float(np.cos(2.0 * phi))
    return cos_phi, sin_phi, cos2_phi, phi

def add_true_geometry_features(X_train: pd.DataFrame, X_test: pd.DataFrame, structures: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]:
    t0 = time.time()
    # Prepare combined index to write back
    pairs = pd.concat([
        X_train[['molecule_name','atom_index_0','atom_index_1','path_len']].assign(_src='train', _idx=np.arange(len(X_train), dtype=np.int64)),
        X_test[['molecule_name','atom_index_0','atom_index_1','path_len']].assign(_src='test',  _idx=np.arange(len(X_test), dtype=np.int64))
    ], ignore_index=True)
    mols = pairs['molecule_name'].unique()
    cache = _build_mol_coords_Z(structures, mols)

    # Preallocate outputs
    ang_cos_tr = np.zeros(len(X_train), dtype=np.float32); ang_cos_te = np.zeros(len(X_test), dtype=np.float32)
    ang_sin_tr = np.zeros(len(X_train), dtype=np.float32); ang_sin_te = np.zeros(len(X_test), dtype=np.float32)
    ang_rad_tr = np.zeros(len(X_train), dtype=np.float32); ang_rad_te = np.zeros(len(X_test), dtype=np.float32)
    dih_cos_tr = np.zeros(len(X_train), dtype=np.float32); dih_cos_te = np.zeros(len(X_test), dtype=np.float32)
    dih_sin_tr = np.zeros(len(X_train), dtype=np.float32); dih_sin_te = np.zeros(len(X_test), dtype=np.float32)
    dih_cos2_tr = np.zeros(len(X_train), dtype=np.float32); dih_cos2_te = np.zeros(len(X_test), dtype=np.float32)

    processed = 0
    for m, rows in pairs.groupby('molecule_name', sort=False):
        if m not in cache:
            continue
        coords, Z = cache[m]
        n_atoms = coords.shape[0]
        if n_atoms < 2:
            continue
        adj = _adjacency_from_coords(coords, Z)
        # Build neighbor lists
        nbrs = [np.flatnonzero(adj[i]).astype(np.int32) for i in range(n_atoms)]

        a0 = rows['atom_index_0'].to_numpy(dtype=np.int32)
        a1 = rows['atom_index_1'].to_numpy(dtype=np.int32)
        pl = rows['path_len'].to_numpy(dtype=np.int16)
        is_pl2 = (pl == 2)
        is_pl3 = (pl == 3)
        # Angles for pl==2
        idxs_pl2 = np.flatnonzero(is_pl2)
        for idx in idxs_pl2:
            i = int(a0[idx]); j = int(a1[idx])
            # find middle k: any common neighbor of i and j
            if i < 0 or j < 0 or i >= n_atoms or j >= n_atoms:
                continue
            # Intersect neighbor lists (small degree, fast)
            Ni, Nj = nbrs[i], nbrs[j]
            # Pick the first common neighbor to be deterministic
            # Efficient intersection for small arrays
            if Ni.size == 0 or Nj.size == 0:
                continue
            # Use numpy intersect1d
            common = np.intersect1d(Ni, Nj, assume_unique=False)
            if common.size == 0:
                continue
            k = int(common[0])
            c, s, th = _angle_features(coords, i, k, j)
            if rows['_src'].iloc[idx] == 'train':
                ang_cos_tr[rows['_idx'].iloc[idx]] = c
                ang_sin_tr[rows['_idx'].iloc[idx]] = s
                ang_rad_tr[rows['_idx'].iloc[idx]] = th
            else:
                ang_cos_te[rows['_idx'].iloc[idx]] = c
                ang_sin_te[rows['_idx'].iloc[idx]] = s
                ang_rad_te[rows['_idx'].iloc[idx]] = th
        # Dihedrals for pl==3
        idxs_pl3 = np.flatnonzero(is_pl3)
        if idxs_pl3.size:
            # For BFS parent path, do a BFS from each unique source i among pl==3 rows
            # Build parent arrays per unique source to reuse
            srcs = np.unique(a0[idxs_pl3])
            parents_map = {}
            for src in srcs:
                # BFS to get parent pointers until all targets are found
                parent = np.full(n_atoms, -1, dtype=np.int32)
                parent[src] = src
                q = [int(src)]; head = 0
                while head < len(q):
                    u = q[head]; head += 1
                    for v in nbrs[u]:
                        if parent[v] == -1:
                            parent[v] = u
                            q.append(int(v))
                parents_map[int(src)] = parent
            # Now compute dihedral for each pl3 row using parent backtrack j->i
            for idx in idxs_pl3:
                i = int(a0[idx]); j = int(a1[idx])
                if i < 0 or j < 0 or i >= n_atoms or j >= n_atoms:
                    continue
                parent = parents_map.get(i, None)
                if parent is None or parent[j] == -1:
                    continue
                # Backtrack: j -> k2 -> k1 -> i
                k2 = int(parent[j])
                if k2 == -1 or k2 == j:
                    continue
                k1 = int(parent[k2])
                if k1 == -1 or k1 == k2 or k1 == i:
                    continue
                c, s, c2, phi = _dihedral_features(coords, i, k1, k2, j)
                if rows['_src'].iloc[idx] == 'train':
                    dih_cos_tr[rows['_idx'].iloc[idx]] = c
                    dih_sin_tr[rows['_idx'].iloc[idx]] = s
                    dih_cos2_tr[rows['_idx'].iloc[idx]] = c2
                else:
                    dih_cos_te[rows['_idx'].iloc[idx]] = c
                    dih_sin_te[rows['_idx'].iloc[idx]] = s
                    dih_cos2_te[rows['_idx'].iloc[idx]] = c2

        processed += 1
        if processed % 1000 == 0:
            print(f'  FE v3 processed {processed} molecules...', flush=True)

    # Assign to dataframes
    X_train['angle_cos'] = ang_cos_tr; X_test['angle_cos'] = ang_cos_te
    X_train['angle_sin'] = ang_sin_tr; X_test['angle_sin'] = ang_sin_te
    X_train['angle_rad'] = ang_rad_tr; X_test['angle_rad'] = ang_rad_te
    X_train['dih_cos'] = dih_cos_tr; X_test['dih_cos'] = dih_cos_te
    X_train['dih_sin'] = dih_sin_tr; X_test['dih_sin'] = dih_sin_te
    X_train['dih_cos2'] = dih_cos2_tr; X_test['dih_cos2'] = dih_cos2_te

    # Optional derived transforms (Karplus-like basis already covered by cos/sin/cos2)
    # Ensure dtypes
    for c in ['angle_cos','angle_sin','angle_rad','dih_cos','dih_sin','dih_cos2']:
        X_train[c] = X_train[c].astype('float32')
        X_test[c] = X_test[c].astype('float32')

    print(f'Added FE v3 true geometry in {(time.time()-t0)/60:.1f} min')
    return X_train, X_test

# Usage after current CatBoost run finishes:
# X_train, X_test = add_true_geometry_features(X_train, X_test, structures)
# Then add these to feature_cols/cb_features:
#   ['angle_cos','angle_sin','angle_rad','dih_cos','dih_sin','dih_cos2']
# Retrain weakest types first (3JHC/3JHN/2JHH) and re-blend.

In [28]:
# FE v3 execution + sanitation + 3-fold molecule-aware mapping (prep for LightGBM)
import numpy as np, pandas as pd
from sklearn.model_selection import KFold

t0 = time.time()
# 1) Run true geometry features (angles/dihedrals)
X_train, X_test = add_true_geometry_features(X_train, X_test, structures)

# 2) Define candidate feature list (filter by availability) incl. FE v3
candidate_features = [
    'Z0','Z1','same_element',
    'dx','dy','dz','d','d2','inv_d','inv_d2',
    'nH','nC','nN','nO','nF','n_atoms',
    'path_len','inv_path','is_bonded','min_nb_d0','min_nb_d1','cos0','cos1',
    'potential_energy','dipole_x','dipole_y','dipole_z','dipole_mag',
    'mulliken_0','mulliken_1','z_mulliken_0','z_mulliken_1',
    'shield_iso_0','shield_iso_1','z_shield_0','z_shield_1',
    'mulliken_diff','mulliken_abs_diff','mulliken_sum','mulliken_prod',
    'shield_diff','shield_abs_diff','shield_sum','shield_prod',
    'mulliken_diff_over_d','mulliken_diff_x_inv_d','shield_diff_over_d','shield_diff_x_inv_d',
    'element_pair_id','element_pair_id_sorted','EN0','EN1','EN_diff','EN_abs_diff',
    'path_len_bucket','path_le2','d_x_inv_path','d_over_1p_path','is_bonded_x_inv_d','inv_d_x_path_le2',
    'cos0_x_inv_path','cos1_x_inv_path','min_nb_d0_x_inv_path','min_nb_d1_x_inv_path',
    'd_over_n_atoms','pe_per_atom','d_over_mol_mean_nb_d','expected_d_by_type','d_from_expected',
    # FE v3
    'angle_cos','angle_sin','angle_rad','dih_cos','dih_sin','dih_cos2'
]
lgb_features = [c for c in candidate_features if c in X_train.columns and c in X_test.columns]

# 3) Sanitize: replace inf -> NaN -> fill; cast float32; fill test with train means
def sanitize_train_test(X_tr: pd.DataFrame, X_te: pd.DataFrame, cols: list[str]):
    X_tr = X_tr.copy(); X_te = X_te.copy()
    X_tr[cols] = X_tr[cols].replace([np.inf, -np.inf], np.nan)
    X_te[cols] = X_te[cols].replace([np.inf, -np.inf], np.nan)
    # Fill integer-like id cols separately if present
    for c in cols:
        if X_tr[c].dtype.kind in 'iu':
            tr_val = int(pd.Series(X_tr[c]).mode(dropna=True).iloc[0]) if pd.Series(X_tr[c]).mode(dropna=True).shape[0] else 0
            X_tr[c] = X_tr[c].fillna(tr_val).astype(X_tr[c].dtype)
            X_te[c] = X_te[c].fillna(tr_val).astype(X_te[c].dtype)
        else:
            m = pd.to_numeric(X_tr[c], errors='coerce').astype('float32')
            mean_val = float(np.nanmean(m)) if np.isfinite(np.nanmean(m)) else 0.0
            X_tr[c] = pd.to_numeric(X_tr[c], errors='coerce').astype('float32').fillna(mean_val)
            X_te[c] = pd.to_numeric(X_te[c], errors='coerce').astype('float32').fillna(mean_val)
    return X_tr, X_te

X_train, X_test = sanitize_train_test(X_train, X_test, lgb_features)

# 4) Build and store a 3-fold molecule-aware mapping to reuse across models
def build_molecule_folds(df: pd.DataFrame, n_splits: int = 3, seed: int = 42):
    mols = df['molecule_name'].unique()
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    fold_of = {}
    for f, (_, val_idx) in enumerate(kf.split(mols)):
        for m in mols[val_idx]:
            fold_of[m] = f
    fold_idx = df['molecule_name'].map(fold_of).values
    folds = [(np.where(fold_idx != i)[0], np.where(fold_idx == i)[0]) for i in range(n_splits)]
    return folds, fold_of

lgb_folds, mol2fold_3f = build_molecule_folds(X_train, n_splits=3, seed=42)
assert X_train['molecule_name'].iloc[lgb_folds[0][0]].isin(X_train['molecule_name'].iloc[lgb_folds[0][1]]).sum() == 0, 'Leakage across folds'

print('FE v3 done. lgb_features:', len(lgb_features), '| folds:', len(lgb_folds), '| elapsed min:', round((time.time()-t0)/60,2))

  FE v3 processed 1000 molecules...


  FE v3 processed 2000 molecules...


  FE v3 processed 3000 molecules...


  FE v3 processed 4000 molecules...


  FE v3 processed 5000 molecules...


  FE v3 processed 6000 molecules...


  FE v3 processed 7000 molecules...


  FE v3 processed 8000 molecules...


  FE v3 processed 9000 molecules...


  FE v3 processed 10000 molecules...


  FE v3 processed 11000 molecules...


  FE v3 processed 12000 molecules...


  FE v3 processed 13000 molecules...


  FE v3 processed 14000 molecules...


  FE v3 processed 15000 molecules...


  FE v3 processed 16000 molecules...


  FE v3 processed 17000 molecules...


  FE v3 processed 18000 molecules...


  FE v3 processed 19000 molecules...


  FE v3 processed 20000 molecules...


  FE v3 processed 21000 molecules...


  FE v3 processed 22000 molecules...


  FE v3 processed 23000 molecules...


  FE v3 processed 24000 molecules...


  FE v3 processed 25000 molecules...


  FE v3 processed 26000 molecules...


  FE v3 processed 27000 molecules...


  FE v3 processed 28000 molecules...


  FE v3 processed 29000 molecules...


  FE v3 processed 30000 molecules...


  FE v3 processed 31000 molecules...


  FE v3 processed 32000 molecules...


  FE v3 processed 33000 molecules...


  FE v3 processed 34000 molecules...


  FE v3 processed 35000 molecules...


  FE v3 processed 36000 molecules...


  FE v3 processed 37000 molecules...


  FE v3 processed 38000 molecules...


  FE v3 processed 39000 molecules...


  FE v3 processed 40000 molecules...


  FE v3 processed 41000 molecules...


  FE v3 processed 42000 molecules...


  FE v3 processed 43000 molecules...


  FE v3 processed 44000 molecules...


  FE v3 processed 45000 molecules...


  FE v3 processed 46000 molecules...


  FE v3 processed 47000 molecules...


  FE v3 processed 48000 molecules...


  FE v3 processed 49000 molecules...


  FE v3 processed 50000 molecules...


  FE v3 processed 51000 molecules...


  FE v3 processed 52000 molecules...


  FE v3 processed 53000 molecules...


  FE v3 processed 54000 molecules...


  FE v3 processed 55000 molecules...


  FE v3 processed 56000 molecules...


  FE v3 processed 57000 molecules...


  FE v3 processed 58000 molecules...


  FE v3 processed 59000 molecules...


  FE v3 processed 60000 molecules...


  FE v3 processed 61000 molecules...


  FE v3 processed 62000 molecules...


  FE v3 processed 63000 molecules...


  FE v3 processed 64000 molecules...


  FE v3 processed 65000 molecules...


  FE v3 processed 66000 molecules...


  FE v3 processed 67000 molecules...


  FE v3 processed 68000 molecules...


  FE v3 processed 69000 molecules...


  FE v3 processed 70000 molecules...


  FE v3 processed 71000 molecules...


  FE v3 processed 72000 molecules...


  FE v3 processed 73000 molecules...


  FE v3 processed 74000 molecules...


  FE v3 processed 75000 molecules...


  FE v3 processed 76000 molecules...


Added FE v3 true geometry in 5.1 min


FE v3 done. lgb_features: 70 | folds: 3 | elapsed min: 5.17


In [40]:
# LightGBM per-type 3-fold CPU training with FE v3 features
import time, numpy as np, pandas as pd
from sklearn.metrics import mean_absolute_error

try:
    import lightgbm as lgb
except Exception:
    import sys, subprocess, importlib
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'lightgbm==4.6.0'])
    importlib.invalidate_caches()
    import lightgbm as lgb

assert 'lgb_features' in globals(), 'Run prep cell to build lgb_features and folds first.'
assert 'lgb_folds' in globals(), 'Run prep cell to build lgb_folds first.'

def lgb_params_for_type(t: str):
    base = dict(objective='mae', metric='mae', boosting_type='gbdt',
                learning_rate=0.1, n_jobs=-1, feature_fraction=0.8,
                bagging_fraction=0.8, bagging_freq=1, max_bin=256, reg_lambda=1.0, verbose=-1)
    if t.startswith('1J'):
        base.update(dict(num_leaves=56, min_data_in_leaf=180, learning_rate=0.12, n_estimators=800))
    elif t.startswith('2J'):
        base.update(dict(num_leaves=96, min_data_in_leaf=100, learning_rate=0.10, n_estimators=1000))
    elif t.startswith('3J'):
        base.update(dict(num_leaves=128, min_data_in_leaf=50, learning_rate=0.08, n_estimators=1400))
    else:
        base.update(dict(num_leaves=96, min_data_in_leaf=100, learning_rate=0.10, n_estimators=1000))
    return base

def lmae_score_fast(y_true: np.ndarray, y_pred: np.ndarray, types: pd.Series, eps: float = 1e-9) -> float:
    df = pd.DataFrame({'y': y_true, 'p': y_pred, 'type': types})
    mae_by_type = df.groupby('type').apply(lambda g: float(np.mean(np.abs(g['y'].values - g['p'].values)))).astype('float64')
    return float(np.log(mae_by_type.clip(lower=eps)).mean())

types_list = sorted(X_train['type'].unique())
# Prioritize heavy 3J first
types_order = [t for t in types_list if t.startswith('3J')] + [t for t in types_list if t.startswith('2J')] + [t for t in types_list if t.startswith('1J')]

oof_lgb = np.zeros(len(X_train), dtype=np.float32)
test_lgb = np.zeros(len(X_test), dtype=np.float32)
per_type_mae = {}

start_all = time.time()
for t in types_order:
    params = lgb_params_for_type(t)
    n_estimators_cap = params.pop('n_estimators')
    tr_mask = (X_train['type'] == t).values
    te_mask = (X_test['type'] == t).values
    X_t = X_train.loc[tr_mask, lgb_features].astype('float32')
    y_t = X_train.loc[tr_mask, 'scalar_coupling_constant'].astype('float32').values
    X_te_t = X_test.loc[te_mask, lgb_features].astype('float32')
    idx_t = np.where(tr_mask)[0]
    oof_t = np.zeros(X_t.shape[0], dtype=np.float32)
    pred_te_t = np.zeros(X_te_t.shape[0], dtype=np.float32)
    print(f"\n[LGBM] Type {t}: n_train={X_t.shape[0]} n_test={X_te_t.shape[0]}", flush=True)
    for fold_i, (tr_idx_all, va_idx_all) in enumerate(lgb_folds):
        tr_loc = np.intersect1d(idx_t, tr_idx_all, assume_unique=False)
        va_loc = np.intersect1d(idx_t, va_idx_all, assume_unique=False)
        tr_loc_local = np.searchsorted(idx_t, tr_loc)
        va_loc_local = np.searchsorted(idx_t, va_loc)
        if len(va_loc_local) == 0 or len(tr_loc_local) == 0:
            continue
        t0 = time.time()
        dtrain = lgb.Dataset(X_t.iloc[tr_loc_local, :], label=y_t[tr_loc_local], free_raw_data=False)
        dvalid = lgb.Dataset(X_t.iloc[va_loc_local, :], label=y_t[va_loc_local], reference=dtrain, free_raw_data=False)
        booster = lgb.train(params, dtrain, num_boost_round=int(n_estimators_cap),
                            valid_sets=[dtrain, dvalid], valid_names=['train','valid'],
                            callbacks=[lgb.early_stopping(stopping_rounds=100, verbose=False)])
        best_it = booster.best_iteration if booster.best_iteration is not None else booster.current_iteration()
        oof_t[va_loc_local] = booster.predict(X_t.iloc[va_loc_local, :], num_iteration=best_it).astype('float32')
        pred_te_t += booster.predict(X_te_t, num_iteration=best_it).astype('float32') / len(lgb_folds)
        dt = time.time() - t0
        mae_fold = mean_absolute_error(y_t[va_loc_local], oof_t[va_loc_local])
        print(f"  Fold {fold_i}: n_tr={len(tr_loc_local)} n_va={len(va_loc_local)} | MAE={mae_fold:.5f} | it={best_it} | {dt:.1f}s", flush=True)
    oof_lgb[idx_t] = oof_t
    test_lgb[te_mask] = pred_te_t
    mae_t = float(np.mean(np.abs(y_t - oof_t)))
    per_type_mae[t] = mae_t
    print(f"[LGBM] Type {t}: MAE={mae_t:.6f}", flush=True)

overall_lmae = lmae_score_fast(X_train['scalar_coupling_constant'].values, oof_lgb, X_train['type'])
print('\nPer-type MAE (LGBM):', {k: round(v,6) for k,v in per_type_mae.items()})
print(f"Overall OOF LMAE (LGBM): {overall_lmae:.6f} | elapsed {(time.time()-start_all)/60:.1f} min", flush=True)

# Save artifacts
np.save('oof_lgb.npy', oof_lgb.astype('float32'))
np.save('pred_test_lgb.npy', test_lgb.astype('float32'))
pd.Series(per_type_mae).to_csv('per_type_mae_lgb.csv')

# Build submission from LGBM as anchor
sub_lgb = pd.DataFrame({'id': X_test['id'].values, 'scalar_coupling_constant': test_lgb.astype('float32')}).sort_values('id')
sub_lgb.to_csv('submission.csv', index=False)
print('Saved LGBM submission.csv:', sub_lgb.shape, 'head:\n', sub_lgb.head())


[LGBM] Type 3JHC: n_train=1359077 n_test=152130


  Fold 0: n_tr=904819 n_va=454258 | MAE=1.17752 | it=1400 | 42.5s


  Fold 1: n_tr=906154 n_va=452923 | MAE=1.16985 | it=1400 | 42.0s


  Fold 2: n_tr=907181 n_va=451896 | MAE=1.17586 | it=1400 | 44.4s


[LGBM] Type 3JHC: MAE=1.174413



[LGBM] Type 3JHH: n_train=531224 n_test=59305


  Fold 0: n_tr=354056 n_va=177168 | MAE=0.70301 | it=1400 | 21.9s


  Fold 1: n_tr=353202 n_va=178022 | MAE=0.69360 | it=1400 | 21.8s


  Fold 2: n_tr=355190 n_va=176034 | MAE=0.70038 | it=1400 | 24.0s


[LGBM] Type 3JHH: MAE=0.698984



[LGBM] Type 3JHN: n_train=150067 n_test=16546


  Fold 0: n_tr=100305 n_va=49762 | MAE=0.36148 | it=1400 | 13.8s


  Fold 1: n_tr=99840 n_va=50227 | MAE=0.35832 | it=1400 | 12.2s


  Fold 2: n_tr=99989 n_va=50078 | MAE=0.36210 | it=1400 | 13.0s


[LGBM] Type 3JHN: MAE=0.360631



[LGBM] Type 2JHC: n_train=1026379 n_test=114488


  Fold 0: n_tr=683209 n_va=343170 | MAE=1.04589 | it=1000 | 26.0s


  Fold 1: n_tr=683994 n_va=342385 | MAE=1.03020 | it=1000 | 23.1s


  Fold 2: n_tr=685555 n_va=340824 | MAE=1.04146 | it=1000 | 23.1s


[LGBM] Type 2JHC: MAE=1.039188



[LGBM] Type 2JHH: n_train=340097 n_test=37891


  Fold 0: n_tr=226826 n_va=113271 | MAE=0.48423 | it=1000 | 12.6s


  Fold 1: n_tr=226656 n_va=113441 | MAE=0.47373 | it=1000 | 10.7s


  Fold 2: n_tr=226712 n_va=113385 | MAE=0.48050 | it=1000 | 11.4s


[LGBM] Type 2JHH: MAE=0.479484



[LGBM] Type 2JHN: n_train=107091 n_test=11968


  Fold 0: n_tr=71887 n_va=35204 | MAE=0.69451 | it=1000 | 6.6s


  Fold 1: n_tr=71216 n_va=35875 | MAE=0.69152 | it=1000 | 7.4s


  Fold 2: n_tr=71079 n_va=36012 | MAE=0.69611 | it=1000 | 6.6s


[LGBM] Type 2JHN: MAE=0.694049



[LGBM] Type 1JHC: n_train=637912 n_test=71221


  Fold 0: n_tr=425180 n_va=212732 | MAE=2.13600 | it=800 | 10.9s


  Fold 1: n_tr=425134 n_va=212778 | MAE=2.13272 | it=800 | 12.1s


  Fold 2: n_tr=425510 n_va=212402 | MAE=2.13826 | it=800 | 10.5s


[LGBM] Type 1JHC: MAE=2.135657



[LGBM] Type 1JHN: n_train=39416 n_test=4264


  Fold 0: n_tr=26428 n_va=12988 | MAE=0.87146 | it=800 | 8.6s


  Fold 1: n_tr=26195 n_va=13221 | MAE=0.84896 | it=800 | 2.9s


  Fold 2: n_tr=26209 n_va=13207 | MAE=0.87227 | it=800 | 2.8s


[LGBM] Type 1JHN: MAE=0.864185



Per-type MAE (LGBM): {'3JHC': 1.174413, '3JHH': 0.698984, '3JHN': 0.360631, '2JHC': 1.039188, '2JHH': 0.479484, '2JHN': 0.694049, '1JHC': 2.135657, '1JHN': 0.864185}
Overall OOF LMAE (LGBM): -0.208284 | elapsed 7.0 min


  mae_by_type = df.groupby('type').apply(lambda g: float(np.mean(np.abs(g['y'].values - g['p'].values)))).astype('float64')


Saved LGBM submission.csv: (467813, 2) head:
          id  scalar_coupling_constant
335622  276                116.564529
335623  277                  0.496286
335624  278                  5.943921
335625  279                  5.943921
335626  280                  0.496286


In [41]:
# LightGBM per-type 3-fold CPU training (Seed 2) for blending
import time, numpy as np, pandas as pd
from sklearn.metrics import mean_absolute_error

try:
    import lightgbm as lgb
except Exception:
    import sys, subprocess, importlib
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'lightgbm==4.6.0'])
    importlib.invalidate_caches()
    import lightgbm as lgb

assert 'lgb_features' in globals(), 'Run prep cell to build lgb_features and folds first.'
assert 'lgb_folds' in globals(), 'Run prep cell to build lgb_folds first.'

def lgb_params_for_type_seed2(t: str):
    base = dict(objective='mae', metric='mae', boosting_type='gbdt',
                learning_rate=0.1, n_jobs=-1, feature_fraction=0.75,
                bagging_fraction=0.85, bagging_freq=1, max_bin=256, reg_lambda=1.0,
                verbose=-1, random_seed=1337)
    if t.startswith('1J'):
        base.update(dict(num_leaves=56, min_data_in_leaf=180, learning_rate=0.12, n_estimators=800))
    elif t.startswith('2J'):
        base.update(dict(num_leaves=96, min_data_in_leaf=100, learning_rate=0.10, n_estimators=1000))
    elif t.startswith('3J'):
        base.update(dict(num_leaves=128, min_data_in_leaf=50, learning_rate=0.08, n_estimators=1400))
    else:
        base.update(dict(num_leaves=96, min_data_in_leaf=100, learning_rate=0.10, n_estimators=1000))
    return base

def lmae_score_fast(y_true: np.ndarray, y_pred: np.ndarray, types: pd.Series, eps: float = 1e-9) -> float:
    df = pd.DataFrame({'y': y_true, 'p': y_pred, 'type': types})
    mae_by_type = df.groupby('type').apply(lambda g: float(np.mean(np.abs(g['y'].values - g['p'].values)))).astype('float64')
    return float(np.log(mae_by_type.clip(lower=eps)).mean())

types_list = sorted(X_train['type'].unique())
types_order = [t for t in types_list if t.startswith('3J')] + [t for t in types_list if t.startswith('2J')] + [t for t in types_list if t.startswith('1J')]

oof_lgb2 = np.zeros(len(X_train), dtype=np.float32)
test_lgb2 = np.zeros(len(X_test), dtype=np.float32)
per_type_mae2 = {}

start_all = time.time()
for t in types_order:
    params = lgb_params_for_type_seed2(t)
    n_estimators_cap = params.pop('n_estimators')
    tr_mask = (X_train['type'] == t).values
    te_mask = (X_test['type'] == t).values
    X_t = X_train.loc[tr_mask, lgb_features].astype('float32')
    y_t = X_train.loc[tr_mask, 'scalar_coupling_constant'].astype('float32').values
    X_te_t = X_test.loc[te_mask, lgb_features].astype('float32')
    idx_t = np.where(tr_mask)[0]
    oof_t = np.zeros(X_t.shape[0], dtype=np.float32)
    pred_te_t = np.zeros(X_te_t.shape[0], dtype=np.float32)
    print(f"\n[LGBM-seed2] Type {t}: n_train={X_t.shape[0]} n_test={X_te_t.shape[0]}", flush=True)
    for fold_i, (tr_idx_all, va_idx_all) in enumerate(lgb_folds):
        tr_loc = np.intersect1d(idx_t, tr_idx_all, assume_unique=False)
        va_loc = np.intersect1d(idx_t, va_idx_all, assume_unique=False)
        tr_loc_local = np.searchsorted(idx_t, tr_loc)
        va_loc_local = np.searchsorted(idx_t, va_loc)
        if len(va_loc_local) == 0 or len(tr_loc_local) == 0:
            continue
        t0 = time.time()
        dtrain = lgb.Dataset(X_t.iloc[tr_loc_local, :], label=y_t[tr_loc_local], free_raw_data=False)
        dvalid = lgb.Dataset(X_t.iloc[va_loc_local, :], label=y_t[va_loc_local], reference=dtrain, free_raw_data=False)
        booster = lgb.train(params, dtrain, num_boost_round=int(n_estimators_cap),
                            valid_sets=[dtrain, dvalid], valid_names=['train','valid'],
                            callbacks=[lgb.early_stopping(stopping_rounds=100, verbose=False)])
        best_it = booster.best_iteration if booster.best_iteration is not None else booster.current_iteration()
        oof_t[va_loc_local] = booster.predict(X_t.iloc[va_loc_local, :], num_iteration=best_it).astype('float32')
        pred_te_t += booster.predict(X_te_t, num_iteration=best_it).astype('float32') / len(lgb_folds)
        dt = time.time() - t0
        mae_fold = mean_absolute_error(y_t[va_loc_local], oof_t[va_loc_local])
        print(f"  Fold {fold_i}: n_tr={len(tr_loc_local)} n_va={len(va_loc_local)} | MAE={mae_fold:.5f} | it={best_it} | {dt:.1f}s", flush=True)
    oof_lgb2[idx_t] = oof_t
    test_lgb2[te_mask] = pred_te_t
    mae_t = float(np.mean(np.abs(y_t - oof_t)))
    per_type_mae2[t] = mae_t
    print(f"[LGBM-seed2] Type {t}: MAE={mae_t:.6f}", flush=True)

overall_lmae2 = lmae_score_fast(X_train['scalar_coupling_constant'].values, oof_lgb2, X_train['type'])
print('\nPer-type MAE (LGBM seed2):', {k: round(v,6) for k,v in per_type_mae2.items()})
print(f"Overall OOF LMAE (LGBM seed2): {overall_lmae2:.6f} | elapsed {(time.time()-start_all)/60:.1f} min", flush=True)

# Save artifacts
np.save('oof_lgb2.npy', oof_lgb2.astype('float32'))
np.save('pred_test_lgb2.npy', test_lgb2.astype('float32'))
pd.Series(per_type_mae2).to_csv('per_type_mae_lgb2.csv')
print('Saved LGBM seed2 artifacts.')


[LGBM-seed2] Type 3JHC: n_train=1359077 n_test=152130


  Fold 0: n_tr=904819 n_va=454258 | MAE=1.17598 | it=1400 | 66.7s


  Fold 1: n_tr=906154 n_va=452923 | MAE=1.17036 | it=1400 | 43.7s


  Fold 2: n_tr=907181 n_va=451896 | MAE=1.17843 | it=1400 | 43.7s


[LGBM-seed2] Type 3JHC: MAE=1.174924



[LGBM-seed2] Type 3JHH: n_train=531224 n_test=59305


  Fold 0: n_tr=354056 n_va=177168 | MAE=0.70574 | it=1400 | 22.5s


  Fold 1: n_tr=353202 n_va=178022 | MAE=0.69505 | it=1400 | 22.2s


  Fold 2: n_tr=355190 n_va=176034 | MAE=0.70160 | it=1400 | 23.2s


[LGBM-seed2] Type 3JHH: MAE=0.700787



[LGBM-seed2] Type 3JHN: n_train=150067 n_test=16546


  Fold 0: n_tr=100305 n_va=49762 | MAE=0.36120 | it=1400 | 13.5s


  Fold 1: n_tr=99840 n_va=50227 | MAE=0.35730 | it=1400 | 13.1s


  Fold 2: n_tr=99989 n_va=50078 | MAE=0.36116 | it=1400 | 12.7s


[LGBM-seed2] Type 3JHN: MAE=0.359883



[LGBM-seed2] Type 2JHC: n_train=1026379 n_test=114488


  Fold 0: n_tr=683209 n_va=343170 | MAE=1.04606 | it=1000 | 37.6s


  Fold 1: n_tr=683994 n_va=342385 | MAE=1.03249 | it=1000 | 22.5s


  Fold 2: n_tr=685555 n_va=340824 | MAE=1.04061 | it=1000 | 24.0s


[LGBM-seed2] Type 2JHC: MAE=1.039725



[LGBM-seed2] Type 2JHH: n_train=340097 n_test=37891


  Fold 0: n_tr=226826 n_va=113271 | MAE=0.48368 | it=1000 | 11.5s


  Fold 1: n_tr=226656 n_va=113441 | MAE=0.47317 | it=1000 | 10.4s


  Fold 2: n_tr=226712 n_va=113385 | MAE=0.47891 | it=1000 | 12.3s


[LGBM-seed2] Type 2JHH: MAE=0.478584



[LGBM-seed2] Type 2JHN: n_train=107091 n_test=11968


  Fold 0: n_tr=71887 n_va=35204 | MAE=0.69451 | it=1000 | 6.6s


  Fold 1: n_tr=71216 n_va=35875 | MAE=0.68966 | it=1000 | 6.6s


  Fold 2: n_tr=71079 n_va=36012 | MAE=0.70006 | it=1000 | 7.2s


[LGBM-seed2] Type 2JHN: MAE=0.694750



[LGBM-seed2] Type 1JHC: n_train=637912 n_test=71221


  Fold 0: n_tr=425180 n_va=212732 | MAE=2.12692 | it=800 | 10.9s


  Fold 1: n_tr=425134 n_va=212778 | MAE=2.13200 | it=800 | 11.9s


  Fold 2: n_tr=425510 n_va=212402 | MAE=2.14268 | it=800 | 10.9s


[LGBM-seed2] Type 1JHC: MAE=2.133864



[LGBM-seed2] Type 1JHN: n_train=39416 n_test=4264


  Fold 0: n_tr=26428 n_va=12988 | MAE=0.87444 | it=800 | 3.1s


  Fold 1: n_tr=26195 n_va=13221 | MAE=0.83312 | it=800 | 7.9s


  Fold 2: n_tr=26209 n_va=13207 | MAE=0.85862 | it=800 | 2.9s


[LGBM-seed2] Type 1JHN: MAE=0.855279



Per-type MAE (LGBM seed2): {'3JHC': 1.174924, '3JHH': 0.700787, '3JHN': 0.359883, '2JHC': 1.039725, '2JHH': 0.478584, '2JHN': 0.69475, '1JHC': 2.133864, '1JHN': 0.855279}
Overall OOF LMAE (LGBM seed2): -0.209611 | elapsed 7.6 min


Saved LGBM seed2 artifacts.


  mae_by_type = df.groupby('type').apply(lambda g: float(np.mean(np.abs(g['y'].values - g['p'].values)))).astype('float64')


In [42]:
# Blend two LGBM seeds per-type (grid search weights) and build submission
import numpy as np, pandas as pd, time, os
from sklearn.metrics import mean_absolute_error

def lmae_score_fast(y_true: np.ndarray, y_pred: np.ndarray, types: pd.Series, eps: float = 1e-9) -> float:
    df = pd.DataFrame({'y': y_true, 'p': y_pred, 'type': types})
    mae_by_type = df.groupby('type').apply(lambda g: float(np.mean(np.abs(g['y'].values - g['p'].values)))).astype('float64')
    return float(np.log(mae_by_type.clip(lower=eps)).mean())

# Load artifacts (fallback to in-memory if present)
if os.path.exists('oof_lgb.npy') and os.path.exists('oof_lgb2.npy'):
    oof1 = np.load('oof_lgb.npy')
    oof2 = np.load('oof_lgb2.npy')
    te1 = np.load('pred_test_lgb.npy')
    te2 = np.load('pred_test_lgb2.npy')
else:
    oof1 = oof_lgb.copy(); oof2 = oof_lgb2.copy()
    te1 = test_lgb.copy(); te2 = test_lgb2.copy()

types = X_train['type'].values
test_types = X_test['type'].values
y = X_train['scalar_coupling_constant'].values.astype('float32')

w_grid = np.linspace(0.0, 1.0, 21)
blend_oof = np.zeros_like(oof1, dtype=np.float32)
blend_test = np.zeros_like(te1, dtype=np.float32)
w_per_type = {}

start = time.time()
for t in sorted(pd.unique(types)):
    m = (types == t)
    best_mae, best_w = 1e9, 0.5
    for w in w_grid:
        o = w*oof1[m] + (1.0-w)*oof2[m]
        mae = float(np.mean(np.abs(y[m] - o)))
        if mae < best_mae:
            best_mae, best_w = mae, float(w)
    w_per_type[t] = best_w
    blend_oof[m] = best_w*oof1[m] + (1.0-best_w)*oof2[m]
    mt = (test_types == t)
    blend_test[mt] = best_w*te1[mt] + (1.0-best_w)*te2[mt]
print('Per-type weights (LGBM seeds):', w_per_type)

overall_lmae = lmae_score_fast(y, blend_oof, X_train['type'])
print(f'OOF LMAE (LGBM seeds blend): {overall_lmae:.6f} | elapsed {(time.time()-start):.1f}s')

# Save blend artifacts and weights
np.save('oof_blend_lgb12.npy', blend_oof.astype('float32'))
np.save('pred_test_blend_lgb12.npy', blend_test.astype('float32'))
pd.Series(w_per_type).to_csv('weights_lgb12_per_type.csv')

# Build submission from blended LGBM seeds
sub_blend = pd.DataFrame({'id': X_test['id'].values, 'scalar_coupling_constant': blend_test.astype('float32')}).sort_values('id')
sub_blend.to_csv('submission.csv', index=False)
print('Saved blended submission.csv:', sub_blend.shape, 'head:\n', sub_blend.head())

Per-type weights (LGBM seeds): {'1JHC': 0.5, '1JHN': 0.45, '2JHC': 0.5, '2JHH': 0.5, '2JHN': 0.5, '3JHC': 0.5, '3JHH': 0.55, '3JHN': 0.5}
OOF LMAE (LGBM seeds blend): -0.223154 | elapsed 4.0s


  mae_by_type = df.groupby('type').apply(lambda g: float(np.mean(np.abs(g['y'].values - g['p'].values)))).astype('float64')


Saved blended submission.csv: (467813, 2) head:
          id  scalar_coupling_constant
335622  276                112.234818
335623  277                  0.938787
335624  278                  6.387651
335625  279                  6.387651
335626  280                  0.938787


In [22]:
# XGBoost per-type 3-fold CPU training (hist) using FE v3 features for blend
import time, numpy as np, pandas as pd, xgboost as xgb
from sklearn.metrics import mean_absolute_error

assert 'lgb_features' in globals(), 'Run prep to build lgb_features first.'
assert 'lgb_folds' in globals(), 'Run prep to build lgb_folds first.'

def xgb_params_for_type(t: str):
    base = dict(objective='reg:squarederror', eval_metric='mae', device='cpu',
                tree_method='hist', max_bin=256, subsample=0.8, colsample_bytree=0.8, reg_lambda=1.0, seed=42)
    if t.startswith('1J'):
        rounds, es, md, mcw, eta = 400, 75, 6, 3.0, 0.10
    elif t.startswith('2J'):
        rounds, es, md, mcw, eta = 500, 75, 7, 2.0, 0.09
    elif t.startswith('3J'):
        rounds, es, md, mcw, eta = 600, 100, 8, 1.0, 0.08
    else:
        rounds, es, md, mcw, eta = 500, 75, 7, 2.0, 0.09
    base.update(dict(max_depth=md, min_child_weight=mcw, eta=eta))
    return base, rounds, es

def lmae_score_fast(y_true: np.ndarray, y_pred: np.ndarray, types: pd.Series, eps: float = 1e-9) -> float:
    df = pd.DataFrame({'y': y_true, 'p': y_pred, 'type': types})
    mae_by_type = df.groupby('type').apply(lambda g: float(np.mean(np.abs(g['y'].values - g['p'].values)))).astype('float64')
    return float(np.log(mae_by_type.clip(lower=eps)).mean())

types_list = sorted(X_train['type'].unique())
types_order = [t for t in types_list if t.startswith('3J')] + [t for t in types_list if t.startswith('2J')] + [t for t in types_list if t.startswith('1J')]

oof_xgb = np.zeros(len(X_train), dtype=np.float32)
test_xgb = np.zeros(len(X_test), dtype=np.float32)
per_type_mae_xgb = {}

start_all = time.time()
for t in types_order:
    params, num_rounds, es_rounds = xgb_params_for_type(t)
    tr_mask = (X_train['type'] == t).values
    te_mask = (X_test['type'] == t).values
    X_t = X_train.loc[tr_mask, lgb_features].astype('float32')
    y_t = X_train.loc[tr_mask, 'scalar_coupling_constant'].astype('float32').values
    X_te_t = X_test.loc[te_mask, lgb_features].astype('float32')
    idx_t = np.where(tr_mask)[0]
    oof_t = np.zeros(X_t.shape[0], dtype=np.float32)
    pred_te_t = np.zeros(X_te_t.shape[0], dtype=np.float32)
    dtest_t = xgb.DMatrix(X_te_t)
    print(f"\n[XGB] Type {t}: n_train={X_t.shape[0]} n_test={X_te_t.shape[0]}", flush=True)
    for fold_i, (tr_idx_all, va_idx_all) in enumerate(lgb_folds):
        tr_loc = np.intersect1d(idx_t, tr_idx_all, assume_unique=False)
        va_loc = np.intersect1d(idx_t, va_idx_all, assume_unique=False)
        tr_loc_local = np.searchsorted(idx_t, tr_loc)
        va_loc_local = np.searchsorted(idx_t, va_loc)
        if len(va_loc_local) == 0 or len(tr_loc_local) == 0:
            continue
        t0 = time.time()
        dtrain = xgb.DMatrix(X_t.iloc[tr_loc_local, :], label=y_t[tr_loc_local])
        dvalid = xgb.DMatrix(X_t.iloc[va_loc_local, :], label=y_t[va_loc_local])
        evals = [(dtrain, 'train'), (dvalid, 'valid')]
        booster = xgb.train(params=params, dtrain=dtrain, num_boost_round=int(num_rounds), evals=evals,
                            early_stopping_rounds=int(es_rounds), verbose_eval=100)
        best_iter = booster.best_iteration if hasattr(booster, 'best_iteration') and booster.best_iteration is not None else booster.best_ntree_limit - 1
        oof_t[va_loc_local] = booster.predict(dvalid, iteration_range=(0, int(best_iter)+1)).astype('float32')
        pred_te_t += booster.predict(dtest_t, iteration_range=(0, int(best_iter)+1)).astype('float32') / len(lgb_folds)
        dt = time.time() - t0
        mae_fold = mean_absolute_error(y_t[va_loc_local], oof_t[va_loc_local])
        print(f"  Fold {fold_i}: n_tr={len(tr_loc_local)} n_va={len(va_loc_local)} | MAE={mae_fold:.5f} | it={best_iter} | {dt:.1f}s", flush=True)
    oof_xgb[idx_t] = oof_t
    test_xgb[te_mask] = pred_te_t
    mae_t = float(np.mean(np.abs(y_t - oof_t)))
    per_type_mae_xgb[t] = mae_t
    print(f"[XGB] Type {t}: MAE={mae_t:.6f}", flush=True)

overall_lmae_xgb = lmae_score_fast(X_train['scalar_coupling_constant'].values, oof_xgb, X_train['type'])
print('\nPer-type MAE (XGB):', {k: round(v,6) for k,v in per_type_mae_xgb.items()})
print(f"Overall OOF LMAE (XGB): {overall_lmae_xgb:.6f} | elapsed {(time.time()-start_all)/60:.1f} min", flush=True)

# Save artifacts
np.save('oof_xgb.npy', oof_xgb.astype('float32'))
np.save('pred_test_xgb.npy', test_xgb.astype('float32'))
pd.Series(per_type_mae_xgb).to_csv('per_type_mae_xgb.csv')


[XGB] Type 3JHC: n_train=1359077 n_test=152130


[0]	train-mae:2.33293	valid-mae:2.34425


[100]	train-mae:0.76605	valid-mae:0.78738


[200]	train-mae:0.69413	valid-mae:0.72556


[300]	train-mae:0.64295	valid-mae:0.68376


[400]	train-mae:0.60855	valid-mae:0.65756


[500]	train-mae:0.58209	valid-mae:0.63858


[599]	train-mae:0.55913	valid-mae:0.62252


  Fold 0: n_tr=904819 n_va=454258 | MAE=0.62252 | it=599 | 33.3s


[0]	train-mae:2.33949	valid-mae:2.33683


[100]	train-mae:0.76873	valid-mae:0.77624


[200]	train-mae:0.69328	valid-mae:0.71163


[300]	train-mae:0.64618	valid-mae:0.67359


[400]	train-mae:0.61002	valid-mae:0.64623


[500]	train-mae:0.58220	valid-mae:0.62663


[599]	train-mae:0.55997	valid-mae:0.61174


  Fold 1: n_tr=906154 n_va=452923 | MAE=0.61174 | it=599 | 30.9s


[0]	train-mae:2.34003	valid-mae:2.33199


[100]	train-mae:0.76256	valid-mae:0.78177


[200]	train-mae:0.68756	valid-mae:0.71674


[300]	train-mae:0.64005	valid-mae:0.67805


[400]	train-mae:0.60751	valid-mae:0.65351


[500]	train-mae:0.57932	valid-mae:0.63262


[599]	train-mae:0.55693	valid-mae:0.61729


  Fold 2: n_tr=907181 n_va=451896 | MAE=0.61729 | it=599 | 31.2s


[XGB] Type 3JHC: MAE=0.617185



[XGB] Type 3JHH: n_train=531224 n_test=59305


[0]	train-mae:2.84973	valid-mae:2.85172


[100]	train-mae:0.41705	valid-mae:0.44104


[200]	train-mae:0.36782	valid-mae:0.40428


[300]	train-mae:0.33548	valid-mae:0.38248


[400]	train-mae:0.31111	valid-mae:0.36816


[500]	train-mae:0.29101	valid-mae:0.35758


[599]	train-mae:0.27468	valid-mae:0.34984


  Fold 0: n_tr=354056 n_va=177168 | MAE=0.34984 | it=599 | 14.3s


[0]	train-mae:2.84584	valid-mae:2.85225


[100]	train-mae:0.42195	valid-mae:0.43894


[200]	train-mae:0.37328	valid-mae:0.40293


[300]	train-mae:0.33952	valid-mae:0.38064


[400]	train-mae:0.31409	valid-mae:0.36562


[500]	train-mae:0.29404	valid-mae:0.35505


[599]	train-mae:0.27683	valid-mae:0.34653


  Fold 1: n_tr=353202 n_va=178022 | MAE=0.34653 | it=599 | 15.6s


[0]	train-mae:2.85058	valid-mae:2.84290


[100]	train-mae:0.41817	valid-mae:0.43816


[200]	train-mae:0.37084	valid-mae:0.40375


[300]	train-mae:0.33790	valid-mae:0.38135


[400]	train-mae:0.31313	valid-mae:0.36711


[500]	train-mae:0.29312	valid-mae:0.35614


[599]	train-mae:0.27685	valid-mae:0.34879


  Fold 2: n_tr=355190 n_va=176034 | MAE=0.34879 | it=599 | 15.2s


[XGB] Type 3JHH: MAE=0.348380



[XGB] Type 3JHN: n_train=150067 n_test=16546


[0]	train-mae:0.91096	valid-mae:0.91278


[100]	train-mae:0.21279	valid-mae:0.24696


[200]	train-mae:0.17141	valid-mae:0.22127


[300]	train-mae:0.14658	valid-mae:0.20887


[400]	train-mae:0.12880	valid-mae:0.20181


[500]	train-mae:0.11471	valid-mae:0.19668


[599]	train-mae:0.10241	valid-mae:0.19270


  Fold 0: n_tr=100305 n_va=49762 | MAE=0.19270 | it=599 | 6.9s


[0]	train-mae:0.91280	valid-mae:0.91013


[100]	train-mae:0.21610	valid-mae:0.24287


[200]	train-mae:0.17436	valid-mae:0.21803


[300]	train-mae:0.14908	valid-mae:0.20570


[400]	train-mae:0.13039	valid-mae:0.19836


[500]	train-mae:0.11546	valid-mae:0.19308


[599]	train-mae:0.10324	valid-mae:0.18948


  Fold 1: n_tr=99840 n_va=50227 | MAE=0.18948 | it=599 | 7.0s


[0]	train-mae:0.90943	valid-mae:0.91138


[100]	train-mae:0.21601	valid-mae:0.24461


[200]	train-mae:0.17483	valid-mae:0.22002


[300]	train-mae:0.14977	valid-mae:0.20819


[400]	train-mae:0.13083	valid-mae:0.20052


[500]	train-mae:0.11590	valid-mae:0.19550


[599]	train-mae:0.10364	valid-mae:0.19168


  Fold 2: n_tr=99989 n_va=50078 | MAE=0.19168 | it=599 | 7.7s


[XGB] Type 3JHN: MAE=0.191284



[XGB] Type 2JHC: n_train=1026379 n_test=114488


[0]	train-mae:2.56576	valid-mae:2.57901


[100]	train-mae:0.97656	valid-mae:0.99747


[200]	train-mae:0.86980	valid-mae:0.90013


[300]	train-mae:0.80686	valid-mae:0.84612


[400]	train-mae:0.76087	valid-mae:0.80786


[499]	train-mae:0.72860	valid-mae:0.78304


  Fold 0: n_tr=683209 n_va=343170 | MAE=0.78304 | it=499 | 19.7s


[0]	train-mae:2.57610	valid-mae:2.55824


[100]	train-mae:0.97902	valid-mae:0.98742


[200]	train-mae:0.87223	valid-mae:0.89284


[300]	train-mae:0.80962	valid-mae:0.84041


[400]	train-mae:0.76382	valid-mae:0.80354


[499]	train-mae:0.73119	valid-mae:0.77909


  Fold 1: n_tr=683994 n_va=342385 | MAE=0.77909 | it=499 | 19.1s


[0]	train-mae:2.56636	valid-mae:2.57148


[100]	train-mae:0.97525	valid-mae:0.99192


[200]	train-mae:0.87022	valid-mae:0.89755


[300]	train-mae:0.80883	valid-mae:0.84543


[400]	train-mae:0.76468	valid-mae:0.80940


[499]	train-mae:0.72808	valid-mae:0.78092


  Fold 2: n_tr=685555 n_va=340824 | MAE=0.78092 | it=499 | 18.2s


[XGB] Type 2JHC: MAE=0.781017



[XGB] Type 2JHH: n_train=340097 n_test=37891


[0]	train-mae:2.46264	valid-mae:2.44768


[100]	train-mae:0.45545	valid-mae:0.48345


[200]	train-mae:0.41289	valid-mae:0.45539


[300]	train-mae:0.38269	valid-mae:0.43804


[400]	train-mae:0.35900	valid-mae:0.42578


[499]	train-mae:0.33831	valid-mae:0.41605


  Fold 0: n_tr=226826 n_va=113271 | MAE=0.41605 | it=499 | 8.6s


[0]	train-mae:2.45177	valid-mae:2.45648


[100]	train-mae:0.46201	valid-mae:0.48172


[200]	train-mae:0.41786	valid-mae:0.45156


[300]	train-mae:0.38737	valid-mae:0.43385


[400]	train-mae:0.36140	valid-mae:0.42035


[499]	train-mae:0.34147	valid-mae:0.41135


  Fold 1: n_tr=226656 n_va=113441 | MAE=0.41135 | it=499 | 7.9s


[0]	train-mae:2.44990	valid-mae:2.46098


[100]	train-mae:0.45599	valid-mae:0.48430


[200]	train-mae:0.41176	valid-mae:0.45431


[300]	train-mae:0.38214	valid-mae:0.43788


[400]	train-mae:0.35731	valid-mae:0.42500


[499]	train-mae:0.33702	valid-mae:0.41549


  Fold 2: n_tr=226712 n_va=113385 | MAE=0.41549 | it=499 | 8.8s


[XGB] Type 2JHH: MAE=0.414293



[XGB] Type 2JHN: n_train=107091 n_test=11968


[0]	train-mae:2.73944	valid-mae:2.73060


[100]	train-mae:0.37025	valid-mae:0.41548


[200]	train-mae:0.30480	valid-mae:0.37236


[300]	train-mae:0.26503	valid-mae:0.35126


[400]	train-mae:0.23531	valid-mae:0.33846


[499]	train-mae:0.21174	valid-mae:0.32985


  Fold 0: n_tr=71887 n_va=35204 | MAE=0.32985 | it=499 | 4.0s


[0]	train-mae:2.73487	valid-mae:2.74427


[100]	train-mae:0.36913	valid-mae:0.40974


[200]	train-mae:0.30296	valid-mae:0.36596


[300]	train-mae:0.26337	valid-mae:0.34583


[400]	train-mae:0.23376	valid-mae:0.33343


[499]	train-mae:0.21069	valid-mae:0.32524


  Fold 1: n_tr=71216 n_va=35875 | MAE=0.32524 | it=499 | 4.0s


[0]	train-mae:2.73927	valid-mae:2.74133


[100]	train-mae:0.36760	valid-mae:0.41441


[200]	train-mae:0.30053	valid-mae:0.37150


[300]	train-mae:0.26323	valid-mae:0.35259


[400]	train-mae:0.23259	valid-mae:0.33927


[499]	train-mae:0.20955	valid-mae:0.33083


  Fold 2: n_tr=71079 n_va=36012 | MAE=0.33083 | it=499 | 4.1s


[XGB] Type 2JHN: MAE=0.328634



[XGB] Type 1JHC: n_train=637912 n_test=71221


[0]	train-mae:11.66550	valid-mae:11.69200


[100]	train-mae:2.52076	valid-mae:2.56098


[200]	train-mae:2.32513	valid-mae:2.39185


[300]	train-mae:2.19672	valid-mae:2.28788


[399]	train-mae:2.11072	valid-mae:2.22350


  Fold 0: n_tr=425180 n_va=212732 | MAE=2.22350 | it=399 | 10.0s


[0]	train-mae:11.68956	valid-mae:11.66829


[100]	train-mae:2.50340	valid-mae:2.55365


[200]	train-mae:2.31115	valid-mae:2.38642


[300]	train-mae:2.18640	valid-mae:2.28367


[399]	train-mae:2.10068	valid-mae:2.21814


  Fold 1: n_tr=425134 n_va=212778 | MAE=2.21814 | it=399 | 9.0s


[0]	train-mae:11.66700	valid-mae:11.66253


[100]	train-mae:2.51457	valid-mae:2.54407


[200]	train-mae:2.32268	valid-mae:2.37730


[300]	train-mae:2.19993	valid-mae:2.27876


[399]	train-mae:2.11055	valid-mae:2.21144


  Fold 2: n_tr=425510 n_va=212402 | MAE=2.21144 | it=399 | 10.4s


[XGB] Type 1JHC: MAE=2.217698



[XGB] Type 1JHN: n_train=39416 n_test=4264


[0]	train-mae:8.75259	valid-mae:8.78045


[100]	train-mae:0.77606	valid-mae:0.92449


[200]	train-mae:0.63625	valid-mae:0.85691


[300]	train-mae:0.54266	valid-mae:0.82113


[399]	train-mae:0.47396	valid-mae:0.80207


  Fold 0: n_tr=26428 n_va=12988 | MAE=0.80207 | it=399 | 1.8s


[0]	train-mae:8.75876	valid-mae:8.76716


[100]	train-mae:0.78011	valid-mae:0.91308


[200]	train-mae:0.64275	valid-mae:0.84481


[300]	train-mae:0.54692	valid-mae:0.80831


[399]	train-mae:0.47448	valid-mae:0.78809


  Fold 1: n_tr=26195 n_va=13221 | MAE=0.78809 | it=399 | 1.8s


[0]	train-mae:8.77242	valid-mae:8.74090


[100]	train-mae:0.76650	valid-mae:0.90711


[200]	train-mae:0.63133	valid-mae:0.83761


[300]	train-mae:0.54029	valid-mae:0.80273


[399]	train-mae:0.47008	valid-mae:0.78446


  Fold 2: n_tr=26209 n_va=13207 | MAE=0.78446 | it=399 | 1.8s


[XGB] Type 1JHN: MAE=0.791480



Per-type MAE (XGB): {'3JHC': 0.617185, '3JHH': 0.34838, '3JHN': 0.191284, '2JHC': 0.781017, '2JHH': 0.414293, '2JHN': 0.328634, '1JHC': 2.217698, '1JHN': 0.79148}
Overall OOF LMAE (XGB): -0.608697 | elapsed 5.0 min


  mae_by_type = df.groupby('type').apply(lambda g: float(np.mean(np.abs(g['y'].values - g['p'].values)))).astype('float64')


In [23]:
# Blend LGBM-seed blend with XGB per-type and write submission
import numpy as np, pandas as pd, os, time

def lmae_score_fast(y_true, y_pred, types, eps: float = 1e-9):
    df = pd.DataFrame({'y': y_true, 'p': y_pred, 'type': types})
    mae_by_type = df.groupby('type').apply(lambda g: float(np.mean(np.abs(g['y'].values - g['p'].values)))).astype('float64')
    return float(np.log(mae_by_type.clip(lower=eps)).mean())

# Load artifacts
oof_lgb_blend = np.load('oof_blend_lgb12.npy') if os.path.exists('oof_blend_lgb12.npy') else np.load('oof_lgb.npy')
pred_lgb_blend = np.load('pred_test_blend_lgb12.npy') if os.path.exists('pred_test_blend_lgb12.npy') else np.load('pred_test_lgb.npy')
oof_xgb = np.load('oof_xgb.npy')
pred_xgb = np.load('pred_test_xgb.npy')

types = X_train['type'].values
test_types = X_test['type'].values
y = X_train['scalar_coupling_constant'].values.astype('float32')

w_grid = np.linspace(0.0, 1.0, 21)  # step 0.05
blend_oof = np.zeros_like(oof_xgb, dtype=np.float32)
blend_test = np.zeros_like(pred_xgb, dtype=np.float32)
w_per_type = {}

t0 = time.time()
for t in sorted(pd.unique(types)):
    m = (types == t)
    best_mae, best_w = 1e9, 0.6  # default prior
    for w in w_grid:
        o = w*oof_xgb[m] + (1.0-w)*oof_lgb_blend[m]
        mae = float(np.mean(np.abs(y[m] - o)))
        if mae < best_mae:
            best_mae, best_w = mae, float(w)
    w_per_type[t] = best_w
    blend_oof[m] = best_w*oof_xgb[m] + (1.0-best_w)*oof_lgb_blend[m]
    mt = (test_types == t)
    blend_test[mt] = best_w*pred_xgb[mt] + (1.0-best_w)*pred_lgb_blend[mt]

overall_lmae = lmae_score_fast(y, blend_oof, X_train['type'])
print('Per-type weights (XGB vs LGBblend):', w_per_type)
print(f'OOF LMAE (XGB+LGBblend): {overall_lmae:.6f} | elapsed {time.time()-t0:.1f}s')

# Save and build submission
np.save('oof_blend_xgb_lgb12.npy', blend_oof.astype('float32'))
np.save('pred_test_blend_xgb_lgb12.npy', blend_test.astype('float32'))
pd.Series(w_per_type).to_csv('weights_xgb_vs_lgb12_per_type.csv')
sub = pd.DataFrame({'id': X_test['id'].values, 'scalar_coupling_constant': blend_test.astype('float32')}).sort_values('id')
sub.to_csv('submission.csv', index=False)
print('Saved final blended submission.csv:', sub.shape, 'head:\n', sub.head())

  mae_by_type = df.groupby('type').apply(lambda g: float(np.mean(np.abs(g['y'].values - g['p'].values)))).astype('float64')


Per-type weights (XGB vs LGBblend): {'1JHC': 0.0, '1JHN': 0.65, '2JHC': 0.05, '2JHH': 0.25, '2JHN': 0.45, '3JHC': 0.15000000000000002, '3JHH': 0.30000000000000004, '3JHN': 0.55}
OOF LMAE (XGB+LGBblend): -0.671321 | elapsed 4.0s


Saved final blended submission.csv: (467813, 2) head:
          id  scalar_coupling_constant
335622  276                102.595863
335623  277                  6.773271
335624  278                  1.953415
335625  279                  1.953415
335626  280                  6.773271


In [24]:
# FE v3 interactions: angle/dihedral × inv_d with path masks
import numpy as np, pandas as pd, time

t0 = time.time()
assert 'X_train' in globals() and 'X_test' in globals(), 'Run FE cells first'

# Masks
pl2_tr = (X_train['path_len'].astype('int16') == 2).astype('float32')
pl3_tr = (X_train['path_len'].astype('int16') == 3).astype('float32')
pl2_te = (X_test['path_len'].astype('int16') == 2).astype('float32')
pl3_te = (X_test['path_len'].astype('int16') == 3).astype('float32')

def add_interactions(df, pl2, pl3):
    # angle terms (only for path_len==2)
    for base in ['angle_cos','angle_sin','angle_rad']:
        if base in df.columns:
            df[f'{base}_inv_d'] = (df[base].astype('float32') * df['inv_d'].astype('float32') * pl2).astype('float32')
    # dihedral terms (only for path_len==3) incl. cos2
    for base in ['dih_cos','dih_sin','dih_cos2']:
        if base in df.columns:
            df[f'{base}_inv_d'] = (df[base].astype('float32') * df['inv_d'].astype('float32') * pl3).astype('float32')
    return df

X_train = add_interactions(X_train, pl2_tr, pl3_tr)
X_test  = add_interactions(X_test,  pl2_te, pl3_te)

# Update lgb_features to include new cols if present
new_cols = [
    'angle_cos_inv_d','angle_sin_inv_d','angle_rad_inv_d',
    'dih_cos_inv_d','dih_sin_inv_d','dih_cos2_inv_d'
]
if 'lgb_features' in globals():
    for c in new_cols:
        if c in X_train.columns and c in X_test.columns and c not in lgb_features:
            lgb_features.append(c)

print('Added FE v3 interaction features. Now lgb_features:', len(lgb_features) if 'lgb_features' in globals() else 'N/A', '| elapsed s:', round(time.time()-t0, 2))

Added FE v3 interaction features. Now lgb_features: 81 | elapsed s: 0.07


In [25]:
# Quick retrain LGBM (both seeds) on high-ROI types with new interaction features, then refresh artifacts
import numpy as np, pandas as pd, time
from sklearn.metrics import mean_absolute_error

import lightgbm as lgb
assert 'lgb_features' in globals() and 'lgb_folds' in globals(), 'Run prep cell first'

types_to_retrain = ['3JHC','3JHN','3JHH','2JHH']

def params_seed1(t):
    base = dict(objective='mae', metric='mae', boosting_type='gbdt',
                n_jobs=-1, feature_fraction=0.8, bagging_fraction=0.8, bagging_freq=1, max_bin=256, reg_lambda=1.0, verbose=-1)
    if t.startswith('1J'):
        base.update(dict(num_leaves=56, min_data_in_leaf=180, learning_rate=0.12, n_estimators=800))
    elif t.startswith('2J'):
        base.update(dict(num_leaves=96, min_data_in_leaf=100, learning_rate=0.10, n_estimators=1000))
    elif t.startswith('3J'):
        base.update(dict(num_leaves=128, min_data_in_leaf=50, learning_rate=0.08, n_estimators=1400))
    else:
        base.update(dict(num_leaves=96, min_data_in_leaf=100, learning_rate=0.10, n_estimators=1000))
    return base

def params_seed2(t):
    base = dict(objective='mae', metric='mae', boosting_type='gbdt',
                n_jobs=-1, feature_fraction=0.75, bagging_fraction=0.85, bagging_freq=1, max_bin=256, reg_lambda=1.0, verbose=-1, random_seed=1337)
    if t.startswith('1J'):
        base.update(dict(num_leaves=56, min_data_in_leaf=180, learning_rate=0.12, n_estimators=800))
    elif t.startswith('2J'):
        base.update(dict(num_leaves=96, min_data_in_leaf=100, learning_rate=0.10, n_estimators=1000))
    elif t.startswith('3J'):
        base.update(dict(num_leaves=128, min_data_in_leaf=50, learning_rate=0.08, n_estimators=1400))
    else:
        base.update(dict(num_leaves=96, min_data_in_leaf=100, learning_rate=0.10, n_estimators=1000))
    return base

def train_update(types_subset, seed=1):
    oof_path = 'oof_lgb2.npy' if seed==2 else 'oof_lgb.npy'
    te_path  = 'pred_test_lgb2.npy' if seed==2 else 'pred_test_lgb.npy'
    if not (os.path.exists(oof_path) and os.path.exists(te_path)):
        print('Missing existing artifacts to update; aborting retrain for seed', seed);
        return
    oof_all = np.load(oof_path)
    te_all  = np.load(te_path)
    start = time.time()
    for t in types_subset:
        params = params_seed2(t) if seed==2 else params_seed1(t)
        n_estimators_cap = params.pop('n_estimators')
        tr_mask = (X_train['type'] == t).values
        te_mask = (X_test['type'] == t).values
        X_t = X_train.loc[tr_mask, lgb_features].astype('float32')
        y_t = X_train.loc[tr_mask, 'scalar_coupling_constant'].astype('float32').values
        X_te_t = X_test.loc[te_mask, lgb_features].astype('float32')
        idx_t = np.where(tr_mask)[0]
        oof_t = np.zeros(X_t.shape[0], dtype=np.float32)
        pred_te_t = np.zeros(X_te_t.shape[0], dtype=np.float32)
        print(f"\n[LGBM seed{seed} retrain] {t}: n_train={X_t.shape[0]} n_test={X_te_t.shape[0]}", flush=True)
        for fold_i, (tr_idx_all, va_idx_all) in enumerate(lgb_folds):
            tr_loc = np.intersect1d(idx_t, tr_idx_all, assume_unique=False)
            va_loc = np.intersect1d(idx_t, va_idx_all, assume_unique=False)
            tr_loc_local = np.searchsorted(idx_t, tr_loc)
            va_loc_local = np.searchsorted(idx_t, va_loc)
            if len(va_loc_local) == 0 or len(tr_loc_local) == 0:
                continue
            dtrain = lgb.Dataset(X_t.iloc[tr_loc_local, :], label=y_t[tr_loc_local], free_raw_data=False)
            dvalid = lgb.Dataset(X_t.iloc[va_loc_local, :], label=y_t[va_loc_local], reference=dtrain, free_raw_data=False)
            booster = lgb.train(params, dtrain, num_boost_round=int(n_estimators_cap),
                                valid_sets=[dtrain, dvalid], valid_names=['train','valid'],
                                callbacks=[lgb.early_stopping(stopping_rounds=100, verbose=False)])
            best_it = booster.best_iteration if booster.best_iteration is not None else booster.current_iteration()
            oof_t[va_loc_local] = booster.predict(X_t.iloc[va_loc_local, :], num_iteration=best_it).astype('float32')
            pred_te_t += booster.predict(X_te_t, num_iteration=best_it).astype('float32') / len(lgb_folds)
            mae_fold = mean_absolute_error(y_t[va_loc_local], oof_t[va_loc_local])
            print(f"  Fold {fold_i}: MAE={mae_fold:.5f} | it={best_it}", flush=True)
        # write back updates
        oof_all[idx_t] = oof_t
        te_all[te_mask] = pred_te_t
        mae_t = float(np.mean(np.abs(y_t - oof_t)))
        print(f"[LGBM seed{seed} retrain] {t}: MAE={mae_t:.6f}", flush=True)
    # save updated artifacts
    np.save(oof_path, oof_all.astype('float32'))
    np.save(te_path, te_all.astype('float32'))
    print(f'Updated artifacts for seed{seed} in {(time.time()-start)/60:.1f} min')

import os
train_update(types_to_retrain, seed=1)
train_update(types_to_retrain, seed=2)


[LGBM seed1 retrain] 3JHC: n_train=1359077 n_test=152130


  Fold 0: MAE=0.59310 | it=1400


  Fold 1: MAE=0.58281 | it=1400


  Fold 2: MAE=0.58979 | it=1400


[LGBM seed1 retrain] 3JHC: MAE=0.588570



[LGBM seed1 retrain] 3JHN: n_train=150067 n_test=16546


  Fold 0: MAE=0.20330 | it=1400


  Fold 1: MAE=0.19683 | it=1400


  Fold 2: MAE=0.19772 | it=1400


[LGBM seed1 retrain] 3JHN: MAE=0.199271



[LGBM seed1 retrain] 3JHH: n_train=531224 n_test=59305


  Fold 0: MAE=0.34180 | it=1400


  Fold 1: MAE=0.33729 | it=1400


  Fold 2: MAE=0.33886 | it=1400


[LGBM seed1 retrain] 3JHH: MAE=0.339315



[LGBM seed1 retrain] 2JHH: n_train=340097 n_test=37891


  Fold 0: MAE=0.39880 | it=1000


  Fold 1: MAE=0.39004 | it=1000


  Fold 2: MAE=0.39573 | it=1000


[LGBM seed1 retrain] 2JHH: MAE=0.394853


Updated artifacts for seed1 in 5.3 min



[LGBM seed2 retrain] 3JHC: n_train=1359077 n_test=152130


  Fold 0: MAE=0.59046 | it=1400


  Fold 1: MAE=0.58382 | it=1400


  Fold 2: MAE=0.59029 | it=1400


[LGBM seed2 retrain] 3JHC: MAE=0.588191



[LGBM seed2 retrain] 3JHN: n_train=150067 n_test=16546


  Fold 0: MAE=0.20191 | it=1400


  Fold 1: MAE=0.19575 | it=1400


  Fold 2: MAE=0.19768 | it=1400


[LGBM seed2 retrain] 3JHN: MAE=0.198435



[LGBM seed2 retrain] 3JHH: n_train=531224 n_test=59305


  Fold 0: MAE=0.34223 | it=1400


  Fold 1: MAE=0.33681 | it=1400


  Fold 2: MAE=0.33848 | it=1400


[LGBM seed2 retrain] 3JHH: MAE=0.339171



[LGBM seed2 retrain] 2JHH: n_train=340097 n_test=37891


  Fold 0: MAE=0.39970 | it=1000


  Fold 1: MAE=0.38853 | it=1000


  Fold 2: MAE=0.39705 | it=1000


[LGBM seed2 retrain] 2JHH: MAE=0.395091


Updated artifacts for seed2 in 5.9 min


In [29]:
# Diagnostics: path_len sanity vs coupling type and geometry coverage
import numpy as np, pandas as pd

def diag_pathlen(df: pd.DataFrame, name: str):
    print(f"\n=== Path length diagnostics: {name} ===", flush=True)
    types = sorted(df['type'].unique())
    for t in types:
        dft = df[df['type']==t]
        vc = dft['path_len'].value_counts(dropna=False).sort_index()
        total = len(dft)
        pl1 = float((dft['path_len']==1).mean()) if total else 0.0
        pl2 = float((dft['path_len']==2).mean()) if total else 0.0
        pl3 = float((dft['path_len']==3).mean()) if total else 0.0
        print(f"{t}: n={total} | path_len dist: {vc.to_dict()} | P(pl=1)={pl1:.3f} P(pl=2)={pl2:.3f} P(pl=3)={pl3:.3f}", flush=True)

def diag_geometry_coverage(df: pd.DataFrame, name: str):
    print(f"\n=== Geometry feature coverage: {name} ===", flush=True)
    pl2 = (df['path_len']==2)
    pl3 = (df['path_len']==3)
    ang_cov = float((pl2 & df['angle_cos'].notna()).mean()) if len(df) else 0.0
    dih_cov = float((pl3 & df['dih_cos'].notna()).mean()) if len(df) else 0.0
    print(f"Angle coverage on pl=2: {ang_cov:.3f} | Dihedral coverage on pl=3: {dih_cov:.3f}", flush=True)

# Run diagnostics on a 200k-row sample for speed
idx_tr = np.random.RandomState(42).choice(len(X_train), size=min(200_000, len(X_train)), replace=False)
idx_te = np.random.RandomState(1337).choice(len(X_test), size=min(100_000, len(X_test)), replace=False)
diag_pathlen(X_train.iloc[idx_tr], 'train(sample)')
diag_pathlen(X_test.iloc[idx_te], 'test(sample)')
diag_geometry_coverage(X_train.iloc[idx_tr], 'train(sample)')
diag_geometry_coverage(X_test.iloc[idx_te], 'test(sample)')


=== Path length diagnostics: train(sample) ===


1JHC: n=30564 | path_len dist: {1: 30564} | P(pl=1)=1.000 P(pl=2)=0.000 P(pl=3)=0.000


1JHN: n=1830 | path_len dist: {1: 1830} | P(pl=1)=1.000 P(pl=2)=0.000 P(pl=3)=0.000


2JHC: n=49202 | path_len dist: {2: 49202} | P(pl=1)=0.000 P(pl=2)=1.000 P(pl=3)=0.000


2JHH: n=16202 | path_len dist: {2: 16202} | P(pl=1)=0.000 P(pl=2)=1.000 P(pl=3)=0.000


2JHN: n=5103 | path_len dist: {2: 5103} | P(pl=1)=0.000 P(pl=2)=1.000 P(pl=3)=0.000


3JHC: n=64970 | path_len dist: {3: 64970} | P(pl=1)=0.000 P(pl=2)=0.000 P(pl=3)=1.000


3JHH: n=25062 | path_len dist: {3: 25062} | P(pl=1)=0.000 P(pl=2)=0.000 P(pl=3)=1.000


3JHN: n=7067 | path_len dist: {3: 7067} | P(pl=1)=0.000 P(pl=2)=0.000 P(pl=3)=1.000



=== Path length diagnostics: test(sample) ===


1JHC: n=15346 | path_len dist: {-1: 15346} | P(pl=1)=0.000 P(pl=2)=0.000 P(pl=3)=0.000


1JHN: n=954 | path_len dist: {-1: 954} | P(pl=1)=0.000 P(pl=2)=0.000 P(pl=3)=0.000


2JHC: n=24214 | path_len dist: {-1: 24214} | P(pl=1)=0.000 P(pl=2)=0.000 P(pl=3)=0.000


2JHH: n=8067 | path_len dist: {-1: 8067} | P(pl=1)=0.000 P(pl=2)=0.000 P(pl=3)=0.000


2JHN: n=2517 | path_len dist: {-1: 2517} | P(pl=1)=0.000 P(pl=2)=0.000 P(pl=3)=0.000


3JHC: n=32591 | path_len dist: {-1: 32591} | P(pl=1)=0.000 P(pl=2)=0.000 P(pl=3)=0.000


3JHH: n=12815 | path_len dist: {-1: 12815} | P(pl=1)=0.000 P(pl=2)=0.000 P(pl=3)=0.000


3JHN: n=3496 | path_len dist: {-1: 3496} | P(pl=1)=0.000 P(pl=2)=0.000 P(pl=3)=0.000



=== Geometry feature coverage: train(sample) ===


Angle coverage on pl=2: 0.353 | Dihedral coverage on pl=3: 0.485



=== Geometry feature coverage: test(sample) ===


Angle coverage on pl=2: 0.000 | Dihedral coverage on pl=3: 0.000


In [30]:
# Purge train-only graph/geometry features; rebuild safe feature list and sanitize
import numpy as np, pandas as pd

DROP_COLS = [
    # Graph/topology
    'path_len','inv_path','is_bonded','min_nb_d0','min_nb_d1','cos0','cos1',
    # Graph-derived
    'path_len_bucket','path_le2','d_x_inv_path','d_over_1p_path','is_bonded_x_inv_d','inv_d_x_path_le2',
    'cos0_x_inv_path','cos1_x_inv_path','min_nb_d0_x_inv_path','min_nb_d1_x_inv_path','d_over_mol_mean_nb_d',
    # FE v3 geometry
    'angle_cos','angle_sin','angle_rad','dih_cos','dih_sin','dih_cos2',
    # FE v3 interactions
    'angle_cos_inv_d','angle_sin_inv_d','angle_rad_inv_d','dih_cos_inv_d','dih_sin_inv_d','dih_cos2_inv_d'
]

# Safe base features (universally available)
SAFE_BASE = [
    'Z0','Z1','same_element',
    'dx','dy','dz','d','d2','inv_d','inv_d2',
    'nH','nC','nN','nO','nF','n_atoms',
    'element_pair_id_sorted',
    'EN0','EN1','EN_diff','EN_abs_diff',
    'd_over_n_atoms','pe_per_atom',
    'expected_d_by_type','d_from_expected'
]

# Quantum candidates (keep only if present for test) and simple interactions
QUANTUM_CANDS = [
    'mulliken_0','mulliken_1','z_mulliken_0','z_mulliken_1',
    'shield_iso_0','shield_iso_1','z_shield_0','z_shield_1',
    'mulliken_diff','mulliken_abs_diff','mulliken_sum','mulliken_prod',
    'shield_diff','shield_abs_diff','shield_sum','shield_prod',
    'mulliken_diff_over_d','mulliken_diff_x_inv_d','shield_diff_over_d','shield_diff_x_inv_d'
]

# Molecule-level optional (keep only if present for test) - used already via pe_per_atom if available
MOL_LEVEL = ['potential_energy','dipole_x','dipole_y','dipole_z','dipole_mag']

def coverage(series: pd.Series) -> float:
    return float(series.notna().mean()) if len(series) else 0.0

# 1) Drop train-only cols from dataframes if exist (avoid accidental use elsewhere)
for c in DROP_COLS:
    if c in X_train.columns:
        X_train.drop(columns=[c], inplace=True)
    if c in X_test.columns:
        X_test.drop(columns=[c], inplace=True)

# 2) Decide whether to keep quantum and mol-level based on test coverage
keep_quantum = True
if 'mulliken_0' in X_test.columns:
    cov_mull = coverage(X_test['mulliken_0'])
else:
    cov_mull = 0.0
if 'shield_iso_0' in X_test.columns:
    cov_shld = coverage(X_test['shield_iso_0'])
else:
    cov_shld = 0.0
if cov_mull == 0.0 and cov_shld == 0.0:
    keep_quantum = False

keep_mol_level = any(c in X_test.columns and coverage(X_test[c]) > 0.0 for c in MOL_LEVEL)

# 3) Build reduced safe feature list from intersection
cands = list(SAFE_BASE)
if keep_quantum:
    cands += QUANTUM_CANDS
if keep_mol_level:
    # we already use pe_per_atom from these; add raw only if present
    cands += [c for c in MOL_LEVEL if c in X_test.columns]

reduced_features = [c for c in cands if (c in X_train.columns and c in X_test.columns)]

# 4) Re-sanitize: replace inf->NaN->fill with train means; cast float32; keep ints for IDs
def sanitize_train_test_cols(X_tr: pd.DataFrame, X_te: pd.DataFrame, cols: list[str]):
    X_tr = X_tr.copy(); X_te = X_te.copy()
    X_tr[cols] = X_tr[cols].replace([np.inf, -np.inf], np.nan)
    X_te[cols] = X_te[cols].replace([np.inf, -np.inf], np.nan)
    for c in cols:
        if X_tr[c].dtype.kind in 'iu':
            mode_val = X_tr[c].mode(dropna=True)
            fillv = int(mode_val.iloc[0]) if len(mode_val) else 0
            X_tr[c] = X_tr[c].fillna(fillv).astype(X_tr[c].dtype)
            X_te[c] = X_te[c].fillna(fillv).astype(X_te[c].dtype)
        else:
            tr = pd.to_numeric(X_tr[c], errors='coerce').astype('float32')
            mean_val = float(np.nanmean(tr)) if np.isfinite(np.nanmean(tr)) else 0.0
            X_tr[c] = tr.fillna(mean_val)
            X_te[c] = pd.to_numeric(X_te[c], errors='coerce').astype('float32').fillna(mean_val)
    return X_tr, X_te

X_train, X_test = sanitize_train_test_cols(X_train, X_test, reduced_features)

# 5) Update lgb_features to reduced safe list and report
lgb_features = reduced_features
print('Quantum coverage test: mulliken=', cov_mull, 'shield=', cov_shld, '| keep_quantum=', keep_quantum)
print('Mol-level any coverage:', keep_mol_level)
print('Reduced lgb_features count:', len(lgb_features))
print('Sample features:', lgb_features[:20])
assert all(c not in lgb_features for c in DROP_COLS), 'Unsafe cols leaked into features'

Quantum coverage test: mulliken= 1.0 shield= 1.0 | keep_quantum= True
Mol-level any coverage: False
Reduced lgb_features count: 45
Sample features: ['Z0', 'Z1', 'same_element', 'dx', 'dy', 'dz', 'd', 'd2', 'inv_d', 'inv_d2', 'nH', 'nC', 'nN', 'nO', 'nF', 'n_atoms', 'element_pair_id_sorted', 'EN0', 'EN1', 'EN_diff']


In [34]:
# Safe physics features: distance bases, radii normalization, Coulomb/EN/Quantum interactions, RBFs, expected d by type-pair, atom fractions
import numpy as np, pandas as pd, time

t0 = time.time()
assert 'X_train' in globals() and 'X_test' in globals(), 'Run data prep first'

# 1) Basic distance bases
for df in (X_train, X_test):
    d = df['d'].astype('float32')
    d2 = df['d2'].astype('float32')
    eps = 1e-8
    df['inv_d3'] = (1.0 / np.clip(d, eps, None)**3).astype('float32')
    df['inv_d4'] = (1.0 / np.clip(d, eps, None)**4).astype('float32')
    df['inv_d6'] = (1.0 / np.clip(d, eps, None)**6).astype('float32')
    df['exp_md'] = np.exp(-1.0 * d).astype('float32')
    df['exp_2d'] = np.exp(-2.0 * d).astype('float32')
    df['exp_hd'] = np.exp(-0.5 * d).astype('float32')

# 2) Radius/chemistry normalization
R_COV = {1:0.31, 6:0.76, 7:0.71, 8:0.66, 9:0.57}
def cov_radius(z):
    return np.float32(R_COV.get(int(z), 0.7))
for df in (X_train, X_test):
    r0 = df['Z0'].astype('int32').map(lambda z: cov_radius(z)).astype('float32')
    r1 = df['Z1'].astype('int32').map(lambda z: cov_radius(z)).astype('float32')
    denom = (r0 + r1).astype('float32')
    denom = denom.replace(0, np.float32(1e-6)) if isinstance(denom, pd.Series) else np.where(denom==0, 1e-6, denom)
    df['bo_ratio'] = (df['d'].astype('float32') / denom.astype('float32')).astype('float32')
    df['inv_bo'] = (1.0 / df['bo_ratio'].replace(0, np.nan)).fillna(0).astype('float32')

# 3) Coulombic proxies
for df in (X_train, X_test):
    zprod = (df['Z0'].astype('float32') * df['Z1'].astype('float32')).astype('float32')
    d = df['d'].astype('float32')
    df['Zprod_over_d'] = (zprod / np.clip(d, 1e-8, None)).astype('float32')
    df['Zprod_over_d2'] = (zprod / np.clip(d, 1e-8, None)**2).astype('float32')

# 4) Electronegativity interactions (EN0/EN1 already present)
for df in (X_train, X_test):
    d = df['d'].astype('float32')
    EN_abs_diff = df['EN_abs_diff'].astype('float32')
    EN_sum = (df['EN0'].astype('float32') + df['EN1'].astype('float32')).astype('float32')
    df['EN_abs_over_d'] = (EN_abs_diff / np.clip(d, 1e-8, None)).astype('float32')
    df['EN_sum_over_d'] = (EN_sum / np.clip(d, 1e-8, None)).astype('float32')
    df['d_times_EN_abs'] = (d * EN_abs_diff).astype('float32')

# 5) Quantum × distance interactions (guard if quantum missing)
has_mull = ('mulliken_0' in X_train.columns) and ('mulliken_0' in X_test.columns)
has_shld = ('shield_iso_0' in X_train.columns) and ('shield_iso_0' in X_test.columns)
if has_mull:
    for df in (X_train, X_test):
        d = df['d'].astype('float32')
        inv_d = (1.0/np.clip(d,1e-8,None)).astype('float32')
        m0 = df['mulliken_0'].astype('float32'); m1 = df['mulliken_1'].astype('float32')
        mdiff = (m0 - m1).astype('float32')
        msum = (m0 + m1).astype('float32')
        mprod = (m0 * m1).astype('float32')
        df['mprod_over_d'] = (mprod * inv_d).astype('float32')
        df['mabsdiff_over_d'] = (mdiff.abs() * inv_d).astype('float32')
        df['msum_over_d2'] = (msum / np.clip(d,1e-8,None)**2).astype('float32')
if has_shld:
    for df in (X_train, X_test):
        d = df['d'].astype('float32')
        inv_d = (1.0/np.clip(d,1e-8,None)).astype('float32')
        sdiff = (df['shield_iso_0'].astype('float32') - df['shield_iso_1'].astype('float32')).astype('float32')
        df['sabsdiff_over_d'] = (sdiff.abs() * inv_d).astype('float32')

# 6) Expected distance by (type, element_pair_id_sorted) computed on TRAIN ONLY
grp = X_train.groupby(['type','element_pair_id_sorted'])['d'].mean().astype('float32')
map_dict = grp.to_dict()
for df in (X_train, X_test):
    key = list(zip(df['type'].values, df['element_pair_id_sorted'].values))
    exp_pair = np.array([map_dict.get(k, np.nan) for k in key], dtype=np.float32)
    mean_fallback = np.float32(X_train['d'].mean())
    exp_pair = np.where(np.isnan(exp_pair), mean_fallback, exp_pair).astype(np.float32)
    df['expected_d_by_type_pair'] = exp_pair
    df['d_from_expected_pair'] = (df['d'].astype('float32') - df['expected_d_by_type_pair'].astype('float32')).astype('float32')

# 7) Atom fractions
for df in (X_train, X_test):
    n_atoms = df['n_atoms'].replace(0, np.nan).astype('float32')
    for a in ['H','C','N','O','F']:
        col = f'n{a}'
        if col in df.columns:
            df[f'{col}_frac'] = (df[col].astype('float32') / n_atoms).fillna(0).astype('float32')

# 8) Per-type RBFs over distance (centers from TRAIN per type); 8 centers per type
types_unique = sorted(X_train['type'].unique())
rbf_centers = {}  # type -> centers np.array
rbf_sigma = {}    # type -> sigma float
for t in types_unique:
    d_t = X_train.loc[X_train['type']==t, 'd'].astype('float32')
    if len(d_t) == 0:
        continue
    qs = np.linspace(0.1, 0.9, 8)
    centers = np.quantile(d_t.values, qs).astype('float32')
    iqr = float(np.quantile(d_t.values, 0.75) - np.quantile(d_t.values, 0.25))
    sigma = np.float32(max(iqr/6.0, 1e-2))
    rbf_centers[t] = centers
    rbf_sigma[t] = sigma

def add_rbf_features(df: pd.DataFrame, prefix: str):
    # For each type row, compute RBFs relative to that type's centers
    d = df['d'].astype('float32').values
    types_arr = df['type'].values
    # Preallocate temp arrays for cumulative writes
    for j in range(8):
        df[f'{prefix}_rbf{j}'] = np.zeros(len(df), dtype=np.float32)
    for t in types_unique:
        idx = np.where(types_arr == t)[0]
        if idx.size == 0 or t not in rbf_centers:
            continue
        c = rbf_centers[t]
        sig = rbf_sigma[t]
        d_loc = d[idx]
        for j in range(len(c)):
            val = np.exp(-((d_loc - c[j])**2) / (2.0 * (sig**2))).astype('float32')
            df.iloc[idx, df.columns.get_loc(f'{prefix}_rbf{j}')] = val

add_rbf_features(X_train, 'd')
add_rbf_features(X_test,  'd')

# 9) Update lgb_features: include only columns present in both train and test
new_cols = [
    'inv_d3','inv_d4','inv_d6','exp_md','exp_2d','exp_hd',
    'bo_ratio','inv_bo','Zprod_over_d','Zprod_over_d2','EN_abs_over_d','EN_sum_over_d','d_times_EN_abs',
]
if has_mull:
    new_cols += ['mprod_over_d','mabsdiff_over_d','msum_over_d2']
if has_shld:
    new_cols += ['sabsdiff_over_d']
new_cols += ['expected_d_by_type_pair','d_from_expected_pair','nH_frac','nC_frac','nN_frac','nO_frac','nF_frac']
new_cols += [f'd_rbf{j}' for j in range(8)]

present_cols = [c for c in new_cols if (c in X_train.columns and c in X_test.columns)]
if 'lgb_features' in globals():
    for c in present_cols:
        if c not in lgb_features:
            lgb_features.append(c)
else:
    lgb_features = present_cols

# 10) Sanitize new columns (fill NaN/inf with train means)
for c in present_cols:
    X_train[c] = pd.to_numeric(X_train[c], errors='coerce').replace([np.inf,-np.inf], np.nan).astype('float32')
    X_test[c]  = pd.to_numeric(X_test[c], errors='coerce').replace([np.inf,-np.inf], np.nan).astype('float32')
    mean_val = float(np.nanmean(X_train[c].values)) if np.isfinite(np.nanmean(X_train[c].values)) else 0.0
    X_train[c] = X_train[c].fillna(mean_val).astype('float32')
    X_test[c]  = X_test[c].fillna(mean_val).astype('float32')

print('Added safe physics features. lgb_features now:', len(lgb_features), '| Δtime:', round(time.time()-t0,1), 's')
print('Sample added:', present_cols[:20])

Added safe physics features. lgb_features now: 77 | Δtime: 13.5 s
Sample added: ['inv_d3', 'inv_d4', 'inv_d6', 'exp_md', 'exp_2d', 'exp_hd', 'bo_ratio', 'inv_bo', 'Zprod_over_d', 'Zprod_over_d2', 'EN_abs_over_d', 'EN_sum_over_d', 'd_times_EN_abs', 'mprod_over_d', 'mabsdiff_over_d', 'msum_over_d2', 'sabsdiff_over_d', 'expected_d_by_type_pair', 'd_from_expected_pair', 'nH_frac']


In [35]:
# Drop unsafe mol-level features per expert advice
import pandas as pd
unsafe_cols = ['potential_energy','dipole_x','dipole_y','dipole_z','dipole_mag','pe_per_atom']
for c in unsafe_cols:
    if c in X_train.columns:
        X_train.drop(columns=[c], inplace=True)
    if c in X_test.columns:
        X_test.drop(columns=[c], inplace=True)

# Remove from lgb_features if present
if 'lgb_features' in globals():
    lgb_features = [c for c in lgb_features if c not in unsafe_cols]
print('Removed unsafe cols. lgb_features now:', len(lgb_features))

Removed unsafe cols. lgb_features now: 76


In [39]:
# OOF Target Encodings: TE_dist_bin and TE_pair_bin (type-aware, distance-quantile bins with smoothing)
import numpy as np, pandas as pd, time

assert 'X_train' in globals() and 'X_test' in globals(), 'Run data prep first'
assert 'lgb_folds' in globals(), 'Build lgb_folds first (cell 10)'

t0 = time.time()
B = 15  # number of bins
m_smooth = 100.0  # smoothing weight

# 1) Compute per-type bin edges on TRAIN only
type_list = sorted(X_train['type'].unique())
bin_edges = {}  # t -> edges np.array length B+1
for t in type_list:
    d_t = X_train.loc[X_train['type']==t, 'd'].astype('float32').values
    if d_t.size == 0:
        continue
    qs = np.linspace(0.0, 1.0, B+1)
    edges = np.quantile(d_t, qs).astype('float32')
    # Ensure strictly increasing edges (handle duplicates)
    for i in range(1, len(edges)):
        if edges[i] <= edges[i-1]:
            edges[i] = edges[i-1] + 1e-6
    bin_edges[t] = edges

def assign_bins(df: pd.DataFrame) -> np.ndarray:
    out = np.full(len(df), -1, dtype=np.int16)
    types_arr = df['type'].values
    d_arr = df['d'].astype('float32').values
    for t in type_list:
        idx = np.where(types_arr == t)[0]
        if idx.size == 0 or t not in bin_edges:
            continue
        e = bin_edges[t]
        # np.digitize returns 1..len(e)-1; convert to 0..B-1 and clip
        b = np.digitize(d_arr[idx], e[1:-1], right=False).astype(np.int16)
        b = np.clip(b, 0, B-1).astype(np.int16)
        out[idx] = b
    return out

# 2) Precompute train-wide stats for test transform
y_all = X_train['scalar_coupling_constant'].astype('float32').values
dist_bin_all = assign_bins(X_train)
X_train['__dist_bin'] = dist_bin_all
X_test['__dist_bin'] = assign_bins(X_test)

# Per-type global mean for smoothing fallback
type_global_mean = X_train.groupby('type')['scalar_coupling_constant'].mean().astype('float32').to_dict()

# Build full-train stats for test mapping
def build_stats(keys_df: pd.DataFrame):
    # keys_df must have columns: 'type','__dist_bin','element_pair_id_sorted','scalar_coupling_constant'
    # (a) by (type, dist_bin)
    g1 = keys_df.groupby(['type','__dist_bin'])['scalar_coupling_constant'].agg(['sum','count']).reset_index()
    g1 = g1.rename(columns={'sum':'sum1','count':'cnt1'})
    # (b) by (type, pair, dist_bin)
    g2 = keys_df.groupby(['type','element_pair_id_sorted','__dist_bin'])['scalar_coupling_constant'].agg(['sum','count']).reset_index()
    g2 = g2.rename(columns={'sum':'sum2','count':'cnt2'})
    return g1, g2

g1_all, g2_all = build_stats(X_train[['type','__dist_bin','element_pair_id_sorted','scalar_coupling_constant']])

# 3) OOF encode for train using lgb_folds
TE_dist_oof = np.zeros(len(X_train), dtype=np.float32)
TE_pair_oof = np.zeros(len(X_train), dtype=np.float32)

for fold_i, (tr_idx, va_idx) in enumerate(lgb_folds):
    tr = X_train.iloc[tr_idx][['type','__dist_bin','element_pair_id_sorted','scalar_coupling_constant']].copy()
    va = X_train.iloc[va_idx][['type','__dist_bin','element_pair_id_sorted']].copy()
    # Filter out rows with missing bin
    tr = tr[tr['__dist_bin'] >= 0]
    # Stats on training part only
    g1, g2 = build_stats(tr)
    # Merge and compute smoothed means for validation part
    va_ = va.copy()
    va_ = va_.merge(g1, on=['type','__dist_bin'], how='left')
    va_ = va_.merge(g2, on=['type','element_pair_id_sorted','__dist_bin'], how='left')
    # Compute smoothed TE values
    # For dist TE: use per-type global mean as prior
    prior = va_['type'].map(type_global_mean).astype('float32')
    sum1 = va_['sum1'].astype('float32').fillna(0.0); cnt1 = va_['cnt1'].astype('float32').fillna(0.0)
    te1 = (sum1 + m_smooth*prior) / (cnt1 + m_smooth)
    # For pair TE: also use per-type global prior
    sum2 = va_['sum2'].astype('float32').fillna(0.0); cnt2 = va_['cnt2'].astype('float32').fillna(0.0)
    te2 = (sum2 + m_smooth*prior) / (cnt2 + m_smooth)
    # Assign back
    TE_dist_oof[va_idx] = te1.values.astype('float32')
    TE_pair_oof[va_idx] = te2.values.astype('float32')
    print(f'TE OOF fold {fold_i}: n_tr={len(tr_idx)} n_va={len(va_idx)}', flush=True)

# 4) Transform test using full-train stats with smoothing
te_df = X_test[['type','__dist_bin','element_pair_id_sorted']].copy()
te_df = te_df.merge(g1_all, on=['type','__dist_bin'], how='left')
te_df = te_df.merge(g2_all, on=['type','element_pair_id_sorted','__dist_bin'], how='left')
prior_te = te_df['type'].map(type_global_mean).astype('float32')
sum1_te = te_df['sum1'].astype('float32').fillna(0.0); cnt1_te = te_df['cnt1'].astype('float32').fillna(0.0)
sum2_te = te_df['sum2'].astype('float32').fillna(0.0); cnt2_te = te_df['cnt2'].astype('float32').fillna(0.0)
TE_dist_test = ((sum1_te + m_smooth*prior_te) / (cnt1_te + m_smooth)).astype('float32')
TE_pair_test = ((sum2_te + m_smooth*prior_te) / (cnt2_te + m_smooth)).astype('float32')

# 5) Attach features and clean up
X_train['TE_dist_bin'] = TE_dist_oof.astype('float32')
X_train['TE_pair_bin'] = TE_pair_oof.astype('float32')
X_test['TE_dist_bin'] = TE_dist_test.astype('float32')
X_test['TE_pair_bin'] = TE_pair_test.astype('float32')

# Drop helper columns
X_train.drop(columns=['__dist_bin'], inplace=True)
X_test.drop(columns=['__dist_bin'], inplace=True)

# 6) Update lgb_features
for c in ['TE_dist_bin','TE_pair_bin']:
    if c in X_train.columns and c in X_test.columns:
        if 'lgb_features' in globals():
            if c not in lgb_features:
                lgb_features.append(c)
        else:
            lgb_features = [c]

print('Added TE features. lgb_features now:', len(lgb_features), '| Δtime:', round(time.time()-t0,1), 's')

TE OOF fold 0: n_tr=2792710 n_va=1398553


TE OOF fold 1: n_tr=2792391 n_va=1398872


TE OOF fold 2: n_tr=2797425 n_va=1393838


Added TE features. lgb_features now: 78 | Δtime: 8.3 s


In [43]:
# Per-type bias correction on blended LGBM seeds and rewrite submission
import numpy as np, pandas as pd, os, time

t0 = time.time()
assert os.path.exists('oof_blend_lgb12.npy') and os.path.exists('pred_test_blend_lgb12.npy'), 'Run blend cell first'
oof_blend = np.load('oof_blend_lgb12.npy')
pred_test_blend = np.load('pred_test_blend_lgb12.npy')

y = X_train['scalar_coupling_constant'].values.astype('float32')
types_tr = X_train['type'].values
types_te = X_test['type'].values

# Compute per-type residual mean (y - oof) and apply as bias to test preds
bias_per_type = {}
for t in sorted(pd.unique(types_tr)):
    m = (types_tr == t)
    if m.any():
        bias_per_type[t] = float((y[m] - oof_blend[m]).mean())
    else:
        bias_per_type[t] = 0.0

adj_test = pred_test_blend.copy().astype('float32')
for t, b in bias_per_type.items():
    mt = (types_te == t)
    if mt.any():
        adj_test[mt] = (adj_test[mt] + np.float32(b)).astype('float32')

# Optionally report OOF LMAE after bias correction (train-side correction only affects mean, not MAE strongly)
def lmae_score_fast(y_true, y_pred, types, eps: float = 1e-9):
    df = pd.DataFrame({'y': y_true, 'p': y_pred, 'type': types})
    mae_by_type = df.groupby('type').apply(lambda g: float(np.mean(np.abs(g['y'].values - g['p'].values)))).astype('float64')
    return float(np.log(mae_by_type.clip(lower=eps)).mean())

# Build and save bias-corrected submission
sub_bc = pd.DataFrame({'id': X_test['id'].values, 'scalar_coupling_constant': adj_test}).sort_values('id')
sub_bc.to_csv('submission.csv', index=False)
pd.Series(bias_per_type).to_csv('bias_per_type_lgb_blend.csv')
print('Applied per-type bias correction. Saved submission.csv', sub_bc.shape, '| Δtime:', round(time.time()-t0,1), 's')
print('Bias per type:', bias_per_type)

Applied per-type bias correction. Saved submission.csv (467813, 2) | Δtime: 0.9 s
Bias per type: {'1JHC': 0.06225172430276871, '1JHN': 0.03956142067909241, '2JHC': 0.21071957051753998, '2JHH': -0.026959268376231194, '2JHN': 0.07537432014942169, '3JHC': 0.13971124589443207, '3JHH': 0.02110287733376026, '3JHN': 0.054075922816991806}
