# CHAMPS Scalar Coupling – Plan

Objectives:
- Establish reliable CV and baseline fast
- Build strong feature pipeline from provided data (structures, tensors, charges, potential_energy, dipoles)
- Train fast GPU models (CatBoost/XGBoost) per coupling type and blend
- Target: medal-tier LogMAE (competition metric: mean over coupling types of log(MAE_type)) and track per-type OOF

Initial Steps:
1) Environment sanity: GPU availability (nvidia-smi), versions
2) Data audit: load train/test; row counts; columns; NA; target profile; coupling_type distribution; per-type stats
3) CV: GroupKFold by molecule_name; stratify by type; lock seed; cache folds
4) Baseline features v0:
   - From train/test: distance between the two atoms (from structures), atom types (one-hot/emb), mulliken charges (sum/diff), shielding isotropic (sum/diff), potential_energy, dipole norms
   - Simple geometric: interatomic distance, d^2, 1/d, 1/d^2, 1/d^3; optional log(d)
   - Graph basics from structures: is_bonded via covalent radii threshold, shortest_path_len, degrees, ring flag
   - Per molecule context (counts, element counts) computed fold-safely
5) Models:
   - Fast screen: XGBoost GPU per type; 5 folds; early stopping
   - Compare CatBoost GPU; per-type models typically win here
6) Iteration:
   - Add angles (cos) via nearest neighbors; then dihedrals/Karplus for 3J types
   - Error buckets by type and distance/path bins
7) Ensembling:
   - Weighted blend XGB + CatBoost by per-type OOF

Risks & Checks:
- Leakage: fit transforms within folds; compute molecule-level aggs inside folds
- Do NOT use scalar_coupling_contributions (FC/SD/PSO/DSO) as features (train-only). Safe: mulliken, shielding, potential_energy, dipoles.
- CV must group by molecule_name; ensure all types present per fold
- Cache heavy joins (parquet); subsample dev runs

Milestones (request expert review at each):
A) Plan + env check
B) Data audit + CV finalized
C) Baseline FE v0 + XGB OOF
D) FE v1 (angles/dihedrals) + CatBoost
E) Blend + submission

Next: run environment check and quick data peek.

In [1]:
# Environment check + quick data peek
import os, sys, subprocess, time, json, gc
import pandas as pd
import numpy as np

t0 = time.time()
print('Running nvidia-smi...')
try:
    out = subprocess.run(['bash','-lc','nvidia-smi || true'], capture_output=True, text=True, check=False)
    print(out.stdout)
except Exception as e:
    print('nvidia-smi failed:', e)

print('Python', sys.version)
print('Pandas', pd.__version__)
print('NumPy', np.__version__)

DATA_FILES = ['train.csv','test.csv']
for f in DATA_FILES:
    print(f, 'exists:', os.path.exists(f), 'size(MB):', round(os.path.getsize(f)/1e6,2) if os.path.exists(f) else None)

SEED = 42
np.random.seed(SEED)

def mem(df):
    return round(df.memory_usage(deep=True).sum()/1e6, 2)

# Load train/test (quick peek)
print('\nLoading train/test...')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print('train shape:', train.shape, 'mem(MB):', mem(train))
print('test  shape:', test.shape,  'mem(MB):', mem(test))

print('\nTrain head:')
print(train.head(3))
print('\nCoupling types (train):')
print(train['type'].value_counts())

print('\nTarget summary:')
print(train['scalar_coupling_constant'].describe())

elapsed = time.time() - t0
print(f'Env + data peek done in {elapsed:.2f}s', flush=True)

Running nvidia-smi...
Tue Sep 23 23:19:52 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     182MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                          

train shape: (4191263, 6) mem(MB): 695.75
test  shape: (467813, 5) mem(MB): 73.91

Train head:
        id     molecule_name  atom_index_0  atom_index_1  type  \
0  3872080  dsgdb9nsd_109986             9             0  1JHC   
1  3872081  dsgdb9nsd_109986             9             2  3JHC   
2  3872082  dsgdb9nsd_109986             9            10  2JHH   

   scalar_coupling_constant  
0                  95.47000  
1                   1.47412  
2                  -9.90448  

Coupling types (train):
type
3JHC    1359077
2JHC    1026379
1JHC     637912
3JHH     531224
2JHH     340097
3JHN     150067
2JHN     107091
1JHN      39416
Name: count, dtype: int64

Target summary:


count    4.191263e+06
mean     1.591607e+01
std      3.493526e+01
min     -4.476050e+01
25%     -2.563940e-01
50%      2.280420e+00
75%      7.391740e+00
max      2.077090e+02
Name: scalar_coupling_constant, dtype: float64
Env + data peek done in 2.64s


In [2]:
# Build and cache GroupKFold (stratified by predominant type if available)
import pandas as pd, numpy as np, time, os, sys
from collections import Counter
t0 = time.time()
print('Loading minimal columns for CV...')
use_cols = ['molecule_name','type']
df = pd.read_csv('train.csv', usecols=use_cols)
print('Rows:', len(df), 'Molecules:', df['molecule_name'].nunique())

# Derive a single stratification label per molecule: predominant type
print('Computing predominant type per molecule...')
type_idx = df.groupby('molecule_name')['type'].agg(lambda s: s.value_counts().idxmax()).rename('strat_label')
mol_df = type_idx.reset_index()

FOLDS = 5
seed = 42
mol_names = mol_df['molecule_name'].values
strat_labels = mol_df['strat_label'].values

folds = np.full(len(mol_df), -1, dtype=int)
assigned = 0
try:
    from sklearn.model_selection import StratifiedGroupKFold
    print('Using StratifiedGroupKFold...')
    sgkf = StratifiedGroupKFold(n_splits=FOLDS, shuffle=True, random_state=seed)
    for k, (_, val_idx) in enumerate(sgkf.split(np.zeros(len(mol_df)), strat_labels, groups=mol_names)):
        folds[val_idx] = k
        print(f'Fold {k}: molecules {len(val_idx)}')
        assigned += len(val_idx)
except Exception as e:
    print('StratifiedGroupKFold unavailable, falling back to GroupKFold. Reason:', e)
    from sklearn.model_selection import GroupKFold
    gkf = GroupKFold(n_splits=FOLDS)
    for k, (_, val_idx) in enumerate(gkf.split(np.zeros(len(mol_df)), groups=mol_names)):
        folds[val_idx] = k
        print(f'Fold {k}: molecules {len(val_idx)}')
        assigned += len(val_idx)

assert (folds >= 0).all(), 'Unassigned folds present'
mol_df['fold'] = folds

# Save molecule-level folds mapping
fold_path = 'folds_molecules.csv'
mol_df[['molecule_name','fold']].to_csv(fold_path, index=False)
print('Saved', fold_path, 'with shape', mol_df.shape)

# Diagnostics: per-fold type distribution
df = df.merge(mol_df[['molecule_name','fold']], on='molecule_name', how='left')
print('Per-fold type counts:')
cnt = df.groupby(['fold','type']).size().unstack(fill_value=0)
print(cnt)

# Quick logMAE metric helper (OOF later) placeholder
def log_mae_by_type(y_true, y_pred, types):
    out = []
    for t in np.unique(types):
        mask = (types == t)
        mae = np.mean(np.abs(y_true[mask] - y_pred[mask]))
        out.append(np.log(mae + 1e-9))
    return float(np.mean(out))

print(f'Fold build done in {time.time()-t0:.2f}s', flush=True)

Loading minimal columns for CV...


Rows: 4191263 Molecules: 76510
Computing predominant type per molecule...


Using StratifiedGroupKFold...


Fold 0: molecules 15302
Fold 1: molecules 15301
Fold 2: molecules 15302
Fold 3: molecules 15302
Fold 4: molecules 15303
Saved folds_molecules.csv with shape (76510, 3)


Per-fold type counts:
type    1JHC  1JHN    2JHC   2JHH   2JHN    3JHC    3JHH   3JHN
fold                                                           
0     127506  7909  205166  68000  21339  271802  106072  29674
1     127794  7857  205413  68236  21313  272225  106385  29972
2     127603  7843  205146  68029  21505  271866  106131  29880
3     127552  7914  205316  67979  21483  271724  106219  30304
4     127457  7893  205338  67853  21451  271460  106417  30237
Fold build done in 26.16s


In [3]:
# Build and cache atoms table (structures + periodic props)
import pandas as pd, numpy as np, time, os, gc

t0 = time.time()
atoms_parquet = 'atoms.parquet'
if os.path.exists(atoms_parquet):
    atoms = pd.read_parquet(atoms_parquet)
    print('Loaded cached', atoms_parquet, 'shape:', atoms.shape)
else:
    print('Reading structures.csv ...')
    atoms = pd.read_csv('structures.csv')  # columns: molecule_name, atom_index, atom, x, y, z
    print('structures shape:', atoms.shape)

    # Periodic table props for CHAMPS atoms (H, C, N, O, F)
    periodic = {
        'H': {'Z':1,  'EN':2.20, 'covrad':0.31, 'period':1, 'group':1,  'valence_e':1},
        'C': {'Z':6,  'EN':2.55, 'covrad':0.76, 'period':2, 'group':14, 'valence_e':4},
        'N': {'Z':7,  'EN':3.04, 'covrad':0.71, 'period':2, 'group':15, 'valence_e':5},
        'O': {'Z':8,  'EN':3.44, 'covrad':0.66, 'period':2, 'group':16, 'valence_e':6},
        'F': {'Z':9,  'EN':3.98, 'covrad':0.57, 'period':2, 'group':17, 'valence_e':7},
    }
    pmap = pd.DataFrame.from_dict(periodic, orient='index')
    pmap.index.name = 'atom'
    pmap = pmap.reset_index()
    atoms = atoms.merge(pmap, on='atom', how='left')

    # Dtypes/downcast
    atoms['atom_index'] = atoms['atom_index'].astype(np.int32)
    for c in ['x','y','z','EN','covrad']:
        atoms[c] = atoms[c].astype(np.float32)
    for c in ['Z','period','group','valence_e']:
        atoms[c] = atoms[c].astype(np.int16)

    # Save cache
    atoms.to_parquet(atoms_parquet, index=False)
    print('Saved', atoms_parquet, 'shape:', atoms.shape)

print('Unique molecules in atoms:', atoms['molecule_name'].nunique())
print('Atom symbols:', atoms['atom'].value_counts().to_dict())
print(f'Atoms table ready in {time.time()-t0:.2f}s', flush=True)

Reading structures.csv ...


structures shape: (1379964, 6)


Saved atoms.parquet shape: (1379964, 12)
Unique molecules in atoms: 76510
Atom symbols: {'H': 707001, 'C': 486817, 'O': 107091, 'N': 77209, 'F': 1846}
Atoms table ready in 1.09s


In [4]:
# Baseline submission: per-type median prediction
import pandas as pd, numpy as np, time, os
t0 = time.time()
train = pd.read_csv('train.csv', usecols=['type','scalar_coupling_constant'])
test = pd.read_csv('test.csv', usecols=['id','type'])
med = train.groupby('type')['scalar_coupling_constant'].median().to_dict()
print('Per-type medians:', med)
pred = test['type'].map(med).astype(np.float32)
sub = pd.DataFrame({'id': test['id'].values, 'scalar_coupling_constant': pred.values})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv shape:', sub.shape, 'size(MB):', round(os.path.getsize('submission.csv')/1e6, 2))
print(sub.head())
print(f'Baseline submission ready in {time.time()-t0:.2f}s')

Per-type medians: {'1JHC': 88.20235, '1JHN': 47.869299999999996, '2JHC': -0.953401, '2JHH': -11.3289, '2JHN': 2.0169900000000003, '3JHC': 2.87845, '3JHH': 3.687980000000001, '3JHN': 0.6542279999999999}


Saved submission.csv shape: (467813, 2) size(MB): 7.73
        id  scalar_coupling_constant
0  2324604                 88.202347
1  2324605                 -0.953401
2  2324606                  2.878450
3  2324607                  2.878450
4  2324608                -11.328900
Baseline submission ready in 1.31s


In [7]:
# FE v0: geometry + periodic + molecule context; cache train/test feature tables
import pandas as pd, numpy as np, time, os, gc

t0 = time.time()
print('Loading base tables...')
atoms = pd.read_parquet('atoms.parquet')
folds = pd.read_csv('folds_molecules.csv')
train_cols = ['id','molecule_name','atom_index_0','atom_index_1','type','scalar_coupling_constant']
test_cols  = ['id','molecule_name','atom_index_0','atom_index_1','type']
train_df = pd.read_csv('train.csv', usecols=train_cols)
test_df  = pd.read_csv('test.csv',  usecols=test_cols)
print('train_df:', train_df.shape, 'test_df:', test_df.shape)

# Molecule-level safe tables
pot = pd.read_csv('potential_energy.csv')  # molecule_name, potential_energy
dip = pd.read_csv('dipole_moments.csv')    # molecule_name, components
# Normalize dipole column names to dx,dy,dz
dip_cols = set(dip.columns.str.lower())
if {'dx','dy','dz'}.issubset(dip_cols):
    # already correct or mixed case
    rename_map = {c: c.lower() for c in dip.columns if c.lower() in {'dx','dy','dz'}}
    dip = dip.rename(columns=rename_map)
elif {'x','y','z'}.issubset(dip_cols):
    # common Kaggle file has X,Y,Z
    rename_map = {}
    for c in dip.columns:
        cl = c.lower()
        if cl == 'x': rename_map[c] = 'dx'
        if cl == 'y': rename_map[c] = 'dy'
        if cl == 'z': rename_map[c] = 'dz'
    dip = dip.rename(columns=rename_map)
else:
    print('Warning: unexpected dipole_moments columns:', dip.columns.tolist())
dip['dip_norm'] = np.sqrt(dip['dx']**2 + dip['dy']**2 + dip['dz']**2).astype(np.float32)
mol_ctx = pot.merge(dip, on='molecule_name', how='left')

def build_features(df, is_train: bool):
    t1 = time.time()
    # Merge atom0 and atom1 records
    a0 = atoms.rename(columns={
        'atom_index':'atom_index_0','atom':'atom_0','x':'x0','y':'y0','z':'z0','Z':'Z0','EN':'EN0','covrad':'covrad0','period':'period0','group':'group0','valence_e':'valence_e0'
    })
    a1 = atoms.rename(columns={
        'atom_index':'atom_index_1','atom':'atom_1','x':'x1','y':'y1','z':'z1','Z':'Z1','EN':'EN1','covrad':'covrad1','period':'period1','group':'group1','valence_e':'valence_e1'
    })
    df = df.merge(a0, on=['molecule_name','atom_index_0'], how='left')
    df = df.merge(a1, on=['molecule_name','atom_index_1'], how='left')
    # Molecule context
    df = df.merge(mol_ctx, on='molecule_name', how='left')
    # Geometry
    dx = (df['x0'].values - df['x1'].values).astype(np.float32)
    dy = (df['y0'].values - df['y1'].values).astype(np.float32)
    dz = (df['z0'].values - df['z1'].values).astype(np.float32)
    d2 = dx*dx + dy*dy + dz*dz
    dist = np.sqrt(d2) + 1e-6
    df['dist'] = dist.astype(np.float32)
    df['dist2'] = d2.astype(np.float32)
    df['inv_dist']  = (1.0/dist).astype(np.float32)
    df['inv_d2']    = (1.0/d2.clip(min=1e-6)).astype(np.float32)
    df['inv_d3']    = (1.0/(dist*dist*dist)).astype(np.float32)
    # Atom identity & periodic props
    df['same_element'] = (df['atom_0'].values == df['atom_1'].values).astype(np.int8)
    for a in ['Z','EN','covrad','valence_e','period','group']:
        a0c, a1c = f'{a}0', f'{a}1'
        df[f'{a}_sum']  = (df[a0c].values + df[a1c].values).astype(np.float32)
        df[f'{a}_diff'] = (df[a0c].values - df[a1c].values).astype(np.float32)
        if a in ('EN','covrad','Z'):
            df[f'{a}_ratio'] = (df[a0c].values / (df[a1c].replace(0, np.nan))).astype(np.float32)
            df[f'{a}_ratio'] = df[f'{a}_ratio'].replace([np.inf, -np.inf], np.nan).fillna(0).astype(np.float32)
    # Dipole and potential energy
    for c in ['potential_energy','dx','dy','dz','dip_norm']:
        if c in df.columns:
            df[c] = df[c].astype(np.float32)
    # Minimal categorical encodings for atom symbols (handle unknowns safely)
    sym_map = {'H':0,'C':1,'N':2,'O':3,'F':4}
    df['sym0'] = df['atom_0'].map(sym_map).fillna(-1).astype(np.int8)
    df['sym1'] = df['atom_1'].map(sym_map).fillna(-1).astype(np.int8)
    # Keep only needed columns
    base_cols = [
        'id','type','molecule_name','dist','dist2','inv_dist','inv_d2','inv_d3','same_element','sym0','sym1',
        'Z0','Z1','EN0','EN1','covrad0','covrad1','valence_e0','valence_e1','period0','period1','group0','group1',
        'Z_sum','Z_diff','EN_sum','EN_diff','EN_ratio','covrad_sum','covrad_diff','covrad_ratio','valence_e_sum','valence_e_diff','period_sum','period_diff','group_sum','group_diff',
        'potential_energy','dx','dy','dz','dip_norm'
    ]
    cols_exist = [c for c in base_cols if c in df.columns]
    out = df[cols_exist].copy()
    if is_train:
        out = out.merge(folds, on='molecule_name', how='left')
    del df; gc.collect()
    print('Built features in', f'{time.time()-t1:.2f}s', 'shape:', out.shape)
    return out

Xtr = build_features(train_df, is_train=True)
ytr = pd.DataFrame({'id': train_df['id'].values, 'scalar_coupling_constant': train_df['scalar_coupling_constant'].values, 'type': train_df['type'].values})
Xte = build_features(test_df, is_train=False)

# Downcast numerics
def downcast_numeric(df):
    for c in df.select_dtypes(include=['float64']).columns:
        df[c] = df[c].astype(np.float32)
    for c in df.select_dtypes(include=['int64']).columns:
        if c == 'id':
            df[c] = df[c].astype(np.int32)
        else:
            df[c] = df[c].astype(np.int32)
    return df

Xtr = downcast_numeric(Xtr)
Xte = downcast_numeric(Xte)

# Save caches
Xtr_path = 'X_train_v0.parquet'; Xte_path = 'X_test_v0.parquet'; y_path = 'y_train.csv'
Xtr.to_parquet(Xtr_path, index=False)
Xte.to_parquet(Xte_path, index=False)
ytr.to_csv(y_path, index=False)
print('Saved:', Xtr_path, Xtr.shape, '|', Xte_path, Xte.shape, '|', y_path, ytr.shape)
print(f'FE v0 total time: {time.time()-t0:.2f}s', flush=True)

Loading base tables...


train_df: (4191263, 6) test_df: (467813, 5)


Built features in 3.23s shape: (4191263, 43)


Built features in 0.60s shape: (467813, 42)


Saved: X_train_v0.parquet (4191263, 43) | X_test_v0.parquet (467813, 42) | y_train.csv (4191263, 3)
FE v0 total time: 10.78s


In [11]:
# FE v1: add mulliken + shielding iso + quick extras + covalent graph features; cache v1
import pandas as pd, numpy as np, time, os, gc
from collections import deque, defaultdict

t0 = time.time()
print('Loading base inputs...')
atoms = pd.read_parquet('atoms.parquet')  # has periodic props, coords
folds = pd.read_csv('folds_molecules.csv')
train = pd.read_csv('train.csv', usecols=['id','molecule_name','atom_index_0','atom_index_1','type','scalar_coupling_constant'])
test  = pd.read_csv('test.csv',  usecols=['id','molecule_name','atom_index_0','atom_index_1','type'])

# Molecule-level safe tables
pot = pd.read_csv('potential_energy.csv')
dip = pd.read_csv('dipole_moments.csv')
dip_cols = set(dip.columns.str.lower())
if {'dx','dy','dz'}.issubset(dip_cols):
    dip = dip.rename(columns={c: c.lower() for c in dip.columns if c.lower() in {'dx','dy','dz'}})
elif {'x','y','z'}.issubset(dip_cols):
    r = {};
    [r.setdefault(c, {'x':'dx','y':'dy','z':'dz'}[c.lower()]) for c in dip.columns if c.lower() in {'x','y','z'}]
    dip = dip.rename(columns=r)
dip['dip_norm'] = np.sqrt(dip['dx']**2 + dip['dy']**2 + dip['dz']**2).astype(np.float32)
mol_ctx = pot.merge(dip, on='molecule_name', how='left')

# Quantum per-atom tables (safe for test)
print('Loading mulliken and shielding...')
mull = pd.read_csv('mulliken_charges.csv')        # molecule_name, atom_index, mulliken_charge
shield = pd.read_csv('magnetic_shielding_tensors.csv')  # molecule_name, atom_index, diag terms
# Normalize shield column names to robustly find diagonal terms
shield.columns = [str(c).lower() for c in shield.columns]
def find_col(sdf, suffix):
    for c in sdf.columns:
        cl = str(c).lower()
        if cl == suffix or cl.endswith('_'+suffix):
            return c
    return None
c_xx = find_col(shield, 'xx')
c_yy = find_col(shield, 'yy')
c_zz = find_col(shield, 'zz')
if c_xx is None or c_yy is None or c_zz is None:
    raise KeyError(f'Diagonal shielding columns not found. Available: {shield.columns.tolist()}')
shield['shield_iso'] = ((shield[c_xx] + shield[c_yy] + shield[c_zz]) / 3.0).astype(np.float32)
mull['mulliken_charge'] = mull['mulliken_charge'].astype(np.float32)
shield = shield[['molecule_name','atom_index','shield_iso']]
atom_q = mull.merge(shield, on=['molecule_name','atom_index'], how='left')

# Molecule atom counts (n_atoms, n_H, n_heavy)
mol_counts = atoms.groupby('molecule_name').agg(
    n_atoms=('atom_index','count'),
    n_H=('atom', lambda s: (s=='H').sum()),
    n_heavy=('atom', lambda s: (s!='H').sum()),
).reset_index()

sym_map = {'H':0,'C':1,'N':2,'O':3,'F':4}

def build_pair_frame(df, is_train: bool):
    t1 = time.time()
    # Prepare atom tables for merge with per-atom quantum props
    a = atoms.merge(atom_q, on=['molecule_name','atom_index'], how='left')
    a0 = a.rename(columns={'atom_index':'atom_index_0','atom':'atom_0','x':'x0','y':'y0','z':'z0','Z':'Z0','EN':'EN0','covrad':'covrad0','period':'period0','group':'group0','valence_e':'valence_e0','mulliken_charge':'q0','shield_iso':'shield0'})
    a1 = a.rename(columns={'atom_index':'atom_index_1','atom':'atom_1','x':'x1','y':'y1','z':'z1','Z':'Z1','EN':'EN1','covrad':'covrad1','period':'period1','group':'group1','valence_e':'valence_e1','mulliken_charge':'q1','shield_iso':'shield1'})
    out = df.merge(a0, on=['molecule_name','atom_index_0'], how='left')
    out = out.merge(a1, on=['molecule_name','atom_index_1'], how='left')
    out = out.merge(mol_ctx, on='molecule_name', how='left')
    out = out.merge(mol_counts, on='molecule_name', how='left')
    # Geometry
    dx = (out['x0'].values - out['x1'].values).astype(np.float32)
    dy = (out['y0'].values - out['y1'].values).astype(np.float32)
    dz = (out['z0'].values - out['z1'].values).astype(np.float32)
    d2 = dx*dx + dy*dy + dz*dz
    dist = np.sqrt(d2) + 1e-6
    out['dist'] = dist.astype(np.float32)
    out['log_dist'] = np.log(dist).astype(np.float32)
    out['dist2'] = d2.astype(np.float32)
    out['inv_dist'] = (1.0/dist).astype(np.float32)
    out['inv_d2'] = (1.0/d2.clip(min=1e-6)).astype(np.float32)
    out['inv_d3'] = (1.0/(dist*dist*dist)).astype(np.float32)
    # Dipole projection onto bond
    bond_ux = dx/dist; bond_uy = dy/dist; bond_uz = dz/dist
    out['dip_proj'] = (out['dx'].values*bond_ux + out['dy'].values*bond_uy + out['dz'].values*bond_uz).astype(np.float32)
    out['dip_cos'] = (out['dip_proj'] / (out['dip_norm'] + 1e-6)).astype(np.float32)
    out['dip_abs_proj'] = out['dip_proj'].abs().astype(np.float32)
    # Atom identity/features
    out['same_element'] = (out['atom_0'].values == out['atom_1'].values).astype(np.int8)
    for a in ['Z','EN','covrad','valence_e','period','group']:
        a0c, a1c = f'{a}0', f'{a}1'
        out[f'{a}_sum']  = (out[a0c].values + out[a1c].values).astype(np.float32)
        out[f'{a}_diff'] = (out[a0c].values - out[a1c].values).astype(np.float32)
    out['EN_absdiff'] = out['EN_diff'].abs().astype(np.float32)
    out['covrad_ratio'] = (out['covrad0'].values / np.where(out['covrad1'].values==0, np.nan, out['covrad1'].values)).astype(np.float32)
    out['covrad_ratio'] = pd.Series(out['covrad_ratio']).replace([np.inf,-np.inf], np.nan).fillna(0).astype(np.float32)
    # Cheap extras
    covsum = (out['covrad0'].values + out['covrad1'].values + 1e-6).astype(np.float32)
    out['dist_over_covsum'] = (out['dist'].values / covsum).astype(np.float32)
    out['covrad_min'] = np.minimum(out['covrad0'].values, out['covrad1'].values).astype(np.float32)
    out['covrad_max'] = np.maximum(out['covrad0'].values, out['covrad1'].values).astype(np.float32)
    # Mulliken & shielding
    for c in ['q0','q1','shield0','shield1']:
        out[c] = out[c].astype(np.float32)
    out['q_sum'] = (out['q0'].values + out['q1'].values).astype(np.float32)
    out['q_diff'] = (out['q0'].values - out['q1'].values).astype(np.float32)
    out['shield_sum'] = (out['shield0'].values + out['shield1'].values).astype(np.float32)
    out['shield_diff'] = (out['shield0'].values - out['shield1'].values).astype(np.float32)
    # Sym codes and unordered pair code
    out['sym0'] = out['atom_0'].map(sym_map).fillna(-1).astype(np.int8)
    out['sym1'] = out['atom_1'].map(sym_map).fillna(-1).astype(np.int8)
    smin = np.minimum(out['sym0'].values, out['sym1'].values).astype(np.int16)
    smax = np.maximum(out['sym0'].values, out['sym1'].values).astype(np.int16)
    out['pair_code'] = (smin*8 + smax).astype(np.int16)
    # Merge fold for train
    if is_train:
        out = out.merge(folds, on='molecule_name', how='left')
        assert 'fold' in out.columns and out['fold'].isna().sum()==0, "Missing 'fold' after merge for train"
    # Build keep list without duplicating molecule_name
    base_head = ['id','type','molecule_name'] + (['fold'] if is_train else [])
    keep_rest = [
        'dist','log_dist','dist2','inv_dist','inv_d2','inv_d3','same_element','sym0','sym1','pair_code',
        'Z0','Z1','EN0','EN1','covrad0','covrad1','valence_e0','valence_e1','period0','period1','group0','group1',
        'Z_sum','Z_diff','EN_sum','EN_diff','EN_absdiff','covrad_sum','covrad_diff','covrad_ratio','valence_e_sum','valence_e_diff','period_sum','period_diff','group_sum','group_diff',
        'q0','q1','q_sum','q_diff','shield0','shield1','shield_sum','shield_diff',
        'potential_energy','dx','dy','dz','dip_norm','dip_proj','dip_cos','dip_abs_proj','dist_over_covsum','covrad_min','covrad_max',
        'n_atoms','n_H','n_heavy',
        'atom_index_0','atom_index_1'
    ]
    keep = base_head + [c for c in keep_rest if c in out.columns]
    out = out.loc[:, ~pd.Index(out.columns).duplicated()].copy()
    out = out[keep].copy()
    # Dtypes
    for c in out.select_dtypes(include=['float64']).columns: out[c] = out[c].astype(np.float32)
    for c in out.select_dtypes(include=['int64']).columns: out[c] = out[c].astype(np.int32)
    print('Built pair frame in', f'{time.time()-t1:.2f}s', 'shape:', out.shape)
    return out

Xtr_base = build_pair_frame(train, is_train=True)
ytr = pd.DataFrame({'id': train['id'].values, 'scalar_coupling_constant': train['scalar_coupling_constant'].values, 'type': train['type'].values})
Xte_base = build_pair_frame(test, is_train=False)

# Sanity checks on base frames
assert Xtr_base['id'].is_unique and Xte_base['id'].is_unique, 'IDs not unique'
assert 'fold' in Xtr_base.columns and Xtr_base['fold'].isna().sum()==0, "Train base missing fold"
assert Xtr_base['molecule_name'].isna().sum()==0 and Xte_base['molecule_name'].isna().sum()==0, 'Missing molecule_name after merges'

# Graph features: per-molecule covalent graph with k=1.15*(covrad_i+covrad_j)
def graph_features_for_molecule(mol_atoms, pairs_rows, k=1.15, clip_len=6):
    # mol_atoms has columns: atom_index (global within mol), x,y,z,covrad
    idx = mol_atoms['atom_index'].values.astype(np.int32)
    idx2local = {g:i for i,g in enumerate(idx)}
    coords = mol_atoms[['x','y','z']].values.astype(np.float32)
    covr = mol_atoms['covrad'].values.astype(np.float32)
    n = len(idx)
    adj = [[] for _ in range(n)]
    # Build adjacency
    for i in range(n):
        ci = coords[i]
        for j in range(i+1, n):
            dij = float(np.linalg.norm(ci - coords[j]))
            thr = float(k * (covr[i] + covr[j]))
            if dij < thr:
                adj[i].append(j); adj[j].append(i)
    deg = np.array([len(nei) for nei in adj], dtype=np.int16)
    adj_sets = [set(nei) for nei in adj]
    # Prepare pairs in local indices
    a0g = pairs_rows['atom_index_0'].values.astype(np.int32)
    a1g = pairs_rows['atom_index_1'].values.astype(np.int32)
    a0 = np.array([idx2local.get(g, -1) for g in a0g], dtype=np.int32)
    a1 = np.array([idx2local.get(g, -1) for g in a1g], dtype=np.int32)
    # BFS from unique sources
    uniq_src = sorted(set([int(s) for s in a0 if s >= 0]))
    dist_map = {}
    for s in uniq_src:
        dist = np.full(n, -1, dtype=np.int16); dist[s] = 0
        dq = deque([s])
        while dq:
            u = dq.popleft()
            if dist[u] >= clip_len:
                continue
            for v in adj[u]:
                if dist[v] == -1:
                    dist[v] = dist[u] + 1
                    dq.append(v)
        dist_map[s] = dist
    # Collect features
    m = len(a0)
    path_len = np.full(m, clip_len, dtype=np.int16)
    is_bonded = np.zeros(m, dtype=np.int8)
    deg0 = np.zeros(m, dtype=np.int16)
    deg1 = np.zeros(m, dtype=np.int16)
    common_nei = np.zeros(m, dtype=np.int16)
    for i in range(m):
        u,v = a0[i], a1[i]
        if u >= 0 and v >= 0:
            d = dist_map.get(int(u), None)
            if d is not None and d[v] != -1:
                path_len[i] = min(int(d[v]), clip_len)
                is_bonded[i] = 1 if path_len[i] == 1 else 0
            deg0[i] = deg[u]
            deg1[i] = deg[v]
            common_nei[i] = len(adj_sets[u].intersection(adj_sets[v]))
    return path_len, is_bonded, deg0, deg1, common_nei

def add_graph_features(Xbase):
    t2 = time.time()
    # Ensure no duplicate column names (e.g., molecule_name)
    Xbase = Xbase.loc[:, ~pd.Index(Xbase.columns).duplicated()]
    # Prepare atoms per molecule minimal subset
    atoms_min = atoms[['molecule_name','atom_index','x','y','z','covrad']].copy()
    # Group by molecule to process
    Xbase = Xbase.sort_values(['molecule_name']).reset_index(drop=True)
    grp_idx = Xbase.groupby('molecule_name').indices
    # Arrays to fill
    n = len(Xbase)
    path_len = np.full(n, 6, dtype=np.int16)
    is_bonded = np.zeros(n, dtype=np.int8)
    deg0 = np.zeros(n, dtype=np.int16)
    deg1 = np.zeros(n, dtype=np.int16)
    comn = np.zeros(n, dtype=np.int16)
    processed = 0
    for gi, (mol, idxs) in enumerate(grp_idx.items()):
        pairs_rows = Xbase.loc[idxs, ['atom_index_0','atom_index_1']]
        mol_atoms = atoms_min[atoms_min['molecule_name'] == mol]
        pl, ib, d0, d1, cn = graph_features_for_molecule(mol_atoms, pairs_rows, k=1.15, clip_len=6)
        path_len[idxs] = pl
        is_bonded[idxs] = ib
        deg0[idxs] = d0
        deg1[idxs] = d1
        comn[idxs] = cn
        processed += 1
        if processed % 2000 == 0:
            print(f'  processed {processed}/{len(grp_idx)} molecules; elapsed {time.time()-t2:.1f}s', flush=True)
    Xbase['path_len'] = path_len
    Xbase['is_bonded'] = is_bonded
    Xbase['degree_0'] = deg0
    Xbase['degree_1'] = deg1
    Xbase['deg_sum'] = (Xbase['degree_0'].values + Xbase['degree_1'].values).astype(np.int16)
    Xbase['deg_diff'] = (Xbase['degree_0'].values - Xbase['degree_1'].values).astype(np.int16)
    Xbase['common_neighbors'] = comn
    print('Graph features added in', f'{time.time()-t2:.2f}s')
    return Xbase

Xtr_v1_path = 'X_train_v1.parquet'; Xte_v1_path = 'X_test_v1.parquet'

# Compute/load train graph
if os.path.exists(Xtr_v1_path):
    print('Loading cached train v1 features...')
    Xtr_v1 = pd.read_parquet(Xtr_v1_path)
else:
    print('Adding graph features to train...')
    Xtr_v1 = add_graph_features(Xtr_base)
    # Save immediately to avoid losing work if test step fails later
    Xtr_v1.to_parquet(Xtr_v1_path, index=False)
    print('Saved train v1 to', Xtr_v1_path)

# Compute/load test graph
if os.path.exists(Xte_v1_path):
    print('Loading cached test v1 features...')
    Xte_v1 = pd.read_parquet(Xte_v1_path)
else:
    print('Adding graph features to test...')
    Xte_v1 = add_graph_features(Xte_base)
    Xte_v1.to_parquet(Xte_v1_path, index=False)
    print('Saved test v1 to', Xte_v1_path)

# Post-run sanity checks
assert Xtr_v1.shape[0] == train.shape[0], f'Train rows mismatch: {Xtr_v1.shape[0]} vs {train.shape[0]}'
assert Xte_v1.shape[0] == test.shape[0], f'Test rows mismatch: {Xte_v1.shape[0]} vs {test.shape[0]}'
for df_chk, name in [(Xtr_v1,'train_v1'), (Xte_v1,'test_v1')]:
    key_cols = [c for c in ['dist','inv_dist','covrad_ratio','dip_proj','dip_cos','dist_over_covsum'] if c in df_chk.columns]
    if key_cols:
        assert np.isfinite(df_chk[key_cols].to_numpy()).all(), f'Non-finite values found in {name}'

# Save y
ytr.to_csv('y_train.csv', index=False)
print('Saved v1:', Xtr_v1_path, Xtr_v1.shape, '|', Xte_v1_path, Xte_v1.shape, '| y:', ytr.shape)
print(f'FE v1 total time: {time.time()-t0:.2f}s', flush=True)

Loading base inputs...


Loading mulliken and shielding...


Built pair frame in 4.39s shape: (4191263, 64)


Built pair frame in 0.79s shape: (467813, 63)


Adding graph features to train...


  processed 2000/76510 molecules; elapsed 128.5s


  processed 4000/76510 molecules; elapsed 253.8s


  processed 6000/76510 molecules; elapsed 379.6s


  processed 8000/76510 molecules; elapsed 505.5s


  processed 10000/76510 molecules; elapsed 631.2s


  processed 12000/76510 molecules; elapsed 756.8s


  processed 14000/76510 molecules; elapsed 881.9s


  processed 16000/76510 molecules; elapsed 1006.9s


  processed 18000/76510 molecules; elapsed 1132.1s


  processed 20000/76510 molecules; elapsed 1257.6s


  processed 22000/76510 molecules; elapsed 1383.7s


  processed 24000/76510 molecules; elapsed 1510.3s


  processed 26000/76510 molecules; elapsed 1635.6s


  processed 28000/76510 molecules; elapsed 1760.8s


  processed 30000/76510 molecules; elapsed 1886.2s


  processed 32000/76510 molecules; elapsed 2011.9s


  processed 34000/76510 molecules; elapsed 2138.5s


  processed 36000/76510 molecules; elapsed 2265.0s


  processed 38000/76510 molecules; elapsed 2391.5s


  processed 40000/76510 molecules; elapsed 2517.8s


  processed 42000/76510 molecules; elapsed 2644.2s


  processed 44000/76510 molecules; elapsed 2770.6s


  processed 46000/76510 molecules; elapsed 2897.1s


  processed 48000/76510 molecules; elapsed 3023.6s


  processed 50000/76510 molecules; elapsed 3150.1s


  processed 52000/76510 molecules; elapsed 3276.8s


  processed 54000/76510 molecules; elapsed 3403.5s


  processed 56000/76510 molecules; elapsed 3530.0s


  processed 58000/76510 molecules; elapsed 3656.5s


  processed 60000/76510 molecules; elapsed 3783.5s


  processed 62000/76510 molecules; elapsed 3910.1s


  processed 64000/76510 molecules; elapsed 4036.8s


  processed 66000/76510 molecules; elapsed 4163.5s


  processed 68000/76510 molecules; elapsed 4290.3s


  processed 70000/76510 molecules; elapsed 4417.4s


  processed 72000/76510 molecules; elapsed 4543.3s


  processed 74000/76510 molecules; elapsed 4668.4s


  processed 76000/76510 molecules; elapsed 4792.8s


Graph features added in 4824.49s


Saved train v1 to X_train_v1.parquet
Adding graph features to test...


  processed 2000/8502 molecules; elapsed 118.9s


  processed 4000/8502 molecules; elapsed 237.6s


  processed 6000/8502 molecules; elapsed 356.3s


  processed 8000/8502 molecules; elapsed 475.0s


Graph features added in 504.74s
Saved test v1 to X_test_v1.parquet


AssertionError: Non-finite values found in test_v1

In [12]:
# Modeling v1: per-type XGBoost GPU with GroupCV (prebuilt folds). Run after X_train_v1.parquet exists.
import os, time, gc, json
import numpy as np, pandas as pd
from sklearn.metrics import mean_absolute_error
import xgboost as xgb

def log_mae_by_type(y_true, y_pred, types):
    vals = []
    for t in np.unique(types):
        m = (types == t)
        mae = float(np.mean(np.abs(y_true[m] - y_pred[m])))
        vals.append(np.log(mae + 1e-9))
    return float(np.mean(vals))

feat_train_path = 'X_train_v1.parquet'
y_path = 'y_train.csv'
assert os.path.exists(feat_train_path), f'Missing {feat_train_path}; run FE v1 first.'
X = pd.read_parquet(feat_train_path)
y = pd.read_csv(y_path)
assert X.shape[0] == y.shape[0], 'Row mismatch X vs y'

# Align by id to be safe
X = X.sort_values('id').reset_index(drop=True)
y = y.sort_values('id').reset_index(drop=True)
for c in ['id','type']:
    assert (X[c].values == y[c].values).all(), f'Mismatch in column {c}'

# Feature columns
drop_cols = [c for c in ['id','molecule_name','type','fold','atom_index_0','atom_index_1'] if c in X.columns]
feat_cols = [c for c in X.columns if c not in drop_cols]
print('Num features:', len(feat_cols))

types = X['type'].values
folds = X['fold'].values
target = y['scalar_coupling_constant'].values.astype(np.float32)

type_list = sorted(np.unique(types))
oof = np.zeros_like(target, dtype=np.float32)
models_info = {}

xgb_params = {
    'tree_method': 'gpu_hist',
    'predictor': 'gpu_predictor',
    'objective': 'reg:absoluteerror',
    'eval_metric': 'mae',
    'learning_rate': 0.05,
    'max_depth': 8,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 0.0,
    'reg_lambda': 1.0,
    'n_estimators': 20000,
}
ESR = 100

t0 = time.time()
for t in type_list:
    m_type = (types == t)
    Xt = X.loc[m_type, feat_cols].astype(np.float32)
    yt = target[m_type]
    ft = folds[m_type]
    print(f'\nType {t}: rows={Xt.shape[0]} features={Xt.shape[1]}')
    oof_t = np.zeros_like(yt, dtype=np.float32)
    fold_models = []
    for k in sorted(np.unique(folds)):
        tr_idx = (ft != k)
        va_idx = (ft == k)
        if va_idx.sum() == 0 or tr_idx.sum() == 0:
            continue
        dtrain = xgb.DMatrix(Xt.loc[tr_idx], label=yt[tr_idx])
        dvalid = xgb.DMatrix(Xt.loc[va_idx], label=yt[va_idx])
        w = xgb.train(xgb_params, dtrain, num_boost_round=xgb_params['n_estimators'], evals=[(dvalid, 'valid')],
                      early_stopping_rounds=ESR, verbose_eval=False)
        preds = w.predict(dvalid, iteration_range=(0, w.best_iteration+1))
        oof_t[va_idx] = preds.astype(np.float32)
        fold_models.append({'fold': int(k), 'best_iteration': int(w.best_iteration)})
        print(f'  fold {k}: best_iter={int(w.best_iteration)} MAE={mean_absolute_error(yt[va_idx], preds):.5f}', flush=True)
    oof[m_type] = oof_t
    mae_t = mean_absolute_error(yt, oof_t)
    print(f'Type {t}: OOF MAE={mae_t:.5f}', flush=True)
    models_info[t] = fold_models
    del Xt; gc.collect()

overall_logmae = log_mae_by_type(target, oof, types)
print('\nOOF log-MAE (competition metric proxy):', overall_logmae)

# Save OOF for diagnostics
oof_df = pd.DataFrame({'id': X['id'].values, 'type': types, 'oof': oof, 'y': target})
oof_df.to_csv('oof_xgb_v1.csv', index=False)
json.dump(models_info, open('models_info_xgb_v1.json','w'))
print('Saved oof and models info. Total time:', f'{time.time()-t0:.1f}s', flush=True)

Num features: 65



Type 1JHC: rows=637912 features=65



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "n_estimators", "predictor" } are not used.



XGBoostError: [03:02:36] /workspace/src/tree/updater_gpu_hist.cu:861: Exception in gpu_hist: [03:02:36] /workspace/src/tree/updater_gpu_hist.cu:867: Check failed: ctx_->Ordinal() >= 0 (-1 vs. 0) : Must have at least one device
Stack trace:
  [bt] (0) /usr/local/lib/python3.11/dist-packages/xgboost/lib/libxgboost.so(+0x25c1ac) [0x7d679405c1ac]
  [bt] (1) /usr/local/lib/python3.11/dist-packages/xgboost/lib/libxgboost.so(+0xe2d2dd) [0x7d6794c2d2dd]
  [bt] (2) /usr/local/lib/python3.11/dist-packages/xgboost/lib/libxgboost.so(+0xe3b814) [0x7d6794c3b814]
  [bt] (3) /usr/local/lib/python3.11/dist-packages/xgboost/lib/libxgboost.so(+0x5ad006) [0x7d67943ad006]
  [bt] (4) /usr/local/lib/python3.11/dist-packages/xgboost/lib/libxgboost.so(+0x5ae3d4) [0x7d67943ae3d4]
  [bt] (5) /usr/local/lib/python3.11/dist-packages/xgboost/lib/libxgboost.so(+0x5f8cd8) [0x7d67943f8cd8]
  [bt] (6) /usr/local/lib/python3.11/dist-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x6f) [0x7d6793f65a1f]
  [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.8(+0x7e2e) [0x7d6a645f5e2e]
  [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.8(+0x4493) [0x7d6a645f2493]



Stack trace:
  [bt] (0) /usr/local/lib/python3.11/dist-packages/xgboost/lib/libxgboost.so(+0x25c1ac) [0x7d679405c1ac]
  [bt] (1) /usr/local/lib/python3.11/dist-packages/xgboost/lib/libxgboost.so(+0xe3ba0b) [0x7d6794c3ba0b]
  [bt] (2) /usr/local/lib/python3.11/dist-packages/xgboost/lib/libxgboost.so(+0x5ad006) [0x7d67943ad006]
  [bt] (3) /usr/local/lib/python3.11/dist-packages/xgboost/lib/libxgboost.so(+0x5ae3d4) [0x7d67943ae3d4]
  [bt] (4) /usr/local/lib/python3.11/dist-packages/xgboost/lib/libxgboost.so(+0x5f8cd8) [0x7d67943f8cd8]
  [bt] (5) /usr/local/lib/python3.11/dist-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x6f) [0x7d6793f65a1f]
  [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.8(+0x7e2e) [0x7d6a645f5e2e]
  [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.8(+0x4493) [0x7d6a645f2493]
  [bt] (8) /usr/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0xa99d) [0x7d6a65e8899d]



In [15]:
# Monitor CatBoost training artifacts and GPU status
import os, time, json, pandas as pd, numpy as np, subprocess, datetime as dt

def fstat(path):
    if not os.path.exists(path):
        return {'exists': False}
    st = os.stat(path)
    return {'exists': True, 'size_MB': round(st.st_size/1e6, 3), 'mtime': dt.datetime.fromtimestamp(st.st_mtime).isoformat(timespec='seconds')}

paths = [
    'oof_cat_v1.csv',
    'models_info_cat_v1.json',
    'best_iters_by_type_cat_v1.json',
    'submission_cat_v1.csv',
    'submission.csv',
    'docker_run.log',
]
print('Artifact status:')
for p in paths:
    print(p, fstat(p))

print('\nGPU status (nvidia-smi):')
try:
    out = subprocess.run(['bash','-lc','nvidia-smi || true'], capture_output=True, text=True, check=False)
    print(out.stdout)
except Exception as e:
    print('nvidia-smi failed:', e)

# If OOF exists, show quick summary
if os.path.exists('oof_cat_v1.csv'):
    oof = pd.read_csv('oof_cat_v1.csv')
    def log_mae_by_type(y_true, y_pred, types):
        vals = []
        for t in np.unique(types):
            m = (types == t)
            mae = float(np.mean(np.abs(y_true[m] - y_pred[m])))
            vals.append(np.log(mae + 1e-9))
        return float(np.mean(vals))
    score = log_mae_by_type(oof['y'].values, oof['oof'].values, oof['type'].values)
    print(f'OOF log-MAE proxy: {score:.6f}')
    print(oof.groupby('type').apply(lambda d: np.log(np.mean(np.abs(d['y']-d['oof']))+1e-9)).to_string())

print('Monitoring done at', time.strftime('%Y-%m-%d %H:%M:%S'), flush=True)

Artifact status:
oof_cat_v1.csv {'exists': False}
models_info_cat_v1.json {'exists': False}
best_iters_by_type_cat_v1.json {'exists': False}
submission_cat_v1.csv {'exists': False}
submission.csv {'exists': True, 'size_MB': 7.731, 'mtime': '2025-09-23T23:22:09'}
docker_run.log {'exists': True, 'size_MB': 0.523, 'mtime': '2025-09-24T09:22:15'}

GPU status (nvidia-smi):
Failed to initialize NVML: Unknown Error

Monitoring done at 2025-09-24 09:22:15
