# Plan to Medal: NOMAD 2018 (formation_energy_ev_natom)

Objectives:
- Build a strong baseline quickly; iterate to medal.
- Target: formation_energy_ev_natom. Metric: RMSLE (single target).

Approach:
1) Data loading + sanity checks
- Read train.csv/test.csv; inspect target distribution and ID linkage to folders.

2) Fast, high-signal features (composition-driven, no heavy structure ML yet)
- Parse each geometry.xyz to count elements {Al, Ga, In, O, N} and total atoms.
- Compute composition fractions, simple chem descriptors via weighted averages:
  atomic number, periodic row/group, Pauling electronegativity, atomic radius, valence electrons, first ionization energy, covalent radius.
- Derived stats: mean/var/skew of descriptors over composition (weighted), entropy of composition, ratios (cation/anion fractions), and stoichiometry deviations from (Al_x Ga_y In_z)_{2N}O_{3N}.
- Counts by species types (cation count = Al+Ga+In; anion count = O+N).

3) Optional light structure geometry signals (cheap)
- From xyz only (no lattice):
  - Bounding box extents per axis, mean pairwise distance (sampled), min/max/median NN distance (fast KDTree), density-like proxy: atoms / bbox volume.
  - These are optional; include if fast enough.

4) Modeling
- LightGBM regressor with KFold (5 or 10 folds), robust params, early stopping.
- Target transform: fit to y directly, use RMSLE in CV (or fit to log1p(y) and evaluate RMSLE on inverse).
- Strong logging and timing per fold.

5) Submission
- Train on full data with best params; predict test; write submission.csv matching sample format.

Iteration roadmap:
- v1: Composition-only features + LGBM baseline.
- v2: Add cheap geometry features.
- v3: Tune LGBM (num_leaves, feature_fraction, min_data_in_leaf).
- v4: Try CatBoost/ExtraTrees stack if time permits.

Checkpoints requiring expert review:
- After plan (now), after EDA/feature schema, after first CV result, before any heavy geometry computation, and before final submission.

In [1]:
# Setup: imports and basic data check
import os, sys, json, math, time, gc, pathlib, warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd

print('Python', sys.version)
print('CWD', os.getcwd())

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print('train shape:', train.shape, 'test shape:', test.shape)
print('train columns:', list(train.columns))
print('test columns:', list(test.columns))

# Peek and target diagnostics
print('\nHead train:')
print(train.head(3))
targets = [c for c in ['formation_energy_ev_natom','bandgap_energy_ev'] if c in train.columns]
print('Detected targets:', targets)
if 'formation_energy_ev_natom' in train.columns:
    y = train['formation_energy_ev_natom'].values
    print('formation_energy_ev_natom min/mean/max:', float(np.min(y)), float(np.mean(y)), float(np.max(y)))
    print('formation_energy_ev_natom <= 0 count:', int(np.sum(y <= 0)))

# Check sample_submission expectations
if os.path.exists('sample_submission.csv'):
    sub_sample = pd.read_csv('sample_submission.csv')
    print('\nSample submission columns:', list(sub_sample.columns))
else:
    print('sample_submission.csv not found')

# Utilities for paths
def train_xyz_path(idx: int) -> str:
    return os.path.join('train', str(int(idx)), 'geometry.xyz')

def test_xyz_path(idx: int) -> str:
    return os.path.join('test', str(int(idx)), 'geometry.xyz')

# Quick existence checks
missing_train = []
for i in train['id'].head(5):
    p = train_xyz_path(i)
    if not os.path.exists(p):
        missing_train.append(p)
print('Example train xyz exists (first 5):', len(missing_train)==0, 'missing:', missing_train)

missing_test = []
for i in test['id'].head(5):
    p = test_xyz_path(i)
    if not os.path.exists(p):
        missing_test.append(p)
print('Example test xyz exists (first 5):', len(missing_test)==0, 'missing:', missing_test)

Python 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
CWD /app/agent_run_states/nomad2018-predict-transparent-conductors-spray-20250909-035925
train shape: (2160, 14) test shape: (240, 12)
train columns: ['id', 'spacegroup', 'number_of_total_atoms', 'percent_atom_al', 'percent_atom_ga', 'percent_atom_in', 'lattice_vector_1_ang', 'lattice_vector_2_ang', 'lattice_vector_3_ang', 'lattice_angle_alpha_degree', 'lattice_angle_beta_degree', 'lattice_angle_gamma_degree', 'formation_energy_ev_natom', 'bandgap_energy_ev']
test columns: ['id', 'spacegroup', 'number_of_total_atoms', 'percent_atom_al', 'percent_atom_ga', 'percent_atom_in', 'lattice_vector_1_ang', 'lattice_vector_2_ang', 'lattice_vector_3_ang', 'lattice_angle_alpha_degree', 'lattice_angle_beta_degree', 'lattice_angle_gamma_degree']

Head train:
   id  spacegroup  number_of_total_atoms  percent_atom_al  percent_atom_ga  \
0   1         206                   80.0           0.3125            0.625   
1   2         206            

In [5]:
# Feature engineering from CSV (fast) + LGBM CV with GroupKFold for both targets
import numpy as np
import pandas as pd
import time, math, os, gc
from sklearn.model_selection import GroupKFold, StratifiedKFold, KFold
from sklearn.metrics import mean_squared_log_error
from sklearn.preprocessing import OneHotEncoder
from scipy.stats import entropy

def compute_volume_from_lengths_angles(a,b,c,alpha_deg,beta_deg,gamma_deg):
    alpha = math.radians(alpha_deg); beta = math.radians(beta_deg); gamma = math.radians(gamma_deg)
    cos_alpha, cos_beta, cos_gamma = math.cos(alpha), math.cos(beta), math.cos(gamma)
    vol = a*b*c*math.sqrt(max(1 + 2*cos_alpha*cos_beta*cos_gamma - cos_alpha*cos_alpha - cos_beta*cos_beta - cos_gamma*cos_gamma, 0.0))
    return vol

def build_features(df: pd.DataFrame) -> pd.DataFrame:
    X = pd.DataFrame(index=df.index)
    # Basics
    X['id'] = df['id'].values
    X['spacegroup'] = df['spacegroup'].astype(int)
    Ntot = df['number_of_total_atoms'].astype(float).fillna(0)
    X['N_total'] = Ntot
    # Lattice
    a = df['lattice_vector_1_ang'].astype(float)
    b = df['lattice_vector_2_ang'].astype(float)
    c = df['lattice_vector_3_ang'].astype(float)
    alpha = df['lattice_angle_alpha_degree'].astype(float)
    beta  = df['lattice_angle_beta_degree'].astype(float)
    gamma = df['lattice_angle_gamma_degree'].astype(float)
    vol = [compute_volume_from_lengths_angles(ai,bi,ci,al,be,ga) for ai,bi,ci,al,be,ga in zip(a,b,c,alpha,beta,gamma)]
    X['volume'] = np.array(vol)
    X['density'] = Ntot / np.clip(X['volume'].values, 1e-8, np.inf)
    # Composition fractions
    x_al = df['percent_atom_al'].astype(float).clip(0,1)
    x_ga = df['percent_atom_ga'].astype(float).clip(0,1)
    x_in = df['percent_atom_in'].astype(float).clip(0,1)
    x_cation = x_al + x_ga + x_in
    x_anion = 1.0 - x_cation
    X['x_al'] = x_al; X['x_ga'] = x_ga; X['x_in'] = x_in
    X['x_cation'] = x_cation; X['x_anion'] = x_anion
    # Mixing entropy (cations only, avoid log(0))
    eps = 1e-12
    X['cation_entropy'] = entropy(np.c_[np.maximum(x_al,eps), np.maximum(x_ga,eps), np.maximum(x_in,eps)].T, base=np.e)
    # Approx integer counts for grouping and stoichiometry checks
    counts_al = np.rint(x_al * Ntot).astype(int)
    counts_ga = np.rint(x_ga * Ntot).astype(int)
    counts_in = np.rint(x_in * Ntot).astype(int)
    X['count_al'] = counts_al; X['count_ga'] = counts_ga; X['count_in'] = counts_in
    X['count_cation'] = counts_al + counts_ga + counts_in
    X['count_anion'] = (Ntot - X['count_cation']).astype(int)
    # Ratios and deviations
    X['cation_anion_ratio'] = X['count_cation'] / np.clip(X['count_anion'].replace(0, np.nan), 1e-8, np.inf)
    X['cation_frac_dev'] = X['x_cation'] - (2.0/5.0)  # ideal (Al,Ga,In)_{2N} O_{3N} -> cation fraction ~ 2/5
    # Lattice shape features
    X['a'] = a; X['b'] = b; X['c'] = c
    X['alpha'] = alpha; X['beta'] = beta; X['gamma'] = gamma
    X['ab_ratio'] = a/np.clip(b,1e-8,np.inf)
    X['ac_ratio'] = a/np.clip(c,1e-8,np.inf)
    X['bc_ratio'] = b/np.clip(c,1e-8,np.inf)
    return X

feat_train = build_features(train)
feat_test  = build_features(test)
print('Feature shapes:', feat_train.shape, feat_test.shape)

# One-hot encode spacegroup for models
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
sg_train = ohe.fit_transform(feat_train[['spacegroup']])
sg_test = ohe.transform(feat_test[['spacegroup']])
sg_cols = [f'sg_{i}' for i in range(sg_train.shape[1])]
sg_df_train = pd.DataFrame(sg_train, columns=sg_cols, index=feat_train.index)
sg_df_test = pd.DataFrame(sg_test, columns=sg_cols, index=feat_test.index)
X_train = pd.concat([feat_train.drop(columns=['id','spacegroup']), sg_df_train], axis=1)
X_test  = pd.concat([feat_test.drop(columns=['id','spacegroup']), sg_df_test], axis=1)

# Group key: composition counts of cations
groups = list(zip(feat_train['count_al'].tolist(), feat_train['count_ga'].tolist(), feat_train['count_in'].tolist(), feat_train['N_total'].astype(int).tolist()))

# Targets and shifted-log transform helpers
def shifted_log_y(y):
    y = y.astype(float)
    y_min = y.min()
    shift = -y_min + 1e-9 if y_min <= 0 else 0.0
    return np.log1p(y + shift), shift

def inv_shifted_log_y(y_log, shift):
    return np.expm1(y_log) - shift

def rmsle_from_preds(y_true, y_pred):
    return math.sqrt(mean_squared_log_error(np.clip(y_true, 0, None), np.clip(y_pred, 0, None)))

# LightGBM install/import (if needed)
try:
    import lightgbm as lgb
except ImportError:
    import sys, subprocess
    print('Installing lightgbm...')
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

lgb_params = {
    'objective': 'regression_l1',
    'metric': 'rmse',
    'n_estimators': 2000,
    'learning_rate': 0.01,
    'num_leaves': 31,
    'max_depth': -1,
    'min_data_in_leaf': 20,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'lambda_l1': 0.1,
    'lambda_l2': 0.1,
    'n_jobs': -1,
    'seed': 42,
    'boosting_type': 'gbdt'
}

def run_cv_and_predict(X, y, groups, params, target_name, use_groups=False, use_stratified=False):
    n_splits = 5
    oof = np.zeros(len(y), dtype=float)
    preds = np.zeros(len(X_test), dtype=float)
    y_log, shift = shifted_log_y(y.values)
    fold = 0
    t0 = time.time()
    if use_groups:
        splitter = GroupKFold(n_splits=n_splits)
        split_iter = splitter.split(X, y, groups)
    elif use_stratified:
        # Stratify on binned target to stabilize folds
        y_bins = pd.qcut(y.values, q=10, duplicates='drop')
        y_bins = pd.Series(y_bins).cat.codes.values
        splitter = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
        split_iter = splitter.split(X, y_bins)
    else:
        splitter = KFold(n_splits=n_splits, shuffle=True, random_state=42)
        split_iter = splitter.split(X)
    for tr_idx, va_idx in split_iter:
        fold += 1
        X_tr, X_va = X.iloc[tr_idx], X.iloc[va_idx]
        y_tr, y_va = y_log[tr_idx], y_log[va_idx]
        print(f'[LGB] {target_name} Fold {fold} train={len(tr_idx)} valid={len(va_idx)}')
        model = lgb.LGBMRegressor(**params)
        model.fit(
            X_tr, y_tr,
            eval_set=[(X_va, y_va)],
            callbacks=[lgb.early_stopping(stopping_rounds=100, verbose=False)]
        )
        va_pred_log = model.predict(X_va, num_iteration=model.best_iteration_)
        va_pred = inv_shifted_log_y(va_pred_log, shift)
        oof[va_idx] = va_pred
        test_pred_log = model.predict(X_test, num_iteration=model.best_iteration_)
        preds += inv_shifted_log_y(test_pred_log, shift) / n_splits
        elapsed = time.time() - t0
        print(f'  -> elapsed {elapsed:.1f}s, best_iter={model.best_iteration_}')
    cv_rmsle = math.sqrt(mean_squared_log_error(np.clip(y.values,0,None), np.clip(oof,0,None)))
    print(f'[LGB] {target_name} CV RMSLE: {cv_rmsle:.6f}')
    return oof, preds, cv_rmsle

# Prepare features and run for both targets
y_fe = train['formation_energy_ev_natom']
y_bg = train['bandgap_energy_ev']

# Use KFold baseline to proceed; will revisit GroupKFold once grouping bug is resolved
oof_fe, pred_fe, cv_fe = run_cv_and_predict(X_train, y_fe, groups, lgb_params, 'formation_energy_ev_natom', use_groups=False, use_stratified=False)
oof_bg, pred_bg, cv_bg = run_cv_and_predict(X_train, y_bg, groups, lgb_params, 'bandgap_energy_ev', use_groups=False, use_stratified=False)

print('CV summary -> FE:', cv_fe, 'BG:', cv_bg, 'mean:', (cv_fe + cv_bg)/2)

# Train final models on full data for each target and predict test
def fit_full_and_predict(X, y, params):
    y_log, shift = shifted_log_y(y.values)
    model = lgb.LGBMRegressor(**params)
    model.fit(X, y_log)
    test_pred_log = model.predict(X_test)
    test_pred = inv_shifted_log_y(test_pred_log, shift)
    return test_pred

full_pred_fe = fit_full_and_predict(X_train, y_fe, lgb_params)
full_pred_bg = fit_full_and_predict(X_train, y_bg, lgb_params)

# Build submission
submission = pd.DataFrame({
    'id': test['id'],
    'formation_energy_ev_natom': full_pred_fe,
    'bandgap_energy_ev': full_pred_bg
})
submission.to_csv('submission.csv', index=False)
print('Saved submission.csv with shape', submission.shape)

Feature shapes: (2160, 27) (240, 27)
[LGB] formation_energy_ev_natom Fold 1 train=1728 valid=432
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001471 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3174
[LightGBM] [Info] Number of data points in the train set: 1728, number of used features: 31
[LightGBM] [Info] Start training from score 0.166404


  -> elapsed 1.1s, best_iter=525
[LGB] formation_energy_ev_natom Fold 2 train=1728 valid=432
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001951 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3178
[LightGBM] [Info] Number of data points in the train set: 1728, number of used features: 31
[LightGBM] [Info] Start training from score 0.167081


  -> elapsed 3.6s, best_iter=1401
[LGB] formation_energy_ev_natom Fold 3 train=1728 valid=432
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000701 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3178
[LightGBM] [Info] Number of data points in the train set: 1728, number of used features: 31
[LightGBM] [Info] Start training from score 0.166954


  -> elapsed 4.6s, best_iter=480
[LGB] formation_energy_ev_natom Fold 4 train=1728 valid=432
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000721 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3179
[LightGBM] [Info] Number of data points in the train set: 1728, number of used features: 31
[LightGBM] [Info] Start training from score 0.167927


  -> elapsed 6.6s, best_iter=968
[LGB] formation_energy_ev_natom Fold 5 train=1728 valid=432
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002018 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3172
[LightGBM] [Info] Number of data points in the train set: 1728, number of used features: 31
[LightGBM] [Info] Start training from score 0.167885


  -> elapsed 9.3s, best_iter=858
[LGB] formation_energy_ev_natom CV RMSLE: 0.032992
[LGB] bandgap_energy_ev Fold 1 train=1728 valid=432
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000663 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3174
[LightGBM] [Info] Number of data points in the train set: 1728, number of used features: 31
[LightGBM] [Info] Start training from score 1.061014


  -> elapsed 3.2s, best_iter=1792
[LGB] bandgap_energy_ev Fold 2 train=1728 valid=432
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000729 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3178
[LightGBM] [Info] Number of data points in the train set: 1728, number of used features: 31
[LightGBM] [Info] Start training from score 1.069304


  -> elapsed 4.4s, best_iter=645
[LGB] bandgap_energy_ev Fold 3 train=1728 valid=432
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000699 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3178
[LightGBM] [Info] Number of data points in the train set: 1728, number of used features: 31
[LightGBM] [Info] Start training from score 1.070641


  -> elapsed 5.9s, best_iter=792
[LGB] bandgap_energy_ev Fold 4 train=1728 valid=432
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000745 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3179
[LightGBM] [Info] Number of data points in the train set: 1728, number of used features: 31
[LightGBM] [Info] Start training from score 1.048669


  -> elapsed 7.2s, best_iter=650
[LGB] bandgap_energy_ev Fold 5 train=1728 valid=432
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000707 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3172
[LightGBM] [Info] Number of data points in the train set: 1728, number of used features: 31
[LightGBM] [Info] Start training from score 1.069304


  -> elapsed 8.9s, best_iter=891
[LGB] bandgap_energy_ev CV RMSLE: 0.088660
CV summary -> FE: 0.032992171801946205 BG: 0.08866038168539883 mean: 0.06082627674367252
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000707 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3193
[LightGBM] [Info] Number of data points in the train set: 2160, number of used features: 31
[LightGBM] [Info] Start training from score 0.167166


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000749 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3193
[LightGBM] [Info] Number of data points in the train set: 2160, number of used features: 31
[LightGBM] [Info] Start training from score 1.065279


Saved submission.csv with shape (240, 3)
