# Plan to Medal: NOMAD2018 Predicting Transparent Conductors

Goals:
- Target: formation_energy_ev_natom
- Metric: mean-column-wise-rmsle
- Output: submission.csv
- Medal thresholds: Bronze ≤ 0.06582, Silver ≤ 0.06229, Gold ≤ 0.05589

Workflow:
1) Data audit
- Locate train.csv/test.csv or equivalent metadata; enumerate directories to confirm file structure
- Inspect columns, target distribution, and link between IDs and geometry.xyz paths

2) Feature engineering (fast → strong)
- Basic composition features: counts and fractions of Al, Ga, In, O; natoms; density proxies (cell volume if available; else surrogate from bounding box of XYZ)
- Stoichiometric descriptors: ratios (Al:Ga:In), Shannon entropy of composition
- Simple geometry stats from XYZ: interatomic distance statistics (mean/median/min/max), radial features (binned histogram), nearest-neighbor stats per element
- Optional if time: matminer composition features (ElementProperty, OxidationStates) and simple structure featurizers; cache to parquet

3) Modeling
- Baseline: LightGBM regression for formation_energy_ev_natom
- CV: GroupKFold if structure IDs need grouping; otherwise KFold with stratification on target quantiles
- Use log1p target transform to align with RMSLE behavior and invert for predictions
- Hyperparameters: quick Bayesian/TPE or guided grid; early_stopping; consistent seeds

4) Validation & logging
- Track fold metrics and elapsed time; save OOF predictions; plot OOF vs true
- Sanity checks: leakage, distribution alignment, feature importances

5) Inference
- Generate test features identically; predict; create submission.csv

6) Iteration for medal
- Add stronger matminer features; try CatBoost/XGBoost stack
- Feature selection via importance/SHAP; try target smoothing by composition

Immediate next steps:
- Enumerate files; load train metadata; confirm target column(s); map IDs to geometry.xyz
- Build a minimal featurizer: element counts/fractions + natoms
- Train fast LightGBM CV baseline and evaluate

In [1]:
import os, sys, time, json, gc
import pandas as pd
import numpy as np
from pathlib import Path

DATA_DIR = Path('.')
train_csv = DATA_DIR / 'train.csv'
test_csv = DATA_DIR / 'test.csv'
sample_csv = DATA_DIR / 'sample_submission.csv'

print('Files exist:', train_csv.exists(), test_csv.exists(), sample_csv.exists())
train = pd.read_csv(train_csv)
test = pd.read_csv(test_csv)
sample = pd.read_csv(sample_csv)
print('train.shape, test.shape:', train.shape, test.shape)
print('train.columns:', train.columns.tolist())
print('test.columns:', test.columns.tolist())
print('sample.columns:', sample.columns.tolist())

target_col = 'formation_energy_ev_natom'
id_col = 'id'
assert id_col in train.columns and id_col in test.columns, 'id column missing'
assert target_col in train.columns, 'Target column missing in train.csv'
print('Target head:', train[target_col].head().to_list())

# Non-negativity check for log1p eligibility
y_min = train[target_col].min()
y_max = train[target_col].max()
print(f'Target min/max: {y_min:.6f} / {y_max:.6f}')
can_log1p = y_min >= 0
print('All non-negative? ->', can_log1p)

# Verify geometry.xyz path mapping for a few IDs
def check_paths(df, split='train', n=5):
    ids = df[id_col].head(n).tolist()
    results = []
    for i in ids:
        p = DATA_DIR / split / str(i) / 'geometry.xyz'
        results.append((i, p.exists(), str(p)))
    return results

print('Sample train paths exists:', check_paths(train, 'train', 5))
print('Sample test paths exists:', check_paths(test, 'test', 5))

# Basic target stats
print(train[target_col].describe())

# Save quick audit to JSON for reference
audit = {
    'train_shape': train.shape,
    'test_shape': test.shape,
    'train_columns': train.columns.tolist(),
    'test_columns': test.columns.tolist(),
    'sample_columns': sample.columns.tolist(),
    'target_min': float(y_min),
    'target_max': float(y_max),
    'can_log1p': bool(can_log1p)
}
with open('data_audit.json', 'w') as f:
    json.dump(audit, f, indent=2)
print('Wrote data_audit.json')

Files exist: True True True
train.shape, test.shape: (2160, 14) (240, 12)
train.columns: ['id', 'spacegroup', 'number_of_total_atoms', 'percent_atom_al', 'percent_atom_ga', 'percent_atom_in', 'lattice_vector_1_ang', 'lattice_vector_2_ang', 'lattice_vector_3_ang', 'lattice_angle_alpha_degree', 'lattice_angle_beta_degree', 'lattice_angle_gamma_degree', 'formation_energy_ev_natom', 'bandgap_energy_ev']
test.columns: ['id', 'spacegroup', 'number_of_total_atoms', 'percent_atom_al', 'percent_atom_ga', 'percent_atom_in', 'lattice_vector_1_ang', 'lattice_vector_2_ang', 'lattice_vector_3_ang', 'lattice_angle_alpha_degree', 'lattice_angle_beta_degree', 'lattice_angle_gamma_degree']
sample.columns: ['id', 'formation_energy_ev_natom', 'bandgap_energy_ev']
Target head: [0.1337, 0.0738, 0.3671, 0.0698, 0.1154]
Target min/max: 0.000000 / 0.657200
All non-negative? -> True
Sample train paths exists: [(1, True, 'train/1/geometry.xyz'), (2, True, 'train/2/geometry.xyz'), (3, True, 'train/3/geometry.xyz'

In [None]:
import math
from sklearn.model_selection import GroupKFold
from sklearn.metrics import mean_squared_error
import lightgbm as lgb

SEED = 42
np.random.seed(SEED)

def safe_div(a, b):
    return a / b if b != 0 else 0.0

def comp_entropy(fracs):
    fr = np.clip(np.array(fracs, dtype=float), 1e-12, 1.0)
    fr = fr / fr.sum() if fr.sum() > 0 else fr
    return float(-(fr * np.log(fr)).sum())

def cell_volume(a, b, c, alpha_deg, beta_deg, gamma_deg):
    alpha = math.radians(alpha_deg); beta = math.radians(beta_deg); gamma = math.radians(gamma_deg)
    cos_a, cos_b, cos_c = math.cos(alpha), math.cos(beta), math.cos(gamma)
    vol_sq = 1 + 2*cos_a*cos_b*cos_c - cos_a**2 - cos_b**2 - cos_c**2
    vol_sq = max(vol_sq, 0.0)
    return float(a*b*c*math.sqrt(vol_sq))

def build_features(df):
    out = pd.DataFrame(index=df.index)
    # Base
    out['natoms'] = df['number_of_total_atoms'].astype(float)
    for e in ['al','ga','in']:
        out[f'pct_{e}'] = df[f'percent_atom_{e}'].astype(float) / 100.0
    out['pct_o'] = 1.0 - (out['pct_al'] + out['pct_ga'] + out['pct_in'])
    # Counts (float and rounded ints)
    out['cnt_al'] = out['natoms'] * out['pct_al']
    out['cnt_ga'] = out['natoms'] * out['pct_ga']
    out['cnt_in'] = out['natoms'] * out['pct_in']
    out['cnt_o']  = out['natoms'] * out['pct_o']
    for e in ['al','ga','in','o']:
        out[f'cnt_{e}_int'] = np.rint(out[f'cnt_{e}']).astype(int)
    # Ratios
    out['ratio_al_ga'] = out['cnt_al'] / (out['cnt_ga'] + 1e-6)
    out['ratio_al_in'] = out['cnt_al'] / (out['cnt_in'] + 1e-6)
    out['ratio_ga_in'] = out['cnt_ga'] / (out['cnt_in'] + 1e-6)
    out['ratio_cation_o'] = (out['cnt_al'] + out['cnt_ga'] + out['cnt_in']) / (out['cnt_o'] + 1e-6)
    out['frac_cations'] = (out['pct_al'] + out['pct_ga'] + out['pct_in'])
    out['frac_o'] = out['pct_o']
    # Composition entropy
    out['comp_entropy'] = [comp_entropy(row) for row in out[['pct_al','pct_ga','pct_in','pct_o']].values]
    # Lattice features
    a = df['lattice_vector_1_ang'].astype(float)
    b = df['lattice_vector_2_ang'].astype(float)
    c = df['lattice_vector_3_ang'].astype(float)
    alpha = df['lattice_angle_alpha_degree'].astype(float)
    beta = df['lattice_angle_beta_degree'].astype(float)
    gamma = df['lattice_angle_gamma_degree'].astype(float)
    out['a'] = a; out['b'] = b; out['c'] = c
    out['alpha'] = alpha; out['beta'] = beta; out['gamma'] = gamma
    out['vol'] = [cell_volume(*vals) for vals in zip(a,b,c,alpha,beta,gamma)]
    out['vol'] = out['vol'].replace([np.inf, -np.inf], np.nan).fillna(0.0)
    out['density_proxy'] = out['natoms'] / (out['vol'] + 1e-6)
    # Simple interactions
    out['a_over_b'] = a / (b + 1e-6)
    out['b_over_c'] = b / (c + 1e-6)
    out['c_over_a'] = c / (a + 1e-6)
    # Spacegroup as numeric
    out['spacegroup'] = df['spacegroup'].astype(int)
    return out

# Build train/test features
t0 = time.time()
X = build_features(train)
X_test = build_features(test)
print('Feature shapes:', X.shape, X_test.shape, '| secs:', round(time.time()-t0,2))

# Groups by reduced composition (1D string key to avoid sklearn tuple handling issues)
grp_cols = ['cnt_al_int','cnt_ga_int','cnt_in_int','cnt_o_int']
groups = X[grp_cols].astype(int).astype(str).agg(lambda r: '_'.join(r.values.tolist()), axis=1).values
print('Unique groups:', len(np.unique(groups)))

def rmsle(y_true, y_pred):
    y_true = np.asarray(y_true).astype(float)
    y_pred = np.maximum(np.asarray(y_pred).astype(float), 0.0)
    return np.sqrt(np.mean((np.log1p(y_pred) - np.log1p(y_true))**2))

lgb_params = {
    'objective': 'regression',
    'learning_rate': 0.04,
    'num_leaves': 63,
    'min_data_in_leaf': 40,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'lambda_l2': 2.0,
    'lambda_l1': 0.0,
    'max_depth': -1,
    'metric': 'rmse',
    'verbosity': -1,
    'seed': SEED
}

features = X.columns.tolist()
print('Using', len(features), 'features')

def run_target(target_col_name):
    y = train[target_col_name].values.astype(float)
    can_log = (y.min() >= 0)
    y_tr = np.log1p(y) if can_log else y.copy()
    folds = 5
    gkf = GroupKFold(n_splits=folds)
    oof = np.zeros(len(train), dtype=float)
    best_iters = []
    models = []
    for fold, (trn_idx, val_idx) in enumerate(gkf.split(X, y_tr, groups=groups), 1):
        t1 = time.time()
        X_tr, X_val = X.iloc[trn_idx], X.iloc[val_idx]
        y_trn, y_val = y_tr[trn_idx], y_tr[val_idx]
        dtrain = lgb.Dataset(X_tr[features], label=y_trn, free_raw_data=False)
        dvalid = lgb.Dataset(X_val[features], label=y_val, free_raw_data=False)
        print(f'[{target_col_name}] Fold {fold}/{folds} | trn:{len(trn_idx)} val:{len(val_idx)}')
        callbacks = [
            lgb.early_stopping(stopping_rounds=200, verbose=False),
            lgb.log_evaluation(period=200)
        ]
        model = lgb.train(lgb_params, dtrain, num_boost_round=4000, valid_sets=[dtrain, dvalid],
                          valid_names=['train','valid'], callbacks=callbacks)
        best_iter = model.best_iteration
        best_iters.append(best_iter)
        pred_val_log = model.predict(X_val[features], num_iteration=best_iter)
        pred_val = np.expm1(pred_val_log) if can_log else pred_val_log
        oof[val_idx] = np.clip(pred_val, 0, None)
        fold_rmsle = rmsle(train[target_col_name].values[val_idx], oof[val_idx])
        print(f'[{target_col_name}] Fold {fold} RMSLE: {fold_rmsle:.6f} | best_iter: {best_iter} | elapsed: {time.time()-t1:.1f}s', flush=True)
        models.append(model)
    cv_rmsle = rmsle(train[target_col_name].values, oof)
    print(f'[{target_col_name}] OOF RMSLE: {cv_rmsle:.6f}')
    final_iter = int(np.mean(best_iters)) if len(best_iters) > 0 else 2000
    print(f'[{target_col_name}] Refitting full model with num_boost_round =', final_iter)
    dall = lgb.Dataset(X[features], label=y_tr, free_raw_data=False)
    final_model = lgb.train(lgb_params, dall, num_boost_round=final_iter)
    test_pred_log = final_model.predict(X_test[features], num_iteration=final_iter)
    test_pred = np.expm1(test_pred_log) if can_log else test_pred_log
    test_pred = np.clip(test_pred, 0, None)
    return oof, test_pred, cv_rmsle, final_iter, final_model

# Run for both targets
oof_fe, test_fe, cv_fe, iter_fe, model_fe = run_target('formation_energy_ev_natom')
oof_bg, test_bg, cv_bg, iter_bg, model_bg = run_target('bandgap_energy_ev')

mean_rmsle = np.mean([cv_fe, cv_bg])
print(f'Mean-column-wise RMSLE (OOF): {mean_rmsle:.6f} | FE: {cv_fe:.6f} | BG: {cv_bg:.6f}')

# Save artifacts
pd.DataFrame({'id': train['id'], 'oof_fe': oof_fe, 'y_fe': train['formation_energy_ev_natom'],
              'oof_bg': oof_bg, 'y_bg': train['bandgap_energy_ev']}).to_csv('oof.csv', index=False)
imp_fe = pd.DataFrame({'feature': features, 'gain': model_fe.feature_importance(importance_type='gain')}).sort_values('gain', ascending=False)
imp_bg = pd.DataFrame({'feature': features, 'gain': model_bg.feature_importance(importance_type='gain')}).sort_values('gain', ascending=False)
imp_fe.to_csv('feature_importance_fe.csv', index=False)
imp_bg.to_csv('feature_importance_bg.csv', index=False)
with open('training_log.txt','w') as f:
    f.write(f'OOF_RMSLE_FE: {cv_fe:.8f}\n')
    f.write(f'OOF_RMSLE_BG: {cv_bg:.8f}\n')
    f.write(f'MEAN_OOF_RMSLE: {mean_rmsle:.8f}\n')
    f.write(f'iter_fe: {iter_fe}\n')
    f.write(f'iter_bg: {iter_bg}\n')
print('Saved oof.csv, feature_importance_fe.csv, feature_importance_bg.csv, training_log.txt')

# Create submission with both required columns
submission = sample.copy()[['id','formation_energy_ev_natom','bandgap_energy_ev']].copy()
map_fe = pd.Series(test_fe, index=test['id']).to_dict()
map_bg = pd.Series(test_bg, index=test['id']).to_dict()
submission['formation_energy_ev_natom'] = submission['id'].map(map_fe).astype(float)
submission['bandgap_energy_ev'] = submission['id'].map(map_bg).astype(float)
submission.to_csv('submission.csv', index=False)
print('Wrote submission.csv with shape', submission.shape)

Feature shapes: (2160, 32) (240, 32) | secs: 0.11
Unique groups: 12
Using 32 features
[formation_energy_ev_natom] Fold 1/5 | trn:1365 val:795


[200]	train's rmse: 0.0254835	valid's rmse: 0.0392942
[formation_energy_ev_natom] Fold 1 RMSLE: 0.038599 | best_iter: 132 | elapsed: 0.4s


[formation_energy_ev_natom] Fold 2/5 | trn:1697 val:463


[200]	train's rmse: 0.0234948	valid's rmse: 0.077623
[formation_energy_ev_natom] Fold 2 RMSLE: 0.076908 | best_iter: 83 | elapsed: 0.5s


[formation_energy_ev_natom] Fold 3/5 | trn:1864 val:296


[200]	train's rmse: 0.0263879	valid's rmse: 0.0583212
[formation_energy_ev_natom] Fold 3 RMSLE: 0.054996 | best_iter: 61 | elapsed: 0.4s


[formation_energy_ev_natom] Fold 4/5 | trn:1848 val:312


[200]	train's rmse: 0.0261705	valid's rmse: 0.0717072


[400]	train's rmse: 0.0238608	valid's rmse: 0.0713454


[600]	train's rmse: 0.02261	valid's rmse: 0.0701465


[800]	train's rmse: 0.0218152	valid's rmse: 0.069883


[1000]	train's rmse: 0.0212675	valid's rmse: 0.0703835
[formation_energy_ev_natom] Fold 4 RMSLE: 0.069847 | best_iter: 803 | elapsed: 1.6s


[formation_energy_ev_natom] Fold 5/5 | trn:1866 val:294


[200]	train's rmse: 0.0254702	valid's rmse: 0.0390149


[400]	train's rmse: 0.0232053	valid's rmse: 0.0383504


[600]	train's rmse: 0.0218819	valid's rmse: 0.038428
[formation_energy_ev_natom] Fold 5 RMSLE: 0.038273 | best_iter: 409 | elapsed: 1.0s


[formation_energy_ev_natom] OOF RMSLE: 0.055989
[formation_energy_ev_natom] Refitting full model with num_boost_round = 297


[bandgap_energy_ev] Fold 1/5 | trn:1365 val:795


[200]	train's rmse: 0.0743757	valid's rmse: 0.0686696
[bandgap_energy_ev] Fold 1 RMSLE: 0.067157 | best_iter: 111 | elapsed: 0.4s
