# Plan to WIN A MEDAL: NOMAD2018 Predicting Transparent Conductors

Objective: Predict bandgap_energy_ev for test structures; optimize RMSLE (single-column case of mean-column-wise-rmsle).

High-level strategy:
- Data audit: Locate train.csv/test.csv or alternative labels file; map structure folders (ids) to targets.
- Feature engineering:
  - Composition-only features from each geometry.xyz: element counts and fractions (Al, Ga, In, O), O ratio, total atoms (N_total), predicted N from stoichiometry.
  - Matminer composition featurizers: Stoichiometry, ElementProperty (Magpie), ValenceOrbital, AtomicOrbitals, IonProperty.
  - Optional structural proxies from XYZ (no lattice):
    - Centered pairwise-distance statistics (mean/std/min/max, RDF histogram), nearest-neighbor stats by element pairs.
    - Coordination counts via distance thresholds per element pair (heuristic radii).
- Modeling:
  - Target transform: y_log = log1p(y); optimize RMSLE naturally; predictions = expm1(y_pred), clip >= 0.
  - Strong baselines: LightGBM, CatBoost, XGBoost. Start with LGBM; add CatBoost; blend/stack.
  - 5-fold KFold with shuffle (seed) for quick iteration; consider GroupKFold by composition signature if leakage suspected.
  - Early stopping, robust seeds; feature importance to iterate.
- Tuning:
  - Quick grid for LGBM (num_leaves, max_depth, min_data_in_leaf, feature_fraction, bagging_fraction, lambda_l1/l2).
  - Consider Optuna if time permits.
- Inference & submission:
  - Build test features with identical pipeline.
  - Save submission.csv with columns: id, bandgap_energy_ev.

Risk & unknowns to resolve immediately:
- Repository currently shows train/ and test/ with per-id geometry.xyz files; train.csv not visible. Need to discover label file (possibly at root or a metadata CSV).
- If only structures provided, we will parse labels from a CSV like train.csv or targets.csv; else, abort and search docs.

First steps (next cells):
1) Probe filesystem for any CSVs (train.csv, test.csv, targets.csv, metadata.csv).
2) If found, load and inspect target distribution and id alignment with folders.
3) Implement robust parser:
   - Read XYZ -> element list and cartesian coords; compute composition; build composition features.
   - Parallelize with joblib; cache to parquet for fast iteration.
4) Train LGBM baseline with CV; log fold times and scores; create feature importance.
5) If CV RMSLE > 0.07, iterate features (RDF/NN stats) and/or swap model to CatBoost/XGB and blend.

Time management:
- Always log progress; parallel feature build to utilize 36 vCPUs; avoid heavy structure features that need periodic boundary conditions (not available).
- While long runs execute, request expert review for guidance/interrupts.

In [None]:
# Data audit: list files, load CSVs, inspect, and verify folder alignment
import os, sys, json, math, textwrap, time, gc
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

print('CWD:', os.getcwd())
print('Top-level files/folders:', os.listdir('.'))

# Load CSVs
train_csv = 'train.csv'
test_csv = 'test.csv'
assert os.path.exists(train_csv), f"Missing {train_csv} in CWD"
assert os.path.exists(test_csv), f"Missing {test_csv} in CWD"
train = pd.read_csv(train_csv)
test = pd.read_csv(test_csv)
print('train shape:', train.shape)
print('test shape:', test.shape)
print('train columns:', train.columns.tolist())
print('test columns:', test.columns.tolist())

# Basic info
print('\ntrain.info():')
print(train.info())
print('\ntest.info():')
print(test.info())

# Target exploration
target_col = 'bandgap_energy_ev'
assert target_col in train.columns, f"Target column {target_col} not in train.csv columns"
print('\nTarget describe:')
print(train[target_col].describe())
print('Num zeros in target:', int((train[target_col] == 0).sum()))
print('Num NaNs in target:', int(train[target_col].isna().sum()))

fig, ax = plt.subplots(1,1, figsize=(6,4))
ax.hist(train[target_col].dropna(), bins=50, color='steelblue', edgecolor='k', alpha=0.8)
ax.set_title('bandgap_energy_ev histogram')
ax.set_xlabel('bandgap (eV)')
ax.set_ylabel('count')
plt.tight_layout()
plt.show()

# Verify alignment with geometry.xyz folders
def verify_paths(df, split_name):
    base = Path(split_name)
    assert base.exists(), f"Missing folder: {base}"
    assert 'id' in df.columns, "id column missing in CSV"
    ids = df['id'].astype(str).values
    missing = []
    for i, sid in enumerate(ids):
        path = base / sid / 'geometry.xyz'
        if not path.exists():
            missing.append(str(path))
        if (i+1) % 2000 == 0:
            print(f'Checked {i+1}/{len(ids)} {split_name} ids...')
    print(f"{split_name}: total ids={len(ids)}, missing geometries={len(missing)}")
    if missing:
        print('Examples of missing:', missing[:5])
    assert len(missing) == 0, f"Found {len(missing)} missing geometry.xyz files in {split_name}"

verify_paths(train, 'train')
verify_paths(test, 'test')
print('Data audit completed successfully.')

In [None]:
# Baseline features, GroupKFold CV, LightGBM model, submission
import os, time, math, gc
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import GroupKFold
from sklearn.metrics import mean_squared_error

def cell_volume(a, b, c, alpha_deg, beta_deg, gamma_deg):
    # Volume for triclinic cell from a,b,c and angles
    alpha = np.deg2rad(alpha_deg); beta = np.deg2rad(beta_deg); gamma = np.deg2rad(gamma_deg)
    cos_alpha, cos_beta, cos_gamma = np.cos(alpha), np.cos(beta), np.cos(gamma)
    term = 1 + 2*cos_alpha*cos_beta*cos_gamma - cos_alpha**2 - cos_beta**2 - cos_gamma**2
    term = np.clip(term, 0, None)
    return a * b * c * np.sqrt(term)

def add_engineered(df):
    df = df.copy()
    # Volume and density-like features
    df['cell_volume'] = cell_volume(df['lattice_vector_1_ang'], df['lattice_vector_2_ang'], df['lattice_vector_3_ang'],
                                     df['lattice_angle_alpha_degree'], df['lattice_angle_beta_degree'], df['lattice_angle_gamma_degree'])
    df['atoms_per_volume'] = df['number_of_total_atoms'] / (df['cell_volume'].replace(0, np.nan))
    # Composition counts from percent and total atoms
    for el, col in [('al','percent_atom_al'), ('ga','percent_atom_ga'), ('in','percent_atom_in')]:
        df[f'n_{el}'] = np.rint(df['number_of_total_atoms'] * df[col] / 100.0).astype(int)
    df['percent_atom_o'] = 100.0 - (df['percent_atom_al'] + df['percent_atom_ga'] + df['percent_atom_in'])
    df['n_o'] = (df['number_of_total_atoms'] - (df['n_al'] + df['n_ga'] + df['n_in'])).astype(int)
    # Fractions (0-1) for tree models; keep percents too
    for el in ['al','ga','in','o']:
        pcol = f'percent_atom_{el}'
        if pcol in df.columns:
            df[f'frac_{el}'] = df[pcol] / 100.0
    # Cation ratios and stats
    df['frac_cation'] = df[['frac_al','frac_ga','frac_in']].sum(axis=1)
    df['frac_o_to_cation'] = df['frac_o'] / (df['frac_cation'] + 1e-9)
    df['mix_entropy_cation'] = -np.sum(np.where(df[['frac_al','frac_ga','frac_in']]>0, df[['frac_al','frac_ga','frac_in']] * np.log(df[['frac_al','frac_ga','frac_in']]+1e-12), 0), axis=1)
    df['hhi_cation'] = np.sum(df[['frac_al','frac_ga','frac_in']]**2, axis=1)
    # Angles trigonometric
    for ang in ['alpha','beta','gamma']:
        col = f'lattice_angle_{ang}_degree'
        df[f'cos_{ang}'] = np.cos(np.deg2rad(df[col]))
        df[f'sin_{ang}'] = np.sin(np.deg2rad(df[col]))
    # Safe fill for infinities
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    return df

# Reload CSVs to ensure clean state
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train_fe = add_engineered(train)
test_fe = add_engineered(test)

# Group key to avoid leakage: integer counts per element
group_cols = ['n_al','n_ga','n_in','n_o']
groups = train_fe[group_cols].astype(int).astype(str).agg('_'.join, axis=1)

# Features selection
drop_cols = ['id','bandgap_energy_ev','formation_energy_ev_natom']
features = [c for c in train_fe.columns if c not in drop_cols]
cat_cols = ['spacegroup']

X = train_fe[features]
X_test = test_fe[features]
y = train_fe['bandgap_energy_ev'].astype(float)

# LightGBM setup
import importlib
try:
    import lightgbm as lgb
except Exception as e:
    import sys, subprocess
    print('Installing lightgbm...'); subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm'])
    import lightgbm as lgb

# Log-transform for RMSLE
y_log = np.log1p(y.clip(lower=0))

params = {
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.04,
    'num_leaves': 192,
    'max_depth': -1,
    'min_data_in_leaf': 80,
    'feature_fraction': 0.85,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'lambda_l2': 0.2,
    'verbosity': -1,
    'seed': 42
}

n_splits = 5
gkf = GroupKFold(n_splits=n_splits)
oof_pred = np.zeros(len(X))
test_pred = np.zeros(len(X_test))

fold_times = []
for fold, (trn_idx, val_idx) in enumerate(gkf.split(X, y_log, groups=groups), 1):
    t0 = time.time()
    X_tr, X_va = X.iloc[trn_idx], X.iloc[val_idx]
    y_tr, y_va = y_log.iloc[trn_idx], y_log.iloc[val_idx]
    lgb_train = lgb.Dataset(X_tr, label=y_tr, categorical_feature=cat_cols, free_raw_data=False)
    lgb_valid = lgb.Dataset(X_va, label=y_va, categorical_feature=cat_cols, free_raw_data=False)
    model = lgb.train(params, lgb_train, num_boost_round=5000, valid_sets=[lgb_train, lgb_valid],
                      valid_names=['train','valid'],
                      callbacks=[lgb.early_stopping(200), lgb.log_evaluation(200)])
    oof_pred[val_idx] = model.predict(X_va, num_iteration=model.best_iteration)
    test_pred += model.predict(X_test, num_iteration=model.best_iteration) / n_splits
    elapsed = time.time() - t0
    fold_times.append(elapsed)
    rmse = mean_squared_error(y_va, oof_pred[val_idx], squared=False)
    print(f'Fold {fold}/{n_splits} RMSE(log1p): {rmse:.6f} | elapsed: {elapsed:.1f}s | best_iter: {model.best_iteration}')
    del model, lgb_train, lgb_valid; gc.collect()

# CV score in RMSLE space (since we trained on log1p)
cv_rmse_log = mean_squared_error(y_log, oof_pred, squared=False)
print(f'CV RMSLE: {cv_rmse_log:.6f}  | mean fold time: {np.mean(fold_times):.1f}s')

# Train final model on full data (optional, we already averaged test preds across folds).
pred_bandgap = np.expm1(test_pred).clip(min=0)

# Save submission
sub = pd.DataFrame({'id': test['id'], 'bandgap_energy_ev': pred_bandgap})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv with shape:', sub.shape)
sub.head()

In [2]:
# Fix grouping, add diagnostics, enrich features, retrain LGBM
import numpy as np, pandas as pd, time, gc
from sklearn.metrics import mean_squared_error

def cell_volume(a, b, c, alpha_deg, beta_deg, gamma_deg):
    alpha = np.deg2rad(alpha_deg); beta = np.deg2rad(beta_deg); gamma = np.deg2rad(gamma_deg)
    ca, cb, cg = np.cos(alpha), np.cos(beta), np.cos(gamma)
    term = 1 + 2*ca*cb*cg - ca**2 - cb**2 - cg**2
    term = np.clip(term, 0, None)
    return a * b * c * np.sqrt(term)

def engineer_features(df):
    df = df.copy()
    # Geometry-derived
    a, b, c = df['lattice_vector_1_ang'], df['lattice_vector_2_ang'], df['lattice_vector_3_ang']
    alpha, beta, gamma = df['lattice_angle_alpha_degree'], df['lattice_angle_beta_degree'], df['lattice_angle_gamma_degree']
    vol = cell_volume(a, b, c, alpha, beta, gamma)
    df['cell_volume'] = vol
    df['volume_per_atom'] = vol / df['number_of_total_atoms']
    df['a_over_b'] = a / b
    df['b_over_c'] = b / c
    df['c_over_a'] = c / a
    df['abc_mean'] = (a + b + c) / 3.0
    df['abc_max'] = np.max(np.stack([a,b,c], axis=1), axis=1)
    df['abc_min'] = np.min(np.stack([a,b,c], axis=1), axis=1)
    df['abc_anisotropy'] = (df['abc_max'] - df['abc_min']) / (df['abc_mean'] + 1e-9)
    for ang_name, series in [('alpha',alpha),('beta',beta),('gamma',gamma)]:
        df[f'cos_{ang_name}'] = np.cos(np.deg2rad(series))
        df[f'abs_{ang_name}_dev90'] = np.abs(series - 90.0)
    df['orthorhombicity'] = df[['abs_alpha_dev90','abs_beta_dev90','abs_gamma_dev90']].sum(axis=1)
    df['atoms_per_volume'] = df['number_of_total_atoms'] / (vol.replace(0, np.nan))

    # Fractions
    for el in ['al','ga','in']:
        df[f'frac_{el}'] = df[f'percent_atom_{el}'] / 100.0
    df['percent_atom_o'] = 100.0 - (df['percent_atom_al'] + df['percent_atom_ga'] + df['percent_atom_in'])
    df['frac_o'] = df['percent_atom_o'] / 100.0
    df['frac_cation'] = df[['frac_al','frac_ga','frac_in']].sum(axis=1)
    # Mix stats
    cat_fracs = df[['frac_al','frac_ga','frac_in']].clip(lower=0, upper=1)
    df['mix_entropy_cation'] = -np.sum(np.where(cat_fracs>0, cat_fracs*np.log(cat_fracs+1e-12), 0), axis=1)
    df['hhi_cation'] = np.sum(cat_fracs**2, axis=1)
    # Pairwise interactions
    df['al_x_ga'] = df['frac_al']*df['frac_ga']
    df['al_x_in'] = df['frac_al']*df['frac_in']
    df['ga_x_in'] = df['frac_ga']*df['frac_in']
    df['al_minus_ga'] = df['frac_al']-df['frac_ga']
    df['al_minus_in'] = df['frac_al']-df['frac_in']
    df['ga_minus_in'] = df['frac_ga']-df['frac_in']
    eps = 1e-6
    df['al_over_ga'] = (df['frac_al']+eps)/(df['frac_ga']+eps)
    df['al_over_in'] = (df['frac_al']+eps)/(df['frac_in']+eps)
    df['ga_over_in'] = (df['frac_ga']+eps)/(df['frac_in']+eps)
    # Categorical preparation
    df['spacegroup'] = df['spacegroup'].astype('category')
    df.replace([np.inf,-np.inf], np.nan, inplace=True)
    return df

def compute_stoich_groups(df):
    # Compute integer counts using cation stoichiometry consistency
    # For all sesquioxides: total atoms = 5N, cations = 2N, oxygens = 3N
    N = np.rint(df['number_of_total_atoms']/5.0).astype(int)
    n_cat = 2 * N
    # Fractions provided are per total atoms; sum(frac_al, frac_ga, frac_in) ~ 0.4
    frac_al = df['percent_atom_al']/100.0
    frac_ga = df['percent_atom_ga']/100.0
    frac_in = df['percent_atom_in']/100.0
    frac_cations_total = (frac_al + frac_ga + frac_in).replace(0, np.nan)
    # Convert to fractions among cations
    w_al = (frac_al / frac_cations_total).clip(0, 1).fillna(0)
    w_ga = (frac_ga / frac_cations_total).clip(0, 1).fillna(0)
    # ensure sums to 1
    w_in = (1.0 - w_al - w_ga).clip(0, 1)
    n_al = np.rint(n_cat * w_al).astype(int)
    n_ga = np.rint(n_cat * w_ga).astype(int)
    n_in = (n_cat - n_al - n_ga).astype(int)
    n_o = 3 * N
    key = pd.Series(list(zip(N, n_al, n_ga, n_in))).astype(str)
    return key, N, n_al, n_ga, n_in, n_o

# Load fresh
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train_fe = engineer_features(train)
test_fe = engineer_features(test)

# Build groups
groups, N, n_al, n_ga, n_in, n_o = compute_stoich_groups(train)
train_fe['N'] = N; train_fe['n_al'] = n_al; train_fe['n_ga'] = n_ga; train_fe['n_in'] = n_in; train_fe['n_o'] = n_o
test_groups, N_te, al_te, ga_te, in_te, o_te = compute_stoich_groups(test)
test_fe['N'] = N_te; test_fe['n_al'] = al_te; test_fe['n_ga'] = ga_te; test_fe['n_in'] = in_te; test_fe['n_o'] = o_te

# Create balanced shuffled fold assignment at group level
n_splits = 5
uniq_groups = groups.drop_duplicates().sample(frac=1.0, random_state=42).reset_index(drop=True)
chunks = np.array_split(uniq_groups.values, n_splits)
group_to_fold = {}
for k, arr in enumerate(chunks):
    for g in arr:
        group_to_fold[g] = k
fold_ids = groups.map(group_to_fold).astype(int).values

# Diagnostics
y = train_fe['bandgap_energy_ev'].astype(float)
print('Overall target describe:\n', y.describe())
print('Unique groups:', len(uniq_groups))
print('Sample groups:', uniq_groups.head().tolist())
for k in range(n_splits):
    val_idx = np.where(fold_ids==k)[0]
    trn_idx = np.where(fold_ids!=k)[0]
    print(f'Fold {k}: n={len(val_idx)}, uniq_groups={pd.Series(groups.iloc[val_idx]).nunique()}')
    print(pd.Series(groups.iloc[val_idx]).value_counts().head())
    print('Fold target describe:\n', y.iloc[val_idx].describe())
    inter = set(groups.iloc[val_idx]).intersection(set(groups.iloc[trn_idx]))
    assert len(inter)==0, 'Group leakage detected!'

# Feature list (ensure train/test alignment) and drop target
drop_cols = ['id','bandgap_energy_ev','formation_energy_ev_natom']
common_cols = [c for c in train_fe.columns if c in test_fe.columns]
features = [c for c in common_cols if c not in drop_cols]
X = train_fe[features].copy()
X_test = test_fe[features].copy()
y_log = np.log1p(y.clip(lower=0))

# LightGBM with stronger regularization
import lightgbm as lgb
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.03,
    'num_leaves': 96,
    'max_depth': -1,
    'min_data_in_leaf': 200,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'lambda_l2': 1.0,
    'lambda_l1': 0.2,
    'verbosity': -1,
    'seed': 42
}

oof = np.zeros(len(X)); test_pred = np.zeros(len(X_test))
fold_times = []
for k in range(n_splits):
    t0 = time.time()
    val_idx = np.where(fold_ids==k)[0]
    trn_idx = np.where(fold_ids!=k)[0]
    dtrain = lgb.Dataset(X.iloc[trn_idx], label=y_log.iloc[trn_idx], categorical_feature=['spacegroup'], free_raw_data=False)
    dvalid = lgb.Dataset(X.iloc[val_idx], label=y_log.iloc[val_idx], categorical_feature=['spacegroup'], free_raw_data=False)
    model = lgb.train(params, dtrain, num_boost_round=5000, valid_sets=[dtrain,dvalid], valid_names=['train','valid'], callbacks=[lgb.early_stopping(300), lgb.log_evaluation(200)])
    oof[val_idx] = model.predict(X.iloc[val_idx], num_iteration=model.best_iteration)
    test_pred += model.predict(X_test, num_iteration=model.best_iteration) / n_splits
    rmse = float(mean_squared_error(y_log.iloc[val_idx], oof[val_idx]) ** 0.5)
    fold_times.append(time.time()-t0)
    print(f'Fold {k} RMSE(log1p): {rmse:.6f} | best_iter: {model.best_iteration} | elapsed: {fold_times[-1]:.1f}s')
    del model, dtrain, dvalid; gc.collect()

cv = float(mean_squared_error(y_log, oof) ** 0.5)
print(f'New CV RMSLE: {cv:.6f} | mean fold time: {np.mean(fold_times):.1f}s')

# Save new submission
pred_bandgap = np.expm1(test_pred).clip(min=0)
sub = pd.DataFrame({'id': test['id'], 'bandgap_energy_ev': pred_bandgap})
sub.to_csv('submission.csv', index=False)
print('submission.csv saved:', sub.shape)
sub.head()

Overall target describe:
 count    2160.000000
mean        2.075512
std         1.005867
min         0.000100
25%         1.275050
50%         1.901650
75%         2.761150
max         5.286100
Name: bandgap_energy_ev, dtype: float64
Unique groups: 692
Sample groups: ['(6, 0, 2, 10)', '(16, 26, 4, 2)', '(6, 2, 2, 8)', '(16, 21, 3, 8)', '(8, 8, 0, 8)']
Fold 0: n=468, uniq_groups=139
(6, 8, 4, 0)     16
(8, 13, 0, 3)    15
(8, 14, 0, 2)    13
(8, 9, 2, 5)     12
(8, 0, 3, 13)    11
Name: count, dtype: int64
Fold target describe:
 count    468.000000
mean       1.950675
std        1.035782
min        0.000100
25%        1.118925
50%        1.808550
75%        2.543750
max        5.286100
Name: bandgap_energy_ev, dtype: float64
Fold 1: n=461, uniq_groups=139
(6, 6, 6, 0)       19
(6, 4, 8, 0)       14
(8, 10, 1, 5)      13
(16, 1, 1, 30)     13
(16, 0, 16, 16)    13
Name: count, dtype: int64
Fold target describe:
 count    461.000000
mean       2.185851
std        0.997888
min        0.233

[400]	train's rmse: 0.0800665	valid's rmse: 0.105045
[600]	train's rmse: 0.0753728	valid's rmse: 0.100822
[800]	train's rmse: 0.0723948	valid's rmse: 0.0983915


[1000]	train's rmse: 0.0703115	valid's rmse: 0.0973343
[1200]	train's rmse: 0.0686176	valid's rmse: 0.0969536
[1400]	train's rmse: 0.0673267	valid's rmse: 0.0966142


[1600]	train's rmse: 0.0661675	valid's rmse: 0.096612
[1800]	train's rmse: 0.0651302	valid's rmse: 0.0966671
Early stopping, best iteration is:
[1502]	train's rmse: 0.0666871	valid's rmse: 0.096564
Fold 0 RMSE(log1p): 0.096564 | best_iter: 1502 | elapsed: 0.8s


Training until validation scores don't improve for 300 rounds
[200]	train's rmse: 0.0923241	valid's rmse: 0.0930194
[400]	train's rmse: 0.0822953	valid's rmse: 0.0884679


[600]	train's rmse: 0.0777468	valid's rmse: 0.0865252
[800]	train's rmse: 0.0748294	valid's rmse: 0.085685


[1000]	train's rmse: 0.0728146	valid's rmse: 0.0852458
[1200]	train's rmse: 0.0711822	valid's rmse: 0.0850852
[1400]	train's rmse: 0.0698328	valid's rmse: 0.0848514


[1600]	train's rmse: 0.0687103	valid's rmse: 0.0846592
[1800]	train's rmse: 0.0677057	valid's rmse: 0.0847021
[2000]	train's rmse: 0.066815	valid's rmse: 0.0845745


[2200]	train's rmse: 0.0659828	valid's rmse: 0.0845675
[2400]	train's rmse: 0.06523	valid's rmse: 0.0844758


[2600]	train's rmse: 0.0645406	valid's rmse: 0.0846742
Early stopping, best iteration is:
[2354]	train's rmse: 0.0653934	valid's rmse: 0.0844583
Fold 1 RMSE(log1p): 0.084458 | best_iter: 2354 | elapsed: 1.4s


Training until validation scores don't improve for 300 rounds
[200]	train's rmse: 0.0907615	valid's rmse: 0.0954211
[400]	train's rmse: 0.0806605	valid's rmse: 0.0922907


[600]	train's rmse: 0.0760448	valid's rmse: 0.0907279
[800]	train's rmse: 0.0732371	valid's rmse: 0.0902232
[1000]	train's rmse: 0.0712328	valid's rmse: 0.0900915


[1200]	train's rmse: 0.0696421	valid's rmse: 0.08997
[1400]	train's rmse: 0.068331	valid's rmse: 0.0900507
Early stopping, best iteration is:
[1254]	train's rmse: 0.0692788	valid's rmse: 0.0899133
Fold 2 RMSE(log1p): 0.089913 | best_iter: 1254 | elapsed: 0.8s


Training until validation scores don't improve for 300 rounds
[200]	train's rmse: 0.0918792	valid's rmse: 0.0963871
[400]	train's rmse: 0.0826031	valid's rmse: 0.0888624


[600]	train's rmse: 0.0780258	valid's rmse: 0.0857932
[800]	train's rmse: 0.0754138	valid's rmse: 0.0843728
[1000]	train's rmse: 0.0735065	valid's rmse: 0.0836148


[1200]	train's rmse: 0.0720295	valid's rmse: 0.0830419
[1400]	train's rmse: 0.0708707	valid's rmse: 0.0825756
[1600]	train's rmse: 0.0697821	valid's rmse: 0.0822702


[1800]	train's rmse: 0.0688293	valid's rmse: 0.0819862
[2000]	train's rmse: 0.0679862	valid's rmse: 0.0818472
[2200]	train's rmse: 0.0671905	valid's rmse: 0.0816003


[2400]	train's rmse: 0.0664088	valid's rmse: 0.0815923
[2600]	train's rmse: 0.0656912	valid's rmse: 0.0815257
Early stopping, best iteration is:
[2316]	train's rmse: 0.0667402	valid's rmse: 0.0814752
Fold 3 RMSE(log1p): 0.081475 | best_iter: 2316 | elapsed: 1.3s


Training until validation scores don't improve for 300 rounds
[200]	train's rmse: 0.0934315	valid's rmse: 0.0936426
[400]	train's rmse: 0.0840476	valid's rmse: 0.0861089


[600]	train's rmse: 0.079551	valid's rmse: 0.0829195
[800]	train's rmse: 0.0767522	valid's rmse: 0.0810449
[1000]	train's rmse: 0.0747718	valid's rmse: 0.0801754


[1200]	train's rmse: 0.0732264	valid's rmse: 0.0793587
[1400]	train's rmse: 0.0718615	valid's rmse: 0.0789466
[1600]	train's rmse: 0.070686	valid's rmse: 0.0786586


[1800]	train's rmse: 0.0696695	valid's rmse: 0.0785104


[2000]	train's rmse: 0.0687292	valid's rmse: 0.0784933
[2200]	train's rmse: 0.0678703	valid's rmse: 0.0784836


[2400]	train's rmse: 0.0670541	valid's rmse: 0.0784728
Early stopping, best iteration is:
[2289]	train's rmse: 0.0674853	valid's rmse: 0.0783439


Fold 4 RMSE(log1p): 0.078344 | best_iter: 2289 | elapsed: 1.6s
New CV RMSLE: 0.086464 | mean fold time: 1.2s
submission.csv saved: (240, 2)


Unnamed: 0,id,bandgap_energy_ev
0,1,1.91839
1,2,1.699429
2,3,4.327457
3,4,2.898343
4,5,1.143205


In [None]:
# 1) Stabilize CV: Stratify stoichiometry groups by target mean into folds
import numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold

# Ensure prerequisites from Cell 3
assert 'train_fe' in globals(), 'train_fe missing; run feature engineering cell.'
assert 'compute_stoich_groups' in globals(), 'compute_stoich_groups missing; run grouping cell.'

# y target
y = train_fe['bandgap_energy_ev'].astype(float)

# Ensure group key exists
if 'groups' not in globals():
    _gkey, _N, _al, _ga, _in, _o = compute_stoich_groups(pd.read_csv('train.csv'))
    groups = _gkey.astype(str)

# Build stratified folds by group mean target
gkey = groups.astype(str)
gmean = y.groupby(gkey).mean()
gbin = pd.qcut(gmean, q=10, labels=False, duplicates='drop')
uniq = pd.DataFrame({'g': gmean.index, 'bin': gbin.values}).sample(frac=1.0, random_state=42).reset_index(drop=True)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
group_to_fold = {}
for k, (_, val_idx) in enumerate(skf.split(uniq['g'], uniq['bin'])):
    for g in uniq['g'].iloc[val_idx]:
        group_to_fold[g] = k
fold_ids = gkey.map(group_to_fold).astype(int).values

# Assert no leakage
for k in range(5):
    vi = np.where(fold_ids==k)[0]; ti = np.where(fold_ids!=k)[0]
    assert set(gkey.iloc[vi]).isdisjoint(set(gkey.iloc[ti])), f'Group leakage detected in fold {k}'

# Diagnostics
print('Stratified GroupKFold created. Fold sizes:', pd.Series(fold_ids).value_counts().sort_index().to_dict())
print('Group bins distribution:', uniq['bin'].value_counts().sort_index().to_dict())

In [None]:
# Formation-energy OOF meta-feature + retrain bandgap
import numpy as np, pandas as pd, time, gc
from sklearn.metrics import mean_squared_error
import lightgbm as lgb

# Ensure we have engineered frames and fold_ids from previous cell
assert 'train_fe' in globals() and 'test_fe' in globals() and 'fold_ids' in globals(), 'Run previous cell first.'

# Features for formation model (must exist in both train/test, exclude targets)
drop_cols = ['id','bandgap_energy_ev','formation_energy_ev_natom']
common_cols = [c for c in train_fe.columns if c in test_fe.columns]
feat_fe = [c for c in common_cols if c not in drop_cols]
Xf = train_fe[feat_fe].copy()
Xf_test = test_fe[feat_fe].copy()

# Target with shift for log1p
y_fe = train_fe['formation_energy_ev_natom'].astype(float)
c_shift = float(max(0.0, -y_fe.min() + 1e-6))
y_fe_log = np.log1p(y_fe + c_shift)

params_fe = {
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.03,
    'num_leaves': 96,
    'max_depth': -1,
    'min_data_in_leaf': 200,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'lambda_l2': 1.0,
    'lambda_l1': 0.2,
    'verbosity': -1,
    'seed': 2025
}

n_splits = len(np.unique(fold_ids))
oof_fe_log = np.zeros(len(Xf)); test_fe_log = np.zeros(len(Xf_test))
for k in range(n_splits):
    trn_idx = np.where(fold_ids!=k)[0]
    val_idx = np.where(fold_ids==k)[0]
    dtr = lgb.Dataset(Xf.iloc[trn_idx], label=y_fe_log.iloc[trn_idx], categorical_feature=['spacegroup'], free_raw_data=False)
    dva = lgb.Dataset(Xf.iloc[val_idx], label=y_fe_log.iloc[val_idx], categorical_feature=['spacegroup'], free_raw_data=False)
    m = lgb.train(params_fe, dtr, num_boost_round=5000, valid_sets=[dtr,dva], valid_names=['train','valid'], callbacks=[lgb.early_stopping(300), lgb.log_evaluation(300)])
    oof_fe_log[val_idx] = m.predict(Xf.iloc[val_idx], num_iteration=m.best_iteration)
    test_fe_log += m.predict(Xf_test, num_iteration=m.best_iteration) / n_splits
    del m, dtr, dva; gc.collect()
cv_fe = mean_squared_error(y_fe_log, oof_fe_log, squared=False)
print(f'Formation-energy CV (log space RMSE): {cv_fe:.6f}')

# Back-transform predictions
oof_fe = np.expm1(oof_fe_log) - c_shift
pred_fe_test = np.expm1(test_fe_log) - c_shift

# Attach meta-feature
train_fe['pred_fe_meta'] = oof_fe
test_fe['pred_fe_meta'] = pred_fe_test

# Rebuild features with meta-feature included
drop_cols_bg = ['id','bandgap_energy_ev']
common_cols_bg = [c for c in train_fe.columns if c in test_fe.columns]
features_bg = [c for c in common_cols_bg if c not in drop_cols_bg]
X = train_fe[features_bg].copy()
X_test = test_fe[features_bg].copy()
y_bg = train_fe['bandgap_energy_ev'].astype(float)
y_bg_log = np.log1p(y_bg.clip(lower=0))

params_bg = {
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.03,
    'num_leaves': 96,
    'max_depth': -1,
    'min_data_in_leaf': 200,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'lambda_l2': 1.0,
    'lambda_l1': 0.2,
    'verbosity': -1,
    'seed': 42
}

oof = np.zeros(len(X)); test_pred = np.zeros(len(X_test))
for k in range(n_splits):
    trn_idx = np.where(fold_ids!=k)[0]
    val_idx = np.where(fold_ids==k)[0]
    dtr = lgb.Dataset(X.iloc[trn_idx], label=y_bg_log.iloc[trn_idx], categorical_feature=['spacegroup'], free_raw_data=False)
    dva = lgb.Dataset(X.iloc[val_idx], label=y_bg_log.iloc[val_idx], categorical_feature=['spacegroup'], free_raw_data=False)
    m = lgb.train(params_bg, dtr, num_boost_round=5000, valid_sets=[dtr,dva], valid_names=['train','valid'], callbacks=[lgb.early_stopping(300), lgb.log_evaluation(300)])
    oof[val_idx] = m.predict(X.iloc[val_idx], num_iteration=m.best_iteration)
    test_pred += m.predict(X_test, num_iteration=m.best_iteration) / n_splits
    del m, dtr, dva; gc.collect()
cv_bg = mean_squared_error(y_bg_log, oof, squared=False)
print(f'Bandgap CV RMSLE with FE meta-feature: {cv_bg:.6f}')

# Write submission
pred_bg = np.expm1(test_pred).clip(min=0)
sub = pd.DataFrame({'id': test_fe['id'], 'bandgap_energy_ev': pred_bg})
sub.to_csv('submission.csv', index=False)
print('submission.csv saved:', sub.shape)
sub.head()

In [3]:
# Lightweight XYZ structural features (pairwise stats + RDF) and retrain
import os, time, gc, math
import numpy as np
import pandas as pd
from pathlib import Path

# Dependencies
try:
    from ase.io import read as ase_read
except Exception:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'ase'])
    from ase.io import read as ase_read
try:
    from joblib import Parallel, delayed
except Exception:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'joblib'])
    from joblib import Parallel, delayed

BIN_MAX = 8.0
BIN_SIZE = 0.2
BINS = np.arange(0.0, BIN_MAX + BIN_SIZE, BIN_SIZE)

CATIONS = {'Al', 'Ga', 'In'}

def pairwise_dists(positions):
    n = positions.shape[0]
    if n <= 1:
        return np.array([])
    diffs = positions[:, None, :] - positions[None, :, :]
    D = np.sqrt(np.sum(diffs**2, axis=2) + 1e-12)
    iu = np.triu_indices(n, k=1)
    return D[iu]

def stats_from_array(x):
    if x.size == 0:
        return dict(min=np.nan, p5=np.nan, p25=np.nan, p50=np.nan, p75=np.nan, p95=np.nan, mean=np.nan, std=np.nan, max=np.nan)
    q = np.percentile(x, [5,25,50,75,95])
    return dict(min=float(np.min(x)), p5=float(q[0]), p25=float(q[1]), p50=float(q[2]), p75=float(q[3]), p95=float(q[4]), mean=float(np.mean(x)), std=float(np.std(x)), max=float(np.max(x)))

def rdf_hist(distances):
    if distances.size == 0:
        return np.zeros(len(BINS)-1, dtype=float)
    h, _ = np.histogram(distances, bins=BINS)
    denom = max(1, distances.size)
    return (h / denom).astype(float)

def manual_xyz_parse(p):
    try:
        with open(p, 'r') as f:
            lines = f.read().strip().splitlines()
        if len(lines) < 3:
            return None, None
        try:
            n = int(lines[0].strip().split()[0])
        except Exception:
            n = None
        data_lines = lines[2:]
        symbols = []
        coords = []
        for ln in data_lines:
            parts = ln.strip().split()
            if len(parts) < 4:
                continue
            sym = parts[0]
            try:
                x, y, z = float(parts[1]), float(parts[2]), float(parts[3])
            except Exception:
                continue
            symbols.append(sym)
            coords.append([x, y, z])
        if len(coords) == 0:
            return None, None
        return np.array(symbols), np.array(coords, dtype=float)
    except Exception:
        return None, None

def read_xyz_features(split, sid):
    p = Path(split) / str(int(sid)) / 'geometry.xyz'
    symbols = None; pos = None
    # Try ASE single-frame
    try:
        atoms = ase_read(str(p), index=0, format='xyz')
        pos = atoms.get_positions()
        symbols = np.array(atoms.get_chemical_symbols())
    except Exception as e1:
        # Try multi-frame and pick first valid
        try:
            frames = ase_read(str(p), index=':', format='xyz')
            if isinstance(frames, list) and len(frames) > 0:
                atoms = frames[0]
                pos = atoms.get_positions()
                symbols = np.array(atoms.get_chemical_symbols())
        except Exception as e2:
            # Fallback to manual parser
            symbols, pos = manual_xyz_parse(str(p))
            if symbols is None or pos is None:
                print(f'Failed to parse {p}: {e1} | {e2}')
                return {'id': int(sid)}
    if pos is None or symbols is None or len(pos) == 0:
        print(f'Empty positions for {p}')
        return {'id': int(sid)}

    # center to remove translation
    pos = pos - pos.mean(axis=0, keepdims=True)
    is_o = symbols == 'O'
    is_cat = np.isin(symbols, list(CATIONS))
    d_all = pairwise_dists(pos)
    idx_o = np.where(is_o)[0]
    idx_c = np.where(is_cat)[0]

    def subset_dists(idxs_a, idxs_b, same):
        if len(idxs_a)==0 or len(idxs_b)==0:
            return np.array([])
        A = pos[idxs_a]; B = pos[idxs_b]
        if same:
            n = A.shape[0]
            if n <= 1:
                return np.array([])
            diffs = A[:, None, :] - A[None, :, :]
            D = np.sqrt(np.sum(diffs**2, axis=2) + 1e-12)
            iu = np.triu_indices(n, k=1)
            return D[iu]
        else:
            diffs = A[:, None, :] - B[None, :, :]
            D = np.sqrt(np.sum(diffs**2, axis=2) + 1e-12)
            return D.reshape(-1)

    d_cc = subset_dists(idx_c, idx_c, same=True)
    d_oo = subset_dists(idx_o, idx_o, same=True)
    d_co = subset_dists(idx_c, idx_o, same=False)
    feat = {'id': int(sid)}
    for name, arr in [('all', d_all), ('cc', d_cc), ('oo', d_oo), ('co', d_co)]:
        st = stats_from_array(arr)
        for k, v in st.items():
            feat[f'd_{name}_{k}'] = v
        rh = rdf_hist(arr)
        for i, v in enumerate(rh):
            feat[f'rdf_{name}_{i}'] = float(v)

    def nearest_cross(A_idx, B_idx):
        if len(A_idx)==0 or len(B_idx)==0:
            return dict(min=np.nan, mean=np.nan, max=np.nan)
        A = pos[A_idx]; B = pos[B_idx]
        diffs = A[:, None, :] - B[None, :, :]
        D = np.sqrt(np.sum(diffs**2, axis=2) + 1e-12)
        nn = D.min(axis=1)
        return dict(min=float(nn.min()), mean=float(nn.mean()), max=float(nn.max()))

    nn_co = nearest_cross(idx_c, idx_o)
    nn_oc = nearest_cross(idx_o, idx_c)
    for k,v in nn_co.items(): feat[f'nn_c_to_o_{k}'] = v
    for k,v in nn_oc.items(): feat[f'nn_o_to_c_{k}'] = v
    return feat

def build_xyz_df(split, ids, n_jobs=8):
    t0 = time.time()
    feats = Parallel(n_jobs=n_jobs, backend='loky')(delayed(read_xyz_features)(split, sid) for sid in ids)
    df = pd.DataFrame(feats).sort_values('id').reset_index(drop=True)
    print(f'{split}: built XYZ features for {len(ids)} ids in {time.time()-t0:.1f}s with shape {df.shape}')
    return df

# Ensure prerequisite frames and folds exist
assert 'train_fe' in globals() and 'test_fe' in globals() and 'fold_ids' in globals(), 'Run earlier cells first.'

train_ids = train_fe['id'].values if 'id' in train_fe.columns else pd.read_csv('train.csv')['id'].values
test_ids = test_fe['id'].values if 'id' in test_fe.columns else pd.read_csv('test.csv')['id'].values

# Build or load cached XYZ features (force rebuild after parser fix)
cache_tr = Path('xyz_train.parquet')
cache_te = Path('xyz_test.parquet')
if cache_tr.exists(): cache_tr.unlink()
if cache_te.exists(): cache_te.unlink()
xyz_tr = build_xyz_df('train', train_ids, n_jobs=16)
xyz_te = build_xyz_df('test', test_ids, n_jobs=16)
xyz_tr.to_parquet(cache_tr, index=False); xyz_te.to_parquet(cache_te, index=False)
print('Cached XYZ features to parquet.')

# Sanity check that we actually created many feature columns
print('XYZ feature columns:', xyz_tr.columns.tolist()[:10], '... total:', len(xyz_tr.columns))

# Merge into engineered frames
train_fe = train_fe.merge(xyz_tr, on='id', how='left')
test_fe = test_fe.merge(xyz_te, on='id', how='left')

# Retrain LGBM with added XYZ features using same folds
import lightgbm as lgb
from sklearn.metrics import mean_squared_error

drop_cols = ['id','bandgap_energy_ev']
common_cols = [c for c in train_fe.columns if c in test_fe.columns]
features = [c for c in common_cols if c not in drop_cols]
X = train_fe[features].copy()
X_test = test_fe[features].copy()
y = train_fe['bandgap_energy_ev'].astype(float)
y_log = np.log1p(y.clip(lower=0))

params = {
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.03,
    'num_leaves': 96,
    'max_depth': -1,
    'min_data_in_leaf': 200,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'lambda_l2': 3.0,
    'lambda_l1': 0.2,
    'verbosity': -1,
    'seed': 42
}

n_splits = len(np.unique(fold_ids))
oof = np.zeros(len(X)); test_pred = np.zeros(len(X_test))
for k in range(n_splits):
    trn_idx = np.where(fold_ids!=k)[0]
    val_idx = np.where(fold_ids==k)[0]
    dtr = lgb.Dataset(X.iloc[trn_idx], label=y_log.iloc[trn_idx], categorical_feature=['spacegroup'], free_raw_data=False)
    dva = lgb.Dataset(X.iloc[val_idx], label=y_log.iloc[val_idx], categorical_feature=['spacegroup'], free_raw_data=False)
    m = lgb.train(params, dtr, num_boost_round=5000, valid_sets=[dtr,dva], valid_names=['train','valid'], callbacks=[lgb.early_stopping(300), lgb.log_evaluation(300)])
    oof[val_idx] = m.predict(X.iloc[val_idx], num_iteration=m.best_iteration)
    test_pred += m.predict(X_test, num_iteration=m.best_iteration) / n_splits
    del m, dtr, dva; gc.collect()
cv = mean_squared_error(y_log, oof, squared=False)
print(f'CV RMSLE with XYZ features: {cv:.6f}')

# Save submission
pred_bg = np.expm1(test_pred).clip(min=0)
sub = pd.DataFrame({'id': test_fe['id'], 'bandgap_energy_ev': pred_bg})
sub.to_csv('submission.csv', index=False)
print('submission.csv saved:', sub.shape)
sub.head()

train: built XYZ features for 2160 ids in 1.8s with shape (2160, 203)


test: built XYZ features for 240 ids in 0.3s with shape (240, 203)
Cached XYZ features to parquet.
XYZ feature columns: ['id', 'd_all_min', 'd_all_p5', 'd_all_p25', 'd_all_p50', 'd_all_p75', 'd_all_p95', 'd_all_mean', 'd_all_std', 'd_all_max'] ... total: 203
Training until validation scores don't improve for 300 rounds


[300]	train's rmse: 0.0820224	valid's rmse: 0.108077
[600]	train's rmse: 0.0732857	valid's rmse: 0.101438


[900]	train's rmse: 0.068578	valid's rmse: 0.0980759
[1200]	train's rmse: 0.0654777	valid's rmse: 0.0967751


[1500]	train's rmse: 0.0629212	valid's rmse: 0.0961109
[1800]	train's rmse: 0.0609254	valid's rmse: 0.09596


[2100]	train's rmse: 0.0591926	valid's rmse: 0.0958148
Early stopping, best iteration is:
[2093]	train's rmse: 0.0592325	valid's rmse: 0.0957706


Training until validation scores don't improve for 300 rounds
[300]	train's rmse: 0.0848733	valid's rmse: 0.0892039


[600]	train's rmse: 0.0754857	valid's rmse: 0.0858933
[900]	train's rmse: 0.0706754	valid's rmse: 0.0848009


[1200]	train's rmse: 0.0673629	valid's rmse: 0.0844294
[1500]	train's rmse: 0.064823	valid's rmse: 0.0842012


[1800]	train's rmse: 0.0626647	valid's rmse: 0.0838746
[2100]	train's rmse: 0.0608356	valid's rmse: 0.0838457


[2400]	train's rmse: 0.0592077	valid's rmse: 0.0837712
Early stopping, best iteration is:
[2349]	train's rmse: 0.0594585	valid's rmse: 0.0837093


Training until validation scores don't improve for 300 rounds


[300]	train's rmse: 0.0832171	valid's rmse: 0.093395
[600]	train's rmse: 0.0739555	valid's rmse: 0.0906044


[900]	train's rmse: 0.0692626	valid's rmse: 0.0900064
[1200]	train's rmse: 0.066015	valid's rmse: 0.0900276


Early stopping, best iteration is:
[1147]	train's rmse: 0.0665018	valid's rmse: 0.089901
Training until validation scores don't improve for 300 rounds


[300]	train's rmse: 0.0845764	valid's rmse: 0.0904315
[600]	train's rmse: 0.0756515	valid's rmse: 0.0842928


[900]	train's rmse: 0.0710796	valid's rmse: 0.0820401
[1200]	train's rmse: 0.0678913	valid's rmse: 0.0812824


[1500]	train's rmse: 0.0655195	valid's rmse: 0.0804976


[1800]	train's rmse: 0.0635658	valid's rmse: 0.0803347


[2100]	train's rmse: 0.0617522	valid's rmse: 0.0803118
[2400]	train's rmse: 0.0602301	valid's rmse: 0.0801906


Early stopping, best iteration is:
[2275]	train's rmse: 0.0608401	valid's rmse: 0.0801157
Training until validation scores don't improve for 300 rounds


[300]	train's rmse: 0.0861201	valid's rmse: 0.0876482
[600]	train's rmse: 0.076906	valid's rmse: 0.0824866


[900]	train's rmse: 0.0720975	valid's rmse: 0.0809133
[1200]	train's rmse: 0.0689646	valid's rmse: 0.0802659


[1500]	train's rmse: 0.0663805	valid's rmse: 0.0799949
[1800]	train's rmse: 0.0641817	valid's rmse: 0.079945


[2100]	train's rmse: 0.0622199	valid's rmse: 0.0800118
Early stopping, best iteration is:
[2022]	train's rmse: 0.0627079	valid's rmse: 0.0798619


TypeError: got an unexpected keyword argument 'squared'

In [None]:
# 2) Add missing low-dimension matminer composition features (Stoichiometry, ValenceOrbital, IonProperty)
import numpy as np, pandas as pd, sys, subprocess

assert 'train_fe' in globals() and 'test_fe' in globals(), 'Run earlier feature engineering cells first.'

try:
    from matminer.featurizers.composition import Stoichiometry, ValenceOrbital, IonProperty
    from pymatgen.core.composition import Composition
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'matminer', 'pymatgen'])
    from matminer.featurizers.composition import Stoichiometry, ValenceOrbital, IonProperty
    from pymatgen.core.composition import Composition

def add_mm_lowdim(df):
    tmp = df[['composition']].copy()
    tmp['composition'] = tmp['composition'].apply(Composition)
    out = tmp.copy()
    for fz in [Stoichiometry(), ValenceOrbital(props=['avg','frac']), IonProperty(fast=True)]:
        out = fz.featurize_dataframe(out, col_id='composition', ignore_errors=True)
    out = out.drop(columns=['composition'])
    out.columns = [f'mm2_{c}' for c in out.columns]
    return out

# Build and merge
mm2_tr = add_mm_lowdim(train_fe)
mm2_te = add_mm_lowdim(test_fe)
print('Low-dim matminer built:', mm2_tr.shape, mm2_te.shape)
train_fe = pd.concat([train_fe.reset_index(drop=True), mm2_tr.reset_index(drop=True)], axis=1)
test_fe  = pd.concat([test_fe.reset_index(drop=True),  mm2_te.reset_index(drop=True)], axis=1)
print('After merge shapes:', train_fe.shape, test_fe.shape)

In [None]:
# 3) Prune XYZ RDF features; 4) Add interaction features, bowing/logs, cation-weight contrasts; OOF TE for spacegroup; rebuild folds; quick importance pruning
import numpy as np, pandas as pd, time, gc

assert 'train_fe' in globals() and 'test_fe' in globals(), 'Run earlier feature engineering cells first.'
t0_all = time.time()

# ------------------ Prune all rdf_* bins from XYZ features ------------------
t0 = time.time()
xyz_drop = [c for c in train_fe.columns if c.startswith('rdf_')]
train_fe.drop(columns=xyz_drop, inplace=True, errors='ignore')
test_fe.drop(columns=xyz_drop, inplace=True, errors='ignore')
print('Dropped RDF bins:', len(xyz_drop), '| elapsed:', f'{time.time()-t0:.2f}s', flush=True)

# ------------------ Ensure base composition fractions and Vegard weights ------------------
t0 = time.time()
for df in (train_fe, test_fe):
    if 'frac_al' not in df.columns: df['frac_al'] = df['percent_atom_al'] / 100.0
    if 'frac_ga' not in df.columns: df['frac_ga'] = df['percent_atom_ga'] / 100.0
    if 'frac_in' not in df.columns: df['frac_in'] = df['percent_atom_in'] / 100.0
    if 'percent_atom_o' not in df.columns:
        df['percent_atom_o'] = 100.0 - (df['percent_atom_al'] + df['percent_atom_ga'] + df['percent_atom_in'])
    if 'frac_o' not in df.columns:
        df['frac_o'] = df['percent_atom_o']/100.0
    frac_cat = (df['frac_al'] + df['frac_ga'] + df['frac_in']).replace(0, np.nan)
    if 'w_al' not in df.columns: df['w_al'] = (df['frac_al']/frac_cat).fillna(0)
    if 'w_ga' not in df.columns: df['w_ga'] = (df['frac_ga']/frac_cat).fillna(0)
    if 'w_in' not in df.columns: df['w_in'] = (df['frac_in']/frac_cat).fillna(0)
    if 'vegard_bg' not in df.columns:
        df['vegard_bg'] = 8.8*df['w_al'] + 4.8*df['w_ga'] + 2.9*df['w_in']
print('Ensured base composition + Vegard | elapsed:', f'{time.time()-t0:.2f}s', flush=True)

# ------------------ Add bowing and log features ------------------
t0 = time.time()
def add_bowing_logs(df):
    df['bow_in'] = df['w_in']*(1.0 - df['w_in'])
    df['bow_ga'] = df['w_ga']*(1.0 - df['w_ga'])
    if 'volume_per_atom' in df.columns:
        df['log_vpa'] = np.log1p(df['volume_per_atom'].clip(lower=0))
    if 'atoms_per_volume' in df.columns:
        df['log_apv'] = np.log1p(df['atoms_per_volume'].clip(lower=0))
    df['log_oc'] = np.log1p((df['frac_o']/(df['frac_al']+df['frac_ga']+df['frac_in']+1e-9)).clip(lower=0))
    df['log_in_over_al'] = np.log1p(((df['frac_in']+1e-6)/(df['frac_al']+1e-6)).clip(lower=0))
    return df
train_fe = add_bowing_logs(train_fe)
test_fe = add_bowing_logs(test_fe)
print('Added bowing/log features | elapsed:', f'{time.time()-t0:.2f}s', flush=True)

# ------------------ Add interaction features (keep existing if present) ------------------
t0 = time.time()
def add_interactions(df):
    fa, fg, fi = df['frac_al'], df['frac_ga'], df['frac_in']
    if 'al_in_diff_sq' not in df.columns: df['al_in_diff_sq'] = (fa - fi) ** 2
    if 'ga_in_diff_sq' not in df.columns: df['ga_in_diff_sq'] = (fg - fi) ** 2
    if 'frac_al_cu' not in df.columns: df['frac_al_cu'] = fa ** 3
    if 'frac_ga_cu' not in df.columns: df['frac_ga_cu'] = fg ** 3
    if 'frac_in_cu' not in df.columns: df['frac_in_cu'] = fi ** 3
    if 'w_al_x_veg' not in df.columns: df['w_al_x_veg'] = df['w_al'] * df['vegard_bg']
    if 'w_in_x_veg' not in df.columns: df['w_in_x_veg'] = df['w_in'] * df['vegard_bg']
    for wname in ['al','ga','in']:
        if f'w_{wname}_sq' not in df.columns: df[f'w_{wname}_sq'] = df[f'w_{wname}']**2
    if 'w_al_ga' not in df.columns: df['w_al_ga'] = df['w_al']*df['w_ga']
    if 'w_al_in' not in df.columns: df['w_al_in'] = df['w_al']*df['w_in']
    if 'w_ga_in' not in df.columns: df['w_ga_in'] = df['w_ga']*df['w_in']
    return df
train_fe = add_interactions(train_fe)
test_fe = add_interactions(test_fe)
print('After interactions/bowing/logs:', train_fe.shape, test_fe.shape, '| elapsed:', f'{time.time()-t0:.2f}s', flush=True)

# ------------------ Cation-weighted contrasts (high-signal) ------------------
t0 = time.time()
props = {
    'chi_pauling': {'Al':1.61,'Ga':1.81,'In':1.78,'O':3.44},
    'ionic_radius': {'Al':0.535,'Ga':0.62,'In':0.80,'O':1.38},  # Shannon approx (Å)
    'Z': {'Al':13,'Ga':31,'In':49,'O':8},
    'period': {'Al':3,'Ga':4,'In':5,'O':2},
    'group': {'Al':13,'Ga':13,'In':13,'O':16},
    'covalent_radius': {'Al':1.21,'Ga':1.22,'In':1.42,'O':0.66},
    'first_ionization_energy': {'Al':5.986,'Ga':5.999,'In':5.786,'O':13.618},
    'electron_affinity': {'Al':0.441,'Ga':0.30,'In':0.30,'O':1.461}
}

def add_cation_weighted(df):
    wa, wg, wi = df['w_al'], df['w_ga'], df['w_in']
    for name, table in props.items():
        ca = table['Al']; cg = table['Ga']; ci = table['In']; co = table['O']
        wmean = wa*ca + wg*cg + wi*ci
        df[f'catw_{name}_mean'] = wmean
        df[f'catw_{name}_var'] = (wa*(ca - wmean)**2 + wg*(cg - wmean)**2 + wi*(ci - wmean)**2)
        if name in ['chi_pauling','ionic_radius']:
            df[f'o_minus_catw_{name}'] = co - wmean
    return df
train_fe = add_cation_weighted(train_fe)
test_fe = add_cation_weighted(test_fe)
print('Added cation-weighted contrasts | elapsed:', f'{time.time()-t0:.2f}s', flush=True)

# ------------------ Rebuild stratified group-disjoint folds with more splits (n_splits=8) ------------------
t0 = time.time()
assert 'compute_stoich_groups' in globals(), 'compute_stoich_groups missing; run grouping cell.'
y = train_fe['bandgap_energy_ev'].astype(float)
# Ensure necessary count columns exist (vectorized, avoid row-wise apply)
need_cols = ['N','n_al','n_ga','n_in','n_o']
missing = [c for c in need_cols if c not in train_fe.columns]
if missing:
    tr_csv = pd.read_csv('train.csv')
    key_tr, N_tr, al_tr, ga_tr, in_tr, o_tr = compute_stoich_groups(tr_csv)
    train_fe['N'] = N_tr; train_fe['n_al'] = al_tr; train_fe['n_ga'] = ga_tr; train_fe['n_in'] = in_tr; train_fe['n_o'] = o_tr
    te_csv = pd.read_csv('test.csv')
    key_te, N_te, al_te, ga_te, in_te, o_te = compute_stoich_groups(te_csv)
    test_fe['N'] = N_te; test_fe['n_al'] = al_te; test_fe['n_ga'] = ga_te; test_fe['n_in'] = in_te; test_fe['n_o'] = o_te
# Vectorized group key
gkey = train_fe[['N','n_al','n_ga','n_in']].astype(int).astype(str).agg('_'.join, axis=1)
gmean = y.groupby(gkey).mean()
gbin = pd.qcut(gmean, q=10, labels=False, duplicates='drop')
uniq = pd.DataFrame({'g': gmean.index, 'bin': gbin.values}).sample(frac=1.0, random_state=42).reset_index(drop=True)
from sklearn.model_selection import StratifiedKFold
n_splits_new = 8
skf = StratifiedKFold(n_splits=n_splits_new, shuffle=True, random_state=42)
group_to_fold = {}
for k, (_, val_idx) in enumerate(skf.split(uniq['g'], uniq['bin'])):
    for g in uniq['g'].iloc[val_idx]: group_to_fold[g] = k
fold_ids = gkey.map(group_to_fold).astype(int).values
print('Rebuilt fold_ids with n_splits=', n_splits_new, 'Fold sizes:', pd.Series(fold_ids).value_counts().sort_index().to_dict(), '| elapsed:', f'{time.time()-t0:.2f}s', flush=True)

# ------------------ OOF target encoding for spacegroup (group-disjoint) ------------------
t0 = time.time()
train_fe['te_sg'] = 0.0
y_log = np.log1p(y.clip(lower=0))
global_mean = float(y_log.mean())
for k in range(n_splits_new):
    trn_idx = np.where(fold_ids!=k)[0]
    val_idx = np.where(fold_ids==k)[0]
    m = train_fe.iloc[trn_idx].groupby('spacegroup')['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).mean())
    te_map = m.to_dict()
    train_fe.loc[train_fe.index[val_idx], 'te_sg'] = train_fe.iloc[val_idx]['spacegroup'].map(te_map).fillna(global_mean).values
sg_map_full = train_fe.groupby('spacegroup')['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).mean()).to_dict()
test_fe['te_sg'] = test_fe['spacegroup'].map(sg_map_full).fillna(global_mean)
print('Added OOF TE for spacegroup | elapsed:', f'{time.time()-t0:.2f}s', flush=True)

# ------------------ Quick LightGBM importance-based pruning ------------------
t0 = time.time()
import lightgbm as lgb
drop_cols = ['id','bandgap_energy_ev','composition']
common_cols = [c for c in train_fe.columns if c in test_fe.columns]
feat_all = [c for c in common_cols if c not in drop_cols]
# Numeric only, keep spacegroup if present
num_cols = list(train_fe[feat_all].select_dtypes(include=[np.number]).columns)
if 'spacegroup' in feat_all and 'spacegroup' not in num_cols:
    num_cols.append('spacegroup')
train_X = train_fe[num_cols].copy()
test_X = test_fe[num_cols].copy()
med = train_X.median(numeric_only=True)
train_X = train_X.fillna(med)
test_X = test_X.fillna(med)

lgb_quick = lgb.LGBMRegressor(n_estimators=300, learning_rate=0.1, random_state=42)
lgb_quick.fit(train_X, y_log)
fi = pd.DataFrame({'feat': train_X.columns, 'imp': lgb_quick.feature_importances_})
fi = fi.sort_values('imp', ascending=True).reset_index(drop=True)
keep_ratio = 0.65
k = int(np.ceil(len(fi)*keep_ratio))
keep_feats = fi.sort_values('imp', ascending=False).head(k)['feat'].tolist()
drop_feats = [f for f in train_X.columns if f not in keep_feats]
print(f'Quick LGBM importance pruning: keeping {len(keep_feats)} / {len(train_X.columns)} features (drop {len(drop_feats)}). Took {time.time()-t0:.1f}s', flush=True)

# Apply pruning to frames (safe drop only for features present in both train/test and numeric)
train_drop_cols = [c for c in drop_feats if c in train_fe.columns]
test_drop_cols = [c for c in drop_feats if c in test_fe.columns]
# Do not drop essential columns if they slipped in
essentials = set(['id','bandgap_energy_ev','composition'])
train_drop_cols = [c for c in train_drop_cols if c not in essentials]
test_drop_cols = [c for c in test_drop_cols if c not in essentials]
train_fe.drop(columns=train_drop_cols, inplace=True, errors='ignore')
test_fe.drop(columns=test_drop_cols, inplace=True, errors='ignore')
print('Applied pruning to train/test frames. New shapes:', train_fe.shape, test_fe.shape, '| total elapsed:', f'{time.time()-t0_all:.1f}s', flush=True)

gc.collect();

In [None]:
# Matminer Magpie features + Vegard predictor + LGBM/CatBoost/XGBoost + NNLS blend
import numpy as np, pandas as pd, gc, time, os, sys, subprocess
from sklearn.metrics import mean_squared_error

assert 'train_fe' in globals() and 'test_fe' in globals() and 'fold_ids' in globals(), 'Run previous cells first.'

# Install deps if missing
try:
    from matminer.featurizers.composition import ElementProperty
    from pymatgen.core.composition import Composition
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'matminer', 'pymatgen'])
    from matminer.featurizers.composition import ElementProperty
    from pymatgen.core.composition import Composition
try:
    from catboost import CatBoostRegressor, Pool
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'catboost'])
    from catboost import CatBoostRegressor, Pool
try:
    import xgboost as xgb
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'xgboost'])
    import xgboost as xgb
try:
    from scipy.optimize import nnls
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'scipy'])
    from scipy.optimize import nnls

# Ensure stoichiometric counts exist (from cell 3) and create composition strings
def ensure_counts(df):
    need = ['N','n_al','n_ga','n_in','n_o']
    for c in need:
        assert c in df.columns, f'Missing {c} in engineered frame; rerun grouping cell.'
    return df
train_fe = ensure_counts(train_fe)
test_fe = ensure_counts(test_fe)

def comp_str(row):
    return f"Al{int(row['n_al'])} Ga{int(row['n_ga'])} In{int(row['n_in'])} O{int(row['n_o'])}"
train_fe['composition'] = train_fe.apply(comp_str, axis=1)
test_fe['composition'] = test_fe.apply(comp_str, axis=1)

# Ensure spacegroup categorical dtype
for df in (train_fe, test_fe):
    df['spacegroup'] = df['spacegroup'].astype('category')

# Magpie features
def build_magpie(df):
    tmp = df[['composition']].copy()
    tmp['composition'] = tmp['composition'].apply(Composition)
    ep = ElementProperty.from_preset('magpie')
    out = ep.featurize_dataframe(tmp, col_id='composition', ignore_errors=True)
    out.columns = ['composition'] + [f'mm_{c}' for c in out.columns[1:]]
    return out.drop(columns=['composition'])

t0 = time.time()
mm_tr = build_magpie(train_fe)
mm_te = build_magpie(test_fe)
print(f'Magpie built: train {mm_tr.shape}, test {mm_te.shape} in {time.time()-t0:.1f}s')

# Merge Magpie features (other features like mm2_ from prior cell are already in train_fe/test_fe)
train_fe = pd.concat([train_fe.reset_index(drop=True), mm_tr.reset_index(drop=True)], axis=1)
test_fe = pd.concat([test_fe.reset_index(drop=True), mm_te.reset_index(drop=True)], axis=1)

# Vegard-like predictor + interactions (if missing)
def add_vegard(df):
    frac_al = df['percent_atom_al']/100.0
    frac_ga = df['percent_atom_ga']/100.0
    frac_in = df['percent_atom_in']/100.0
    frac_cat = (frac_al + frac_ga + frac_in).replace(0, np.nan)
    w_al = (frac_al/frac_cat).fillna(0)
    w_ga = (frac_ga/frac_cat).fillna(0)
    w_in = (frac_in/frac_cat).fillna(0)
    df['vegard_bg'] = df.get('vegard_bg', 8.8*w_al + 4.8*w_ga + 2.9*w_in)
    for wname, w in [('al',w_al),('ga',w_ga),('in',w_in)]:
        if f'w_{wname}' not in df.columns: df[f'w_{wname}'] = w
        if f'w_{wname}_sq' not in df.columns: df[f'w_{wname}_sq'] = w*w
    if 'w_al_ga' not in df.columns: df['w_al_ga'] = df['w_al']*df['w_ga']
    if 'w_al_in' not in df.columns: df['w_al_in'] = df['w_al']*df['w_in']
    if 'w_ga_in' not in df.columns: df['w_ga_in'] = df['w_ga']*df['w_in']
    # small interactions
    for name, expr in [('al_in_diff_sq',(frac_al-frac_in)**2), ('ga_in_diff_sq',(frac_ga-frac_in)**2),
                       ('frac_al_cu', frac_al**3), ('frac_ga_cu', frac_ga**3), ('frac_in_cu', frac_in**3),
                       ('w_al_x_veg', w_al*df['vegard_bg']), ('w_in_x_veg', w_in*df['vegard_bg'])]:
        if name not in df.columns: df[name] = expr
    return df
train_fe = add_vegard(train_fe)
test_fe = add_vegard(test_fe)

# Normalize lattice lengths by volume^(1/3) (if not present)
for df in (train_fe, test_fe):
    if 'a_red' not in df.columns or 'b_red' not in df.columns or 'c_red' not in df.columns:
        vol = df['cell_volume'].replace(0, np.nan)
        l = vol.pow(1/3)
        df['a_red'] = df['lattice_vector_1_ang']/l
        df['b_red'] = df['lattice_vector_2_ang']/l
        df['c_red'] = df['lattice_vector_3_ang']/l

# Build feature matrices
drop_cols = ['id','bandgap_energy_ev','composition']
common_cols = [c for c in train_fe.columns if c in test_fe.columns]
features = [c for c in common_cols if c not in drop_cols]
features = pd.Index(features).drop_duplicates().tolist()

med = train_fe[features].median(numeric_only=True)
train_X = train_fe[features].copy().fillna(med)
test_X = test_fe[features].copy().fillna(med)

# Align and dtype guard
train_X = train_X.loc[:, ~train_X.columns.duplicated()]
test_X = test_X.loc[:, ~test_X.columns.duplicated()]
common_aligned = [c for c in train_X.columns if c in test_X.columns]
train_X = train_X[common_aligned]
test_X = test_X[common_aligned]
bad = list(train_X.select_dtypes(include=['object']).columns)
if bad:
    print('Dropping object cols:', bad)
    train_X.drop(columns=bad, inplace=True)
    test_X.drop(columns=bad, inplace=True)
num_cols = list(train_X.select_dtypes(include=[np.number]).columns)
if 'spacegroup' in train_X.columns:
    num_cols = list(dict.fromkeys(num_cols + ['spacegroup']))
train_X = train_X[num_cols]
test_X = test_X[num_cols]
assert train_X.select_dtypes(include='object').empty and test_X.select_dtypes(include='object').empty

y = train_fe['bandgap_energy_ev'].astype(float)
y_log = np.log1p(y.clip(lower=0))
n_splits = len(np.unique(fold_ids))

# LightGBM OOF
import lightgbm as lgb
params_lgb = {
    'objective': 'regression', 'metric': 'rmse', 'learning_rate': 0.03,
    'num_leaves': 128, 'max_depth': -1, 'min_data_in_leaf': 150,
    'feature_fraction': 0.8, 'bagging_fraction': 0.8, 'bagging_freq': 1,
    'lambda_l2': 2.0, 'lambda_l1': 0.0, 'verbosity': -1, 'seed': 42
}
oof_lgb = np.zeros(len(train_X)); pred_lgb = np.zeros(len(test_X))
for k in range(n_splits):
    trn = np.where(fold_ids!=k)[0]; val = np.where(fold_ids==k)[0]
    dtr = lgb.Dataset(train_X.iloc[trn], label=y_log.iloc[trn], categorical_feature=['spacegroup'], free_raw_data=False)
    dva = lgb.Dataset(train_X.iloc[val], label=y_log.iloc[val], categorical_feature=['spacegroup'], free_raw_data=False)
    m = lgb.train(params_lgb, dtr, num_boost_round=7000, valid_sets=[dtr,dva], valid_names=['train','valid'], callbacks=[lgb.early_stopping(450), lgb.log_evaluation(300)])
    oof_lgb[val] = m.predict(train_X.iloc[val], num_iteration=m.best_iteration)
    pred_lgb += m.predict(test_X, num_iteration=m.best_iteration)/n_splits
    del m, dtr, dva; gc.collect()
cv_lgb = mean_squared_error(y_log, oof_lgb, squared=False)
print(f'LGBM CV RMSLE: {cv_lgb:.6f}')

# CatBoost OOF
cat_params = dict(loss_function='RMSE', eval_metric='RMSE', iterations=5000, learning_rate=0.03, depth=7,
                   l2_leaf_reg=5.0, subsample=0.8, rsm=0.8, random_seed=42, od_type='Iter', od_wait=300, verbose=300)
oof_cb = np.zeros(len(train_X)); pred_cb = np.zeros(len(test_X))
cat_idx = [train_X.columns.get_loc('spacegroup')] if 'spacegroup' in train_X.columns else []
for k in range(n_splits):
    trn = np.where(fold_ids!=k)[0]; val = np.where(fold_ids==k)[0]
    pool_tr = Pool(train_X.iloc[trn], y_log.iloc[trn], cat_features=cat_idx)
    pool_va = Pool(train_X.iloc[val], y_log.iloc[val], cat_features=cat_idx)
    model_cb = CatBoostRegressor(**cat_params)
    model_cb.fit(pool_tr, eval_set=pool_va, use_best_model=True)
    oof_cb[val] = model_cb.predict(pool_va)
    pred_cb += model_cb.predict(Pool(test_X, cat_features=cat_idx))/n_splits
    del model_cb, pool_tr, pool_va; gc.collect()
cv_cb = mean_squared_error(y_log, oof_cb, squared=False)
print(f'CatBoost CV RMSLE: {cv_cb:.6f}')

# XGBoost OOF
xgb_params = dict(objective='reg:squarederror', eval_metric='rmse', tree_method='hist',
                  max_depth=6, eta=0.03, subsample=0.8, colsample_bytree=0.8,
                  min_child_weight=5, reg_lambda=3.0, reg_alpha=0.0, random_state=42)
oof_xgb = np.zeros(len(train_X)); pred_xgb = np.zeros(len(test_X))
for k in range(n_splits):
    trn = np.where(fold_ids!=k)[0]; val = np.where(fold_ids==k)[0]
    dtr = xgb.DMatrix(train_X.iloc[trn], label=y_log.iloc[trn])
    dva = xgb.DMatrix(train_X.iloc[val], label=y_log.iloc[val])
    dte = xgb.DMatrix(test_X)
    model = xgb.train(xgb_params, dtr, num_boost_round=8000, evals=[(dva,'valid')], early_stopping_rounds=400, verbose_eval=300)
    oof_xgb[val] = model.predict(xgb.DMatrix(train_X.iloc[val]), iteration_range=(0, model.best_ntree_limit))
    pred_xgb += model.predict(dte, iteration_range=(0, model.best_ntree_limit))/n_splits
    del model, dtr, dva, dte; gc.collect()
cv_xgb = mean_squared_error(y_log, oof_xgb, squared=False)
print(f'XGBoost CV RMSLE: {cv_xgb:.6f}')

# NNLS blend in log space
P = np.vstack([oof_lgb, oof_cb, oof_xgb]).T
w, _ = nnls(P, y_log.values)
w = w / (w.sum() if w.sum() > 0 else 1)
print('NNLS weights (LGB, CB, XGB):', w)
oof_blend = P @ w
cv_blend = mean_squared_error(y_log, oof_blend, squared=False)
print(f'Blended (NNLS) CV RMSLE: {cv_blend:.6f}')
Ptest = np.vstack([pred_lgb, pred_cb, pred_xgb]).T
pred_blend = Ptest @ w

# Save submission (clip after expm1)
pred_bg = np.expm1(pred_blend).clip(0, 6.5)
sub = pd.DataFrame({'id': test_fe['id'], 'bandgap_energy_ev': pred_bg})
sub.to_csv('submission.csv', index=False)
print('submission.csv saved:', sub.shape)
sub.head()

In [None]:
# 8A) Fast feature pass: drop rdf_*, add key features, rebuild 8-folds, add OOF TE (no importance pruning)
import numpy as np, pandas as pd, time, gc

assert 'train_fe' in globals() and 'test_fe' in globals(), 'Run earlier feature engineering cells first.'
t0_all = time.time()

# Drop rdf_* to reduce noise
t0 = time.time()
rdf_cols_tr = [c for c in train_fe.columns if c.startswith('rdf_')]
rdf_cols_te = [c for c in test_fe.columns if c.startswith('rdf_')]
train_fe.drop(columns=rdf_cols_tr, inplace=True, errors='ignore')
test_fe.drop(columns=rdf_cols_te, inplace=True, errors='ignore')
print('Dropped RDF bins:', len(rdf_cols_tr), '| elapsed:', f'{time.time()-t0:.2f}s', flush=True)

# Ensure base composition fractions, cation weights, Vegard
t0 = time.time()
for df in (train_fe, test_fe):
    if 'frac_al' not in df.columns: df['frac_al'] = df['percent_atom_al'] / 100.0
    if 'frac_ga' not in df.columns: df['frac_ga'] = df['percent_atom_ga'] / 100.0
    if 'frac_in' not in df.columns: df['frac_in'] = df['percent_atom_in'] / 100.0
    if 'percent_atom_o' not in df.columns:
        df['percent_atom_o'] = 100.0 - (df['percent_atom_al'] + df['percent_atom_ga'] + df['percent_atom_in'])
    if 'frac_o' not in df.columns: df['frac_o'] = df['percent_atom_o'] / 100.0
    frac_cat = (df['frac_al'] + df['frac_ga'] + df['frac_in']).replace(0, np.nan)
    if 'w_al' not in df.columns: df['w_al'] = (df['frac_al']/frac_cat).fillna(0)
    if 'w_ga' not in df.columns: df['w_ga'] = (df['frac_ga']/frac_cat).fillna(0)
    if 'w_in' not in df.columns: df['w_in'] = (df['frac_in']/frac_cat).fillna(0)
    if 'vegard_bg' not in df.columns: df['vegard_bg'] = 8.8*df['w_al'] + 4.8*df['w_ga'] + 2.9*df['w_in']
print('Ensured base composition + Vegard | elapsed:', f'{time.time()-t0:.2f}s', flush=True)

# Bowing/log features
t0 = time.time()
def add_bowing_logs(df):
    df['bow_in'] = df['w_in']*(1.0 - df['w_in'])
    df['bow_ga'] = df['w_ga']*(1.0 - df['w_ga'])
    if 'volume_per_atom' in df.columns: df['log_vpa'] = np.log1p(df['volume_per_atom'].clip(lower=0))
    if 'atoms_per_volume' in df.columns: df['log_apv'] = np.log1p(df['atoms_per_volume'].clip(lower=0))
    df['log_oc'] = np.log1p((df['frac_o']/(df['frac_al']+df['frac_ga']+df['frac_in']+1e-9)).clip(lower=0))
    df['log_in_over_al'] = np.log1p(((df['frac_in']+1e-6)/(df['frac_al']+1e-6)).clip(lower=0))
    return df
train_fe = add_bowing_logs(train_fe); test_fe = add_bowing_logs(test_fe)
print('Added bowing/log features | elapsed:', f'{time.time()-t0:.2f}s', flush=True)

# Interactions (lightweight)
t0 = time.time()
def add_interactions(df):
    fa, fg, fi = df['frac_al'], df['frac_ga'], df['frac_in']
    if 'al_in_diff_sq' not in df.columns: df['al_in_diff_sq'] = (fa - fi) ** 2
    if 'ga_in_diff_sq' not in df.columns: df['ga_in_diff_sq'] = (fg - fi) ** 2
    if 'frac_al_cu' not in df.columns: df['frac_al_cu'] = fa ** 3
    if 'frac_ga_cu' not in df.columns: df['frac_ga_cu'] = fg ** 3
    if 'frac_in_cu' not in df.columns: df['frac_in_cu'] = fi ** 3
    if 'w_al_x_veg' not in df.columns: df['w_al_x_veg'] = df['w_al'] * df['vegard_bg']
    if 'w_in_x_veg' not in df.columns: df['w_in_x_veg'] = df['w_in'] * df['vegard_bg']
    for wname in ['al','ga','in']:
        if f'w_{wname}_sq' not in df.columns: df[f'w_{wname}_sq'] = df[f'w_{wname}']**2
    if 'w_al_ga' not in df.columns: df['w_al_ga'] = df['w_al']*df['w_ga']
    if 'w_al_in' not in df.columns: df['w_al_in'] = df['w_al']*df['w_in']
    if 'w_ga_in' not in df.columns: df['w_ga_in'] = df['w_ga']*df['w_in']
    return df
train_fe = add_interactions(train_fe); test_fe = add_interactions(test_fe)
print('After interactions:', train_fe.shape, test_fe.shape, '| elapsed:', f'{time.time()-t0:.2f}s', flush=True)

# Cation-weighted contrasts
t0 = time.time()
props = {
    'chi_pauling': {'Al':1.61,'Ga':1.81,'In':1.78,'O':3.44},
    'ionic_radius': {'Al':0.535,'Ga':0.62,'In':0.80,'O':1.38},
    'Z': {'Al':13,'Ga':31,'In':49,'O':8},
    'period': {'Al':3,'Ga':4,'In':5,'O':2},
    'group': {'Al':13,'Ga':13,'In':13,'O':16},
    'covalent_radius': {'Al':1.21,'Ga':1.22,'In':1.42,'O':0.66},
    'first_ionization_energy': {'Al':5.986,'Ga':5.999,'In':5.786,'O':13.618},
    'electron_affinity': {'Al':0.441,'Ga':0.30,'In':0.30,'O':1.461}
}
def add_cation_weighted(df):
    wa, wg, wi = df['w_al'], df['w_ga'], df['w_in']
    for name, table in props.items():
        ca, cg, ci, co = table['Al'], table['Ga'], table['In'], table['O']
        wmean = wa*ca + wg*cg + wi*ci
        df[f'catw_{name}_mean'] = wmean
        df[f'catw_{name}_var'] = (wa*(ca-wmean)**2 + wg*(cg-wmean)**2 + wi*(ci-wmean)**2)
        if name in ['chi_pauling','ionic_radius']:
            df[f'o_minus_catw_{name}'] = co - wmean
    return df
train_fe = add_cation_weighted(train_fe); test_fe = add_cation_weighted(test_fe)
print('Added cation-weighted contrasts | elapsed:', f'{time.time()-t0:.2f}s', flush=True)

# Rebuild 8-fold stratified group-disjoint folds (vectorized)
t0 = time.time()
assert 'compute_stoich_groups' in globals(), 'compute_stoich_groups missing; run grouping cell.'
y = train_fe['bandgap_energy_ev'].astype(float)
need_cols = ['N','n_al','n_ga','n_in','n_o']
missing = [c for c in need_cols if c not in train_fe.columns]
if missing:
    tr_csv = pd.read_csv('train.csv'); te_csv = pd.read_csv('test.csv')
    _, N_tr, al_tr, ga_tr, in_tr, o_tr = compute_stoich_groups(tr_csv)
    train_fe['N'] = N_tr; train_fe['n_al'] = al_tr; train_fe['n_ga'] = ga_tr; train_fe['n_in'] = in_tr; train_fe['n_o'] = o_tr
    _, N_te, al_te, ga_te, in_te, o_te = compute_stoich_groups(te_csv)
    test_fe['N'] = N_te; test_fe['n_al'] = al_te; test_fe['n_ga'] = ga_te; test_fe['n_in'] = in_te; test_fe['n_o'] = o_te
gkey = train_fe[['N','n_al','n_ga','n_in']].astype(int).astype(str).agg('_'.join, axis=1)
gmean = y.groupby(gkey).mean()
gbin = pd.qcut(gmean, q=10, labels=False, duplicates='drop')
uniq = pd.DataFrame({'g': gmean.index, 'bin': gbin.values}).sample(frac=1.0, random_state=42).reset_index(drop=True)
from sklearn.model_selection import StratifiedKFold
n_splits_new = 8
skf = StratifiedKFold(n_splits=n_splits_new, shuffle=True, random_state=42)
group_to_fold = {}
for k, (_, val_idx) in enumerate(skf.split(uniq['g'], uniq['bin'])):
    for g in uniq['g'].iloc[val_idx]: group_to_fold[g] = k
fold_ids = gkey.map(group_to_fold).astype(int).values
print('Fold sizes:', pd.Series(fold_ids).value_counts().sort_index().to_dict(), '| elapsed:', f'{time.time()-t0:.2f}s', flush=True)

# OOF target encoding for spacegroup (in log space)
t0 = time.time()
train_fe['te_sg'] = 0.0
y_log = np.log1p(y.clip(lower=0))
global_mean = float(y_log.mean())
for k in range(n_splits_new):
    trn_idx = np.where(fold_ids!=k)[0]; val_idx = np.where(fold_ids==k)[0]
    m = train_fe.iloc[trn_idx].groupby('spacegroup')['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).mean())
    te_map = m.to_dict()
    train_fe.loc[train_fe.index[val_idx], 'te_sg'] = train_fe.iloc[val_idx]['spacegroup'].map(te_map).fillna(global_mean).values
sg_map_full = train_fe.groupby('spacegroup')['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).mean()).to_dict()
test_fe['te_sg'] = test_fe['spacegroup'].map(sg_map_full).fillna(global_mean)
print('Added OOF TE for spacegroup | elapsed:', f'{time.time()-t0:.2f}s', flush=True)

print('Fast pass done. Shapes:', train_fe.shape, test_fe.shape, '| total elapsed:', f'{time.time()-t0_all:.1f}s', flush=True)
gc.collect();

In [None]:
# Rebuild 8-fold stratified group-disjoint folds quickly (vectorized, minimal I/O)
import numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
assert 'train_fe' in globals() and 'compute_stoich_groups' in globals(), 'Prerequisites missing.'
y = train_fe['bandgap_energy_ev'].astype(float)
if 'groups' not in globals():
    _gkey, *_ = compute_stoich_groups(pd.read_csv('train.csv'))
    groups = _gkey.astype(str)
gkey = groups.astype(str)
gmean = y.groupby(gkey).mean()
gbin = pd.qcut(gmean, q=10, labels=False, duplicates='drop')
uniq = pd.DataFrame({'g': gmean.index, 'bin': gbin.values}).sample(frac=1.0, random_state=42).reset_index(drop=True)
skf = StratifiedKFold(n_splits=8, shuffle=True, random_state=42)
group_to_fold = {}
for k, (_, val_idx) in enumerate(skf.split(uniq['g'], uniq['bin'])):
    for g in uniq['g'].iloc[val_idx]:
        group_to_fold[g] = k
fold_ids = gkey.map(group_to_fold).astype(int).values
print('8-fold fold_ids built. Fold sizes:', pd.Series(fold_ids).value_counts().sort_index().to_dict())

In [7]:
# Clean end-to-end pipeline (no Magpie): build compact features, 8-fold CV, OOF TE, 3-seed LGBM+XGB, NNLS blend
import numpy as np, pandas as pd, time, gc, os
from pathlib import Path
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error

t0_all = time.time()
print('Start clean pipeline...')

# ------------------ Load base CSVs ------------------
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
y = train['bandgap_energy_ev'].astype(float)

# ------------------ Engineer base features (reuse functions if available) ------------------
assert 'engineer_features' in globals(), 'Run Cell 3 to define engineer_features()'
train_fe = engineer_features(train)
test_fe = engineer_features(test)

# ------------------ Stoichiometric counts and group key ------------------
assert 'compute_stoich_groups' in globals(), 'Run Cell 3 to define compute_stoich_groups()'
groups, N, n_al, n_ga, n_in, n_o = compute_stoich_groups(train)
train_fe['N'] = N; train_fe['n_al'] = n_al; train_fe['n_ga'] = n_ga; train_fe['n_in'] = n_in; train_fe['n_o'] = n_o
gkey = groups.astype(str)
groups_te, N_te, al_te, ga_te, in_te, o_te = compute_stoich_groups(test)
test_fe['N'] = N_te; test_fe['n_al'] = al_te; test_fe['n_ga'] = ga_te; test_fe['n_in'] = in_te; test_fe['n_o'] = o_te

# ------------------ Composition weights and Vegard + bowing/logs + interactions ------------------
for df in (train_fe, test_fe):
    df['frac_al'] = df['percent_atom_al']/100.0
    df['frac_ga'] = df['percent_atom_ga']/100.0
    df['frac_in'] = df['percent_atom_in']/100.0
    df['percent_atom_o'] = 100.0 - (df['percent_atom_al'] + df['percent_atom_ga'] + df['percent_atom_in'])
    df['frac_o'] = df['percent_atom_o']/100.0
    frac_cat = (df['frac_al'] + df['frac_ga'] + df['frac_in']).replace(0, np.nan)
    df['w_al'] = (df['frac_al']/frac_cat).fillna(0)
    df['w_ga'] = (df['frac_ga']/frac_cat).fillna(0)
    df['w_in'] = (df['frac_in']/frac_cat).fillna(0)
    df['vegard_bg'] = 8.8*df['w_al'] + 4.8*df['w_ga'] + 2.9*df['w_in']
    df['bow_in'] = df['w_in']*(1.0 - df['w_in'])
    df['bow_ga'] = df['w_ga']*(1.0 - df['w_ga'])
    if 'volume_per_atom' in df.columns: df['log_vpa'] = np.log1p(df['volume_per_atom'].clip(lower=0))
    if 'atoms_per_volume' in df.columns: df['log_apv'] = np.log1p(df['atoms_per_volume'].clip(lower=0))
    df['log_oc'] = np.log1p((df['frac_o']/(df['frac_al']+df['frac_ga']+df['frac_in']+1e-9)).clip(lower=0))
    df['log_in_over_al'] = np.log1p(((df['frac_in']+1e-6)/(df['frac_al']+1e-6)).clip(lower=0))
    # interactions
    df['w_al_sq'] = df['w_al']**2; df['w_ga_sq'] = df['w_ga']**2; df['w_in_sq'] = df['w_in']**2
    df['w_al_ga'] = df['w_al']*df['w_ga']; df['w_al_in'] = df['w_al']*df['w_in']; df['w_ga_in'] = df['w_ga']*df['w_in']
    df['w_al_x_veg'] = df['w_al']*df['vegard_bg']; df['w_in_x_veg'] = df['w_in']*df['vegard_bg']
    df['al_in_diff_sq'] = (df['frac_al']-df['frac_in'])**2; df['ga_in_diff_sq'] = (df['frac_ga']-df['frac_in'])**2
    df['frac_al_cu'] = df['frac_al']**3; df['frac_ga_cu'] = df['frac_ga']**3; df['frac_in_cu'] = df['frac_in']**3
    # a_red/b_red/c_red
    vol = df['cell_volume'].replace(0, np.nan); l = vol.pow(1/3)
    df['a_red'] = df['lattice_vector_1_ang']/l; df['b_red'] = df['lattice_vector_2_ang']/l; df['c_red'] = df['lattice_vector_3_ang']/l

# mix metrics (if not already) were added by engineer_features

# ------------------ Cation-weighted contrasts (EN, ionic radius) ------------------
props = {
    'chi_pauling': {'Al':1.61,'Ga':1.81,'In':1.78,'O':3.44},
    'ionic_radius': {'Al':0.535,'Ga':0.62,'In':0.80,'O':1.38}
}
def add_cation_weighted(df):
    wa, wg, wi = df['w_al'], df['w_ga'], df['w_in']
    for name, tbl in props.items():
        ca, cg, ci, co = tbl['Al'], tbl['Ga'], tbl['In'], tbl['O']
        wmean = wa*ca + wg*cg + wi*ci
        df[f'catw_{name}_mean'] = wmean
        df[f'catw_{name}_var'] = (wa*(ca-wmean)**2 + wg*(cg-wmean)**2 + wi*(ci-wmean)**2)
        df[f'o_minus_catw_{name}'] = co - wmean
    return df
train_fe = add_cation_weighted(train_fe); test_fe = add_cation_weighted(test_fe)

# ------------------ Minimal XYZ features (load cache or build, then prune) ------------------
cache_tr = Path('xyz_train.parquet'); cache_te = Path('xyz_test.parquet')
if cache_tr.exists() and cache_te.exists():
    xyz_tr = pd.read_parquet(cache_tr); xyz_te = pd.read_parquet(cache_te)
else:
    assert 'build_xyz_df' in globals(), 'Run Cell 6 to define build_xyz_df/read_xyz_features'
    xyz_tr = pd.read_parquet(cache_tr) if cache_tr.exists() else build_xyz_df('train', train['id'].values, n_jobs=16)
    xyz_te = pd.read_parquet(cache_te) if cache_te.exists() else build_xyz_df('test', test['id'].values, n_jobs=16)
    xyz_tr.to_parquet(cache_tr, index=False); xyz_te.to_parquet(cache_te, index=False)

# prune: drop all rdf_* and mid-quantiles p5/p25/p50/p75/p95; keep only min/mean/std/max of d_* for all,cc,co,oo and nn_* min/mean/max
def prune_xyz(df):
    keep = ['id']
    for base in ['all','cc','co','oo']:
        for stat in ['min','mean','std','max']:
            keep.append(f'd_{base}_{stat}')
    for dirn in ['c_to_o','o_to_c']:
        for stat in ['min','mean','max']:
            keep.append(f'nn_{dirn}_{stat}')
    cols = [c for c in df.columns if c in keep]
    return df[cols].copy()
xyz_tr_p = prune_xyz(xyz_tr)
xyz_te_p = prune_xyz(xyz_te)
train_fe = train_fe.merge(xyz_tr_p, on='id', how='left')
test_fe = test_fe.merge(xyz_te_p, on='id', how='left')
print('Merged minimal XYZ:', train_fe.shape, test_fe.shape)

# ------------------ Build 8-fold stratified group-disjoint folds ------------------
y = train_fe['bandgap_energy_ev'].astype(float)
gmean = y.groupby(gkey).mean()
gbin = pd.qcut(gmean, q=10, labels=False, duplicates='drop')
uniq = pd.DataFrame({'g': gmean.index, 'bin': gbin.values}).sample(frac=1.0, random_state=42).reset_index(drop=True)
skf = StratifiedKFold(n_splits=8, shuffle=True, random_state=42)
group_to_fold = {}
for k, (_, val_idx) in enumerate(skf.split(uniq['g'], uniq['bin'])):
    for g in uniq['g'].iloc[val_idx]: group_to_fold[g] = k
fold_ids = gkey.map(group_to_fold).astype(int).values
print('Fold sizes:', pd.Series(fold_ids).value_counts().sort_index().to_dict())

# ------------------ OOF target encoding for spacegroup in log space ------------------
train_fe['te_sg'] = 0.0
y_log = np.log1p(y.clip(lower=0))
global_mean = float(y_log.mean())
for k in range(8):
    trn_idx = np.where(fold_ids!=k)[0]; val_idx = np.where(fold_ids==k)[0]
    m = train_fe.iloc[trn_idx].groupby('spacegroup')['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).mean())
    te_map = m.to_dict()
    sg_series = train_fe.iloc[val_idx]['spacegroup'].astype(str)
    mapped = sg_series.map(te_map).astype(float).fillna(global_mean)
    train_fe.loc[train_fe.index[val_idx], 'te_sg'] = mapped.values
sg_map_full = train_fe.groupby('spacegroup')['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).mean()).to_dict()
test_fe['te_sg'] = test_fe['spacegroup'].astype(str).map(sg_map_full).astype(float).fillna(global_mean)

# ------------------ Build final feature matrices ------------------
drop_cols = ['id','bandgap_energy_ev']
common_cols = [c for c in train_fe.columns if c in test_fe.columns]
features = [c for c in common_cols if c not in drop_cols]
# Drop any rdf_* remnants just in case
features = [c for c in features if not c.startswith('rdf_')]
# Ensure numeric except allow spacegroup
num_cols = list(train_fe[features].select_dtypes(include=[np.number]).columns)
if 'spacegroup' in features and 'spacegroup' not in num_cols: num_cols.append('spacegroup')
train_X = train_fe[num_cols].copy(); test_X = test_fe[num_cols].copy()
med = train_X.median(numeric_only=True); train_X = train_X.fillna(med); test_X = test_X.fillna(med)
print('Feature matrix shapes:', train_X.shape, test_X.shape)

# ------------------ Models: 3 seeds x (LGBM, XGB) ------------------
import lightgbm as lgb, xgboost as xgb
seeds = [7, 42, 2025]
n_splits = 8
oof_lgb_seeds = []; pred_lgb_seeds = []
oof_xgb_seeds = []; pred_xgb_seeds = []

for SEED in seeds:
    print(f'-- LGBM seed {SEED} --'); t0 = time.time()
    params_lgb = {
        'objective':'regression','metric':'rmse','learning_rate':0.03,
        'num_leaves':128,'max_depth':-1,'min_data_in_leaf':200,
        'feature_fraction':0.8,'bagging_fraction':0.8,'bagging_freq':1,
        'lambda_l2':3.0,'lambda_l1':0.0,'verbosity':-1,'seed':SEED
    }
    oof_lgb = np.zeros(len(train_X)); pred_lgb = np.zeros(len(test_X))
    for k in range(n_splits):
        trn = np.where(fold_ids!=k)[0]; val = np.where(fold_ids==k)[0]
        dtr = lgb.Dataset(train_X.iloc[trn], label=y_log.iloc[trn], categorical_feature=['spacegroup'] if 'spacegroup' in train_X.columns else None, free_raw_data=False)
        dva = lgb.Dataset(train_X.iloc[val], label=y_log.iloc[val], categorical_feature=['spacegroup'] if 'spacegroup' in train_X.columns else None, free_raw_data=False)
        m = lgb.train(params_lgb, dtr, num_boost_round=7000, valid_sets=[dtr,dva], valid_names=['train','valid'], callbacks=[lgb.early_stopping(450), lgb.log_evaluation(300)])
        oof_lgb[val] = m.predict(train_X.iloc[val], num_iteration=m.best_iteration)
        pred_lgb += m.predict(test_X, num_iteration=m.best_iteration)/n_splits
        del m, dtr, dva; gc.collect()
    rmse = float(mean_squared_error(y_log, oof_lgb) ** 0.5); print(f'LGBM seed {SEED} OOF RMSLE: {rmse:.6f} | {time.time()-t0:.1f}s')
    oof_lgb_seeds.append(oof_lgb); pred_lgb_seeds.append(pred_lgb)

    print(f'-- XGB seed {SEED} --'); t0 = time.time()
    xgb_params = dict(objective='reg:squarederror', eval_metric='rmse', tree_method='hist',
                      max_depth=6, eta=0.03, subsample=0.8, colsample_bytree=0.8,
                      min_child_weight=5, reg_lambda=3.0, reg_alpha=0.0, random_state=SEED)
    oof_xgb = np.zeros(len(train_X)); pred_xgb = np.zeros(len(test_X))
    for k in range(n_splits):
        trn = np.where(fold_ids!=k)[0]; val = np.where(fold_ids==k)[0]
        dtr = xgb.DMatrix(train_X.iloc[trn], label=y_log.iloc[trn], enable_categorical=True); dva = xgb.DMatrix(train_X.iloc[val], label=y_log.iloc[val], enable_categorical=True); dte = xgb.DMatrix(test_X, enable_categorical=True)
        model = xgb.train(xgb_params, dtr, num_boost_round=8000, evals=[(dva,'valid')], early_stopping_rounds=400, verbose_eval=False)
        oof_xgb[val] = model.predict(dva)
        pred_xgb += model.predict(dte)/n_splits
        del model, dtr, dva, dte; gc.collect()
    rmse = float(mean_squared_error(y_log, oof_xgb) ** 0.5); print(f'XGB seed {SEED} OOF RMSLE: {rmse:.6f} | {time.time()-t0:.1f}s')
    oof_xgb_seeds.append(oof_xgb); pred_xgb_seeds.append(pred_xgb)

# Average across seeds
oof_lgb_avg = np.mean(np.vstack(oof_lgb_seeds), axis=0)
pred_lgb_avg = np.mean(np.vstack(pred_lgb_seeds), axis=0)
oof_xgb_avg = np.mean(np.vstack(oof_xgb_seeds), axis=0)
pred_xgb_avg = np.mean(np.vstack(pred_xgb_seeds), axis=0)
cv_lgb = float(mean_squared_error(y_log, oof_lgb_avg) ** 0.5)
cv_xgb = float(mean_squared_error(y_log, oof_xgb_avg) ** 0.5)
print(f'Averaged LGBM CV RMSLE: {cv_lgb:.6f} | Averaged XGB CV RMSLE: {cv_xgb:.6f}')

# ------------------ NNLS blend on seed-averaged OOF ------------------
from scipy.optimize import nnls
P = np.vstack([oof_lgb_avg, oof_xgb_avg]).T
w, _ = nnls(P, y_log.values)
w = w / (w.sum() if w.sum() > 0 else 1.0)
print('NNLS weights (LGB, XGB):', w)
oof_blend = P @ w
cv_blend = float(mean_squared_error(y_log, oof_blend) ** 0.5)
print(f'Blended CV RMSLE: {cv_blend:.6f}')
Ptest = np.vstack([pred_lgb_avg, pred_xgb_avg]).T
pred_blend = Ptest @ w

# ------------------ Save submission ------------------
pred_bandgap = np.expm1(pred_blend).clip(0, 6.5)
sub = pd.DataFrame({'id': test['id'], 'bandgap_energy_ev': pred_bandgap})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv', sub.shape, '| total elapsed:', f'{time.time()-t0_all:.1f}s')
sub.head()

Start clean pipeline...
Merged minimal XYZ: (2160, 107) (240, 105)
Fold sizes: {0: 309, 1: 300, 2: 264, 3: 273, 4: 248, 5: 265, 6: 252, 7: 249}


  m = train_fe.iloc[trn_idx].groupby('spacegroup')['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).mean())
  m = train_fe.iloc[trn_idx].groupby('spacegroup')['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).mean())
  m = train_fe.iloc[trn_idx].groupby('spacegroup')['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).mean())
  m = train_fe.iloc[trn_idx].groupby('spacegroup')['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).mean())
  m = train_fe.iloc[trn_idx].groupby('spacegroup')['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).mean())
  m = train_fe.iloc[trn_idx].groupby('spacegroup')['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).mean())
  m = train_fe.iloc[trn_idx].groupby('spacegroup')['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).mean())
  m = train_fe.iloc[trn_idx].groupby('spacegroup')['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).mean())
  sg_map_full = train_fe

Feature matrix shapes: (2160, 105) (240, 105)
-- LGBM seed 7 --
Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0856496	valid's rmse: 0.0774812


[600]	train's rmse: 0.0772158	valid's rmse: 0.0745231
[900]	train's rmse: 0.073193	valid's rmse: 0.0738798


[1200]	train's rmse: 0.0704691	valid's rmse: 0.0739593
[1500]	train's rmse: 0.068259	valid's rmse: 0.0742076
Early stopping, best iteration is:
[1067]	train's rmse: 0.0715769	valid's rmse: 0.0736709


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0815879	valid's rmse: 0.100798


[600]	train's rmse: 0.0733341	valid's rmse: 0.0967418
[900]	train's rmse: 0.0694476	valid's rmse: 0.0965866


Early stopping, best iteration is:
[653]	train's rmse: 0.0725173	valid's rmse: 0.0963557


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0836861	valid's rmse: 0.0940116


[600]	train's rmse: 0.0749884	valid's rmse: 0.0864025
[900]	train's rmse: 0.0708483	valid's rmse: 0.0846562


[1200]	train's rmse: 0.0681095	valid's rmse: 0.084239
[1500]	train's rmse: 0.0658102	valid's rmse: 0.0843299
Early stopping, best iteration is:
[1117]	train's rmse: 0.0687921	valid's rmse: 0.0840047


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.080151	valid's rmse: 0.111537


[600]	train's rmse: 0.0723708	valid's rmse: 0.107929
[900]	train's rmse: 0.068648	valid's rmse: 0.106719


[1200]	train's rmse: 0.0661379	valid's rmse: 0.10621
[1500]	train's rmse: 0.064165	valid's rmse: 0.106117


[1800]	train's rmse: 0.0625183	valid's rmse: 0.106034
[2100]	train's rmse: 0.0610491	valid's rmse: 0.106228


Early stopping, best iteration is:
[1905]	train's rmse: 0.0619777	valid's rmse: 0.105949


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0830686	valid's rmse: 0.0862842


[600]	train's rmse: 0.0751174	valid's rmse: 0.084659
[900]	train's rmse: 0.0713326	valid's rmse: 0.0847306


[1200]	train's rmse: 0.0686077	valid's rmse: 0.0849417
Early stopping, best iteration is:
[854]	train's rmse: 0.071823	valid's rmse: 0.0845651


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0843786	valid's rmse: 0.0782782


[600]	train's rmse: 0.0762588	valid's rmse: 0.0738623


[900]	train's rmse: 0.0724736	valid's rmse: 0.07284


[1200]	train's rmse: 0.0698496	valid's rmse: 0.0722456


[1500]	train's rmse: 0.0677388	valid's rmse: 0.072013
[1800]	train's rmse: 0.065988	valid's rmse: 0.0721585


[2100]	train's rmse: 0.0643984	valid's rmse: 0.0722795
Early stopping, best iteration is:
[1652]	train's rmse: 0.0668233	valid's rmse: 0.0718587


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0812928	valid's rmse: 0.109006


[600]	train's rmse: 0.0729543	valid's rmse: 0.103708
[900]	train's rmse: 0.0691171	valid's rmse: 0.102344


[1200]	train's rmse: 0.066445	valid's rmse: 0.101943
[1500]	train's rmse: 0.0643745	valid's rmse: 0.101888


[1800]	train's rmse: 0.0625484	valid's rmse: 0.101979
Early stopping, best iteration is:
[1439]	train's rmse: 0.0647881	valid's rmse: 0.101662


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0841944	valid's rmse: 0.0762753


[600]	train's rmse: 0.0764886	valid's rmse: 0.0723259
[900]	train's rmse: 0.0729529	valid's rmse: 0.0705437


[1200]	train's rmse: 0.0703476	valid's rmse: 0.0694227
[1500]	train's rmse: 0.0681918	valid's rmse: 0.0689946


[1800]	train's rmse: 0.0664515	valid's rmse: 0.06855
[2100]	train's rmse: 0.0648673	valid's rmse: 0.0681788


[2400]	train's rmse: 0.0634607	valid's rmse: 0.0678188
[2700]	train's rmse: 0.0622122	valid's rmse: 0.0674495


[3000]	train's rmse: 0.0609934	valid's rmse: 0.0672938
[3300]	train's rmse: 0.0598725	valid's rmse: 0.0672952


[3600]	train's rmse: 0.0588747	valid's rmse: 0.0670349
[3900]	train's rmse: 0.0578846	valid's rmse: 0.0669583


[4200]	train's rmse: 0.0569952	valid's rmse: 0.066674
[4500]	train's rmse: 0.0561555	valid's rmse: 0.0669064


Early stopping, best iteration is:
[4207]	train's rmse: 0.0569705	valid's rmse: 0.066625


LGBM seed 7 OOF RMSLE: 0.086713 | 10.3s
-- XGB seed 7 --


XGB seed 7 OOF RMSLE: 0.092530 | 14.9s
-- LGBM seed 42 --
Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.085508	valid's rmse: 0.0781661


[600]	train's rmse: 0.0771791	valid's rmse: 0.0747851


[900]	train's rmse: 0.0731036	valid's rmse: 0.0739759
[1200]	train's rmse: 0.0703407	valid's rmse: 0.0741083


Early stopping, best iteration is:
[996]	train's rmse: 0.0721374	valid's rmse: 0.0737348


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0815807	valid's rmse: 0.0999958


[600]	train's rmse: 0.0734396	valid's rmse: 0.0956292
[900]	train's rmse: 0.0695839	valid's rmse: 0.0950461


[1200]	train's rmse: 0.0669634	valid's rmse: 0.0954804
Early stopping, best iteration is:
[860]	train's rmse: 0.070012	valid's rmse: 0.0949792


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0837445	valid's rmse: 0.0938564


[600]	train's rmse: 0.0751999	valid's rmse: 0.0862385
[900]	train's rmse: 0.0709888	valid's rmse: 0.0844289


[1200]	train's rmse: 0.0682034	valid's rmse: 0.0839498
[1500]	train's rmse: 0.0659152	valid's rmse: 0.0842596


Early stopping, best iteration is:
[1177]	train's rmse: 0.068366	valid's rmse: 0.0838876


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0798898	valid's rmse: 0.111709


[600]	train's rmse: 0.0721518	valid's rmse: 0.10808
[900]	train's rmse: 0.0685581	valid's rmse: 0.107172


[1200]	train's rmse: 0.0660951	valid's rmse: 0.106718
[1500]	train's rmse: 0.0640944	valid's rmse: 0.106442


[1800]	train's rmse: 0.0623874	valid's rmse: 0.10635
[2100]	train's rmse: 0.0609154	valid's rmse: 0.106664
Early stopping, best iteration is:
[1651]	train's rmse: 0.0632026	valid's rmse: 0.106149


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0831767	valid's rmse: 0.0856575


[600]	train's rmse: 0.0751093	valid's rmse: 0.0837509
[900]	train's rmse: 0.0712341	valid's rmse: 0.083749


[1200]	train's rmse: 0.0684839	valid's rmse: 0.0838816
Early stopping, best iteration is:
[841]	train's rmse: 0.0718286	valid's rmse: 0.0834152


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0846432	valid's rmse: 0.0786524


[600]	train's rmse: 0.0763644	valid's rmse: 0.0744153
[900]	train's rmse: 0.0725899	valid's rmse: 0.073078


[1200]	train's rmse: 0.0699753	valid's rmse: 0.072375
[1500]	train's rmse: 0.0678473	valid's rmse: 0.0720426


[1800]	train's rmse: 0.0660848	valid's rmse: 0.0720875
[2100]	train's rmse: 0.0644905	valid's rmse: 0.0718777


[2400]	train's rmse: 0.0630992	valid's rmse: 0.0720452
Early stopping, best iteration is:
[2200]	train's rmse: 0.0640098	valid's rmse: 0.0718165


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0813722	valid's rmse: 0.109564


[600]	train's rmse: 0.0731332	valid's rmse: 0.103896
[900]	train's rmse: 0.0692716	valid's rmse: 0.102814


[1200]	train's rmse: 0.0666337	valid's rmse: 0.102272
[1500]	train's rmse: 0.0645641	valid's rmse: 0.10213


[1800]	train's rmse: 0.0627842	valid's rmse: 0.102127
[2100]	train's rmse: 0.0611821	valid's rmse: 0.102656


Early stopping, best iteration is:
[1699]	train's rmse: 0.0633593	valid's rmse: 0.101965


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0843094	valid's rmse: 0.0751923


[600]	train's rmse: 0.0766112	valid's rmse: 0.0716441
[900]	train's rmse: 0.0729071	valid's rmse: 0.0703212


[1200]	train's rmse: 0.0702609	valid's rmse: 0.069114
[1500]	train's rmse: 0.0682411	valid's rmse: 0.0687042


[1800]	train's rmse: 0.0664917	valid's rmse: 0.0681041
[2100]	train's rmse: 0.0648839	valid's rmse: 0.0678158


[2400]	train's rmse: 0.0635033	valid's rmse: 0.0673557
[2700]	train's rmse: 0.0622023	valid's rmse: 0.0675497


[3000]	train's rmse: 0.0609959	valid's rmse: 0.0671894
[3300]	train's rmse: 0.0599078	valid's rmse: 0.0670289


[3600]	train's rmse: 0.0588721	valid's rmse: 0.0671246
[3900]	train's rmse: 0.057941	valid's rmse: 0.067161
Early stopping, best iteration is:
[3554]	train's rmse: 0.059023	valid's rmse: 0.0669614


LGBM seed 42 OOF RMSLE: 0.086466 | 13.7s
-- XGB seed 42 --


XGB seed 42 OOF RMSLE: 0.092677 | 13.6s
-- LGBM seed 2025 --
Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0859364	valid's rmse: 0.0771393


[600]	train's rmse: 0.0771758	valid's rmse: 0.074141
[900]	train's rmse: 0.0730051	valid's rmse: 0.0731567


[1200]	train's rmse: 0.0702805	valid's rmse: 0.0730463
Early stopping, best iteration is:
[1004]	train's rmse: 0.0719747	valid's rmse: 0.0728362


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0814154	valid's rmse: 0.0999094


[600]	train's rmse: 0.0732328	valid's rmse: 0.0959326
[900]	train's rmse: 0.0694414	valid's rmse: 0.0953001


[1200]	train's rmse: 0.0667236	valid's rmse: 0.0958198
Early stopping, best iteration is:
[854]	train's rmse: 0.0699752	valid's rmse: 0.0952198


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0837879	valid's rmse: 0.0938396


[600]	train's rmse: 0.0753796	valid's rmse: 0.0859205
[900]	train's rmse: 0.071236	valid's rmse: 0.0842573


[1200]	train's rmse: 0.0683923	valid's rmse: 0.0838426
[1500]	train's rmse: 0.0661048	valid's rmse: 0.0840176
Early stopping, best iteration is:
[1129]	train's rmse: 0.0690391	valid's rmse: 0.0837008


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0800815	valid's rmse: 0.111737


[600]	train's rmse: 0.0723075	valid's rmse: 0.108137
[900]	train's rmse: 0.0686146	valid's rmse: 0.106972


[1200]	train's rmse: 0.0661008	valid's rmse: 0.106454
[1500]	train's rmse: 0.0640729	valid's rmse: 0.106422


[1800]	train's rmse: 0.0624185	valid's rmse: 0.106393
Early stopping, best iteration is:
[1350]	train's rmse: 0.0650625	valid's rmse: 0.106209
Training until validation scores don't improve for 450 rounds


[300]	train's rmse: 0.0833347	valid's rmse: 0.0859517
[600]	train's rmse: 0.0752093	valid's rmse: 0.0846921


[900]	train's rmse: 0.0713428	valid's rmse: 0.084483
Early stopping, best iteration is:
[640]	train's rmse: 0.0745108	valid's rmse: 0.0843916


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0845529	valid's rmse: 0.0783966


[600]	train's rmse: 0.0763483	valid's rmse: 0.0736694
[900]	train's rmse: 0.0725471	valid's rmse: 0.072082


[1200]	train's rmse: 0.0699088	valid's rmse: 0.071604
[1500]	train's rmse: 0.0679742	valid's rmse: 0.0715702


[1800]	train's rmse: 0.0661533	valid's rmse: 0.0716301
Early stopping, best iteration is:
[1459]	train's rmse: 0.0682271	valid's rmse: 0.0713699


Training until validation scores don't improve for 450 rounds
[300]	train's rmse: 0.0815259	valid's rmse: 0.110147


[600]	train's rmse: 0.0731359	valid's rmse: 0.10455
[900]	train's rmse: 0.0692384	valid's rmse: 0.103231


[1200]	train's rmse: 0.0666076	valid's rmse: 0.103181
[1500]	train's rmse: 0.0644938	valid's rmse: 0.10303


[1800]	train's rmse: 0.0627103	valid's rmse: 0.102991
[2100]	train's rmse: 0.0610989	valid's rmse: 0.102816


[2400]	train's rmse: 0.0596901	valid's rmse: 0.10271
Early stopping, best iteration is:
[2207]	train's rmse: 0.06058	valid's rmse: 0.102592


Training until validation scores don't improve for 450 rounds


[300]	train's rmse: 0.0843953	valid's rmse: 0.0756015
[600]	train's rmse: 0.0766445	valid's rmse: 0.0713352


[900]	train's rmse: 0.0728799	valid's rmse: 0.0698603
[1200]	train's rmse: 0.0703448	valid's rmse: 0.0691909


[1500]	train's rmse: 0.0683241	valid's rmse: 0.0684698
[1800]	train's rmse: 0.0664805	valid's rmse: 0.0683905


[2100]	train's rmse: 0.0649096	valid's rmse: 0.0680098
[2400]	train's rmse: 0.0635109	valid's rmse: 0.0677584


[2700]	train's rmse: 0.0622104	valid's rmse: 0.0677844
[3000]	train's rmse: 0.0610164	valid's rmse: 0.0676602


[3300]	train's rmse: 0.0598999	valid's rmse: 0.0673782
[3600]	train's rmse: 0.0588645	valid's rmse: 0.0673075


[3900]	train's rmse: 0.0579341	valid's rmse: 0.0672961
[4200]	train's rmse: 0.0570096	valid's rmse: 0.06733


[4500]	train's rmse: 0.0561554	valid's rmse: 0.0672897
[4800]	train's rmse: 0.0553514	valid's rmse: 0.0673912


Early stopping, best iteration is:
[4524]	train's rmse: 0.0560889	valid's rmse: 0.0671464


LGBM seed 2025 OOF RMSLE: 0.086548 | 10.0s
-- XGB seed 2025 --


XGB seed 2025 OOF RMSLE: 0.092398 | 13.5s
Averaged LGBM CV RMSLE: 0.086440 | Averaged XGB CV RMSLE: 0.092294
NNLS weights (LGB, XGB): [0.94364721 0.05635279]
Blended CV RMSLE: 0.086418
Saved submission.csv (240, 2) | total elapsed: 76.3s


Unnamed: 0,id,bandgap_energy_ev
0,1,1.921003
1,2,1.697664
2,3,4.329125
3,4,2.918891
4,5,1.117051


In [None]:
# DIAG: quick stdout and state check
import numpy as np, pandas as pd, time, gc
print('[DIAG] start')
print('[DIAG] globals:', {k: True for k in ['train_fe','test_fe','fold_ids','engineer_features','compute_stoich_groups'] if k in globals()})
if 'train_fe' in globals():
    print('[DIAG] train_fe shape:', train_fe.shape)
    rdfc = sum(1 for c in train_fe.columns if isinstance(c, str) and c.startswith('rdf_'))
    print('[DIAG] rdf_* columns in train_fe:', rdfc)
    print('[DIAG] sample cols:', list(train_fe.columns[:8]))
if 'fold_ids' in globals():
    try:
        import numpy as _np
        uniq, counts = _np.unique(fold_ids, return_counts=True)
        print('[DIAG] fold_ids unique:', dict(zip(uniq.tolist(), counts.tolist())))
    except Exception as e:
        print('[DIAG] fold_ids check error:', e)
print('[DIAG] done')

In [12]:
# Install matminer/pymatgen explicitly with logs, then sanity-import
import sys, subprocess, time, os
t0 = time.time()
print('[SETUP] Installing dependencies: pymatgen, matminer (prefer binary wheels)')
os.environ['PIP_DISABLE_PIP_VERSION_CHECK'] = '1'
os.environ['PIP_NO_INPUT'] = '1'
cmd = [sys.executable, '-m', 'pip', 'install', '--prefer-binary', '--upgrade', 'pymatgen', 'matminer']
print('[SETUP] Running:', ' '.join(cmd))
subprocess.check_call(cmd)
print('[SETUP] Install finished in', f'{time.time()-t0:.1f}s')
print('[SETUP] Importing modules to warm cache...')
import importlib
mm = importlib.import_module('matminer')
pmg = importlib.import_module('pymatgen')
from matminer.featurizers.composition import Stoichiometry, ValenceOrbital, IonProperty
from pymatgen.core.composition import Composition
print('[SETUP] Versions -> matminer:', getattr(mm, '__version__', 'unknown'), '| pymatgen:', getattr(pmg, '__version__', 'unknown'))
print('[SETUP] Ready.')

In [11]:
# Precompute cached low-dim matminer features (deduplicated compositions only)
import numpy as np, pandas as pd, time, os, sys, subprocess, warnings
from pathlib import Path

t0_all = time.time()
print('[MM] Start precompute low-dim matminer features...')
os.environ['TQDM_DISABLE'] = '1'
os.environ['PYTHONWARNINGS'] = 'ignore'
warnings.filterwarnings('ignore')

# Ensure grouping util exists
assert 'compute_stoich_groups' in globals(), 'Run Cell 3 to define compute_stoich_groups()'

train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
_, N_tr, al_tr, ga_tr, in_tr, o_tr = compute_stoich_groups(train)
_, N_te, al_te, ga_te, in_te, o_te = compute_stoich_groups(test)
comp_tr = pd.DataFrame({'n_al': al_tr, 'n_ga': ga_tr, 'n_in': in_tr, 'n_o': o_tr})
comp_te = pd.DataFrame({'n_al': al_te, 'n_ga': ga_te, 'n_in': in_te, 'n_o': o_te})
def comp_str_df(df):
    return 'Al' + df['n_al'].astype(int).astype(str) + ' Ga' + df['n_ga'].astype(int).astype(str) + ' In' + df['n_in'].astype(int).astype(str) + ' O' + df['n_o'].astype(int).astype(str)
comp_tr['composition'] = comp_str_df(comp_tr)
comp_te['composition'] = comp_str_df(comp_te)

def build_mm_lowdim_from_comp(comp_series, cache_path):
    cache_p = Path(cache_path)
    if cache_p.exists():
        try:
            cached = pd.read_parquet(cache_p)
            if len(cached) == len(comp_series):
                print(f'[MM] Loaded cache: {cache_path} shape={cached.shape}')
                return cached.reset_index(drop=True)
        except Exception:
            pass
    t0 = time.time()
    try:
        from matminer.featurizers.composition import Stoichiometry, ValenceOrbital, IonProperty
        from pymatgen.core.composition import Composition
    except Exception:
        print('[MM] Installing matminer/pymatgen...'); subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'matminer', 'pymatgen'])
        from matminer.featurizers.composition import Stoichiometry, ValenceOrbital, IonProperty
        from pymatgen.core.composition import Composition
    uniq = pd.Series(comp_series.astype(str).unique())
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        comp_objs = uniq.apply(lambda s: Composition(s))
    df_u = pd.DataFrame({'composition': uniq.values, 'comp_obj': comp_objs.values})
    fz_list = [Stoichiometry(), ValenceOrbital(props=['avg','frac'], impute_nan=True), IonProperty(fast=True, impute_nan=True)]
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        out_u = df_u[['comp_obj']].rename(columns={'comp_obj':'composition'}).copy()
        for fz in fz_list:
            out_u = fz.featurize_dataframe(out_u, col_id='composition', ignore_errors=True, pbar=False)
        feats_u = out_u.drop(columns=['composition'])
    feats_u.columns = [f'mm2_{c}' for c in feats_u.columns]
    map_df = pd.concat([df_u[['composition']], feats_u], axis=1)
    all_map = pd.DataFrame({'composition': comp_series.values})
    out = all_map.merge(map_df, on='composition', how='left').drop(columns=['composition'])
    try:
        out.to_parquet(cache_p, index=False)
        print(f'[MM] Cached -> {cache_path} shape={out.shape} | uniq={len(uniq)} | {time.time()-t0:.1f}s')
    except Exception:
        print(f'[MM] Built (no cache write) shape={out.shape} | uniq={len(uniq)} | {time.time()-t0:.1f}s')
    return out.reset_index(drop=True)

mm_tr = build_mm_lowdim_from_comp(comp_tr['composition'], 'mm2_train.parquet')
mm_te = build_mm_lowdim_from_comp(comp_te['composition'], 'mm2_test.parquet')
print('[MM] Done. train/test shapes:', mm_tr.shape, mm_te.shape, '| total elapsed:', f'{time.time()-t0_all:.1f}s')

In [16]:
# Composition-only upgraded pipeline: drop XYZ and matminer, add mm-lite features, expanded cation contrasts, smoothed TE, LGBM-only (multi-seed) with optional monotone constraints
import numpy as np, pandas as pd, time, gc, os, warnings
from pathlib import Path
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error

t0_all = time.time()
print('Start composition-only pipeline (mm-lite, LGBM-only, no matminer/xyz)...', flush=True)

# Silence noisy warnings and progress bars
os.environ['TQDM_DISABLE'] = '1'
os.environ['PYTHONWARNINGS'] = 'ignore'
warnings.filterwarnings('ignore')
try:
    from tqdm import auto as _tqdm_auto
    _tqdm_auto.tqdm_disable = True
except Exception:
    pass

# ------------------ Load base CSVs ------------------
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
y = train['bandgap_energy_ev'].astype(float)

# ------------------ Engineer base features (reuse functions) ------------------
assert 'engineer_features' in globals(), 'Run Cell 3 to define engineer_features()'
assert 'compute_stoich_groups' in globals(), 'Run Cell 3 to define compute_stoich_groups()'
train_fe = engineer_features(train).copy()
test_fe = engineer_features(test).copy()

# Stoichiometric counts and group key
g_tr, N, n_al, n_ga, n_in, n_o = compute_stoich_groups(train)
train_fe['N'] = N; train_fe['n_al'] = n_al; train_fe['n_ga'] = n_ga; train_fe['n_in'] = n_in; train_fe['n_o'] = n_o
g_te, N_te, al_te, ga_te, in_te, o_te = compute_stoich_groups(test)
test_fe['N'] = N_te; test_fe['n_al'] = al_te; test_fe['n_ga'] = ga_te; test_fe['n_in'] = in_te; test_fe['n_o'] = o_te
gkey = g_tr.astype(str)

# ------------------ Composition weights, Vegard, bowing/logs, interactions, reduced lattice ------------------
for df in (train_fe, test_fe):
    df['frac_al'] = df['percent_atom_al']/100.0
    df['frac_ga'] = df['percent_atom_ga']/100.0
    df['frac_in'] = df['percent_atom_in']/100.0
    df['percent_atom_o'] = 100.0 - (df['percent_atom_al'] + df['percent_atom_ga'] + df['percent_atom_in'])
    df['frac_o'] = df['percent_atom_o']/100.0
    frac_cat = (df['frac_al'] + df['frac_ga'] + df['frac_in']).replace(0, np.nan)
    df['w_al'] = (df['frac_al']/frac_cat).fillna(0)
    df['w_ga'] = (df['frac_ga']/frac_cat).fillna(0)
    df['w_in'] = (df['frac_in']/frac_cat).fillna(0)
    df['vegard_bg'] = 8.8*df['w_al'] + 4.8*df['w_ga'] + 2.9*df['w_in']
    df['bow_in'] = df['w_in']*(1.0 - df['w_in'])
    df['bow_ga'] = df['w_ga']*(1.0 - df['w_ga'])
    if 'volume_per_atom' in df.columns: df['log_vpa'] = np.log1p(df['volume_per_atom'].clip(lower=0))
    if 'atoms_per_volume' in df.columns: df['log_apv'] = np.log1p(df['atoms_per_volume'].clip(lower=0))
    df['log_oc'] = np.log1p((df['frac_o']/(df['frac_al']+df['frac_ga']+df['frac_in']+1e-9)).clip(lower=0))
    df['log_in_over_al'] = np.log1p(((df['frac_in']+1e-6)/(df['frac_al']+1e-6)).clip(lower=0))
    # interactions
    df['w_al_sq'] = df['w_al']**2; df['w_ga_sq'] = df['w_ga']**2; df['w_in_sq'] = df['w_in']**2
    df['w_al_ga'] = df['w_al']*df['w_ga']; df['w_al_in'] = df['w_al']*df['w_in']; df['w_ga_in'] = df['w_ga']*df['w_in']
    df['w_al_x_veg'] = df['w_al']*df['vegard_bg']; df['w_in_x_veg'] = df['w_in']*df['vegard_bg']
    df['al_in_diff_sq'] = (df['frac_al']-df['frac_in'])**2; df['ga_in_diff_sq'] = (df['frac_ga']-df['frac_in'])**2
    df['frac_al_cu'] = df['frac_al']**3; df['frac_ga_cu'] = df['frac_ga']**3; df['frac_in_cu'] = df['frac_in']**3
    # reduced lattice
    vol = df['cell_volume'].replace(0, np.nan); l = vol.pow(1/3)
    df['a_red'] = df['lattice_vector_1_ang']/l; df['b_red'] = df['lattice_vector_2_ang']/l; df['c_red'] = df['lattice_vector_3_ang']/l

# ------------------ Expanded cation-weighted contrasts ------------------
props = {
    'chi_pauling': {'Al':1.61,'Ga':1.81,'In':1.78,'O':3.44},
    'ionic_radius': {'Al':0.535,'Ga':0.62,'In':0.80,'O':1.38},
    'Z': {'Al':13,'Ga':31,'In':49,'O':8},
    'period': {'Al':3,'Ga':4,'In':5,'O':2},
    'group': {'Al':13,'Ga':13,'In':13,'O':16},
    'covalent_radius': {'Al':1.21,'Ga':1.22,'In':1.42,'O':0.66},
    'first_ionization_energy': {'Al':5.986,'Ga':5.999,'In':5.786,'O':13.618},
    'electron_affinity': {'Al':0.441,'Ga':0.30,'In':0.30,'O':1.461}
}
def add_cation_weighted(df):
    wa, wg, wi = df['w_al'], df['w_ga'], df['w_in']
    for name, tbl in props.items():
        ca, cg, ci, co = tbl['Al'], tbl['Ga'], tbl['In'], tbl['O']
        wmean = wa*ca + wg*cg + wi*ci
        df[f'catw_{name}_mean'] = wmean
        df[f'catw_{name}_var'] = (wa*(ca-wmean)**2 + wg*(cg-wmean)**2 + wi*(ci-wmean)**2)
    # O-minus-cation deltas for key props
    df['o_minus_catw_chi_pauling'] = props['chi_pauling']['O'] - df['catw_chi_pauling_mean']
    df['o_minus_catw_ionic_radius'] = props['ionic_radius']['O'] - df['catw_ionic_radius_mean']
    return df
train_fe = add_cation_weighted(train_fe); test_fe = add_cation_weighted(test_fe)

# ------------------ mm-lite features (no matminer) ------------------
def add_mm_lite(df):
    # Stoichiometry norms from fracs
    fa, fg, fi, fo = df['frac_al'], df['frac_ga'], df['frac_in'], df['frac_o']
    arr = np.stack([fa, fg, fi, fo], axis=1)
    df['sto_s2'] = np.sqrt((arr**2).sum(axis=1))
    df['sto_s3'] = np.cbrt((arr**3).sum(axis=1).clip(lower=0))
    df['sto_s5'] = (arr**5).sum(axis=1).clip(lower=0) ** (1/5)
    df['frac_max'] = arr.max(axis=1); df['frac_min'] = arr.min(axis=1); df['frac_range'] = df['frac_max'] - df['frac_min']
    # mix stats on cations
    w = np.stack([df['w_al'], df['w_ga'], df['w_in']], axis=1)
    df['w_max'] = w.max(axis=1); df['w_min'] = w.min(axis=1); df['w_range'] = df['w_max'] - df['w_min']
    df['hhi_cation2'] = (w**2).sum(axis=1)
    # Valence-orbital proxies (hardcoded) s/p counts
    s_map = {'Al':2,'Ga':2,'In':2,'O':2}; p_map = {'Al':1,'Ga':1,'In':1,'O':4}
    # cation-weighted
    s_cat = df['w_al']*s_map['Al'] + df['w_ga']*s_map['Ga'] + df['w_in']*s_map['In']
    p_cat = df['w_al']*p_map['Al'] + df['w_ga']*p_map['Ga'] + df['w_in']*p_map['In']
    df['vo_cat_s_mean'] = s_cat; df['vo_cat_p_mean'] = p_cat
    df['vo_cat_p_frac'] = p_cat / (s_cat + p_cat + 1e-9); df['vo_cat_p_minus_s'] = p_cat - s_cat
    # total-weighted
    s_tot = fa*s_map['Al'] + fg*s_map['Ga'] + fi*s_map['In'] + fo*s_map['O']
    p_tot = fa*p_map['Al'] + fg*p_map['Ga'] + fi*p_map['In'] + fo*p_map['O']
    df['vo_tot_s_mean'] = s_tot; df['vo_tot_p_mean'] = p_tot
    df['vo_tot_p_frac'] = p_tot / (s_tot + p_tot + 1e-9); df['vo_tot_p_minus_s'] = p_tot - s_tot
    # Oxidation consistency (Al3+, Ga3+, In3+, O2-)
    cation_charge = 3.0*(df['n_al'] + df['n_ga'] + df['n_in'])
    oxygen_charge = -2.0*df['n_o']
    charge_imb = cation_charge + oxygen_charge
    df['charge_imbalance'] = charge_imb
    denom = (5.0*df['N']).replace(0, np.nan)
    df['abs_imbalance_per_5N'] = np.abs(charge_imb) / denom
    return df
train_fe = add_mm_lite(train_fe); test_fe = add_mm_lite(test_fe)

# ------------------ Spacegroup expansions ------------------
def lattice_system_from_sgnum(sgnum):
    n = int(sgnum)
    if n<=2: return 1
    if n<=15: return 2
    if n<=74: return 3
    if n<=142: return 4
    if n<=167: return 5
    if n<=194: return 6
    return 7
for df in (train_fe, test_fe):
    df['sg_number'] = pd.to_numeric(df['spacegroup'], errors='coerce').fillna(-1).astype(int)
    df['lattice_system'] = df['sg_number'].apply(lattice_system_from_sgnum).astype(int)

# ------------------ Build 8-fold stratified group-disjoint folds ------------------
y = train_fe['bandgap_energy_ev'].astype(float)
gmean = y.groupby(gkey).mean()
gbin = pd.qcut(gmean, q=10, labels=False, duplicates='drop')
uniq = pd.DataFrame({'g': gmean.index, 'bin': gbin.values}).sample(frac=1.0, random_state=42).reset_index(drop=True)
skf = StratifiedKFold(n_splits=8, shuffle=True, random_state=42)
group_to_fold = {}
for k, (_, val_idx) in enumerate(skf.split(uniq['g'], uniq['bin'])):
    for g in uniq['g'].iloc[val_idx]: group_to_fold[g] = k
fold_ids = gkey.map(group_to_fold).astype(int).values
print('Fold sizes:', pd.Series(fold_ids).value_counts().sort_index().to_dict(), flush=True)

# ------------------ Target encodings (m-estimate smoothing) ------------------
y_log = np.log1p(y.clip(lower=0))
global_mean = float(y_log.mean())
m_smooth = 12.0
train_fe['te_sg'] = 0.0
train_fe['fe_sg'] = 0.0  # frequency encoding
for k in range(8):
    trn_idx = np.where(fold_ids!=k)[0]; val_idx = np.where(fold_ids==k)[0]
    df_tr = train_fe.iloc[trn_idx].copy()
    s_tr = df_tr['spacegroup'].astype(str)
    grp = s_tr.groupby(s_tr)
    counts = grp.size()
    sums = df_tr.groupby(s_tr)['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).sum())
    te = (sums + m_smooth*global_mean) / (counts + m_smooth)
    fe = counts / counts.sum()
    sg_val = train_fe.iloc[val_idx]['spacegroup'].astype(str)
    train_fe.loc[train_fe.index[val_idx], 'te_sg'] = sg_val.map(te).fillna(global_mean).values
    train_fe.loc[train_fe.index[val_idx], 'fe_sg'] = sg_val.map(fe).fillna(0.0).values
# full-map for test
s_all = train_fe['spacegroup'].astype(str)
counts_all = s_all.groupby(s_all).size()
sums_all = train_fe.groupby(s_all)['bandgap_energy_ev'].apply(lambda s: np.log1p(s.clip(lower=0)).sum())
te_all = (sums_all + m_smooth*global_mean) / (counts_all + m_smooth)
fe_all = counts_all / counts_all.sum()
test_fe['te_sg'] = test_fe['spacegroup'].astype(str).map(te_all).fillna(global_mean)
test_fe['fe_sg'] = test_fe['spacegroup'].astype(str).map(fe_all).fillna(0.0)

# lattice_system frequency encoding
for k in range(8):
    trn_idx = np.where(fold_ids!=k)[0]; val_idx = np.where(fold_ids==k)[0]
    ls_counts = train_fe.iloc[trn_idx]['lattice_system'].value_counts(normalize=True)
    ls_val = train_fe.iloc[val_idx]['lattice_system']
    train_fe.loc[train_fe.index[val_idx], 'fe_ls'] = ls_val.map(ls_counts).fillna(0.0).values
ls_counts_all = train_fe['lattice_system'].value_counts(normalize=True)
test_fe['fe_ls'] = test_fe['lattice_system'].map(ls_counts_all).fillna(0.0)

# ------------------ Build final feature matrices (composition-only; no XYZ/matminer) ------------------
drop_cols = ['id','bandgap_energy_ev']
common_cols = [c for c in train_fe.columns if c in test_fe.columns]
features = [c for c in common_cols if c not in drop_cols]
# Ensure numeric matrix for LGBM
train_X = train_fe[features].copy()
test_X = test_fe[features].copy()
med = train_X.median(numeric_only=True)
train_X = train_X.fillna(med)
test_X = test_X.fillna(med)
num_cols = list(train_X.select_dtypes(include=[np.number]).columns)
train_X = train_X[num_cols]
test_X = test_X[num_cols]
print('Feature matrix shapes (LGB numeric):', train_X.shape, test_X.shape, flush=True)

# ------------------ LightGBM only: 1 seed x 8 folds (quick run), average ------------------
import lightgbm as lgb
seeds = [42]
n_splits = 8
oof_lgb_seeds = []; pred_lgb_seeds = []

# Disable monotone constraints for robustness in quick run
# mono_map = {'vegard_bg': +1, 'w_in': -1, 'catw_chi_pauling_mean': +1}
# mono_list = [mono_map.get(c, 0) for c in train_X.columns]

for SEED in seeds:
    print(f'-- LGBM seed {SEED} --', flush=True); t0 = time.time()
    params_lgb = {
        'objective':'regression','metric':'rmse','learning_rate':0.03,
        'num_leaves':96,'max_depth':-1,'min_data_in_leaf':450,
        'feature_fraction':0.78,'bagging_fraction':0.8,'bagging_freq':1,
        'lambda_l2':10.0,'lambda_l1':0.0,'verbosity':-1,'seed':SEED
    }
    oof_lgb = np.zeros(len(train_X)); pred_lgb = np.zeros(len(test_X))
    for k in range(n_splits):
        trn = np.where(fold_ids!=k)[0]; val = np.where(fold_ids==k)[0]
        print(f'   Fold {k} trn={len(trn)} val={len(val)}', flush=True)
        dtr = lgb.Dataset(train_X.iloc[trn], label=y_log.iloc[trn], free_raw_data=False)
        dva = lgb.Dataset(train_X.iloc[val], label=y_log.iloc[val], free_raw_data=False)
        m = lgb.train(params_lgb, dtr, num_boost_round=5000, valid_sets=[dtr,dva], valid_names=['train','valid'], callbacks=[lgb.early_stopping(400), lgb.log_evaluation(300)])
        oof_lgb[val] = m.predict(train_X.iloc[val], num_iteration=m.best_iteration)
        pred_lgb += m.predict(test_X, num_iteration=m.best_iteration)/n_splits
        del m, dtr, dva; gc.collect()
    rmse = float(mean_squared_error(y_log, oof_lgb) ** 0.5); print(f'LGBM seed {SEED} OOF RMSLE: {rmse:.6f} | {time.time()-t0:.1f}s', flush=True)
    oof_lgb_seeds.append(oof_lgb); pred_lgb_seeds.append(pred_lgb)

# Average across seeds
oof_avg = np.mean(np.vstack(oof_lgb_seeds), axis=0)
pred_avg = np.mean(np.vstack(pred_lgb_seeds), axis=0)
cv_lgb = float(mean_squared_error(y_log, oof_avg) ** 0.5)
print(f'Averaged LGBM CV RMSLE: {cv_lgb:.6f}', flush=True)

# ------------------ Save submission ------------------
pred_bandgap = np.expm1(pred_avg).clip(0, 6.5)
sub = pd.DataFrame({'id': test['id'], 'bandgap_energy_ev': pred_bandgap})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv', sub.shape, '| total elapsed:', f'{time.time()-t0_all:.1f}s', flush=True)
sub.head()

In [14]:
# Install CatBoost explicitly with logs, then sanity-import
import sys, subprocess, time, os, importlib
t0 = time.time()
print('[SETUP] Installing CatBoost (prefer binary wheels)')
os.environ['PIP_DISABLE_PIP_VERSION_CHECK'] = '1'
os.environ['PIP_NO_INPUT'] = '1'
cmd = [sys.executable, '-m', 'pip', 'install', '--prefer-binary', '--upgrade', 'catboost']
print('[SETUP] Running:', ' '.join(cmd))
subprocess.check_call(cmd)
print('[SETUP] Install finished in', f'{time.time()-t0:.1f}s')
print('[SETUP] Importing catboost to warm cache...')
cb = importlib.import_module('catboost')
print('[SETUP] CatBoost version:', getattr(cb, '__version__', 'unknown'))
print('[SETUP] Ready.')