# Tabular Playground Series - Dec 2021: Plan & Experiment Log

## Problem framing
- Task: Multiclass classification (7 classes) predicting Cover_Type
- Metric: Accuracy
- Data: Synthetic Forest Cover inspired (all numeric/binary)

## Success targets
- Ship a working baseline ASAP and submit
- Above median: >= 0.953
- Medal target: >= 0.9566 LB via robust CV and ensembling/tuning

## Validation protocol
- Stratified KFold (n_splits=5, shuffle=True, fixed seed)
- Single fold set reused for all experiments; save OOF predictions
- Track CV mean/std, compare to LB with early submissions

## Modeling plan (iterative)
1) Baseline: XGBoost (GPU) with modest depth/regularization
2) Tune key params (max_depth, eta, min_child_weight, subsamples) using CV
3) Try CatBoost (GPU) as alternative; blend XGB+Cat
4) Feature engineering:
   - Basic interactions: distances, hillshade stats, slope/aspect trigs
   - Row-wise aggregates over Soil_Type and Wilderness_Area binaries
5) Error analysis on OOF by class; adjust class-wise calibration if needed

## Guardrails
- Use GPU: tree_method=gpu_hist, predictor=gpu_predictor
- Early stopping with validation folds
- Start with smoke runs (subset rows, fewer rounds) to validate code
- Deterministic seeds, single CV splitter

## Experiment Log
| Exp | Date/Time | Features | Model | Params | Folds | OOF Acc | LB Acc | Notes |
|-----|-----------|----------|-------|--------|-------|---------|--------|-------|

## Next actions
1) Environment check (GPU available) and install xgboost/catboost if needed
2) Load data, target distribution, basic checks
3) Implement 5-fold stratified CV XGBoost baseline, generate OOF/test, submit

In [1]:
# Environment check: GPU + packages
import sys, subprocess, importlib, os
print(sys.version)
print('CUDA libraries present:', os.path.exists('/usr/lib/x86_64-linux-gnu/libcuda.so.1'))

def ensure(pkg, import_name=None, extras=None):
    import_name = import_name or pkg
    try:
        importlib.import_module(import_name)
        print(f'{pkg} already installed')
    except ImportError:
        to_install = pkg if extras is None else f"{pkg}{extras}"
        print(f'Installing {to_install} ...')
        subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', to_install], check=True)
        importlib.import_module(import_name)
        print(f'{pkg} installed')

# Prefer GPU-ready libs: XGBoost and CatBoost
ensure('xgboost')
ensure('catboost')
ensure('pandas')
ensure('numpy')
ensure('scikit-learn', 'sklearn')

# GPU sanity (PyTorch just for check if not present, skip heavy install)
try:
    import torch
    print('Torch GPU Available:', torch.cuda.is_available())
    if torch.cuda.is_available():
        print('GPU Name:', torch.cuda.get_device_name(0))
except Exception as e:
    print('Torch not available for GPU check; skipping. Error:', str(e))

import xgboost as xgb
print('XGBoost version:', xgb.__version__)
from catboost import CatBoostClassifier, Pool
print('CatBoost ready')

import pandas as pd, numpy as np
pd.set_option('display.max_columns', 200)
print('Env ready.')

3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
CUDA libraries present: True


xgboost already installed
catboost already installed
pandas already installed
numpy already installed
scikit-learn already installed
Torch not available for GPU check; skipping. Error: No module named 'torch'
XGBoost version: 2.1.4
CatBoost ready
Env ready.


In [9]:
# Baseline: Load data, 5-fold Stratified CV XGBoost (raw features), OOF acc, test preds, submission
import pandas as pd, numpy as np, os, time
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
import xgboost as xgb

t0 = time.time()
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print('Train shape:', train.shape, 'Test shape:', test.shape)

id_col = 'Id' if 'Id' in train.columns else None
target_col = 'Cover_Type'

# Prepare features
feature_cols = [c for c in train.columns if c not in ([id_col] if id_col else []) + [target_col]]
X = train[feature_cols].astype(np.float32).values
y = train[target_col].astype(int).values - 1  # convert to 0..6
X_test = test[feature_cols].astype(np.float32).values

n_classes = len(np.unique(y))
print('Features:', len(feature_cols), 'Classes:', n_classes)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_proba = np.zeros((len(train), n_classes), dtype=np.float32)
test_proba = np.zeros((len(test), n_classes), dtype=np.float32)

# XGBoost native params (use xgb.train for early stopping in xgb 2.x)
params_native = {
    'objective': 'multi:softprob',
    'num_class': n_classes,
    'tree_method': 'gpu_hist',
    'predictor': 'gpu_predictor',
    'max_depth': 10,
    'eta': 0.03,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_lambda': 5.0,
    'min_child_weight': 1.0,
    'eval_metric': 'mlogloss',
    'max_bin': 256,
    'seed': 42
}

num_boost_round = 2000
early_stopping_rounds = 200

dtest = xgb.DMatrix(X_test)

fold_accs = []
for fold, (tr_idx, va_idx) in enumerate(skf.split(X, y), 1):
    X_tr, y_tr = X[tr_idx], y[tr_idx]
    X_va, y_va = X[va_idx], y[va_idx]
    dtrain = xgb.DMatrix(X_tr, label=y_tr)
    dvalid = xgb.DMatrix(X_va, label=y_va)
    bst = xgb.train(
        params_native,
        dtrain,
        num_boost_round=num_boost_round,
        evals=[(dtrain, 'train'), (dvalid, 'valid')],
        early_stopping_rounds=early_stopping_rounds,
        verbose_eval=False
    )
    proba_va = bst.predict(dvalid)
    oof_proba[va_idx] = proba_va.astype(np.float32)
    preds_va = np.argmax(proba_va, axis=1)
    acc = accuracy_score(y_va, preds_va)
    fold_accs.append(acc)
    best_it = getattr(bst, 'best_iteration', None)
    print(f'Fold {fold} acc: {acc:.6f}, best_iter: {best_it}')
    # predict test using best_iteration
    if best_it is not None:
        test_proba += bst.predict(dtest, iteration_range=(0, best_it + 1)) / skf.n_splits
    else:
        test_proba += bst.predict(dtest) / skf.n_splits

oof_pred = np.argmax(oof_proba, axis=1)
oof_acc = accuracy_score(y, oof_pred)
print('Fold accs:', [round(a,6) for a in fold_accs])
print(f'OOF accuracy: {oof_acc:.6f}')

# Save submission
pred_labels = np.argmax(test_proba, axis=1) + 1  # back to 1..7
sub = pd.DataFrame({
    ('Id' if 'Id' in test.columns else 'id'): test[id_col].values if id_col else np.arange(len(test)),
    'Cover_Type': pred_labels.astype(int)
})
sub.rename(columns={'id': 'Id'}, inplace=True)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv with shape:', sub.shape)
print('Done in %.1fs' % (time.time() - t0))

Train shape: (3600000, 56) Test shape: (400000, 55)


Features: 54 Classes: 7





    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.




    E.g. tree_method = "hist", device = "cuda"



Fold 1 acc: 0.961167, best_iter: 1936



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.




    E.g. tree_method = "hist", device = "cuda"



Fold 2 acc: 0.961110, best_iter: 1892



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.




    E.g. tree_method = "hist", device = "cuda"



Fold 3 acc: 0.961290, best_iter: 1947



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.




    E.g. tree_method = "hist", device = "cuda"



Fold 4 acc: 0.961810, best_iter: 1902



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.




    E.g. tree_method = "hist", device = "cuda"



Fold 5 acc: 0.961633, best_iter: 1906


Fold accs: [0.961167, 0.96111, 0.96129, 0.96181, 0.961633]
OOF accuracy: 0.961402


Saved submission.csv with shape: (400000, 2)
Done in 2809.7s
