## Training Code (Train Notebook)

### Overview
- **Objective**: Train a model for each target using LightGBM (GPU) and save the trained model.
- **Outcome**: Output a set of `.pkl` files and a meta information file (`meta.json`) to `/kaggle/working/models`.
- Publish this by selecting "Create Dataset" in Kaggle Notebook → it can be used in the submission notebook.
- Combine train.csv and train_labels.csv without preprocessing, and train each target_* using independent regression.

**This is a baseline implementation that can be submitted simply without preprocessing, using raw features as is.** The goal is to "first submit and understand how the score works."

### Steps
1. **Loading Data**
- Load `train.csv`, `test.csv`, and `train_labels.csv`.
- Confirm that `date_id` is the key.

2. **Preprocessing**
- Quantify object data (return NaN for failures)
- Category data → `.cat.codes`
- Intersect numeric columns and define `feat_cols` as `feat_cols`.

3. **Define Features and Objective Variables**
- Build training data using `train.merge(labels, on="date_id")`.
- `label_cols = target_0 .. target_423`

4. **LightGBM Settings**
- GPU Training (`device_type="gpu"`)
- Early Stopping

5. **Training Loop**
- Train for each target.
- Skip samples with insufficient samples (<20 lines) and output a `.skip` file.
- Save the training model as `joblib.dump`.

6. **Save Meta Information**
- Save the feature list, target list, and LGBM parameters used in `meta.json`.
- Create a new model by selecting ShowVersions → Notebook → Output → New Model.

7. **Verify**
- Output and verify the list of files saved in `/kaggle/working/models`.
- Submission Notebook -> https://www.kaggle.com/code/shunyafukuda/baseline-lgbm-submit/

In [None]:
# ---------------- Librarys ----------------
import os, sys, time, json, warnings, joblib
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

# Install & import
!pip -q install --upgrade lightgbm
import lightgbm as lgb

In [2]:
# =========================================================
# TRAIN NOTEBOOK: LightGBM(GPU) per-target -> save models + meta
# 1) Train and save to /kaggle/working/models
# 2) After execution, publish this folder with "Create Dataset"
# =========================================================


# ---------------- Paths ----------------
DATA_PATH  = "/kaggle/input/mitsui-commodity-prediction-challenge"
TRAIN_PATH = f"{DATA_PATH}/train.csv"
TEST_PATH  = f"{DATA_PATH}/test.csv"
LABELS_PATH= f"{DATA_PATH}/train_labels.csv"

MODEL_OUTPUT_DIR = "/kaggle/working/models"
os.makedirs(MODEL_OUTPUT_DIR, exist_ok=True)

# ---------------- Preprocess helper ----------------
def preprocess_for_lgbm(df: pd.DataFrame) -> pd.DataFrame:
    """object→numeric（失敗はNaN）、category→codes"""
    df = df.copy()
    obj_cols = df.select_dtypes(include=["object"]).columns
    if len(obj_cols) > 0:
        df[obj_cols] = df[obj_cols].apply(pd.to_numeric, errors="coerce")
    for c in df.select_dtypes(include=["category"]).columns:
        df[c] = df[c].cat.codes
    return df

# ---------------- Load ----------------
train_raw  = pd.read_csv(TRAIN_PATH)
test_raw   = pd.read_csv(TEST_PATH)
labels     = pd.read_csv(LABELS_PATH)

for name, df_ in [("train", train_raw), ("test", test_raw), ("labels", labels)]:
    assert "date_id" in df_.columns, f"{name}.csv に date_id がありません"

train = preprocess_for_lgbm(train_raw)
test  = preprocess_for_lgbm(test_raw)

# ---------------- Feature/Label schema ----------------
df = train.merge(labels, on="date_id", how="inner")

label_cols = [c for c in labels.columns if c != "date_id"]  # target_0..423
num_train_cols = df.select_dtypes(include=[np.number]).columns.tolist()
num_test_cols  = test.select_dtypes(include=[np.number]).columns.tolist()
feat_cols = sorted(set(num_train_cols).intersection(num_test_cols))
feat_cols = [c for c in feat_cols if c not in label_cols]  # exclude targets

X = df[feat_cols]
print(f"#features={len(feat_cols)} | #train_rows={len(X)} | #test_rows={len(test)}")

# ---------------- LGBM params (GPU) ----------------
lgb_params = dict(
    n_estimators=1000,
    learning_rate=0.03,
    num_leaves=31,
    feature_fraction=0.8,
    bagging_fraction=0.8,
    bagging_freq=1,
    max_bin=255,
    random_state=42,
    verbosity=-1,
    device_type="gpu",
    gpu_platform_id=0,
    gpu_device_id=0,
)
callbacks = [lgb.log_evaluation(period=0), lgb.early_stopping(stopping_rounds=50)]

# ---------------- Train & Save ----------------
t0 = time.time()
for i, tgt in enumerate(label_cols, 1):
    y_full = df[tgt].values
    mask   = np.isfinite(y_full)
    Xc, yc = X.loc[mask], y_full[mask]

    print(f"[{i:03d}/{len(label_cols)}] {tgt}: train_rows={len(Xc)}, features={Xc.shape[1]}")

    if len(yc) < 20:
        # Skip learning → Create a flag file for 0 output because it cannot be saved with empty dummy (None)
        with open(os.path.join(MODEL_OUTPUT_DIR, f"{tgt}.skip"), "w") as f:
            f.write("too_few_samples")
        continue

    n   = len(Xc)
    cut = max(int(n*0.9), n-1)   # Simple verification of the last 10%
    X_tr, y_tr = Xc.iloc[:cut], yc[:cut]
    X_va, y_va = Xc.iloc[cut:], yc[cut:]

    model = lgb.LGBMRegressor(**lgb_params)
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)] if len(X_va) > 0 else None,
        eval_metric="rmse",
        callbacks=callbacks
    )

    joblib.dump(model, os.path.join(MODEL_OUTPUT_DIR, f"{tgt}.pkl"))

print(f"Training finished in {time.time()-t0:.1f}s (GPU)")

# ---------------- Save meta (schema & params) ----------------
meta = {
    "feat_cols": feat_cols,
    "label_cols": label_cols,
    "lgb_params": {k: (str(v) if k in ("device_type",) else v) for k,v in lgb_params.items()},
    "note": "Use these columns/order in the submission notebook."
}
with open(os.path.join(MODEL_OUTPUT_DIR, "meta.json"), "w") as f:
    json.dump(meta, f)

# For confirmation
print("Saved models to:", MODEL_OUTPUT_DIR)
print("Example files:", sorted(os.listdir(MODEL_OUTPUT_DIR))[:8])

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h#features=558 | #train_rows=1917 | #test_rows=90
[001/424] target_0: train_rows=1787, features=558
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[162]	valid_0's rmse: 4.71253e-06	valid_0's l2: 2.22079e-11
[002/424] target_1: train_rows=1744, features=558
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[129]	valid_0's rmse: 0.00918688	valid_0's l2: 8.43988e-05
[003/424] target_2: train_rows=1831, features=558
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[8]	valid_0's rmse: 0.00973496	valid_0's l2: 9.47695e-05
[004/424] target_3: train_rows=1831, features=558
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[34]	valid_0's rmse: 1.09984e-06	valid_0's l2: 1.2