# Price → Log-Return (Two-Stage Inference) Baseline Design Notes

## Objective

Instead of directly regressing on the log-return,

- Stage A: Predict the price
- Stage B: Synthesize the log-return from the predicted price and historical prices

This two-stage inference aims to improve accuracy and stability.

---

## Background and Objectives

- Return is scale-invariant and easy to handle, but it is difficult to utilize price dynamics (autocorrelation and seasonality).
- In practice, it is common to estimate prices and take the difference.
- Utilizing price model information (predicted values and error structure) to stabilize return synthesis.

---

## Overall Flow

### Stage-A (Training) — Price Model

- Extract price columns (A and B) appearing in `target_pairs.csv`
- Regression using LightGBM (GPU) for each price column
- Objective variable: `y = log(price)` (restored using `exp` during inference)
- Prevent leaks: Exclude the target price column from the features.

### Stage-B (Submission) — Price → Return Synthesis

- Read price models (`.txt`) and `meta`
- Rolling Price Store (history based on `date_id`)
- Current price `P_t(*)`:
- Prioritize actual measurements if available; otherwise, use model predictions.
- Lagged price `P_{t-L}(*)`: Obtain from store.
- Calculate log-return based on `pair`:
- Single: `log(P_t(A)) - log(P_{t-L}(A))`
- `"A - B"`: `log(P_t(A)) - log(P_{t-L}(B))`
- Failsafe = 0.0 for insufficient history or non-positive values.

---

## Validation Design (Offline)

- Sort `train.csv` in ascending order by `date_id` and apply `TimeSeriesSplit(k=5)`.
- Evaluate the pricing model using **price RMSE**.
- Additional validation (optional):
- Use the last 10% of the training data as the validation interval.
- Output `P̂_t(A)` using the price model.
- Reference `P_{t-L}(B)` from the real data (no leaks allowed).
- Synthesize `ŷ_return = log(P̂_t(A)) - log(P_{t-L}(B))`.
- Compare RMSE/MAE with the existing "return direct regression baseline."

---

## Steps (Execution Order)

- Execute the learning notebook (Stage-A)
- Output the results to /kaggle/working/models_price/.
- Publish using "Create Dataset" (e.g., mitsui-price-models).

- Add to Add Data in the submission notebook (Stage-B)
- Set MODEL_INPUT_DIR to the models_price of the published dataset.

---

## Reproducibility

- Fixed random_state=42.

- Save only the best fold of price CV to a txt file → The submitter loads Booster for full reproduction.

- Preprocessing (object → numeric / category → codes), feat_cols_base recorded in meta.

---

## Related URLs
- Datasets -> https://www.kaggle.com/datasets/shunyafukuda/mitsui-byprice-lgbm-v1
- submit notebook -> https://www.kaggle.com/code/shunyafukuda/byprice-lgbm-submit

In [1]:
# =============================== #
# Price-First TRAIN (GPU-only)    #
# - resumable, progress, quiet    #
# - single-GPU sequential         #
# =============================== #

import os, json, time, warnings, re, logging
warnings.filterwarnings("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"
logging.getLogger("lightgbm").setLevel(logging.ERROR)

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

# ---- LightGBM のインストールは別セルで先に実行するとさらに静かになります ----
# try:
#     import lightgbm as lgb
# except ImportError:
#     !pip -qq install --upgrade lightgbm
#     import lightgbm as lgb
import lightgbm as lgb

from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error

# ---------------- Paths/Config ----------------
DATA_PATH  = "/kaggle/input/mitsui-commodity-prediction-challenge"
TRAIN_CSV  = f"{DATA_PATH}/train.csv"
TEST_CSV   = f"{DATA_PATH}/test.csv"
PAIRS_CSV  = f"{DATA_PATH}/target_pairs.csv"

OUT_DIR    = "/kaggle/working/models_price"
os.makedirs(OUT_DIR, exist_ok=True)

LOG_JSONL  = os.path.join(OUT_DIR, "train_price_progress.jsonl")  # 逐次ログ
METRICS_CSV= os.path.join(OUT_DIR, "cv_price_metrics.csv")        # 成功分の集計

# ---- 学習ハイパラ（GPU前提）----
N_SPLITS       = 5        # 重ければ 3 に
USE_LOG_PRICE  = True     # y=log(price)
MIN_SAMPLES    = 200      # 小さすぎる列はスキップ
LGB_PARAMS = dict(
    n_estimators=2000, learning_rate=0.03, num_leaves=31,
    feature_fraction=0.8, bagging_fraction=0.8, bagging_freq=1,
    max_bin=255, random_state=42, verbosity=-1,
    device_type="gpu", gpu_platform_id=0, gpu_device_id=0,
)
CALLBACKS = [lgb.log_evaluation(period=0), lgb.early_stopping(stopping_rounds=100)]

# ---------------- Utils ----------------
def preprocess_for_lgbm(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    obj = df.select_dtypes(include=["object"]).columns
    if len(obj) > 0:
        df[obj] = df[obj].apply(pd.to_numeric, errors="coerce")
    for c in df.select_dtypes(include=["category"]).columns:
        df[c] = df[c].cat.codes
    return df

def parse_pair_string(s: str) -> tuple[str | None, str | None]:
    """'A - B' -> ('A','B'), 'A' -> ('A', None)"""
    if " - " in s:
        a, b = s.split(" - ", 1)
        return a.strip(), b.strip()
    return s.strip(), None

def append_jsonl(path: str, obj: dict):
    with open(path, "a", encoding="utf-8") as f:
        f.write(json.dumps(obj, ensure_ascii=False) + "\n")

def already_done_cols(out_dir: str) -> set[str]:
    done = set()
    for fname in os.listdir(out_dir):
        if fname.startswith("price__") and fname.endswith(".txt"):
            done.add(fname[len("price__"):-4])
    return done

# ---------------- Load ----------------
train_raw = pd.read_csv(TRAIN_CSV)
test_raw  = pd.read_csv(TEST_CSV)
pairs     = pd.read_csv(PAIRS_CSV)

assert "date_id" in train_raw.columns, "train.csv に date_id がありません"
assert set(["target","lag","pair"]).issubset(pairs.columns), "target_pairs.csv は target,lag,pair の3列が必要です"

train = preprocess_for_lgbm(train_raw)
test  = preprocess_for_lgbm(test_raw)

num_train_cols = train.select_dtypes(include=[np.number]).columns.tolist()
num_test_cols  = test.select_dtypes(include=[np.number]).columns.tolist()
feat_cols_base = sorted(set(num_train_cols).intersection(num_test_cols))
feat_cols_base = [c for c in feat_cols_base if c != "date_id"]

# ---- 学習対象の価格列（pairに登場する列名）を厳密抽出 ----
required_price_cols = set()
for _, row in pairs.iterrows():
    a, b = parse_pair_string(str(row["pair"]))
    if a is not None: required_price_cols.add(a)
    if b is not None: required_price_cols.add(b)

missing = [col for col in required_price_cols if col not in train.columns]
assert len(missing) == 0, f"pair で参照される列が train.csv にありません: {missing}"

# ---- 時系列順 ----
df = train.sort_values("date_id").reset_index(drop=True)

# ---------------- Core: per-price-column training ----------------
def train_one_pricecol(pcol: str) -> dict:
    """1つの価格列に対して LGBM(GPU) を学習し、モデル保存＆メトリクス返却"""
    t_start = time.time()
    out = {"price_col": pcol, "status": "skipped", "rmse": None, "secs": 0.0, "n": 0}

    model_path = os.path.join(OUT_DIR, f"price__{pcol}.txt")
    if os.path.exists(model_path):
        out["status"] = "exists"
        return out

    y = df[pcol].astype(float).values
    mask = np.isfinite(y)
    Xc, yc = df.loc[mask, feat_cols_base + ["date_id"]], y[mask]
    out["n"] = len(yc)
    if len(yc) < MIN_SAMPLES:
        out["status"] = "too_few_samples"
        append_jsonl(LOG_JSONL, {**out, "ts": time.time()})
        return out

    # リーク防止：同じ列を特徴から除外
    feat_cols = [c for c in feat_cols_base if c != pcol]
    yc_tr = np.log(np.clip(yc, 1e-12, None)) if USE_LOG_PRICE else yc

    tss = TimeSeriesSplit(n_splits=N_SPLITS)
    best_model, best_rmse = None, np.inf

    for fold, (tr_idx, va_idx) in enumerate(tss.split(Xc), 1):
        X_tr = Xc.iloc[tr_idx][feat_cols]
        X_va = Xc.iloc[va_idx][feat_cols]
        y_tr = yc_tr[tr_idx]
        y_va = yc_tr[va_idx]

        model = lgb.LGBMRegressor(**LGB_PARAMS)
        model.fit(X_tr, y_tr, eval_set=[(X_va, y_va)],
                  eval_metric="rmse", callbacks=CALLBACKS)

        pred = model.predict(X_va, num_iteration=getattr(model, "best_iteration_", None))
        if USE_LOG_PRICE:
            pred = np.exp(pred)
            y_va_eval = np.exp(y_va)
        else:
            y_va_eval = y_va

        rmse = mean_squared_error(y_va_eval, pred, squared=False)
        if rmse < best_rmse:
            best_rmse, best_model = rmse, model

    if best_model is not None:
        best_model.booster_.save_model(model_path)
        out["status"] = "ok"
        out["rmse"]   = float(best_rmse)

    out["secs"] = time.time() - t_start
    append_jsonl(LOG_JSONL, {**out, "ts": time.time()})
    return out

# ---------------- Run (single-GPU sequential) ----------------
price_cols_sorted = sorted(required_price_cols)
done_before = already_done_cols(OUT_DIR)
todo_cols   = [c for c in price_cols_sorted if c not in done_before]

print(f"Total targets = {len(price_cols_sorted)} | resume skipped = {len(done_before)} | to-train = {len(todo_cols)}")

results = []
t0 = time.time()
for col in tqdm(todo_cols, ascii=True, desc="train(price cols) [GPU]"):
    results.append(train_one_pricecol(col))
elapsed = time.time() - t0
print(f"Done. elapsed={elapsed/60:.1f} min")

# ---------------- Save metrics (append-safe) ----------------
prev = None
if os.path.exists(METRICS_CSV):
    try:
        prev = pd.read_csv(METRICS_CSV)
    except Exception:
        prev = None

df_res = pd.DataFrame(results)
df_ok  = df_res[df_res["status"]=="ok"][["price_col","rmse","secs","n"]]
merged = pd.concat([prev, df_ok], axis=0, ignore_index=True) if prev is not None else df_ok
merged.to_csv(METRICS_CSV, index=False)

# ---------------- Save meta ----------------
max_lag = int(pairs["lag"].max())
meta = {
    "feat_cols_base": feat_cols_base,
    "price_cols": price_cols_sorted,
    "use_log_price": USE_LOG_PRICE,
    "max_lag": max_lag,
    "n_splits": N_SPLITS,
    "note": "GPU-only sequential training with resume & progress.",
}
with open(os.path.join(OUT_DIR, "meta_price.json"), "w") as f:
    json.dump(meta, f)

print("Saved to:", OUT_DIR)
print("Examples:", sorted(os.listdir(OUT_DIR))[:10])

Total targets = 106 | resume skipped = 0 | to-train = 106


train(price cols) [GPU]:   0%|          | 0/106 [00:00<?, ?it/s]



Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[1699]	valid_0's rmse: 0.0387763	valid_0's l2: 0.0015036
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[1042]	valid_0's rmse: 0.0176491	valid_0's l2: 0.000311491
Training until validation scores don't improve for 100 rounds
Did not meet early stopping. Best iteration is:
[1915]	valid_0's rmse: 0.0225908	valid_0's l2: 0.000510345
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[339]	valid_0's rmse: 0.0167336	valid_0's l2: 0.000280015
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[33]	valid_0's rmse: 0.0122869	valid_0's l2: 0.000150968
Training until validation scores don't improve for 100 rounds
Did not meet early stopping. Best iteration is:
[1991]	valid_0's rmse: 0.0636922	valid_0's l2: 0.00405669
Training until validation scores don't impro