Skip to content

Fix: Remove window-normalization leakage in AR training (causal lookback normalization) #84

@HarvestStars

Description

@HarvestStars

Hi Kronos team,

While reviewing train_predictor and the dataset, I noticed a potential data-leakage issue in getitem:

  • In AR training we use

    token_in  = [token_seq_0[:, :-1], token_seq_1[:, :-1]]
    token_out = [token_seq_0[:, 1:],  token_seq_1[:, 1:]]
    
  • But token_seq*_ comes from x_tensor that was Z-scored using the entire window (lookback_window + predict_window + 1).
    This makes each token_in position depend on statistics that include future points in the same window, which can cause distribution mismatch vs. real-time inference.

So, I opened a PR with the fix and a small comment tying the change back to the paper rationale: #83

What I changed (minimal fix):

  • Add some code comments to better locating the paper.
  • Compute Z-score using only the lookback segment.
  • Start training from the lookback boundary (predict future steps only), to better simulate deployment.

If you’d like, I can follow up with a fully causal sliding/online normalization variant as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions