Skip to content

[TFT] Cannot start training #1437

Open
Open
@jomach

Description

@jomach

Related to TFT/Pytorch

Describe the bug
I'm trying to add a new dataset to this framework following the yaml. I got all kind of errors to be honest, but most of them are:

  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1445, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/workspace/models/tft_pyt/modeling.py", line 229, in forward
    t_observed_tgt = fused_pointwise_linear_v2(t_tgt_obs, self.t_tgt_embedding_vectors, self.t_tgt_embedding_bias)
RuntimeError: Error instantiating 'training.trainer.CTLTrainer' : The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/workspace/models/tft_pyt/modeling.py", line 89, in fused_pointwise_linear_v2
def fused_pointwise_linear_v2(x, a, b):
    out = x.unsqueeze(3) * a
    out = out + b
          ~~~~~~~ <--- HERE
    return out
**RuntimeError: CUDA error: device-side assert triggered**
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

To Reproduce
Dataset:

zeus@b8ae237f7dad:/workspace$ head /workspace/datasets/sosd/timeseries_datasetcs.csv 
AUDAT,MATNR,WERKS,total_quantity
2022-04-01,M00000000213903201,D110,1
2022-04-01,M00000000215022201,D110,5
2022-04-01,M00000000214593302,D110,3
2022-04-01,M00000000215043701,D110,5
2022-04-01,M00000000213449504,D110,0
2022-04-01,M00000000214319300,D110,0
2022-04-01,M00000000214385102,D110,10
2022-04-01,M00000000214180004,D110,0
2022-04-01,M00000000214458104,D110,20

config:

_target_: data.datasets.create_datasets
config:
    graph: False
    source_path: /workspace/datasets/sosd/timeseries_datasetcs.csv
    dest_path: /workspace/datasets/sosd/
    train_range:
      - '2022-04-01'
      - '2023-09-02'
    valid_range:
      - '2023-10-26'
      - '2024-02-15'
    test_range:
      - '2023-09-02'
      - '2023-10-26'
    scale_per_id: True
    encoder_length: 5
    input_length: 5
    example_length: 10
    dataset_stride: 1
    MultiID: False
    features:
    - name: 'MATNR'
      feature_type: 'ID'
      feature_embed_type: 'CATEGORICAL'
      cardinality: 70908
    - name: 'MATNR'
      feature_type: 'STATIC'
      feature_embed_type: 'CATEGORICAL'
      cardinality: 70908      
    - name: 'WERKS'
      feature_type: 'ID'
      feature_embed_type: 'CATEGORICAL'
      cardinality: 1
    - name: 'AUDAT'
      feature_type: 'TIME'
      feature_embed_type: 'DATE'
    - name: 'WERKS'
      feature_type: 'KNOWN'
      feature_embed_type: 'CATEGORICAL'
      cardinality: 1
    - name: 'total_quantity'
      feature_type: 'TARGET'
      feature_embed_type: 'CONTINUOUS'
      scaler:
        _target_: sklearn.preprocessing.StandardScaler
    train_samples: 619765
    valid_samples: 174172
    binarized: True
    time_series_count: 70908

Expected behavior
The Training starts.

Environment

  • NVIDIA-SMI 535.216.03
  • Driver Version: 535.216.03
  • CUDA Version: 12.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions