Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError while sampling using freshly trained PAR model #943

Closed
DamianUS opened this issue Aug 9, 2022 · 9 comments
Closed

KeyError while sampling using freshly trained PAR model #943

DamianUS opened this issue Aug 9, 2022 · 9 comments
Labels
bug Something isn't working data:sequential Related to timeseries datasets resolution:duplicate This issue or pull request already exists

Comments

@DamianUS
Copy link

DamianUS commented Aug 9, 2022

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version: 0.16.0
Python version: 3.8.13 (default, May 8 2022, 17:48:02) \n[Clang 13.1.6 (clang-1316.0.21.2)]
Operating System: Macbook Pro M1 Mac OS X 12.0.1

Error description

The key error is also being raised when trying to sample from a freshly-trained PAR model in v0.16.0.

I tried both passing the field types metadata and without it, nothing seems to help.

I printed the model metadata just to check if the model inferred properly the data types and everything seems correct.

Here I attach the code used just in case it helps (this is the last version used in which the model infers the field types):

import pandas as pd
from sdv.timeseries import PAR
from sdv.metrics.timeseries import TSFClassifierEfficacy

data = pd.read_csv("data/micro_batch_task.csv")
sequence_index = 'start_time'
field_types = {
    "instance_num": {
        "type": "numerical",
        'subtype': 'integer'
    },
    "start_time": {
        "type": "numerical",
        'subtype': 'integer'
    },
    "plan_cpu": {
        "type": "numerical",
        'subtype': 'float'
    },
    "plan_mem": {
        "type": "numerical",
        'subtype': 'float'
    },
    "makespan": {
        "type": "numerical",
        'subtype': 'integer'
    },
}
model = PAR(
    sequence_index=sequence_index,
    segment_size=10,
    epochs=1,
    verbose=True
)
model.fit(data)
print(model.get_metadata().to_dict())
new_data = model.sample(1)
print(new_data)
print(TSFClassifierEfficacy.compute(data, new_data, field_types, target='makespan'))

When trying to sample:

PARModel(epochs=1, sample_size=1, cuda='cpu', verbose=True) instance created
Epoch 1 | Loss 0.001459105173125863: 100%|██████████| 1/1 [00:51<00:00, 51.42s/it]
{'fields': {'instance_num': {'type': 'numerical', 'subtype': 'float', 'transformer': None}, 'start_time': {'type': 'numerical', 'subtype': 'integer', 'transformer': None}, 'plan_cpu': {'type': 'numerical', 'subtype': 'float', 'transformer': None}, 'plan_mem': {'type': 'numerical', 'subtype': 'float', 'transformer': None}, 'makespan': {'type': 'numerical', 'subtype': 'integer', 'transformer': None}}, 'constraints': [], 'model_kwargs': {}, 'name': None, 'primary_key': None, 'sequence_index': 'start_time', 'entity_columns': [], 'context_columns': []}
100%|██████████| 1/1 [00:00<00:00, 85.72it/s]
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'start_time'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/damianfernandez/PycharmProjects/sdv/main.py", line 46, in <module>
    new_data = model.sample(1)
  File "/opt/homebrew/lib/python3.8/site-packages/sdv/timeseries/base.py", line 268, in sample
    return self._metadata.reverse_transform(sampled)
  File "/opt/homebrew/lib/python3.8/site-packages/sdv/metadata/table.py", line 700, in reverse_transform
    field_data = reversed_data[name]
  File "/opt/homebrew/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/opt/homebrew/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 'start_time'

Process finished with exit code 1

Maybe I'm not doing something properly. I'm new to the library!

@DamianUS DamianUS added bug Something isn't working new Automatic label applied to new issues labels Aug 9, 2022
@yamidibarra
Copy link

yamidibarra commented Aug 10, 2022

Dear @npatki thank you in advance for your support! I´m having a similar issue. Here I describe it:

Environment Details

SDV version: 0.16.0
Python version: 3.8.13
Operating System: Windows 10

Error:
Exception has occurred: KeyError
'Time'

The above exception was the direct cause of the following exception:
File "C:\Users\Data_Augmentation\PAR_Model.py", line 13, in
new_data = model.sample(1)

import pandas as pd
from sdv.timeseries import PAR

data = pd.read_pickle('df_PAR.pkl')
context_columns = ['POM', 'Mold Temperature [°C]', 'Injection velocity [cmm/s]', 'Holding pressure [bar]'] 
entity_columns = ['id']
sequence_index = 'Time'

model = PAR(entity_columns=entity_columns,  context_columns=context_columns,  sequence_index=sequence_index)
 
model.fit(data)
new_data = model.sample(1)

model.save('Timeseries_synthetic_model.pkl')

Attached you will find .py file and .pkl file with data
PS: I tried to reproduce the example shown here: https://sdv.dev/SDV/user_guides/timeseries/par.html but I can´t access the file. I wanted to check the type of data variables.

@yamidibarra
Copy link

#808 (comment)

I understand what´s going on. My Time column is float-type, PAR allows only Data-Time type though...

@dharmesh1007
Copy link

@yamidibarra, I'm having the same issue. Time column needing to be in date time format.

@npatki
Copy link
Contributor

npatki commented Aug 10, 2022

Hi everyone,

Yes @yamidibarra, I agree with you. Issue #808 is likely the root cause for all these errors: It is a known issue that the PAR model currently produces a sampling error when sequence_index is numerical (float, int). The error should go away if you express sequence_index as a datetime or if you remove it altogether.

Does this accurately describe everyone's scenario? If so, I can close this issue in favor of #808 for tracking.

@npatki
Copy link
Contributor

npatki commented Aug 10, 2022

BTW --

@DamianUS, thanks for filing this issue! I will delete the comments in #935 since you copied it over here

@yamidibarra, re the link:

PS: I tried to reproduce the example shown here: https://sdv.dev/SDV/user_guides/timeseries/par.html but I can´t access the file. I wanted to check the type of data variables.

The text of the link is correct by the hyperlink is pointing to some other URL. You should be able to open the page if you click on this: https://sdv.dev/SDV/user_guides/timeseries/par.html.

@npatki npatki added data:sequential Related to timeseries datasets under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Aug 10, 2022
@yamidibarra
Copy link

Hi everyone,

Yes @yamidibarra, I agree with you. Issue #808 is likely the root cause for all these errors: It is a known issue that the PAR model currently produces a sampling error when sequence_index is numerical (float, int). The error should go away if you express sequence_index as a datetime or if you remove it altogether.

Does this accurately describe everyone's scenario? If so, I can close this issue in favor of #808 for tracking.

yes, it resolves this specific issue. Here my workaround. I´ll open up another issue regarding the synthetic data. I have some questions and I would appreciate your opinion dear @npatki

data = pd.read_pickle('df_PAR.pkl')
data['Time'] = data['Time'].multiply(1E9)
data['Time'] = pd.to_datetime(data['Time'])

context_columns = ['POM', 'Mold Temperature [°C]', 'Injection velocity [cmm/s]', 'Holding pressure [bar]'] 
entity_columns = ['id']
sequence_index = 'Time'
model = PAR(entity_columns=entity_columns,  context_columns=context_columns,  sequence_index=sequence_index)
 
model.fit(data)
new_data = model.sample(1)
   
 # get seconds
new_data['Time']=new_data['Time'].apply(lambda x:'%02d.%06d' %(x.second, x.microsecond)).astype(float)

@npatki
Copy link
Contributor

npatki commented Aug 11, 2022

Great, thanks for confirming! I'll close this issue in favor of #808.

Please feel free to reply if you continue to see a KeyError on the PAR model even if you have a datetime sequence_index and I can reopen this issue for discussion.

@npatki npatki closed this as completed Aug 11, 2022
@npatki npatki added resolution:duplicate This issue or pull request already exists and removed under discussion Issue is currently being discussed labels Aug 11, 2022
@mohammedsabiya
Copy link

mohammedsabiya commented Jul 20, 2023

Hi, I am facing the same KeyError issue in PARsynthesizer as here, even though sequence_index is datetime. Please see the issue #1510.

p.s. the KeyError that I get is from the context_columns

Great, thanks for confirming! I'll close this issue in favor of #808.

Please feel free to reply if you continue to see a KeyError on the PAR model even if you have a datetime sequence_index and I can reopen this issue for discussion.

@npatki
Copy link
Contributor

npatki commented Jul 21, 2023

@mohammedsabiya Thanks for filing! We'll follow up in the new issue, as it's been some time since this original one was resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:sequential Related to timeseries datasets resolution:duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

5 participants