Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PAR model sampling error when there is a numerical sequence_index (float, int) #808

Closed
doolingdavidrs21 opened this issue May 20, 2022 · 5 comments
Labels
bug Something isn't working data:sequential Related to timeseries datasets resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@doolingdavidrs21
Copy link

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: '0.14.1'
  • Python version: Python 3.8.10
  • Operating System: Windows 10

Error Description

I am unable to use an integer type field in the PAR models for the index_sequence parameter.
I would like to be able to do so so that a PAR model trained with one will have the values for that field with increasingly larger integers be able to be mapped back to a datetime field that has a frequency other than days.

Below is an example where setting index_sequence parameter to an integer value allows for model training, but the model methods all fail, cannot sample:

Steps to reproduce

from sdv.demo import load_timeseries_demo
import pandas as pd

data = load_timeseries_demo()

sequence_map = {
sorted(data["Date"].unique())[i]: i for i in range(len(data["Date"].unique()))
}

data["Date"] = data["Date"].map(sequence_map)

entity_columns = ["Symbol"]
context_columns = ["MarketCap", "Sector", "Industry"]
sequence_index = "Date"

from sdv.timeseries import PAR

model = PAR(
entity_columns=entity_columns,
context_columns=context_columns,
sequence_index=sequence_index,
verbose=True,
epochs=45,
)

model.fit(data)

In[247]:

throws error

new_data = model.sample(num_sequences=1, sequence_length=10)

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
@doolingdavidrs21 doolingdavidrs21 added bug Something isn't working pending review labels May 20, 2022
@doolingdavidrs21
Copy link
Author

Here is the traceback when I run the above:

C:\Users\davidd\Anaconda3\lib\site-packages\scipy\stats\_continuous_distns.py:639: RuntimeWarning: invalid value encountered in sqrt
  sk = 2*(b-a)*np.sqrt(a + b + 1) / (a + b + 2) / np.sqrt(a*b)
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\optimize\minpack.py:175: RuntimeWarning: The iteration is not making good progress, as measured by the 
  improvement from the last ten iterations.
  warnings.warn(msg, RuntimeWarning)
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\stats\_continuous_distns.py:5320: RuntimeWarning: divide by zero encountered in true_divide
  return c**2 / (c**2 - n**2)
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:2606: RuntimeWarning: invalid value encountered in double_scalars
  Lhat = muhat - Shat*mu
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\optimize\minpack.py:175: RuntimeWarning: The number of calls to function has reached maxfev = 600.
  warnings.warn(msg, RuntimeWarning)
PARModel(epochs=45, sample_size=1, cuda='cuda', verbose=True) instance created
Epoch 45 | Loss 1.814377784729004: 100%|███████████████████████████████████████████████| 45/45 [00:30<00:00,  1.46it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.63it/s]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

~\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Date'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_21128/1263544826.py in <module>
     35 # throws error
     36 
---> 37 new_data = model.sample(num_sequences=1, sequence_length=10)

~\Anaconda3\lib\site-packages\sdv\timeseries\base.py in sample(self, num_sequences, context, sequence_length)
    265 
    266         sampled = self._sample(context, sequence_length)
--> 267         return self._metadata.reverse_transform(sampled)
    268 
    269     def save(self, path):

~\Anaconda3\lib\site-packages\sdv\metadata\table.py in reverse_transform(self, data)
    712                 field_data = pd.Series(Table._get_fake_values(field_metadata, len(reversed_data)))
    713             else:
--> 714                 field_data = reversed_data[name]
    715 
    716             reversed_data[name] = field_data[field_data.notnull()].astype(self._dtypes[name])

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'Date'
​

@npatki
Copy link
Contributor

npatki commented May 20, 2022

Thanks for filing @doolingdavidrs21. I can replicate this issue.

For SDV developers: I did some digging and found the following --

  • If sequence index 'Date' is a datetime, then sampled data has column 'Date.value', which is reversed back to 'Date' (no issues)
  • If sequence index 'Date' is numerical, then sampled data has column 'Date.value' and the reversed name remains'Date.value' (error)

@npatki npatki added data:sequential Related to timeseries datasets and removed pending review labels Jun 3, 2022
@npatki npatki changed the title sequence_index columns must be datetime; integer sequence_index columns can be used for model training, but result in errors when samping to get synthetic data PAR model can fit an integer sequence index but it errors when sampling Jul 8, 2022
@npatki npatki changed the title PAR model can fit an integer sequence index but it errors when sampling PAR model sampling error when there is a numerical sequence_index (float, int) Aug 11, 2022
@npatki
Copy link
Contributor

npatki commented Aug 11, 2022

Potential Workarounds

  1. If the sequence index is only used for ordering and the data is already in order, you can drop the sequence index
data = data.drop([sequence_index], axis=1)
  1. Alternatively, you can cast an int column into datetime, as proposed by @yamidibarra in KeyError while sampling using freshly trained PAR model #943
import pandas as pd

sequence_index = 'my_sequence_index_column_name' # name of column
data[sequence_index] = pd.to_datetime(data[sequence_index]) 

Remember to cast the synthetic data back to an int at the end

synthetic_data[sequence_index] = synthetic_data[sequence_index].astype(int)

@amontanez24
Copy link
Contributor

Thanks for filing @doolingdavidrs21. I can replicate this issue.

For SDV developers: I did some digging and found the following --

  • If sequence index 'Date' is a datetime, then sampled data has column 'Date.value', which is reversed back to 'Date' (no issues)
  • If sequence index 'Date' is numerical, then sampled data has column 'Date.value' and the reversed name remains'Date.value' (error)

This is because in the PAR model, only the datetime columns are transformed. This can be seen here:

_DTYPE_TRANSFORMERS = {
'i': None,
'f': None,
'M': rdt.transformers.UnixTimestampEncoder(),
'b': None,
'O': None,
}

However, in sampling we add the .value suffix back in for the sequence index no matter what type it is.

output = output.rename(columns={
self._sequence_index: self._sequence_index + '.value'
})

This is a bug

@npatki
Copy link
Contributor

npatki commented Mar 9, 2023

Great news! This issue has now been resolved in our new SDV 1.0 (Beta!) release. Check it out and let us know if you're still encountering any problems.

Resources:

@npatki npatki closed this as completed Mar 9, 2023
@npatki npatki added resolution:resolved The issue was fixed, the question was answered, etc. SDV 1.0 (Beta!) labels Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:sequential Related to timeseries datasets resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

3 participants