PAR model sampling error when there is a numerical `sequence_index` (float, int) #808

doolingdavidrs21 · 2022-05-20T16:15:45Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version: '0.14.1'
Python version: Python 3.8.10
Operating System: Windows 10

Error Description

I am unable to use an integer type field in the PAR models for the index_sequence parameter.
I would like to be able to do so so that a PAR model trained with one will have the values for that field with increasingly larger integers be able to be mapped back to a datetime field that has a frequency other than days.

Below is an example where setting index_sequence parameter to an integer value allows for model training, but the model methods all fail, cannot sample:

Steps to reproduce

from sdv.demo import load_timeseries_demo
import pandas as pd

data = load_timeseries_demo()

sequence_map = {
sorted(data["Date"].unique())[i]: i for i in range(len(data["Date"].unique()))
}

data["Date"] = data["Date"].map(sequence_map)

entity_columns = ["Symbol"]
context_columns = ["MarketCap", "Sector", "Industry"]
sequence_index = "Date"

from sdv.timeseries import PAR

model = PAR(
entity_columns=entity_columns,
context_columns=context_columns,
sequence_index=sequence_index,
verbose=True,
epochs=45,
)

model.fit(data)

In[247]:

throws error

new_data = model.sample(num_sequences=1, sequence_length=10)

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

The text was updated successfully, but these errors were encountered:

doolingdavidrs21 · 2022-05-20T16:17:36Z

Here is the traceback when I run the above:

C:\Users\davidd\Anaconda3\lib\site-packages\scipy\stats\_continuous_distns.py:639: RuntimeWarning: invalid value encountered in sqrt
  sk = 2*(b-a)*np.sqrt(a + b + 1) / (a + b + 2) / np.sqrt(a*b)
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\optimize\minpack.py:175: RuntimeWarning: The iteration is not making good progress, as measured by the 
  improvement from the last ten iterations.
  warnings.warn(msg, RuntimeWarning)
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\stats\_continuous_distns.py:5320: RuntimeWarning: divide by zero encountered in true_divide
  return c**2 / (c**2 - n**2)
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:2606: RuntimeWarning: invalid value encountered in double_scalars
  Lhat = muhat - Shat*mu
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\optimize\minpack.py:175: RuntimeWarning: The number of calls to function has reached maxfev = 600.
  warnings.warn(msg, RuntimeWarning)
PARModel(epochs=45, sample_size=1, cuda='cuda', verbose=True) instance created
Epoch 45 | Loss 1.814377784729004: 100%|███████████████████████████████████████████████| 45/45 [00:30<00:00,  1.46it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.63it/s]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

~\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Date'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_21128/1263544826.py in <module>
     35 # throws error
     36 
---> 37 new_data = model.sample(num_sequences=1, sequence_length=10)

~\Anaconda3\lib\site-packages\sdv\timeseries\base.py in sample(self, num_sequences, context, sequence_length)
    265 
    266         sampled = self._sample(context, sequence_length)
--> 267         return self._metadata.reverse_transform(sampled)
    268 
    269     def save(self, path):

~\Anaconda3\lib\site-packages\sdv\metadata\table.py in reverse_transform(self, data)
    712                 field_data = pd.Series(Table._get_fake_values(field_metadata, len(reversed_data)))
    713             else:
--> 714                 field_data = reversed_data[name]
    715 
    716             reversed_data[name] = field_data[field_data.notnull()].astype(self._dtypes[name])

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'Date'

npatki · 2022-05-20T16:58:23Z

Thanks for filing @doolingdavidrs21. I can replicate this issue.

For SDV developers: I did some digging and found the following --

If sequence index 'Date' is a datetime, then sampled data has column 'Date.value', which is reversed back to 'Date' (no issues)
If sequence index 'Date' is numerical, then sampled data has column 'Date.value' and the reversed name remains'Date.value' (error)

npatki · 2022-08-11T20:26:24Z

Potential Workarounds

If the sequence index is only used for ordering and the data is already in order, you can drop the sequence index

data = data.drop([sequence_index], axis=1)

Alternatively, you can cast an int column into datetime, as proposed by @yamidibarra in KeyError while sampling using freshly trained PAR model #943

import pandas as pd

sequence_index = 'my_sequence_index_column_name' # name of column
data[sequence_index] = pd.to_datetime(data[sequence_index])

Remember to cast the synthetic data back to an int at the end

synthetic_data[sequence_index] = synthetic_data[sequence_index].astype(int)

amontanez24 · 2022-10-04T23:32:57Z

Thanks for filing @doolingdavidrs21. I can replicate this issue.

For SDV developers: I did some digging and found the following --

If sequence index 'Date' is a datetime, then sampled data has column 'Date.value', which is reversed back to 'Date' (no issues)

If sequence index 'Date' is numerical, then sampled data has column 'Date.value' and the reversed name remains'Date.value' (error)

This is because in the PAR model, only the datetime columns are transformed. This can be seen here:

SDV/sdv/timeseries/base.py

Lines 74 to 80 in f822903

    
           _DTYPE_TRANSFORMERS = { 
        
               'i': None, 
        
               'f': None, 
        
               'M': rdt.transformers.UnixTimestampEncoder(), 
        
               'b': None, 
        
               'O': None, 
        
           }

However, in sampling we add the .value suffix back in for the sequence index no matter what type it is.

SDV/sdv/timeseries/deepecho.py

Lines 141 to 143 in f822903

    
           output = output.rename(columns={ 
        
               self._sequence_index: self._sequence_index + '.value' 
        
           })

This is a bug

npatki · 2023-03-09T23:10:55Z

Great news! This issue has now been resolved in our new SDV 1.0 (Beta!) release. Check it out and let us know if you're still encountering any problems.

Resources:

New documentation for the PARSynthesizer
[Tutorial] for PAR

doolingdavidrs21 added bug Something isn't working pending review labels May 20, 2022

npatki mentioned this issue Jun 1, 2022

Problem with timeseries sequence generation #825

Closed

npatki added data:sequential Related to timeseries datasets and removed pending review labels Jun 3, 2022

npatki mentioned this issue Jun 30, 2022

PAR Model cannot fit columns with dtype period #512

Open

npatki changed the title ~~sequence_index columns must be datetime; integer sequence_index columns can be used for model training, but result in errors when samping to get synthetic data~~ PAR model can fit an integer sequence index but it errors when sampling Jul 8, 2022

yamidibarra mentioned this issue Aug 10, 2022

KeyError while sampling using freshly trained PAR model #943

Closed

npatki changed the title ~~PAR model can fit an integer sequence index but it errors when sampling~~ PAR model sampling error when there is a numerical sequence_index (float, int) Aug 11, 2022

npatki mentioned this issue Oct 5, 2022

KeyError: 'frame' #1056

Closed

npatki closed this as completed Mar 9, 2023

npatki added resolution:resolved The issue was fixed, the question was answered, etc. SDV 1.0 (Beta!) labels Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PAR model sampling error when there is a numerical `sequence_index` (float, int) #808

PAR model sampling error when there is a numerical `sequence_index` (float, int) #808

doolingdavidrs21 commented May 20, 2022

doolingdavidrs21 commented May 20, 2022

npatki commented May 20, 2022

npatki commented Aug 11, 2022

amontanez24 commented Oct 4, 2022

npatki commented Mar 9, 2023

PAR model sampling error when there is a numerical sequence_index (float, int) #808

PAR model sampling error when there is a numerical sequence_index (float, int) #808

Comments

doolingdavidrs21 commented May 20, 2022

Environment Details

Error Description

Steps to reproduce

In[247]:

throws error

doolingdavidrs21 commented May 20, 2022

npatki commented May 20, 2022

npatki commented Aug 11, 2022

Potential Workarounds

amontanez24 commented Oct 4, 2022

npatki commented Mar 9, 2023

PAR model sampling error when there is a numerical `sequence_index` (float, int) #808

PAR model sampling error when there is a numerical `sequence_index` (float, int) #808