Improve quality of `sequence_index`: Move the start dates into the context model #1760

npatki · 2024-01-26T20:34:17Z

Problem Description

In multi-sequence data, a sequence index is used to denote both the order of sequences as well as the intervals between them. Usually, this column is datetime (but it may also be numerical if the user is storing absolute time values). In this example dataset, the sequence_index is the Date column.

The sequence_index is generally supposed to adhere to some special rules. However, when using the PARSynthesizer as-is, users have reported that various properties are not being learned and the rules are being broken.

The sequence_index is supposed to be unique, in order to uniquely determine the sequence order. Users are finding duplicates (see PARSynthesizer: dates duplicates in synthetic data #1723),
The sequence_index value should be increasing, as most sequential datasets are presented in order. Users are finding that the value sometimes decreases (see Sequence index values should be strictly increasing in the synthetic data #466)
The above two rules also imply that for very long sequences, sequence_index should continue on into the future. Users are finding that the current index values are very limited in range (see PARSynthesizer creates limited ranges (and is unable to forecast past the max date) #1752)

Root Cause

In an investigation with @amontanez24 and @frances-h, we found a potential root cause:

During Fit: This function is computing the diffs between rows (as expected) but it is also storing the start sequence index as a single, static value within the sequence. PAR cannot effectively learn a different static value per sequence.

SDV/sdv/sequential/par.py

Lines 205 to 212 in e6e508b

    
           def _transform_sequence_index(self, sequences): 
        
               sequence_index_idx = self._data_columns.index(self._sequence_index) 
        
               for sequence in sequences: 
        
                   data = sequence['data'] 
        
                   sequence_index = data[sequence_index_idx] 
        
                   diffs = np.diff(sequence_index).tolist() 
        
                   data[sequence_index_idx] = diffs[0:1] + diffs 
        
                   data.append(sequence_index[0:1] * len(sequence_index))

During Sampling: The diffs between each rows are being added up (as expected) but they are all using a different starting value due to the issue above.

SDV/sdv/sequential/par.py

Lines 299 to 303 in e6e508b

    
           if self._sequence_index: 
        
               sequence_index_idx = self._data_columns.index(self._sequence_index) 
        
               diffs = sequence[sequence_index_idx] 
        
               start = sequence.pop(-1) 
        
               sequence[sequence_index_idx] = np.cumsum(diffs) - diffs[0] + start

Solution

During fit:

We should continue computing the differences between rows. But the static value (sequence index) should not be added to the sequence for modeling using PAR.
Instead, the start value for each sequence should be added to the context model. This means that we only model 1 single starting value for each sequence.

During sample:

The context model will synthesize a single starting value per sequence
Now for each sequence, we can use that constant starting value and sum up the differences

Additional Notes

Should the difference column be put through a FloatFormatter with min/max clipping? Ideally, we create differences that are too high or too low
Note that the sequence_index is an optional concept so we should very that PAR can continue to work without it

The text was updated successfully, but these errors were encountered:

npatki added bug Something isn't working data:sequential Related to timeseries datasets labels Jan 26, 2024

This was referenced Jan 26, 2024

PARSynthesizer: dates duplicates in synthetic data #1723

Closed

Sequence index values should be strictly increasing in the synthetic data #466

Closed

PARSynthesizer creates limited ranges (and is unable to forecast past the max date) #1752

Closed

frances-h mentioned this issue Jan 31, 2024

Improve quality of sequence_index #1765

Merged

frances-h closed this as completed in #1765 Feb 5, 2024

amontanez24 added this to the 1.10.0 milestone Feb 8, 2024

amontanez24 assigned frances-h Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve quality of `sequence_index`: Move the start dates into the context model #1760

Improve quality of `sequence_index`: Move the start dates into the context model #1760

npatki commented Jan 26, 2024 •

edited

Improve quality of sequence_index: Move the start dates into the context model #1760

Improve quality of sequence_index: Move the start dates into the context model #1760

Comments

npatki commented Jan 26, 2024 • edited

Problem Description

Root Cause

Solution

Additional Notes

Improve quality of `sequence_index`: Move the start dates into the context model #1760

Improve quality of `sequence_index`: Move the start dates into the context model #1760

npatki commented Jan 26, 2024 •

edited