Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve quality of sequence_index: Move the start dates into the context model #1760

Closed
npatki opened this issue Jan 26, 2024 · 0 comments · Fixed by #1765
Closed

Improve quality of sequence_index: Move the start dates into the context model #1760

npatki opened this issue Jan 26, 2024 · 0 comments · Fixed by #1765
Assignees
Labels
bug Something isn't working data:sequential Related to timeseries datasets
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Jan 26, 2024

Problem Description

In multi-sequence data, a sequence index is used to denote both the order of sequences as well as the intervals between them. Usually, this column is datetime (but it may also be numerical if the user is storing absolute time values). In this example dataset, the sequence_index is the Date column.

The sequence_index is generally supposed to adhere to some special rules. However, when using the PARSynthesizer as-is, users have reported that various properties are not being learned and the rules are being broken.

  1. The sequence_index is supposed to be unique, in order to uniquely determine the sequence order. Users are finding duplicates (see PARSynthesizer: dates duplicates in synthetic data #1723),
  2. The sequence_index value should be increasing, as most sequential datasets are presented in order. Users are finding that the value sometimes decreases (see Sequence index values should be strictly increasing in the synthetic data #466)
  3. The above two rules also imply that for very long sequences, sequence_index should continue on into the future. Users are finding that the current index values are very limited in range (see PARSynthesizer creates limited ranges (and is unable to forecast past the max date) #1752)

Root Cause

In an investigation with @amontanez24 and @frances-h, we found a potential root cause:

During Fit: This function is computing the diffs between rows (as expected) but it is also storing the start sequence index as a single, static value within the sequence. PAR cannot effectively learn a different static value per sequence.

SDV/sdv/sequential/par.py

Lines 205 to 212 in e6e508b

def _transform_sequence_index(self, sequences):
sequence_index_idx = self._data_columns.index(self._sequence_index)
for sequence in sequences:
data = sequence['data']
sequence_index = data[sequence_index_idx]
diffs = np.diff(sequence_index).tolist()
data[sequence_index_idx] = diffs[0:1] + diffs
data.append(sequence_index[0:1] * len(sequence_index))

During Sampling: The diffs between each rows are being added up (as expected) but they are all using a different starting value due to the issue above.

SDV/sdv/sequential/par.py

Lines 299 to 303 in e6e508b

if self._sequence_index:
sequence_index_idx = self._data_columns.index(self._sequence_index)
diffs = sequence[sequence_index_idx]
start = sequence.pop(-1)
sequence[sequence_index_idx] = np.cumsum(diffs) - diffs[0] + start

Solution

During fit:

  • We should continue computing the differences between rows. But the static value (sequence index) should not be added to the sequence for modeling using PAR.
  • Instead, the start value for each sequence should be added to the context model. This means that we only model 1 single starting value for each sequence.

During sample:

  • The context model will synthesize a single starting value per sequence
  • Now for each sequence, we can use that constant starting value and sum up the differences

Additional Notes

  • Should the difference column be put through a FloatFormatter with min/max clipping? Ideally, we create differences that are too high or too low
  • Note that the sequence_index is an optional concept so we should very that PAR can continue to work without it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:sequential Related to timeseries datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants