Add a utility function `get_random_sequence_subset` #2085

npatki · 2024-06-20T21:38:04Z

Problem Description

Subsetting single and multi-table data is easy by using existing functions such as get_random_subset.

But subsetting sequential data is not as easy. Since different rows can belong together (within the same sequence) and have an order, it's not possible to simply select random rows. For such data, it will be helpful to have a utility unction to perform the subsetting.

Expected behavior

Add a function to utils called get_random_sequence_subset to be used by sequential data.

Parameters:

(required) data: A pandas DataFrame with the sequential data
(required) metadata: A SingleTableMetadata object describing the data
(required) num_sequences: The number of sequences to subsample
max_sequence_length: The maximum length each subsampled sequence is allowed to be
- (default) None: Do not enforce any max length, meaning that entire sequences will be sampled
- int: All subsampled sequences must be <= the provided length
long_sequence_subsampling_method: The method to use when a selected sequence is too long
- (default) first_rows: Keep the first n rows of the sequence, where n is the max sequence length
- last_rows: Keep the last n rows of the sequence, where n is the max sequence length
- random: Randomly choose n rows to keep within the sequence. It is important to keep the randomly chosen rows in the same order as they appear in the original data.

from sdv.utils import get_random_sequence_subset

data_subset = get_random_sequence_subset(data, metadata,
  num_sequences=100, 
  max_sequence_length=1000,
  long_sequence_subsampling_method='last_rows')

The function would do the following:

Randomly select sequences according to num_sequences parameter. (Note that the sequence_key is used in determining sequences.)
For each selected sequence, ensure that the length is <= max_sequence_length. If sequences are longer, then use the long_sequence_subsampling_method to make it shorter

Return the shortened pandas DataFrame with the subsampled data. Ensure that the index of the DataFrame has been reset.

Additional context

The metadata must contain a sequence_key -- otherwise it is not multi-sequence data and not really eligible for this type of subsampling. If there is no sequence_key, throw an error
As a starting point, below is some code we've provided to a user to sample entire sequences. Note that this code does not consider max sequence length at all.

import numpy as np

def get_random_sequence_subset(data, metadata, num_sequences):
  sequence_key = metadata.to_dict()['sequence_key']
  unique_sequences = data[sequence_key].unique()
  sequence_subset = np.random.choice(unique_sequences, size=num_sequences)
  subsetted_data = data[data[sequence_key].isin(sequence_subset)].reset_index(drop=True)
  return subsetted_data

The text was updated successfully, but these errors were encountered:

amontanez24 · 2024-06-26T21:19:19Z

@npatki Should this also be in poc?

npatki added feature request Request for a new feature data:sequential Related to timeseries datasets labels Jun 20, 2024

amontanez24 mentioned this issue Jun 27, 2024

Add a utility function get_random_sequence_subset #2098

Merged

amontanez24 closed this as completed in #2098 Jul 2, 2024

amontanez24 self-assigned this Jul 9, 2024

amontanez24 added this to the 1.15.0 milestone Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a utility function `get_random_sequence_subset` #2085

Add a utility function `get_random_sequence_subset` #2085

npatki commented Jun 20, 2024 •

edited

Loading

amontanez24 commented Jun 26, 2024 •

edited

Loading

Add a utility function get_random_sequence_subset #2085

Add a utility function get_random_sequence_subset #2085

Comments

npatki commented Jun 20, 2024 • edited Loading

Problem Description

Expected behavior

Additional context

amontanez24 commented Jun 26, 2024 • edited Loading

Add a utility function `get_random_sequence_subset` #2085

Add a utility function `get_random_sequence_subset` #2085

npatki commented Jun 20, 2024 •

edited

Loading

amontanez24 commented Jun 26, 2024 •

edited

Loading