You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Subsetting single and multi-table data is easy by using existing functions such as get_random_subset.
But subsetting sequential data is not as easy. Since different rows can belong together (within the same sequence) and have an order, it's not possible to simply select random rows. For such data, it will be helpful to have a utility unction to perform the subsetting.
Expected behavior
Add a function to utils called get_random_sequence_subset to be used by sequential data.
Parameters:
(required) data: A pandas DataFrame with the sequential data
(required) metadata: A SingleTableMetadata object describing the data
(required) num_sequences: The number of sequences to subsample
max_sequence_length: The maximum length each subsampled sequence is allowed to be
(default) None: Do not enforce any max length, meaning that entire sequences will be sampled
int: All subsampled sequences must be <= the provided length
long_sequence_subsampling_method: The method to use when a selected sequence is too long
(default) first_rows: Keep the first n rows of the sequence, where n is the max sequence length
last_rows: Keep the last n rows of the sequence, where n is the max sequence length
random: Randomly choose n rows to keep within the sequence. It is important to keep the randomly chosen rows in the same order as they appear in the original data.
Randomly select sequences according to num_sequences parameter. (Note that the sequence_key is used in determining sequences.)
For each selected sequence, ensure that the length is <= max_sequence_length. If sequences are longer, then use the long_sequence_subsampling_method to make it shorter
Return the shortened pandas DataFrame with the subsampled data. Ensure that the index of the DataFrame has been reset.
Additional context
The metadata must contain a sequence_key -- otherwise it is not multi-sequence data and not really eligible for this type of subsampling. If there is no sequence_key, throw an error
As a starting point, below is some code we've provided to a user to sample entire sequences. Note that this code does not consider max sequence length at all.
Problem Description
Subsetting single and multi-table data is easy by using existing functions such as
get_random_subset
.But subsetting sequential data is not as easy. Since different rows can belong together (within the same sequence) and have an order, it's not possible to simply select random rows. For such data, it will be helpful to have a utility unction to perform the subsetting.
Expected behavior
Add a function to
utils
calledget_random_sequence_subset
to be used by sequential data.Parameters:
data
: A pandas DataFrame with the sequential datametadata
: A SingleTableMetadata object describing the datanum_sequences
: The number of sequences to subsamplemax_sequence_length
: The maximum length each subsampled sequence is allowed to beNone
: Do not enforce any max length, meaning that entire sequences will be sampledint
: All subsampled sequences must be <= the provided lengthlong_sequence_subsampling_method
: The method to use when a selected sequence is too longfirst_rows
: Keep the first n rows of the sequence, where n is the max sequence lengthlast_rows
: Keep the last n rows of the sequence, where n is the max sequence lengthrandom
: Randomly choose n rows to keep within the sequence. It is important to keep the randomly chosen rows in the same order as they appear in the original data.The function would do the following:
num_sequences
parameter. (Note that thesequence_key
is used in determining sequences.)max_sequence_length
. If sequences are longer, then use thelong_sequence_subsampling_method
to make it shorterReturn the shortened pandas DataFrame with the subsampled data. Ensure that the index of the DataFrame has been reset.
Additional context
sequence_key
-- otherwise it is not multi-sequence data and not really eligible for this type of subsampling. If there is nosequence_key
, throw an errorThe text was updated successfully, but these errors were encountered: