You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should create a utility function that will allow users to coherently subsample a very large dataset so that it can be used with HMA.
Expected behavior
Add a new function, get_random_subset to the utils.poc module.
>>> from sdv.utils import poc
>>> small_dataset = poc.get_random_subset(
data,
metadata,
main_table_name='transactions',
num_rows=1000,
verbose=True
)
Success! Your subset has 90% less rows than the original.
Table Name # Rows (Original) # Rows (Subset)
sessions 1200 120
transactions 5000 200
Parameters
data [dict] - The data dictionary
metadata [MultiTableMetadata] - The metadata for the data
main_table_name [str] - The main table to consider when subsampling
num_rows [int] - The number of rows to subsample from the main table
verbose [bool, optional] - Whether to print a summary of the results of subsampling. Defaults to True.
Returns
The data dictionary containing the subsampled tables
If verbose is True, it should also print a summary of what the function did:
The total percentage of rows that were dropped (i.e. total number of rows in the subsampled data / total number of rows in the original data)
For each table, the original number of rows in the table and the number of rows in the subsampled table
Algorithm Overview
[For disconnected schemas, which we don't currently support but may in the future]
Calculate ratio of num_rows to the original table size for the main table
For every root table that is disconnected root from the main table:
subsample the root table using ratio found above
Randomly sample num_rows rows from the main table
If the main table has any parents, for each parent:
If all parent rows were referenced in the original main table, drop all parent rows that are no longer referenced by the subsampled main table
If there were parent rows that were not referenced (aka childless parent rows) in the original main table, drop any rows that had a reference and are now no longer referenced. Determine the percentage of referenced rows that were dropped, and randomly drop the same percentage of the originally unreferenced parent rows
Repeat this process for grandparents, great-grandparents, etc. Note that if we have e.g. a diamond shaped relationship (main table has 2 parents that each share the same parent), we want to be keep all rows in the grandparent that are referenced by either parent.
Use drop unknown references to enforce referential integrity and drop rows from the descendant tables. Note that this should not change the size of the main table since we only drop unreferenced rows from the parent tables.
Perform validation:
If any subsampled table has no rows, raise an error and suggest re-trying or increasing the num_rows parameter
This could happen if a parent is aggressively subsampled, causing drop_unknown_references to wipe out a child
Since we are randomly sampling, re-trying may give a better result
If verbose, print how the results of data was subsampled:
Percentage of rows dropped (total number of subsampled rows / original total number of rows)
For each table, print original number of rows vs subsampled number of rows
The text was updated successfully, but these errors were encountered:
Problem Description
We should create a utility function that will allow users to coherently subsample a very large dataset so that it can be used with HMA.
Expected behavior
Add a new function,
get_random_subset
to theutils.poc
module.Parameters
data [dict]
- The data dictionarymetadata [MultiTableMetadata]
- The metadata for the datamain_table_name [str]
- The main table to consider when subsamplingnum_rows [int]
- The number of rows to subsample from the main tableverbose [bool, optional]
- Whether to print a summary of the results of subsampling. Defaults to True.Returns
verbose
is True, it should also print a summary of what the function did:Algorithm Overview
num_rows
to the original table size for the main tablenum_rows
rows from the main tablenum_rows
parameterdrop_unknown_references
to wipe out a childThe text was updated successfully, but these errors were encountered: