Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get_random_subset poc utility function #1877

Closed
frances-h opened this issue Mar 27, 2024 · 0 comments · Fixed by #1928
Closed

Add get_random_subset poc utility function #1877

frances-h opened this issue Mar 27, 2024 · 0 comments · Fixed by #1928
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@frances-h
Copy link
Contributor

Problem Description

We should create a utility function that will allow users to coherently subsample a very large dataset so that it can be used with HMA.

Expected behavior

Add a new function, get_random_subset to the utils.poc module.

>>> from sdv.utils import poc

>>> small_dataset = poc.get_random_subset(
            data,
            metadata,
            main_table_name='transactions',
            num_rows=1000,
            verbose=True
)
Success! Your subset has 90% less rows than the original.

Table Name    # Rows (Original)    # Rows (Subset)
sessions      1200                 120            
transactions  5000                 200             

Parameters

  • data [dict] - The data dictionary
  • metadata [MultiTableMetadata] - The metadata for the data
  • main_table_name [str] - The main table to consider when subsampling
  • num_rows [int] - The number of rows to subsample from the main table
  • verbose [bool, optional] - Whether to print a summary of the results of subsampling. Defaults to True.

Returns

  • The data dictionary containing the subsampled tables
  • If verbose is True, it should also print a summary of what the function did:
    • The total percentage of rows that were dropped (i.e. total number of rows in the subsampled data / total number of rows in the original data)
    • For each table, the original number of rows in the table and the number of rows in the subsampled table

Algorithm Overview

  • [For disconnected schemas, which we don't currently support but may in the future]
    • Calculate ratio of num_rows to the original table size for the main table
    • For every root table that is disconnected root from the main table:
      • subsample the root table using ratio found above
  • Randomly sample num_rows rows from the main table
  • If the main table has any parents, for each parent:
    • If all parent rows were referenced in the original main table, drop all parent rows that are no longer referenced by the subsampled main table
    • If there were parent rows that were not referenced (aka childless parent rows) in the original main table, drop any rows that had a reference and are now no longer referenced. Determine the percentage of referenced rows that were dropped, and randomly drop the same percentage of the originally unreferenced parent rows
    • Repeat this process for grandparents, great-grandparents, etc. Note that if we have e.g. a diamond shaped relationship (main table has 2 parents that each share the same parent), we want to be keep all rows in the grandparent that are referenced by either parent.
  • Use drop unknown references to enforce referential integrity and drop rows from the descendant tables. Note that this should not change the size of the main table since we only drop unreferenced rows from the parent tables.
  • Perform validation:
    • If any subsampled table has no rows, raise an error and suggest re-trying or increasing the num_rows parameter
      • This could happen if a parent is aggressively subsampled, causing drop_unknown_references to wipe out a child
      • Since we are randomly sampling, re-trying may give a better result
  • If verbose, print how the results of data was subsampled:
    • Percentage of rows dropped (total number of subsampled rows / original total number of rows)
    • For each table, print original number of rows vs subsampled number of rows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants