Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add demo loading functionality #1128

Closed
amontanez24 opened this issue Dec 2, 2022 · 3 comments
Closed

Add demo loading functionality #1128

amontanez24 opened this issue Dec 2, 2022 · 3 comments
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@amontanez24
Copy link
Contributor

Problem Description

As a user, it would be useful to be able to access demo data that I could use to test the SDV out with in a Sandbox like environment.

We want to have functionality similar to the current demo module that allows users to see available demo datasets, and download them.

Acceptance criteria

  • Add a file called demo.py to the datasets module
  • Add a function called download_demo
    • The function should have the following parameters:
      • (required) modality: One of: 'single_table', 'multi_table', 'sequential'
      • (required) dataset_name: A string with the name of a dataset
      • output_folder_name: The name of the local folder where the metadata and data should be stored
        • (default) None: Do not save the data locally. Just load it as Python objects.
        • '': Create a subfolder in the desired location to store the data. Note: only store metadata_v1.json and not metadata_v0.json.
    • Returns tuple of (data, metadata)
      • data:
        • (single table and sequential) A pandas DataFrame object
        • (multi table) A dictionary mapping table name (string) to a pandas DataFrame object
      • metadata
        • (single table and sequential) A SingleTableMetadata object
        • (multi table) A MultiTableMetadata object
    • Errors
      • If dataset name isn't provided:
        Error: Missing required parameter 'dataset_name'.
      • If the dataset name exists in our bucket but under a different modality
        Error: Dataset name '<name>' is a <modality> dataset. Use 'load_<modality>_demo' to load this dataset.
        for eg.
        Error: Dataset name 'heart_rate' is a sequential dataset. Use 'load_sequential_demo' to load this dataset.
      • If the dataset name doesn't exist in our bucket:
        Error: Invalid dataset name 'dataset_1'. Use 'list_available_demos' to get a list of demo datasets.
      • If there is already a folder that exists
        Error: Folder 'my_datasets/student_placements/' already exists. Please specify a different name or use 'load_from_csv' to load from an existing folder.

Expected behavior

from sdv.datasets.demo import download_demo

data, metadata = download_demo(
  modality='single_table',
  dataset_name='student_placements',
  output_folder_name='my_datasets/student_placements')

Additional context

  • We should no longer be using the bucket that is currently used to get demo data. Instead we want to switch to the new demo data bucket. The new bucket has a designated folder for each modality which should make downloading the correct data easier.
  • We also no longer want to save the csv every time and load data from there. Instead, we should download the data directly from the bucket and obtain the DataFrame from the bits that S3 returns.
@amontanez24 amontanez24 added the feature request Request for a new feature label Dec 2, 2022
@amontanez24 amontanez24 added this to the 1.0.0 milestone Dec 2, 2022
@fealho
Copy link
Member

fealho commented Dec 13, 2022

  1. The error "if dataset name isn't provided" shouldn't be necessary, since it crashes automatically if you don't pass a required param.

  2. Adding another error if modality is not one of the 3 required ones.

  3. I'm assuming you meant load_csvs instead of load_from_csv in the final error message.

@amontanez24
Copy link
Contributor Author

@npatki I believe the error messages are outdated now. I think the error conditions should probably have some changes.

  1. In the case where the dataset name exists in our bucket but under a different modality, I think it would be weird to raise an error and tell them to change the modality they pass. We should either raise a warning and return it anyway, or just crash and say perhaps it is in a different modality. To actually confirm that it is in a different folder but not return it seems strange.

  2. If the dataset name doesn't exist in our bucket:
    Error: Invalid dataset name 'dataset_1'. Use 'list_available_demos' to get a list of demo datasets.
    In this case, list_available_demos should be get_available_demos

  3. If there is already a folder that exists
    Error: Folder 'my_datasets/student_placements/' already exists. Please specify a different name or use 'load_from_csv' to load from an existing folder.
    In this case load_from_csv should be load_csvs

What are your thoughts?

@npatki
Copy link
Contributor

npatki commented Dec 16, 2022

@amontanez24 agreed on all points.

We should either raise a warning and return it anyway, or just crash and say perhaps it is in a different modality.

The same error message as (2) would suffice for this case too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

3 participants