Skip to content

Update the benchmark to load datasets from multiple S3 buckets when None is specified #604

@sarahmish

Description

@sarahmish

Problem Description

By default, the benchmark downloads data from the s3 bucket that we use for SDV demo datasets. In some cases, we also want the ability to load from multiple buckets as sources too. To provide this dynamic behavior, we can create an indexing function that tells us which datasets are available in each bucket, then we use the corresponding bucket to load the dataset.

Expected behavior

Right now, if the bucket is not specified, it will default to the public bucket (see line). We want to update this to inspect our own buckets (public and private).

  • Allow bucket to be a list of buckets.
  • Validate that you can access the buckets with the given credentials. If any is inaccessible, raise an error.
  • If bucket is a list, create with get_available_demos or _genereate_dataset_info to search based on modality.
  • Load the dataset from a corresponding bucket.
  • Do not save the dataset locally.

Additional context

Right now, we only tackle the issue of unspecified buckets. In the future, we want to provide the ability for custom specification. Moreover, we want the ability to provide custom credentials for loading datasets.

Metadata

Metadata

Assignees

Labels

feature requestRequest for a new featureinternalThe issue doesn't change the API or functionality

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions