Update the benchmark to load datasets from multiple S3 buckets when None is specified

### Problem Description

By default, the benchmark downloads data from the s3 bucket that we use for SDV demo datasets. In some cases, we also want the ability to load from multiple buckets as sources too. To provide this dynamic behavior, we can create an indexing function that tells us which datasets are available in each bucket, then we use the corresponding bucket to load the dataset.

### Expected behavior

Right now, if the bucket is not specified, it will default to the public bucket (see [line](https://github.com/sdv-dev/SDGym/blob/4eb841bf929171cccfa3f536cc62899fa009993c/sdgym/datasets.py#L136)). We want to update this to inspect our own buckets (public and private).
* Allow `bucket` to be a list of buckets.
* Validate that you can access the buckets with the given credentials. If any is inaccessible, raise an error.
* If bucket is a list, create with `get_available_demos` or `_genereate_dataset_info` to search based on modality.
* Load the dataset from a corresponding bucket.
* Do **not** save the dataset locally.

### Additional context

Right now, we only tackle the issue of unspecified buckets. In the future, we want to provide the ability for custom specification. Moreover, we want the ability to provide custom credentials for loading datasets.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the benchmark to load datasets from multiple S3 buckets when None is specified #604

Problem Description

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Update the benchmark to load datasets from multiple S3 buckets when None is specified #604

Description

Problem Description

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions