-
Notifications
You must be signed in to change notification settings - Fork 64
Description
Problem Description
SDGym version 0.11.0 offers a DatasetExplorer class that you can use to view and explore all the datasets that are within a bucket. This class has enhanced functionality that summarizes all the datasets.
At the same time, there is a legacy method called get_available_datasets that simply lists all the datasets and some basic facts about them (their size and number of tables). Much of this functionality is already covered by DatasetExplorer.summarize_datasets.
However, the get_avialable_datasets functionality is a bit more lightweight. The reason is that summarize_datasets requires SDGym to actually load the data and perform some computations on the data -- which can take time. In the meantime, this function is more lightweight because it only needs to read information that is already available in the metainfo.yaml file.
Expected behavior
Incorporate the functionality from get_available_datasets into DatasetExplorer.list_datasets.
Inputs:
- (required)
modality" The modality of the data to summarize. This should be one of 'single_table', 'multi_table' or 'sequential'. output_filepath: A string with the full output filepath where the results will be written. This should end in .csv
Output: This function produces the same dataframe as get_available_datasets does today. If the output_filepath is provided, then the same dataframe would be written as a CSV in the file provided.
from sdgym import DatasetExplorer
de = DatasetExplorer(
s3_url='s3://my_bucket/', # optional
aws_access_key_id='my_access_key', # optional
aws_secret_access_key='my_secret' # optional
)
dataset_list = de.list_datasets(modality='single_table', output_filepath='my_dataset_list.csv')After doing this, we can remove the existing get_available_datasets function, as it is no longer needed.