Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_dataset fails for #253

Closed
samgalen opened this issue Jun 18, 2023 · 4 comments
Closed

load_dataset fails for #253

samgalen opened this issue Jun 18, 2023 · 4 comments
Labels
bug Something isn't working resolution:WAI The software is working as intended

Comments

@samgalen
Copy link

samgalen commented Jun 18, 2023

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDGym version: 0.7.0
  • Python version: 3.9
  • Operating System: MacOS

Error Description

My understanding is that calling the load_dataset method should download the demo datasets from your AWS bucket. However, when I run this command, it produces an error that appears to be related to the credentials or lack thereof.

Steps to reproduce

I am including the traceback of this for the census dataset, however I've also tested it with adult and credit as well as a few others.

>>> import sdgym
>>> values = sdgym.load_dataset("single-table", "census")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/sdgym/datasets.py", line 112, in load_dataset
    dataset_path = _get_dataset_path(modality, dataset, datasets_path, bucket, aws_key, aws_secret)
  File "/usr/local/lib/python3.9/site-packages/sdgym/datasets.py", line 63, in _get_dataset_path
    _download_dataset(
  File "/usr/local/lib/python3.9/site-packages/sdgym/datasets.py", line 38, in _download_dataset
    obj = s3.get_object(Bucket=bucket_name, Key=f'{modality.upper()}/{dataset_name}.zip')
  File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 964, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

Edit: I should also mention that get_available_datasets does work and produces output

>>> sdgym.get_available_datasets()
              dataset_name     size_MB  num_tables
0                   KRK_v1    0.072128           1
1                    adult    3.907448           1
2                    alarm    4.520128           1
3                     asia    1.280128           1
4                   census   98.165608           1
5          census_extended    4.949400           1
6                    child    3.200128           1
7                  covtype  255.645408           1
8                   credit   68.353808           1
9       expedia_hotel_logs    0.200128           1
10          fake_companies    0.001280           1
11       fake_hotel_guests    0.032628           1
12                    grid    0.320128           1
13                   gridr    0.320128           1
14               insurance    3.340128           1
15               intrusion  162.039016           1
16                 mnist12   81.200128           1
17                 mnist28  439.600128           1
18                    news   18.712096           1
19                    ring    0.320128           1
20      student_placements    0.026358           1
21  student_placements_pii    0.028078           1
@samgalen samgalen added bug Something isn't working new Automatic label applied to new issues labels Jun 18, 2023
@npatki
Copy link

npatki commented Jun 20, 2023

Hi @samgalen, the SDGym documentation website contains a reference to all the features that we support. The get_available_datasets function is listed but load_dataset is not -- meaning, it's not currently a supported feature.

Note that we are in the process of cleaning up our library so older, unsupported features may still be present in the code. So we ask that you please bear with us as we clean our repo!

BTW -- I'm curious about your use case? We found that loading datasets ad-hoc was not a frequently used feature, as most of our users are directly coming to benchmark synthesizers. If this would be helpful to you, we could track it as a feature request.

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jun 20, 2023
@samgalen
Copy link
Author

Hi @npatki - Thanks for the response.

My use case is that I'm trying to replicate prior work which uses the load_dataset function (but not other portions of SDgym). So it's not so much that I need to be able to use the function regularly, but rather that I was trying to figure out some aspects of how some of the datasets were processed, and how data was encoded etc.

If there's a way to see that easily in the current version of SDgym, that would be ideal.

@npatki
Copy link

npatki commented Jun 21, 2023

No problem! SDGym uses the SDV library for a majority of the predefined synthesizers. It also reads from the same demo datasets.

So one options is to directly pull from the SDV instead of SDGym. It should be automatically installed if you have SDGym already.

from sdv.datasets.demo import get_available_demos
from sdv.datasets.demo import download_demo

# get a table of all demos
# this should have the same datasets as what SDGym returns
all_demos = get_available_demos(modality='single_table')

# select a particular dataset name to download
data, metadata = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests'
)

For more resources see:

  • SDV demo API
  • SDV transformation API. We now expose functions that allow you to see how the data is preprocessed (converted from raw -> numeric values) before applying the machine learning.

Let me know if you have any more Qs!

@npatki
Copy link

npatki commented Nov 13, 2023

Hi @samgalen, I'm closing this issue off since it has been inactive for some time and we've answered the original question.

I've filed a separate feature request in #261 to allow the ability to download and inspect datasets prior to running them in the benchmark. I've also copied over the workaround where you can access the datasets directly from the SDV library.

Feel free to reply if there is more to discuss and we can always reopen the issue. Alternatively, we can continue the conversation in the new feature request.

@npatki npatki closed this as completed Nov 13, 2023
@npatki npatki added resolution:WAI The software is working as intended and removed under discussion Issue is currently being discussed labels Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working resolution:WAI The software is working as intended
Projects
None yet
Development

No branches or pull requests

2 participants