Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add download_demo method #1148

Merged
merged 11 commits into from
Dec 21, 2022
Merged

Add download_demo method #1148

merged 11 commits into from
Dec 21, 2022

Conversation

fealho
Copy link
Member

@fealho fealho commented Dec 14, 2022

Resolve #1128.

Changes compared to the issue:

  1. The error "if dataset name isn't provided" shouldn't be necessary, since it crashes automatically if you don't pass a required param.
  2. Adding another error if modality is not one of the 3 required ones.
  3. I'm assuming you meant load_csvs instead of load_from_csv in the final error message.

@fealho fealho marked this pull request as ready for review December 15, 2022 21:15
@fealho fealho requested a review from a team as a code owner December 15, 2022 21:15
@fealho
Copy link
Member Author

fealho commented Dec 15, 2022

@amontanez24 could you double check that the names of the functions mentioned in the errors are correct? That is, load_csvs, load_{other_modality}_demo and list_available_demos .

sdv/datasets/demo.py Outdated Show resolved Hide resolved
sdv/datasets/errors.py Outdated Show resolved Hide resolved
sdv/datasets/demo.py Outdated Show resolved Hide resolved
Comment on lines 37 to 47
try:
urllib.request.urlopen(dataset_url)
except urllib.error.HTTPError:
# If the dataset exists in the wrong modality, raise an error
other_modalities = set(possible_modalities) - {modality}
for other_modality in other_modalities:
dataset_url = _get_dataset_url(other_modality, dataset_name)
try:
urllib.request.urlopen(dataset_url)
raise InvalidArgumentError(
f"Dataset name '{dataset_name}' is a '{other_modality}' dataset. "
f"Use 'load_{other_modality}_demo' to load this dataset."
)
except urllib.error.HTTPError:
pass

# If the dataset doesn't exist at all, raise different error
raise InvalidArgumentError(
f"Invalid dataset name '{dataset_name}'. "
"Use 'list_available_demos' to get a list of demo datasets."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should get rid of this whole section. Instead, during the actual _download method, we should throw that in a try catch and raise an error if it can't be downloaded suggesting that possibly it is a different modality.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I implemented like this to validate whether the dataset is in another modality. Since we removed it, this is unnecessary.

PS: also, I originally implemented it using the requests library, so I could make a HEAD request, which doesn't cost much. But I forgot that we don't have requests, and I don't know how to do it with urllib...

sdv/datasets/demo.py Outdated Show resolved Hide resolved
sdv/datasets/demo.py Outdated Show resolved Hide resolved
sdv/datasets/demo.py Outdated Show resolved Hide resolved
Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment about streaming the files directly into objects in memory. Also seems like some tests are failing

Comment on lines 53 to 57
zf.extractall(output_folder_name)
os.remove(os.path.join(output_folder_name, 'metadata_v0.json'))
os.rename(
os.path.join(output_folder_name, 'metadata_v1.json'),
os.path.join(output_folder_name, METADATA_FILENAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to extract the objects directly in memory instead of storing them to disk in a temp file? something like what's discussed here? https://stackoverflow.com/questions/5710867/downloading-and-unzipping-a-zip-file-without-writing-to-disk
And then we'd only save to disk if requested by the user

Copy link
Member Author

@fealho fealho Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, but the code would be a little more complicated (I need to pass a path to load_from_json) and a little less readable, while also potentially increasing memory usage (since, if you don't do something fancy, you have a copy of the zip file and the object in memory at the same time). It just doesn't seem worth it.

And regarding the tests failing, yeah, I've been trying to figure it out the whole day...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @amontanez24 pointed, it is possible to read the object in memory and once you exit the with indentation the garbage collector will close the zip file and unload it from memory, only the assigned variable in it will stay.

I'm not sure about the format that the json will come out, but if it's a dict just use the private method _load_from_dict: https://github.com/sdv-dev/SDV/blob/V1.0.0.dev/sdv/metadata/single_table.py#L512

@codecov-commenter
Copy link

codecov-commenter commented Dec 20, 2022

Codecov Report

Base: 82.54% // Head: 82.74% // Increases project coverage by +0.20% 🎉

Coverage data is based on head (96bcb66) compared to base (12b4942).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@              Coverage Diff               @@
##           V1.0.0.dev    #1148      +/-   ##
==============================================
+ Coverage       82.54%   82.74%   +0.20%     
==============================================
  Files              60       61       +1     
  Lines            5430     5495      +65     
==============================================
+ Hits             4482     4547      +65     
  Misses            948      948              
Impacted Files Coverage Δ
sdv/datasets/demo.py 100.00% <100.00%> (ø)
sdv/metadata/multi_table.py 99.34% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Comment on lines 53 to 57
zf.extractall(output_folder_name)
os.remove(os.path.join(output_folder_name, 'metadata_v0.json'))
os.rename(
os.path.join(output_folder_name, 'metadata_v1.json'),
os.path.join(output_folder_name, METADATA_FILENAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @amontanez24 pointed, it is possible to read the object in memory and once you exit the with indentation the garbage collector will close the zip file and unload it from memory, only the assigned variable in it will stay.

I'm not sure about the format that the json will come out, but if it's a dict just use the private method _load_from_dict: https://github.com/sdv-dev/SDV/blob/V1.0.0.dev/sdv/metadata/single_table.py#L512

sdv/datasets/demo.py Outdated Show resolved Hide resolved
sdv/datasets/demo.py Outdated Show resolved Hide resolved
@@ -610,7 +610,7 @@ def load_from_json(cls, filepath):
A ``MultiTableMetadata`` instance.
"""
metadata = read_json(filepath)
return cls._load_from_dict(metadata)
return cls._load_from_dict(metadata) # (1 <- check here)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the comment can be removed

Copy link
Member

@pvk-developer pvk-developer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just make sure to remove the comment
Good job!

Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing! LGTM :shipit:

@fealho fealho merged commit d1f6ef9 into V1.0.0.dev Dec 21, 2022
@fealho fealho deleted the issue-1128-download-demo branch December 21, 2022 19:12
amontanez24 pushed a commit that referenced this pull request Dec 21, 2022
* Add validation and validation tests

* Working code, failing tests

* Fix tests + clean up code

* Remove unnecessary lines

* Remove requests library :(

* Address feedback

* Update error message

* Fix path joining for url...

* Working version

* Fix style

* Remove forgotten comment
amontanez24 pushed a commit that referenced this pull request Feb 17, 2023
* Add validation and validation tests

* Working code, failing tests

* Fix tests + clean up code

* Remove unnecessary lines

* Remove requests library :(

* Address feedback

* Update error message

* Fix path joining for url...

* Working version

* Fix style

* Remove forgotten comment
amontanez24 pushed a commit that referenced this pull request Mar 23, 2023
* Add validation and validation tests

* Working code, failing tests

* Fix tests + clean up code

* Remove unnecessary lines

* Remove requests library :(

* Address feedback

* Update error message

* Fix path joining for url...

* Working version

* Fix style

* Remove forgotten comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants