-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add download_demo
method
#1148
Add download_demo
method
#1148
Conversation
@amontanez24 could you double check that the names of the functions mentioned in the errors are correct? That is, |
sdv/datasets/demo.py
Outdated
try: | ||
urllib.request.urlopen(dataset_url) | ||
except urllib.error.HTTPError: | ||
# If the dataset exists in the wrong modality, raise an error | ||
other_modalities = set(possible_modalities) - {modality} | ||
for other_modality in other_modalities: | ||
dataset_url = _get_dataset_url(other_modality, dataset_name) | ||
try: | ||
urllib.request.urlopen(dataset_url) | ||
raise InvalidArgumentError( | ||
f"Dataset name '{dataset_name}' is a '{other_modality}' dataset. " | ||
f"Use 'load_{other_modality}_demo' to load this dataset." | ||
) | ||
except urllib.error.HTTPError: | ||
pass | ||
|
||
# If the dataset doesn't exist at all, raise different error | ||
raise InvalidArgumentError( | ||
f"Invalid dataset name '{dataset_name}'. " | ||
"Use 'list_available_demos' to get a list of demo datasets." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should get rid of this whole section. Instead, during the actual _download
method, we should throw that in a try catch and raise an error if it can't be downloaded suggesting that possibly it is a different modality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I implemented like this to validate whether the dataset is in another modality. Since we removed it, this is unnecessary.
PS: also, I originally implemented it using the requests
library, so I could make a HEAD request, which doesn't cost much. But I forgot that we don't have requests
, and I don't know how to do it with urllib
...
7e91805
to
75820ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment about streaming the files directly into objects in memory. Also seems like some tests are failing
sdv/datasets/demo.py
Outdated
zf.extractall(output_folder_name) | ||
os.remove(os.path.join(output_folder_name, 'metadata_v0.json')) | ||
os.rename( | ||
os.path.join(output_folder_name, 'metadata_v1.json'), | ||
os.path.join(output_folder_name, METADATA_FILENAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible to extract the objects directly in memory instead of storing them to disk in a temp file? something like what's discussed here? https://stackoverflow.com/questions/5710867/downloading-and-unzipping-a-zip-file-without-writing-to-disk
And then we'd only save to disk if requested by the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it, but the code would be a little more complicated (I need to pass a path to load_from_json
) and a little less readable, while also potentially increasing memory usage (since, if you don't do something fancy, you have a copy of the zip file and the object in memory at the same time). It just doesn't seem worth it.
And regarding the tests failing, yeah, I've been trying to figure it out the whole day...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @amontanez24 pointed, it is possible to read the object in memory and once you exit the with
indentation the garbage collector will close the zip
file and unload it from memory, only the assigned variable in it will stay.
I'm not sure about the format that the json
will come out, but if it's a dict
just use the private method _load_from_dict
: https://github.com/sdv-dev/SDV/blob/V1.0.0.dev/sdv/metadata/single_table.py#L512
Codecov ReportBase: 82.54% // Head: 82.74% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## V1.0.0.dev #1148 +/- ##
==============================================
+ Coverage 82.54% 82.74% +0.20%
==============================================
Files 60 61 +1
Lines 5430 5495 +65
==============================================
+ Hits 4482 4547 +65
Misses 948 948
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
sdv/datasets/demo.py
Outdated
zf.extractall(output_folder_name) | ||
os.remove(os.path.join(output_folder_name, 'metadata_v0.json')) | ||
os.rename( | ||
os.path.join(output_folder_name, 'metadata_v1.json'), | ||
os.path.join(output_folder_name, METADATA_FILENAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @amontanez24 pointed, it is possible to read the object in memory and once you exit the with
indentation the garbage collector will close the zip
file and unload it from memory, only the assigned variable in it will stay.
I'm not sure about the format that the json
will come out, but if it's a dict
just use the private method _load_from_dict
: https://github.com/sdv-dev/SDV/blob/V1.0.0.dev/sdv/metadata/single_table.py#L512
sdv/metadata/multi_table.py
Outdated
@@ -610,7 +610,7 @@ def load_from_json(cls, filepath): | |||
A ``MultiTableMetadata`` instance. | |||
""" | |||
metadata = read_json(filepath) | |||
return cls._load_from_dict(metadata) | |||
return cls._load_from_dict(metadata) # (1 <- check here) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the comment can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just make sure to remove the comment
Good job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing! LGTM
* Add validation and validation tests * Working code, failing tests * Fix tests + clean up code * Remove unnecessary lines * Remove requests library :( * Address feedback * Update error message * Fix path joining for url... * Working version * Fix style * Remove forgotten comment
* Add validation and validation tests * Working code, failing tests * Fix tests + clean up code * Remove unnecessary lines * Remove requests library :( * Address feedback * Update error message * Fix path joining for url... * Working version * Fix style * Remove forgotten comment
* Add validation and validation tests * Working code, failing tests * Fix tests + clean up code * Remove unnecessary lines * Remove requests library :( * Address feedback * Update error message * Fix path joining for url... * Working version * Fix style * Remove forgotten comment
Resolve #1128.
Changes compared to the issue: