Add `download_demo` method #1148

fealho · 2022-12-14T04:36:34Z

Resolve #1128.

Changes compared to the issue:

The error "if dataset name isn't provided" shouldn't be necessary, since it crashes automatically if you don't pass a required param.
Adding another error if modality is not one of the 3 required ones.
I'm assuming you meant load_csvs instead of load_from_csv in the final error message.

fealho · 2022-12-15T21:16:52Z

@amontanez24 could you double check that the names of the functions mentioned in the errors are correct? That is, load_csvs, load_{other_modality}_demo and list_available_demos .

sdv/datasets/demo.py

sdv/datasets/errors.py

sdv/datasets/demo.py

amontanez24 · 2022-12-17T00:10:49Z

sdv/datasets/demo.py

+    try:
+        urllib.request.urlopen(dataset_url)
+    except urllib.error.HTTPError:
+        # If the dataset exists in the wrong modality, raise an error
+        other_modalities = set(possible_modalities) - {modality}
+        for other_modality in other_modalities:
+            dataset_url = _get_dataset_url(other_modality, dataset_name)
+            try:
+                urllib.request.urlopen(dataset_url)
+                raise InvalidArgumentError(
+                    f"Dataset name '{dataset_name}' is a '{other_modality}' dataset. "
+                    f"Use 'load_{other_modality}_demo' to load this dataset."
+                )
+            except urllib.error.HTTPError:
+                pass
+
+        # If the dataset doesn't exist at all, raise different error
+        raise InvalidArgumentError(
+            f"Invalid dataset name '{dataset_name}'. "
+            "Use 'list_available_demos' to get a list of demo datasets."
+        )


I think we should get rid of this whole section. Instead, during the actual _download method, we should throw that in a try catch and raise an error if it can't be downloaded suggesting that possibly it is a different modality.

Yeah, I implemented like this to validate whether the dataset is in another modality. Since we removed it, this is unnecessary.

PS: also, I originally implemented it using the requests library, so I could make a HEAD request, which doesn't cost much. But I forgot that we don't have requests, and I don't know how to do it with urllib...

sdv/datasets/demo.py

amontanez24

I left a comment about streaming the files directly into objects in memory. Also seems like some tests are failing

amontanez24 · 2022-12-19T23:49:57Z

sdv/datasets/demo.py

+        zf.extractall(output_folder_name)
+        os.remove(os.path.join(output_folder_name, 'metadata_v0.json'))
+        os.rename(
+            os.path.join(output_folder_name, 'metadata_v1.json'),
+            os.path.join(output_folder_name, METADATA_FILENAME)


would it be possible to extract the objects directly in memory instead of storing them to disk in a temp file? something like what's discussed here? https://stackoverflow.com/questions/5710867/downloading-and-unzipping-a-zip-file-without-writing-to-disk
And then we'd only save to disk if requested by the user

I thought about it, but the code would be a little more complicated (I need to pass a path to load_from_json) and a little less readable, while also potentially increasing memory usage (since, if you don't do something fancy, you have a copy of the zip file and the object in memory at the same time). It just doesn't seem worth it.

And regarding the tests failing, yeah, I've been trying to figure it out the whole day...

As @amontanez24 pointed, it is possible to read the object in memory and once you exit the with indentation the garbage collector will close the zip file and unload it from memory, only the assigned variable in it will stay.

I'm not sure about the format that the json will come out, but if it's a dict just use the private method _load_from_dict: https://github.com/sdv-dev/SDV/blob/V1.0.0.dev/sdv/metadata/single_table.py#L512

codecov-commenter · 2022-12-20T03:23:24Z

Codecov Report

Base: 82.54% // Head: 82.74% // Increases project coverage by +0.20% 🎉

Coverage data is based on head (96bcb66) compared to base (12b4942).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@              Coverage Diff               @@
##           V1.0.0.dev    #1148      +/-   ##
==============================================
+ Coverage       82.54%   82.74%   +0.20%     
==============================================
  Files              60       61       +1     
  Lines            5430     5495      +65     
==============================================
+ Hits             4482     4547      +65     
  Misses            948      948

Impacted Files	Coverage Δ
sdv/datasets/demo.py	`100.00% <100.00%> (ø)`
sdv/metadata/multi_table.py	`99.34% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

pvk-developer · 2022-12-20T10:26:23Z

sdv/datasets/demo.py

+        zf.extractall(output_folder_name)
+        os.remove(os.path.join(output_folder_name, 'metadata_v0.json'))
+        os.rename(
+            os.path.join(output_folder_name, 'metadata_v1.json'),
+            os.path.join(output_folder_name, METADATA_FILENAME)


As @amontanez24 pointed, it is possible to read the object in memory and once you exit the with indentation the garbage collector will close the zip file and unload it from memory, only the assigned variable in it will stay.

I'm not sure about the format that the json will come out, but if it's a dict just use the private method _load_from_dict: https://github.com/sdv-dev/SDV/blob/V1.0.0.dev/sdv/metadata/single_table.py#L512

sdv/datasets/demo.py

pvk-developer · 2022-12-21T12:41:37Z

sdv/metadata/multi_table.py

@@ -610,7 +610,7 @@ def load_from_json(cls, filepath):
            A ``MultiTableMetadata`` instance.
        """
        metadata = read_json(filepath)
-        return cls._load_from_dict(metadata)
+        return cls._load_from_dict(metadata)  # (1 <- check here)


I think that the comment can be removed

pvk-developer

LGTM, just make sure to remove the comment
Good job!

amontanez24

Thanks for addressing! LGTM

* Add validation and validation tests * Working code, failing tests * Fix tests + clean up code * Remove unnecessary lines * Remove requests library :( * Address feedback * Update error message * Fix path joining for url... * Working version * Fix style * Remove forgotten comment

fealho requested review from amontanez24 and pvk-developer December 15, 2022 21:15

fealho marked this pull request as ready for review December 15, 2022 21:15

fealho requested a review from a team as a code owner December 15, 2022 21:15

amontanez24 requested changes Dec 17, 2022

View reviewed changes

fealho requested a review from amontanez24 December 19, 2022 18:48

amontanez24 reviewed Dec 19, 2022

View reviewed changes

sdv/datasets/demo.py Outdated Show resolved Hide resolved

sdv/datasets/demo.py Outdated Show resolved Hide resolved

fealho requested a review from amontanez24 December 19, 2022 19:10

fealho added 7 commits December 19, 2022 14:22

Add validation and validation tests

e4190cc

Working code, failing tests

92ad7fb

Fix tests + clean up code

605580b

Remove unnecessary lines

b868d94

Remove requests library :(

067f7c3

Address feedback

ca39aee

Update error message

75820ca

fealho force-pushed the issue-1128-download-demo branch from 7e91805 to 75820ca Compare December 19, 2022 22:22

amontanez24 reviewed Dec 19, 2022

View reviewed changes

Fix path joining for url...

633a847

pvk-developer requested changes Dec 20, 2022

View reviewed changes

fealho added 2 commits December 20, 2022 18:59

Working version

db67996

Fix style

96bcb66

pvk-developer reviewed Dec 21, 2022

View reviewed changes

pvk-developer approved these changes Dec 21, 2022

View reviewed changes

Remove forgotten comment

cc4e67c

amontanez24 approved these changes Dec 21, 2022

View reviewed changes

fealho merged commit d1f6ef9 into V1.0.0.dev Dec 21, 2022

fealho deleted the issue-1128-download-demo branch December 21, 2022 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `download_demo` method #1148

Add `download_demo` method #1148

fealho commented Dec 14, 2022

fealho commented Dec 15, 2022

amontanez24 Dec 17, 2022

fealho Dec 19, 2022

amontanez24 left a comment

amontanez24 Dec 19, 2022

fealho Dec 20, 2022 •

edited

Loading

pvk-developer Dec 20, 2022

codecov-commenter commented Dec 20, 2022 •

edited

Loading

pvk-developer Dec 20, 2022

pvk-developer Dec 21, 2022

pvk-developer left a comment

amontanez24 left a comment

Add download_demo method #1148

Add download_demo method #1148

Conversation

fealho commented Dec 14, 2022

fealho commented Dec 15, 2022

amontanez24 Dec 17, 2022

Choose a reason for hiding this comment

fealho Dec 19, 2022

Choose a reason for hiding this comment

amontanez24 left a comment

Choose a reason for hiding this comment

amontanez24 Dec 19, 2022

Choose a reason for hiding this comment

fealho Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

pvk-developer Dec 20, 2022

Choose a reason for hiding this comment

codecov-commenter commented Dec 20, 2022 • edited Loading

Codecov Report

pvk-developer Dec 20, 2022

Choose a reason for hiding this comment

pvk-developer Dec 21, 2022

Choose a reason for hiding this comment

pvk-developer left a comment

Choose a reason for hiding this comment

amontanez24 left a comment

Choose a reason for hiding this comment

Add `download_demo` method #1148

Add `download_demo` method #1148

fealho Dec 20, 2022 •

edited

Loading

codecov-commenter commented Dec 20, 2022 •

edited

Loading