-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multipart zip support #62
Add multipart zip support #62
Conversation
soundata/download_utils.py
Outdated
zip_path = save_dir+"/"+k+".zip" | ||
outpath = save_dir+"/"+k+"_single.zip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe using os.path.join()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
Thanks @iranroman for that!:) Regarding your last comment on #55: yes of course it has no sense for the user to get just some data splits instead of downloading them all. In my version of the FSD50K loader, I added However, there is something I don't understand: where are we testing that merging and unzipping code now? Should this be done in |
Hello @genisplaja. Thanks for your feedback.
Isn't it implied that the download, merging, and unzipping all were done correctly when the |
Hello @genisplaja. I followed your suggestion and I have added a multipart zip test in |
Looks good @iranroman!! Thanks for that:) You need to make sure that every line is covered by tests, even if the |
No idea @genisplaja. This is my first PR int his repo. I was expecting them to just pass. You would know better than me. |
…data into add_multipart_zip_support
Codecov Report
@@ Coverage Diff @@
## master #62 +/- ##
==========================================
- Coverage 97.91% 97.87% -0.04%
==========================================
Files 18 18
Lines 1437 1415 -22
==========================================
- Hits 1407 1385 -22
Misses 30 30 |
Hello all, |
@iranroman will look into this tomorrow! |
…data into add_multipart_zip_support
Hello @genisplaja. This PR is very relevant to the dataset loader you wrote for |
soundata/download_utils.py
Outdated
@@ -54,6 +55,8 @@ def downloader( | |||
The directory to download the data | |||
remotes (dict or None): | |||
A dictionary of RemoteFileMetadata tuples of data in zip format. | |||
If a dictionary key has a list of RemoteFileMetadata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe rephrase? If an element of the dictionary is a list of RemoteFileMetadata
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. Done
|
||
if remotes[k].unpack_directories: | ||
for src_dir in remotes[k].unpack_directories: | ||
if isinstance(remotes[k], list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're assuming that the multiple files are zip
, maybe you can check that and add an NotImplementedError
if it is not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this idea. I've implemented it
Yes! I'll do that tomorrow morning (it's almost midnight here!). Thanks for letting me know:) |
No problem @genisplaja. In the meantime I ran The multipart zip download, joining, and unpacking now is handled by |
soundata/datasets/fsd50k.py
Outdated
@@ -206,51 +206,54 @@ | |||
primaryClass={cs.SD} | |||
} | |||
""" | |||
|
|||
# a dictionary key that hasa a list of RemoteFileMetadata implies a multi-part zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: has a
soundata/download_utils.py
Outdated
""" | ||
for l in range(len(zip_remotes)): | ||
download_from_remote(zip_remotes[l], save_dir, force_overwrite) | ||
zip_path = os.path.join(save_dir, obj_to_download + ".zip") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good way to get the principal zip filename to use it as input of command zip
, but it forces the use of the principal filename without the .zip
extension as key of the list of parts of the split compressed file in the REMOTES
dict (see line 213-214 of FSD50K loader). Am I right?
If so, this may cause confusion for an external contributor, since in other loaders one would just put, for instance, development
as key, as I did for the original FSD50K loader. This should be specified somewhere in the docs, or I would suggest to iterate over the download item list parsing the field filename
to identify the one with .zip
extension, which is actually the input filename for the zip
command. I think it would be something like:
zip_path = next((part.filename for part in zip_remotes if '.zip' in part.filename), None)
and then you can get the output path like:
out_path = zip_path.replace('.zip', '_single.zip')
Then the input obj_to_download
would be no longer needed. Since at the moment this function is just for multipart files in zip format, it should work! And then the contributor is allowed to use any key for the divided file.
Also because I think that you don't actually check that the actual filename is k + '.zip'
, or obj_to_download + '.zip'
.
Just a proposal:) What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, thanks for your suggestions.
I'm afraid I tried making the changes you suggested and now the tests are not passing. do you wanna give it a shot and open a PR against this one before we join with main
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind. I got it. Great suggestions, by the way. Thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, no worries! This PR looks ready to me.
This PR implements handling of multi-part zip files, which was the zip format of the
fsd50k
dataset and resulted in #55 . This will be needed to write the loaders forTAU NIGENS SSE 2021
,TAU NIGENS SSE 2020
, andTAU SSE 2019
.