Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multipart zip support #62

Merged
merged 35 commits into from
Oct 19, 2021

Conversation

iranroman
Copy link
Contributor

This PR implements handling of multi-part zip files, which was the zip format of the fsd50k dataset and resulted in #55 . This will be needed to write the loaders for TAU NIGENS SSE 2021, TAU NIGENS SSE 2020, and TAU SSE 2019.

Comment on lines 139 to 140
zip_path = save_dir+"/"+k+".zip"
outpath = save_dir+"/"+k+"_single.zip"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe using os.path.join() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@genisplaja
Copy link
Collaborator

Thanks @iranroman for that!:)

Regarding your last comment on #55: yes of course it has no sense for the user to get just some data splits instead of downloading them all. In my version of the FSD50K loader, I added Exceptions to make sure that users were not allowed to download just a single split alone but the entire development or evaluation splits. However, I like the idea of storing the RemoteFileMetadata of the splits as a list to identify the multipart zips, and then using the expected final folder name as key.

However, there is something I don't understand: where are we testing that merging and unzipping code now? Should this be done in test_download_utils.py?

@iranroman
Copy link
Contributor Author

Hello @genisplaja. Thanks for your feedback.

However, there is something I don't understand: where are we testing that merging and unzipping code now? Should this be done in test_download_utils.py?

Isn't it implied that the download, merging, and unzipping all were done correctly when the pytest -s tests/test_full_dataset.py --local --dataset fsd50 passes?

@iranroman
Copy link
Contributor Author

Hello @genisplaja. I followed your suggestion and I have added a multipart zip test in test_download_utils.py. I also refactored the code to create a download_multipart_zip function in download_utils.py. Let me know if you have more suggestions.

@genisplaja
Copy link
Collaborator

genisplaja commented Jun 5, 2021

Looks good @iranroman!! Thanks for that:) You need to make sure that every line is covered by tests, even if the test_full_dataset passes in local! Codecov is checking that for you every time you update branch in GitHub. Any idea why the circleCI tests are not passing here? I cloned your repo and all tests all passing locally for me. Have you done black formatting?

@iranroman
Copy link
Contributor Author

No idea @genisplaja. This is my first PR int his repo. I was expecting them to just pass. You would know better than me.

@codecov
Copy link

codecov bot commented Oct 13, 2021

Codecov Report

Merging #62 (15782d9) into master (04b4a01) will decrease coverage by 0.03%.
The diff coverage is 93.54%.

@@            Coverage Diff             @@
##           master      #62      +/-   ##
==========================================
- Coverage   97.91%   97.87%   -0.04%     
==========================================
  Files          18       18              
  Lines        1437     1415      -22     
==========================================
- Hits         1407     1385      -22     
  Misses         30       30              

@iranroman
Copy link
Contributor Author

Hello all, codecov is complaining about lines that I did not originally write. What should I do to make codecov ignore these lines as before?

@magdalenafuentes
Copy link
Collaborator

@iranroman will look into this tomorrow!

magdalenafuentes added a commit that referenced this pull request Oct 18, 2021
magdalenafuentes added a commit that referenced this pull request Oct 18, 2021
@iranroman
Copy link
Contributor Author

Hello @genisplaja.

This PR is very relevant to the dataset loader you wrote for fsd50k. Would you like to have a look before we close it?

@@ -54,6 +55,8 @@ def downloader(
The directory to download the data
remotes (dict or None):
A dictionary of RemoteFileMetadata tuples of data in zip format.
If a dictionary key has a list of RemoteFileMetadata,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe rephrase? If an element of the dictionary is a list of RemoteFileMetadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. Done


if remotes[k].unpack_directories:
for src_dir in remotes[k].unpack_directories:
if isinstance(remotes[k], list):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're assuming that the multiple files are zip, maybe you can check that and add an NotImplementedError if it is not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this idea. I've implemented it

@genisplaja
Copy link
Collaborator

Yes! I'll do that tomorrow morning (it's almost midnight here!). Thanks for letting me know:)

@iranroman
Copy link
Contributor Author

No problem @genisplaja.

In the meantime I ran pytest -s tests/test_full_dataset.py --local --dataset fsd50k and everything passed for me (including download, multipart joining, and unzipping).

The multipart zip download, joining, and unpacking now is handled by download_utils.py. Let us know your thoughts.

@@ -206,51 +206,54 @@
primaryClass={cs.SD}
}
"""

# a dictionary key that hasa a list of RemoteFileMetadata implies a multi-part zip
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: has a

"""
for l in range(len(zip_remotes)):
download_from_remote(zip_remotes[l], save_dir, force_overwrite)
zip_path = os.path.join(save_dir, obj_to_download + ".zip")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good way to get the principal zip filename to use it as input of command zip, but it forces the use of the principal filename without the .zip extension as key of the list of parts of the split compressed file in the REMOTES dict (see line 213-214 of FSD50K loader). Am I right?
If so, this may cause confusion for an external contributor, since in other loaders one would just put, for instance, development as key, as I did for the original FSD50K loader. This should be specified somewhere in the docs, or I would suggest to iterate over the download item list parsing the field filename to identify the one with .zip extension, which is actually the input filename for the zip command. I think it would be something like:

zip_path = next((part.filename for part in zip_remotes if '.zip' in part.filename), None)

and then you can get the output path like:

out_path = zip_path.replace('.zip', '_single.zip')

Then the input obj_to_download would be no longer needed. Since at the moment this function is just for multipart files in zip format, it should work! And then the contributor is allowed to use any key for the divided file.
Also because I think that you don't actually check that the actual filename is k + '.zip', or obj_to_download + '.zip'.
Just a proposal:) What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thanks for your suggestions.

I'm afraid I tried making the changes you suggested and now the tests are not passing. do you wanna give it a shot and open a PR against this one before we join with main?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. I got it. Great suggestions, by the way. Thanks a lot!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, no worries! This PR looks ready to me.

@magdalenafuentes magdalenafuentes merged commit 978c185 into soundata:master Oct 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants