Fix _quota_exceeded check and gdrive download #4109

ORippler · 2021-06-24T12:35:20Z

Fixes #4108 and #2992 by constructing the iterator from a requests.Response via Response.iter_content only once, passing the first chunk to _quota_exceeded for checking, restoring the initial iterator and passing it to _save_response_content for writing to disk.

@pmeier

Currently, we instantiate Iterators via `response.iter_content` twice. Since this is most likely not supported for a streaming Response.content (refer https://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow) , we instead refactor the related functions such that only 1 Iterator is generated via `response.iter_content`. The first chunk of this iterator is then used for checking the quota and the partially consumed Iterator + first chunk for writing to disk if the quota_check is passed.

facebook-github-bot · 2021-06-24T12:35:25Z

Hi @ORippler!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

pmeier

I wondered the exact same thing, but somehow my version passed some local tests just fine. I thought maybe there is some internal caching in the response that made this work. Weirdly enough we had no mention of this issue since my previous fix, so I thought this was done for good. Guess I was wrong 🙄

Anyway, this looks like a good solution. Thanks a lot @ORippler!

facebook-github-bot · 2021-06-24T13:28:27Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

torchvision/datasets/utils.py

Restoring the initial generator simplifies the code, since we no longer have to treat `first_chunk` separately when writing to disk but can iterate over the restored generator instead

* Remove unused library `contextlib` * Reference the broken/inconsistent google Drive API that necessitates the workaround via decoding of the `first_chunk` from the payload.

NicolasHug

Thanks @ORippler , LGTM with a question and a suggestion

NicolasHug · 2021-06-24T15:30:56Z

torchvision/datasets/utils.py

+        # with their own API, refer https://github.com/pytorch/vision/issues/2992#issuecomment-730614517.
+        # Should this be fixed at some place in future, one could refactor the following to no longer rely on decoding
+        # the first_chunk of the payload
+        response_content_generator = response.iter_content(32768)


Does it matter that the chunksize used to be chunk_size=128 for this specific check and now it's 32768?

I think I tested it back then and 128 was enough to get to the important part of the message. But since we use the chunk anyway, any reasonable chunk size should be ok.

As far as I could debug manually, google drive either returns:

A html stating that the quota is exceeded. We currently parse the contents of this html in _quota_exceeded, and yes, size of 128 is sufficient to get to the phrase/part we check against here. Larger chunk sizes didn't hurt in my debugging.

The desired payload, e.g. a tarball of a dataset (which cannot be decoded and parsed).

The reason parsing the html is worse than using the response.status_code is that if google decides to change the formatting of the returned html or returns different htmls for edge cases we are currently unaware of our check will fail, we will write the html to disk again, and subsequently fail to unpack it in the next step, leaving the users with a cryptic error message.

We could of course be more strict here by rejecting all responses that have a payload which's first chunk can be fully decoded to strings, but the best long term solution would probably be for google to fix/adhere to their API.

NicolasHug · 2021-06-24T15:32:09Z

torchvision/datasets/utils.py

+        while not first_chunk:  # filter out keep-alive new chunks
+            first_chunk = next(response_content_generator)
+
+        if _quota_exceeded(first_chunk):


I know this doesn't come from this PR but I would suggest to remove the _quota_exceeded function and just inline it here. We'll be able to use raise RuntimeError(msg) from decode_error which will provide cleaner tracebacks.

While I can surely inline the function I don't get what you mean with raise from UnicodeDecodeError.
We use the UnicodeDecodeError as the passing condition (i.e. UnicodeDecodeError will only happen if we have a valid payload).

Did I misunderstand you here ? Please clarify (sorry I am new to Exception Chaining).

oh you're right, it wouldn't make sense then!

NicolasHug

thanks!

…iles in some cases (#4109) Reviewed By: NicolasHug Differential Revision: D29369894 fbshipit-source-id: 52d175103eb77170963f8115dbee3f8eb373802d

g-i-o-r-g-i-o · 2022-09-23T20:20:53Z

Still have the same problem in 2022 :-(

pmeier · 2022-09-23T20:50:00Z

@GianniGi please open a new issue.

rewixx · 2022-12-25T21:08:40Z

please fixi this thing

pmeier approved these changes Jun 24, 2021

View reviewed changes

pmeier requested a review from NicolasHug June 24, 2021 13:06

facebook-github-bot added the cla signed label Jun 24, 2021

NicolasHug reviewed Jun 24, 2021

View reviewed changes

torchvision/datasets/utils.py Outdated Show resolved Hide resolved

ORippler added 2 commits June 24, 2021 16:42

Restore initial generator for leaner code

581bde3

Restoring the initial generator simplifies the code, since we no longer have to treat `first_chunk` separately when writing to disk but can iterate over the restored generator instead

Remove unused library & Add comments

1abf1ef

* Remove unused library `contextlib` * Reference the broken/inconsistent google Drive API that necessitates the workaround via decoding of the `first_chunk` from the payload.

ORippler requested a review from NicolasHug June 24, 2021 14:59

NicolasHug approved these changes Jun 24, 2021

View reviewed changes

NicolasHug added bug module: datasets labels Jun 24, 2021

NicolasHug merged commit ab60e53 into pytorch:master Jun 24, 2021

ORippler deleted the fix_google_drive_quotacheck branch June 24, 2021 16:51

ORippler mentioned this pull request Jun 24, 2021

Flaky behavior when downloading Google drive files #2992

Closed

sevro mentioned this pull request Jun 25, 2021

Test failures emerald-ai/torchmetal#2

Open

pmeier mentioned this pull request Jun 28, 2021

Torchvision - can't extract Caltech256 #4127

Closed

pmeier mentioned this pull request Aug 16, 2021

Failed to download CelebA dataset using download=True #1920

Closed

pmeier mentioned this pull request Sep 30, 2021

Unable to load CelebA dataset. File is not zip file error. #2262

Closed

calebrob6 mentioned this pull request Oct 1, 2021

download_file_from_google_drive not working #4519

Closed

calebrob6 mentioned this pull request Dec 15, 2021

Auto download of ETCI2021 is broken microsoft/torchgeo#229

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix _quota_exceeded check and gdrive download #4109

Fix _quota_exceeded check and gdrive download #4109

ORippler commented Jun 24, 2021 •

edited

Loading

facebook-github-bot commented Jun 24, 2021

pmeier left a comment •

edited

Loading

facebook-github-bot commented Jun 24, 2021

NicolasHug left a comment

NicolasHug Jun 24, 2021

pmeier Jun 24, 2021

ORippler Jun 24, 2021 •

edited

Loading

NicolasHug Jun 24, 2021

ORippler Jun 24, 2021 •

edited

Loading

NicolasHug Jun 24, 2021

NicolasHug left a comment

g-i-o-r-g-i-o commented Sep 23, 2022

pmeier commented Sep 23, 2022

rewixx commented Dec 25, 2022

Fix _quota_exceeded check and gdrive download #4109

Fix _quota_exceeded check and gdrive download #4109

Conversation

ORippler commented Jun 24, 2021 • edited Loading

facebook-github-bot commented Jun 24, 2021

Action Required

Process

pmeier left a comment • edited Loading

Choose a reason for hiding this comment

facebook-github-bot commented Jun 24, 2021

NicolasHug left a comment

Choose a reason for hiding this comment

NicolasHug Jun 24, 2021

Choose a reason for hiding this comment

pmeier Jun 24, 2021

Choose a reason for hiding this comment

ORippler Jun 24, 2021 • edited Loading

Choose a reason for hiding this comment

NicolasHug Jun 24, 2021

Choose a reason for hiding this comment

ORippler Jun 24, 2021 • edited Loading

Choose a reason for hiding this comment

NicolasHug Jun 24, 2021

Choose a reason for hiding this comment

NicolasHug left a comment

Choose a reason for hiding this comment

g-i-o-r-g-i-o commented Sep 23, 2022

pmeier commented Sep 23, 2022

rewixx commented Dec 25, 2022

ORippler commented Jun 24, 2021 •

edited

Loading

pmeier left a comment •

edited

Loading

ORippler Jun 24, 2021 •

edited

Loading

ORippler Jun 24, 2021 •

edited

Loading