Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix _quota_exceeded check and gdrive download #4109

Merged
merged 3 commits into from
Jun 24, 2021
Merged

Fix _quota_exceeded check and gdrive download #4109

merged 3 commits into from
Jun 24, 2021

Conversation

ORippler
Copy link
Contributor

@ORippler ORippler commented Jun 24, 2021

Fixes #4108 and #2992 by constructing the iterator from a requests.Response via Response.iter_content only once, passing the first chunk to _quota_exceeded for checking, restoring the initial iterator and passing it to _save_response_content for writing to disk.

@pmeier

Currently, we instantiate Iterators via `response.iter_content` twice.

Since this is most likely not supported for a streaming Response.content
(refer https://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow)
, we instead refactor the related functions such that
only 1 Iterator is generated via `response.iter_content`.

The first chunk of this iterator is then used for checking the quota and
the partially consumed Iterator + first chunk for writing to disk if
the quota_check is passed.
@facebook-github-bot
Copy link

Hi @ORippler!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered the exact same thing, but somehow my version passed some local tests just fine. I thought maybe there is some internal caching in the response that made this work. Weirdly enough we had no mention of this issue since my previous fix, so I thought this was done for good. Guess I was wrong 🙄

Anyway, this looks like a good solution. Thanks a lot @ORippler!

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

Restoring the initial generator simplifies the code, since we no
longer have to treat `first_chunk` separately when writing to
disk but can iterate over the restored generator instead
* Remove unused library `contextlib`
* Reference the broken/inconsistent google Drive API that necessitates
the workaround via decoding of the `first_chunk` from the payload.
@ORippler ORippler requested a review from NicolasHug June 24, 2021 14:59
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ORippler , LGTM with a question and a suggestion

# with their own API, refer https://github.com/pytorch/vision/issues/2992#issuecomment-730614517.
# Should this be fixed at some place in future, one could refactor the following to no longer rely on decoding
# the first_chunk of the payload
response_content_generator = response.iter_content(32768)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter that the chunksize used to be chunk_size=128 for this specific check and now it's 32768?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I tested it back then and 128 was enough to get to the important part of the message. But since we use the chunk anyway, any reasonable chunk size should be ok.

Copy link
Contributor Author

@ORippler ORippler Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I could debug manually, google drive either returns:

  1. A html stating that the quota is exceeded. We currently parse the contents of this html in _quota_exceeded, and yes, size of 128 is sufficient to get to the phrase/part we check against here. Larger chunk sizes didn't hurt in my debugging.
  2. The desired payload, e.g. a tarball of a dataset (which cannot be decoded and parsed).

The reason parsing the html is worse than using the response.status_code is that if google decides to change the formatting of the returned html or returns different htmls for edge cases we are currently unaware of our check will fail, we will write the html to disk again, and subsequently fail to unpack it in the next step, leaving the users with a cryptic error message.

We could of course be more strict here by rejecting all responses that have a payload which's first chunk can be fully decoded to strings, but the best long term solution would probably be for google to fix/adhere to their API.

while not first_chunk: # filter out keep-alive new chunks
first_chunk = next(response_content_generator)

if _quota_exceeded(first_chunk):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this doesn't come from this PR but I would suggest to remove the _quota_exceeded function and just inline it here. We'll be able to use raise RuntimeError(msg) from decode_error which will provide cleaner tracebacks.

Copy link
Contributor Author

@ORippler ORippler Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I can surely inline the function I don't get what you mean with raise from UnicodeDecodeError.
We use the UnicodeDecodeError as the passing condition (i.e. UnicodeDecodeError will only happen if we have a valid payload).

Did I misunderstand you here ? Please clarify (sorry I am new to Exception Chaining).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh you're right, it wouldn't make sense then!

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@NicolasHug NicolasHug merged commit ab60e53 into pytorch:master Jun 24, 2021
@ORippler ORippler deleted the fix_google_drive_quotacheck branch June 24, 2021 16:51
facebook-github-bot pushed a commit that referenced this pull request Jun 25, 2021
…iles in some cases (#4109)

Reviewed By: NicolasHug

Differential Revision: D29369894

fbshipit-source-id: 52d175103eb77170963f8115dbee3f8eb373802d
@g-i-o-r-g-i-o
Copy link

Still have the same problem in 2022 :-(

@pmeier
Copy link
Collaborator

pmeier commented Sep 23, 2022

@GianniGi please open a new issue.

@rewixx
Copy link

rewixx commented Dec 25, 2022

please fixi this thing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Downloads from Google Drive return empty files / are still broken
6 participants