-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix _quota_exceeded check and gdrive download #4109
Fix _quota_exceeded check and gdrive download #4109
Conversation
Currently, we instantiate Iterators via `response.iter_content` twice. Since this is most likely not supported for a streaming Response.content (refer https://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow) , we instead refactor the related functions such that only 1 Iterator is generated via `response.iter_content`. The first chunk of this iterator is then used for checking the quota and the partially consumed Iterator + first chunk for writing to disk if the quota_check is passed.
Hi @ORippler! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered the exact same thing, but somehow my version passed some local tests just fine. I thought maybe there is some internal caching in the response that made this work. Weirdly enough we had no mention of this issue since my previous fix, so I thought this was done for good. Guess I was wrong 🙄
Anyway, this looks like a good solution. Thanks a lot @ORippler!
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
Restoring the initial generator simplifies the code, since we no longer have to treat `first_chunk` separately when writing to disk but can iterate over the restored generator instead
* Remove unused library `contextlib` * Reference the broken/inconsistent google Drive API that necessitates the workaround via decoding of the `first_chunk` from the payload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ORippler , LGTM with a question and a suggestion
# with their own API, refer https://github.com/pytorch/vision/issues/2992#issuecomment-730614517. | ||
# Should this be fixed at some place in future, one could refactor the following to no longer rely on decoding | ||
# the first_chunk of the payload | ||
response_content_generator = response.iter_content(32768) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter that the chunksize used to be chunk_size=128
for this specific check and now it's 32768?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I tested it back then and 128 was enough to get to the important part of the message. But since we use the chunk anyway, any reasonable chunk size should be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I could debug manually, google drive either returns:
- A html stating that the quota is exceeded. We currently parse the contents of this html in
_quota_exceeded
, and yes, size of 128 is sufficient to get to the phrase/part we check against here. Larger chunk sizes didn't hurt in my debugging. - The desired payload, e.g. a tarball of a dataset (which cannot be decoded and parsed).
The reason parsing the html is worse than using the response.status_code
is that if google decides to change the formatting of the returned html or returns different htmls for edge cases we are currently unaware of our check will fail, we will write the html to disk again, and subsequently fail to unpack it in the next step, leaving the users with a cryptic error message.
We could of course be more strict here by rejecting all responses that have a payload which's first chunk can be fully decoded to strings, but the best long term solution would probably be for google to fix/adhere to their API.
while not first_chunk: # filter out keep-alive new chunks | ||
first_chunk = next(response_content_generator) | ||
|
||
if _quota_exceeded(first_chunk): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this doesn't come from this PR but I would suggest to remove the _quota_exceeded
function and just inline it here. We'll be able to use raise RuntimeError(msg) from decode_error
which will provide cleaner tracebacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I can surely inline the function I don't get what you mean with raise from UnicodeDecodeError
.
We use the UnicodeDecodeError
as the passing condition (i.e. UnicodeDecodeError
will only happen if we have a valid payload).
Did I misunderstand you here ? Please clarify (sorry I am new to Exception Chaining).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh you're right, it wouldn't make sense then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
…iles in some cases (#4109) Reviewed By: NicolasHug Differential Revision: D29369894 fbshipit-source-id: 52d175103eb77170963f8115dbee3f8eb373802d
Still have the same problem in 2022 :-( |
@GianniGi please open a new issue. |
please fixi this thing |
Fixes #4108 and #2992 by constructing the iterator from a
requests.Response
viaResponse.iter_content
only once, passing thefirst chunk
to_quota_exceeded
for checking, restoring the initial iterator and passing it to_save_response_content
for writing to disk.@pmeier