-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use urllib
handler for s3://
and gs://
, improve url_exists
through HEAD requests
#34324
Conversation
urllib
handler for s3:// and gs://, improve url_exists through HEAD requests
urllib
handler for s3:// and gs://, improve url_exists through HEAD requestsurllib
handler for s3://
and gs://
, improve url_exists
through HEAD requests
0a3c15d
to
fac0cce
Compare
Make `url_exists` do HEAD request for http/https/s3 protocols Rework the opener: construct it once and only once, dynamically dispatch to the right one based on config.
df02ab8
to
ad9f686
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks @haampie! One thing I didn't understand from the description. You mentioned "Keep using standard HTTPRedirectHandler for HEAD requests", can you point me to where that is happening? I didn't find it in here or in the code right off the bat.
with_ssl = build_opener(s3, gcs, HTTPSHandler(context=ssl.create_default_context())) | ||
|
||
# One opener with HTTPS ssl disabled | ||
without_ssl = build_opener(s3, gcs, HTTPSHandler(context=ssl._create_unverified_context())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice, compared to how it was done in those blocks you removed!
One thing about these web/s3 PRs is they don't change any hashes and never rebuild anything. Additionally, I don't think the unit tests may do a very good job of testing the s3 bits. |
Yeah, some more tests around here would help (also it would force fewer globals...) I honestly think Python has a bug in urllib (HEAD on redirect becomes GET), for which I also submitted a PR to cpython, but didn't get a response yet, and it causes some issues with bitbucket redirects to an aws bucket, so in this pr I'm not touching it |
… `url_exists` through HEAD requests (spack#34324)"" This reverts commit 8035eeb.
… `url_exists` through HEAD requests (spack#34324)"" This reverts commit 8035eeb.
… `url_exists` through HEAD requests (spack#34324)"" This reverts commit 8035eeb.
… `url_exists` through HEAD requests (spack#34324)"" This reverts commit 8035eeb.
… `url_exists` through HEAD requests (#34324)"" (#34498) This reverts commit 8035eeb. And also removes logic around an additional HEAD request to prevent a more expensive GET request on wrong content-type. Since large files are typically an attachment and only downloaded when reading the stream, it's not an optimization that helps much, and in fact the logic was broken since the GET request was done unconditionally.
… `url_exists` through HEAD requests (spack#34324)"" (spack#34498) This reverts commit 8035eeb. And also removes logic around an additional HEAD request to prevent a more expensive GET request on wrong content-type. Since large files are typically an attachment and only downloaded when reading the stream, it's not an optimization that helps much, and in fact the logic was broken since the GET request was done unconditionally.
…rough HEAD requests (spack#34324) * `url_exists` improvements (take 2) Make `url_exists` do HEAD request for http/https/s3 protocols Rework the opener: construct it once and only once, dynamically dispatch to the right one based on config.
…ists` through HEAD requests (spack#34324)" This reverts commit db8f115.
… `url_exists` through HEAD requests (spack#34324)"" (spack#34498) This reverts commit 8035eeb. And also removes logic around an additional HEAD request to prevent a more expensive GET request on wrong content-type. Since large files are typically an attachment and only downloaded when reading the stream, it's not an optimization that helps much, and in fact the logic was broken since the GET request was done unconditionally.
The idea of
urllib
is to add handlers for custom protocols through<protocol>_open(req)
, so let's do that fors3://
andgs://
.Use
head_object
forRequest("s3://...", method="HEAD")
requestsUse
HEAD
requests inurl_exists
to avoid expensive downloads from s3.Create a "custom"
urlopen
that instantiates an opener once and only once.Keep using the standard
HTTPRedirectHandler
forHEAD
requests; in a previous iteration of this PR, I redirected HEAD requests as HEAD requests which led to issues when downloading from BitBucket.Before:
After: