Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use urllib handler for s3:// and gs://, improve url_exists through HEAD requests #34324

Merged
merged 4 commits into from
Dec 9, 2022

Conversation

haampie
Copy link
Member

@haampie haampie commented Dec 5, 2022

  • The idea of urllib is to add handlers for custom protocols through <protocol>_open(req), so let's do that for s3:// and gs://.

  • Use head_object for Request("s3://...", method="HEAD") requests

  • Use HEAD requests in url_exists to avoid expensive downloads from s3.

  • Create a "custom" urlopen that instantiates an opener once and only once.

  • Keep using the standard HTTPRedirectHandler for HEAD requests; in a previous iteration of this PR, I redirected HEAD requests as HEAD requests which led to issues when downloading from BitBucket.

Before:

In [1]: import spack.util.web

In [2]: %timeit spack.util.web.url_exists("s3://spack-binaries/develop/aws-ahug-aarch64/build_cache/index.json")
526 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit spack.util.web.url_exists("https://binaries.spack.io/develop/aws-ahug-aarch64/build_cache/index.json")
558 ms ± 9.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After:

In [1]: import spack.util.web

In [2]: %timeit spack.util.web.url_exists("s3://spack-binaries/develop/aws-ahug-aarch64/build_cache/index.json")
104 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit spack.util.web.url_exists("https://binaries.spack.io/develop/aws-ahug-aarch64/build_cache/index.json")
146 ms ± 3.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@spackbot-app spackbot-app bot added core PR affects Spack core functionality fetching tests General test capability(ies) utilities labels Dec 5, 2022
@haampie haampie marked this pull request as ready for review December 5, 2022 16:17
@haampie haampie mentioned this pull request Dec 5, 2022
@haampie haampie changed the title move s3/gs into handler Use urrlib handler for s3:// and gs://, import url_exists through HEAD requests Dec 5, 2022
@haampie haampie changed the title Use urrlib handler for s3:// and gs://, import url_exists through HEAD requests Use urrlib handler for s3:// and gs://, improve url_exists through HEAD requests Dec 5, 2022
@haampie haampie changed the title Use urrlib handler for s3:// and gs://, improve url_exists through HEAD requests Use urllib handler for s3:// and gs://, improve url_exists through HEAD requests Dec 5, 2022
@haampie haampie changed the title Use urllib handler for s3:// and gs://, improve url_exists through HEAD requests Use urllib handler for s3:// and gs://, improve url_exists through HEAD requests Dec 5, 2022
@haampie haampie force-pushed the fix/url-things-3 branch 2 times, most recently from 0a3c15d to fac0cce Compare December 6, 2022 16:27
Make `url_exists` do HEAD request for http/https/s3 protocols

Rework the opener: construct it once and only once, dynamically dispatch
to the right one based on config.
Copy link
Contributor

@scottwittenburg scottwittenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks @haampie! One thing I didn't understand from the description. You mentioned "Keep using standard HTTPRedirectHandler for HEAD requests", can you point me to where that is happening? I didn't find it in here or in the code right off the bat.

with_ssl = build_opener(s3, gcs, HTTPSHandler(context=ssl.create_default_context()))

# One opener with HTTPS ssl disabled
without_ssl = build_opener(s3, gcs, HTTPSHandler(context=ssl._create_unverified_context()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice, compared to how it was done in those blocks you removed!

@scottwittenburg
Copy link
Contributor

One thing about these web/s3 PRs is they don't change any hashes and never rebuild anything. Additionally, I don't think the unit tests may do a very good job of testing the s3 bits.

@haampie
Copy link
Member Author

haampie commented Dec 9, 2022

Yeah, some more tests around here would help (also it would force fewer globals...)

I honestly think Python has a bug in urllib (HEAD on redirect becomes GET), for which I also submitted a PR to cpython, but didn't get a response yet, and it causes some issues with bitbucket redirects to an aws bucket, so in this pr I'm not touching it

@haampie haampie merged commit db8f115 into spack:develop Dec 9, 2022
@haampie haampie deleted the fix/url-things-3 branch December 9, 2022 23:20
tgamblin added a commit that referenced this pull request Dec 10, 2022
…ists` through HEAD requests (#34324)"

This reverts commit db8f115.
tgamblin added a commit that referenced this pull request Dec 11, 2022
…ists` through HEAD requests (#34324)"

This reverts commit db8f115.
haampie added a commit to haampie/spack that referenced this pull request Dec 11, 2022
… `url_exists` through HEAD requests (spack#34324)""

This reverts commit 8035eeb.
haampie added a commit to haampie/spack that referenced this pull request Dec 13, 2022
… `url_exists` through HEAD requests (spack#34324)""

This reverts commit 8035eeb.
haampie added a commit to haampie/spack that referenced this pull request Dec 13, 2022
… `url_exists` through HEAD requests (spack#34324)""

This reverts commit 8035eeb.
haampie added a commit to haampie/spack that referenced this pull request Dec 14, 2022
… `url_exists` through HEAD requests (spack#34324)""

This reverts commit 8035eeb.
haampie added a commit that referenced this pull request Dec 14, 2022
… `url_exists` through HEAD requests (#34324)"" (#34498)

This reverts commit 8035eeb.

And also removes logic around an additional HEAD request to prevent
a more expensive GET request on wrong content-type. Since large files
are typically an attachment and only downloaded when reading the
stream, it's not an optimization that helps much, and in fact the logic
was broken since the GET request was done unconditionally.
benkirk pushed a commit to benkirk/spack that referenced this pull request Dec 16, 2022
… `url_exists` through HEAD requests (spack#34324)"" (spack#34498)

This reverts commit 8035eeb.

And also removes logic around an additional HEAD request to prevent
a more expensive GET request on wrong content-type. Since large files
are typically an attachment and only downloaded when reading the
stream, it's not an optimization that helps much, and in fact the logic
was broken since the GET request was done unconditionally.
amd-toolchain-support pushed a commit to amd-toolchain-support/spack that referenced this pull request Feb 16, 2023
…rough HEAD requests (spack#34324)

* `url_exists` improvements (take 2)

Make `url_exists` do HEAD request for http/https/s3 protocols

Rework the opener: construct it once and only once, dynamically dispatch
to the right one based on config.
amd-toolchain-support pushed a commit to amd-toolchain-support/spack that referenced this pull request Feb 16, 2023
…ists` through HEAD requests (spack#34324)"

This reverts commit db8f115.
amd-toolchain-support pushed a commit to amd-toolchain-support/spack that referenced this pull request Feb 16, 2023
… `url_exists` through HEAD requests (spack#34324)"" (spack#34498)

This reverts commit 8035eeb.

And also removes logic around an additional HEAD request to prevent
a more expensive GET request on wrong content-type. Since large files
are typically an attachment and only downloaded when reading the
stream, it's not an optimization that helps much, and in fact the logic
was broken since the GET request was done unconditionally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core PR affects Spack core functionality fetching tests General test capability(ies) utilities
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants