Use `urllib` handler for `s3://` and `gs://`, improve `url_exists` through HEAD requests #34324

haampie · 2022-12-05T14:13:15Z

The idea of urllib is to add handlers for custom protocols through <protocol>_open(req), so let's do that for s3:// and gs://.
Use head_object for Request("s3://...", method="HEAD") requests
Use HEAD requests in url_exists to avoid expensive downloads from s3.
Create a "custom" urlopen that instantiates an opener once and only once.
Keep using the standard HTTPRedirectHandler for HEAD requests; in a previous iteration of this PR, I redirected HEAD requests as HEAD requests which led to issues when downloading from BitBucket.

Before:

In [1]: import spack.util.web

In [2]: %timeit spack.util.web.url_exists("s3://spack-binaries/develop/aws-ahug-aarch64/build_cache/index.json")
526 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit spack.util.web.url_exists("https://binaries.spack.io/develop/aws-ahug-aarch64/build_cache/index.json")
558 ms ± 9.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After:

In [1]: import spack.util.web

In [2]: %timeit spack.util.web.url_exists("s3://spack-binaries/develop/aws-ahug-aarch64/build_cache/index.json")
104 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit spack.util.web.url_exists("https://binaries.spack.io/develop/aws-ahug-aarch64/build_cache/index.json")
146 ms ± 3.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Make `url_exists` do HEAD request for http/https/s3 protocols Rework the opener: construct it once and only once, dynamically dispatch to the right one based on config.

scottwittenburg

This looks great, thanks @haampie! One thing I didn't understand from the description. You mentioned "Keep using standard HTTPRedirectHandler for HEAD requests", can you point me to where that is happening? I didn't find it in here or in the code right off the bat.

scottwittenburg · 2022-12-09T19:20:39Z

lib/spack/spack/util/web.py

+    with_ssl = build_opener(s3, gcs, HTTPSHandler(context=ssl.create_default_context()))
+
+    # One opener with HTTPS ssl disabled
+    without_ssl = build_opener(s3, gcs, HTTPSHandler(context=ssl._create_unverified_context()))


This is really nice, compared to how it was done in those blocks you removed!

scottwittenburg · 2022-12-09T19:28:28Z

One thing about these web/s3 PRs is they don't change any hashes and never rebuild anything. Additionally, I don't think the unit tests may do a very good job of testing the s3 bits.

haampie · 2022-12-09T23:19:49Z

Yeah, some more tests around here would help (also it would force fewer globals...)

I honestly think Python has a bug in urllib (HEAD on redirect becomes GET), for which I also submitted a PR to cpython, but didn't get a response yet, and it causes some issues with bitbucket redirects to an aws bucket, so in this pr I'm not touching it

…ists` through HEAD requests (#34324)" This reverts commit db8f115.

… `url_exists` through HEAD requests (spack#34324)"" This reverts commit 8035eeb.

… `url_exists` through HEAD requests (#34324)"" (#34498) This reverts commit 8035eeb. And also removes logic around an additional HEAD request to prevent a more expensive GET request on wrong content-type. Since large files are typically an attachment and only downloaded when reading the stream, it's not an optimization that helps much, and in fact the logic was broken since the GET request was done unconditionally.

… `url_exists` through HEAD requests (spack#34324)"" (spack#34498) This reverts commit 8035eeb. And also removes logic around an additional HEAD request to prevent a more expensive GET request on wrong content-type. Since large files are typically an attachment and only downloaded when reading the stream, it's not an optimization that helps much, and in fact the logic was broken since the GET request was done unconditionally.

…rough HEAD requests (spack#34324) * `url_exists` improvements (take 2) Make `url_exists` do HEAD request for http/https/s3 protocols Rework the opener: construct it once and only once, dynamically dispatch to the right one based on config.

…ists` through HEAD requests (spack#34324)" This reverts commit db8f115.

… `url_exists` through HEAD requests (spack#34324)"" (spack#34498) This reverts commit 8035eeb. And also removes logic around an additional HEAD request to prevent a more expensive GET request on wrong content-type. Since large files are typically an attachment and only downloaded when reading the stream, it's not an optimization that helps much, and in fact the logic was broken since the GET request was done unconditionally.

spackbot-app bot added core PR affects Spack core functionality fetching tests General test capability(ies) utilities labels Dec 5, 2022

haampie mentioned this pull request Dec 5, 2022

url_exists improvements (take 2) #34322

Closed

haampie marked this pull request as ready for review December 5, 2022 16:17

blue42u mentioned this pull request Dec 5, 2022

Update the binary index before attempting direct fetches #32137

Merged

haampie mentioned this pull request Dec 5, 2022

Partially revert #32137 #34326

Closed

haampie changed the title ~~move s3/gs into handler~~ Use urrlib handler for s3:// and gs://, import url_exists through HEAD requests Dec 5, 2022

haampie changed the title ~~Use urrlib handler for s3:// and gs://, import url_exists through HEAD requests~~ Use urrlib handler for s3:// and gs://, improve url_exists through HEAD requests Dec 5, 2022

haampie changed the title ~~Use urrlib handler for s3:// and gs://, improve url_exists through HEAD requests~~ Use urllib handler for s3:// and gs://, improve url_exists through HEAD requests Dec 5, 2022

haampie changed the title ~~Use urllib handler for s3:// and gs://, improve url_exists through HEAD requests~~ Use urllib handler for s3:// and gs://, improve url_exists through HEAD requests Dec 5, 2022

haampie force-pushed the fix/url-things-3 branch 2 times, most recently from 0a3c15d to fac0cce Compare December 6, 2022 16:27

blue42u mentioned this pull request Dec 7, 2022

Various binary cache improvements #34371

Open

haampie added 4 commits December 9, 2022 11:13

url_exists improvements (take 2)

c01e559

Make `url_exists` do HEAD request for http/https/s3 protocols Rework the opener: construct it once and only once, dynamically dispatch to the right one based on config.

use s3 / gs handlers in custom opener to simplify

299e3b5

fix error handling s3

fba997b

style and use get_method()

ad9f686

haampie force-pushed the fix/url-things-3 branch from df02ab8 to ad9f686 Compare December 9, 2022 10:18

scottwittenburg approved these changes Dec 9, 2022

View reviewed changes

haampie merged commit db8f115 into spack:develop Dec 9, 2022

haampie deleted the fix/url-things-3 branch December 9, 2022 23:20

tgamblin added a commit that referenced this pull request Dec 10, 2022

Revert "Use urllib handler for s3:// and gs://, improve `url_ex…

f918560

…ists` through HEAD requests (#34324)" This reverts commit db8f115.

tgamblin mentioned this pull request Dec 10, 2022

Revert "Use urllib handler for s3:// and gs://, improve url_exists through HEAD requests" #34446

Merged

johnwparent mentioned this pull request Dec 10, 2022

Restore usage of file URL handling method on Win #34449

Closed

blue42u mentioned this pull request Dec 10, 2022

Don't check remote mirrors to determine fetch order from mirrors #34359

Merged

tgamblin added a commit that referenced this pull request Dec 11, 2022

Revert "Use urllib handler for s3:// and gs://, improve `url_ex…

8035eeb

…ists` through HEAD requests (#34324)" This reverts commit db8f115.

haampie added a commit to haampie/spack that referenced this pull request Dec 11, 2022

Revert "Revert "Use urllib handler for s3:// and gs://, improve…

e6d61ce

… `url_exists` through HEAD requests (spack#34324)"" This reverts commit 8035eeb.

haampie added a commit to haampie/spack that referenced this pull request Dec 13, 2022

Revert "Revert "Use urllib handler for s3:// and gs://, improve…

2e3434c

… `url_exists` through HEAD requests (spack#34324)"" This reverts commit 8035eeb.

haampie added a commit to haampie/spack that referenced this pull request Dec 13, 2022

Revert "Revert "Use urllib handler for s3:// and gs://, improve…

0fd2656

… `url_exists` through HEAD requests (spack#34324)"" This reverts commit 8035eeb.

haampie added a commit to haampie/spack that referenced this pull request Dec 14, 2022

Revert "Revert "Use urllib handler for s3:// and gs://, improve…

64802e6

… `url_exists` through HEAD requests (spack#34324)"" This reverts commit 8035eeb.

amd-toolchain-support pushed a commit to amd-toolchain-support/spack that referenced this pull request Feb 16, 2023

Revert "Use urllib handler for s3:// and gs://, improve `url_ex…

5597e57

…ists` through HEAD requests (spack#34324)" This reverts commit db8f115.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `urllib` handler for `s3://` and `gs://`, improve `url_exists` through HEAD requests #34324

Use `urllib` handler for `s3://` and `gs://`, improve `url_exists` through HEAD requests #34324

haampie commented Dec 5, 2022 •

edited

scottwittenburg left a comment

scottwittenburg Dec 9, 2022

scottwittenburg commented Dec 9, 2022

haampie commented Dec 9, 2022

Use urllib handler for s3:// and gs://, improve url_exists through HEAD requests #34324

Use urllib handler for s3:// and gs://, improve url_exists through HEAD requests #34324

Conversation

haampie commented Dec 5, 2022 • edited

scottwittenburg left a comment

Choose a reason for hiding this comment

scottwittenburg Dec 9, 2022

Choose a reason for hiding this comment

scottwittenburg commented Dec 9, 2022

haampie commented Dec 9, 2022

Use `urllib` handler for `s3://` and `gs://`, improve `url_exists` through HEAD requests #34324

Use `urllib` handler for `s3://` and `gs://`, improve `url_exists` through HEAD requests #34324

haampie commented Dec 5, 2022 •

edited