Increase hub download chunk size #116536

akx · 2023-12-29T14:16:19Z

This PR increases the read size for the hub.download_url_to_file function from 8,192 bytes to 131,072 bytes (128 * 1,024), as reading in larger chunks should be more efficient. The size could probably be larger still, at the expense of the progress bar not getting updated as often.

It re-introduces use of the READ_DATA_CHUNK constant that was originally used for this purpose in 4a3baec and since forgotten.

cc @nairbv @NicolasHug @vmoens @jdsgomes @penguinwu

pytorch-bot · 2023-12-29T14:16:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116536

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit afe2cdd with merge base 71ec3ed ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2023-12-29T14:16:24Z

The committers listed above are authorized under a signed CLA.

✅ login: akx / name: Aarni Koskela (7654fed, afe2cdd)
✅ login: NicolasHug / name: Nicolas Hug (afe2cdd)

akx · 2023-12-29T14:21:11Z

@pytorchbot label "topic: not user facing"

Reading in larger chunks should be more efficient. Signed-off-by: Aarni Koskela <akx@iki.fi>

NicolasHug · 2024-01-03T10:53:20Z

torch/hub.py

                  unit='B', unit_scale=True, unit_divisor=1024) as pbar:
            while True:
-                buffer = u.read(8192)
+                buffer = u.read(READ_DATA_CHUNK)


Thanks for the PR @akx. I'm not an expert in download performance, so before moving forward I'd have a few questions:

reading in larger chunks should be more efficient

On a stable ad fast network, does it actually lead to a measurable difference in download speed?

What could be potential issues with making the READ_DATA_CHUNK bigger? Is it just more RAM usage or is there something else?

@NicolasHug Thanks for getting back to me!

On a stable ad fast network, does it actually lead to a measurable difference in download speed?

Glad you asked! The bottleneck is not the network stability or speed here; it's how many times you switch from Python to C (the underlying SSL/HTTP socket read() call) and back. The same stands for feeding lots of small chunks to the hasher (also C code) when you could be feeding more.

I wrote a small benchmark to prove this – downloading from a local HTTP server and to /dev/null, so we can assume there is no network or disk overhead. The data is random, so we can assume there's no compression hijinks going on either.

import hashlib import os from urllib.request import Request, urlopen import tqdm url = "http://192.168.68.102:8000/foo" chunk = int(os.environ["CHUNK"]) req = Request(url, headers={"User-Agent": "torch.hub"}) u = urlopen(req) sha256 = hashlib.sha256() with open("/dev/null", "wb") as f, tqdm.tqdm(unit='B', unit_scale=True, unit_divisor=1024) as pbar: while True: buffer = u.read(chunk) if len(buffer) == 0: break f.write(buffer) sha256.update(buffer) pbar.update(len(buffer))

$ hyperfine 'env CHUNK=8192 python download_speed_test.py' 'env CHUNK=131072 python download_speed_test.py' --show-output Benchmark 1: env CHUNK=8192 python download_speed_test.py 0.98GB [00:00, 1.37GB/s] [... snip ...] Time (mean ± σ): 966.1 ms ± 32.1 ms [User: 626.7 ms, System: 202.0 ms] Range (min … max): 911.6 ms … 1005.2 ms 10 runs Benchmark 2: env CHUNK=131072 python download_speed_test.py 0.98GB [00:00, 2.09GB/s] [... snip ...] Time (mean ± σ): 690.1 ms ± 12.6 ms [User: 458.7 ms, System: 95.0 ms] Range (min … max): 668.2 ms … 711.0 ms 10 runs Summary env CHUNK=131072 python download_speed_test.py ran 1.40 ± 0.05 times faster than env CHUNK=8192 python download_speed_test.py

IOW, this single-number change made downloading from a perfect network 1.4x faster. You can also see that there's 10 percentage points less time spent in syscalls in the faster version. (In this scenario, updating the chunk size to 524288 gives diminishing returns, more so on non-perfect networks.)

What could be potential issues with making the READ_DATA_CHUNK bigger?

Slightly increased memory usage during downloads, as you noted. Considering where torch is supposed to be used, 128 KiB instead of 8 KiB should not make a difference for any real use case 😁

The other is that the progress bar, if enabled, won't update quite as often (since it is only updated between chunk reads), but with this chunk size and modern networks, that shouldn't matter either.

torch/hub.py

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

NicolasHug

Thanks a lot for the detailed answer and for the benchmarks @akx . It looks like the speed-up should be fairly minimal for any reasonable ML workload, but since this PR adds zero complexity/maintenance cost, I'm happy to merge it. I just made one nit suggestion in https://github.com/pytorch/pytorch/pull/116536/files#r1440361098. LMK if it's OK with you and I'll merge. Thank you !

akx · 2024-01-03T11:52:27Z

@NicolasHug Thank you, committed the suggestion! You're right in that this will probably not make a large difference for a general ML workload (where you'd be expected to have downloaded models and other data already), but e.g. notebooks and CI runs should see some wallclock speedup and less CPU cycles spent (good for the planet, hey!) 😄

NicolasHug · 2024-01-03T14:46:32Z

@pytorchbot merge

pytorchmergebot · 2024-01-03T14:48:35Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchbot added the open source label Dec 29, 2023

pytorch-bot bot added the topic: not user facing topic category label Dec 29, 2023

Increase hub download chunk size

7654fed

Reading in larger chunks should be more efficient. Signed-off-by: Aarni Koskela <akx@iki.fi>

akx force-pushed the larger-read-data-chunk branch from 2ab241d to 7654fed Compare December 31, 2023 17:03

cpuhrsch added the module: hub label Jan 3, 2024

cpuhrsch requested a review from NicolasHug January 3, 2024 08:54

cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 3, 2024

NicolasHug reviewed Jan 3, 2024

View reviewed changes

torch/hub.py Outdated Show resolved Hide resolved

Be more explicit about read data chunk size

afe2cdd

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

NicolasHug approved these changes Jan 3, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 3, 2024

pytorchmergebot added the merging label Jan 3, 2024

pytorchmergebot added Merged and removed merging labels Jan 3, 2024

pytorchmergebot closed this in 1429c20 Jan 3, 2024

akx mentioned this pull request Oct 21, 2024

Update ollama_dl.py - for larger files and continuos download akx/ollama-dl#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase hub download chunk size #116536

Increase hub download chunk size #116536

Uh oh!

akx commented Dec 29, 2023 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Dec 29, 2023 •

edited

Loading

Uh oh!

linux-foundation-easycla bot commented Dec 29, 2023 •

edited

Loading

Uh oh!

akx commented Dec 29, 2023

Uh oh!

NicolasHug Jan 3, 2024

Uh oh!

akx Jan 3, 2024

Uh oh!

Uh oh!

NicolasHug left a comment •

edited

Loading

Uh oh!

akx commented Jan 3, 2024

Uh oh!

NicolasHug commented Jan 3, 2024

Uh oh!

pytorchmergebot commented Jan 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Increase hub download chunk size #116536

Increase hub download chunk size #116536

Uh oh!

Conversation

akx commented Dec 29, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116536

✅ No Failures

Uh oh!

linux-foundation-easycla bot commented Dec 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akx commented Dec 29, 2023

Uh oh!

NicolasHug Jan 3, 2024

Choose a reason for hiding this comment

Uh oh!

akx Jan 3, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NicolasHug left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akx commented Jan 3, 2024

Uh oh!

NicolasHug commented Jan 3, 2024

Uh oh!

pytorchmergebot commented Jan 3, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

akx commented Dec 29, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 29, 2023 •

edited

Loading

linux-foundation-easycla bot commented Dec 29, 2023 •

edited

Loading

NicolasHug left a comment •

edited

Loading