Skip to content

Conversation

akx
Copy link
Contributor

@akx akx commented Dec 29, 2023

This PR increases the read size for the hub.download_url_to_file function from 8,192 bytes to 131,072 bytes (128 * 1,024), as reading in larger chunks should be more efficient. The size could probably be larger still, at the expense of the progress bar not getting updated as often.

It re-introduces use of the READ_DATA_CHUNK constant that was originally used for this purpose in 4a3baec and since forgotten.

cc @nairbv @NicolasHug @vmoens @jdsgomes @penguinwu

Copy link

pytorch-bot bot commented Dec 29, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116536

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit afe2cdd with merge base 71ec3ed (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

linux-foundation-easycla bot commented Dec 29, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: akx / name: Aarni Koskela (7654fed, afe2cdd)
  • ✅ login: NicolasHug / name: Nicolas Hug (afe2cdd)

@akx
Copy link
Contributor Author

akx commented Dec 29, 2023

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Dec 29, 2023
Reading in larger chunks should be more efficient.

Signed-off-by: Aarni Koskela <akx@iki.fi>
@akx akx force-pushed the larger-read-data-chunk branch from 2ab241d to 7654fed Compare December 31, 2023 17:03
@cpuhrsch cpuhrsch requested a review from NicolasHug January 3, 2024 08:54
@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 3, 2024
unit='B', unit_scale=True, unit_divisor=1024) as pbar:
while True:
buffer = u.read(8192)
buffer = u.read(READ_DATA_CHUNK)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @akx. I'm not an expert in download performance, so before moving forward I'd have a few questions:

reading in larger chunks should be more efficient

On a stable ad fast network, does it actually lead to a measurable difference in download speed?

What could be potential issues with making the READ_DATA_CHUNK bigger? Is it just more RAM usage or is there something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug Thanks for getting back to me!

On a stable ad fast network, does it actually lead to a measurable difference in download speed?

Glad you asked! The bottleneck is not the network stability or speed here; it's how many times you switch from Python to C (the underlying SSL/HTTP socket read() call) and back. The same stands for feeding lots of small chunks to the hasher (also C code) when you could be feeding more.

I wrote a small benchmark to prove this – downloading from a local HTTP server and to /dev/null, so we can assume there is no network or disk overhead. The data is random, so we can assume there's no compression hijinks going on either.

import hashlib
import os
from urllib.request import Request, urlopen

import tqdm

url = "http://192.168.68.102:8000/foo"
chunk = int(os.environ["CHUNK"])

req = Request(url, headers={"User-Agent": "torch.hub"})
u = urlopen(req)
sha256 = hashlib.sha256()
with open("/dev/null", "wb") as f, tqdm.tqdm(unit='B', unit_scale=True, unit_divisor=1024) as pbar:
    while True:
        buffer = u.read(chunk)
        if len(buffer) == 0:
            break
        f.write(buffer)
        sha256.update(buffer)
        pbar.update(len(buffer))
$ hyperfine 'env CHUNK=8192 python download_speed_test.py' 'env CHUNK=131072 python download_speed_test.py' --show-output
Benchmark 1: env CHUNK=8192 python download_speed_test.py
0.98GB [00:00, 1.37GB/s]
[... snip ...]
  Time (mean ± σ):     966.1 ms ±  32.1 ms    [User: 626.7 ms, System: 202.0 ms]
  Range (min … max):   911.6 ms … 1005.2 ms    10 runs

Benchmark 2: env CHUNK=131072 python download_speed_test.py
0.98GB [00:00, 2.09GB/s]
[... snip ...]
  Time (mean ± σ):     690.1 ms ±  12.6 ms    [User: 458.7 ms, System: 95.0 ms]
  Range (min … max):   668.2 ms … 711.0 ms    10 runs

Summary
  env CHUNK=131072 python download_speed_test.py ran
    1.40 ± 0.05 times faster than env CHUNK=8192 python download_speed_test.py

IOW, this single-number change made downloading from a perfect network 1.4x faster. You can also see that there's 10 percentage points less time spent in syscalls in the faster version. (In this scenario, updating the chunk size to 524288 gives diminishing returns, more so on non-perfect networks.)

What could be potential issues with making the READ_DATA_CHUNK bigger?

Slightly increased memory usage during downloads, as you noted. Considering where torch is supposed to be used, 128 KiB instead of 8 KiB should not make a difference for any real use case 😁

The other is that the progress bar, if enabled, won't update quite as often (since it is only updated between chunk reads), but with this chunk size and modern networks, that shouldn't matter either.

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the detailed answer and for the benchmarks @akx . It looks like the speed-up should be fairly minimal for any reasonable ML workload, but since this PR adds zero complexity/maintenance cost, I'm happy to merge it. I just made one nit suggestion in https://github.com/pytorch/pytorch/pull/116536/files#r1440361098. LMK if it's OK with you and I'll merge. Thank you !

@akx
Copy link
Contributor Author

akx commented Jan 3, 2024

@NicolasHug Thank you, committed the suggestion! You're right in that this will probably not make a large difference for a general ML workload (where you'd be expected to have downloaded models and other data already), but e.g. notebooks and CI runs should see some wallclock speedup and less CPU cycles spent (good for the planet, hey!) 😄

@NicolasHug
Copy link
Member

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 3, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: hub open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants