PERF: download and compute hashes in chunks of 1MB, did you know the progress bar was 30% of the runtime! #12810

morotti · 2024-06-28T16:54:15Z

Hello, it's me again

I'm offering you 2 more improvements in this PR.

First fix, to download the file in chunks of 1 MB.
Things to know, urllib is broken, depending on which function you use, it can download in chunks of 1 byte or 10240 bytes by default. They have some tickets about that somewhere for years, but nobody has been fixing them.
pip code was setting the chunks to CONTENT_CHUNK_SIZE=10240 from urllib which is the bad constant. You don't want to do that ⚠️

Special case of pip. pip is updating a progress bar after each chunk is downloaded.
This is doing an insane amount of progress bar updates, that can take as much as 30% of the runtime for a large package :D

The PR is downloading in chunks of 1 MB. That's a reasonable size for I/O operations.

Profiling on pip install --prefix /var/username/deleteme tensorflow-cpu --no-deps --no-cache-dir --dry-run
that takes 3 seconds to run the main().

MASTER BRANCH

FIX BRANCH

Notice: the tensorflow wheel is 207 MB.
It was updating the progress bar 20237 times, or every 10240 bytes.
Notice the same amount of calls to read() fp_read() stream read() etc...

Second fix, after the wheel is downloaded, pip is reading back the file in blocks of io.DEFAULT_BUFFER_SIZE to compute hashes of the file.
io.DEFAULT_BUFFER_SIZE is an obsolete constant that was set to 8k forever ago. You don't want to use that.
By the way, I have some tickets and PRs open on the python interpreter to fix that constant but I don't know if we will ever get to merging them python/cpython#117151

This one doesn't have too much impact thankfully, the downloaded file should be in the read cache because it was just written, and it is written to /tmp as a ramdisk for me. So it makes little difference on my machine but that really depends what type of OS and disks you have.

Cheers.

morotti · 2024-07-01T16:44:56Z

Hello, all green and all comments reviewed, should be good to merge now

src/pip/_internal/network/utils.py

morotti · 2024-07-08T09:49:40Z

I've replaced 1024*1024 with a constant as requested.

two constants actually, because the network chunks and file chunks don't necessary need to be the same size and we don't necessarily want these files to import each other just for one constant.

morotti · 2024-07-08T10:37:52Z

(pingback)

ping back to the bug tickets in requests, they've had the issue opened for nearly 10 years to stop using a small chunk size but it was never fixed ^^
psf/requests#3186
psf/requests#844

morotti · 2024-07-09T09:13:08Z

one quick check on older version of pip on older python version
the downloading is nearly 3 times faster =)

pip version 21.0.1 on python 3.8

default chunk size 10 kiB
time pip download torch --no-deps --no-cache
...

 |████████████████████████████████| 1801.8 MB 65.9 MB/s

Saved ./torch-1.13.1+cu117-cp38-cp38-linux_x86_64.whl
Successfully downloaded torch

real 0m30.120s
user 0m16.202s
sys 0m13.398s

larger chunk size 1 MiB
time pip download torch --no-deps --no-cache
...

 |████████████████████████████████| 1801.8 MB 240.4 MB/s

Saved ./torch-1.13.1+cu117-cp38-cp38-linux_x86_64.whl
Successfully downloaded torch

real 0m15.173s
user 0m5.931s
sys 0m6.345s

news/12810.bugfix.rst

src/pip/_internal/network/utils.py

ichard26

While this does seem like a good idea on principal and the performance uplift is compelling, I think this can be a bit smarter to ensure responsive feedback to the user.

Let's pretend I'm on a 5 Mbps down connection and I'm installing Black. black-24.4.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl is 1.8 MB. At 5 Mbps, each megabyte will take 1.6 seconds, meaning there will be a 1.6s delay between each update of the progress bar. This results in an awkward delay in the progress bar that could be perceived as a hang.¹

Screencast.from.2024-07-09.15-18-47.webm

I know this is nit-picky, but responsive design is important and providing immediate feedback on the download would be nice. Perhaps the download size could be scaled dynamically based on the download size (if known).

chunk_size = 1024 * 1024
if total_length:
    # Reduce the default chunk size to a small fraction of the total size to
    # ensure responsive progress updates (up to a lower bound to prevent
    # harmfully low chunk sizes), or to 1 MiB, whatever is lower.
    # TODO: it's probably best to round to the "nearest" power of 2...?
    chunk_size = max(1024 * 128, min(total_length // 15, chunk_size))
chunks = response_chunks(resp, chunk_size)

Screencast.from.2024-07-09.16-07-35.webm

Although this would quickly get tricky as evident by the example logic above², so perhaps it would be best to simply lower the download chunk size to 512 KiB or 256 KiB to balance chunk overhead and responsiveness.³ To reduce the performance penalty, you can reduce the download progress bar's refresh rate to something more reasonable than 30/s.

pip/src/pip/_internal/cli/progress_bars.py

Lines 52 to 53 in 34e70ba

    
           progress = Progress(*columns, refresh_per_second=30) 
        
           task_id = progress.add_task(" " * (get_indentation() + 2), total=total)

A value between 5-10 seems fine to me.

If people don't think it's worth optimizing for responsiveness, that's fine—go ahead and ignore my suggestion—but you can still reduce the progress bar refresh rate at least (since performance is the entire name of the game here 😉).

It also feels slower to me, although that's subjective :) ↩
I didn't put that much thought into it. It probably needs a lot of tweaking... ↩
While I mention lower chunk sizes, I'm curious, are there any other potential negative consequences of setting the chunk size to a large value like 1 MiB? Can it have a harmful effect on flaky connections? ↩

notatallshaw · 2024-07-09T21:14:56Z

FWIW I agree with @ichard26, there have been complaints before from users who are on slow or unreliable connections that pip isn't always friendly.

With this PR there becomes this awkward middle ground of file sizes (between 512 KB and 10 MB) where they will complete extremely fast and not be noticable on people with high bandwidth connections, but may appear stuck on a user who has a slow enough connection that takes a noticeable amount of time.

I assume chunk size is fixed, and pip can't determine ahead of time how fast or reliable a connection is. So @ichard26 approach seems a good compromise to me (without debating the specific numbers which could always be tweaked).

ichard26 · 2024-07-09T21:24:17Z

I've been discussing my review with others (in the PyPA Discord) and Ethan made a suggestion that involves dynamically updating the chunk size:

[W]ould it be possible to do an adaptive download chunk size based on the time it took to download the last chunk?
So you start small and increase the download chunk size until the last chunk was "slow" by some metric?

TBH it sounds even more complicated and I don't think I'd want to maintain such logic, but it could be a valid approach.

notatallshaw · 2024-07-09T22:52:00Z

If chunk size is dynamic (which it doesn't look like?), I think it would make sense to start at 8 kB and keep doubling until some time threshold was exceeded (e.g. 1 second) or some maximum value was reached (e.g. 1 MB)

morotti · 2024-07-10T12:40:07Z

Hello,

I've made adjustments. We can now reach ~450 MB/s download speed internally, up from ~60 MB before this PR.

Chunks are set to 256 kB. That's sufficient to show regular progress on the slowest connections, without being detrimental to performance.

Do not use smaller chunks, as this will incur negative performance waiting on I/O operations from the device and for the interpreter to run more iterations (note that python 3.11+ brought significant improvements on the interpreting speed).

Most devices, especially HDD or USB devices, would benefit from larger block size (1+ MB ) but the difference is not necessarily significant. (I see some inefficient copy in the SSL code to reconstrcut large packets and they'd fall outside of CPU cache, which could incur microsecond optimizations, but this is outside of the scope of this patch.)

The refresh was limited to 5 times per second. It helps a lot. Thanks for pointing this out @ichard26

(It's funny because I'm testing and the UI "feels" more responsive the more often it's flashing (I can force refresh 60 times a second and it's very flashy!) but that doesn't make anything faster. Users won't run 3 windows on pip install with 1, 5 and 60 refresh a second and won't tell the difference)

Last I checked 3 years ago we were talking ~40% of the UK with <8 Mbps connections including ~10% with ~2Mbps or below.
I myself lived with a 2Mbps connections for many years.

Downloading a movie takes a whole hour (tensorflow and torch-cpu are in that order depending on platform).
Downloading a game on steam can take 3 days.
There is no improvement to the progress bar you can make to make the experience less terrible. Users just have to wait :D
You guys don't need to worry about showing progress so much. Having any form of progress bar is good enough really.

dolfinus · 2024-07-10T12:44:23Z

Last I checked 3 years ago we were talking ~40% of the UK with <8 Mbps connections including ~10% with ~2Mbps or below.
I myself lived with a 2Mbps connections for many years.

pip can be used not only with PyPI, but also with local repository or proxy repository (e.g. JFrog Artifactory, Sonatype Nexus). Downloading packages within internal company network can be much faster. I'd prefer having an instantaneous CI builds, if it it possible.
What about checking if terminal is a TTY, and if not (e.g. CI run, Docker build), use the maximum possible chunk size.

morotti · 2024-07-10T12:47:45Z

I see failing tests.

rebased on master and squashed all the commits. the next build should pass.

ichard26

While it does partially suck to lose some of the smoothness in the progress bar, the responsiveness is good enough at this point which is all that actually matters :)

The last comment I have is whether we should adjust the minimum download size that enables the download progress bar. It's currently at 40 KB, but the progress bar is essentially useless until 256 KiB as the entire download will complete in one chunk anyway (unless I'm misunderstanding how chunk sizes work, perhaps the read() call isn't guaranteed to return the requested chunk size?)

(It's funny because I'm testing and the UI "feels" more responsive the more often it's flashing (I can force refresh 60 times a second and it's very flashy!) but that doesn't make anything faster. Users won't run 3 windows on pip install with 1, 5 and 60 refresh a second and won't tell the difference)

I'm glad the performance is even better than before. 5 refreshes per second does actually seem a bit slow, but that's a matter of taste. I agree that users won't notice or wouldn't care enough to complain.

Last I checked 3 years ago we were talking ~40% of the UK with <8 Mbps connections including ~10% with ~2Mbps or below.
I myself lived with a 2Mbps connections for many years.

Not too long ago, I was working with a "in-theory" 25/2 Mbps internet connection. That's pretty good I realize, but the download speed I got in practice was much lower. I'm glad you understand where I was coming from.

Downloading a movie takes a whole hour (tensorflow and torch-cpu are in that order depending on platform).
Downloading a game on steam can take 3 days.
There is no improvement to the progress bar you can make to make the experience less terrible. Users just have to wait :D

Right, but as @notatallshaw pointed out, this patch in its original form most impacted the download UX of medium-sized distributions. That's what I was trying to optimize for. Of course, if you're on a slow enough connection, any large distributions is going to progress at a snail's pace and even the most responsive progress bar wouldn't help :)

What about checking if terminal is a TTY, and if not (e.g. CI run, Docker build), use the maximum possible chunk size.

Frankly, if I understand @morotti's comment, increasing the chunk size beyond 256 KiB nets only marginal gains. With the current version of this PR, they recorded a top download speed of 460 MB/s which is over 3.5 Gbps. Do more enterprises have 2.5+ Gbps ethernet links to their internal network than I think...?

Thanks @morotti for bearing with the considerable back and forth on this PR. I hope you found my comments useful and not too nit-picky :)

morotti · 2024-07-11T09:37:30Z

(repushing to trigger builds, there seem to be a flaky test on Windows)

Thanks, I've adjusted the progress bar to only render for packages > 512 kiB, up from 40 kiB. I think that's a reasonable cutoff.

A simple "pip install jupyter" is installing 60 packages, most are very small. I'm finding the pip output easier to read and follow, with less progress bars appearing and disappearing very fast.

read() returns the provided chunk size, except for the last chunk.

Do more enterprises have 2.5+ Gbps ethernet links to their internal network than I think...?

A lot of companies have employees working on a remote machine (VM, VDI, RDP, remote terminal, ssh, etc...). Hardware has been dual 10 Gbps for more than a decade, 40 Gbps is common nowadays.

morotti · 2024-07-11T09:51:02Z

There seem to be two tests in master that are flaky on Windows. Added in May.

FAILED tests/unit/test_utils_retry.py::test_retry_wait[0.015] - assert (669.765 - 669.75) >= 0.015
FAILED tests/unit/test_utils_retry.py::test_retry_time_limit[0.01-10] - assert 11 <= 10

EDIT: i see they are discussed in another PR #12839

morotti · 2024-07-15T09:36:37Z

(rebasing to pick up test fixes on main)

morotti · 2024-07-15T10:01:43Z

All green now that main branch has been fixed.
Would you be able to merge the PR? @ichard26

final result: time pip download --dest /tmp/deleteme --no-cache tensorflow torch xgboost
a few large packages typical of machine learning use cases, 51 packages total.
it cuts the runtime almost in half.
download speeds is going from ~100 MB/s to ~400 MB/s 🚀 🚀 🚀

on main:
real    0m37.430s
user    0m19.404s
sys     0m15.371s

with the patch:
real    0m23.292s
user    0m10.226s
sys     0m10.907s

ichard26 · 2024-07-15T15:26:38Z

Thanks, I've adjusted the progress bar to only render for packages > 512 kiB, up from 40 kiB. I think that's a reasonable cutoff.

I'm not sure if this is the right call, but I don't feel strongly enough so I won't block the PR on this. We can see if anyone complains later.

I don't have the commit bit so I can't merge anything. I'm a triager, not a maintainer as I'm too new to the project :)

pfmoore · 2024-07-17T15:32:00Z

This seems like a good improvement, and it's easy enough to revert if it causes issues, so I'm going to merge it pre-emptively. @pradyunsg if you have concerns for 24.2, feel free to revert it.

psf-chronographer bot added the bot:chronographer:provided label Jun 28, 2024

morotti force-pushed the perf-download branch from 9297e8f to dd132de Compare July 1, 2024 09:40

morotti mentioned this pull request Jul 1, 2024

PERF: proof of concept, parallel installation of wheels #12816

Open

ofek suggested changes Jul 3, 2024

View reviewed changes

src/pip/_internal/network/utils.py Outdated Show resolved Hide resolved

ichard26 self-requested a review July 6, 2024 23:15

morotti mentioned this pull request Jul 8, 2024

gh-117151: IO performance improvement, increase io.DEFAULT_BUFFER_SIZE to 128k python/cpython#118144

Open

uranusjr reviewed Jul 9, 2024

View reviewed changes

news/12810.bugfix.rst Outdated Show resolved Hide resolved

uranusjr reviewed Jul 9, 2024

View reviewed changes

src/pip/_internal/network/utils.py Show resolved Hide resolved

uranusjr approved these changes Jul 9, 2024

View reviewed changes

ichard26 reviewed Jul 9, 2024

View reviewed changes

morotti force-pushed the perf-download branch from 35899a6 to e61a864 Compare July 10, 2024 12:46

ichard26 approved these changes Jul 10, 2024

View reviewed changes

morotti force-pushed the perf-download branch from 64a555b to d4bcf30 Compare July 11, 2024 09:21

morotti force-pushed the perf-download branch from d4bcf30 to 0346cea Compare July 11, 2024 10:19

PERF: download in chunks of 256 kB + limit progress bar to 5 refresh/sec

ee7d0fb

morotti force-pushed the perf-download branch from 0346cea to ee7d0fb Compare July 15, 2024 09:33

ichard26 added the type: performance Commands take too long to run label Jul 15, 2024

morotti mentioned this pull request Jul 17, 2024

Release 24.2 #12819

Open

pfmoore merged commit 5fb46a3 into pypa:main Jul 17, 2024
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: download and compute hashes in chunks of 1MB, did you know the progress bar was 30% of the runtime! #12810

PERF: download and compute hashes in chunks of 1MB, did you know the progress bar was 30% of the runtime! #12810

morotti commented Jun 28, 2024 •

edited

Loading

morotti commented Jul 1, 2024

morotti commented Jul 8, 2024

morotti commented Jul 8, 2024

morotti commented Jul 9, 2024

ichard26 left a comment

notatallshaw commented Jul 9, 2024

ichard26 commented Jul 9, 2024

notatallshaw commented Jul 9, 2024

morotti commented Jul 10, 2024

dolfinus commented Jul 10, 2024 •

edited

Loading

morotti commented Jul 10, 2024

ichard26 left a comment

morotti commented Jul 11, 2024

morotti commented Jul 11, 2024 •

edited

Loading

morotti commented Jul 15, 2024

morotti commented Jul 15, 2024

ichard26 commented Jul 15, 2024

pfmoore commented Jul 17, 2024

	progress = Progress(*columns, refresh_per_second=30)
	task_id = progress.add_task(" " * (get_indentation() + 2), total=total)

PERF: download and compute hashes in chunks of 1MB, did you know the progress bar was 30% of the runtime! #12810

PERF: download and compute hashes in chunks of 1MB, did you know the progress bar was 30% of the runtime! #12810

Conversation

morotti commented Jun 28, 2024 • edited Loading

morotti commented Jul 1, 2024

morotti commented Jul 8, 2024

morotti commented Jul 8, 2024

morotti commented Jul 9, 2024

ichard26 left a comment

Choose a reason for hiding this comment

Footnotes

notatallshaw commented Jul 9, 2024

ichard26 commented Jul 9, 2024

notatallshaw commented Jul 9, 2024

morotti commented Jul 10, 2024

dolfinus commented Jul 10, 2024 • edited Loading

morotti commented Jul 10, 2024

ichard26 left a comment

Choose a reason for hiding this comment

morotti commented Jul 11, 2024

morotti commented Jul 11, 2024 • edited Loading

morotti commented Jul 15, 2024

morotti commented Jul 15, 2024

ichard26 commented Jul 15, 2024

pfmoore commented Jul 17, 2024

morotti commented Jun 28, 2024 •

edited

Loading

dolfinus commented Jul 10, 2024 •

edited

Loading

morotti commented Jul 11, 2024 •

edited

Loading