Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: download and compute hashes in chunks of 1MB, did you know the progress bar was 30% of the runtime! #12810

Merged
merged 1 commit into from
Jul 17, 2024

Conversation

morotti
Copy link
Contributor

@morotti morotti commented Jun 28, 2024

Hello, it's me again

I'm offering you 2 more improvements in this PR.

First fix, to download the file in chunks of 1 MB.
Things to know, urllib is broken, depending on which function you use, it can download in chunks of 1 byte or 10240 bytes by default. They have some tickets about that somewhere for years, but nobody has been fixing them.
pip code was setting the chunks to CONTENT_CHUNK_SIZE=10240 from urllib which is the bad constant. You don't want to do that ⚠️

Special case of pip. pip is updating a progress bar after each chunk is downloaded.
This is doing an insane amount of progress bar updates, that can take as much as 30% of the runtime for a large package :D

The PR is downloading in chunks of 1 MB. That's a reasonable size for I/O operations.

Profiling on pip install --prefix /var/username/deleteme tensorflow-cpu --no-deps --no-cache-dir --dry-run
that takes 3 seconds to run the main().

MASTER BRANCH
image

FIX BRANCH
image

Notice: the tensorflow wheel is 207 MB.
It was updating the progress bar 20237 times, or every 10240 bytes.
Notice the same amount of calls to read() fp_read() stream read() etc...

Second fix, after the wheel is downloaded, pip is reading back the file in blocks of io.DEFAULT_BUFFER_SIZE to compute hashes of the file.
io.DEFAULT_BUFFER_SIZE is an obsolete constant that was set to 8k forever ago. You don't want to use that.
By the way, I have some tickets and PRs open on the python interpreter to fix that constant but I don't know if we will ever get to merging them python/cpython#117151

This one doesn't have too much impact thankfully, the downloaded file should be in the read cache because it was just written, and it is written to /tmp as a ramdisk for me. So it makes little difference on my machine but that really depends what type of OS and disks you have.

Cheers.

@morotti
Copy link
Contributor Author

morotti commented Jul 1, 2024

Hello, all green and all comments reviewed, should be good to merge now

src/pip/_internal/network/utils.py Outdated Show resolved Hide resolved
@ichard26 ichard26 self-requested a review July 6, 2024 23:15
@morotti
Copy link
Contributor Author

morotti commented Jul 8, 2024

I've replaced 1024*1024 with a constant as requested.

two constants actually, because the network chunks and file chunks don't necessary need to be the same size and we don't necessarily want these files to import each other just for one constant.

@morotti
Copy link
Contributor Author

morotti commented Jul 8, 2024

(pingback)

ping back to the bug tickets in requests, they've had the issue opened for nearly 10 years to stop using a small chunk size but it was never fixed ^^
psf/requests#3186
psf/requests#844

@morotti
Copy link
Contributor Author

morotti commented Jul 9, 2024

one quick check on older version of pip on older python version
the downloading is nearly 3 times faster =)

pip version 21.0.1 on python 3.8

default chunk size 10 kiB
time pip download torch --no-deps --no-cache
...

 |████████████████████████████████| 1801.8 MB 65.9 MB/s

Saved ./torch-1.13.1+cu117-cp38-cp38-linux_x86_64.whl
Successfully downloaded torch

real 0m30.120s
user 0m16.202s
sys 0m13.398s

larger chunk size 1 MiB
time pip download torch --no-deps --no-cache
...

 |████████████████████████████████| 1801.8 MB 240.4 MB/s

Saved ./torch-1.13.1+cu117-cp38-cp38-linux_x86_64.whl
Successfully downloaded torch

real 0m15.173s
user 0m5.931s
sys 0m6.345s

news/12810.bugfix.rst Outdated Show resolved Hide resolved
Copy link
Member

@ichard26 ichard26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this does seem like a good idea on principal and the performance uplift is compelling, I think this can be a bit smarter to ensure responsive feedback to the user.

Let's pretend I'm on a 5 Mbps down connection and I'm installing Black. black-24.4.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl is 1.8 MB. At 5 Mbps, each megabyte will take 1.6 seconds, meaning there will be a 1.6s delay between each update of the progress bar. This results in an awkward delay in the progress bar that could be perceived as a hang.1

Screencast.from.2024-07-09.15-18-47.webm

I know this is nit-picky, but responsive design is important and providing immediate feedback on the download would be nice. Perhaps the download size could be scaled dynamically based on the download size (if known).

chunk_size = 1024 * 1024
if total_length:
    # Reduce the default chunk size to a small fraction of the total size to
    # ensure responsive progress updates (up to a lower bound to prevent
    # harmfully low chunk sizes), or to 1 MiB, whatever is lower.
    # TODO: it's probably best to round to the "nearest" power of 2...?
    chunk_size = max(1024 * 128, min(total_length // 15, chunk_size))
chunks = response_chunks(resp, chunk_size)
Screencast.from.2024-07-09.16-07-35.webm

Although this would quickly get tricky as evident by the example logic above2, so perhaps it would be best to simply lower the download chunk size to 512 KiB or 256 KiB to balance chunk overhead and responsiveness.3 To reduce the performance penalty, you can reduce the download progress bar's refresh rate to something more reasonable than 30/s.

progress = Progress(*columns, refresh_per_second=30)
task_id = progress.add_task(" " * (get_indentation() + 2), total=total)

A value between 5-10 seems fine to me.

If people don't think it's worth optimizing for responsiveness, that's fine—go ahead and ignore my suggestion—but you can still reduce the progress bar refresh rate at least (since performance is the entire name of the game here 😉).

Footnotes

  1. It also feels slower to me, although that's subjective :)

  2. I didn't put that much thought into it. It probably needs a lot of tweaking...

  3. While I mention lower chunk sizes, I'm curious, are there any other potential negative consequences of setting the chunk size to a large value like 1 MiB? Can it have a harmful effect on flaky connections?

@notatallshaw
Copy link
Member

FWIW I agree with @ichard26, there have been complaints before from users who are on slow or unreliable connections that pip isn't always friendly.

With this PR there becomes this awkward middle ground of file sizes (between 512 KB and 10 MB) where they will complete extremely fast and not be noticable on people with high bandwidth connections, but may appear stuck on a user who has a slow enough connection that takes a noticeable amount of time.

I assume chunk size is fixed, and pip can't determine ahead of time how fast or reliable a connection is. So @ichard26 approach seems a good compromise to me (without debating the specific numbers which could always be tweaked).

@ichard26
Copy link
Member

ichard26 commented Jul 9, 2024

I've been discussing my review with others (in the PyPA Discord) and Ethan made a suggestion that involves dynamically updating the chunk size:

[W]ould it be possible to do an adaptive download chunk size based on the time it took to download the last chunk?
So you start small and increase the download chunk size until the last chunk was "slow" by some metric?

TBH it sounds even more complicated and I don't think I'd want to maintain such logic, but it could be a valid approach.

@notatallshaw
Copy link
Member

If chunk size is dynamic (which it doesn't look like?), I think it would make sense to start at 8 kB and keep doubling until some time threshold was exceeded (e.g. 1 second) or some maximum value was reached (e.g. 1 MB)

@morotti
Copy link
Contributor Author

morotti commented Jul 10, 2024

Hello,

I've made adjustments. We can now reach ~450 MB/s download speed internally, up from ~60 MB before this PR.

  • Chunks are set to 256 kB. That's sufficient to show regular progress on the slowest connections, without being detrimental to performance.

Do not use smaller chunks, as this will incur negative performance waiting on I/O operations from the device and for the interpreter to run more iterations (note that python 3.11+ brought significant improvements on the interpreting speed).

Most devices, especially HDD or USB devices, would benefit from larger block size (1+ MB ) but the difference is not necessarily significant. (I see some inefficient copy in the SSL code to reconstrcut large packets and they'd fall outside of CPU cache, which could incur microsecond optimizations, but this is outside of the scope of this patch.)

  • The refresh was limited to 5 times per second. It helps a lot. Thanks for pointing this out @ichard26

(It's funny because I'm testing and the UI "feels" more responsive the more often it's flashing (I can force refresh 60 times a second and it's very flashy!) but that doesn't make anything faster. Users won't run 3 windows on pip install with 1, 5 and 60 refresh a second and won't tell the difference)

Last I checked 3 years ago we were talking ~40% of the UK with <8 Mbps connections including ~10% with ~2Mbps or below.
I myself lived with a 2Mbps connections for many years.

Downloading a movie takes a whole hour (tensorflow and torch-cpu are in that order depending on platform).
Downloading a game on steam can take 3 days.
There is no improvement to the progress bar you can make to make the experience less terrible. Users just have to wait :D
You guys don't need to worry about showing progress so much. Having any form of progress bar is good enough really.

@dolfinus
Copy link

dolfinus commented Jul 10, 2024

Last I checked 3 years ago we were talking ~40% of the UK with <8 Mbps connections including ~10% with ~2Mbps or below.
I myself lived with a 2Mbps connections for many years.

pip can be used not only with PyPI, but also with local repository or proxy repository (e.g. JFrog Artifactory, Sonatype Nexus). Downloading packages within internal company network can be much faster. I'd prefer having an instantaneous CI builds, if it it possible.
What about checking if terminal is a TTY, and if not (e.g. CI run, Docker build), use the maximum possible chunk size.

@morotti
Copy link
Contributor Author

morotti commented Jul 10, 2024

I see failing tests.

rebased on master and squashed all the commits. the next build should pass.

Copy link
Member

@ichard26 ichard26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it does partially suck to lose some of the smoothness in the progress bar, the responsiveness is good enough at this point which is all that actually matters :)

The last comment I have is whether we should adjust the minimum download size that enables the download progress bar. It's currently at 40 KB, but the progress bar is essentially useless until 256 KiB as the entire download will complete in one chunk anyway (unless I'm misunderstanding how chunk sizes work, perhaps the read() call isn't guaranteed to return the requested chunk size?)

(It's funny because I'm testing and the UI "feels" more responsive the more often it's flashing (I can force refresh 60 times a second and it's very flashy!) but that doesn't make anything faster. Users won't run 3 windows on pip install with 1, 5 and 60 refresh a second and won't tell the difference)

I'm glad the performance is even better than before. 5 refreshes per second does actually seem a bit slow, but that's a matter of taste. I agree that users won't notice or wouldn't care enough to complain.

Last I checked 3 years ago we were talking ~40% of the UK with <8 Mbps connections including ~10% with ~2Mbps or below.
I myself lived with a 2Mbps connections for many years.

Not too long ago, I was working with a "in-theory" 25/2 Mbps internet connection. That's pretty good I realize, but the download speed I got in practice was much lower. I'm glad you understand where I was coming from.

Downloading a movie takes a whole hour (tensorflow and torch-cpu are in that order depending on platform).
Downloading a game on steam can take 3 days.
There is no improvement to the progress bar you can make to make the experience less terrible. Users just have to wait :D

Right, but as @notatallshaw pointed out, this patch in its original form most impacted the download UX of medium-sized distributions. That's what I was trying to optimize for. Of course, if you're on a slow enough connection, any large distributions is going to progress at a snail's pace and even the most responsive progress bar wouldn't help :)

What about checking if terminal is a TTY, and if not (e.g. CI run, Docker build), use the maximum possible chunk size.

Frankly, if I understand @morotti's comment, increasing the chunk size beyond 256 KiB nets only marginal gains. With the current version of this PR, they recorded a top download speed of 460 MB/s which is over 3.5 Gbps. Do more enterprises have 2.5+ Gbps ethernet links to their internal network than I think...?

Thanks @morotti for bearing with the considerable back and forth on this PR. I hope you found my comments useful and not too nit-picky :)

@morotti
Copy link
Contributor Author

morotti commented Jul 11, 2024

(repushing to trigger builds, there seem to be a flaky test on Windows)

Thanks, I've adjusted the progress bar to only render for packages > 512 kiB, up from 40 kiB. I think that's a reasonable cutoff.

A simple "pip install jupyter" is installing 60 packages, most are very small. I'm finding the pip output easier to read and follow, with less progress bars appearing and disappearing very fast.

read() returns the provided chunk size, except for the last chunk.

Do more enterprises have 2.5+ Gbps ethernet links to their internal network than I think...?

A lot of companies have employees working on a remote machine (VM, VDI, RDP, remote terminal, ssh, etc...). Hardware has been dual 10 Gbps for more than a decade, 40 Gbps is common nowadays.

@morotti
Copy link
Contributor Author

morotti commented Jul 11, 2024

There seem to be two tests in master that are flaky on Windows. Added in May.

FAILED tests/unit/test_utils_retry.py::test_retry_wait[0.015] - assert (669.765 - 669.75) >= 0.015
FAILED tests/unit/test_utils_retry.py::test_retry_time_limit[0.01-10] - assert 11 <= 10

EDIT: i see they are discussed in another PR #12839

@morotti
Copy link
Contributor Author

morotti commented Jul 15, 2024

(rebasing to pick up test fixes on main)

@morotti
Copy link
Contributor Author

morotti commented Jul 15, 2024

All green now that main branch has been fixed.
Would you be able to merge the PR? @ichard26

final result: time pip download --dest /tmp/deleteme --no-cache tensorflow torch xgboost
a few large packages typical of machine learning use cases, 51 packages total.
it cuts the runtime almost in half.
download speeds is going from ~100 MB/s to ~400 MB/s 🚀 🚀 🚀

on main:
real    0m37.430s
user    0m19.404s
sys     0m15.371s
with the patch:
real    0m23.292s
user    0m10.226s
sys     0m10.907s

@ichard26
Copy link
Member

Thanks, I've adjusted the progress bar to only render for packages > 512 kiB, up from 40 kiB. I think that's a reasonable cutoff.

I'm not sure if this is the right call, but I don't feel strongly enough so I won't block the PR on this. We can see if anyone complains later.

I don't have the commit bit so I can't merge anything. I'm a triager, not a maintainer as I'm too new to the project :)

@ichard26 ichard26 added the type: performance Commands take too long to run label Jul 15, 2024
@morotti morotti mentioned this pull request Jul 17, 2024
@pfmoore
Copy link
Member

pfmoore commented Jul 17, 2024

This seems like a good improvement, and it's easy enough to revert if it causes issues, so I'm going to merge it pre-emptively. @pradyunsg if you have concerns for 24.2, feel free to revert it.

@pfmoore pfmoore merged commit 5fb46a3 into pypa:main Jul 17, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants