Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom progress bar that supports parallel downloads #12404

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

NeilBotelho
Copy link
Contributor

@NeilBotelho NeilBotelho commented Nov 19, 2023

This PR introduces 2 new classes PipProgress and PipParallelProgress supporting parallel downloads as presented in PR #12388. These are subclasses of rich.progress.Progress and are designed as a replacement for the current progress bar.

In the interest of not creating a single PR that does too many things, this PR only adds the above mentioned classes and does not integrate them with the Downloader classes. For those interested in testing the integrated version, a rough merge of the integration is available on my fork. Feedback is highly appreciated :)

The need for a new subclass:

  • Keeping log message together with the progress bar when downloading in parallel.
  • Dynamically switching download descriptions and columns when the total length of the download isn't known
  • Sorting of completed downloads when downloading in parallel to keep active downloads together

To expand on this further:

Keeping log message and progress bar together:

When you add multiple tasks to the same progress bar(as would be done for parallel download), any logged text gets pushed to the start of all the progress bars. So the log message, i.e. "Downloading xxxx.whl (yykB)" or "Using cached" needs to be part of the progress bar.
But rich.progress doesn't natively support multi-line progress bars. As per the documentation, to customize this behaviour you would need to override the get_renderable method, which is the approach this PR takes.

Dynamically switching columns

Progress doesn't natively support switching between columns when a progress bar has multiple tasks. pip requires this since if it can't get the total length of one the links provided it is supposed to use a different set of columns. In the case of parallel downloads, all downloads use the same progress bar object, hence it needs to be able to dynamically switch columns for a task if the total length for that task is not known. Using the subclass of Progress, this behaviour is implemented in this PR

Sorting of completed downloads

This is more of a UX thing. By default for parallel downloads the progress bars are rendered in the order that they are added. But this makes it difficult for an end user to see how many downloads are currently active. In this PR every time a download is completed, the progress bars are sorted by which ones are completed, to keep the active ones together at the bottom of the screen.

Possible Issues

If the user doesn't provide the --parallel-downloads flag (as per PR #12388 ) I've tried to keep it so that there isn't a change in behaviour or appearance. In the terminal, I think I've succeeded. But there are a few issues I was not able to address and would appreciate help with.

Rendering issues in Jupyter (Solved)

Using parallel progress bars seems to cause rendering issues in Jupyter.(By rendering issues i mean the progress bar gets redrawn on a newline for each refresh) as can be see in the rough integrated version of this PR. I'm working on solving this currently and will update when I learn more :)

Current Solution:

The issue here was that in Jupyter you can update the current line, but as soon as a newline character is printed, you can no longer go back and update the previous line. In my previous solution, to allow the download description and its related progress bar to stay adjacent to each other when downloading multiple files in parallel, the description was added to the progress bar object and the progress bar was made into a 2 line progress bar. This means that the description was the first line and the progress meter was being reprinted for every update. In the current solution, to fix this for sequential downloads, the description is logged, and not made a part of the progress bar object. But parallel downloads still have a 2 line progress bar and remain broken in Jupyter. I think this is an acceptable solution as it maintains legacy behaviour and only breaks UI, when the --parallel-downloads flag is passed in jupyter.

Logging (Solved)

In the case of parallel downloads in BatchDownloader, if any messages are logged while the download is active, it may get overwritten by the progress bars. I did a quick check and it doesn't appear as though there are any log messages during a call to BatchDownloader but I'm not sure about this.

Current Solution:

To solve this, the logger is passed into PipProgress(for sequential downloads) and the description is logged when a task is added. To enable keeping description and progress bar together, this logging is disabled in PipParallelProgress and the description is added to the progress bar making a multi-line progress bar

1 for sequential downloads and 1 for parallel downloads.
@NeilBotelho
Copy link
Contributor Author

The latest commit (390fd51) addresses the previously mentioned potential issues. The goal was to maintain the existing behavior for sequential downloads while updating the behavior of the progress bar specifically for parallel downloads.

To achieve this in a clear and maintainable manner, I've broken up the new progress bar class into PipProgress (which is a subclass of Progress) and PipParallelProgress (which is a subclass of PipProgress). This design choice has several advantages:

Modularity:

PipProgress isolates modifications to the original Progress class for the new progress bar, making it easier to understand and maintain.

Separation of Concerns:

PipParallelProgress builds on PipProgress and introduces changes specific to supporting parallel downloads. This separation enables a clearer distinction between modifications for sequential and parallel download behaviors.

Isolation of Issues:

With this class structure, issues or updates related to the progress bar for sequential downloads can be traced to PipProgress, and issues or updates specific to parallel downloads can be traced to PipParallelProgress. This isolation simplifies debugging and future maintenance.

@ofek
Copy link
Sponsor Contributor

ofek commented Nov 27, 2023

FYI when I do pip install --dry-run -q -I --no-cache-dir --progress-bar off ... on the latest version I get no output as expected but on this branch I get a bunch of new lines that get printed out

@NeilBotelho
Copy link
Contributor Author

NeilBotelho commented Nov 27, 2023

Thanks for catching that. Thanks to that I also noticed I hadn't handled the -q flag yet. Most of this issue is related to changes on the rough merge of the integration branch. Just pushed a fix for that now. Also pushing a fix here to handle disabling progress in parallel downloads.

@notatallshaw notatallshaw mentioned this pull request Apr 9, 2024
@NeilBotelho
Copy link
Contributor Author

I wanted to gently nudge this PR for a bit more attention :) , and put forward a plan for merging this feature that hopefully makes sense.

This PR adds the necessary classes to add a UI for parallel downloads, and #12388 adds the user facing interface for it as well as enables parallel downloads on the BatchDownloader class. So I think the following makes sense:

  1. Merge this PR (after some feedback from a couple more people)
  2. Merge Add parallel download support to BatchDownloader #12388 to add the user facing interface
  3. Open a PR to add support for these progress bars in the Downloader and BatchDownloader classes

@arenasys
Copy link
Contributor

arenasys commented May 4, 2024

raw progress bar from #12586 will need to be tweaked in this rewrite, so it supports parallel downloads.

@ofek
Copy link
Sponsor Contributor

ofek commented May 4, 2024

Yes a rebase is required now.

@NeilBotelho
Copy link
Contributor Author

NeilBotelho commented May 4, 2024

Shouldn't be too difficult, it doesn't seem like #12586 adds any conflicting changes. I think a merge with main could work here?

@NeilBotelho
Copy link
Contributor Author

NeilBotelho commented May 4, 2024

Taking a closer look at it, I realize what you mean. I'll look into how to support the flow for simple progress bars with the changes suggested by this PR.
The change may just be in the integration of this PR with #12388 in which case I'll update the rough integration branch I mentioned in the PR description and make an update comment here.

@arenasys
Copy link
Contributor

arenasys commented May 5, 2024

It wont be difficult, it just needs some identifier on the lines it writes: Progress CURRENT of TOTAL to Progress FILE CURRENT of TOTAL, etc. So the program reading can distinguish between parallel downloads.

@NeilBotelho
Copy link
Contributor Author

NeilBotelho commented May 5, 2024

It'll also require a some type of mutex like threading.RLock to prevent 2 threads from trying to print to stdout at the same time. That becomes trickier to support since all threads would need to share the same RLock.

Also the way I was combining this PR and #12388 removed the support for other types of progress bars. This is because supporting parallel progress bars requires access to the underlying rich progress bar to add downloads. So now get_download_progress_renderer function returns the rich progress bar where it used to return a Callable[[Iterable[bytes]], Iterator[bytes]].
I'll have to figure out a workaround to still support other types of simple progress bars like #12586.

@arenasys
Copy link
Contributor

arenasys commented May 6, 2024

I would have a thread specifically for displaying the progress bars (maybe even the main thread?), then that thread can have a mutex guarded queue that the download threads can write progress events into (via callbacks or whatever). And queue.Queue is already guarded iirc.

@NeilBotelho
Copy link
Contributor Author

Could you elaborate?
Currently, if we are downloading sequentially the main thread will just download 1 url at a time so a mutex isn't really needed. In thr case of parallel downloads the main thread will spawn a thread pool which will download the urls in parallel and will wait till all the downloads are completed. Though the lock is only needed in the parallel case, I was going to include an optional lock argument in the common download logic and only acquire it while printing in the case of the simple progress bar.

@arenasys
Copy link
Contributor

arenasys commented May 6, 2024

So each thread is downloading and displaying a progress bar? Im saying its easier if you have all the display logic in a single thread which is fed progress events by the download threads. But anyway, just wanted to make sure raw progress mode isnt removed.

@pfmoore
Copy link
Member

pfmoore commented May 6, 2024

Can I urge some caution here? We seem to be adding a lot of complexity (progress bar subclasses, an now locking mechanisms) for getting "prettier" output, without looking at the bigger picture of what exactly we need here.

How many users really look at the output of pip install in real time, and take note of what precisely is going on? I know I don't. All I (as a user) want to know is (roughly!) how far through are we, and maybe how long is remaining (on the hope that the computer's estimate is better than the one I'd make based on % complete). After the fact I might want to look at a log file, or rerun with --verbose, to see why something took so long but when it's running - hardly ever.

Maybe we should take a lead from uv, which supports parallel downloads already. As far as I can see (it runs a bit too fast to be sure!!! 🙄) they simply have

Resolving <progress bar> --> Resolved NN packages in TTT
Downloading <progress bar> [NN/MM] some package name --> Downloaded NNN packages in TTT
Installing <progress bar> [NN/MM] some package name --> Installed NNN packages in TTT

That feels perfectly adequate to me, and well within the capabilities of the basic rich progress bar.

@pradyunsg
Copy link
Member

I concur with @pfmoore here -- rich does support parallel progress bars already and it's unclear why we need this complexity here. See https://github.com/Textualize/rich/blob/v13.7.1/examples/dynamic_progress.py for an example.

We don't need additional threads and locks etc -- that's a detail handled by rich's progress bar already (https://github.com/Textualize/rich/blob/v13.7.1/rich/progress.py#L968) and using a wrapper around that instead of subclassing it would make it easir to ensure we don't break internal expectations of that class.

@NeilBotelho
Copy link
Contributor Author

NeilBotelho commented May 6, 2024

Ah I see.

its easier if you have all the display logic in a single thread which is fed progress events by the download threads.

That is how it is currently working with this PR. rich progress bars allow you to update tasks and rich handles updating and refreshing the display a few times every second. So all the threads are sharing a common progress bar and are updating different tasks by their task_id.

But anyway, just wanted to make sure raw progress mode isnt removed.

Yep will do. I'll look into how to support this with the raw progress mode.

@NeilBotelho
Copy link
Contributor Author

NeilBotelho commented May 6, 2024

We seem to be adding a lot of complexity (progress bar subclasses, an now locking mechanisms) for getting "prettier" output, without looking at the bigger picture of what exactly we need here.

Point taken. The complexity here has gotten away from me a bit

Maybe we should take a lead from uv, which supports parallel downloads already.

I'll have to take a closer look at uv to be sure, but if realtime progress per package download isn't a priority this can be simplified greatly

@NeilBotelho
Copy link
Contributor Author

NeilBotelho commented May 6, 2024

rich does support parallel progress bars already and it's unclear why we need this complexity here.

The reason why I felt rich's default parallel progress bar was not suitable in this case was because it only supports single line progress bars, i.e. the progress bar for a task can only take a single line. When pip downloads a package its progress consists of two lines the progress bar and a log message of "Downloading .." which mentions the package name.
If I just used rich's parallel progress bar I get n rows of only progress bars with no indication of which package is being downloaded. Trying to put the package name on the same line as the progress bar result in weird output when package names are long.
These subclasses allows me to keep the progress bar together with its log message.

We don't need additional threads and locks etc -- that's a detail handled by rich's progress bar already

The subclasses introduced here rely on those, it doesn't add additional threads or locks. I had suggested additional locks for supporting the raw progress bar added by #12586 since it doesn't use a rich progress bar.

But I agree with your point about complexity. I'll try to take @pfmoore 's suggestion about not needing as much detail in the progress display. If I come up with a simpler solution I'll close this PR and open a new one.

@pradyunsg
Copy link
Member

If I just used rich's parallel progress bar I get n rows of only progress bars with no indication of which package is being downloaded. Trying to put the package name on the same line as the progress bar result in weird output when package names are long.

Ah, that should be possible to achieve with https://rich.readthedocs.io/en/latest/live.html instead of using just progress directly. See https://github.com/Textualize/rich/blob/master/examples/live_progress.py which presents things via a parent renderable -- we should be able to use https://rich.readthedocs.io/en/latest/group.html to provide such a composite renderable for this purpose.

@pfmoore
Copy link
Member

pfmoore commented May 6, 2024

Can someone show (even if it's just a mockup somehow) what the end goal of the display is here? I still feel like we'd be better off with just a 1-line display, showing N/M files downloaded, with maybe a list of the packages currently being downloaded (note that I said "packages", not "files" deliberately here) - for serial downloads that would just be one package, but for parallel it would be up to one per thread.

If we do want to keep the "one progress bar per file downloaded" scheme that we currently have, then I think it makes more sense to simplify it to a one-line format, something like:

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.5/15.5 MB 65.5 MB/s eta 0:00:00 [Downloading numpy]

That could be done with just an extra column (you can link the package name (and version, if you like!) to the task, using custom field arguments when you create the task. You can then reference the fields as {task.fields[package]} in a text column, for example.

@pradyunsg what's your view on simplifying/compacting the download progress display like this? Is it something we should do as a standalone PR, before trying to add parallel download support? My gut instinct is that we should do that.

I had suggested additional locks for supporting the raw progress bar added by #12586 since it doesn't use a rich progress bar.

I'd keep the raw progress bar entirely separate here. Again, we should focus on what's actually needed, and that means getting input from maintainers of tools that use the raw progress bar, to find out what progress information they actually use. There's no point reporting per-file statistics if they summarise that into "X% complete", after all... I suggest having a separate issue/PR for the raw progress bar, and defer working on it until someone complains that parallel downloads (will) break their existing usage.

@notatallshaw
Copy link
Member

FYI, this discussions came up before in a previous PR, but I think the OP was hoping to achieve the same display behavior as what conda has now when it downloads multiple files at once, but with rich (whereas I believe conda uses tqdm).

I can make a gif later today to demonstrate this if someone doesn't easily have access to a recentish version of conda.

@pradyunsg
Copy link
Member

@ofek
Copy link
Sponsor Contributor

ofek commented May 6, 2024

Personally, I think based on that example the efforts are worth it! This is similar to what Docker CLI does for pulling and decompressing image layers.

@pradyunsg
Copy link
Member

@pradyunsg what's your view on simplifying/compacting the download progress display like this? Is it something we should do as a standalone PR, before trying to add parallel download support? My gut instinct is that we should do that.

I agree. I have a long list of changes I'd like to make to the UI (tweaking how stuff is presented during the resolve and download process is one of them).

@NeilBotelho
Copy link
Contributor Author

NeilBotelho commented May 6, 2024

Can someone show (even if it's just a mockup somehow) what the end goal of the display is here?

I made a test integration of this PR and #12388, it looks like this

If we do want to keep the "one progress bar per file downloaded" scheme that we currently have, then I think it makes more sense to simplify it to a one-line format, something like:

I think simplifying the package name to just the name or just the name and the version instead of the full name (like numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl) would allow a 1 line progress bar to work in most if not all cases. That would weaken the case for having subclasses of rich's progress bar.
The other reasons for having it as mentioned in the PR description like dynamically switching between column types for determinate and indeterminate download sizes still apply though.

@pfmoore
Copy link
Member

pfmoore commented May 6, 2024

OK, this is purely personal preference, but I love the one-line form as shown in the conda example, and I hate the 2-line pip form.

Specifically, the one-line form ends up being a tabular display of name|bar|ETA/percentage and I think that looks excellent. It should be pretty straightforward to do - it's just adding tasks to the progress display as we need them. Two things we will need to consider - if we used the filename, we'd want to truncate long names to avoid making the bar/ETA data tiny, and we'd need to work out what to do if we have more than a screenful of tasks (rich displays one screen and a "..." marker for the rest). It would be interesting to know what conda (or docker) does if it's downloading >1 screen of files at once.

@arenasys
Copy link
Contributor

arenasys commented May 7, 2024

getting input from maintainers of tools that use the raw progress bar

Nobody is using it, its not live. But i made the PR, need per-file progress information (same info as sequential). I can write parallel support if needed, integrating this PR too broadly with rich could cause issues with that (Ideally there's a nice bottleneck somewhere we can divert the progress events from rich).

@pfmoore
Copy link
Member

pfmoore commented May 7, 2024

I’m sorry, but I still think we should use the features Rich has.

I expressed my reservations about raw progress reporting here and the rich project seemed receptive to adding raw progress functionality in Textualize/rich#3000. I still think getting rich support first and using that is the better call here.

As noted here, my support of the original PR was because the code looked simple and clean. I am a strong -1 on changing that so the raw progress support has to handle complexities like threading support, or limit how we use rich in the normal situation.

If you want parallel download support for the raw progress bar, please create a PR for it. It can stand on its own merits, and not distract from this discussion. This PR won’t “break” the raw progress reporting in the sense that if tests fail, they will have to be fixed. That should be enough for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants