Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(buffer)!: apply SSO technique to text buffer in buffer::Cell #601

Merged
merged 1 commit into from Jan 19, 2024

Conversation

rhysd
Copy link
Contributor

@rhysd rhysd commented Oct 28, 2023

Description

ratatui allocates one String buffer for each buffer::Cell. char is not sufficient since one character can consist of multiple code points. However, in almost all cases the cells have a single code point. It means that ratatui allocates many small heap memories (~16 Bytes/cell) only for the rare cases.

Allocation on heap is slower than allocation on stack so moving these allocations to stack can improve performance of allocating buffer::Buffer.

To archive this, a small string optimization (SSO) is available in general (which is adopted by C++'s std::string). Fortunately, compact_str crate is a solid implementation of SSO. It provides a drop in replacement of standard String so we can use it with very small changes (as diffs of this PR show).

The CompactString has the same size as String and has 24 bytes inline storage. By using it for text buffer of buffer::Cell, the symbol character can be stored on stack in almost all cases.

One problem is that symbol field of Cell struct is no longer a standard String object. Now it is a CompactString but it would have small impact on users' code base since users usually modifies the buffer cells through widgets.

Measurement summary

  • Total number of small (16Bytes) heap allocations was reduced from 288895 to 8178
  • Total size of small (16Bytes) heap allocations was reduced from 4.41MiB to 127.78KiB
  • Performance of initial rendering was 1.77x faster
  • Performance of rendering on each tick was 1.14x faster

Memory profiling

I used xctrace (general purpose profiler bundled in Xcode) to measure memory profiling. I modified the demo example to use TextBackend and stops after 100 tick iterations to measure the impact on this change precisely.

Diff applied to demo example
diff --git a/examples/demo/dummy.rs b/examples/demo/dummy.rs
new file mode 100644
index 0000000..b733aca
--- /dev/null
+++ b/examples/demo/dummy.rs
@@ -0,0 +1,31 @@
+use std::{error::Error, io};
+
+use ratatui::backend::TestBackend;
+use ratatui::prelude::*;
+
+use crate::{app::App, ui};
+
+pub fn run(enhanced_graphics: bool) -> Result<(), Box<dyn Error>> {
+    let backend = TestBackend::new(318, 82);
+    let mut terminal = Terminal::new(backend)?;
+
+    // create app and run it
+    let app = App::new("Test Demo", enhanced_graphics);
+    let res = run_app(&mut terminal, app);
+
+    if let Err(err) = res {
+        println!("{err:?}");
+        // } else {
+        //     println!("{:?}", terminal.backend().buffer().content);
+    }
+
+    Ok(())
+}
+
+fn run_app<B: Backend>(terminal: &mut Terminal<B>, mut app: App) -> io::Result<()> {
+    for _ in 0..100 {
+        terminal.draw(|f| ui::draw(f, &mut app))?;
+        app.on_tick();
+    }
+    Ok(())
+}
diff --git a/examples/demo/main.rs b/examples/demo/main.rs
index c2f742b..a236d18 100644
--- a/examples/demo/main.rs
+++ b/examples/demo/main.rs
@@ -1,10 +1,11 @@
-use std::{error::Error, time::Duration};
+use std::{env, error::Error, time::Duration};
 
 use argh::FromArgs;
 
 mod app;
 #[cfg(feature = "crossterm")]
 mod crossterm;
+mod dummy;
 #[cfg(feature = "termion")]
 mod termion;
 #[cfg(feature = "termwiz")]
@@ -25,12 +26,19 @@ struct Cli {
 
 fn main() -> Result<(), Box<dyn Error>> {
     let cli: Cli = argh::from_env();
-    let tick_rate = Duration::from_millis(cli.tick_rate);
-    #[cfg(feature = "crossterm")]
-    crate::crossterm::run(tick_rate, cli.enhanced_graphics)?;
-    #[cfg(feature = "termion")]
-    crate::termion::run(tick_rate, cli.enhanced_graphics)?;
-    #[cfg(feature = "termwiz")]
-    crate::termwiz::run(tick_rate, cli.enhanced_graphics)?;
+    if env::var("DO_NOT_USE_DUMMY_BACKEND")
+        .map(|v| !v.is_empty())
+        .unwrap_or(false)
+    {
+        let tick_rate = Duration::from_millis(cli.tick_rate);
+        #[cfg(feature = "crossterm")]
+        crate::crossterm::run(tick_rate, cli.enhanced_graphics)?;
+        #[cfg(feature = "termion")]
+        crate::termion::run(tick_rate, cli.enhanced_graphics)?;
+        #[cfg(feature = "termwiz")]
+        crate::termwiz::run(tick_rate, cli.enhanced_graphics)?;
+    } else {
+        crate::dummy::run(cli.enhanced_graphics)?;
+    }
     Ok(())
 }

The measured results of 16Bytes heap allocations are as follows:

main branch sso branch
Total bytes 4.41MiB 127.78KiB
# of allocs 288895 8178
Screenshot before after

Initial rendering performance

I measured E2E launch time (initial rendering) performance using hyperfine.

I removed the for loop to measure a single tick iteration where screen is rendered once.

Diff applied to demo example
diff --git a/examples/demo/dummy.rs b/examples/demo/dummy.rs
index b733aca..6b2a4f7 100644
--- a/examples/demo/dummy.rs
+++ b/examples/demo/dummy.rs
@@ -23,9 +23,7 @@ pub fn run(enhanced_graphics: bool) -> Result<(), Box<dyn Error>> {
 }
 
 fn run_app<B: Backend>(terminal: &mut Terminal<B>, mut app: App) -> io::Result<()> {
-    for _ in 0..100 {
-        terminal.draw(|f| ui::draw(f, &mut app))?;
-        app.on_tick();
-    }
+    terminal.draw(|f| ui::draw(f, &mut app))?;
+    app.on_tick();
     Ok(())
 }

The result on my machine was as follows:

$ hyperfine --shell=none ./demo_main ./demo_sso
Benchmark 1: ./demo_main
  Time (mean ± σ):       8.8 ms ±   0.4 ms    [User: 6.4 ms, System: 1.5 ms]
  Range (min … max):     8.4 ms …  11.4 ms    316 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: ./demo_sso
  Time (mean ± σ):       5.0 ms ±   0.1 ms    [User: 3.0 ms, System: 1.1 ms]
  Range (min … max):     4.7 ms …   5.8 ms    581 runs

Summary
  ./demo_sso ran
    1.77 ± 0.09 times faster than ./demo_main

100 tick iterations performance

In addition, I measured performance of 100 tick iterations with hyperfine. And it turned out that this branch is 1.14x faster than main branch.

$ hyperfine ./demo_main ./demo_sso
Benchmark 1: ./demo_main
  Time (mean ± σ):      85.2 ms ±   2.5 ms    [User: 81.8 ms, System: 2.1 ms]
  Range (min … max):    82.2 ms …  94.9 ms    34 runs

Benchmark 2: ./demo_sso
  Time (mean ± σ):      74.9 ms ±   0.6 ms    [User: 72.2 ms, System: 1.5 ms]
  Range (min … max):    73.3 ms …  76.8 ms    38 runs

Summary
  ./demo_sso ran
    1.14 ± 0.03 times faster than ./demo_main

@codecov
Copy link

codecov bot commented Oct 28, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (330a899) 92.5% compared to head (d065252) 92.5%.

Additional details and impacted files
@@           Coverage Diff           @@
##            main    #601     +/-   ##
=======================================
- Coverage   92.5%   92.5%   -0.1%     
=======================================
  Files         61      61             
  Lines      16290   16287      -3     
=======================================
- Hits       15070   15067      -3     
  Misses      1220    1220             

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rhysd
Copy link
Contributor Author

rhysd commented Oct 28, 2023

I noticed performance regression that a benchmark with 100 tick iterations got ~10% slower.

After some investigation with CPU sampling profiler, I found that calling clear() then pushing new string/char with CompactString is slightly slower than String. I guess it has additional branch to determine to use inline storage or spill to heap memory.

I fixed it at c5ecda6 and confirmed the fix with hyperfine. Here is a measurement before/after the fix with 100 tick iterations.

$ hyperfine ./demo_before ./demo_after
Benchmark 1: ./demo_before
  Time (mean ± σ):      95.2 ms ±   2.0 ms    [User: 92.1 ms, System: 1.8 ms]
  Range (min … max):    92.7 ms … 101.2 ms    30 runs

Benchmark 2: ./demo_after
  Time (mean ± σ):      75.3 ms ±   0.9 ms    [User: 72.5 ms, System: 1.6 ms]
  Range (min … max):    73.6 ms …  77.9 ms    38 runs

Summary
  ./demo_after ran
    1.26 ± 0.03 times faster than ./demo_before

It's now even 1.14x faster than main branch. I added this result to PR description.

@a-kenji
Copy link
Contributor

a-kenji commented Oct 29, 2023

I can reproduce the improvements on:
tui-term-main vs tui-term-sso:

Pasted image 1
Main:
main_divan
Sso:
sso_divan

Note: The benchmarks are all done with a grid of 80:24, so I am likely looking at a minimum of performance gained. Also even tough I use buffer::Cell directly, I did not need to change any code.

@joshka
Copy link
Member

joshka commented Oct 30, 2023

Thanks for doing tests on this - this was missing from the previous approaches.

Some questions:

  1. How does this compare to the time spent rendering in the actual terminal?
  2. I'd suggest using something like docs(examples): add animation and FPS counter to colors_rgb #583 as a test for this (it changes every cell and the background / foreground every frame, so in theory will ensure that we see the effect on the worst performing app.
  3. We could perhaps simplify to create a random full buffer every frame instead.
  4. It would be a good idea to tests ~10K cells (200x50) as this tends to be the size of a 16in Macbook with some default font sizes (as well as 80x24 for a min and 255x255 for a max)
  5. Is it worth testing the other string types?
  6. What are your thoughts on the absolute magnitude of the change (the relative 14% change obscures that this is 0.1ms - what sorts of scenarios is that worth worrying about in?
  7. Would it be worth making the Cell fields private to avoid exposing this internal detail?

Some previous similar work (tagging for their insights / interest):

@rhysd
Copy link
Contributor Author

rhysd commented Oct 30, 2023

@a-kenji Thank you for your trial. It is very helpful to try this branch by practical external widgets. I'm glad to hear that benchmarks on your side were improved.

@joshka Thank you for the feedback.

How does this compare to the time spent rendering in the actual terminal?

No backend is modified. So the difference is caused only by accessing to the content (e.g. crossterm).

Theoretically it only adds one branch per accessing one cell, since CompactString needs to check if it's using inline storage or heap buffer. The branch only adds a few instructions and we can ignore the impact.

Practically, I'll try to measure the actual impact with real backend. I'll post the result later.

I'd suggest using something like #583 as a test for this (it changes every cell and the background / foreground every frame, so in theory will ensure that we see the effect on the worst performing app.

OK, I'll measure performance with colors_rgb and TestBackend later.

We could perhaps simplify to create a random full buffer every frame instead.

Do you mean measuring creating Buffer instance alone? It's possible but I feel initial rendering measurement is sufficient for this.

It would be a good idea to tests ~10K cells (200x50) as this tends to be the size of a 16in Macbook with some default font sizes (as well as 80x24 for a min and 255x255 for a max)

OK, I'll try 80x24 and 200x50 and 255x255. (I chose 318x82 for my benchmark because it is actual terminal size of my environment.)

Is it worth testing the other string types?

I considered the following crates:

  • smol_str : MSRV policy does not fit to this crate. And the author said that ecow is the better implementation.
  • ecow : This is a good alternative. However, in terms of the following points, I selected compact_str
    • Its inline storage size is smaller than compact_str
    • It internally manages its buffer with reference counter for O(1) clone. This is effective when cloning string happens frequently but it is not the case of this crate's usage. And increasing/decreasing atomic reference counter has additional cost.

If you have other alternative, I'll take a look.

What are your thoughts on the absolute magnitude of the change (the relative 14% change obscures that this is 0.1ms - what sorts of scenarios is that worth worrying about in?

I'm not sure the fact that rendering on each tick is 1.14x faster has real impact on user's use case. Rendering happens per some intervals (e.g. 100ms) so program has enough budget for rendering each frames with current main branch. CPU usage of program may slightly go down with this branch tough.

The most important point in regards to performance is initial rendering since a program usually tries to render the first paint and it is CPU intensive. The absolute values of my benchmark with only initial rendering are 5.0ms (sso) vs 8.8ms (main). I feel 3.8ms is large enough to be considered.

And this branch is not only for CPU performance but also for memory usage. Heap memory usage by this branch is obviously smaller than main branch since each cell string contains its content inline instead of allocating extra heap memory. Even if this branch has no CPU performance benefit, it is worth consideration in terms of memory consumption.

Would it be worth making the Cell fields private to avoid exposing this internal detail?

Well, that depends on maintainers. I have no strong opinion. CompactString's API is almost compatible with String. Users of buffer::Cell (I guess there would be only a few direct users though) can use the field almost the same as String. .into() method automatically converts string types into CompactString too.

In general, exposing struct fields can cause an issue like this. If I were a maintainer of this crate, I would add a builder for Cell and add accessor methods instead of exposing fields. Then fields of Cell can be modified/added without breaking change. (Without #[non_exhaustive], adding new field to the struct is a breaking change.)

struct CellBuilder {
    cell: Cell,
}

impl CellBuilder {
    fn new(s: &str) -> Self {
        Self {
            cell: Cell {
                symbol: CompactString::new(s),
                ..Default::default
            },
        }
    }

    fn with_fg(self, color: Color) -> Self {
        // ...
    }

    // ...

    fn build(self) -> Cell {
        self.cell
    }
}

If you want this change, please let me know. I'll implement it in this branch.

@rhysd
Copy link
Contributor Author

rhysd commented Oct 30, 2023

I tried colors_rgb example with TestBackend. I applied the following patch to the example.

Diff
diff --git a/examples/colors_rgb.rs b/examples/colors_rgb.rs
index b15c859..d28e395 100644
--- a/examples/colors_rgb.rs
+++ b/examples/colors_rgb.rs
@@ -17,36 +17,36 @@ use palette::{
     convert::{FromColorUnclamped, IntoColorUnclamped},
     Okhsv, Srgb,
 };
-use ratatui::{prelude::*, widgets::*};
+use ratatui::{backend::TestBackend, prelude::*, widgets::*};
 
 type Result<T> = std::result::Result<T, Box<dyn Error>>;
 
 fn main() -> Result<()> {
-    install_panic_hook();
+    // install_panic_hook();
     App::new()?.run()
 }
 
 struct App {
-    terminal: Terminal<CrosstermBackend<Stdout>>,
+    terminal: Terminal<TestBackend>,
     should_quit: bool,
 }
 
 impl App {
     pub fn new() -> Result<Self> {
         Ok(Self {
-            terminal: Terminal::new(CrosstermBackend::new(stdout()))?,
+            terminal: Terminal::new(TestBackend::new(200, 50))?,
             should_quit: false,
         })
     }
 
     pub fn run(mut self) -> Result<()> {
-        init_terminal()?;
-        self.terminal.clear()?;
-        while !self.should_quit {
+        // init_terminal()?;
+        // self.terminal.clear()?;
+        for _ in 0..100 {
             self.draw()?;
-            self.handle_events()?;
+            // self.handle_events()?;
         }
-        restore_terminal()?;
+        // restore_terminal()?;
         Ok(())
     }
 
@@ -71,7 +71,7 @@ impl App {
 
 impl Drop for App {
     fn drop(&mut self) {
-        let _ = restore_terminal();
+        // let _ = restore_terminal();
     }
 }
 

I compiled the example on both sso and main branches with --release and measured performance with hyperfine:

$ hyperfine ./colors_rgb_main ./colors_rgb_sso
Benchmark 1: ./colors_rgb_main
  Time (mean ± σ):     292.0 ms ±   7.5 ms    [User: 289.3 ms, System: 1.5 ms]
  Range (min … max):   283.9 ms … 309.7 ms    10 runs

Benchmark 2: ./colors_rgb_sso
  Time (mean ± σ):     286.4 ms ±   1.2 ms    [User: 283.8 ms, System: 1.3 ms]
  Range (min … max):   284.8 ms … 288.3 ms    10 runs

Summary
  ./colors_rgb_sso ran
    1.02 ± 0.03 times faster than ./colors_rgb_main

Even in the worst case, sso branch is faster than main branch.

@rhysd
Copy link
Contributor Author

rhysd commented Oct 30, 2023

I applied #583 patch to both sso and main branches and monitored the FPS on iTerm2 with 200x50 terminal size.

  • sso → about 21.40 fps
  • main → about 20.00 fps
sso main
sso main

@rhysd
Copy link
Contributor Author

rhysd commented Oct 30, 2023

I measured performance (demo example with test backend) with hyperfine in additional sizes: 200x50, 80x24, and 255x255

200x50

$ hyperfine ./demo_main_200_50 ./demo_sso_200_50
Benchmark 1: ./demo_main_200_50
  Time (mean ± σ):      38.6 ms ±   1.5 ms    [User: 36.3 ms, System: 1.2 ms]
  Range (min … max):    37.0 ms …  46.5 ms    71 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: ./demo_sso_200_50
  Time (mean ± σ):      34.7 ms ±   0.7 ms    [User: 32.7 ms, System: 0.9 ms]
  Range (min … max):    33.7 ms …  39.1 ms    79 runs

Summary
  ./demo_sso_200_50 ran
    1.11 ± 0.05 times faster than ./demo_main_200_50

80x24

$ hyperfine ./demo_main_80_24 ./demo_sso_80_24
Benchmark 1: ./demo_main_80_24
  Time (mean ± σ):      12.2 ms ±   0.6 ms    [User: 10.6 ms, System: 0.8 ms]
  Range (min … max):    11.5 ms …  15.3 ms    198 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: ./demo_sso_80_24
  Time (mean ± σ):      11.5 ms ±   0.3 ms    [User: 10.0 ms, System: 0.7 ms]
  Range (min … max):    11.0 ms …  12.6 ms    200 runs

Summary
  ./demo_sso_80_24 ran
    1.06 ± 0.06 times faster than ./demo_main_80_24

255x255

$ hyperfine ./demo_main_255_255 ./demo_sso_255_255
Benchmark 1: ./demo_main_255_255
  Time (mean ± σ):     193.1 ms ±   6.9 ms    [User: 187.3 ms, System: 4.2 ms]
  Range (min … max):   187.0 ms … 212.2 ms    15 runs

Benchmark 2: ./demo_sso_255_255
  Time (mean ± σ):     174.9 ms ±   6.2 ms    [User: 170.2 ms, System: 3.3 ms]
  Range (min … max):   169.7 ms … 196.1 ms    17 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  ./demo_sso_255_255 ran
    1.10 ± 0.06 times faster than ./demo_main_255_255

@joshka
Copy link
Member

joshka commented Oct 31, 2023

If you have other alternative, I'll take a look.

I don't have specific crates in mind for this - more just curious as there's a few options mentioned in the previous PRs.

Do you mean measuring creating Buffer instance alone? It's possible but I feel initial rendering measurement is sufficient for this.

I meant that it could be worth investing in a small test application for this. The simplest test case without any extra processing that would mess things up would be to render a full screen of symbols filled with random characters to the buffer.

The most important point in regards to performance is initial rendering since a program usually tries to render the first paint and it is CPU intensive. The absolute values of my benchmark with only initial rendering are 5.0ms (sso) vs 8.8ms (main). I feel 3.8ms is large enough to be considered.

My main concern is that this one time gain is outside of the realm of a human being able to see any difference, and I'm wondering if there's a reasonable use case where that speedup will matter for some application.

And this branch is not only for CPU performance but also for memory usage. Heap memory usage by this branch is obviously smaller than main branch since each cell string contains its content inline instead of allocating extra heap memory. Even if this branch has no CPU performance benefit, it is worth consideration in terms of memory consumption.

Yep, but the memory concern is pretty small too - most computers these days have at least 8GB RAM, so 4MB is 0.05% or put another way, I'd need 20 copies of an application running to make a 1% difference in memory usage.

In general, exposing struct fields can cause an issue like this. If I were a maintainer of this crate, I would add a builder for Cell and add accessor methods instead of exposing fields. Then fields of Cell can be modified/added without breaking change. (Without #[non_exhaustive], adding new field to the struct is a breaking change.)

Yeah, I think if I were designing many of the types from scratch I'd create explicit builders (and just use one of the derive builder crates) rather than having the types themselves be builders. The set_symbol() is kind of a builder style setter though, as it returns the cell. This means if we do change this, there's a fairly simple workaround for code that gets broken by this being made private. That said it's worth taking the deprecate and then change pattern for it so that people at least have a chance to see a change that helps drive fixes.


I want to make sure you hear that doing the perf tests here is really important and useful to see, and I value that work a lot. At the same time, I also want to highlight that the results seem to suggest that changing this doesn't seem worth it - at least for now. I'm willing to be wrong on this and change my mind however. But I'd like to consider a few more things first.

Rather than changing the low level data structure, I wonder if there's some better data structure / algorithm than our current one for handling this. When tuning an operation for performance that happens a lot, there's two approaches that attack the same problem: 1. reduce the cost of the individual operation, 2. reduce the count of the operations. The second approach is usually much more important than the first (here N = ~10-20K / frame) . Here are some possible ideas to investigate:

  1. Is there some way to store each Span as the full span in the cell rather than as individual symbols? This would effectively be a zero allocation approach. I suspect there's a lot of strings that are one style and take up the entire row that could be stored just as is. We'd need some sort of z-index to ensure that things which are written over the top of each other are processed in the right order to make this work (e.g. the title of a block is written over the top of the borders etc.) https://crates.io/crates/store-interval-tree but modified for hierarchy perhaps? Perhaps there's already something like this optimized for UI work.
  2. I suspect a lot of UIs (e.g. the normal / average case) have screens that are mostly blank - what if we stored each Cells as Option<Cell> instead of Cell - do we get a perf increase from the lowered number of allocations?
  3. I wonder if there's a simple way we can just use Cow slices of the spans instead of allocating a String for each cell? The buffer would have to gain a lifetime for this, but that might be the right thing to change.
  4. Is there some way that instead of clearing the buffer, we could start from a buffer that is already the same as the previous buffer, and instead re-allocating strings that are the same as the previous buffer, we just use the existing strings (this would effectively move the diff code to happen during render() instead of at the end, and perhaps would mean that we keep track of a render mask for each cell or something like that.
  5. If we're breaking compatibility on this, then it's worth considering what else we should fix in the same place. My big want is to make get() & get_mut() return Option rather than panicking for index out of bounds.
  6. Is there a way to deliver this behind a feature flag?
  7. Are there any other ideas that could work?

@joshka
Copy link
Member

joshka commented Oct 31, 2023

Also, is there some way to make the performance tests app and commands part of the code for the PR, to make it easy to repro this rather than applying patches - perhaps by building it into the benchmarks we have currently?

I also noticed https://nikolaivazquez.com/blog/divan/ a little while ago which might be a good replacement for criterion if we want to invest more in benchmarking.

@rhysd
Copy link
Contributor Author

rhysd commented Nov 2, 2023

I'm sorry for the delayed response. I needed to handle some issues in tui-textarea.

I don't have specific crates in mind for this - more just curious as there's a few options mentioned in the previous PRs.

OK. Then I considered all options you kindly listed in the previous comment. I didn't mention smartstring, but its features are almost identical to compact_str and I selected compact_str because it is well-maintained.

I meant that it could be worth investing in a small test application for this. The simplest test case without any extra processing that would mess things up would be to render a full screen of symbols filled with random characters to the buffer.

Thanks for the additional explanation. I think I could get the point. And I don't think demo example is small in terms of measuring the rendering performance. It contains many widgets and the graph is dynamically updated on each tick.

My main concern is that this one time gain is outside of the realm of a human being able to see any difference, and I'm wondering if there's a reasonable use case where that speedup will matter for some application.

Well, I think speedup matters for all applications in my opinion. Even if a single optimization is small, gathering many small optimizations can cause speedup which people can notice.

We don't see any trade offs. So I'm honestly not understanding why it is a concern that the speed gain is small.

That said it's worth taking the deprecate and then change pattern for it so that people at least have a chance to see a change that helps drive fixes.

I agree with this idea.

I want to make sure you hear that doing the perf tests here is really important and useful to see

That's not possible for me until you elaborate on the criteria of really important and useful.

Rather than changing the low level data structure, I wonder if there's some better data structure / algorithm than our current one for handling this.

I don't have time to do such big changes.

@rhysd
Copy link
Contributor Author

rhysd commented Nov 2, 2023

Honestly I was disappointed by your response. I measured performance as many variations as possible. And there is no down side actually observed. And you just say 'I have a concern' without any concrete examples. There is nothing I can do to push the discussion forward.

If you concern that my benchmarks are useless, then how about merging this branch and trying to observe real world applications have any regressions? This PR is very small. Changes are obvious. And rollback is easy (just replace CompactString with String and revert the 'clear and push' operations).

@Philipp-M
Copy link
Contributor

Yeah I agree, I think if it doesn't introduce a massive dependency chain, or not well maintained or written dependencies (which is not the case here), it's a net-benefit. I think the benches shown here are significant enough and it doesn't introduce a lot of new code that needs to be maintained.

Finding an algorithmically more optimal solution is orthogonal to this change and can still be done after this.

@joshka
Copy link
Member

joshka commented Nov 3, 2023

We don't see any trade offs. So I'm honestly not understanding why it is a concern that the speed gain is small.

My apologies for not properly explaining the actual concern (If I understand you right, you took offense that just focused on the size of the speed increase)

Let's examine the tradeoff:

  • Upside: 0.18ms per frame speedup here based on the 255x255 rendering of the demo ((193.1 - 174.9)ms/100). This would be a fairly usual case (except for full screen animation apps like colors_rgb). For comparison, 60fps is 16.67ms per frame.
  • Downside: every application / library that manually sets the Cell::symbol field using a String has to be updated. I don't know how large this number is (https://github.com/search?q=ratatui+%22.symbol+%3D+%22&type=code suggests it's small, but there are ~193 repos that use tui-logger - I'm guessing a bunch of those are still on tui-rs not ratatui though, 🤷‍♂️). This is just the public github repos though too, so there might be other places.
  • To me it seems difficult to argue that breaking user's code is worth such an imperceptibly small performance increase.

The main concern I have with this is that when we change the public interface, we make users have to change their code, and potentially break downstream applications at the installation point (i.e. our user's user's are the ones that can't install the app). Adding CompactString to the public API, expecting our users to update and then later expecting them to update again if we find a better approach that is not compatible with the CompactString approach is annoying. I have a tendency to try to be conservative when asking other people to change their code (and prefer to do it once rather than twice when possible).

Assuming we make the symbol field private (or perhaps pub(crate), then there's less concern as CompactString becomes an internal detail instead of a public API issue. I think one way to do this would be to mark the field as deprecated and make a release to help communicate this to downstream users prior to changing it, and also adding a getter for the field so that people can migrate to that before the change happens. Making the field private potentially breaks anyone using the serde serialization too, so there's downsides to this approach that are also "interesting"

Anyway, I hope that's helped clarify why this isn't as simple as just changing something small here, and if I've misunderstood your perspective at all, please do continue to advocate your position to help me see things from your perspective.

@rhysd
Copy link
Contributor Author

rhysd commented Nov 6, 2023

@joshka Thank you for elaborating on your concern.

0.18ms per frame speedup here based on the 255x255 rendering of the demo ((193.1 - 174.9)ms/100). This would be a fairly usual case (except for full screen animation apps like colors_rgb). For comparison, 60fps is 16.67ms per frame.

This does not explain entire upsides. As I said repeatedly, there is a larger memory consumption benefit. And initial rendering is more faster (due to initial heap allocations). Though it is one-shot, it is important because users of tools wait until application launches with doing nothing in general.

You said that 5ms is too small for meaningful optimization, but on my machine vim command starts in 11ms and I would be happy if it is reduced to 6ms since I start vim ~20 times every day.

every application / library that manually sets the Cell::symbol field using a String has to be updated. I don't know how large this number is (https://github.com/search?q=ratatui+%22.symbol+%3D+%22&type=code suggests it's small, but there are ~193 repos that use tui-logger

I created a PR for tui-logger. As you can see, the change is really small and authors can easily adopt them. Since ratatui has not released v1 yet, it is easy to assume some breaking changes in v0.x release for users as other breaking changes ratatui has added.

gin66/tui-logger#51

(Actually .symbol = ... is less efficient than .set_symbol(...) so making the field private and forcing users to use the API is a good idea even if this PR is not merged.)

The main concern I have with this is that when we change the public interface, we make users have to change their code, and potentially break downstream applications at the installation point

From my perspective and if I caught your points well, this is a problem of 'small breaking change' v.s. 'small optimization gain' (though I don't think optimization on memory consumption is not small). I still think the optimization is worth adding the breaking change because the breaking change happens only once on a single major version bump but the optimization will be effective in all future versions.

I think one way to do this would be to mark the field as deprecated and make a release to help communicate this to downstream users prior to changing it, and also adding a getter for the field so that people can migrate to that before the change happens.

I agree with this idea very much. How about the following timeline?

  1. Add deprecation message on Cell::symbol field
  2. Add a new default constructor and getter for symbol field
  3. Release ratatui v0.24.1 (or v0.24.2 if v0.24.1 doesn't fit)
  4. Hide the symbol field
  5. Release the change at v0.25.0

What I'm thinking about 2. is:

diff --git a/src/buffer.rs b/src/buffer.rs
index 2bd3d56..f16ae67 100644
--- a/src/buffer.rs
+++ b/src/buffer.rs
@@ -17,7 +17,7 @@ use crate::{
 #[derive(Debug, Clone, Eq, PartialEq, Hash)]
 #[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
 pub struct Cell {
-    pub symbol: CompactString,
+    pub(crate) symbol: CompactString,
     pub fg: Color,
     pub bg: Color,
     #[cfg(feature = "underline-color")]
@@ -26,12 +26,30 @@ pub struct Cell {
     pub skip: bool,
 }
 
+impl Default for Cell {
+    fn default() -> Self {
+        Self {
+            symbol: CompactString::new(" "),
+            fg: Color::Reset,
+            bg: Color::Reset,
+            #[cfg(feature = "underline-color")]
+            underline_color: Color::Reset,
+            modifier: Modifier::empty(),
+            skip: false,
+        }
+    }
+}
+
 impl Cell {
     pub fn set_symbol(&mut self, symbol: &str) -> &mut Cell {
         self.symbol = CompactString::new(symbol);
         self
     }
 
+    pub fn symbol(&self) -> &'_ str {
+        self.symbol.as_str()
+    }
+
     pub fn set_char(&mut self, ch: char) -> &mut Cell {
         let mut buf = [0; 4];
         self.symbol = CompactString::new(ch.encode_utf8(&mut buf));

Using Cell::default would be fairly rare case because Cell instances are usually created by Buffer.

@joshka
Copy link
Member

joshka commented Nov 8, 2023

I agree with this idea very much. How about the following timeline?

Add deprecation message on Cell::symbol field
Add a new default constructor and getter for symbol field
Release ratatui v0.24.1 (or v0.24.2 if v0.24.1 doesn't fit)
Hide the symbol field
Release the change at v0.25.0

Sounds good to me. Let's do it! :)


Everything past here is more about my curiosity about than it is about the problem

There is a larger memory consumption benefit

What sort of systems would this change be a problem for? The smallest EC2 instance has 512MB, and I can't recall a laptop I've owned since perhaps 2010 that's had less than 16GB.

You said that 5ms is too small for meaningful optimization, but on my machine vim command starts in 11ms and I would be happy if it is reduced to 6ms since I start vim ~20 times every day.

I stand by that - perf isn't an absolute thing. Implementing perf changes when there isn't a reasonable situation where they can be detected is a failure to heed Knuth's observation regarding optimization:

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

I'd be curious to see if you can actually reliably detect this (I certainly could not). Want to test it out?

Some silly test code
use std::{env::args, thread::sleep, time::Duration};

fn main() {
    if rand::random() {
        println!("Slowed by 5ms - did you notice?");
        sleep(Duration::from_millis(5));
    }
    let args = args().skip(1).collect::<Vec<String>>();
    let err = exec::Command::new("vim").args(&args).exec();
    eprintln!("vim: {err}");
}

I created a PR for tui-logger. As you can see, the change is really small and authors can easily adopt them. Since ratatui has not released v1 yet, it is easy to assume some breaking changes in v0.x release for users as other breaking changes ratatui has added.

I understand that the change itself is fairly small and that it's easy to think this means that breaking is trivial. That's not where the issue gets felt though. A lot of smaller apps have installation instructions that are cargo install foo (they often miss the advised --locked). Every user that runs that command in a downstream application now fails if we break an API. We've seen something similar happen when we introduce breaking changes for our own examples (developers copy code from our git repo and expected it to work with the most recent release). E.g. removing the generic parameter from Frame recently had this problem (and there was another issue in some other breaking changes).

@a-kenji
Copy link
Contributor

a-kenji commented Nov 8, 2023

While I am not sure if there is a good way to test for it, I suspect the large differential in allocations here might be able to switch a core from one p state to another in low load scenarios.

@rhysd
Copy link
Contributor Author

rhysd commented Nov 11, 2023

Sounds good to me. Let's do it! :)

I'll make a PR for the step 1. and 2.

What sort of systems would this change be a problem for?

Well, I cannot come up with. Maybe some systems which are almost running out memory and causing thrashing?

Implementing perf changes when there isn't a reasonable situation where they can be detected is a failure to heed Knuth's observation regarding optimization

Yes, I understand your point. I of course know Knuth's rules for optimization and agree with it. It's the matter of trade off (complexity v.s. benefit).

The point is that your stance is that small optimizations which people cannot perceive are premature optimization. And I don't stand for this because sum of optimization gain can be meaningful. For my vim example, I can save 50ms * 20 * 365 = 365s per year and I'd appreciate it. And this PR does not bring much complexity in its internal structure (CompactString just put string bytes inline instead of allocated heap memory). So the Knuth's concern about the introducing complexity is not so applying to this case IMO.

@rhysd
Copy link
Contributor Author

rhysd commented Nov 11, 2023

I created a PR for deprecation: #624

@rhysd
Copy link
Contributor Author

rhysd commented Nov 11, 2023

I rebased onto the latest main branch to resolve conflicts and force-pushed the commit since a merge commit is not allowed by lint.

@rhysd

This comment was marked as outdated.

@joshka
Copy link
Member

joshka commented Nov 13, 2023

This should be merged only after releasing 0.24.1

@rhysd
Copy link
Contributor Author

rhysd commented Nov 13, 2023

I rebased onto the latest main branch for fixing the conflict with #626. e689bea hid the Cell::symbol field and removed the deprecation warning.

@joshka joshka added this to the v0.25.0 milestone Nov 13, 2023
@joshka
Copy link
Member

joshka commented Dec 4, 2023

I suspect this will go out in v0.26.0 not v0.25.0 as we've got some breaking changes in main that prevent us shipping the next version as v0.24.1.

@rhysd
Copy link
Contributor Author

rhysd commented Dec 5, 2023

Sure. So,

  1. Release 0.25.0 with the deprecation message
  2. Release 0.25.1, 0.25.2, ... (if necessary)
  3. Release 0.26 with this change

Am I correct?

@joshka
Copy link
Member

joshka commented Dec 5, 2023

Yep

@Valentin271 Valentin271 modified the milestones: v0.25.0, v0.26.0 Dec 6, 2023
@Valentin271
Copy link
Member

This can be merged after a rebase right?

@rhysd
Copy link
Contributor Author

rhysd commented Dec 26, 2023

I will rebase tomorrow.

@rhysd
Copy link
Contributor Author

rhysd commented Dec 27, 2023

I rebased this branch onto the latest main branch.

@orhun
Copy link
Sponsor Member

orhun commented Dec 27, 2023

I'm writing a blog post about benchmarking the I/O streams (block buffered, line buffered, etc.) using Ratatui and I will reference this PR there. However, I just tried it and I'm getting a slightly less FPS on this branch. I'm not sure what is happening yet but I will read through the comments in here and try to figure it out. Maybe it is an initial rendering issue.

Here is the unpolished code that I'm using to benchmark (press Space to toggle between streams):

io_bench.rs
use std::{
    fmt,
    fs::File,
    io::{self, BufWriter, LineWriter, Write},
    os::fd::{FromRawFd, OwnedFd},
    time::{Duration, Instant},
};

use anyhow::Result;
use crossterm::{
    event::{self, Event, KeyCode, KeyEventKind},
    terminal::{disable_raw_mode, enable_raw_mode, EnterAlternateScreen, LeaveAlternateScreen},
    ExecutableCommand,
};
use enum_iterator::{last, next_cycle, Sequence};
use lazy_static::lazy_static;
use palette::{convert::FromColorUnclamped, Hsv, Srgb};
use ratatui::{prelude::*, widgets::*};

lazy_static! {
    static ref RAW_STDOUT_FD: OwnedFd = unsafe { OwnedFd::from_raw_fd(1) };
}

#[derive(Copy, Clone, Debug, Default, Sequence)]
enum IoStream {
    #[default]
    Stdout,
    LineBufferedStdout,
    BlockBufferedStdout,
    Stderr,
    LineBufferedStderr,
    BlockBufferedStderr,
}

impl fmt::Display for IoStream {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(
            f,
            "{}",
            match self {
                IoStream::Stdout => "stdout (unbuffered)",
                IoStream::LineBufferedStdout => "stdout (line buffered)",
                IoStream::BlockBufferedStdout => "stdout (block buffered)",
                IoStream::Stderr => "stderr (unbuffered)",
                IoStream::LineBufferedStderr => "stderr (line buffered)",
                IoStream::BlockBufferedStderr => "stderr (block buffered)",
            }
        )
    }
}

impl IoStream {
    fn as_stream(&self) -> Result<Box<dyn Write>> {
        Ok(match self {
            IoStream::Stdout => Box::new(File::from(RAW_STDOUT_FD.try_clone()?)),
            IoStream::LineBufferedStdout => Box::new(io::stdout()),
            IoStream::BlockBufferedStdout => Box::new(BufWriter::new(io::stdout())),
            IoStream::Stderr => Box::new(io::stderr()),
            IoStream::LineBufferedStderr => Box::new(LineWriter::new(io::stderr())),
            IoStream::BlockBufferedStderr => Box::new(BufWriter::new(io::stderr())),
        })
    }
}

#[derive(Debug)]
struct Fps {
    frame_count: usize,
    last_instant: Instant,
    fps: Option<f32>,
}

impl Default for Fps {
    fn default() -> Self {
        Self {
            frame_count: 0,
            last_instant: Instant::now(),
            fps: None,
        }
    }
}

impl Fps {
    fn tick(&mut self) {
        self.frame_count += 1;
        let elapsed = self.last_instant.elapsed();
        // update the fps every second, but only if we've rendered at least 2 frames (to avoid
        // noise in the fps calculation)
        if elapsed > Duration::from_secs(1) && self.frame_count > 2 {
            self.fps = Some(self.frame_count as f32 / elapsed.as_secs_f32());
            self.frame_count = 0;
            self.last_instant = Instant::now();
        }
    }
}

struct FpsWidget<'a> {
    fps: &'a Fps,
}

impl<'a> Widget for FpsWidget<'a> {
    fn render(self, area: Rect, buf: &mut Buffer) {
        if let Some(fps) = self.fps.fps {
            let text = format!("{:.1} fps", fps);
            Paragraph::new(text).render(area, buf);
        }
    }
}

struct RgbColorsWidget<'a> {
    /// The colors to render - should be double the height of the area
    colors: &'a Vec<Vec<Color>>,
    /// the number of elapsed frames that have passed - used to animate the colors
    frame_count: usize,
}

impl Widget for RgbColorsWidget<'_> {
    fn render(self, area: Rect, buf: &mut Buffer) {
        let colors = self.colors;
        for (xi, x) in (area.left()..area.right()).enumerate() {
            // animate the colors by shifting the x index by the frame number
            let xi = (xi + self.frame_count) % (area.width as usize);
            for (yi, y) in (area.top()..area.bottom()).enumerate() {
                let fg = colors[yi * 2][xi];
                let bg = colors[yi * 2 + 1][xi];
                buf.get_mut(x, y).set_char('▀').set_fg(fg).set_bg(bg);
            }
        }
    }
}

struct AppWidget<'a> {
    title: Paragraph<'a>,
    fps_widget: FpsWidget<'a>,
    rgb_colors_widget: RgbColorsWidget<'a>,
}

impl<'a> AppWidget<'a> {
    fn new(app: &'a App) -> Self {
        let title = Paragraph::new(vec![Line::styled(
            app.current_stream.to_string(),
            Style::new().bold(),
        )])
        .alignment(Alignment::Center);
        Self {
            title,
            fps_widget: FpsWidget { fps: &app.fps },
            rgb_colors_widget: RgbColorsWidget {
                colors: &app.colors,
                frame_count: app.frame_count,
            },
        }
    }
}

impl Widget for AppWidget<'_> {
    fn render(self, area: Rect, buf: &mut Buffer) {
        let main_layout = Layout::default()
            .direction(Direction::Vertical)
            .constraints([Constraint::Length(1), Constraint::Min(0)])
            .split(area);
        let title_layout = Layout::default()
            .direction(Direction::Horizontal)
            .constraints([Constraint::Min(0), Constraint::Length(8)])
            .split(main_layout[0]);

        self.title.render(title_layout[0], buf);
        self.fps_widget.render(title_layout[1], buf);
        self.rgb_colors_widget.render(main_layout[1], buf);
    }
}

#[derive(Debug, Default)]
struct App {
    should_quit: bool,
    switch_stream: bool,
    current_stream: IoStream,
    // a 2d vec of the colors to render, calculated when the size changes as this is expensive
    // to calculate every frame
    colors: Vec<Vec<Color>>,
    last_size: Rect,
    fps: Fps,
    frame_count: usize,
}

impl App {
    pub fn run(io_stream: IoStream) -> Result<bool> {
        let mut terminal = init_terminal(io_stream.as_stream()?)?;
        let mut app = Self::default();
        app.current_stream = io_stream;

        while !app.should_quit && !app.switch_stream {
            app.tick();

            terminal.draw(|frame| {
                let size = frame.size();
                app.setup_colors(size);
                frame.render_widget(AppWidget::new(&app), size);
            })?;
            app.handle_events()?;
        }
        restore_terminal(io_stream.as_stream()?)?;
        Ok(app.should_quit)
    }

    fn tick(&mut self) {
        self.frame_count += 1;
        self.fps.tick();
    }

    fn handle_events(&mut self) -> Result<()> {
        if event::poll(Duration::from_secs_f32(1.0 / 60.0))? {
            if let Event::Key(key) = event::read()? {
                if key.kind == KeyEventKind::Press && key.code == KeyCode::Char('q') {
                    self.should_quit = true;
                }
                if key.kind == KeyEventKind::Press && key.code == KeyCode::Char(' ') {
                    self.switch_stream = true;
                };
            }
        }
        Ok(())
    }

    fn setup_colors(&mut self, size: Rect) {
        // only update the colors if the size has changed since the last time we rendered
        if self.last_size.width == size.width && self.last_size.height == size.height {
            return;
        }
        self.last_size = size;
        let Rect { width, height, .. } = size;
        // double the height because each screen row has two rows of half block pixels
        let height = height * 2;
        self.colors.clear();

        use rand::Rng;
        let mut rng = rand::thread_rng();
        for y in 0..height {
            let mut row = Vec::new();
            let randomness_factor = (height - y) as f32 / height as f32; // More randomness towards the bottom
            for _ in 0..width {
                let base_value = randomness_factor * ((height - y) as f32 / height as f32);
                let random_offset: f32 = rng.gen_range(-0.1..0.1); // Adjust the range as needed
                let value = base_value + random_offset;

                // Clamp the value to ensure it stays within the valid range [0.0, 1.0]
                let value = value.max(0.0).min(1.0);

                let color = Hsv::new(0.0, 0.0, value); // Set hue to 0 for grayscale
                let color = Srgb::<f32>::from_color_unclamped(color);
                let color: Srgb<u8> = color.into_format();
                let color = Color::Rgb(color.red, color.green, color.blue);
                row.push(color);
            }
            self.colors.push(row);
        }
    }
}

fn init_terminal<W>(mut stream: W) -> Result<Terminal<CrosstermBackend<W>>>
where
    W: Write,
{
    enable_raw_mode()?;
    stream.execute(EnterAlternateScreen)?;
    let mut terminal = Terminal::new(CrosstermBackend::new(stream))?;
    terminal.clear()?;
    terminal.hide_cursor()?;
    Ok(terminal)
}

fn restore_terminal<W>(mut stream: W) -> Result<()>
where
    W: Write,
{
    disable_raw_mode()?;
    stream.execute(LeaveAlternateScreen)?;
    Ok(())
}

fn main() -> Result<()> {
    let mut io_stream = last::<IoStream>().unwrap_or_default();
    loop {
        io_stream = next_cycle(&io_stream).unwrap_or_default();
        if App::run(io_stream)? {
            break;
        }
    }
    Ok(())
}

@rhysd
Copy link
Contributor Author

rhysd commented Dec 27, 2023

@orhun

However, I just tried it and I'm getting a slightly less FPS on this branch.

That's interesting. Since terminal rendering is a complicated task, I guess its performance is depending on the terminal implementation and CPU arch (I'm using iTerm2 and older Intel Mac). If I could understand how to run your benchmark, I would be happy to run it on my side.

However, please note that the point of this PR is an improvement of memory usage. CPU usage improvement I've seen is just a bonus.

@orhun
Copy link
Sponsor Member

orhun commented Dec 27, 2023

However, please note that the point of this PR is an improvement of memory usage. CPU usage improvement I've seen is just a bonus.

Oh, got it. Fair point.

If I could understand how to run your benchmark, I would be happy to run it on my side.

I'm actually just running the code that I shared and compare the FPS's (not the best approach for benchmarking but it kinda works)

Without this change:

Details

rec_20231227T201751

With this change:

Details

rec_20231227T201829

With a second thought, there isn't much difference. But it would be nice to see the result on your side as well.

Signed-off-by: rhysd <lin90162@yahoo.co.jp>
@joshka
Copy link
Member

joshka commented Jan 19, 2024

Rebased against main (Buffer moved to buffer/buffer.rs and Cell moved to buffer/cell.rs)

@joshka joshka merged commit 1d3fbc1 into ratatui-org:main Jan 19, 2024
34 checks passed
@joshka
Copy link
Member

joshka commented Jan 19, 2024

Merged. Thanks for the PR.

@orhun
Copy link
Sponsor Member

orhun commented Jan 19, 2024

@rhysd my blog post (previously mentioned) is out now if you want to give it a read: https://blog.orhun.dev/stdout-vs-stderr/

Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants