New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(buffer)!: apply SSO technique to text buffer in buffer::Cell
#601
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #601 +/- ##
=======================================
- Coverage 92.5% 92.5% -0.1%
=======================================
Files 61 61
Lines 16290 16287 -3
=======================================
- Hits 15070 15067 -3
Misses 1220 1220 ☔ View full report in Codecov by Sentry. |
I noticed performance regression that a benchmark with 100 tick iterations got ~10% slower. After some investigation with CPU sampling profiler, I found that calling I fixed it at c5ecda6 and confirmed the fix with hyperfine. Here is a measurement before/after the fix with 100 tick iterations. $ hyperfine ./demo_before ./demo_after
Benchmark 1: ./demo_before
Time (mean ± σ): 95.2 ms ± 2.0 ms [User: 92.1 ms, System: 1.8 ms]
Range (min … max): 92.7 ms … 101.2 ms 30 runs
Benchmark 2: ./demo_after
Time (mean ± σ): 75.3 ms ± 0.9 ms [User: 72.5 ms, System: 1.6 ms]
Range (min … max): 73.6 ms … 77.9 ms 38 runs
Summary
./demo_after ran
1.26 ± 0.03 times faster than ./demo_before It's now even 1.14x faster than main branch. I added this result to PR description. |
I can reproduce the improvements on: Note: The benchmarks are all done with a grid of |
Thanks for doing tests on this - this was missing from the previous approaches. Some questions:
Some previous similar work (tagging for their insights / interest): |
@a-kenji Thank you for your trial. It is very helpful to try this branch by practical external widgets. I'm glad to hear that benchmarks on your side were improved. @joshka Thank you for the feedback.
No backend is modified. So the difference is caused only by accessing to the content (e.g. crossterm). Theoretically it only adds one branch per accessing one cell, since Practically, I'll try to measure the actual impact with real backend. I'll post the result later.
OK, I'll measure performance with
Do you mean measuring creating
OK, I'll try 80x24 and 200x50 and 255x255. (I chose 318x82 for my benchmark because it is actual terminal size of my environment.)
I considered the following crates:
If you have other alternative, I'll take a look.
I'm not sure the fact that rendering on each tick is 1.14x faster has real impact on user's use case. Rendering happens per some intervals (e.g. 100ms) so program has enough budget for rendering each frames with current main branch. CPU usage of program may slightly go down with this branch tough. The most important point in regards to performance is initial rendering since a program usually tries to render the first paint and it is CPU intensive. The absolute values of my benchmark with only initial rendering are 5.0ms (sso) vs 8.8ms (main). I feel 3.8ms is large enough to be considered. And this branch is not only for CPU performance but also for memory usage. Heap memory usage by this branch is obviously smaller than main branch since each cell string contains its content inline instead of allocating extra heap memory. Even if this branch has no CPU performance benefit, it is worth consideration in terms of memory consumption.
Well, that depends on maintainers. I have no strong opinion. In general, exposing struct fields can cause an issue like this. If I were a maintainer of this crate, I would add a builder for struct CellBuilder {
cell: Cell,
}
impl CellBuilder {
fn new(s: &str) -> Self {
Self {
cell: Cell {
symbol: CompactString::new(s),
..Default::default
},
}
}
fn with_fg(self, color: Color) -> Self {
// ...
}
// ...
fn build(self) -> Cell {
self.cell
}
} If you want this change, please let me know. I'll implement it in this branch. |
I tried Diffdiff --git a/examples/colors_rgb.rs b/examples/colors_rgb.rs
index b15c859..d28e395 100644
--- a/examples/colors_rgb.rs
+++ b/examples/colors_rgb.rs
@@ -17,36 +17,36 @@ use palette::{
convert::{FromColorUnclamped, IntoColorUnclamped},
Okhsv, Srgb,
};
-use ratatui::{prelude::*, widgets::*};
+use ratatui::{backend::TestBackend, prelude::*, widgets::*};
type Result<T> = std::result::Result<T, Box<dyn Error>>;
fn main() -> Result<()> {
- install_panic_hook();
+ // install_panic_hook();
App::new()?.run()
}
struct App {
- terminal: Terminal<CrosstermBackend<Stdout>>,
+ terminal: Terminal<TestBackend>,
should_quit: bool,
}
impl App {
pub fn new() -> Result<Self> {
Ok(Self {
- terminal: Terminal::new(CrosstermBackend::new(stdout()))?,
+ terminal: Terminal::new(TestBackend::new(200, 50))?,
should_quit: false,
})
}
pub fn run(mut self) -> Result<()> {
- init_terminal()?;
- self.terminal.clear()?;
- while !self.should_quit {
+ // init_terminal()?;
+ // self.terminal.clear()?;
+ for _ in 0..100 {
self.draw()?;
- self.handle_events()?;
+ // self.handle_events()?;
}
- restore_terminal()?;
+ // restore_terminal()?;
Ok(())
}
@@ -71,7 +71,7 @@ impl App {
impl Drop for App {
fn drop(&mut self) {
- let _ = restore_terminal();
+ // let _ = restore_terminal();
}
}
I compiled the example on both sso and main branches with $ hyperfine ./colors_rgb_main ./colors_rgb_sso
Benchmark 1: ./colors_rgb_main
Time (mean ± σ): 292.0 ms ± 7.5 ms [User: 289.3 ms, System: 1.5 ms]
Range (min … max): 283.9 ms … 309.7 ms 10 runs
Benchmark 2: ./colors_rgb_sso
Time (mean ± σ): 286.4 ms ± 1.2 ms [User: 283.8 ms, System: 1.3 ms]
Range (min … max): 284.8 ms … 288.3 ms 10 runs
Summary
./colors_rgb_sso ran
1.02 ± 0.03 times faster than ./colors_rgb_main Even in the worst case, sso branch is faster than main branch. |
I applied #583 patch to both sso and main branches and monitored the FPS on iTerm2 with 200x50 terminal size.
|
I measured performance (demo example with test backend) with hyperfine in additional sizes: 200x50, 80x24, and 255x255 200x50$ hyperfine ./demo_main_200_50 ./demo_sso_200_50
Benchmark 1: ./demo_main_200_50
Time (mean ± σ): 38.6 ms ± 1.5 ms [User: 36.3 ms, System: 1.2 ms]
Range (min … max): 37.0 ms … 46.5 ms 71 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: ./demo_sso_200_50
Time (mean ± σ): 34.7 ms ± 0.7 ms [User: 32.7 ms, System: 0.9 ms]
Range (min … max): 33.7 ms … 39.1 ms 79 runs
Summary
./demo_sso_200_50 ran
1.11 ± 0.05 times faster than ./demo_main_200_50 80x24$ hyperfine ./demo_main_80_24 ./demo_sso_80_24
Benchmark 1: ./demo_main_80_24
Time (mean ± σ): 12.2 ms ± 0.6 ms [User: 10.6 ms, System: 0.8 ms]
Range (min … max): 11.5 ms … 15.3 ms 198 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: ./demo_sso_80_24
Time (mean ± σ): 11.5 ms ± 0.3 ms [User: 10.0 ms, System: 0.7 ms]
Range (min … max): 11.0 ms … 12.6 ms 200 runs
Summary
./demo_sso_80_24 ran
1.06 ± 0.06 times faster than ./demo_main_80_24 255x255$ hyperfine ./demo_main_255_255 ./demo_sso_255_255
Benchmark 1: ./demo_main_255_255
Time (mean ± σ): 193.1 ms ± 6.9 ms [User: 187.3 ms, System: 4.2 ms]
Range (min … max): 187.0 ms … 212.2 ms 15 runs
Benchmark 2: ./demo_sso_255_255
Time (mean ± σ): 174.9 ms ± 6.2 ms [User: 170.2 ms, System: 3.3 ms]
Range (min … max): 169.7 ms … 196.1 ms 17 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
./demo_sso_255_255 ran
1.10 ± 0.06 times faster than ./demo_main_255_255 |
I don't have specific crates in mind for this - more just curious as there's a few options mentioned in the previous PRs.
I meant that it could be worth investing in a small test application for this. The simplest test case without any extra processing that would mess things up would be to render a full screen of symbols filled with random characters to the buffer.
My main concern is that this one time gain is outside of the realm of a human being able to see any difference, and I'm wondering if there's a reasonable use case where that speedup will matter for some application.
Yep, but the memory concern is pretty small too - most computers these days have at least 8GB RAM, so 4MB is 0.05% or put another way, I'd need 20 copies of an application running to make a 1% difference in memory usage.
Yeah, I think if I were designing many of the types from scratch I'd create explicit builders (and just use one of the derive builder crates) rather than having the types themselves be builders. The I want to make sure you hear that doing the perf tests here is really important and useful to see, and I value that work a lot. At the same time, I also want to highlight that the results seem to suggest that changing this doesn't seem worth it - at least for now. I'm willing to be wrong on this and change my mind however. But I'd like to consider a few more things first. Rather than changing the low level data structure, I wonder if there's some better data structure / algorithm than our current one for handling this. When tuning an operation for performance that happens a lot, there's two approaches that attack the same problem: 1. reduce the cost of the individual operation, 2. reduce the count of the operations. The second approach is usually much more important than the first (here N = ~10-20K / frame) . Here are some possible ideas to investigate:
|
Also, is there some way to make the performance tests app and commands part of the code for the PR, to make it easy to repro this rather than applying patches - perhaps by building it into the benchmarks we have currently? I also noticed https://nikolaivazquez.com/blog/divan/ a little while ago which might be a good replacement for criterion if we want to invest more in benchmarking. |
I'm sorry for the delayed response. I needed to handle some issues in tui-textarea.
OK. Then I considered all options you kindly listed in the previous comment. I didn't mention
Thanks for the additional explanation. I think I could get the point. And I don't think
Well, I think speedup matters for all applications in my opinion. Even if a single optimization is small, gathering many small optimizations can cause speedup which people can notice. We don't see any trade offs. So I'm honestly not understanding why it is a concern that the speed gain is small.
I agree with this idea.
That's not possible for me until you elaborate on the criteria of really important and useful.
I don't have time to do such big changes. |
Honestly I was disappointed by your response. I measured performance as many variations as possible. And there is no down side actually observed. And you just say 'I have a concern' without any concrete examples. There is nothing I can do to push the discussion forward. If you concern that my benchmarks are useless, then how about merging this branch and trying to observe real world applications have any regressions? This PR is very small. Changes are obvious. And rollback is easy (just replace |
Yeah I agree, I think if it doesn't introduce a massive dependency chain, or not well maintained or written dependencies (which is not the case here), it's a net-benefit. I think the benches shown here are significant enough and it doesn't introduce a lot of new code that needs to be maintained. Finding an algorithmically more optimal solution is orthogonal to this change and can still be done after this. |
My apologies for not properly explaining the actual concern (If I understand you right, you took offense that just focused on the size of the speed increase) Let's examine the tradeoff:
The main concern I have with this is that when we change the public interface, we make users have to change their code, and potentially break downstream applications at the installation point (i.e. our user's user's are the ones that can't install the app). Adding CompactString to the public API, expecting our users to update and then later expecting them to update again if we find a better approach that is not compatible with the CompactString approach is annoying. I have a tendency to try to be conservative when asking other people to change their code (and prefer to do it once rather than twice when possible). Assuming we make the symbol field private (or perhaps pub(crate), then there's less concern as CompactString becomes an internal detail instead of a public API issue. I think one way to do this would be to mark the field as deprecated and make a release to help communicate this to downstream users prior to changing it, and also adding a getter for the field so that people can migrate to that before the change happens. Making the field private potentially breaks anyone using the serde serialization too, so there's downsides to this approach that are also "interesting" Anyway, I hope that's helped clarify why this isn't as simple as just changing something small here, and if I've misunderstood your perspective at all, please do continue to advocate your position to help me see things from your perspective. |
@joshka Thank you for elaborating on your concern.
This does not explain entire upsides. As I said repeatedly, there is a larger memory consumption benefit. And initial rendering is more faster (due to initial heap allocations). Though it is one-shot, it is important because users of tools wait until application launches with doing nothing in general. You said that 5ms is too small for meaningful optimization, but on my machine
I created a PR for tui-logger. As you can see, the change is really small and authors can easily adopt them. Since ratatui has not released v1 yet, it is easy to assume some breaking changes in v0.x release for users as other breaking changes ratatui has added. (Actually
From my perspective and if I caught your points well, this is a problem of 'small breaking change' v.s. 'small optimization gain' (though I don't think optimization on memory consumption is not small). I still think the optimization is worth adding the breaking change because the breaking change happens only once on a single major version bump but the optimization will be effective in all future versions.
I agree with this idea very much. How about the following timeline?
What I'm thinking about 2. is: diff --git a/src/buffer.rs b/src/buffer.rs
index 2bd3d56..f16ae67 100644
--- a/src/buffer.rs
+++ b/src/buffer.rs
@@ -17,7 +17,7 @@ use crate::{
#[derive(Debug, Clone, Eq, PartialEq, Hash)]
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
pub struct Cell {
- pub symbol: CompactString,
+ pub(crate) symbol: CompactString,
pub fg: Color,
pub bg: Color,
#[cfg(feature = "underline-color")]
@@ -26,12 +26,30 @@ pub struct Cell {
pub skip: bool,
}
+impl Default for Cell {
+ fn default() -> Self {
+ Self {
+ symbol: CompactString::new(" "),
+ fg: Color::Reset,
+ bg: Color::Reset,
+ #[cfg(feature = "underline-color")]
+ underline_color: Color::Reset,
+ modifier: Modifier::empty(),
+ skip: false,
+ }
+ }
+}
+
impl Cell {
pub fn set_symbol(&mut self, symbol: &str) -> &mut Cell {
self.symbol = CompactString::new(symbol);
self
}
+ pub fn symbol(&self) -> &'_ str {
+ self.symbol.as_str()
+ }
+
pub fn set_char(&mut self, ch: char) -> &mut Cell {
let mut buf = [0; 4];
self.symbol = CompactString::new(ch.encode_utf8(&mut buf)); Using |
Sounds good to me. Let's do it! :) Everything past here is more about my curiosity about than it is about the problem
What sort of systems would this change be a problem for? The smallest EC2 instance has 512MB, and I can't recall a laptop I've owned since perhaps 2010 that's had less than 16GB.
I stand by that - perf isn't an absolute thing. Implementing perf changes when there isn't a reasonable situation where they can be detected is a failure to heed Knuth's observation regarding optimization:
I'd be curious to see if you can actually reliably detect this (I certainly could not). Want to test it out? Some silly test codeuse std::{env::args, thread::sleep, time::Duration};
fn main() {
if rand::random() {
println!("Slowed by 5ms - did you notice?");
sleep(Duration::from_millis(5));
}
let args = args().skip(1).collect::<Vec<String>>();
let err = exec::Command::new("vim").args(&args).exec();
eprintln!("vim: {err}");
}
I understand that the change itself is fairly small and that it's easy to think this means that breaking is trivial. That's not where the issue gets felt though. A lot of smaller apps have installation instructions that are |
While I am not sure if there is a good way to test for it, I suspect the large differential in allocations here might be able to switch a core from one p state to another in low load scenarios. |
I'll make a PR for the step 1. and 2.
Well, I cannot come up with. Maybe some systems which are almost running out memory and causing thrashing?
Yes, I understand your point. I of course know Knuth's rules for optimization and agree with it. It's the matter of trade off (complexity v.s. benefit). The point is that your stance is that small optimizations which people cannot perceive are premature optimization. And I don't stand for this because sum of optimization gain can be meaningful. For my |
I created a PR for deprecation: #624 |
I rebased onto the latest main branch to resolve conflicts and force-pushed the commit since a merge commit is not allowed by lint. |
This comment was marked as outdated.
This comment was marked as outdated.
This should be merged only after releasing 0.24.1 |
I suspect this will go out in v0.26.0 not v0.25.0 as we've got some breaking changes in main that prevent us shipping the next version as v0.24.1. |
Sure. So,
Am I correct? |
Yep |
This can be merged after a rebase right? |
I will rebase tomorrow. |
I rebased this branch onto the latest main branch. |
I'm writing a blog post about benchmarking the I/O streams (block buffered, line buffered, etc.) using Ratatui and I will reference this PR there. However, I just tried it and I'm getting a slightly less FPS on this branch. I'm not sure what is happening yet but I will read through the comments in here and try to figure it out. Maybe it is an initial rendering issue. Here is the unpolished code that I'm using to benchmark (press Space to toggle between streams): io_bench.rsuse std::{
fmt,
fs::File,
io::{self, BufWriter, LineWriter, Write},
os::fd::{FromRawFd, OwnedFd},
time::{Duration, Instant},
};
use anyhow::Result;
use crossterm::{
event::{self, Event, KeyCode, KeyEventKind},
terminal::{disable_raw_mode, enable_raw_mode, EnterAlternateScreen, LeaveAlternateScreen},
ExecutableCommand,
};
use enum_iterator::{last, next_cycle, Sequence};
use lazy_static::lazy_static;
use palette::{convert::FromColorUnclamped, Hsv, Srgb};
use ratatui::{prelude::*, widgets::*};
lazy_static! {
static ref RAW_STDOUT_FD: OwnedFd = unsafe { OwnedFd::from_raw_fd(1) };
}
#[derive(Copy, Clone, Debug, Default, Sequence)]
enum IoStream {
#[default]
Stdout,
LineBufferedStdout,
BlockBufferedStdout,
Stderr,
LineBufferedStderr,
BlockBufferedStderr,
}
impl fmt::Display for IoStream {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(
f,
"{}",
match self {
IoStream::Stdout => "stdout (unbuffered)",
IoStream::LineBufferedStdout => "stdout (line buffered)",
IoStream::BlockBufferedStdout => "stdout (block buffered)",
IoStream::Stderr => "stderr (unbuffered)",
IoStream::LineBufferedStderr => "stderr (line buffered)",
IoStream::BlockBufferedStderr => "stderr (block buffered)",
}
)
}
}
impl IoStream {
fn as_stream(&self) -> Result<Box<dyn Write>> {
Ok(match self {
IoStream::Stdout => Box::new(File::from(RAW_STDOUT_FD.try_clone()?)),
IoStream::LineBufferedStdout => Box::new(io::stdout()),
IoStream::BlockBufferedStdout => Box::new(BufWriter::new(io::stdout())),
IoStream::Stderr => Box::new(io::stderr()),
IoStream::LineBufferedStderr => Box::new(LineWriter::new(io::stderr())),
IoStream::BlockBufferedStderr => Box::new(BufWriter::new(io::stderr())),
})
}
}
#[derive(Debug)]
struct Fps {
frame_count: usize,
last_instant: Instant,
fps: Option<f32>,
}
impl Default for Fps {
fn default() -> Self {
Self {
frame_count: 0,
last_instant: Instant::now(),
fps: None,
}
}
}
impl Fps {
fn tick(&mut self) {
self.frame_count += 1;
let elapsed = self.last_instant.elapsed();
// update the fps every second, but only if we've rendered at least 2 frames (to avoid
// noise in the fps calculation)
if elapsed > Duration::from_secs(1) && self.frame_count > 2 {
self.fps = Some(self.frame_count as f32 / elapsed.as_secs_f32());
self.frame_count = 0;
self.last_instant = Instant::now();
}
}
}
struct FpsWidget<'a> {
fps: &'a Fps,
}
impl<'a> Widget for FpsWidget<'a> {
fn render(self, area: Rect, buf: &mut Buffer) {
if let Some(fps) = self.fps.fps {
let text = format!("{:.1} fps", fps);
Paragraph::new(text).render(area, buf);
}
}
}
struct RgbColorsWidget<'a> {
/// The colors to render - should be double the height of the area
colors: &'a Vec<Vec<Color>>,
/// the number of elapsed frames that have passed - used to animate the colors
frame_count: usize,
}
impl Widget for RgbColorsWidget<'_> {
fn render(self, area: Rect, buf: &mut Buffer) {
let colors = self.colors;
for (xi, x) in (area.left()..area.right()).enumerate() {
// animate the colors by shifting the x index by the frame number
let xi = (xi + self.frame_count) % (area.width as usize);
for (yi, y) in (area.top()..area.bottom()).enumerate() {
let fg = colors[yi * 2][xi];
let bg = colors[yi * 2 + 1][xi];
buf.get_mut(x, y).set_char('▀').set_fg(fg).set_bg(bg);
}
}
}
}
struct AppWidget<'a> {
title: Paragraph<'a>,
fps_widget: FpsWidget<'a>,
rgb_colors_widget: RgbColorsWidget<'a>,
}
impl<'a> AppWidget<'a> {
fn new(app: &'a App) -> Self {
let title = Paragraph::new(vec![Line::styled(
app.current_stream.to_string(),
Style::new().bold(),
)])
.alignment(Alignment::Center);
Self {
title,
fps_widget: FpsWidget { fps: &app.fps },
rgb_colors_widget: RgbColorsWidget {
colors: &app.colors,
frame_count: app.frame_count,
},
}
}
}
impl Widget for AppWidget<'_> {
fn render(self, area: Rect, buf: &mut Buffer) {
let main_layout = Layout::default()
.direction(Direction::Vertical)
.constraints([Constraint::Length(1), Constraint::Min(0)])
.split(area);
let title_layout = Layout::default()
.direction(Direction::Horizontal)
.constraints([Constraint::Min(0), Constraint::Length(8)])
.split(main_layout[0]);
self.title.render(title_layout[0], buf);
self.fps_widget.render(title_layout[1], buf);
self.rgb_colors_widget.render(main_layout[1], buf);
}
}
#[derive(Debug, Default)]
struct App {
should_quit: bool,
switch_stream: bool,
current_stream: IoStream,
// a 2d vec of the colors to render, calculated when the size changes as this is expensive
// to calculate every frame
colors: Vec<Vec<Color>>,
last_size: Rect,
fps: Fps,
frame_count: usize,
}
impl App {
pub fn run(io_stream: IoStream) -> Result<bool> {
let mut terminal = init_terminal(io_stream.as_stream()?)?;
let mut app = Self::default();
app.current_stream = io_stream;
while !app.should_quit && !app.switch_stream {
app.tick();
terminal.draw(|frame| {
let size = frame.size();
app.setup_colors(size);
frame.render_widget(AppWidget::new(&app), size);
})?;
app.handle_events()?;
}
restore_terminal(io_stream.as_stream()?)?;
Ok(app.should_quit)
}
fn tick(&mut self) {
self.frame_count += 1;
self.fps.tick();
}
fn handle_events(&mut self) -> Result<()> {
if event::poll(Duration::from_secs_f32(1.0 / 60.0))? {
if let Event::Key(key) = event::read()? {
if key.kind == KeyEventKind::Press && key.code == KeyCode::Char('q') {
self.should_quit = true;
}
if key.kind == KeyEventKind::Press && key.code == KeyCode::Char(' ') {
self.switch_stream = true;
};
}
}
Ok(())
}
fn setup_colors(&mut self, size: Rect) {
// only update the colors if the size has changed since the last time we rendered
if self.last_size.width == size.width && self.last_size.height == size.height {
return;
}
self.last_size = size;
let Rect { width, height, .. } = size;
// double the height because each screen row has two rows of half block pixels
let height = height * 2;
self.colors.clear();
use rand::Rng;
let mut rng = rand::thread_rng();
for y in 0..height {
let mut row = Vec::new();
let randomness_factor = (height - y) as f32 / height as f32; // More randomness towards the bottom
for _ in 0..width {
let base_value = randomness_factor * ((height - y) as f32 / height as f32);
let random_offset: f32 = rng.gen_range(-0.1..0.1); // Adjust the range as needed
let value = base_value + random_offset;
// Clamp the value to ensure it stays within the valid range [0.0, 1.0]
let value = value.max(0.0).min(1.0);
let color = Hsv::new(0.0, 0.0, value); // Set hue to 0 for grayscale
let color = Srgb::<f32>::from_color_unclamped(color);
let color: Srgb<u8> = color.into_format();
let color = Color::Rgb(color.red, color.green, color.blue);
row.push(color);
}
self.colors.push(row);
}
}
}
fn init_terminal<W>(mut stream: W) -> Result<Terminal<CrosstermBackend<W>>>
where
W: Write,
{
enable_raw_mode()?;
stream.execute(EnterAlternateScreen)?;
let mut terminal = Terminal::new(CrosstermBackend::new(stream))?;
terminal.clear()?;
terminal.hide_cursor()?;
Ok(terminal)
}
fn restore_terminal<W>(mut stream: W) -> Result<()>
where
W: Write,
{
disable_raw_mode()?;
stream.execute(LeaveAlternateScreen)?;
Ok(())
}
fn main() -> Result<()> {
let mut io_stream = last::<IoStream>().unwrap_or_default();
loop {
io_stream = next_cycle(&io_stream).unwrap_or_default();
if App::run(io_stream)? {
break;
}
}
Ok(())
} |
That's interesting. Since terminal rendering is a complicated task, I guess its performance is depending on the terminal implementation and CPU arch (I'm using iTerm2 and older Intel Mac). If I could understand how to run your benchmark, I would be happy to run it on my side. However, please note that the point of this PR is an improvement of memory usage. CPU usage improvement I've seen is just a bonus. |
Oh, got it. Fair point.
I'm actually just running the code that I shared and compare the FPS's (not the best approach for benchmarking but it kinda works) Without this change: With this change: With a second thought, there isn't much difference. But it would be nice to see the result on your side as well. |
Signed-off-by: rhysd <lin90162@yahoo.co.jp>
Rebased against main (Buffer moved to buffer/buffer.rs and Cell moved to buffer/cell.rs) |
Merged. Thanks for the PR. |
@rhysd my blog post (previously mentioned) is out now if you want to give it a read: https://blog.orhun.dev/stdout-vs-stderr/ Thanks for your contribution! |
Description
ratatui allocates one
String
buffer for eachbuffer::Cell
.char
is not sufficient since one character can consist of multiple code points. However, in almost all cases the cells have a single code point. It means that ratatui allocates many small heap memories (~16 Bytes/cell) only for the rare cases.Allocation on heap is slower than allocation on stack so moving these allocations to stack can improve performance of allocating
buffer::Buffer
.To archive this, a small string optimization (SSO) is available in general (which is adopted by C++'s
std::string
). Fortunately, compact_str crate is a solid implementation of SSO. It provides a drop in replacement of standardString
so we can use it with very small changes (as diffs of this PR show).The
CompactString
has the same size asString
and has 24 bytes inline storage. By using it for text buffer ofbuffer::Cell
, the symbol character can be stored on stack in almost all cases.One problem is that
symbol
field ofCell
struct is no longer a standardString
object. Now it is aCompactString
but it would have small impact on users' code base since users usually modifies the buffer cells through widgets.Measurement summary
Memory profiling
I used
xctrace
(general purpose profiler bundled in Xcode) to measure memory profiling. I modified thedemo
example to useTextBackend
and stops after 100 tick iterations to measure the impact on this change precisely.Diff applied to demo example
The measured results of 16Bytes heap allocations are as follows:
main
branchsso
branchInitial rendering performance
I measured E2E launch time (initial rendering) performance using hyperfine.
I removed the
for
loop to measure a single tick iteration where screen is rendered once.Diff applied to demo example
The result on my machine was as follows:
100 tick iterations performance
In addition, I measured performance of 100 tick iterations with hyperfine. And it turned out that this branch is 1.14x faster than main branch.