TCP client performance optimizations#177
Conversation
- Add PERFORMANCE_ANALYSIS.md documenting optimization opportunities in the TCP client hot path - Fix perf test to use a Listener<ClientState> to wait for connection before benchmarking, eliminating the startup race condition - Add README.md for perf test with valgrind profiling instructions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Store oneshot::Sender directly in an enum variant instead of wrapping it in a boxed closure. This removes one heap allocation per request on the Channel API path while preserving the boxed callback path for the FFI/CallbackSession API. Measured with valgrind dhat: 3.02 -> 2.02 allocs/request (-33%) Measured with callgrind: 19385 -> 18980 instructions/request (-2.1%)
The condition `self.end == self.len()` reduces to `self.begin == 0`, so the shift never triggered when it was actually needed (begin > 0 and end at capacity). Changed to `self.end == self.buffer.len()`. Added a test that reproduces the bug: filling the buffer to capacity, consuming some bytes, then attempting to read more would panic with "space for 0 bytes" because no shift occurred.
When all decode levels are Nothing (the default), skip creating the per-transaction tracing::info_span. This avoids span allocation and tx_id Display formatting on every request when logging is inactive. Measured with callgrind: 18980 -> 18809 instructions/request (-0.9%)
Picks up first_chunk-based integer reads, const-generic read_array, and Kani-verified panic-freedom guarantees. No API breakage.
Removes a redundant 8-arm enum match on RequestDetails per response. Measured with callgrind: 18819 -> 18784 instructions/request (-0.2%)
Replace the recursive self.parse() call after header parsing with a loop. Removes one function call/match overhead per parsed frame. Measured with callgrind: 18784 -> 18733 instructions/request (-0.3%)
…ultipleIterator These iterators already provided exact size_hint. Adding the marker trait makes the contract explicit and may help the optimizer in generic contexts.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers significant performance enhancements to the TCP client by reducing heap allocations and CPU instructions, resulting in a 33% decrease in allocations per request and a 3.4% reduction in instructions per request. It also includes a crucial correctness fix for the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request implements several performance optimizations for the TCP client, as detailed in the description. Key changes include reducing heap allocations by refactoring Promise types, fixing a logic bug in ReadBuffer, and converting a recursive parser to an iterative one. The changes are supported by new tests and documentation. After a thorough review, I have not identified any issues or further improvement opportunities in the submitted code.
Replace iterator-based collect() with as_chunks::<2>() to eliminate per-element bounds checking when building Vec<Indexed<u16>> from register responses. Applies to both holding and input register reads. Requires bumping rust-version to 1.94 for slice::as_chunks. Measured with callgrind: 18733 -> 18145 instructions/request (-3.1%)
Notable: tokio 1.48 -> 1.50, tracing 0.1.41 -> 0.1.44
Summary
Box<dyn Callback>allocation in Promise types by storingoneshot::Senderdirectly (3.02 → 2.02 allocs/request, -33%)self.end == self.len()→self.end == self.buffer.len()) with regression testhandle_responseinstead of recomputingExactSizeIteratorforBitIterator,RegisterIterator,WriteMultipleIteratorListener<ClientState>Measurements
All measurements taken with valgrind (deterministic instruction counts and allocation profiling).