Skip to content

feat: support batch in session.#401

Merged
k82cn merged 1 commit into
xflops:mainfrom
k82cn:flm_400
Apr 16, 2026
Merged

feat: support batch in session.#401
k82cn merged 1 commit into
xflops:mainfrom
k82cn:flm_400

Conversation

@k82cn
Copy link
Copy Markdown
Contributor

@k82cn k82cn commented Apr 13, 2026

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces batch support for sessions, enabling gang scheduling by allocating executors and dispatching tasks in coordinated groups. It includes updates to the protobuf definitions, CLI, scheduler actions, and storage engines to handle batch configurations. Feedback highlights several critical issues, including a busy-wait loop in the task polling logic that could lead to high CPU usage and a race condition during batch index assignment. Performance improvements were also suggested regarding inefficient map cloning and sorting in the task selection process, along with a fix for potential integer truncation when handling task IDs.

Comment on lines +224 to +225
ctx.waker().wake_by_ref();
return Poll::Pending;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling wake_by_ref() immediately before returning Poll::Pending without an external event trigger results in a busy-wait loop, which will consume 100% CPU on the executor thread. The future should instead be registered to be woken up when new tasks are added to the session.

Comment thread session_manager/src/storage/mod.rs Outdated
Comment on lines +626 to +636
pub async fn next_batch_index(&self, ssn_id: &SessionID, batch_size: u32) -> Result<u32, FlameError> {
let exe_map = lock_ptr!(self.executors)?;

let bound_count = exe_map
.values()
.filter_map(|exe_ptr| exe_ptr.lock().ok())
.filter(|exe| exe.ssn_id.as_ref() == Some(ssn_id))
.count();

Ok((bound_count as u32) % batch_size)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a race condition in next_batch_index. The bound_count is calculated based on a snapshot of the executors, but the assignment of the index to the executor happens after the lock on the executor map is released. If multiple executors attempt to bind to the same session concurrently, they may receive the same batch_index.

Comment thread common/src/apis.rs Outdated
Comment on lines +619 to +621
if let Some((task_id, _)) = pending_tasks.clone().iter().next() {
return pending_tasks.remove(task_id);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Cloning the entire pending_tasks map just to retrieve the first element is highly inefficient. You can use .keys().next() to get an optional key without cloning the collection.

Suggested change
if let Some((task_id, _)) = pending_tasks.clone().iter().next() {
return pending_tasks.remove(task_id);
}
if let Some(task_id) = pending_tasks.keys().next().cloned() {
return pending_tasks.remove(&task_id);
}

Comment thread common/src/apis.rs Outdated
Comment on lines +625 to +626
let mut sorted_ids: Vec<_> = pending_tasks.keys().cloned().collect();
sorted_ids.sort();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Sorting all pending task IDs on every call to pop_pending_task results in $O(N \log N)$ complexity. For sessions with a large number of tasks, this will significantly degrade performance. Consider using a data structure that maintains order (like BTreeMap) for tasks_index or caching the sorted keys.

Comment thread common/src/apis.rs Outdated
sorted_ids.sort();

for task_id in sorted_ids {
if (task_id as u32) % batch_size == batch_index {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Casting task_id (i64) to u32 can lead to incorrect batch assignment due to truncation if task IDs exceed u32::MAX. It is safer to perform the modulo operation on the original i64 type or cast batch_size to i64.

Suggested change
if (task_id as u32) % batch_size == batch_index {
if (task_id % batch_size as i64) == batch_index as i64 {

@k82cn k82cn force-pushed the flm_400 branch 6 times, most recently from 1a5e5ac to 5b47043 Compare April 13, 2026 13:57
@k82cn k82cn force-pushed the flm_400 branch 13 times, most recently from 17c6766 to c3bf371 Compare April 14, 2026 03:43
@gitguardian
Copy link
Copy Markdown

gitguardian Bot commented Apr 14, 2026

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
While these secrets were previously flagged, we no longer have a reference to the
specific commits where they were detected. Once a secret has been leaked into a git
repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@k82cn k82cn force-pushed the flm_400 branch 6 times, most recently from 66be9ed to 676c279 Compare April 15, 2026 05:38
- Add batch_index field to executor for tracking batch assignment
- Add derive_events_path_tests.rs with 12 tests covering sqlite://, fs://,
  and fallback paths including FLAME_TEST_DIR env var handling
- Add 8 comprehensive FsEventManager tests for multiple events, sessions,
  persistence, and FLAME_TEST_DIR integration
- Fix test isolation by using unique subdirectories per test to avoid
  shared events directory conflicts in parallel test runs
@k82cn k82cn merged commit 6612daf into xflops:main Apr 16, 2026
7 checks passed
@k82cn k82cn deleted the flm_400 branch April 16, 2026 02:03
@k82cn k82cn mentioned this pull request Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant