-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[rollout] feat: use rollout and validate parallel process #4863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rollout] feat: use rollout and validate parallel process #4863
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces parallel processing for validation in the FullyAsyncRollouter by using a ThreadPoolExecutor. This allows validation to run asynchronously, controlled by the parallel_validate_and_rollout configuration. The changes refactor the validation logic to support both synchronous and asynchronous execution paths. My review identifies a critical issue with unhandled exceptions in a fire-and-forget asyncio task, which could lead to silent failures. I have also pointed out several high-severity issues related to Python best practices, such as placing imports inside methods, accessing private attributes, using non-idiomatic boolean comparisons, and including non-English comments, all of which impact code quality and maintainability.
Co-authored-by: Shangwei-Li <lishangwei@mail.ustc.edu.cn>
|
|
||
| cpu_cores = multiprocessing.cpu_count() | ||
| # cpu case use cpu_cores; io case use cpu_cores*2 | ||
| self.validate_executor = ThreadPoolExecutor(max_workers=cpu_cores) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need more than 1 thread in this case ? Aren't we only making single validate call during validation ?
On high end GPU nodes there are > 100s of CPU core and this will essentially create 100s of unused threads and take memory 100*8MB.
| print("[FullyAsyncRollouter] validate_task is running, wait last validate_task to finish") | ||
| self.validate_task.get() | ||
| self.validate_task = asyncio.create_task( | ||
| self.do_validate_async(timing_raw, version, global_steps, use_trainer_do_validate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are only making call to validate method once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is subject to conditional restrictions, which is controlled by need_validate (consistent with the previous trigger conditions).
What does this PR do?
validate process and rollout process can be paralled in fully-async mode. This can reduce rollouter ilde and improve timing_s/gen speed when in short-long response_length case.
This is below profiler: (qwen2.5-Math-7b)
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
we can use
dapo_7b_math_fsdp2_8_8.sh, and add+async_training.parallel_validate_and_rollout=TruecommandAPI and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.