-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Core] Reuse created spec tokens lists to mitigate GC cost #28917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -892,7 +892,8 @@ def _update_states(self, scheduler_output: "SchedulerOutput") -> None: | |
| # conform to the schema. This can result in | ||
| # scheduler_output.scheduled_spec_decode_tokens being empty, | ||
| # even when speculative decoding is enabled. | ||
| self.input_batch.spec_token_ids[req_index] = spec_token_ids | ||
| self.input_batch.spec_token_ids[req_index].clear() | ||
| self.input_batch.spec_token_ids[req_index].extend(spec_token_ids) | ||
|
Comment on lines
+895
to
+896
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand this one, how is it better to repopulate an existing list and then discard the other one. Surely better to avoid that extra work and just use the other one? Either way a list is going to get garbage collected..
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great question! The magic is hidden behind gc.freeze. If we preallocate spec_token_ids lists, then run gc.freeze. The preallocate lists will NOT be GC tracked. And the spec_token_ids in this function is short living, majority of the cases, it would be gone due to reference counting, it will not be handled by gc.collect in most of the cases. Let me know if you want more clarification :)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @Jialin, my general concern about these changes is that they sometimes come at the cost of added complexity and can negatively affect the readability/maintainability. There is a balance where it's reasonable I think to give up minor perf benefits to keep the code simpler (of course case by case weighing magnitude of complexity vs magnitude of perf benefit). This particular change for example I think is quite fragile, especially without comments explain why we are expending extra cycles here to optimize the object lifecycle in relation to GC. It's very likely someone could change this in future without being aware. I'm not sure whether it's realistic to enforce this (actually these lists will go away anyhow soon with MRV2).
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @njhill Totally agree with you for the tradeoff. And @zhuohan123 actually raised similar concerns with my other PRs as well. And I totally aligned with it. We had internal TPGS runs to justify the impact. But moving forward, I think I should provide more data which is accessible for open access to avoid such confusion. And I'll hold off other GC changes landing to OSS side, until I came up with e2e benchmark to show e2e wins (both latency and GC costs) to justify such changes. |
||
|
|
||
| # there are no draft tokens with async scheduling, | ||
| # we clear the spec_decoding info in scheduler_output and | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.