Skip to content

Conversation

LiuXiaoxuanPKU
Copy link
Collaborator

In the current implementation, even if running_queue_size >= speculative_disable_by_batch_size, it will still go through the speculative decoding logic, which includes get_spec_proposals with k=0, score_proposal, rejection sampling and create sampler output. The flow introduces extra overhead (especially rejection sampling), which makes disabling speculative decoding slower than 'real' without speculative decoding.

To fix this, we can just reuse the _run_no_spec to avoid touching the sd flow at all. Also add a test to check the correctness.

Concretely, for a batch size of 8 with 128 output tokens, TP=4, for LLama3-70B, the batch latency is

Without SD Disable SD Disable SD after fix
5.4 s 5.8s 5.5s

Here, without SD means not using sd flag at all. Disable SD means using SD flag, but set speculative_disable_by_batch_size smaller than batch size to disable speculative decoding.
After the fix, we are still slower than the original case, this is caused by broadcasting control flow, which will be fixed in future PRs.

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

disable_all_speculation, execute_model_req.seq_group_metadata_list)

# If no spec tokens, call the proposer and scorer workers normally.
# Used for prefill.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refine the comment to include auto disable?

Copy link
Collaborator

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's update the comment before merging

"ngram_prompt_lookup_max": 3,
"speculative_disable_by_batch_size": 4
}])
@pytest.mark.parametrize("batch_size", [1, 2, 5, 8])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: suggest only [1, 5] to reduce test time

LiuXiaoxuanPKU added 2 commits May 23, 2024 19:16
@comaniac
Copy link
Collaborator

Looks like the CI failure is unrelated and we should just merge this. cc @simon-mo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants