-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Description
🚀 The feature, motivation and pitch
Current Implementation Status
This branch is the PARD implementation of Speculative decoding for V0. However, this was done a few months ago and is unsupported with V1.
With V1, speculative decoding with vLLM does not have draft model support. It raises the following error.
NotImplementedError: Draft model speculative decoding is not supported yet. Please consider using other speculative decoding methods such as ngram, medusa, eagle, or mtp.
Other speculative decoding methods such as eagle, ngram etc also raise the following assertion when run on CPU.
AssertionError: spec decode is not supported. PermaLink
I found a PR that has been created to add draft model support.
But there is no mention of support for CPUs.
Feature Request
Enabling draft model based speculative decoding for CPUs
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.