Skip to content

[Feature]: Enabling draft model based speculative decoding for CPUs #28384

@ganeshr10

Description

@ganeshr10

🚀 The feature, motivation and pitch

Current Implementation Status

This branch is the PARD implementation of Speculative decoding for V0. However, this was done a few months ago and is unsupported with V1.

With V1, speculative decoding with vLLM does not have draft model support. It raises the following error.
NotImplementedError: Draft model speculative decoding is not supported yet. Please consider using other speculative decoding methods such as ngram, medusa, eagle, or mtp.

Other speculative decoding methods such as eagle, ngram etc also raise the following assertion when run on CPU.
AssertionError: spec decode is not supported. PermaLink

I found a PR that has been created to add draft model support.
But there is no mention of support for CPUs.

Feature Request

Enabling draft model based speculative decoding for CPUs

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions