# Fast LLM reasoning with DeepSeek-R1-Distill-Llama-8B and FastDraft

[DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf) is an open-source reasoning model developed by DeepSeek to address tasks requiring logical inference, mathematical problem-solving, and real-time decision-making. With DeepSeek-R1, you can follow its logic, making it easier to understand and, if necessary, challenge its output. 
This capability gives reasoning models an edge in fields where outcomes need to be explainable, like research or complex decision-making.
Distillation in AI creates smaller, more efficient models from larger ones, preserving much of their reasoning power while reducing computational demands. DeepSeek applied this technique to create a suite of distilled models from R1, using Qwen and Llama architectures. 
That allows us to try DeepSeek-R1 capability locally on usual laptops.<br>
Please check [this](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/deepseek-r1) great tutorial for running DeepSeek-R1 distilled models on your laptop with OpenVINO.

In this notebook we will show you how to speedup DeepSeek-R1 inference on your laptop with FastDraft, our latest publication.

## Intel Labs' FastDraft
FastDraft is a novel and efficient approach for pre-training and aligning a draft model to any LLM to be used with speculative decoding, by incorporating efficient pre-training followed by fine-tuning over synthetic datasets generated by the target model. 
FastDraft was presented in a [paper](https://arxiv.org/abs/2411.11055) at [ENLSP@NeurIPS24](https://neurips2024-enlsp.github.io/accepted_papers.html) by Intel Labs.
FastDraft pre-trained draft models achieve impressive results in key metrics of acceptance rate, block efficiency and up to 3x memory bound speed up when evaluated on code completion and up to 2x in summarization, text completion and instruction tasks and unlock large language models inference on AI-PC and other edge-devices.

In this notebook we will use the Llama-3.1 FastDraft model created as part of the project to accelerate [DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B). Since this model shares the same vocabulary as the Llama-3.1 family, our Llama-3.1 FastDraft is compatible with DeepSeek-R1-Distill-Llama-8B model.


## Prerequisites

> Note: we recommend running this notebook in a virtual environment. 

Install required dependencies

In [None]:
import os
import platform

os.environ["GIT_CLONE_PROTECTION_ACTIVE"] = "false"

%pip install -q -U --pre "openvino==2025.1.0.dev20250120" "openvino-tokenizers==2025.1.0.dev20250120" "openvino-genai==2025.1.0.dev20250120" --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu \
"optimum-intel==1.20.1" \
"optimum==1.23.3" \
"nncf==2.14.1" \
"torch>=2.1" \
"datasets" \
"accelerate" \
"gradio>=4.19" \
"transformers==4.46.3" \
"einops" "tiktoken"

if platform.system() == "Darwin":
    %pip install "numpy<2.0.0"

## Generation with OpenVINO GenAI
We will first need to convert the model to OpenVINO format and then we will be ready to run the model on your laptop

In [None]:
from pathlib import Path

model_name = "DeepSeek-R1-Distill-Llama-8B"
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
model_dir = Path("./deepseek-r1-distill-llama-8b-int4-ov")
device = "GPU"

### Convert model using Optimum-CLI tool

🤗 [Optimum Intel](https://huggingface.co/docs/optimum/intel/index) is the interface between the 🤗 [Transformers](https://huggingface.co/docs/transformers/index) and OpenVINO to accelerate end-to-end pipelines on Intel architectures. It provides ease-to-use cli interface for exporting models to [OpenVINO Intermediate Representation (IR)](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format.

In [None]:
if not model_dir.exists():
    ! optimum-cli export openvino --model $model_id --task text-generation-with-past --weight-format int4 $model_dir

# convert OV tokenizer if needed
if not (model_dir / "openvino_tokenizer.xml").exists():
    ! convert_tokenizer $model_dir --with-detokenizer -o $model_dir

### Instantiate pipeline with OpenVINO Generate API

We will use [OpenVINO Generate API](https://github.com/openvinotoolkit/openvino.genai/blob/master/src/README.md) to create pipelines to run an inference with OpenVINO Runtime. 

Firstly we need to create a pipeline with `LLMPipeline`. `LLMPipeline` is the main object used for text generation using LLM in OpenVINO GenAI API. You can construct it straight away from the folder with the converted model. We will provide directory with model and device for `LLMPipeline`. Then we run `generate` method and get the output in text format.
Additionally, we can configure parameters for decoding. We can create the default config with `ov_genai.GenerationConfig()`, setup parameters, and apply the updated version with `set_generation_config(config)` or put config directly to `generate()`.

In [None]:
import openvino_genai as ov_genai

from llm_pipeline_with_hf_tokenizer import LLMPipelineWithHFTokenizer


def streamer(subword):
    print(subword, end="", flush=True)
    return False


# Define scheduler
scheduler_config = ov_genai.SchedulerConfig()
scheduler_config.num_kv_blocks = 2048 // 16
scheduler_config.dynamic_split_fuse = False
scheduler_config.max_num_batched_tokens = 2048

pipe = LLMPipelineWithHFTokenizer(model_dir, device, scheduler_config=scheduler_config)

After instantiating the pipeline we are ready to generate with the model.

DeepSeek models work best when called with their custom chat template so we pass the query to the model in a chat format with `apply_chat_template=True` argument.

To get a more accurate measurement of the generation time, we do a warmup step to let the model allocate memory and compile any kernels it needs to reach its full potential.

In [None]:
import time


generation_config = ov_genai.GenerationConfig()

input_prompt = [{"role": "user", "content": "Which number is bigger 9.11 or 9.9?"}]

# We will first do a short warmup to the model so the time measurement will not include the warmup overhead.
generation_config.max_new_tokens = 8
pipe.generate(input_prompt, generation_config, apply_chat_template=True)

# Now we can measure the time and see the result
generation_config.max_new_tokens = 1024

start = time.perf_counter()
result = pipe.generate(input_prompt, generation_config, streamer, apply_chat_template=True)
ar_gen_time = time.perf_counter() - start

In [None]:
import gc

print(f"Generation took {ar_gen_time:.3f} seconds")
del pipe
gc.collect()

## Acceleration with FastDraft and speculative decoding
Speculative decoding is a lossless decoding paradigm introduced in a recent [ICML paper](https://arxiv.org/abs/2211.17192) for accelerating auto-regressive generation with LLMs.
The method aims to mitigate the inherent latency bottleneck caused by the sequential nature of auto-regressive generation.
Speculative decoding employs a draft language model to generate a block of \(\gamma\) candidate tokens.
The LLM, referred to as the target model, then processes these candidate tokens in parallel.
The algorithm examines each token's probability distribution, calculated by both the target and draft models, to determine whether the token should be accepted or rejected.

In this section we will show how to accelerate the generation of the DeepSeek using our Llama-3.1 FastDraft and OpenVINO GenAI speculative decoding pipeline.

First we will download our draft and then we will initialize a generation pipeline

In [None]:
import filecmp
import shutil

import huggingface_hub as hf_hub

draft_model_id = "OpenVINO/Llama-3.1-8B-Instruct-FastDraft-150M-int8-ov"
draft_model_path = Path("DeepSeek-R1-Llama-FastDraft-int8-ov")

if not draft_model_path.exists():
    hf_hub.snapshot_download(draft_model_id, local_dir=draft_model_path)

# We need tokenizers to match between the target and draft model so we apply this workaround
if not filecmp.cmp(str(model_dir / "openvino_tokenizer.xml"), str(draft_model_path / "openvino_tokenizer.xml"), shallow=False):
    for fname in ["openvino_tokenizer.xml", "openvino_tokenizer.bin", "openvino_detokenizer.xml", "openvino_detokenizer.bin"]:
        shutil.copy(model_dir / fname, draft_model_path / fname)

In [None]:
# Define schedulers
scheduler_config = ov_genai.SchedulerConfig()
scheduler_config.num_kv_blocks = 2048 // 16
scheduler_config.dynamic_split_fuse = False
scheduler_config.max_num_batched_tokens = 2048

draft_scheduler_config = ov_genai.SchedulerConfig()
draft_scheduler_config.num_kv_blocks = 2048 // 16
draft_scheduler_config.dynamic_split_fuse = False
draft_scheduler_config.max_num_batched_tokens = 2048

draft_model = ov_genai.draft_model(draft_model_path, device, scheduler_config=draft_scheduler_config)

pipe = LLMPipelineWithHFTokenizer(model_dir, device, scheduler_config=scheduler_config, draft_model=draft_model)

Now we are ready to generate with our speculative decoding pipeline, as in the previous section, we will do a small warmup step before measuring the generation time

In [None]:
# We need to define in the generation config how many tokens the draft should predict in each cycle
generation_config.num_assistant_tokens = 3

# Again we will do a short warmup before measuring time for the model
generation_config.max_new_tokens = 8
pipe.generate(input_prompt, generation_config, apply_chat_template=True)

# Now we can measure the time and see the result
generation_config.max_new_tokens = 1024

start = time.perf_counter()
result = pipe.generate(input_prompt, generation_config, streamer, apply_chat_template=True)
sd_gen_time = time.perf_counter() - start

In [None]:
print(f"Generation took {sd_gen_time:.3f} seconds")

In our FastDraft paper we have measured a 1.5x speedup in average with our draft after aligning it to the target model, in this case the draft wasn't align to the target mode, however, we still saw roughly the same speedup results.
Let's calculate the speedup results for the specific example we used:

In [None]:
print(f"End to end speedup with FastDraft and speculative decoding is {ar_gen_time / sd_gen_time:.2f}x")

## Evaluate Speculative Decoding Speedup On Multiple Examples

In this section we compare auto-regressive generation and speculative-decoding generation with DeepSeek-R1-Distill-Llama-8B model on multiple examples. 
We use 40 example-prompts taken from [MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts) and from [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) datasets.
We loop over these examples and measure generation times, first without speculative-decoding and later with speculative-decoding. Eventually we compare generation times for both methods and compute the average speedup gain.

### 1. Run target model without speculative decoding
As in previous section, we will first run generation without speculative-decoding, but this time we will run it over 40 examples.

In [None]:
import openvino_genai as ov_genai
import sys
import time
from tqdm import tqdm
from llm_pipeline_with_hf_tokenizer import LLMPipelineWithHFTokenizer

print(f"Loading model from {model_dir}")

# Define scheduler
scheduler_config = ov_genai.SchedulerConfig()
scheduler_config.num_kv_blocks = 2048 // 16
scheduler_config.dynamic_split_fuse = False
scheduler_config.max_num_batched_tokens = 2048

pipe = LLMPipelineWithHFTokenizer(model_dir, device, scheduler_config=scheduler_config)

generation_config = ov_genai.GenerationConfig()
generation_config.max_new_tokens = 1024

print("Loading prompts...")
import json
f= open('prompts.json')
prompts = json.load(f)
prompts = [[{"role": "user", "content": p }] for p in prompts]

times_auto_regressive = []
for prompt in tqdm(prompts):
    start_time = time.perf_counter()
    result = pipe.generate(prompt, generation_config, apply_chat_template=True)
    end_time = time.perf_counter()
    times_auto_regressive.append(end_time - start_time)
print("Done")

import gc

del pipe
gc.collect()

### 2. Run target model with speculative decoding
Now we will run generation with speculative-decoding over the same 40 examples.

In [None]:
print(f"Loading draft from {draft_model_path}")

# Define scheduler for the draft

draft_scheduler_config = ov_genai.SchedulerConfig()
draft_scheduler_config.num_kv_blocks = 2048 // 16
draft_scheduler_config.dynamic_split_fuse = False
draft_scheduler_config.max_num_batched_tokens = 2048

draft_model = ov_genai.draft_model(draft_model_path, device, scheduler_config=draft_scheduler_config)

pipe = LLMPipelineWithHFTokenizer(model_dir, device, scheduler_config=scheduler_config, draft_model=draft_model)


generation_config = ov_genai.GenerationConfig()
generation_config.num_assistant_tokens = 3
generation_config.max_new_tokens = 2048

times_speculative_decoding = []

print("Running Speculative Decoding generation...")
for prompt in tqdm(prompts):
    start_time = time.perf_counter()
    result = pipe.generate(prompt, generation_config, apply_chat_template=True)
    end_time = time.perf_counter()
    times_speculative_decoding.append((end_time - start_time))
print("Done")

### 3. Calculate speedup


In [None]:
avg_speedup = sum([x / y for x, y in zip(times_auto_regressive, times_speculative_decoding)]) / len(prompts)
print(f"average speedup: {avg_speedup:.2f}")

We see that by using speculative-decoding with FastDraft we can accelerate DeepSeek-R1-Distill-Llama-8B generation by ~1.5x on avarage.

## Run Chatbot

Now, when model created, we can setup Chatbot interface using [Gradio](https://www.gradio.app/).

In [None]:
from gradio_helper import make_demo

stop_strings = ["<｜end▁of▁sentence｜>", "<｜User｜>", "</User|>", "<|User|>", "<|end_of_sentence|>", "</｜"]
demo = make_demo(pipe, stop_strings, model_name)

In [None]:
demo.launch(inline=False, inbrowser=True)

In [None]:
# demo.close()