In [None]:
# These values from Verda differ from the ones of the LRZ benchmark. Not entirely sure why.
# Verda
tpot_1_7b = 2.27
tpot_32b = 23.65

What kind of speedups should be obvserve from SD?
For $K=4$, we execute the drafter 4 times, and the target 1 time.
And because of token rejections, we get a number of token equivalent to the acceptance length (AL)

$$T_{s} = \frac{K \cdot T_{d} + T_{t}}{AL} = \frac{ITL}{AL} $$

In [None]:
k = 4
expected_sd_itl = k * tpot_1_7b + tpot_32b
al = 3.28
expected_sd_tpot = expected_sd_itl / al
expected_sd_tpot

However, we are observing higher (worse, slower) TPOT. How slow is then the drafter really running?


$$T_{d} = \frac{T_{s} \cdot AL - T_{t}}{K}$$

In [None]:
actual_sd_tpot = 10.46
implied_tpot_1_7b = (actual_sd_tpot * al - tpot_32b) / k
implied_tpot_1_7b

In [None]:
implied_itl = 4 * implied_tpot_1_7b + tpot_32b
implied_itl

The implied SD ITL is equal to the actual SD ITL (34.34), meaning that we are correct in assuming that the TPOT of the target is unchanged.

What are the current speedups?
We compare TPOT in SD to non-SD.

In [None]:
actual_speedup = tpot_32b / actual_sd_tpot
actual_speedup

What should be the speedup?

In [None]:
expected_speedup = tpot_32b / expected_sd_tpot
expected_speedup

Would K=4 still be optimal?
The analysis below show that the optimal K (to maximize TPOT) should be 5 to 6.

In [None]:
import torch

# from k=3 to k=8
k1 = 1.80
k2 = 2.40
k3 = 2.87
k4 = 3.28
k5 = 3.53
k6 = 3.70
k7 = 3.90
k8 = 4.04
acceptance_lens = torch.tensor([k1, k2, k3, k4, k5, k6, k7, k8])

In [None]:
import torch


def sd_tpot(k, tpot_1_7b):
    itl = tpot_1_7b * k + tpot_32b
    # starts at k=1, so k=1 -> idx=0
    al = acceptance_lens[idx := k - 1]
    return itl / al


sd_tpot(4, tpot_1_7b=tpot_1_7b), torch.tensor(expected_sd_tpot)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import os

sns.set_context("talk")

k_values = torch.arange(1, 9)
expected_tpot_values = torch.tensor([sd_tpot(k, tpot_1_7b=tpot_1_7b) for k in k_values])
actual_tpot_values = torch.tensor(
    [sd_tpot(k, tpot_1_7b=implied_tpot_1_7b) for k in k_values]
)

plt.plot(k_values, expected_tpot_values, label="Expected", marker="o")
plt.plot(
    k_values[expected_tpot_values.argmin()],
    expected_tpot_values.min(),
    marker="x",
    color="blue",
    markersize=15,
)
plt.plot(k_values, actual_tpot_values, label="Current", marker="o")
plt.plot(
    k_values[actual_tpot_values.argmin()],
    actual_tpot_values.min(),
    marker="x",
    color="red",
    markersize=15,
)
plt.xlabel("Number of Speculative Tokens (K)")
plt.ylabel("SD TPOT (ms)")

plt.tight_layout()
plt.grid(alpha=0.5)
plt.legend()

os.makedirs("imgs", exist_ok=True)
plt.savefig("imgs/tpot_k_comparison.png", dpi=300, bbox_inches="tight")
plt.show()


In [None]:
best_expected_tpot_sd = expected_tpot_values.min()
best_expected_speedup = tpot_32b / best_expected_tpot_sd
best_expected_speedup

In [None]:
best_expected_speedup / actual_speedup

In [None]:
implied_tpot_1_7b / tpot_1_7b

* The draft model (Qwen3-1.7B) runs decodes faster when its the main model (TPOT of 2.27ms vs 2.66ms).
* Implementing full CUDA graphs for the draft model would speed up this drafter by 17%.
* However, since the drafter only makes a fraction of the total runtime, the TPOT improvement would be closer to 5 to 6%.
* Given the lower overhead, the optimal $K$ would move from 4 to 5, though the difference is small.