**Keynote:**
1. Use the T40 GPU on Colab, which is **free** and efficient to run this demo (CPU can also work but slow).
2. The free Colab T4 GPU supports up to 6,000 tokens for DTS inference, limited by its 16 GB RAM. For longer generations, please use a local GPU or upgrade your Colab service.
3. You can freely adjust all settings and view the results on this site, but your changes will not be saved here.


**Import DTS environment:**

In [None]:
!git clone https://github.com/ZichengXu/Decoding-Tree-Sketching.git
!pip install transformers==4.54.0
%cd Decoding-Tree-Sketching

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, set_seed

import sys, os
sys.path.append(os.path.abspath('..'))

from decoding_tree_sketching.kvbatch_decoder import KVBatchEGDT
from decoding_tree_sketching.utils.eval_utils import extract_answer_llm

  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


/home/zxu161/Decoding-Tree-Sketching


**Load model _deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B_:**

In [None]:
# print(torch.cuda.is_available())
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

# Decoding hyperparameters
DECODE_CONFIG = {
    "entropy_threshold": 2,
    "branch_top_k": 3,
    "max_active_hyps": 12,
    "max_new_tokens": 5000,
    "temperature": 0.6,
}
tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True
    )
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cuda" if torch.cuda.is_available() else torch.device("cpu"),
    torch_dtype="auto",
    trust_remote_code=True
)
streamer = TextStreamer(tokenizer)

**Tokenize the prompt:**

In [None]:
examples = [
  "Six points $A, B, C, D, E,$ and $F$ lie in a straight line in that order. Suppose that $G$ is a point not on the line and that $AC=26, BD=22, CE=31, DF=33, AF=73, CG=40,$ and $DG=30.$ Find the area of $\\triangle BGE.$",
]
groundtruths = [
    "468",
]
reasoning_tail = r" Please reason step by step, and put your final answer within \boxed{}."
seed = 1

set_seed(seed)
ques_idx = 0
example = examples[ques_idx]
groundtruth = groundtruths[ques_idx]
full_prompt = example + reasoning_tail
text = tokenizer.apply_chat_template(
    [{"role": "user", "content": full_prompt}],
    tokenize=False,
    add_generation_prompt=True
)

**Standard inference:**

In [None]:
# Standard inference
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(
    **inputs,
    max_new_tokens=DECODE_CONFIG["max_new_tokens"],
    do_sample=True,
    temperature=DECODE_CONFIG["temperature"],
    streamer=streamer,
)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


<｜begin▁of▁sentence｜><｜begin▁of▁sentence｜><｜User｜>Six points $A, B, C, D, E,$ and $F$ lie in a straight line in that order. Suppose that $G$ is a point not on the line and that $AC=26, BD=22, CE=31, DF=33, AF=73, CG=40,$ and $DG=30.$ Find the area of $\triangle BGE.$ Please reason step by step, and put your final answer within \boxed{}.<｜Assistant｜><think>


Okay, so I have this problem where there are six points, A, B, C, D, E, and F, lying on a straight line in that order. There's another point G not on the line, and we're given several distances: AC is 26, BD is 22, CE is 31, DF is 33, AF is 73, CG is 40, and DG is 30. I need to find the area of triangle BGE. Hmm, okay, let me try to visualize this.

First, since all the points are on a straight line, I can imagine them arranged in the order A, B, C, D, E, F. So, from left to right, we have A, then B, then C, then D, then E, then F. Point G is somewhere off the line. So, triangle BGE is formed by points B, G, and E.

I think the first step is to figure out the positions of each point on the line. Maybe I can assign coordinates to each point to make it easier. Let me set a coordinate system where the line is the x-axis. Let's let point A be at coordinate 0. Then, since the points are in order A, B, C, D, E, F, I can assign coordinates to B, C, D, E, F as some increasing values.

Let me d

**Parse the final answer of standard inference:**

In [None]:
num_new_tokens = out[0].shape[0] - inputs["input_ids"].shape[1]
stat = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
# print(gen)
ans = extract_answer_llm(stat)
print(f"Groundtruth = {groundtruth}, Regular decoding output = {ans}, Number of tokens = {num_new_tokens}")

Groundtruth = 468, Regular decoding output = 91, Number of tokens = 5000


**DTS inference (DTS can potentially early stop before reaching the maximum token limit, such as 3000/5000):**

In [None]:
kvegdt = KVBatchEGDT(model, tokenizer, seed=seed)
dts_out = kvegdt.generate(
        text,
        entropy_threshold=DECODE_CONFIG["entropy_threshold"],
        branch_top_k=DECODE_CONFIG["branch_top_k"],
        max_active_hyps=DECODE_CONFIG["max_active_hyps"],
        max_new_tokens=DECODE_CONFIG["max_new_tokens"],
        temperature=DECODE_CONFIG["temperature"],
    )

Decoding:   0%|          | 0/5000 [00:00<?, ?it/s]`cache.key_cache[idx]` is deprecated and will be removed in v4.56.0. Use `cache.layers[idx].keys` instead.
`cache.value_cache[idx]` is deprecated and will be removed in v4.56.0. Use `cache.layers[idx].values` instead.


Decoding:  59%|█████▉    | 2942/5000 [01:10<00:49, 41.71it/s]


**Parse the final answer of DTS inference:**

In [None]:
# print(f"*** MODEL OUTPUT ***\n{dts_out['text']}")
# Print generation statistics such as steps, branch events, and sequence length
print(f"\n*** GENERATION STATS ***\n{dts_out['stats']}")
dts_ans = extract_answer_llm(dts_out['text'])
print(f"Groundtruth = {groundtruth}, DTS output = {dts_ans}, Number of tokens = {dts_out['stats']['generated_len']}")


*** GENERATION STATS ***
{'num_steps': 2943, 'num_finished': 1, 'generated_len': 2942, 'mean_entropy_best': 0.157777969086345, 'branch_events': 96, 'total_branches_created': 192, 'max_active_batch': 12, 'unfinished_left': 11}
Groundtruth = 468, DTS output = 468., Number of tokens = 2942
