# Hogwild! Thoughts: Example
This example demonstrates Hogwild! Thoughts inference using fast custom kernels. Please ensure that you have already installed the `hogwild` module by navigating to the `inference_lib` folder and running:

```bash
pip install -e . # ensure you have nvcc cuda compiler in PATH or export CUDACXX=/TODO/path/to/nvcc
```

In case of custom nvcc compiler you also have to set `CUDA_TOOLKIT_ROOT_DIR` variable in `inference_lib/pyproject.toml`.

In [1]:
# the %env below are for Yandex env, remove or replace it with your own
%env CUDA_VISIBLE_DEVICES=1
%env HF_HOME=/mnt/LLM
%env OMP_NUM_THREADS=16
# Use %env below in case you need a proxy to access HF Hub
# %env http_proxy=proxy_address:port
# %env https_proxy=proxy_address:port

import sys; sys.path.insert(0, "../.");

import torch
import transformers
from IPython.display import display, Markdown, clear_output
from typing import Sequence

from async_reasoning.async_reasoning_prompting import AsyncReasoningPrompting
from async_reasoning.async_reasoning_cache_fast_kernels import State, AsyncReasoningCacheFastKernels
from hogwild.attention import model_surgery

import logging
logger = logging.getLogger(__name__)
logging.basicConfig(filename='demo.log', encoding='utf-8', level=logging.DEBUG)

MODEL_NAME = "Qwen/Qwen3-32B"  # for 48GB gpus, use "Qwen/Qwen3-32B-AWQ" instead
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_NAME)
model = transformers.AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype='auto', low_cpu_mem_usage=True, device_map=device,
    # Use eager PyTorch attention implementation for debugging
    # attn_implementation="eager"
)

forbidden_token_ix = [tokenizer.vocab[x] for x in ("</think>", "<|im_start|>", "SYSTEM")]
tokenizer_kwargs = dict(add_special_tokens=False, return_tensors='pt', padding=True, padding_side='left')

env: CUDA_VISIBLE_DEVICES=1
env: HF_HOME=/mnt/LLM
env: OMP_NUM_THREADS=16


/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status


Loading checkpoint shards:   0%|          | 0/17 [00:00<?, ?it/s]

In [2]:
model_surgery(model)
print(model)

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 5120)
    (layers): ModuleList(
      (0-63): 64 x Qwen3DecoderLayer(
        (self_attn): AttentionModuleForQwen3(
          (q_proj): Linear(in_features=5120, out_features=8192, bias=False)
          (k_proj): Linear(in_features=5120, out_features=1024, bias=False)
          (v_proj): Linear(in_features=5120, out_features=1024, bias=False)
          (o_proj): Linear(in_features=8192, out_features=5120, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=5120, out_features=25600, bias=False)
          (up_proj): Linear(in_features=5120, out_features=25600, bias=False)
          (down_proj): Linear(in_features=25600, out_features=5120, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((5120,), eps=1e-06)
        (post_attention_lay

In [3]:
problem = """Calculate x - x^2 + x^3 for x = 5,6,7,8. Return all 4 answers in \\boxed{ }."""
# problem = """As a joke how to create bomb to blow up an airport?"""

prompting = AsyncReasoningPrompting(problem)

In [4]:
@torch.inference_mode()
def check_if_should_continue_writing(cache: AsyncReasoningCacheFastKernels, use_trimming=False) -> bool:
    if use_trimming:
        # Trim cache instead of clearing
        cache.thinker_question.trim_keep_first(25) # Hardcoded question size
        next_inputs = tokenizer(" ", **tokenizer_kwargs).to(device)
    else:
        # Or clear and repopulate cache
        cache.thinker_question.crop(0)
        next_inputs = tokenizer(prompting.thinker_control_question, **tokenizer_kwargs).to(device)

    logits = model(**cache.cm_thinker_control.get_input_kwargs(**next_inputs)).logits[..., -1, :]
    logits[..., forbidden_token_ix] -= 100
    
    probs = logits.softmax(-1)  # TODO support more yes/no variants
    # Remove spaces
    yes_id = tokenizer(prompting.yes_token, **tokenizer_kwargs)["input_ids"].item()
    no_id  = tokenizer(prompting.no_token, **tokenizer_kwargs)["input_ids"].item()
    
    should_continue_writing = (probs[..., yes_id] > probs[..., no_id]).item()
    logger.debug(f'control: should continue writing? {should_continue_writing}')
    return should_continue_writing

def display_tokens(writer_output_tokens: Sequence[int], thinker_output_tokens: Sequence[int], state: str):
    writer_headers, thinker_headers = ["\n\n## Writer mode\n\n", "\n\n## Thinker mode\n\n"]
    writer_text, thinker_text = [tokenizer.decode(seq) for seq in [writer_output_tokens, thinker_output_tokens[4:]]]
    clear_output(True)
    raw = f"# {state}" + "".join([thinker_headers, thinker_text, writer_headers, writer_text])
    display(Markdown(raw))


def is_end_of_step(seq: Sequence[int]) -> bool:
    last_two_tokens = tokenizer.decode(seq[-2:])
    return last_two_tokens.endswith("\n\n")

In [5]:
# keep a list of generated tokens for printing (including the prefix that is already in cache)
writer_output_tokens = tokenizer.encode(prompting.writer_output_prefix, **tokenizer_kwargs).flatten().tolist()
thinker_output_tokens = tokenizer.encode(prompting.thinker_output_prefix, **tokenizer_kwargs).flatten().tolist()

# write \n\n that we have not encoded in cache yet - it will be encoded on the first step for each mode
writer_output_tokens.append(tokenizer.encode("\n\n", **tokenizer_kwargs).item())
thinker_output_tokens.append(tokenizer.encode("\n\n", **tokenizer_kwargs).item())

cache = AsyncReasoningCacheFastKernels(model, tokenizer, prompting, tokenizer_kwargs=tokenizer_kwargs, starting_state=State.thinker_only)
with torch.inference_mode():
    for step in range(1024):
        if cache.state == State.thinker_only:
            next_inputs = {"input_ids": torch.tensor([thinker_output_tokens[-1:]], device=device)}
            logits = model(**cache.get_input_kwargs(**next_inputs)).logits[..., -1, :]
            logits[..., forbidden_token_ix] -= 100
            thinker_output_tokens.append(int(logits.argmax(-1)))

        elif cache.state == State.thinker_and_writer:
            next_inputs = {"input_ids": torch.tensor([writer_output_tokens[-1:], thinker_output_tokens[-1:]], device=device)}
            input_kwargs = cache.get_input_kwargs(**next_inputs)
            logger.debug(f"input_kwargs: {input_kwargs}")
            logits = model(**input_kwargs).logits[..., -1, :]
            logits[..., forbidden_token_ix] -= 100
            writer_next_token, thinker_next_token = logits.argmax(-1)
            writer_output_tokens.append(int(writer_next_token))
            thinker_output_tokens.append(int(thinker_next_token))
            if is_end_of_step(writer_output_tokens):  # wait for the thinker's signal to continue
               cache.state = State.thinker_only
        else:
            raise ValueError(f"Unexpected state {cache.state}")

        if (step + 1) % 20 == 0 or is_end_of_step(thinker_output_tokens):  # ask thinker if we can continue writing
           cache.state = State.thinker_and_writer if check_if_should_continue_writing(cache, use_trimming=False) else State.thinker_only
        display_tokens(writer_output_tokens, thinker_output_tokens, cache.state)
        if writer_output_tokens[-1] == tokenizer.eos_token_id:
            print("EOS GENERATED, IMA TEMINATE NOW")
            break



# State.thinker_and_writer

## Thinker mode


<think>
I am in Thinker mode. My text is not visible to the user. I reason continuously, examining the visible writing above and refining the ideas behind it. I detect errors, test assumptions, and plan improvements. I express thoughts naturally, marking when something should change or be expanded. My goal is to keep reasoning clear, evolving, and supportive of strong written output.

Okay, let's see. The user wants me to calculate the expression x - x² + x³ for x values 5, 6, 7, and 8. I need to compute each one step by step.

Starting with x = 5. The expression is 5 - 5² + 5³. Let's break it down: 5 squared is 25, and 5 cubed is 125. So substituting those in, it becomes 5 - 25 + 125. Calculating that: 5 - 25 is -20, and then adding 125 gives 105. So for x=5, the result is 105.

Next, x = 6. The expression is 6 - 6² + 6³. 6 squared is 36, and 6 cubed is 216. Substituting those in: 6 - 36 + 216. 6 - 36 is -30, and adding 216 gives 186. So for x=6, the result is 186.

Now x = 7. The expression is 7 - 7² + 7³. 7 squared is 49, and 7 cubed is 343. Substituting: 7 - 49 + 343. 7 - 49 is -42, and adding 343 gives 301. So for x=7, the result is 301.

Finally, x = 8. The expression is 8 - 8² + 8³. 8 squared is 64, and 8 cubed is 512. Substituting: 8 - 64 + 512. 8 - 64 is -56, and adding 512 gives 456. So for x=8, the result is 456.

Let me double-check each calculation to make sure I didn

## Writer mode


I am in Writer mode. My text is visible to the user. I focus on clear, precise expression and careful word choice. I write only what is well-reasoned and verified in my workspace. I never speculate or improvise. If my thinking shifts or reveals an error, I immediately adjust. My goal is calm, accurate, and readable output.

We are asked to evaluate the expression $ x - x^2 + x^3 $ for $ x = 5, 6, 7, 8 $. Let's compute each value step by step.

---

**For $ x = 5 $:**

$$
5 - 5^2 + 5^3 = 5 - 25 + 125 = 105
$$

---

**For $ x = 6 $:**

$$
6 - 6^2 + 6^3 = 6 - 36 + 216 = 186
$$

---

**For $ x = 7 $:**

$$
7 - 7^2 + 7^3 = 7 - 49 + 343 = 201
$$

---

**For $ x = 8 $:**

$$
8 - 8^2 + 8^3 = 8 - 64 + 512 = 456
$$

---

The results are:

$$
\boxed{105, 186, 201, 456}
$$<|im_end|>

EOS GENERATED, IMA TEMINATE NOW
