In [1]:
# Install PyTorch (CPU version), Transformers, and Accelerate
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
%pip install transformers accelerate

Looking in indexes: https://download.pytorch.org/whl/cpu
Collecting torch
  Downloading https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (29 kB)
Collecting torchvision
  Downloading https://download.pytorch.org/whl/cpu/torchvision-0.24.1%2Bcpu-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (5.9 kB)
Collecting torchaudio
  Downloading https://download.pytorch.org/whl/cpu/torchaudio-2.9.1%2Bcpu-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (6.9 kB)
Collecting filelock (from torch)
  Using cached filelock-3.20.0-py3-none-any.whl.metadata (2.1 kB)
Collecting sympy>=1.13.3 (from torch)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx>=2.5.1 (from torch)
  Using cached networkx-3.6.1-py3-none-any.whl.metadata (6.8 kB)
Collecting fsspec>=0.8.5 (from torch)
  Using cached fsspec-2025.12.0-py3-none-any.whl.metadata (10 kB)
Collecting numpy (from torchvision)
  Downloading numpy-2.3.5-cp313-cp313-manylinux_2_27

In [7]:
import time
import torch
from threading import Thread
from transformers import pipeline, TextIteratorStreamer, AutoTokenizer, AutoModelForCausalLM

# 1. Setup Device (CPU)
device = "cpu"
model_id = "gpt2"
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model_id = "mistralai/Ministral-3-3B-Instruct-2512"
model_id = "Qwen/Qwen2.5-3B-Instruct"

dtype = torch.float32

print(f"\nLoading {model_id}...")

# TRUST_REMOTE_CODE=True is the key fix here
tokenizer = AutoTokenizer.from_pretrained(
    model_id, 
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=dtype,    # Standard HF uses torch_dtype, but some custom models prefer dtype
    trust_remote_code=True, # Allow the model to define its own config class
    device_map=device       # Auto-moves to GPU
)

# --- 3. Run Inference ---
messages = [
    {"role": "user", "content": "Tell me a short story."}
]

# Apply Mistral's chat template
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

# Tokenize and move to device
inputs = tokenizer(prompt, return_tensors="pt").to(device)

generation_kwargs = dict(
    inputs=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    streamer=streamer,
    max_new_tokens=300,    
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id
)

print(f"\nPrompt: {messages[0]['content']}")
print("-" * 30)

t0 = time.time()
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

# --- 4. Stream Output ---
generated_text = ""
first_token_received = False
ttft = 0

for new_text in streamer:
    if not first_token_received:
        ttft = time.time() - t0
        first_token_received = True
        print(new_text, end="", flush=True)
    else:
        print(new_text, end="", flush=True)
    generated_text += new_text

t_end = time.time()

# --- 5. Stats ---
total_new_tokens = len(tokenizer.encode(generated_text))
decoding_time = t_end - (t0 + ttft)

print("\n" + "-" * 30)
print(f"Time to First Token: {ttft:.4f} s")
if decoding_time > 0:
    print(f"Generation Speed:    {(total_new_tokens-1)/decoding_time:.2f} tokens/sec")
print(f"Total Tokens:        {total_new_tokens}")


Loading Qwen/Qwen2.5-3B-Instruct...


`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


Prompt: Tell me a short story.
------------------------------
Once upon a time, in a small village nestled between rolling hills and dense forests, there lived a young girl named Lila. Lila was known for her kind heart and her love of storytelling. She would often gather the children around her during the long winter evenings, weaving tales that filled their imaginations with magic and adventure.

One winter, as the snow fell gently and the village wrapped itself in a blanket of white, a mysterious old man appeared at the edge of the forest. He was tall and thin, with eyes that seemed to hold secrets from another world. The villagers were wary, but Lila welcomed him warmly, inviting him to share his stories.

The old man agreed, and every evening, he would come and tell tales of far-off lands, of dragons and princesses, of brave knights and wicked witches. But there was something else too; the old man spoke of hidden treasures, treasures that could grant wishes and make impossible dre