<a href="https://colab.research.google.com/github/royam0820/nanochat_rl/blob/main/nanochat_inference_working.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NanoChat 1.8B-RL Inference Notebook (Fixed)

This notebook demonstrates how to run inference with the `jasonacox/nanochat-1.8B-rl` model.

**Model Details:**
- Parameters: ~1.9 billion
- Architecture: 20 layers, 1280 embedding dimension
- Training: Includes pretraining, midtraining, SFT, and RL (GRPO)

**Model Link:** [jasonacox/nanochat-1.8B-rl](https://huggingface.co/jasonacox/nanochat-1.8B-rl)

## 1. Setup and Installation

In [1]:
# Check GPU availability
!nvidia-smi

Wed Nov 19 13:37:32 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   61C    P8             11W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
# Install dependencies
!pip install -q huggingface_hub torch

## 2. Clone NanoChat Repository

In [3]:
import os
import sys

# Clone the NanoChat repository
if not os.path.exists('nanochat'):
    !git clone https://github.com/karpathy/nanochat.git
    print("‚úì NanoChat repository cloned")
else:
    print("‚úì NanoChat repository exists")

# Add to Python path
sys.path.insert(0, 'nanochat')

Cloning into 'nanochat'...
remote: Enumerating objects: 629, done.[K
remote: Counting objects: 100% (44/44), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 629 (delta 25), reused 10 (delta 10), pack-reused 585 (from 3)[K
Receiving objects: 100% (629/629), 432.97 KiB | 2.19 MiB/s, done.
Resolving deltas: 100% (390/390), done.
‚úì NanoChat repository cloned


## 3. Download Model and Organize Files

In [4]:
from huggingface_hub import snapshot_download
import shutil
from pathlib import Path

# Download model
print("üì• Downloading model from Hugging Face...")
model_path = snapshot_download(
    repo_id="jasonacox/nanochat-1.8B-rl",
    cache_dir=os.path.expanduser("~/.cache/huggingface")
)
print(f"‚úì Downloaded to: {model_path}")

# Create NanoChat directory structure
base_dir = Path.home() / ".cache" / "nanochat"
tokenizer_dir = base_dir / "tokenizer"
checkpoint_dir = base_dir / "1.8B-rl" / "final"

tokenizer_dir.mkdir(parents=True, exist_ok=True)
checkpoint_dir.mkdir(parents=True, exist_ok=True)

print("\nüìÇ Organizing files...")
model_path = Path(model_path)

# Copy tokenizer files from subdirectory
source_tokenizer_dir = model_path / "tokenizer"
if source_tokenizer_dir.exists():
    for file in source_tokenizer_dir.glob("*"):
        if file.is_file():
            shutil.copy2(file, tokenizer_dir / file.name)
            print(f"  ‚úì Copied tokenizer/{file.name}")
else:
    print("  ‚ö† Tokenizer subdirectory not found, checking root...")
    # Fallback: check root directory
    for fname in ['tokenizer.pkl', 'token_bytes.pt']:
        src = model_path / fname
        if src.exists():
            shutil.copy2(src, tokenizer_dir / fname)
            print(f"  ‚úì Copied {fname}")

# Copy checkpoint files and fix naming
for file in model_path.glob("*"):
    if file.name.startswith(('model_', 'meta_')) and file.is_file():
        dest = checkpoint_dir / file.name
        shutil.copy2(file, dest)
        print(f"  ‚úì Copied {file.name}")

        # Fix filename if it has wrong number of zeros
        if 'model_' in file.name:
            # Extract step number and ensure 6-digit format
            parts = file.name.replace('model_', '').replace('.pt', '')
            step = int(parts)
            correct_name = f"model_{step:06d}.pt"
            correct_path = checkpoint_dir / correct_name

            if dest.name != correct_name:
                shutil.move(dest, correct_path)
                print(f"    ‚Üí Renamed to {correct_name}")

print("\n‚úì Setup complete!")
print(f"  Tokenizer: {tokenizer_dir}")
print(f"  Checkpoint: {checkpoint_dir}")
print(f"\nCheckpoint contents: {list(checkpoint_dir.glob('*'))}")

üì• Downloading model from Hugging Face...


Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/235 [00:00<?, ?B/s]

meta_000466.json:   0%|          | 0.00/155 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/248 [00:00<?, ?B/s]

model_000466.pt:   0%|          | 0.00/2.08G [00:00<?, ?B/s]

tokenizer/token_bytes.pt:   0%|          | 0.00/264k [00:00<?, ?B/s]

tokenizer/tokenizer.pkl:   0%|          | 0.00/846k [00:00<?, ?B/s]

‚úì Downloaded to: /root/.cache/huggingface/models--jasonacox--nanochat-1.8B-rl/snapshots/53460595bc0c4ff1e31df01c59711798a372bd6a

üìÇ Organizing files...
  ‚úì Copied tokenizer/token_bytes.pt
  ‚úì Copied tokenizer/tokenizer.pkl
  ‚úì Copied meta_000466.json
  ‚úì Copied model_000466.pt

‚úì Setup complete!
  Tokenizer: /root/.cache/nanochat/tokenizer
  Checkpoint: /root/.cache/nanochat/1.8B-rl/final

Checkpoint contents: [PosixPath('/root/.cache/nanochat/1.8B-rl/final/meta_000466.json'), PosixPath('/root/.cache/nanochat/1.8B-rl/final/model_000466.pt')]


## 4. Load Model

In [5]:
import torch
from contextlib import nullcontext
import glob

# Import NanoChat modules
from nanochat.checkpoint_manager import build_model
from nanochat.common import compute_init, autodetect_device_type
from nanochat.engine import Engine

print("üöÄ Initializing model...")

# Setup device
device_type = autodetect_device_type()
print(f"Device: {device_type}")

_, _, _, _, device = compute_init(device_type)

# Set precision
ptdtype = torch.bfloat16 if device_type == "cuda" else torch.float32
autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=ptdtype) if device_type == "cuda" else nullcontext()

# Find checkpoint
checkpoint_files = list(checkpoint_dir.glob("model_*.pt"))
if not checkpoint_files:
    raise FileNotFoundError(f"No checkpoint found in {checkpoint_dir}")

# Extract step number
checkpoint_file = checkpoint_files[0]
step = int(checkpoint_file.stem.split('_')[1])
print(f"Loading checkpoint step {step}...")

# Load model
model, tokenizer, _ = build_model(str(checkpoint_dir), step, device, phase="eval")
engine = Engine(model, tokenizer)

print("‚úì Model loaded successfully!")
print(f"‚úì Ready for inference")

üöÄ Initializing model...
Autodetected device type: cuda
Device: cuda
Loading checkpoint step 466...
‚úì Model loaded successfully!
‚úì Ready for inference


## 5. Inference Function

In [6]:
def generate_response(prompt, max_tokens=200, temperature=0.8, top_k=50, verbose=True):
    """
    Generate a response from the model.

    Args:
        prompt: Input text
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature (0.0-1.0+)
        top_k: Top-k sampling
        verbose: Print as generating

    Returns:
        Generated text
    """
    tokens = tokenizer.encode(prompt)

    if verbose:
        print(f"\n{'='*60}")
        print(f"Prompt: {prompt}")
        print(f"{'='*60}")
        print("Response: ", end="", flush=True)

    response_tokens = []

    with autocast_ctx:
        for token_column, _ in engine.generate(
            tokens,
            num_samples=1,
            max_tokens=max_tokens,
            temperature=temperature,
            top_k=top_k
        ):
            token = token_column[0]
            response_tokens.append(token)

            if verbose:
                print(tokenizer.decode([token]), end="", flush=True)

    if verbose:
        print("\n" + "="*60)

    return tokenizer.decode(response_tokens)

## 6. Example Usage

In [7]:
# Example 1: Simple greeting
generate_response(
    "Hello! How are you today?",
    max_tokens=100,
    temperature=0.8
);


Prompt: Hello! How are you today?
Response:  I'm at the store, I have to buy some milk. Now let me tell you what I think: I like chocolate. I think I like milk too. That's all for now. Hope you enjoy your day. How about this? Oh, I see some flowers in the window. The sunlight   is really bright today. Today is a nice day. We should go to the park to have a rest. ,. What do you think of milk?
- It's really good for our


In [8]:
# Example 2: Factual question
generate_response(
    "What is the capital of France?",
    max_tokens=80,
    temperature=0.7
);


Prompt: What is the capital of France?
Response: <|user_end|><|assistant_start|>The capital of France is Paris.<|assistant_end|>


In [9]:
# Example 3: Math problem (model has RL training for math)
generate_response(
    "If I have 5 apples and buy 3 more, how many apples do I have in total?",
    max_tokens=100,
    temperature=0.7
);


Prompt: If I have 5 apples and buy 3 more, how many apples do I have in total?
Response: <|user_end|><|assistant_start|>5 apples is equal to 5 x 3 = <|python_start|>5*3<|python_end|><|output_start|>15<|output_end|>15.
Therefore, you have a total of 15 + 3 = <|python_start|>15+ Trinity<|assistant_end|>


In [10]:
# Example 4: Explain a concept
generate_response(
    "Explain what machine learning is in simple terms.",
    max_tokens=150,
    temperature=0.7
);


Prompt: Explain what machine learning is in simple terms.
Response: <|user_end|><|assistant_start|>Machine learning is a subset of artificial intelligence (AI) that focuses on training algorithms and systems to recognize patterns and make decisions or predictions without being explicitly programmed. These systems learn from data, similar to how humans learn and grow, enabling them to improve their performance over time.

Think of it like a learning process where the system learns to recognize what it's told it to recognize. This learning is typically done through various algorithms and techniques, such as statistical models, neural networks, and deep learning. These methods allow the system to learn from data, making it capable of improving its performance over time.

Machine learning is commonly used in various applications, including image and speech recognition, natural language processing, predictive analytics, and recommendation systems, among others


In [11]:
# Example 5: Creative task
generate_response(
    "Write a haiku about artificial intelligence.",
    max_tokens=80,
    temperature=0.9
);


Prompt: Write a haiku about artificial intelligence.
Response: <|user_end|><|assistant_start|>Alone, AI is on wings,
A master of craft, without hands divine.
Its wisdom is like moonlight shining bright,
A beacon in the dark, where hope reigns.

In my dreams, we dance with hands above,
A testament to AI's unyielding might.
Its purpose, clear and clear,
Is to learn, to adapt, to master.

In the gentle hands


## 7. Interactive Chat

In [13]:
def chat():
    """Interactive chat loop. Type 'quit' or 'exit' to end."""
    print("\n" + "="*60)
    print("NanoChat Interactive Mode")
    print("Type 'quit' or 'exit' to end")
    print("="*60 + "\n")

    while True:
        user_input = input("You: ")

        if user_input.lower() in ['quit', 'exit', 'q']:
            print("Goodbye!")
            break

        if not user_input.strip():
            continue

        print("\nNanoChat: ", end="", flush=True)
        response = generate_response(
            user_input,
            max_tokens=200,
            temperature=0.8,
            verbose=False
        )
        print(response + "\n")

# Uncomment to start interactive mode
chat()


NanoChat Interactive Mode
Type 'quit' or 'exit' to end

You: hi

NanoChat: , a popular Japanese dish featuring thinly sliced raw fish, usually served with a side of steamed rice, was cooked to perfection in the oven.<|assistant_end|>

You: what is 2+2

NanoChat: i?<|user_end|><|assistant_start|>The expression 2+2i can be factored as 2 * (2^2 + 2^1).<|assistant_end|>

You: use Python

NanoChat:  function to count the occurrences of each word in the text:
    word_counts = {}
    for word in text.split():
        word = word.strip().lower()
        word_counts[word] = word_counts.get(word, 0) + 1
    return word_counts

text = "This is a sample text. This text is just a sample."
print(count_words(text))
```

This code defines a function `count_words` that takes a string `text` as input and returns a dictionary mapping each word to its count. The function splits the text into a list of words using `text.split()`, then uses a dictionary comprehension to create a dictionary where the keys 

## 8. Temperature Comparison

In [14]:
prompt = "Tell me an interesting fact about space."

print("\nüå°Ô∏è Temperature Comparison\n")

for temp in [0.3, 0.7, 1.0]:
    print(f"\n{'='*60}")
    print(f"Temperature = {temp}")
    print("="*60)
    generate_response(prompt, max_tokens=80, temperature=temp, verbose=False)
    print()


üå°Ô∏è Temperature Comparison


Temperature = 0.3


Temperature = 0.7


Temperature = 1.0



## 9. Custom Prompt

In [15]:
# Test your own prompt here!
YOUR_PROMPT = "What is the meaning of life?"
MAX_TOKENS = 200
TEMPERATURE = 0.8

generate_response(
    YOUR_PROMPT,
    max_tokens=MAX_TOKENS,
    temperature=TEMPERATURE
);


Prompt: What is the meaning of life?
Response: <|user_end|><|assistant_start|>The meaning of life is a profound and enduring question that has puzzled thinkers and individuals for centuries. At its core, the meaning of life lies in the pursuit of happiness, fulfillment, and self-discovery. It is the idea that, despite our best efforts, we are ultimately limited by our choices, circumstances, and experiences.

Think of it this way: life is a journey, and we are the participants. Our decisions, successes, and failures significantly impact our trajectory. But the meaning of life is the journey itself, and it is our choice to navigate it with integrity, resilience, and purpose.

Ultimately, the meaning of life is about living a life that is authentic, meaningful, and true to oneself. It's about embracing the present moment, letting go of attachments, and finding joy in the simple things. It's about living in harmony with the natural world, with others, and with our own unique experiences.

## 10. Batch Processing

In [16]:
prompts = [
    "What is Python?",
    "Explain quantum computing.",
    "What are neural networks?",
    "How does the internet work?"
]

print(f"\nüì¶ Processing {len(prompts)} prompts...\n")

for i, prompt in enumerate(prompts, 1):
    print(f"\n[{i}/{len(prompts)}] {prompt}")
    print("-" * 60)
    response = generate_response(
        prompt,
        max_tokens=100,
        temperature=0.7,
        verbose=False
    )
    print(response)


üì¶ Processing 4 prompts...


[1/4] What is Python?
------------------------------------------------------------
<|user_end|><|assistant_start|>Python is a high-level, interpreted programming language developed by Guido van Rossum at Bell Labs in the 1980s. It is often compared to other languages such as Java, C++, and PHP, but Python's syntax and nature are distinct from those of these languages.

Python's syntax is relatively simple, with a focus on readability and simplicity. It supports multiple data types, including integers, strings, and tuples, and supports conditional statements such as `if`, `

[2/4] Explain quantum computing.
------------------------------------------------------------
<|user_end|><|assistant_start|>Quantum computing refers to the simulation and manipulation of quantum systems, which are fundamentally different from classical computers. Unlike classical computers, which process information using binary digits (0s and 1s), quantum computing processes informa

## üìù Notes

### Model Characteristics
- **Size**: 1.9B parameters (d20 architecture)
- **Training**: RL-enhanced with GRPO (improved math & reduced hallucinations)
- **Best for**: Educational purposes, experiments, conversational AI

### Generation Parameters
- **temperature**:
  - `0.3-0.5`: Focused, consistent (good for facts)
  - `0.7-0.8`: Balanced (default)
  - `0.9-1.2`: Creative, diverse

- **top_k**:
  - `20-30`: Conservative
  - `50`: Default, balanced
  - `100+`: More diverse

### Tips
1. Use GPU runtime (Runtime ‚Üí Change runtime type ‚Üí T4 GPU)
2. Lower temperature for factual questions
3. Higher temperature for creative tasks
4. Model has RL training, so it's better at math than base models

### Resources
- [NanoChat GitHub](https://github.com/karpathy/nanochat)
- [Model on HF](https://huggingface.co/jasonacox/nanochat-1.8B-rl)
- [Training Details](https://github.com/karpathy/nanochat/discussions)

### Limitations
‚ö†Ô∏è This is a micro-model for educational purposes:
- May hallucinate facts
- Limited knowledge
- Not production-ready
- Much smaller than GPT-4/Claude

---

## üéâ Ready to Experiment!

The model is loaded and ready. Try different prompts and parameters to explore its capabilities.

Happy experimenting! üöÄ