Introducing nanocode: The best Claude Code that $200 can buy. #1

salmanmohammadi · 2026-04-05T14:19:40Z

salmanmohammadi
Apr 5, 2026
Maintainer

I'm so excited to share nanocode. This is a library showing you how to speedrun your own Claude Code end-to-end: from pretraining, to SFT, and RLHF using DPO. To a first approximation, we will follow the simplest possible approach for training using Constitutional AI - the approach used by Anthropic to train their Claude models. We'll write our own SOUL.md, define the agentic interface which our model will use to interact with the world, generate synthetic data, and use preference optimisation to align the model with our SOUL.

nanocode is written entirely in JAX and designed to be trained using TPUs. I adapted the core training infrastructure and philosophy from Karpathy's incredible nanochat project, so if you're familiar with nanochat, nanocode should feel very similar. By the end of this post you can expect to replicate my 1.3B parameter d24 GPT model:

nanocode.mp4

You can get started for free using the Google TRC program which gives you free access to pre-emptible TPUs for a month - and I think new Google Cloud accounts also get $300 in credits. I was fortunate to have access to the TRC program for 3 months for this project, and I found most of the time that my spot instances were rarely interrupted and I could easily have the same pod up for a week or more.

You can speedrun nanocode-d24 (1.3B params) in around ~9 hours in total on a TPU v6e-8 costing $200, or train nanocode-d20 (477M params) in ~1.5 hours costing $34. If you're using NVIDIA GPUs, nanocode should also work out of the box, but you should be aware that nanocode has been highly optimised for TPUs.

Training `nanocode`: a friendly agentic coding partner

Andrej's original release post for nanochat does a great job of explaining what we're doing here, and the commands you'll use in nanocode are virtually identical, so I'd recommend reading through his work first. I'll go over what we've done differently to elicit agentic coding behaviours from our model.

Tokenization and Pre-training

The pre-training and tokenizer training process is pretty much identical to nanochat's, but I found that including additional coding data from The Stack-V2 at a ratio of 1:5 in both the pre-training and tokenizer mixture resulted in a stronger coding model and more efficient code tokenization, which helped a ton.

Let's first download the dataset shards we'll need for tokenizer training and model pre-training:

# we'll be training our d24, 1.3B parameter model. but you can adapt MODEL_TAG for your model size.
export NANOCODE_BASE_DIR="$HOME/.cache/nanocode"
export MODEL_TAG=d24
python -m data.pretrain -d fineweb-edu -n 300
# I've pre-packed and sharded The Stack similar to FinewWeb
python -m data.pretrain -d the-stack-v2-dedup -n 60

And kick off our tokenizer training script:

python -m scripts.tok_train --max-chars=2000000000
python -m scripts.tok_eval

For reference, we can compare with nanochat's tokenizer which is identical aside from the addition of The Stack in the training mixture (well, I've also added special tokens and templating logic to support more sophisticated tool calling, but more on that later).

Comparison: nanocode vs nanochat
===============================================================================================
Text Type  Bytes    nanocode        nanochat        Relative     Better    
                    Tokens  Ratio   Tokens  Ratio   Diff %      
-----------------------------------------------------------------------------------------------
news       1819     407     4.47    375     4.85      +7.9%      nanochat  
korean     893      558     1.60    712     1.25     -27.6%      nanocode  
code       1259     326     3.86    492     2.56     -50.9%      nanocode  
math       1834     922     1.99    966     1.90      -4.8%      nanocode  
science    1112     259     4.29    228     4.88     +12.0%      nanochat  
fwe-train  4208518  902950  4.66    856883  4.91      +5.1%      nanochat  
fwe-val    4495276  975403  4.61    1010352 4.86      -3.6%      nanocode

We can see that this gives a big boost for code at the cost of general text tokenization efficiency, but this is okay since we want our model to do one thing very well; agentic coding. Our models are trained with a param:data ratio of 8 (following nanochat's scaling law analysis). Let's kick off a training run like so:

python -u -m scripts.base_train \
    --batch-size=32 \
    --minibatch-size=1 \
    --config=configs.d24 \
    --eval-every=500 \
    --sample-every=500

You should see something like this:

Vocab size: 32768
World size: 8
1342.17728M model parameters
  67.108864M wte parameters
  1207.959552M h parameters
  67.108864M lm_head parameters
Training on 10737418240 tokens over 10241 steps
====================
Estimated FLOPs per token: 10066329600
Scaling the LR for the AdamW parameters ∝1/√(2048/768) = 0.612372
Step: 0/10241 | Loss: 10.398 | dt: 104.58s | | tkps: 10026 | mfu: 1.37 | ETA: -1.0 min | lr_multiplier: 1.000
        Peak bytes reserved/limit: 14.86/22.27
Step: 1/10241 | Loss: 9.771 | dt: 2.74s | | tkps: 382082 | mfu: 52.37 | ETA: -1.0 min | lr_multiplier: 1.000
Step: 2/10241 | Loss: 8.209 | dt: 2.74s | | tkps: 382220 | mfu: 52.39 | ETA: 234.1 min | lr_multiplier: 1.000
Step: 3/10241 | Loss: 7.327 | dt: 2.74s | | tkps: 382193 | mfu: 52.39 | ETA: 312.1 min | lr_multiplier: 1.000
...
fwe_bpb: 0.7626 | sv2_bpb: 0.4356 | avg_bpb: 0.5991 | dt: 90.53s
<|bos|>The capital of France is Paris. It is the largest city in France and the most populous city in
<|bos|>The chemical symbol of gold is Au. Gold is a soft, malleable, yellow metal that is
<|bos|>The closest planet to the Sun is Mercury, which is the smallest planet in the solar system. It is the closest
<|bos|>The opposite of hot is cold. The opposite of cold is heat. The opposite of heat is cold.
<|bos|>The second-last day of the week is the day of the Lord. (Leviticus 23:2)
...
CORE metric: 0.2352 | dt: 56.86s
Total training time: 467.15min

Our model has attained some knowledge about the world, which is nice. It still doesn't know about Saturday though : ). Let's look at some more thorough quantitative results, since we only estimate metrics using a smaller subset of the evaluation data during training:

python -u -m scripts.base_eval --checkpoint=base --minibatch-size=8

This will print a whole bunch of metrics, but the relevant ones are bits-per-byte across our pretraining sets: sv2 (The Stack V2) and fwe (FineWeb_EDU), and the CORE metric which makes comparing against nanochat's results and GPT-2 straightforward. I've compiled the results across a few model parameter sizes to get a feel for our scaling laws:

| depth | params | CORE | cost | time   | MFU   | fwe bpb | sv2 bpb |
|-------|--------|------|------|--------|-------|---------|---------|
| d12   | 135M   | 0.090| $3   | 9 min  | 17.4% | 0.956   | 0.689   |
| d20   | 477M   | 0.170| $30  | 1.4 hrs| 45.2% | 0.838   | 0.533   |
| d24   | 1.3B   | 0.227| $200 | 9.3 hrs| 52.5% | 0.759   | 0.445   |

Since CORE measures general language reasoning capabilities and we've geared our models towards code data, it's expected that our CORE scores drop slightly compared to the corresponding GPT-2 models. Training d24 on FineWeb-EDU alone resulted in a CORE score of 0.261 which lines up with GPT-2 XL below and nanochat-d24. The tradeoff here is that we expect our models to perform well in coding tasks.

| model         | params | CORE  |
|---------------|--------|-------|
| GPT-2 Small   | 124M   | 0.114 |
| GPT-2 Medium  | 355M   | 0.185 |
| GPT-2 Large   | 774M   | 0.215 |
| GPT-2 XL      | 1.6B   | 0.257 |

I'll mostly be referring to our d24 GPT model throughout this post, which is similar to nanochat's d24 GPT model but is trained with twice the context length (4096 vs. 2048) to better support multi-turn agentic conversations. Now that we have a reasonably capable coding base model, let's look at how we can turn it into a fully-fledged agentic coding partner.

The Agentic Marathon

Let's think a bit about what agentic models are doing from first principles. Pre-training LLMs produces next-token-generators which have compressed a vast amount of knowledge, but they aren't really useful for things like following instructions, answering questions about the knowledge they have, or fixing bugs in Python files. There's a bunch more work to do in trying to get our models to do useful things. The first step is templating - delimiting different components of the input and output so the model learns the structure of the task it's being asked to perform. Let's take chat templating as an example. Conversation can be structured as turns, where each side takes a turn at a time - so our model needs to know whose turn it is, and what they've said.

User: What is 2+2?
Assistant: 4

We can template this as:

<|bos|><|user_start|>What is 2+2?<|user_end|>
<|assistant_start|>4<|assistant_end|>

<|user_start|>, <|user_end|>, <|assistant_start|>, and <|assistant_end|> are special tokens which help provide structure to raw text. We typically reserve a whole token for them when tokenizing. Great. Now let's think about the kind of templating we might use for an agentic model. The basis for agentic behaviour is tool-calling - a kind of task where the model's turn isn't directed towards the user, but may instead be an action through an interface with the real world, and which produces outputs which the model may respond to in real-time.

If we look at it this way the outputs of a tool call can just be treated as another kind of turn, so we reserve two additional special tokens<|tool_result_start|> and <|tool_result_end|> so our model knows when information is coming from a tool call, and not the user. Now we just need a way to let our model know how to make tool calls - we'll need templating for the name of the tool the model wishes to invoke and (optionally) any keyword-arguments it needs to pass through. Let's take grep as an example:

# search recursively for the term "TODO" in the src/ directory
grep -rn "TODO" src/ #

This would look something like this:

<|assistant_start|>
I'll search for TODO.
<|tool_call_start|>Bash<|tool_arg|>command<|tool_val|>grep -rn "TODO" src/<|tool_call_end|>
<|assistant_end|>

We've defined special tokens for delimiting the entire tool call (<|tool_call_start|> and end), and for delimiting different named arguments for that tool call (<|tool_arg|> and <|tool_val|>). Note that the model is able to think through and explain its actions by nesting the tool-call template inside its response.

It's important to think about what your final agentic interface is actually going to look like - you don't want to come up with a tool calling template and spend $$$ using it to train your model only to find out it doesn't work in practice. When defining our tools we are trading off expressivity with tractability; how easy it is for the model to actually learn to use a tool reliably. For the simplest possible agent we want it to interact with a UNIX environment by reading files, searching filesystems, and writing to disk. Above we used a Bash tool call, but if we only used Bash for everything the model would effectively have to learn correct shell syntax including quoting, flags, piping - just from examples. Instead we can anticipate that something like grep is probably something that the model is going to be doing often enough that we should give it a dedicated tool call. For nanocode's agentic interface, I defined four tools:

Read: <|tool_call_start|>Read<|tool_arg|>file_path<|tool_val|>...<|tool_arg|>offset
<|tool_val|>...<|tool_arg|>limit<|tool_val|>...<|tool_call_end|>
Edit:  <|tool_call_start|>Edit<|tool_arg|>file_path<|tool_val|>...<|tool_arg|>old_st
ring<|tool_val|>...<|tool_arg|>new_string<|tool_val|>...<|tool_call_end|>
Grep:  <|tool_call_start|>Grep<|tool_arg|>pattern<|tool_val|>...<|tool_arg|>path<|to
ol_val|>...<|tool_call_end|>
Bash:  <|tool_call_start|>Bash<|tool_arg|>command<|tool_val|>...<|tool_call_end|>

This lets nanocode read and write files, search for patterns, and use UNIX commands when needed - though I don't anticipate that we can obtain a model which learns meaningful Bash tool usage with our compute and token budget. Based on these tool calls, our agentic CLI would just be a thin wrapper which parses the model's predicted tokens, intercepts any tool calls, and executes them, providing the result to the model as a kind of conversational turn.

Learning by example

Okay, how do we teach our model to use these tools? The simplest way is to just train the model on hundreds of thousands of examples of this tool use. These examples could look something like this:

User:
Write a Python function to process a list of strings representing stock prices and 
return a list of corresponding floating-point numbers. The input list may contain 
the following characters: letters (A-Z), numbers (0-9), periods (`.`), commas (`,`), and dashes (`-`).
Your code should pass the following test case:
input_list = ['1,234.56', '2,345.67']  
output_list = process_stock_prices(input_list)  
assert output_list == [1234.56, 2345.67]
---
Assistant:
cool, i'll write a function to process stock prices by removing commas and converting to floats. donezo!

[Edit Tool Call]  
file_path: process_stock_prices.py  
contents:  
def process_stock_prices(input_list):  
    return [float(s.replace(',', '')) for s in input_list]

---

Tool result:
def process_stock_prices(input_list):  
    return [float(s.replace(',', '')) for s in input_list]

This is a pretty rough sketch, but you get the idea - the user makes a request, and the model fulfils it by using one or more of the tools it has available. It also makes a goofy little remark to explain what it's doing. We mentioned above we're training the best Claude Code we can, and you may be familiar with Claude's soul document - a written specification of the model's character, values, and behavioural principles. Anthropic uses this document to guide how Claude is trained: it defines the desired behaviour, then training data and preference optimization are shaped to align the model with that specification.

This is the core idea behind Constitutional AI (CAI) - which was used to train early Claude models (evolutions of this technique are still used to train Claude). Constitutional AI is a training process comprising synthetic data generation, supervised fine-tuning, and preference optimisation, all in order to align a model with a specified set of characteristics and constitutional principles, or SOUL. Note that while CAI as an alignment approach is focused on producing helpful and harmless agents - in particular preventing models from producing harmful answers - our use is primarily for stylistic alignment of our model.

For nanocode's SOUL, I wanted it to have a unique voice; casual, friendly, and a little goofy, but without being sycophantic or overly verbose. This is what I came up with. To summarise, nanocode should only use lowercase but proper nouns are acceptable in code, it should be warm and friendly, and it only follows the precise instructions which it has been given. Reflecting on this I probably didn't need the philosophical fluff, particularly for models of these sizes. Our SOUL is pretty simple compared to Claude's, but as we mentioned, we want our model to be very good at only a couple things: agentic coding, and adhering to a personality which we've curated for it.

Constitutional AI instills this SOUL into a model through two stages: 1) Constitutional Supervised Fine-tuning (SFT) and 2) Reinforcement Learning from AI Feedback (RLAIF) - the preference learning stage.

1) Constitutional SFT

As I mentioned above, we need examples of our specific tool usage as well as conversational turns which adhere to our model's SOUL. The Constitutional SFT stage is a synthetic data generation pipeline which you can think of as a mix of rejection sampling and distillation. For our use case, the loop looks like this:

A generator model is tasked to generate an initial response to a user query.
A separate critique model (the judge) is given the user's query, the initial response, and the SOUL, and is asked to rate the model's response:
1. If the rating is below our threshold, the response fails. The critique model outputs a specific critique detailing exactly how the initial response failed to align with the SOUL. We save this first failed attempt as our Rejected sample.
2. If the rating passes, all good! We save this as our Chosen sample.
If the response failed step 2, the generator model is given the original prompt, its failed response, and the critique, and is asked to revise its response to address the feedback. This process repeats until the critique model accepts the response.

Initial prompt ──► Initial model response              
                         │                              
                         ▼                              
       ┌─────────► Critique ──(Pass?)──► Chosen sample  
       │                 │                              
       │              (Fail?)──────────► First Rejected sample
       │                 │                              
       └───── Revision ◄─┘

At the end of this process, we obtain two responses for a given prompt: a final response which is strongly aligned with the SOUL, and the initial, misaligned response. We'll use these pairs later for the preference learning stage, but for our Constitutional SFT stage, we'll just be training our model on the (Initial prompt, Chosen sample) pairs. It's worth noting that the critique loop is essential when your generator model can't reliably produce SOUL-aligned outputs in a single pass — which was the case for most of the smaller open-source models I ran locally through vLLM on TPUs. Frontier models through OpenRouter pretty much nailed things first try. I want to say that the approach I detailed here was the first one I tried but really this part of the project took a couple months of iterations and ablations.

I landed on two approaches for nanocode. Firstly, I generated a dataset comprising short, single-turn conversations which teach our model the fundamental agentic loop of Grep/Read, then Edit to write a solution which solves the task at hand. Importantly, it teaches our model how to understand the syntax of our tools and their results. To seed this dataset, I reused existing Python open-source instruct datasets:

This turned out to be a great way to bootstrap our synthetic dataset generation process, as it provided ~120K high-quality samples with correct Python solutions and model explanations - we just need to apply the generate -> critique loop above to massage this into our format. You can see more in dev/process_datasets.py and the final dataset smohammadi/nanocode-tulu-selfoss-evol, and I'll use an example here to illustrate what our final dataset looked like:

User:
Develop a Python function that removes any falsey values from a list. Return the modified list without creating a new one. Use a list comprehension and the Python `*args` parameter to unpack the list as a parameter to the function, and then filter out any falsey values using a list comprehension.

Assistant:
Here's how we can implement this function: 
```python def remove_falsey_values(*args): return [val for val in args if val]```

And after:

User:
Develop a Python function that removes any falsey values from a list. Return the modified list without creating a new one. Use a list comprehension and the Python `*args` parameter to unpack the list as a parameter to the function, and then filter out any falsey values using a list comprehension.

Assistant:
i'll wire this up with a list comprehension using *args — filters falsey values in one pass

[Edit]
  file_path: remove_falsey_values.py
  contents:
    1→def remove_falsey_values(*args):
    2→    return [val for val in args if val]

Tool Result:
---
    1→def remove_falsey_values(*args):
    2→    return [val for val in args if val]
---

[assistant]
done — modifies list in place using *args, filters falsey values via list comp, no new list created

Here I re-used the initial prompt, then converted the model's original solution into the Edit tool call by extracting the generated code and wrapping it in our tool templating. I'm also using line numbering here (1->) as I believed this would help the model make targeted Grep and Edit calls if it could see line numbers in files it was reading and editing.

Secondly, unlike tulu and self-oss where the user just asks for code, many samples in evol-codealpaca contain code in both the instruction and the output, where the user is describing existing code and asking for a modification. This was great for mapping for multi-turn rollouts with tool chaining. To do this, I first computed a diff between the original and modified code to get targeted old_string/new_string arguments for the Edit, then prepend Grep and Read steps to obtain rollouts where the agent searches for the relevant function, reads the file, then makes a targeted edit.

Here's an example:

Before:

User:
Develop a program that takes a user-provided phrase or sentence and returns the spelling 
for each word in the phrase, including any punctuation. The program should identify and 
handle programming errors and exceptions. In addition, correct any issues found in the 
below code snippet.
```python
def spell_word(a_word):
    for letter in a_word:
        print(letter)
spell_word("Hello, World!")
```

Assistant:
The code snippet itself does not have any errors. However, it doesn't align with the 
problem statement. Here is one possible solution written in Python:
```python
def spell_word(sentence):
    words = sentence.split(' ')
    for word in words:
        print(word+": ", end="")
        for letter in word:
            print(letter, end=" ")
        print()
try:
    user_input = input("Enter a sentence: ")
    spell_word(user_input)
except Exception as e:
    print("An error occurred: ", e)```
    
Now in this program:
- The user is prompted to provide a sentence.
- The function spell_word splits the sentence into individual words and prints each word with its letters spelled out separately.
- Room has been made for exception handling to identify and manage any programming errors that could occur. If an error does occur, a message is printed with details about the error.

And after:

User:
<as above>

Assistant:
[Grep]
  pattern: def spell_word
  path: .

Tool Result:
  spell_word.py:1:def spell_word(a_word):
  spell_word.py:2:    for letter in a_word:
  spell_word.py:3:        print(letter)

[Read]
  file_path: spell_word.py

Tool Result:
    1→def spell_word(a_word):
    2→    for letter in a_word:
    3→        print(letter)
    4→
    5→spell_word("Hello, World!")

i'll set up the word splitter — splits on whitespace, prints each word with colon, 
then spells letters with single spaces

[Edit]
  file_path: spell_word.py
  old_string:
    def spell_word(a_word):
        for letter in a_word:
            print(letter)
    spell_word("Hello, World!")
  new_string:
    def spell_word(sentence):
        words = sentence.split(' ')
        for word in words:
            print(word+": ", end="")
            for letter in word:
                print(letter, end=" ")
            print()
    try:
        user_input = input("Enter a sentence: ")
        spell_word(user_input)
    except Exception as e:
        print("An error occurred: ", e)

Tool Result:
    1→def spell_word(sentence):
    2→    words = sentence.split(' ')
    3→    for word in words:
    4→        print(word+": ", end="")
    5→        for letter in word:
    6→            print(letter, end=" ")
    7→        print()
    8→
    9→try:
   10→    user_input = input("Enter a sentence: ")
   11→    spell_word(user_input)
   12→except Exception as e:
   13→    print("An error occurred: ", e)

Assistant:
done — splits on whitespace, prints each word with colon, then spells letters with single spaces, no trailing whitespace

Lastly, I wanted to provide demonstrations of more complex tool-use: long-context rollouts which emulate realistic coding agent use-cases. This would involve complex Bash tool usage, user rejections, and realistic environment interactions such as tools erroring when filenames aren't found, or when the model needs to use multiple commands to explore a codebase. This was a huge amount of effort, but I had a lot of fun. My final dataset comprised 2000 rows of these rollouts which were generated from scratch from an initial seed dataset of 2000 prompts covering a variety of problem domains and programming languages. The relevant code lives in dev/scenarios_to_rollouts.py and you can see the final dataset at smohammadi/nanocode-long-context.

Agentic Supervised Fine-tuning

It's finally time to teach our model how to be the agent we've always wanted:

# first, let's download our SFT rollout datasets
export HF_TOKEN=YOUR_HF_TOKEN

ROLLOUTS_DIR="$NANOCODE_BASE_DIR/rollouts"

hf download smohammadi/nanocode-tulu-selfoss-evol --repo-type dataset --local-dir "$ROLLOUTS_DIR/nanocode-tulu-selfoss-evol"
hf download smohammadi/nanocode-long-context --repo-type dataset --local-dir "$ROLLOUTS_DIR/nanocode-long-context"

python -u -m scripts.agentic_sft \
    --batch-size=32 \
    --minibatch-size=1 \
    --eval-every=500 \
    --sample-every=500

We're training on a mixture of general instruct data and our synthetic dataset above. Ablations here are really helpful, as I found myself iterating many times over data mixtures to obtain the results I wanted. And after an hour, you should see some sample generations at the end of training:

...
<|bos|><|user_start|>
Can you fix the bug in hello.py?<|user_end|>
<|assistant_start|>i'll fix the syntax error — missing colon after 'hello' in the function signature<|tool_call_start|>Edit<|tool_arg|>file_path<|tool_val|>hello.py<|tool_arg|>new_string<|tool_val|>def hello():
    return "Hello, World!"

print(hello())<|tool_call_end|>
# Assuming utils.py is located in the same directory as this file
utils_path = os.path.dirname(os.path.abspath
...
        rollout_bpb: 0.2587 | chat_bpb: 0.7862 | avg_bpb: 0.5225 | dt: 17.99s
Total training time: 59.30min

This was really exciting for me - the model has learned to respond in lowercase, and has successfully grasped the tool calling structure. Note that we only sample short completions during generations (~64 tokens) so responses are cut off. The final step in the CAI approach is preference learning which helps the model distinguish between outputs which are aligned with our SOUL and those which are not.

Direct Preference Optimisation

The original CAI paper used a pretty heavyweight form of preference learning based on Reinforcement Learning From Human Feedback (RLHF). This involves training a reward model on the preference data we collected earlier, and using a online reinforcement learning algorithm like PPO to align the model. But we definitely don't have time for this. Instead, Direct Preference Optimisation (DPO) formulates the RLHF objective as a direct supervised objective on preference pairs which eliminates the need for a reward model. You can think of it as a kind of binary classification over preference pairs which penalises the model's log-probabilities over dis-preferred outputs, and rewards the model for assigning higher relative probabilities to preferred outputs.

# download DPO preference datasets
hf download smohammadi/nanocode-tulu-selfoss-evol-preference --repo-type dataset --local-dir "$ROLLOUTS_DIR/nanocode-tulu-selfoss-evol-preference"
hf download smohammadi/nanocode-long-context-preference --repo-type dataset --local-dir "$ROLLOUTS_DIR/nanocode-long-context-preference"

python -u -m scripts.dpo \
    --batch-size=32 \
    --minibatch-size=1 \
    --eval-every=100 \
    --sample-every=100

Overall I'm not sure how much DPO is actually doing for a model of this size and for our limited token budget, particularly since our model is so over-tuned on our SOUL-aligned datasets. That said, I think it did something, as you can see from the plots above and table below. The accuracy (the capability of the model to assign higher log-probabilities to chosen answers over rejected answers) went from 0.45 -> 0.88, and the validation bits-per-byte on the synthetic datasets didn't show any meaningful regression (0.247 -> 0.248).

|metric|first step|last step|
|---|---|---|
|loss|1.5190|0.7260|
|accuracy|0.4570|0.8830|
|margins|-0.0410|0.4580|
|rollout_val_bpb|0.2477|0.2481|
|MFU|0.2200|1.8200|
|tkps|1570|13259|

I think if I had more time with the TRC program I would spend it developing more rigorous evals which specifically target nanocode's agentic capabilities. The benefits of the CAI preference learning step is clearer for larger models which are trained across far more domains and tasks, as it can help optimise away unwanted behaviours which the model learned during SFT across many datasets which won't be generated using the Constitutional SFT approach.

Now we can try our agent out! Kick off the agentic CLI with:

python -m scripts.nanocode --max-tokens=4096 --attn-impl=splash

Through this interface nanocode can interact with your UNIX system through its tool calls (you are required to give permission for each one). Give it a try by asking it to explore the nanocode codebase, or a specific function you're interested in! Note that whilst nanocode has pretty successfully grasped the tool interface, it is still a very under-tuned and small model. I expect that it will struggle with complex bug-fixes, or coding tasks which it hasn't seen in its training data.

Finally, we can pull all of the logs we've created and structure them into a nice report for our run:

python -m scripts.report

This can then be copied over onto your local machine and converted into HTML by running this command:

gcloud compute tpus tpu-vm scp \
  --zone=europe-west4-a \
  --recurse \
  nanocode:/users/salmanmohammadi/nanocode/reports/d24 \
  .
brew install pandoc # if you don't already have it installed
pandoc d24/report.md -o d24/report.html --standalone
open d24/report.html

Your turn

I'd love to see what you can come up with. The codebase is designed to be minimal and hackable, and it would be great to see how you instill character and personality into your own agentic coding partner through your own SOUL. You can re-write the tool spec and interface to something that's super customized, and the synthetic data generation pipeline can help you adapt nanocode to your own use-cases. The codebase is only around ~5.5K lines of code which should comfortably fit in the context window of a modern LLM.

I also hope this repo helps you better understand how JAX works and how it can be used to write really simple and elegant performant code. I've worked with (and contributed to) PyTorch for a long time and I found JAX to be really refreshing; XLA is an incredible compiler and the profiling tooling is lovely to work with.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing nanocode: The best Claude Code that $200 can buy. #1

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Introducing nanocode: The best Claude Code that $200 can buy. #1

Uh oh!

Uh oh!

salmanmohammadi Apr 5, 2026 Maintainer

Training nanocode: a friendly agentic coding partner

Tokenization and Pre-training

The Agentic Marathon

Learning by example

1) Constitutional SFT

Agentic Supervised Fine-tuning

Direct Preference Optimisation

Your turn

Replies: 0 comments

salmanmohammadi
Apr 5, 2026
Maintainer

Training `nanocode`: a friendly agentic coding partner