Introducing nanocode: The best Claude Code that $200 can buy. #1
salmanmohammadi
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm so excited to share
nanocode. This is a library showing you how to speedrun your own Claude Code end-to-end: from pretraining, to SFT, and RLHF using DPO. To a first approximation, we will follow the simplest possible approach for training using Constitutional AI - the approach used by Anthropic to train their Claude models. We'll write our ownSOUL.md, define the agentic interface which our model will use to interact with the world, generate synthetic data, and use preference optimisation to align the model with ourSOUL.nanocodeis written entirely in JAX and designed to be trained using TPUs. I adapted the core training infrastructure and philosophy from Karpathy's incrediblenanochatproject, so if you're familiar withnanochat,nanocodeshould feel very similar. By the end of this post you can expect to replicate my 1.3B parameterd24GPT model:nanocode.mp4
You can get started for free using the Google TRC program which gives you free access to pre-emptible TPUs for a month - and I think new Google Cloud accounts also get $300 in credits. I was fortunate to have access to the TRC program for 3 months for this project, and I found most of the time that my spot instances were rarely interrupted and I could easily have the same pod up for a week or more.
You can speedrun
nanocode-d24(1.3B params) in around ~9 hours in total on a TPU v6e-8 costing $200, or trainnanocode-d20(477M params) in ~1.5 hours costing $34. If you're using NVIDIA GPUs,nanocodeshould also work out of the box, but you should be aware thatnanocodehas been highly optimised for TPUs.Training
nanocode: a friendly agentic coding partnerAndrej's original release post for
nanochatdoes a great job of explaining what we're doing here, and the commands you'll use innanocodeare virtually identical, so I'd recommend reading through his work first. I'll go over what we've done differently to elicit agentic coding behaviours from our model.Tokenization and Pre-training
The pre-training and tokenizer training process is pretty much identical to
nanochat's, but I found that including additional coding data from The Stack-V2 at a ratio of 1:5 in both the pre-training and tokenizer mixture resulted in a stronger coding model and more efficient code tokenization, which helped a ton.Let's first download the dataset shards we'll need for tokenizer training and model pre-training:
And kick off our tokenizer training script:
For reference, we can compare with
nanochat's tokenizer which is identical aside from the addition of The Stack in the training mixture (well, I've also added special tokens and templating logic to support more sophisticated tool calling, but more on that later).We can see that this gives a big boost for code at the cost of general text tokenization efficiency, but this is okay since we want our model to do one thing very well; agentic coding. Our models are trained with a param:data ratio of 8 (following nanochat's scaling law analysis). Let's kick off a training run like so:
python -u -m scripts.base_train \ --batch-size=32 \ --minibatch-size=1 \ --config=configs.d24 \ --eval-every=500 \ --sample-every=500You should see something like this:
Our model has attained some knowledge about the world, which is nice. It still doesn't know about Saturday though : ). Let's look at some more thorough quantitative results, since we only estimate metrics using a smaller subset of the evaluation data during training:
This will print a whole bunch of metrics, but the relevant ones are bits-per-byte across our pretraining sets:
sv2(The Stack V2) andfwe(FineWeb_EDU), and the CORE metric which makes comparing againstnanochat's results and GPT-2 straightforward. I've compiled the results across a few model parameter sizes to get a feel for our scaling laws:Since CORE measures general language reasoning capabilities and we've geared our models towards code data, it's expected that our CORE scores drop slightly compared to the corresponding GPT-2 models. Training
d24on FineWeb-EDU alone resulted in a CORE score of 0.261 which lines up withGPT-2 XLbelow andnanochat-d24. The tradeoff here is that we expect our models to perform well in coding tasks.I'll mostly be referring to our
d24GPT model throughout this post, which is similar tonanochat's d24 GPT model but is trained with twice the context length (4096 vs. 2048) to better support multi-turn agentic conversations. Now that we have a reasonably capable coding base model, let's look at how we can turn it into a fully-fledged agentic coding partner.The Agentic Marathon
Let's think a bit about what agentic models are doing from first principles. Pre-training LLMs produces next-token-generators which have compressed a vast amount of knowledge, but they aren't really useful for things like following instructions, answering questions about the knowledge they have, or fixing bugs in Python files. There's a bunch more work to do in trying to get our models to do useful things. The first step is templating - delimiting different components of the input and output so the model learns the structure of the task it's being asked to perform. Let's take chat templating as an example. Conversation can be structured as turns, where each side takes a turn at a time - so our model needs to know whose turn it is, and what they've said.
User: What is 2+2? Assistant: 4We can template this as:
<|user_start|>,<|user_end|>,<|assistant_start|>, and<|assistant_end|>are special tokens which help provide structure to raw text. We typically reserve a whole token for them when tokenizing. Great. Now let's think about the kind of templating we might use for an agentic model. The basis for agentic behaviour is tool-calling - a kind of task where the model's turn isn't directed towards the user, but may instead be an action through an interface with the real world, and which produces outputs which the model may respond to in real-time.If we look at it this way the outputs of a tool call can just be treated as another kind of turn, so we reserve two additional special tokens
<|tool_result_start|> and <|tool_result_end|>so our model knows when information is coming from a tool call, and not the user. Now we just need a way to let our model know how to make tool calls - we'll need templating for the name of the tool the model wishes to invoke and (optionally) any keyword-arguments it needs to pass through. Let's takegrepas an example:This would look something like this:
We've defined special tokens for delimiting the entire tool call (
<|tool_call_start|>andend), and for delimiting different named arguments for that tool call (<|tool_arg|>and<|tool_val|>). Note that the model is able to think through and explain its actions by nesting the tool-call template inside its response.It's important to think about what your final agentic interface is actually going to look like - you don't want to come up with a tool calling template and spend $$$ using it to train your model only to find out it doesn't work in practice. When defining our tools we are trading off expressivity with tractability; how easy it is for the model to actually learn to use a tool reliably. For the simplest possible agent we want it to interact with a UNIX environment by reading files, searching filesystems, and writing to disk. Above we used a
Bashtool call, but if we only usedBashfor everything the model would effectively have to learn correct shell syntax including quoting, flags, piping - just from examples. Instead we can anticipate that something likegrepis probably something that the model is going to be doing often enough that we should give it a dedicated tool call. Fornanocode's agentic interface, I defined four tools:This lets
nanocoderead and write files, search for patterns, and use UNIX commands when needed - though I don't anticipate that we can obtain a model which learns meaningful Bash tool usage with our compute and token budget. Based on these tool calls, our agentic CLI would just be a thin wrapper which parses the model's predicted tokens, intercepts any tool calls, and executes them, providing the result to the model as a kind of conversational turn.Learning by example
Okay, how do we teach our model to use these tools? The simplest way is to just train the model on hundreds of thousands of examples of this tool use. These examples could look something like this:
This is a pretty rough sketch, but you get the idea - the user makes a request, and the model fulfils it by using one or more of the tools it has available. It also makes a goofy little remark to explain what it's doing. We mentioned above we're training the best Claude Code we can, and you may be familiar with Claude's soul document - a written specification of the model's character, values, and behavioural principles. Anthropic uses this document to guide how Claude is trained: it defines the desired behaviour, then training data and preference optimization are shaped to align the model with that specification.
This is the core idea behind Constitutional AI (CAI) - which was used to train early Claude models (evolutions of this technique are still used to train Claude). Constitutional AI is a training process comprising synthetic data generation, supervised fine-tuning, and preference optimisation, all in order to align a model with a specified set of characteristics and constitutional principles, or
SOUL. Note that while CAI as an alignment approach is focused on producing helpful and harmless agents - in particular preventing models from producing harmful answers - our use is primarily for stylistic alignment of our model.For
nanocode'sSOUL, I wanted it to have a unique voice; casual, friendly, and a little goofy, but without being sycophantic or overly verbose. This is what I came up with. To summarise,nanocodeshould only use lowercase but proper nouns are acceptable in code, it should be warm and friendly, and it only follows the precise instructions which it has been given. Reflecting on this I probably didn't need the philosophical fluff, particularly for models of these sizes. OurSOULis pretty simple compared to Claude's, but as we mentioned, we want our model to be very good at only a couple things: agentic coding, and adhering to a personality which we've curated for it.Constitutional AI instills this
SOULinto a model through two stages: 1) Constitutional Supervised Fine-tuning (SFT) and 2) Reinforcement Learning from AI Feedback (RLAIF) - the preference learning stage.1) Constitutional SFT
As I mentioned above, we need examples of our specific tool usage as well as conversational turns which adhere to our model's
SOUL. The Constitutional SFT stage is a synthetic data generation pipeline which you can think of as a mix of rejection sampling and distillation. For our use case, the loop looks like this:SOUL, and is asked to rate the model's response:SOUL. We save this first failed attempt as our Rejected sample.At the end of this process, we obtain two responses for a given prompt: a final response which is strongly aligned with the
SOUL, and the initial, misaligned response. We'll use these pairs later for the preference learning stage, but for our Constitutional SFT stage, we'll just be training our model on the (Initial prompt, Chosen sample) pairs. It's worth noting that the critique loop is essential when your generator model can't reliably produceSOUL-aligned outputs in a single pass — which was the case for most of the smaller open-source models I ran locally through vLLM on TPUs. Frontier models through OpenRouter pretty much nailed things first try. I want to say that the approach I detailed here was the first one I tried but really this part of the project took a couple months of iterations and ablations.I landed on two approaches for
nanocode. Firstly, I generated a dataset comprising short, single-turn conversations which teach our model the fundamental agentic loop ofGrep/Read, thenEditto write a solution which solves the task at hand. Importantly, it teaches our model how to understand the syntax of our tools and their results. To seed this dataset, I reused existing Python open-source instruct datasets:This turned out to be a great way to bootstrap our synthetic dataset generation process, as it provided ~120K high-quality samples with correct Python solutions and model explanations - we just need to apply the generate -> critique loop above to massage this into our format. You can see more in
dev/process_datasets.pyand the final dataset smohammadi/nanocode-tulu-selfoss-evol, and I'll use an example here to illustrate what our final dataset looked like:And after:
Here I re-used the initial prompt, then converted the model's original solution into the
Edittool call by extracting the generated code and wrapping it in our tool templating. I'm also using line numbering here (1->) as I believed this would help the model make targetedGrepandEditcalls if it could see line numbers in files it was reading and editing.Secondly, unlike
tuluandself-osswhere the user just asks for code, many samples inevol-codealpacacontain code in both the instruction and the output, where the user is describing existing code and asking for a modification. This was great for mapping for multi-turn rollouts with tool chaining. To do this, I first computed a diff between the original and modified code to get targetedold_string/new_stringarguments for the Edit, then prependGrepandReadsteps to obtain rollouts where the agent searches for the relevant function, reads the file, then makes a targeted edit.Here's an example:
Before:And after:
Lastly, I wanted to provide demonstrations of more complex tool-use: long-context rollouts which emulate realistic coding agent use-cases. This would involve complex
Bashtool usage, user rejections, and realistic environment interactions such as tools erroring when filenames aren't found, or when the model needs to use multiple commands to explore a codebase. This was a huge amount of effort, but I had a lot of fun. My final dataset comprised 2000 rows of these rollouts which were generated from scratch from an initial seed dataset of 2000 prompts covering a variety of problem domains and programming languages. The relevant code lives indev/scenarios_to_rollouts.pyand you can see the final dataset at smohammadi/nanocode-long-context.Agentic Supervised Fine-tuning
It's finally time to teach our model how to be the agent we've always wanted:
We're training on a mixture of general instruct data and our synthetic dataset above. Ablations here are really helpful, as I found myself iterating many times over data mixtures to obtain the results I wanted. And after an hour, you should see some sample generations at the end of training:
This was really exciting for me - the model has learned to respond in lowercase, and has successfully grasped the tool calling structure. Note that we only sample short completions during generations (~64 tokens) so responses are cut off. The final step in the CAI approach is preference learning which helps the model distinguish between outputs which are aligned with our
SOULand those which are not.Direct Preference Optimisation
The original CAI paper used a pretty heavyweight form of preference learning based on Reinforcement Learning From Human Feedback (RLHF). This involves training a reward model on the preference data we collected earlier, and using a online reinforcement learning algorithm like PPO to align the model. But we definitely don't have time for this. Instead, Direct Preference Optimisation (DPO) formulates the RLHF objective as a direct supervised objective on preference pairs which eliminates the need for a reward model. You can think of it as a kind of binary classification over preference pairs which penalises the model's log-probabilities over dis-preferred outputs, and rewards the model for assigning higher relative probabilities to preferred outputs.
Overall I'm not sure how much DPO is actually doing for a model of this size and for our limited token budget, particularly since our model is so over-tuned on our
SOUL-aligned datasets. That said, I think it did something, as you can see from the plots above and table below. The accuracy (the capability of the model to assign higher log-probabilities to chosen answers over rejected answers) went from 0.45 -> 0.88, and the validation bits-per-byte on the synthetic datasets didn't show any meaningful regression (0.247 -> 0.248).I think if I had more time with the TRC program I would spend it developing more rigorous evals which specifically target
nanocode's agentic capabilities. The benefits of the CAI preference learning step is clearer for larger models which are trained across far more domains and tasks, as it can help optimise away unwanted behaviours which the model learned during SFT across many datasets which won't be generated using the Constitutional SFT approach.Now we can try our agent out! Kick off the agentic CLI with:
Through this interface
nanocodecan interact with your UNIX system through its tool calls (you are required to give permission for each one). Give it a try by asking it to explore thenanocodecodebase, or a specific function you're interested in! Note that whilstnanocodehas pretty successfully grasped the tool interface, it is still a very under-tuned and small model. I expect that it will struggle with complex bug-fixes, or coding tasks which it hasn't seen in its training data.Finally, we can pull all of the logs we've created and structure them into a nice report for our run:
This can then be copied over onto your local machine and converted into HTML by running this command:
Your turn
I'd love to see what you can come up with. The codebase is designed to be minimal and hackable, and it would be great to see how you instill character and personality into your own agentic coding partner through your own
SOUL. You can re-write the tool spec and interface to something that's super customized, and the synthetic data generation pipeline can help you adaptnanocodeto your own use-cases. The codebase is only around ~5.5K lines of code which should comfortably fit in the context window of a modern LLM.I also hope this repo helps you better understand how JAX works and how it can be used to write really simple and elegant performant code. I've worked with (and contributed to) PyTorch for a long time and I found JAX to be really refreshing; XLA is an incredible compiler and the profiling tooling is lovely to work with.
Beta Was this translation helpful? Give feedback.
All reactions