Bare-metal LLM inference engine for Raspberry Pi 4. Runs transformer models directly on the hardware with no operating system, just the four Cortex-A72 cores, NEON SIMD, and a UART serial console.
Supports multiple model architectures (GPT2, Llama) loaded from a USB drive. On boot, presents a model selection menu over UART, auto-detects the engine type from the model binary, and enters an interactive chat loop.
- Multi-architecture support - GPT2 and Llama engines, each self-contained with their own tokenizer, sampler, and chat template. Engine type auto-detected from model.bin header magic.
- Multi-core inference - All 4 Cortex-A72 cores run parallel matrix multiplications using NEON fp16->fp32 widening with WFE/SEV synchronization.
- USB 3.0 model loading - Full bare-metal stack: PCIe root complex -> xHCI (VL805) -> USB Mass Storage -> FAT32 with Long File Name support.
- Model hot-switching - Type
/exitduring chat to return to the model selection menu. Heap is reset, USB re-initialized, and a different model can be loaded without rebooting. - GPU firmware coexistence - Automatically detects and skips the VideoCore firmware memory region during model loading. Periodic mailbox keepalive prevents the GPU watchdog from scrubbing DRAM.
- Raspberry Pi 4 Model B (BCM2711, 4x Cortex-A72)
- USB 3.0 flash drive with FAT32 partition containing model directories
- USB-to-UART adapter connected to GPIO 14/15 (UART0, 115200 baud)
- microSD card with boot firmware (see
sdcard/)
/GPT2-286M/
model.bin # exported model weights
tokenizer.bin # token vocabulary
/Supra-50M-Instruct/
model.bin
tokenizer.bin
Each subdirectory is one model. The engine type is detected automatically from the first 4 bytes of model.bin:
0x4E434854("NCHT"): GPT2 engine- Anything else: Llama engine (Probably needs update later when another architure is added)
Requires the aarch64-none-elf bare-metal GCC toolchain (tested with Arm GNU Toolchain 13.2).
make buildOutput:
build/inferpi.elf- ELF for JTAG/GDB loadingbuild/kernel8.img- flat binary for SD card boot
cp build/kernel8.img sdcard/The sdcard/ directory contains the required boot partition files:
| File | Purpose |
|---|---|
config.txt |
GPU bootloader configuration (gpu_mem=32) |
start4.elf |
VideoCore firmware |
fixup4.dat |
GPU firmware fixup |
bcm2711-rpi-4-b.dtb |
Device tree |
armstub8-rpi4.bin |
ARM stub for EL2 entry |
kernel8.img |
InferPi ELF binary |
Key config.txt setting: gpu_mem=32 minimizes the GPU firmware memory region to reduce the skip gap during model loading.
0x00080000 - 0x04300000 ELF code + BSS + 64 MB heap
0x08000000 - ~0x4C400000 model.bin (up to 1.18 GB)
0x3F000000 - 0x3FFFFFFF GPU firmware (skipped during load)
0x50000000 - ~0x50100000 tokenizer.bin (up to 1 MB)
- RPi4 GPU firmware loads
kernel8.imgand jumps to theinitentry point at EL2 - CPU0 initializes UART, NEON, MMU, and caches; secondary cores enter worker loops
- USB storage stack initializes: PCIe -> VL805 xHCI -> USB device enumeration -> FAT32
- Root directory scanned for model subdirectories containing
model.bin - Model selection menu displayed over UART
- Selected model's
model.binandtokenizer.binloaded into RAM - Engine auto-detected, initialized, and chat loop entered
inferpi/
boot/ # AArch64 bootstrap, exception vectors, stack setup
drivers/
emmc/ # SD card controller (eMMC2)
fat32/ # FAT32 filesystem with LFN support
pcie/ # BCM2711 PCIe root complex + VL805 firmware load
uart/ # PL011 UART serial I/O
usb_storage/ # USB Mass Storage (Bulk-Only Transport)
xhci/ # xHCI USB 3.0 host controller
include/ # Headers
lib/ # C runtime (crt0, clock, stdio retarget)
linker/ # Linker script
mmu/ # Page table setup (EL2, 36-bit PA for PCIe at 0x600000000)
src/
main.c # Entry point, model menu, engine dispatch
inferpi.c # Shared bump allocator (64 MB), timer, string functions
loadmodel.c # USB storage init, model scanning, file loading
smp.c # Multi-core NEON matmul (4 cores, WFE/SEV sync)
engines/
gpt2.c # GPT2 engine (RoPE, RMSNorm, GQA, ReLU^2, smear gate)
llama.c # Llama engine (RoPE, RMSNorm, GQA, SwiGLU, Alpaca template)
sdcard/ # Boot partition files for microSD
tools/
export_gpt2.py # Convert GPT2 checkpoints to model.bin + tokenizer.bin
export_llama.py # Convert Llama checkpoints to model.bin + tokenizer.bin
- Create
src/engines/myengine.c - Implement
init()andchat_loop()functions - Export an
Engine myengine_enginestruct - Add
ENGINE_MYENGINEconstant toinclude/engine.h, incrementENGINE_COUNT - Register in the dispatch table in
main.c - Update the auto-detection logic in
loadmodel.c(or add a new magic number to your model format) - Add the source to
Makefile
The GPT2-286M model was trained from scratch using Karpathy's nanochat framework on 2x NVIDIA GPUs (RTX 5070 Ti + RTX 5060 Ti) with DDP via torchrun.
- 286M parameters, 12 layers, 768 embedding dim
- 32,768 vocab BPE tokenizer, 2,048 max sequence length
- Window pattern: SSSL
Trained on the ClimbMix dataset (~1.4 GB, 17 shards) for 2,520 iterations (~1.3B tokens) with a batch size of 524,288 tokens and bf16 precision. Gradient accumulation: 16 steps per rank, device batch size 8 per GPU.
cd /path/to/nanochat/
nanochat-env/bin/python -m torch.distributed.run \
--nproc_per_node=2 \
-m scripts.base_train \
-- \
--depth=12 \
--max-seq-len=2048 \
--device-batch-size=8 \
--target-param-data-ratio=12 \
--save-every=500 \
--run=dummyCompleted in ~6.2 hours. Final validation bpb: 0.850675.
The base model does text completion only. SFT fine-tunes it on ~1M structured conversation rows (SmolTalk, MMLU, GSM8K, SpellingBee, identity conversations) so it can follow a user/assistant chat format with special tokens (<|user_start|>, <|user_end|>, <|assistant_start|>, <|assistant_end|>). Loss is masked so only assistant responses are supervised.
cd /path/to/nanochat/
nanochat-env/bin/python -m torch.distributed.run \
--nproc_per_node=2 \
-m scripts.chat_sft \
-- \
--run=dummyAutomatically loads the latest base checkpoint as starting point. Completed in ~2.4 hours (970 steps). Final training loss: ~1.06, validation bpb: 0.3571.
After training, export the SFT checkpoint to the flat binary format used by InferPi:
cd /path/to/nanochat/
nanochat-env/bin/python /path/to/inferpi/tools/export_gpt2.py \
--source sft --model_tag d12 --step 970 \
--output_dir /path/to/usb/GPT2-286M/Produces model.bin (546 MB - 286M params in fp16, header magic 0x4E434854) and tokenizer.bin (339 KB - 32,768 tokens).
For base model export (text completion only, no chat format), use --source base --step 2520 instead.
You can also download model.bin and tokenizer.in from https://huggingface.co/sjthapa/GPT2-286M-BIN
python3 tools/export_llama.py --model_dir /path/to/Supra-50M-Instruct --output_dir /path/to/usb/Supra-50M-Instruct/Converts a HuggingFace LlamaForCausalLM model (safetensors format, handles bfloat16/fp16/fp32) to model.bin (raw Config struct header + fp16 weights in llama2.c layout) and tokenizer.bin (llama2.c binary tokenizer format converted from HuggingFace tokenizer.json).
You can also download model.bin and tokenizer.bin from https://huggingface.co/sjthapa/Supra-50M-Instruct-BIN
Connect at 115200 baud, 8N1. Example session:
======================================
InferPi - Bare Metal LLM Inference
Raspberry Pi 4 (4x Cortex-A72)
======================================
=== USB Storage Init ===
PCIe: initializing BCM2711 root complex
PCIe: VL805 ready
xHCI: device VID=0x 3f0 PID=0x2003 (port 2, USB3)
USB: found mass storage interface 0
USB: bulk IN=0x81 (1024), OUT=0x 2 (1024)
USB: HP USB Flash Drive
USB: waiting for device ready...
USB: 121145344 sectors, 512 bytes/sector (59153 MB)
=== USB Storage Ready ===
Available models:
1. GPT2-286M [GPT2]
2. Supra-50M-Instruct [Llama]
Select model (1-2): 1
Loading model from GPT2-286M...
546 MB in 7122 ms (76 MB/s)
Loading tokenizer... 338 KB in 9 ms
Starting GPT2 engine...
Building GPT2 model...
GPT2: 12 layers, 768 embd, 32768 vocab, 2048 seq
Tokenizer: 32768 tokens loaded
Ready! Type your prompt and press ENTER.
> Hi!
Hello! How can I help you today?
---
Generation: 9 tokens in 1101 ms (7.2 tokens/sec)
> What is the capital of France?
The capital of France is Paris.
---
Generation: 7 tokens in 900 ms (6.6 tokens/sec)
> /exit
Available models:
1. GPT2-286M [GPT2]
2. Supra-50M-Instruct [Llama]
Select model (1-2): 2
Loading model from Supra-50M-Instruct...
98 MB in 1344 ms (73 MB/s)
Loading tokenizer... 454 KB in 11 ms
Starting Llama engine...
Building Llama transformer...
Llama: dim=512 hidden=1408 layers=12 heads=8 kv_heads=4 vocab=32000 seq=1024
Building tokenizer...
Ready! Type your prompt and press ENTER.
> Hi!
Hello! Is there anything else I can help you with?
---
Generation: 11 tokens in 673 ms (16.3 tokens/sec)
> tell me a story
Once upon a time, there was a brave young man named John. He lived in the bustling city of Tokyo with his family and friends in the mountains. One day, while wandering through the streets of Tokyo's ancient temples and gardens, he stumbled across an old map that had been lost for centuries. The map showed a vast range of treasures from ancient to modern times, including a treasure chest filled with gold coins and precious gems scattered throughout the city.
John was fascinated by this new map because it showed that there were many hidden treasures in the city's history, including ancient buildings, temples, and parks. He could hardly believe his eyes seeing such an incredible picture of this world once he had seen it before! The map showed that even a small glimpse into this world would bring so much joy and intrigue to anyone who entered.
John realized that there were many things he couldn't do without the map, but he was determined to make it a reality. He decided to keep his eyes on the map and see if anyone could help him find it, even if they didn't know of any ancient people or other artifacts in this world.
---
Generation: 233 tokens in 15838 ms (14.7 tokens/sec)
>
The Supra-50M-Instruct model used with the Llama engine is from SupraLabs/Supra-50M-Instruct on Hugging Face, converted to the llama2.c binary format using tools/export_llama.py.
The inference engines in this project are based on Andrej Karpathy's work:
- The GPT2 engine uses nanochat as its starting point, a clean and minimal GPT-2 implementation that made it practical to bring transformer inference to bare metal.
- The Llama engine uses llama2.c as its base, which demonstrated that a full Llama inference stack fits in a single C file.
Both engines have been significantly modified for bare-metal operation: fp16 weight storage, 4-core NEON parallel matmul, bump allocator integration, GPU firmware memory gap handling, UART I/O, multi-turn KV cache persistence, and the engine plugin interface. The core transformer algorithms remain faithful to the originals.
Both projects set the standard for accessible, from-scratch LLM implementations. InferPi would not exist without them.
MIT