Skip to content

siddharthjthapa/InferPi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InferPi

Bare-metal LLM inference engine for Raspberry Pi 4. Runs transformer models directly on the hardware with no operating system, just the four Cortex-A72 cores, NEON SIMD, and a UART serial console.

Supports multiple model architectures (GPT2, Llama) loaded from a USB drive. On boot, presents a model selection menu over UART, auto-detects the engine type from the model binary, and enters an interactive chat loop.

Features

  • Multi-architecture support - GPT2 and Llama engines, each self-contained with their own tokenizer, sampler, and chat template. Engine type auto-detected from model.bin header magic.
  • Multi-core inference - All 4 Cortex-A72 cores run parallel matrix multiplications using NEON fp16->fp32 widening with WFE/SEV synchronization.
  • USB 3.0 model loading - Full bare-metal stack: PCIe root complex -> xHCI (VL805) -> USB Mass Storage -> FAT32 with Long File Name support.
  • Model hot-switching - Type /exit during chat to return to the model selection menu. Heap is reset, USB re-initialized, and a different model can be loaded without rebooting.
  • GPU firmware coexistence - Automatically detects and skips the VideoCore firmware memory region during model loading. Periodic mailbox keepalive prevents the GPU watchdog from scrubbing DRAM.

Hardware Requirements

  • Raspberry Pi 4 Model B (BCM2711, 4x Cortex-A72)
  • USB 3.0 flash drive with FAT32 partition containing model directories
  • USB-to-UART adapter connected to GPIO 14/15 (UART0, 115200 baud)
  • microSD card with boot firmware (see sdcard/)

USB Drive Layout

/GPT2-286M/
    model.bin           # exported model weights
    tokenizer.bin       # token vocabulary
/Supra-50M-Instruct/
    model.bin
    tokenizer.bin

Each subdirectory is one model. The engine type is detected automatically from the first 4 bytes of model.bin:

  • 0x4E434854 ("NCHT"): GPT2 engine
  • Anything else: Llama engine (Probably needs update later when another architure is added)

Building

Requires the aarch64-none-elf bare-metal GCC toolchain (tested with Arm GNU Toolchain 13.2).

make build

Output:

  • build/inferpi.elf - ELF for JTAG/GDB loading
  • build/kernel8.img - flat binary for SD card boot
cp build/kernel8.img sdcard/

SD Card Setup

The sdcard/ directory contains the required boot partition files:

File Purpose
config.txt GPU bootloader configuration (gpu_mem=32)
start4.elf VideoCore firmware
fixup4.dat GPU firmware fixup
bcm2711-rpi-4-b.dtb Device tree
armstub8-rpi4.bin ARM stub for EL2 entry
kernel8.img InferPi ELF binary

Key config.txt setting: gpu_mem=32 minimizes the GPU firmware memory region to reduce the skip gap during model loading.

RAM Layout

0x00080000 - 0x04300000    ELF code + BSS + 64 MB heap
0x08000000 - ~0x4C400000   model.bin (up to 1.18 GB)
0x3F000000 - 0x3FFFFFFF    GPU firmware (skipped during load)
0x50000000 - ~0x50100000   tokenizer.bin (up to 1 MB)

Boot Sequence

  1. RPi4 GPU firmware loads kernel8.img and jumps to the init entry point at EL2
  2. CPU0 initializes UART, NEON, MMU, and caches; secondary cores enter worker loops
  3. USB storage stack initializes: PCIe -> VL805 xHCI -> USB device enumeration -> FAT32
  4. Root directory scanned for model subdirectories containing model.bin
  5. Model selection menu displayed over UART
  6. Selected model's model.bin and tokenizer.bin loaded into RAM
  7. Engine auto-detected, initialized, and chat loop entered

Project Structure

inferpi/
  boot/                     # AArch64 bootstrap, exception vectors, stack setup
  drivers/
    emmc/                   # SD card controller (eMMC2)
    fat32/                  # FAT32 filesystem with LFN support
    pcie/                   # BCM2711 PCIe root complex + VL805 firmware load
    uart/                   # PL011 UART serial I/O
    usb_storage/            # USB Mass Storage (Bulk-Only Transport)
    xhci/                   # xHCI USB 3.0 host controller
  include/                  # Headers
  lib/                      # C runtime (crt0, clock, stdio retarget)
  linker/                   # Linker script
  mmu/                      # Page table setup (EL2, 36-bit PA for PCIe at 0x600000000)
  src/
    main.c                  # Entry point, model menu, engine dispatch
    inferpi.c               # Shared bump allocator (64 MB), timer, string functions
    loadmodel.c             # USB storage init, model scanning, file loading
    smp.c                   # Multi-core NEON matmul (4 cores, WFE/SEV sync)
    engines/
      gpt2.c                # GPT2 engine (RoPE, RMSNorm, GQA, ReLU^2, smear gate)
      llama.c               # Llama engine (RoPE, RMSNorm, GQA, SwiGLU, Alpaca template)
  sdcard/                   # Boot partition files for microSD
  tools/
    export_gpt2.py          # Convert GPT2 checkpoints to model.bin + tokenizer.bin
    export_llama.py         # Convert Llama checkpoints to model.bin + tokenizer.bin

Adding a New Engine

  1. Create src/engines/myengine.c
  2. Implement init() and chat_loop() functions
  3. Export an Engine myengine_engine struct
  4. Add ENGINE_MYENGINE constant to include/engine.h, increment ENGINE_COUNT
  5. Register in the dispatch table in main.c
  6. Update the auto-detection logic in loadmodel.c (or add a new magic number to your model format)
  7. Add the source to Makefile

GPT2 Model Training

The GPT2-286M model was trained from scratch using Karpathy's nanochat framework on 2x NVIDIA GPUs (RTX 5070 Ti + RTX 5060 Ti) with DDP via torchrun.

Model Configuration

  • 286M parameters, 12 layers, 768 embedding dim
  • 32,768 vocab BPE tokenizer, 2,048 max sequence length
  • Window pattern: SSSL

Base Training

Trained on the ClimbMix dataset (~1.4 GB, 17 shards) for 2,520 iterations (~1.3B tokens) with a batch size of 524,288 tokens and bf16 precision. Gradient accumulation: 16 steps per rank, device batch size 8 per GPU.

cd /path/to/nanochat/
nanochat-env/bin/python -m torch.distributed.run \
    --nproc_per_node=2 \
    -m scripts.base_train \
    -- \
    --depth=12 \
    --max-seq-len=2048 \
    --device-batch-size=8 \
    --target-param-data-ratio=12 \
    --save-every=500 \
    --run=dummy

Completed in ~6.2 hours. Final validation bpb: 0.850675.

SFT (Supervised Fine-Tuning)

The base model does text completion only. SFT fine-tunes it on ~1M structured conversation rows (SmolTalk, MMLU, GSM8K, SpellingBee, identity conversations) so it can follow a user/assistant chat format with special tokens (<|user_start|>, <|user_end|>, <|assistant_start|>, <|assistant_end|>). Loss is masked so only assistant responses are supervised.

cd /path/to/nanochat/
nanochat-env/bin/python -m torch.distributed.run \
    --nproc_per_node=2 \
    -m scripts.chat_sft \
    -- \
    --run=dummy

Automatically loads the latest base checkpoint as starting point. Completed in ~2.4 hours (970 steps). Final training loss: ~1.06, validation bpb: 0.3571.

Export to Binary

After training, export the SFT checkpoint to the flat binary format used by InferPi:

cd /path/to/nanochat/
nanochat-env/bin/python /path/to/inferpi/tools/export_gpt2.py \
    --source sft --model_tag d12 --step 970 \
    --output_dir /path/to/usb/GPT2-286M/

Produces model.bin (546 MB - 286M params in fp16, header magic 0x4E434854) and tokenizer.bin (339 KB - 32,768 tokens).

For base model export (text completion only, no chat format), use --source base --step 2520 instead.

You can also download model.bin and tokenizer.in from https://huggingface.co/sjthapa/GPT2-286M-BIN

Exporting Llama Models

python3 tools/export_llama.py --model_dir /path/to/Supra-50M-Instruct --output_dir /path/to/usb/Supra-50M-Instruct/

Converts a HuggingFace LlamaForCausalLM model (safetensors format, handles bfloat16/fp16/fp32) to model.bin (raw Config struct header + fp16 weights in llama2.c layout) and tokenizer.bin (llama2.c binary tokenizer format converted from HuggingFace tokenizer.json).

You can also download model.bin and tokenizer.bin from https://huggingface.co/sjthapa/Supra-50M-Instruct-BIN

UART Interface

Connect at 115200 baud, 8N1. Example session:

======================================
  InferPi - Bare Metal LLM Inference
  Raspberry Pi 4 (4x Cortex-A72)
======================================


=== USB Storage Init ===
PCIe: initializing BCM2711 root complex
PCIe: VL805 ready
xHCI: device VID=0x 3f0 PID=0x2003 (port 2, USB3)
USB: found mass storage interface 0
USB: bulk IN=0x81 (1024), OUT=0x 2 (1024)
USB: HP       USB Flash Drive
USB: waiting for device ready...
USB: 121145344 sectors, 512 bytes/sector (59153 MB)
=== USB Storage Ready ===

Available models:

  1. GPT2-286M  [GPT2]
  2. Supra-50M-Instruct  [Llama]

Select model (1-2): 1

Loading model from GPT2-286M...
  546 MB in 7122 ms (76 MB/s)
Loading tokenizer... 338 KB in 9 ms

Starting GPT2 engine...

Building GPT2 model...
  GPT2: 12 layers, 768 embd, 32768 vocab, 2048 seq
  Tokenizer: 32768 tokens loaded

Ready! Type your prompt and press ENTER.

> Hi!

Hello! How can I help you today?
---
Generation: 9 tokens in 1101 ms (7.2 tokens/sec)

> What is the capital of France?

The capital of France is Paris.
---
Generation: 7 tokens in 900 ms (6.6 tokens/sec)

> /exit

Available models:

  1. GPT2-286M  [GPT2]
  2. Supra-50M-Instruct  [Llama]

Select model (1-2): 2

Loading model from Supra-50M-Instruct...
  98 MB in 1344 ms (73 MB/s)
Loading tokenizer... 454 KB in 11 ms

Starting Llama engine...

Building Llama transformer...
  Llama: dim=512 hidden=1408 layers=12 heads=8 kv_heads=4 vocab=32000 seq=1024
Building tokenizer...

Ready! Type your prompt and press ENTER.

> Hi!

Hello! Is there anything else I can help you with?
---
Generation: 11 tokens in 673 ms (16.3 tokens/sec)

> tell me a story

Once upon a time, there was a brave young man named John. He lived in the bustling city of Tokyo with his family and friends in the mountains. One day, while wandering through the streets of Tokyo's ancient temples and gardens, he stumbled across an old map that had been lost for centuries. The map showed a vast range of treasures from ancient to modern times, including a treasure chest filled with gold coins and precious gems scattered throughout the city.

John was fascinated by this new map because it showed that there were many hidden treasures in the city's history, including ancient buildings, temples, and parks. He could hardly believe his eyes seeing such an incredible picture of this world once he had seen it before! The map showed that even a small glimpse into this world would bring so much joy and intrigue to anyone who entered.

John realized that there were many things he couldn't do without the map, but he was determined to make it a reality. He decided to keep his eyes on the map and see if anyone could help him find it, even if they didn't know of any ancient people or other artifacts in this world.
---
Generation: 233 tokens in 15838 ms (14.7 tokens/sec)

>

Models

The Supra-50M-Instruct model used with the Llama engine is from SupraLabs/Supra-50M-Instruct on Hugging Face, converted to the llama2.c binary format using tools/export_llama.py.

Acknowledgements

The inference engines in this project are based on Andrej Karpathy's work:

  • The GPT2 engine uses nanochat as its starting point, a clean and minimal GPT-2 implementation that made it practical to bring transformer inference to bare metal.
  • The Llama engine uses llama2.c as its base, which demonstrated that a full Llama inference stack fits in a single C file.

Both engines have been significantly modified for bare-metal operation: fp16 weight storage, 4-core NEON parallel matmul, bump allocator integration, GPU firmware memory gap handling, UART I/O, multi-turn KV cache persistence, and the engine plugin interface. The core transformer algorithms remain faithful to the originals.

Both projects set the standard for accessible, from-scratch LLM implementations. InferPi would not exist without them.

License

MIT

About

InferPi - Bare-Metal Multi-Architecture LLM Inference on Raspberry Pi

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors