A bare-metal, hardware-agnostic Large Language Model (LLM) inference engine designed to run massive models on heavily memory-constrained edge devices, and to act as a safety feature for running LLMs locally.
FloatLLM is built for a fundamental shift in local AI execution: Dynamic Zero-Copy Memory Chunking.
Originally, handling models larger than host RAM relied on static, layer-by-layer disk swapping. However, static swapping creates massive I/O bottlenecks.
FloatLLM abandons static swapping. Instead, it utilizes OS-level hardware interrogation to calculate exact, real-time memory boundaries, slicing standard .gguf neural network weights into mathematically perfect execution blocks. By leveraging native mmap (memory-mapping), it creates a zero-copy hardware bridge, streaming gigabytes of tensor data from SSD to RAM at bare-metal speeds without ever triggering an Out-of-Memory (OOM) panic.
This allows massive architectures to execute natively on anything from an Apple Silicon Mac to a non-rooted Android device running terminal environments, completely offline.
FloatLLM is being developed in these stages:
The master entry point. The router dynamically interrogates the host machine's hardware, evaluating total RAM, free RAM, and SSD capacity.
- Hardware Agnostic: Automatically routes compute workloads based on host detection.
- Failsafe Math: Calculates strict safety thresholds, ensuring a configurable buffer (default 20%) is always left free for the operating system and dynamic KV Cache context.
- Absolute Control: Allows users to manually force RAM limits to run multi-gigabyte models through ultra-tight memory constraints.
The physical memory mapper.
- Metadata Parsing: Uses the official
gguflibrary to scan the model header, discovering exact tensor byte offsets without loading the massive payload. - Dynamic Slicing: Takes the safety limits and mathematically groups hundreds of tensors into safe execution blocks.
- Zero-Copy Streaming: Utilizes a read-only
mmapbridge to swap execution chunks in and out of RAM at maximum SSD read speeds.
The bare-metal execution layer utilizing ggml.
- Hardware Binding: Dynamically binds zero-copy Python memory maps to dedicated GPU cores (Metal, CUDA, Vulkan).
- VRAM Detachment: Securely detaches CPU memory pointers to prevent OS-level segmentation faults, allowing the GPU allocator to provision safe computational VRAM on the fly.
The translation layer.
- 100% Offline Generation: Dynamically reads the internal
tokenizer.ggml.tokensarray directly from the GGUF file. Zero API calls, zero internet dependency. - Dynamic Handling: Automatically scales between 1B and 405B parameter models, supporting all standard tokenization architectures.
The output interface.
- Note: Currently bypasses hidden layers to stress-test the zero-copy VRAM buffer allocation and streaming loop stability.
- Generation loop: Pipeline integrated. Prompt integers are passed securely across the
ctypesbridge, processed through the GPU, and streamed back horizontally to the user terminal in real-time.
Clone this repository and install the minimal required Python libraries:
git clone https://github.com/suryanshRoy/FloatLLM.git
cd FloatLLM
pip install -r requirements.txtFloatLLM relies on the ggml C library for the matrix operations. You must clone it into the project root before compiling:
git clone https://github.com/ggerganov/ggml.gitFloatLLM requires a model in the .gguf format. If you don't have one, you can download a 7B parameter test model (~3.5GB)
wget -c -O test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q3_K_M.gguf"curl -L -o test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q3_K_M.gguf"Download the Stress-Test Model (14B Parameters, ~9GB) To demonstrate FloatLLM's core innovation—dynamic zero-copy memory chunking—you need a massive model that exceeds standard available RAM. Please run this command in your terminal to download a 14-Billion parameter test model (~9GB):
wget -c -O test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-GGUF/resolve/main/Qwen2.5-14B-Instruct-Q4_K_M.gguf"curl -L -o test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-GGUF/resolve/main/Qwen2.5-14B-Instruct-Q4_K_M.gguf"Make sure you have CMake installed, if you don't have then:
- Linux (Ubuntu/Debian):
sudo apt update && sudo apt install cmake- macOS:
brew install cmake-
Windows:
https://cmake.org/download/ -
If cmake has broken builds then before compiling C++ make sure to
rm -rf build
For Apple Silicon (Metal/MPS):
cmake -B build -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4For NVIDIA GPU (CUDA):
cmake -B build -DGGML_CUDA=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4For Vulkan GPU:
cmake -B build -DGGML_VULKAN=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4For OpenCL:
cmake -B build -DGGML_OPENCL=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4For SYCL (Intel OneAPI):
cmake -B build -DGGML_SYCL=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4For Kompute / DirectX:
cmake -B build -DGGML_KOMPUTE=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4For CPU-Only / Native ARM:
cmake -B build -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4- Execute the router, pointing it to a local .gguf file:
python floatllm_router.py --hardware auto --model-path /path/to/your/model.gguf --prompt "What is the capital of France?"FloatLLM was fundamentally driven by human architectural design, but AI tools were actively leveraged as collaborative research and debugging assistants. I acted as the core systems architect, directing the routing logic, memory management, and broad structural shifts.
During development, Google Search AI Overviews were utilized for researching core concepts and discovering cross-platform C++ libraries. Gemini was heavily utilized as a debugging partner to help troubleshoot the project's most difficult technical hurdles. Specifically, Gemini assisted in debugging the bare-metal C++ inference engine crashes, optimizing tensor management within the Python loader, and resolving complex OS-specific ggml bugs. All final implementations and architectural decisions were independently executed and tested.
