Skip to content

suryanshRoy/FloatLLM

Repository files navigation

FloatLLM 🚀

FloatLLM Logo

A bare-metal, hardware-agnostic Large Language Model (LLM) inference engine designed to run massive models on heavily memory-constrained edge devices, and to act as a safety feature for running LLMs locally.

FloatLLM is built for a fundamental shift in local AI execution: Dynamic Zero-Copy Memory Chunking.

🚀 The Architectural Shift

Originally, handling models larger than host RAM relied on static, layer-by-layer disk swapping. However, static swapping creates massive I/O bottlenecks.

FloatLLM abandons static swapping. Instead, it utilizes OS-level hardware interrogation to calculate exact, real-time memory boundaries, slicing standard .gguf neural network weights into mathematically perfect execution blocks. By leveraging native mmap (memory-mapping), it creates a zero-copy hardware bridge, streaming gigabytes of tensor data from SSD to RAM at bare-metal speeds without ever triggering an Out-of-Memory (OOM) panic.

This allows massive architectures to execute natively on anything from an Apple Silicon Mac to a non-rooted Android device running terminal environments, completely offline.


🏗️ Project Architecture & Status

FloatLLM is being developed in these stages:

✅ Phase 1 (Hardware Router) - floatllm_router.py

The master entry point. The router dynamically interrogates the host machine's hardware, evaluating total RAM, free RAM, and SSD capacity.

  • Hardware Agnostic: Automatically routes compute workloads based on host detection.
  • Failsafe Math: Calculates strict safety thresholds, ensuring a configurable buffer (default 20%) is always left free for the operating system and dynamic KV Cache context.
  • Absolute Control: Allows users to manually force RAM limits to run multi-gigabyte models through ultra-tight memory constraints.

✅ Phase 2 (Memory Loader) - floatllm_loader.py

The physical memory mapper.

  • Metadata Parsing: Uses the official gguf library to scan the model header, discovering exact tensor byte offsets without loading the massive payload.
  • Dynamic Slicing: Takes the safety limits and mathematically groups hundreds of tensors into safe execution blocks.
  • Zero-Copy Streaming: Utilizes a read-only mmap bridge to swap execution chunks in and out of RAM at maximum SSD read speeds.

✅ Phase 3 (Inference Engine) - floatllm_compute.cpp

The bare-metal execution layer utilizing ggml.

  • Hardware Binding: Dynamically binds zero-copy Python memory maps to dedicated GPU cores (Metal, CUDA, Vulkan).
  • VRAM Detachment: Securely detaches CPU memory pointers to prevent OS-level segmentation faults, allowing the GPU allocator to provision safe computational VRAM on the fly.

✅ Phase 4 (Tokenizer) - floatllm_tokenizer.py

The translation layer.

  • 100% Offline Generation: Dynamically reads the internal tokenizer.ggml.tokens array directly from the GGUF file. Zero API calls, zero internet dependency.
  • Dynamic Handling: Automatically scales between 1B and 405B parameter models, supporting all standard tokenization architectures.

Phase 5 (Generation loop) - Active (Raw logit streaming test)

The output interface.

  • Note: Currently bypasses hidden layers to stress-test the zero-copy VRAM buffer allocation and streaming loop stability.
  • Generation loop: Pipeline integrated. Prompt integers are passed securely across the ctypes bridge, processed through the GPU, and streamed back horizontally to the user terminal in real-time.

🛠️ Usage (Building from Source)

1. Envirnoment & Requirements

Clone this repository and install the minimal required Python libraries:

git clone https://github.com/suryanshRoy/FloatLLM.git
cd FloatLLM
pip install -r requirements.txt

2. Fetch the GGML Library

FloatLLM relies on the ggml C library for the matrix operations. You must clone it into the project root before compiling:

git clone https://github.com/ggerganov/ggml.git

3. Download a Test Model

FloatLLM requires a model in the .gguf format. If you don't have one, you can download a 7B parameter test model (~3.5GB)

Using wget:

wget -c -O test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q3_K_M.gguf"

Using curl:

curl -L -o test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q3_K_M.gguf"

Stress-test Model:

Download the Stress-Test Model (14B Parameters, ~9GB) To demonstrate FloatLLM's core innovation—dynamic zero-copy memory chunking—you need a massive model that exceeds standard available RAM. Please run this command in your terminal to download a 14-Billion parameter test model (~9GB):

Using wget:

wget -c -O test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-GGUF/resolve/main/Qwen2.5-14B-Instruct-Q4_K_M.gguf"

Using curl:

curl -L -o test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-GGUF/resolve/main/Qwen2.5-14B-Instruct-Q4_K_M.gguf"

4. Build the Compute Bridge

Make sure you have CMake installed, if you don't have then:

  • Linux (Ubuntu/Debian):
sudo apt update && sudo apt install cmake
  • macOS:
brew install cmake
  • Windows: https://cmake.org/download/

  • If cmake has broken builds then before compiling C++ make sure to rm -rf build

For Apple Silicon (Metal/MPS):

cmake -B build -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4

For NVIDIA GPU (CUDA):

cmake -B build -DGGML_CUDA=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4

For Vulkan GPU:

cmake -B build -DGGML_VULKAN=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4

For OpenCL:

cmake -B build -DGGML_OPENCL=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4

For SYCL (Intel OneAPI):

cmake -B build -DGGML_SYCL=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4

For Kompute / DirectX:

cmake -B build -DGGML_KOMPUTE=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4

For CPU-Only / Native ARM:

cmake -B build -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4

5. Run the Engine

  • Execute the router, pointing it to a local .gguf file:
python floatllm_router.py --hardware auto --model-path /path/to/your/model.gguf --prompt "What is the capital of France?"

🤖 AI Acknowledgement

FloatLLM was fundamentally driven by human architectural design, but AI tools were actively leveraged as collaborative research and debugging assistants. I acted as the core systems architect, directing the routing logic, memory management, and broad structural shifts.

During development, Google Search AI Overviews were utilized for researching core concepts and discovering cross-platform C++ libraries. Gemini was heavily utilized as a debugging partner to help troubleshoot the project's most difficult technical hurdles. Specifically, Gemini assisted in debugging the bare-metal C++ inference engine crashes, optimizing tensor management within the Python loader, and resolving complex OS-specific ggml bugs. All final implementations and architectural decisions were independently executed and tested.

About

FloatLLM is a hardware-agnostic, memory-aware LLM inference engine built in C++ to execute massive models (up to 405B) on highly constrained, low-RAM devices using dynamic memory slicing. It supports GPU accelerations, making models run extremely fast!

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors