TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs
-
Updated
Apr 10, 2026 - Python
TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs
Lightweight Modular AI Routing Engine for Local LLMs — Run specialised experts efficiently on consumer GPUs using smart Mixture-of-Experts routing.
Near-lossless 5-bit transformer compression - 23 architectures verified across 4 classes (dense + MoE + SSM + ViT, 0.6B-405B). Hermes-3-405B 1.0066x, Phi-4 1.00506x. SHA-256-verifiable, reproducible reconstruction. OpenAI-compatible API at api.sipsalabs.com. pip install ultracompress
🦁 Local AI for Consumer GPUs — Run powerful LLMs on GTX 1060/1080. No cloud. No subscriptions. Built on llama.cpp + CUDA.
RAM-Backed MCP Memory Architecture for Consumer LLM Inference — 900K token context on 16GB VRAM
Arbitrary Numbers
RAMP: RL-guided Adaptive Mixed-Precision quantization for GGUF models. Data-free sensitivity analysis, evolutionary search, per-tensor type optimization. Produces hardware-optimized GGUF for consumer GPUs.
Dynamic GPU Layer Swapping: Train large models on consumer GPUs with intelligent memory management
Surgical reasoning on consumer silicon. Hybrid SSM + causal memory architecture with entropy-gated System 1/2 dispatch, O(1) inference memory, and continual learning — designed for 16 GB VRAM.
Self-hosted LLM chat client with streaming UI for vLLM servers. Run Mistral-24B locally on RTX 4090/3090. Privacy-focused ChatGPT alternative for homelab/gaming PCs. Python/Rich terminal UI.
Reproducible local inference for FLUX.1 [schnell] on 8 GB Turing GPUs (RTX 2070). Sequential CPU offload, fp16 compute, mock-tested pipeline init - portfolio piece for a neuroscience to ML transition.
AeloRu (Adaptive Elastic Learning with Orthogonal Robust Units) enables real-time, continuous learning on resource-constrained devices. Drop-in memory module for LLMs.
A comprehensive, modular framework for fine-tuning Stable Diffusion 3.5 models using LoRA (Low-Rank Adaptation). Create custom AI image generators tailored to your artistic style, objects, or concepts with memory-efficient training on consumer GPUs.
GPT-OSS B20 Local Execution. Lightweight local environment for running it with Python 3.12 and CUDA acceleration. - Run GPT-OSS B20 entirely offline - Optimize text generation with GPU - Enable fast, secure inference on consumer hardware.
Tiered GPU memory architecture for consumer AI inference. VRAM as execution cache, system RAM as passive staging layer.
ismail is a from-scratch Turkish language model implementation designed for low-end hardware, built and trained on a single RTX 5070 (12GB).
PILON (Primitive-Induced Linear Operator Network) explores a compositional weight parameterization for transformer FFN layers. The goal is to replace dense FFN matrices with shared low-rank primitives plus learned composition weights.
Kernel-level profiling of batch-1 decode on consumer GPU to prove that decode is memory-bandwidth-bound against the hardware roofline, then beating the baseline with a fused dequant+GEMV CUDA kernel. (regime-aware attribution at the end)
Optimize FAISS-compatible vector quantization for fast, accurate vector search with TurboQuant
Compress KV cache for LLM inference with TurboQuant and vLLM integration to cut memory use and raise token capacity on dense and MoE models
Add a description, image, and links to the consumer-gpu topic page so that developers can more easily learn about it.
To associate your repository with the consumer-gpu topic, visit your repo's landing page and select "manage topics."