#

consumer-gpu

Here are 20 public repositories matching this topic...

Alberto-Codes / turboquant-vllm

TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs

compression transformer triton quantization inference-optimization kv-cache llm vllm consumer-gpu turboquant

Updated Apr 10, 2026
Python

MELLM

Rahul-14507 / MELLM

Lightweight Modular AI Routing Engine for Local LLMs — Run specialised experts efficiently on consumer GPUs using smart Mixture-of-Experts routing.

python mixture-of-experts llama-cpp local-llm gguf consumer-gpu ai-router

Updated Apr 6, 2026
Python

sipsalabs / ultracompress

Near-lossless 5-bit transformer compression - 23 architectures verified across 4 classes (dense + MoE + SSM + ViT, 0.6B-405B). Hermes-3-405B 1.0066x, Phi-4 1.00506x. SHA-256-verifiable, reproducible reconstruction. OpenAI-compatible API at api.sipsalabs.com. pip install ultracompress

python compression cuda inference pytorch transformer lossless quantization mlops deep-tech openai-api llm patent-pending ai-infrastructure 405b consumer-gpu 5-bit sipsa-labs experimental-tech

Updated Jun 1, 2026
Python

kimari-local-ai

smouj / kimari-local-ai

🦁 Local AI for Consumer GPUs — Run powerful LLMs on GTX 1060/1080. No cloud. No subscriptions. Built on llama.cpp + CUDA.

python cli nextjs cuda local-first llama-cpp llm-inference gguf open-webui openai-compatible-api consumer-gpu openclaw gtx-1060

Updated May 26, 2026
Python

anna-claudette / angruvadal

RAM-Backed MCP Memory Architecture for Consumer LLM Inference — 900K token context on 16GB VRAM

amd mcp rocm llm llama-cpp local-llm context-window rdna4 consumer-gpu rotorquant

Updated Mar 27, 2026
Python

arbitrary-number / arbitrary-number

Arbitrary Numbers

python machine-learning deep-learning tensorflow gpu cuda pytorch nvidia model-serving numerical-computing model-optimization edge-inference ai-inference ai-performance consumer-gpu

Updated Aug 12, 2025
Python

AIdevsmartdata / ramp-quant

RAMP: RL-guided Adaptive Mixed-Precision quantization for GGUF models. Data-free sensitivity analysis, evolutionary search, per-tensor type optimization. Produces hardware-optimized GGUF for consumer GPUs.

moe quantization sensitivity-analysis ramp mixed-precision llm llama-cpp qwen gguf qwen3 consumer-gpu imatrix ik-llama

Updated Apr 16, 2026
Python

obisin / dgls

Dynamic GPU Layer Swapping: Train large models on consumer GPUs with intelligent memory management

training pytorch gpu-memory memory-optimization consumer-gpu layer-swapping

Updated Sep 12, 2025
Python

Novamind-CS / Novamind-CS

Surgical reasoning on consumer silicon. Hybrid SSM + causal memory architecture with entropy-gated System 1/2 dispatch, O(1) inference memory, and continual learning — designed for 16 GB VRAM.

machine-learning research deep-learning pytorch lora language-model mamba causal-inference reasoning low-memory fine-tuning state-space-model continual-learning neuro-symbolic llm consumer-gpu

Updated Mar 22, 2026
Python

Zorac

chris-colinsky / Zorac

Self-hosted LLM chat client with streaming UI for vLLM servers. Run Mistral-24B locally on RTX 4090/3090. Privacy-focused ChatGPT alternative for homelab/gaming PCs. Python/Rich terminal UI.

python cli ai self-hosted chat-client homelab mistral nvidia-gpu awq llm vllm chatgpt-alternative local-llm llm-inference offline-ai consumer-gpu

Updated Feb 27, 2026
Python

magicmayonaise / flux-local-inference

Reproducible local inference for FLUX.1 [schnell] on 8 GB Turing GPUs (RTX 2070). Sequential CPU offload, fp16 compute, mock-tested pipeline init - portfolio piece for a neuroscience to ML transition.

flux neuroscience pytorch text-to-image diffusion-models huggingface-diffusers local-inference flux-schnell consumer-gpu

Updated May 18, 2026
Python

jyimu / AeloRu

AeloRu (Adaptive Elastic Learning with Orthogonal Robust Units) enables real-time, continuous learning on resource-constrained devices. Drop-in memory module for LLMs.

lora hebbian-learning dora continuous-learning peft hebbian llm consumer-gpu relora

Updated May 28, 2026
Python

gokhaneraslan / stable-diffusion-3.5-lora-finetuning

A comprehensive, modular framework for fine-tuning Stable Diffusion 3.5 models using LoRA (Low-Rank Adaptation). Create custom AI image generators tailored to your artistic style, objects, or concepts with memory-efficient training on consumer GPUs.

Updated Jun 7, 2025
Python

tk-yasuno / gpt-oss-20b-local-execute

GPT-OSS B20 Local Execution. Lightweight local environment for running it with Python 3.12 and CUDA acceleration. - Run GPT-OSS B20 entirely offline - Optimize text generation with GPU - Enable fast, secure inference on consumer hardware.

text-generation performance-optimization gpu-optimization edge-ai inference-acceleration secure-inference model-runtime minimal-setup llm-inference open-source-llm local-execution offline-inference privacy-preserving-ai consumer-gpu gpt-oss-b20 lightweight-environment

Updated Aug 13, 2025
Python

junkyard22 / holster-memory

Tiered GPU memory architecture for consumer AI inference. VRAM as execution cache, system RAM as passive staging layer.

inference pytorch transformer gpu-memory memory-management offloading vram llm local-ai consumer-gpu vram-optimization

Updated Apr 15, 2026
Python

ikaganacar1 / ismail

ismail is a from-scratch Turkish language model implementation designed for low-end hardware, built and trained on a single RTX 5070 (12GB).

moe mla pytorch-implementation turkish-nlp llm llms llm-training deepseek-v3 turkish-llm consumer-gpu low-resource-llm

Updated Nov 19, 2025
Python

Babyhamsta / PILON

PILON (Primitive-Induced Linear Operator Network) explores a compositional weight parameterization for transformer FFN layers. The goal is to replace dense FFN matrices with shared low-rank primitives plus learned composition weights.

research pytorch transformer language-model model-compression bitnet weight-sharing low-rank ternary-quantization quantization-aware-training efficient-deep-learning consumer-gpu

Updated Mar 20, 2026
Python

rishi-more-2003 / decode-roofline

Kernel-level profiling of batch-1 decode on consumer GPU to prove that decode is memory-bandwidth-bound against the hardware roofline, then beating the baseline with a fused dequant+GEMV CUDA kernel. (regime-aware attribution at the end)

gpu cuda cuda-kernels quantization llm-inference roofline-profiler consumer-gpu roofline-analysis

Updated May 30, 2026
Python

bojobh609 / TurboQuant

Optimize FAISS-compatible vector quantization for fast, accurate vector search with TurboQuant

python machine-learning deep-learning metal pytorch nearest-neighbor quant attention embedding mlx iclr rag inference-optimization kv-cache kv-cache-compression consumer-gpu turboquant

Updated Jun 1, 2026
Python

Robertmorrissteelproduction437 / turboquant

Compress KV cache for LLM inference with TurboQuant and vLLM integration to cut memory use and raise token capacity on dense and MoE models

compression gpu inference pytorch quantization spectral-analysis mlx iclr faiss memory-optimization huggingface kv-cache apple-silicon google-research llm vllm kv-cache-compression consumer-gpu turboquant

Updated Jun 1, 2026
Python

Improve this page

Add a description, image, and links to the consumer-gpu topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the consumer-gpu topic, visit your repo's landing page and select "manage topics."