a platform for monitoring the chip situation
-
Updated
Jul 19, 2025 - Shell
a platform for monitoring the chip situation
Models and training scripts for "LSTMs for Keyword Spotting with ReRAM-based Compute-In-Memory Architectures" (ISCAS 2021).
Docker image for a self-hosted WhisperLive real-time speech-to-text server, powered by faster-whisper. Provides WebSocket streaming for live audio transcription and an OpenAI-compatible REST API. Supports all Whisper models, VAD, NVIDIA GPU (CUDA) acceleration, offline mode, and multi-arch (amd64, arm64).
A pretty way to compress images
Whisper speech-to-text server installer for Ubuntu, Debian, AlmaLinux, Rocky Linux, CentOS, RHEL and Fedora. OpenAI-compatible transcription and translation APIs powered by faster-whisper. Supports all Whisper models, word-level timestamps, JSON/SRT/VTT output, SSE streaming and offline mode.
Coding assistant is a lightweight llama.cpp wrapper for quantized local SLM deployment
Build, run, and setup scripts for the complete TensorRT-LLM pipeline on RTX A6000 Ada (SM89). Reproducible path from HuggingFace checkpoint to deployable .engine file, with FP16 baseline and FP8 quantization. Companion material to the 4-part blog series on ai-box.eu — in preparation for the NVIDIA TensorRT Edge-LLM ecosystem.
Local GPU inference experiments for NVFP4 quantization and Spark model workflows.
KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
⚡ TIMTEH Model Forge — Uncensored, abliterated & reasoning-distilled GGUFs. Forged on 8×H200 SXM5 | 1.1TB VRAM
LLM inference with 7x KV cache compression. Combines llama.cpp (production inference engine) with TurboQuant (KV quantization). Run 131K token context on 16GB VRAM. OpenAI-compatible API server. Supports 100+ model architectures.
KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
🎤 Record and transcribe voice dictation on Linux with push-to-talk functionality, injecting text directly into any focused application.
KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
vLLM serving stack for Gemma 4 31B on RTX PRO 6000 Blackwell, with FP8 KV cache, MTP speculative decoding, and an async FastAPI logging proxy in front.
Add a description, image, and links to the quantization topic page so that developers can more easily learn about it.
To associate your repository with the quantization topic, visit your repo's landing page and select "manage topics."