LLM bootstrap loader for local CPU/GPU inference with fully customizable chat.
-
Updated
Jun 8, 2024 - Python
LLM bootstrap loader for local CPU/GPU inference with fully customizable chat.
⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.
Efficient and general syntactical decoding for Large Language Models
Implementation of Model-Distributed Inference for Large Language Models, built on top of LitGPT
Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud.
The official evaluation suite and dynamic data release for MixEval.
Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.
PyTorch/XLA integration with JetStream (https://github.com/google/JetStream) for LLM inference"
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
llmon-py is a multimodal webui for Llama 3-8B.
The easiest way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Multi-model Inference Graph/Pipelines, LLM/RAG apps, and more!
🔮 SuperDuperDB: Bring AI to your database! Build, deploy and manage any AI application directly with your existing data infrastructure, without moving your data. Including streaming inference, scalable model training and vector search.
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
ARS: Article Retrieval System
Implementation of the paper Fast Inference from Transformers via Speculative Decoding, Leviathan et al. 2023.
Semantic embedding-based system for question answering from PDFs with visual analysis tools.
Add a description, image, and links to the llm-inference topic page so that developers can more easily learn about it.
To associate your repository with the llm-inference topic, visit your repo's landing page and select "manage topics."