Useful information about LLM and its environment is collected here
- vLLM: a fast and easy-to-use library for LLM inference and serving (blog)
- RouteLLM: a framework for serving and evaluating LLM routers (paper)
- DeepSpeed: a deep learning optimization library that makes distributed training easy, efficient, and effective
- TVM: a compiler stack for deep learning systems. It is designed to close the gap between the productivity-focused deep learning frameworks, and the performance- and efficiency-focused hardware backends
- FireOptimizer: customizing latency and quality for your production inference workload
- GGML: tensor library for machine learning
- Medusa is a simple framework that democratizes the acceleration techniques for LLM generation with multiple decoding heads.
- Optimal Brain Compression (OBC): a framework for accurate PTQ and pruning (paper)
- LangChain is a framework for developing applications powered by large language models (LLMs)
- GenAIComps: a service-based tool that includes microservice components such as llm, embedding, reranking, and so on
- GenAIInfra: part of the OPEA containerization and cloud-native suite, enables quick and efficient deployment of GenAIExamples in the cloud
- GenAIEval: it measures service performance metrics such as throughput, latency, and accuracy for GenAIExamples. This feature helps users compare performance across various hardware configurations easily.
- Speculative decoding:
- OpenAI API
- Using logprobs from OpenAI
Raw list of papers sorted by general topics
Brief description and analysis of current state based on the papers can be also found there.