Introduction

📘Documentation | 🛠️Quick Start | 🤔Reporting Issues

English | 简体中文

👋 join us on

Latest News 🎉

2024

[2024/05] Balance vision model when deploying VLMs with multiple GPUs
[2024/05] Support 4-bits weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2
[2024/04] Support Llama3 and more VLMs, such as InternVL v1.1, v1.2, MiniGemini, InternLMXComposer2.
[2024/04] TurboMind adds online int8/int4 KV cache quantization and inference for all supported devices. Refer here for detailed guide
[2024/04] TurboMind latest upgrade boosts GQA, rocketing the internlm2-20b model inference to 16+ RPS, about 1.8x faster than vLLM.
[2024/04] Support Qwen1.5-MOE and dbrx.
[2024/03] Support DeepSeek-VL offline inference pipeline and serving.
[2024/03] Support VLM offline inference pipeline and serving.
[2024/02] Support Qwen 1.5, Gemma, Mistral, Mixtral, Deepseek-MOE and so on.
[2024/01] OpenAOE seamless integration with LMDeploy Serving Service.
[2024/01] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to here
[2024/01] Support PyTorch inference engine, developed entirely in Python, helping to lower the barriers for developers and enable rapid experimentation with new features and technologies.

2023

[2023/12] Turbomind supports multimodal input. Gradio Demo
[2023/11] Turbomind supports loading hf model directly. Click here for details.
[2023/11] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
[2023/09] TurboMind supports Qwen-14B
[2023/09] TurboMind supports InternLM-20B
[2023/09] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click here for deployment guide
[2023/09] TurboMind supports Baichuan2-7B
[2023/08] TurboMind supports flash-attention2.
[2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
[2023/08] TurboMind supports Windows (tp=1)
[2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation. Check this guide for detailed info
[2023/08] LMDeploy has launched on the HuggingFace Hub, providing ready-to-use 4-bit models.
[2023/08] LMDeploy supports 4-bit quantization using the AWQ algorithm.
[2023/07] TurboMind supports Llama-2 70B with GQA.
[2023/07] TurboMind supports Llama-2 7B/13B.
[2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

Efficient Inference: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.
Effective Quantization: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.
Effortless Distribution Server: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.
Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.

Performance

For detailed inference benchmarks in more devices and more settings, please refer to the following link:

A100
V100
4090
3090
2080

Supported Models

LLMs

VLMs

Llama (7B - 65B)
Llama2 (7B - 70B)
Llama3 (8B, 70B)
InternLM (7B - 20B)
InternLM2 (7B - 20B)
QWen (1.8B - 72B)
QWen1.5 (0.5B - 110B)
QWen1.5 - MoE (0.5B - 72B)
Baichuan (7B)
Baichuan2 (7B-13B)
Code Llama (7B - 34B)
ChatGLM2 (6B)
Falcon (7B - 180B)
YI (6B-34B)
Mistral (7B)
DeepSeek-MoE (16B)
Mixtral (8x7B, 8x22B)
Gemma (2B - 7B)
Dbrx (132B)
Phi-3-mini (3.8B)
StarCoder2 (3B - 15B)

LLaVA(1.5,1.6) (7B-34B)
InternLM-XComposer2 (7B, 4khd-7B)
QWen-VL (7B)
DeepSeek-VL (7B)
InternVL-Chat (v1.1-v1.5)
MiniGeminiLlama (7B)

LMDeploy has developed two inference engines - TurboMind and PyTorch, each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.

They differ in the types of supported models and the inference data type. Please refer to this table for each engine's capability and choose the proper one that best fits your actual needs.

Quick Start

Installation

Install lmdeploy with pip ( python 3.8+) or from source

pip install lmdeploy

Since v0.3.0, The default prebuilt package is compiled on CUDA 12. However, if CUDA 11+ is required, you can install lmdeploy by:

export LMDEPLOY_VERSION=0.3.0
export PYTHON_VERSION=38
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

Offline Batch Inference

import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

Note

By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by pip install modelscope and set the environment variable:

export LMDEPLOY_USE_MODELSCOPE=True

For more information about inference pipeline, please refer to here.

Tutorials

Please overview getting_started section for the basic usage of LMDeploy.

For detailed user guides and advanced guides, please refer to our tutorials:

User Guide
Advance Guide

Third-party projects

Deploying LLMs offline on the NVIDIA Jetson platform by LMDeploy: LMDeploy-Jetson

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

Citation

@misc{2023lmdeploy,
    title={LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM},
    author={LMDeploy Contributors},
    howpublished = {\url{https://github.com/InternLM/lmdeploy}},
    year={2023}
}

License

This project is released under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 669 Commits
.github		.github
3rdparty		3rdparty
autotest		autotest
benchmark		benchmark
builder		builder
cmake		cmake
docker		docker
docs		docs
examples		examples
k8s		k8s
lmdeploy		lmdeploy
requirements		requirements
resources		resources
src		src
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_zh-CN.md		README_zh-CN.md
generate.sh		generate.sh
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Latest News 🎉

Introduction

Performance

Supported Models

Quick Start

Installation

Offline Batch Inference

Tutorials

Third-party projects

Contributing

Acknowledgement

Citation

License

About

Releases

Packages

Languages

License

vody-am/lmdeploy

Folders and files

Latest commit

History

Repository files navigation

Latest News 🎉

Introduction

Performance

Supported Models

Quick Start

Installation

Offline Batch Inference

Tutorials

Third-party projects

Contributing

Acknowledgement

Citation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages