lmdeploy_en

lmdeploy Usage

lmdeploy supports transformer structures (such as LLaMA, LLaMa2, InternLM, Vicuna, etc.), currently supporting fp16, int8, and int4.

I. Installation

Install the precompiled python package

python3 -m pip install lmdeploy

II. fp16 Inference

Convert the model to lmdeploy inference format, assuming the huggingface version of the LLaMa2 model has been downloaded to the /models/llama-2-7b-chat directory, and the results will be stored in the workspace folder

python3 -m lmdeploy.serve.turbomind.deploy llama2 /models/llama-2-7b-chat

Test the chat on the command line

python3 -m lmdeploy.turbomind.chat ./workspace
..
double enter to end input >>> who are you

..
Hello! I'm just an AI assistant ..

You can also start WebUI to chat with gradio

python3 -m lmdeploy.serve.gradio.app ./workspace

lmdeploy also supports the original Facebook model format and supports 70B model distributed inference. For usage, please refer to lmdeploy official documentation.

III. kv cache int8 Quantization

lmdeploy implements kv cache int8 quantization, and the same memory can serve more concurrent users.

First obtain the quantization parameters, the result is saved in workspace/triton_models/weights after fp16 conversion, and there is no need for tensor parallel for the 7B model.

python3 -m lmdeploy.lite.apis.kv_qparams \ 
  --work_dir /models/llama-2-7b-chat \                 # huggingface format model
  --turbomind_dir ./workspace/triton_models/weights \  # Result save directory
  --kv_sym False \                                     # Use asymmetric quantization
  --num_tp 1                                           # Number of tensor parallel GPUs

Then modify the inference configuration to enable kv cache int8. Edit workspace/triton_models/weights/config.ini

Change use_context_fmha to 0, indicating that flashattention is turned off
Set quant_policy to 4, indicating that kv cache quantization is enabled

Finally execute the test

python3 -m lmdeploy.turbomind.chat ./workspace

Click here to view the kv cache int8 quantization implementation formula, accuracy and memory test report.

IV. weight int4 Quantization

lmdeploy based on the AWQ algorithm implemented weight int4 quantization, relative to the fp16 version, the speed is 3.16 times, and the memory is reduced from 16G to 6.3G.

Here is the AWQ algorithm optimized llama2 original model, you can just download it.

git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4

For your own model, you can use the auto_awq tool to optimize it, assuming your huggingface model is saved in /models/llama-2-7b-chat

python3 -m lmdeploy.lite.apis.auto_awq \
  --model /models/llama-2-7b-chat \
  --w_bits 4 \                       # Bit number for weight quantization
  --w_group_size 128 \               # Weight Quantization Group Statistical Size
  --work_dir ./llama2-chat-7b-w4     # Directory to save quantization parameters

Run the following command to chat with the model in the terminal:

## Convert the model's layout and store it in the default path ./workspace
python3 -m lmdeploy.serve.turbomind.deploy \
    --model-name llama2 \
    --model-path ./llama2-chat-7b-w4 \
    --model-format awq \
    --group-size 128

## Inference
python3 -m lmdeploy.turbomind.chat ./workspace

Click here to view the memory and speed test results of weight int4 quantization.

Additionally, weight int4 and kv cache int8 do not conflict and can be turned on at the same time to save more memory.

Provide feedback

Saved searches