-
Notifications
You must be signed in to change notification settings - Fork 0
lmdeploy_en
lmdeploy supports transformer structures (such as LLaMA, LLaMa2, InternLM, Vicuna, etc.), currently supporting fp16, int8, and int4.
Install the precompiled python package
python3 -m pip install lmdeploy
Convert the model to lmdeploy inference format, assuming the huggingface version of the LLaMa2 model has been downloaded to the /models/llama-2-7b-chat
directory, and the results will be stored in the workspace
folder
python3 -m lmdeploy.serve.turbomind.deploy llama2 /models/llama-2-7b-chat
Test the chat on the command line
python3 -m lmdeploy.turbomind.chat ./workspace
..
double enter to end input >>> who are you
..
Hello! I'm just an AI assistant ..
You can also start WebUI to chat with gradio
python3 -m lmdeploy.serve.gradio.app ./workspace
lmdeploy also supports the original Facebook model format and supports 70B model distributed inference. For usage, please refer to lmdeploy official documentation.
lmdeploy implements kv cache int8 quantization, and the same memory can serve more concurrent users.
First obtain the quantization parameters, the result is saved in workspace/triton_models/weights
after fp16 conversion, and there is no need for tensor parallel for the 7B model.
python3 -m lmdeploy.lite.apis.kv_qparams \
--work_dir /models/llama-2-7b-chat \ # huggingface format model
--turbomind_dir ./workspace/triton_models/weights \ # Result save directory
--kv_sym False \ # Use asymmetric quantization
--num_tp 1 # Number of tensor parallel GPUs
Then modify the inference configuration to enable kv cache int8. Edit workspace/triton_models/weights/config.ini
- Change
use_context_fmha
to 0, indicating that flashattention is turned off - Set
quant_policy
to 4, indicating that kv cache quantization is enabled
Finally execute the test
python3 -m lmdeploy.turbomind.chat ./workspace
Click here to view the kv cache int8 quantization implementation formula, accuracy and memory test report.
lmdeploy based on the AWQ algorithm implemented weight int4 quantization, relative to the fp16 version, the speed is 3.16 times, and the memory is reduced from 16G to 6.3G.
Here is the AWQ algorithm optimized llama2 original model, you can just download it.
git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4
For your own model, you can use the auto_awq
tool to optimize it, assuming your huggingface model is saved in /models/llama-2-7b-chat
python3 -m lmdeploy.lite.apis.auto_awq \
--model /models/llama-2-7b-chat \
--w_bits 4 \ # Bit number for weight quantization
--w_group_size 128 \ # Weight Quantization Group Statistical Size
--work_dir ./llama2-chat-7b-w4 # Directory to save quantization parameters
Run the following command to chat with the model in the terminal:
## Convert the model's layout and store it in the default path ./workspace
python3 -m lmdeploy.serve.turbomind.deploy \
--model-name llama2 \
--model-path ./llama2-chat-7b-w4 \
--model-format awq \
--group-size 128
## Inference
python3 -m lmdeploy.turbomind.chat ./workspace
Click here to view the memory and speed test results of weight int4 quantization.
Additionally, weight int4 and kv cache int8 do not conflict and can be turned on at the same time to save more memory.