diff --git "a/docs/source/LLM/LLM\346\216\250\347\220\206\346\226\207\346\241\243.md" "b/docs/source/LLM/LLM\346\216\250\347\220\206\346\226\207\346\241\243.md"
index 1ed6aff05..02484682a 100644
--- "a/docs/source/LLM/LLM\346\216\250\347\220\206\346\226\207\346\241\243.md"
+++ "b/docs/source/LLM/LLM\346\216\250\347\220\206\346\226\207\346\241\243.md"
@@ -183,26 +183,47 @@ model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': '
 
 template = get_template(template_type, tokenizer)
 seed_everything(42)
+
 query = '浙江的省会在哪里？'
 gen = inference_stream(model, template, query)
 print(f'query: {query}')
 for response, history in gen:
-    print(f'response: {response}')
+    pass
+print(f'response: {response}')
+
+# 方式1
 query = '这有什么好吃的？'
-gen = inference_stream(model, template, query, history)
+old_history = history
+gen = inference_stream(model, template, query, old_history)
 print(f'query: {query}')
 for response, history in gen:
     print(f'response: {response}')
 print(f'history: {history}')
 
+# 方式2
+query = '这有什么好吃的？'
+gen = inference_stream(model, template, query, old_history)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, history in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print(f'\nhistory: {history}')
+
 """Out[0]
 query: 浙江的省会在哪里？
-...
 response: 浙江省的省会是杭州。
 query: 这有什么好吃的？
+response: 杭
+response: 杭州
+response: 杭州市有
 ...
-response: 杭州市有很多著名的美食，例如西湖醋鱼、龙井虾仁、糖醋排骨、毛血旺等。此外，还有杭州特色的点心，如桂花糕、荷花酥、艾窝窝等。
-history: [('浙江的省会在哪里？', '浙江省的省会是杭州。'), ('这有什么好吃的？', '杭州市有很多著名的美食，例如西湖醋鱼、龙井虾仁、糖醋排骨、毛血旺等。此外，还有杭州特色的点心，如桂花糕、荷花酥、艾窝窝等。')]
+response: 杭州市有很多著名的美食，例如西湖醋鱼、龙井虾仁、糖醋排骨、毛血旺等。此外，还有杭州特色的点心，如桂花酥饼、抹茶糕点等。
+history: [['浙江的省会在哪里？', '浙江省的省会是杭州。'], ['这有什么好吃的？', '杭州市有很多著名的美食，例如西湖醋鱼、龙井虾仁、糖醋排骨、毛血旺等。此外，还有杭州特色的点心，如桂花酥饼、抹茶糕点等。']]
+query: 这有什么好吃的？
+response: 杭州有许多美食，比如西湖醋鱼、龙井虾仁、酱鸭等。此外，还有许多小吃，如烧麦、春卷、油条等，都是浙江特色美食。
+history: [['浙江的省会在哪里？', '浙江省的省会是杭州。'], ['这有什么好吃的？', '杭州有许多美食，比如西湖醋鱼、龙井虾仁、酱鸭等。此外，还有许多小吃，如烧麦、春卷、油条等，都是浙江特色美食。']]
 """
 ```
 
diff --git "a/docs/source/LLM/LLM\351\207\217\345\214\226\346\226\207\346\241\243.md" "b/docs/source/LLM/LLM\351\207\217\345\214\226\346\226\207\346\241\243.md"
index a61a4e852..a41000c0d 100644
--- "a/docs/source/LLM/LLM\351\207\217\345\214\226\346\226\207\346\241\243.md"
+++ "b/docs/source/LLM/LLM\351\207\217\345\214\226\346\226\207\346\241\243.md"
@@ -1,20 +1,17 @@
 # LLM量化文档
-swift支持使用awq, gptq, bnb, hqq, eetq技术对模型进行量化. 其中awq, gptq量化技术支持vllm进行推理加速, 且量化后的模型支持qlora微调.
+swift支持使用awq、gptq、bnb、hqq、eetq技术对模型进行量化。其中awq、gptq量化技术支持vllm进行推理加速，需要使用校准数据集，量化性能更好，但量化速度较慢。而bnb、hqq、eetq无需校准数据，量化速度较快。这五种量化方法都支持qlora微调。
 
-**注意** 量化在不同指令下的作用不同
-- sft lora训练中指定量化用于`qlora`，用于降低训练所需显存
-- export中指定量化用于量化模型并保存。
-- infer中指定量化用于量化模型并推理。
+awq、gptq需要使用`swift export`进行量化。而bnb、hqq、eetq可以直接在sft和infer时进行快速量化。
 
-其中bnb,hqq,eetq无需校准数据，量化速度较快，在 sft lora 训练 和 infer 中使用，指定`--quant_method bnb/hqq/eetq`
 
-awq,gptq需要校准数据，在 export 中使用，`--quant_method awq/gptq`
+从vllm推理加速支持的角度来看，更推荐使用awq和gptq进行量化。从量化效果的角度来看，更推荐使用awq、hqq和gptq进行量化。而从量化速度的角度来看，更推荐使用hqq进行量化。
+
 
 ## 目录
 - [环境准备](#环境准备)
-- [量化微调(qlora)](#量化微调(qlora))
 - [原始模型](#原始模型)
 - [微调后模型](#微调后模型)
+- [QLoRA微调](#QLoRA微调)
 - [推送模型](#推送模型)
 
 ## 环境准备
@@ -35,6 +32,9 @@ pip install autoawq -U
 # auto_gptq和cuda版本有对应关系，请按照`https://github.com/PanQiWei/AutoGPTQ#quick-installation`选择版本
 pip install auto_gptq -U
 
+# 使用bnb量化：
+pip install bitsandbytes -U
+
 # 使用hqq量化：
 # 需要transformers版本>4.40，从源码安装
 pip install git+https://github.com/huggingface/transformers
@@ -58,54 +58,9 @@ pip install -r requirements/framework.txt  -U
 pip install -r requirements/llm.txt  -U
 ```
 
-## 量化微调(qlora)
-在sft lora训练中指定`--quant_method`和`--quantization_bit`来执行qlora，显著减少训练所需显存
-```bash
-CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type qwen1half-7b-chat \
-    --sft_type lora \
-    --dataset alpaca-zh#5000 \
-    --quant_method hqq \
-    --quantization_bit 4 \
-
-CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type qwen1half-7b-chat \
-    --sft_type lora \
-    --dataset alpaca-zh#5000 \
-    --quant_method eetq \
-    --dtype fp16 \
-
-CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type qwen1half-7b-chat \
-    --sft_type lora \
-    --dataset alpaca-zh#5000 \
-    --quant_method bnb \
-    --quantization_bit 4 \
-    --dtype fp16 \
-```
-**注意**
-- hqq支持更多自定义参数，比如为不同网络层指定不同量化配置，具体请见[命令行参数](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md)
-- eetq量化为8bit量化，无需指定quantization_bit。目前不支持bf16，需要指定dtype为fp16
-- eetq目前qlora速度比较慢，推荐使用hqq。参考[issue](https://github.com/NetEase-FuXi/EETQ/issues/17)
-
 ## 原始模型
-使用bnb,hqq,eetq量化模型并推理
-```bash
-CUDA_VISIBLE_DEVICES=0 swift infer \
-    --model_type qwen1half-7b-chat \
-    --quant_method bnb \
-    --quantization_bit 4
 
-CUDA_VISIBLE_DEVICES=0 swift infer \
-    --model_type qwen1half-7b-chat \
-    --quant_method hqq \
-    --quantization_bit 4
-
-CUDA_VISIBLE_DEVICES=0 swift infer \
-    --model_type qwen1half-7b-chat \
-    --quant_method eetq \
-    --dtype fp16
-```
+### awq、gptq
 
 这里展示对qwen1half-7b-chat进行awq, gptq量化.
 ```bash
@@ -234,6 +189,25 @@ CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen1half-7b-chat
 ```
 
 
+### bnb、hqq、eetq
+对于bnb、hqq、eetq，我们只需要使用swift infer来进行快速量化并推理。
+```bash
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --model_type qwen1half-7b-chat \
+    --quant_method bnb \
+    --quantization_bit 4
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --model_type qwen1half-7b-chat \
+    --quant_method hqq \
+    --quantization_bit 4
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --model_type qwen1half-7b-chat \
+    --quant_method eetq \
+    --dtype fp16
+```
+
 ## 微调后模型
 
 假设你使用lora微调了qwen1half-4b-chat, 模型权重目录为: `output/qwen1half-4b-chat/vx-xxx/checkpoint-xxx`.
@@ -281,6 +255,65 @@ curl http://localhost:8000/v1/chat/completions \
 }'
 ```
 
+## QLoRA微调
+
+### awq、gptq
+如果想要对awq、gptq量化的模型进行qlora微调，你需要进行提前量化。例如可以对原始模型使用`swift export`进行量化。然后使用以下命令进行微调，你需要指定`--quant_method`来指定对应量化的方式：
+
+```bash
+# awq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --model_id_or_path qwen1half-7b-chat-awq-int4 \
+    --quant_method awq \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+
+# gptq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --model_id_or_path qwen1half-7b-chat-gptq-int4 \
+    --quant_method gptq \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+```
+
+
+### bnb、hqq、eetq
+如果想要使用bnb、hqq、eetq进行qlora微调，你需要在训练中指定`--quant_method`和`--quantization_bit`：
+
+```bash
+# bnb
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+    --quant_method bnb \
+    --quantization_bit 4 \
+    --dtype fp16 \
+
+# hqq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+    --quant_method hqq \
+    --quantization_bit 4 \
+
+# eetq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+    --quant_method eetq \
+    --dtype fp16 \
+```
+
+**注意**
+- hqq支持更多自定义参数，比如为不同网络层指定不同量化配置，具体请见[命令行参数](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md)
+- eetq量化为8bit量化，无需指定quantization_bit。目前不支持bf16，需要指定dtype为fp16
+- eetq目前qlora速度比较慢，推荐使用hqq。参考[issue](https://github.com/NetEase-FuXi/EETQ/issues/17)
+
 
 ## 推送模型
 假设你使用lora微调了qwen1half-4b-chat, 模型权重目录为: `output/qwen1half-4b-chat/vx-xxx/checkpoint-xxx`.
diff --git "a/docs/source/LLM/Qwen1.5\345\205\250\346\265\201\347\250\213\346\234\200\344\275\263\345\256\236\350\267\265.md" "b/docs/source/LLM/Qwen1.5\345\205\250\346\265\201\347\250\213\346\234\200\344\275\263\345\256\236\350\267\265.md"
index 11ff7c48a..88289a178 100644
--- "a/docs/source/LLM/Qwen1.5\345\205\250\346\265\201\347\250\213\346\234\200\344\275\263\345\256\236\350\267\265.md"
+++ "b/docs/source/LLM/Qwen1.5\345\205\250\346\265\201\347\250\213\346\234\200\344\275\263\345\256\236\350\267\265.md"
@@ -413,7 +413,9 @@ for query in ['78654+657=?', '晚上睡不着觉怎么办']:
 
     print(f'query: {query}')
     print('response: ', end='')
+    response = ''
     for chunk in stream_resp:
+        response += chunk.choices[0].delta.content
         print(chunk.choices[0].delta.content, end='', flush=True)
     print()
     messages.append({'role': 'assistant', 'content': response})
@@ -574,7 +576,9 @@ for query in ['78654+657=?', '晚上睡不着觉怎么办']:
 
     print(f'query: {query}')
     print('response: ', end='')
+    response = ''
     for chunk in stream_resp:
+        response += chunk.choices[0].delta.content
         print(chunk.choices[0].delta.content, end='', flush=True)
     print()
     messages.append({'role': 'assistant', 'content': response})
diff --git "a/docs/source/LLM/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md" "b/docs/source/LLM/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md"
index 7aa60846f..d27d079c9 100644
--- "a/docs/source/LLM/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md"
+++ "b/docs/source/LLM/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md"
@@ -121,8 +121,8 @@
 |llama-3-chinese-8b-instruct|[ChineseAlpacaGroup/llama-3-chinese-8b-instruct](https://modelscope.cn/models/ChineseAlpacaGroup/llama-3-chinese-8b-instruct/summary)|q_proj, k_proj, v_proj|llama3|&#x2714;|&#x2714;||-|[hfl/llama-3-chinese-8b-instruct](https://huggingface.co/hfl/llama-3-chinese-8b-instruct)|
 |atom-7b|[FlagAlpha/Atom-7B](https://modelscope.cn/models/FlagAlpha/Atom-7B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;||-|[FlagAlpha/Atom-7B](https://huggingface.co/FlagAlpha/Atom-7B)|
 |atom-7b-chat|[FlagAlpha/Atom-7B-Chat](https://modelscope.cn/models/FlagAlpha/Atom-7B-Chat/summary)|q_proj, k_proj, v_proj|atom|&#x2714;|&#x2714;||-|[FlagAlpha/Atom-7B-Chat](https://huggingface.co/FlagAlpha/Atom-7B-Chat)|
-|llava1d6-mistral-7b-instruct|[AI-ModelScope/llava-v1.6-mistral-7b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-mistral-7b/summary)|q_proj, k_proj, v_proj|llava-mistral-instruct|&#x2714;|&#x2718;|transformers>=4.34|multi-modal, vision|[liuhaotian/llava-v1.6-mistral-7b](https://huggingface.co/liuhaotian/llava-v1.6-mistral-7b)|
-|llava1d6-yi-34b-instruct|[AI-ModelScope/llava-v1.6-34b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-34b/summary)|q_proj, k_proj, v_proj|llava-yi-instruct|&#x2714;|&#x2718;||multi-modal, vision|[liuhaotian/llava-v1.6-34b](https://huggingface.co/liuhaotian/llava-v1.6-34b)|
+|llava1_6-mistral-7b-instruct|[AI-ModelScope/llava-v1.6-mistral-7b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-mistral-7b/summary)|q_proj, k_proj, v_proj|llava-mistral-instruct|&#x2714;|&#x2718;|transformers>=4.34|multi-modal, vision|[liuhaotian/llava-v1.6-mistral-7b](https://huggingface.co/liuhaotian/llava-v1.6-mistral-7b)|
+|llava1_6-yi-34b-instruct|[AI-ModelScope/llava-v1.6-34b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-34b/summary)|q_proj, k_proj, v_proj|llava-yi-instruct|&#x2714;|&#x2718;||multi-modal, vision|[liuhaotian/llava-v1.6-34b](https://huggingface.co/liuhaotian/llava-v1.6-34b)|
 |llama3-llava-next-8b|[AI-Modelscope/llama3-llava-next-8b](https://modelscope.cn/models/AI-Modelscope/llama3-llava-next-8b/summary)|q_proj, k_proj, v_proj|llama-llava-next|&#x2714;|&#x2718;||multi-modal, vision|[lmms-lab/llama3-llava-next-8b](https://huggingface.co/lmms-lab/llama3-llava-next-8b)|
 |llava-next-72b|[AI-Modelscope/llava-next-72b](https://modelscope.cn/models/AI-Modelscope/llava-next-72b/summary)|q_proj, k_proj, v_proj|llava-qwen-instruct|&#x2714;|&#x2718;||multi-modal, vision|[lmms-lab/llava-next-72b](https://huggingface.co/lmms-lab/llava-next-72b)|
 |llava-next-110b|[AI-Modelscope/llava-next-110b](https://modelscope.cn/models/AI-Modelscope/llava-next-110b/summary)|q_proj, k_proj, v_proj|llava-qwen-instruct|&#x2714;|&#x2718;||multi-modal, vision|[lmms-lab/llava-next-110b](https://huggingface.co/lmms-lab/llava-next-110b)|
@@ -236,7 +236,7 @@
 |baichuan2-13b-chat|[baichuan-inc/Baichuan2-13B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat/summary)|W_pack|baichuan|&#x2718;|&#x2714;||-|[baichuan-inc/Baichuan2-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat)|
 |baichuan2-13b-chat-int4|[baichuan-inc/Baichuan2-13B-Chat-4bits](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat-4bits/summary)|W_pack|baichuan|&#x2718;|&#x2718;|bitsandbytes<0.41.2, accelerate<0.26|-|[baichuan-inc/Baichuan2-13B-Chat-4bits](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat-4bits)|
 |mplug-owl2-chat|[iic/mPLUG-Owl2](https://modelscope.cn/models/iic/mPLUG-Owl2/summary)|q_proj, k_proj.multiway.0, k_proj.multiway.1, v_proj.multiway.0, v_proj.multiway.1|mplug-owl2|&#x2714;|&#x2718;|transformers<4.35, icecream|-|[MAGAer13/mplug-owl2-llama2-7b](https://huggingface.co/MAGAer13/mplug-owl2-llama2-7b)|
-|mplug-owl2d1-chat|[iic/mPLUG-Owl2.1](https://modelscope.cn/models/iic/mPLUG-Owl2.1/summary)|c_attn.multiway.0, c_attn.multiway.1|mplug-owl2|&#x2714;|&#x2718;|transformers<4.35, icecream|-|[Mizukiluke/mplug_owl_2_1](https://huggingface.co/Mizukiluke/mplug_owl_2_1)|
+|mplug-owl2_1-chat|[iic/mPLUG-Owl2.1](https://modelscope.cn/models/iic/mPLUG-Owl2.1/summary)|c_attn.multiway.0, c_attn.multiway.1|mplug-owl2|&#x2714;|&#x2718;|transformers<4.35, icecream|-|[Mizukiluke/mplug_owl_2_1](https://huggingface.co/Mizukiluke/mplug_owl_2_1)|
 |yuan2-2b-instruct|[YuanLLM/Yuan2.0-2B-hf](https://modelscope.cn/models/YuanLLM/Yuan2.0-2B-hf/summary)|q_proj, k_proj, v_proj|yuan|&#x2714;|&#x2718;||-|[IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf)|
 |yuan2-2b-janus-instruct|[YuanLLM/Yuan2-2B-Janus-hf](https://modelscope.cn/models/YuanLLM/Yuan2-2B-Janus-hf/summary)|q_proj, k_proj, v_proj|yuan|&#x2714;|&#x2718;||-|[IEITYuan/Yuan2-2B-Janus-hf](https://huggingface.co/IEITYuan/Yuan2-2B-Janus-hf)|
 |yuan2-51b-instruct|[YuanLLM/Yuan2.0-51B-hf](https://modelscope.cn/models/YuanLLM/Yuan2.0-51B-hf/summary)|q_proj, k_proj, v_proj|yuan|&#x2714;|&#x2718;||-|[IEITYuan/Yuan2-51B-hf](https://huggingface.co/IEITYuan/Yuan2-51B-hf)|
diff --git "a/docs/source/LLM/\350\207\252\345\256\232\344\271\211\344\270\216\346\213\223\345\261\225.md" "b/docs/source/LLM/\350\207\252\345\256\232\344\271\211\344\270\216\346\213\223\345\261\225.md"
index fa32348a9..c6c9f5016 100644
--- "a/docs/source/LLM/\350\207\252\345\256\232\344\271\211\344\270\216\346\213\223\345\261\225.md"
+++ "b/docs/source/LLM/\350\207\252\345\256\232\344\271\211\344\270\216\346\213\223\345\261\225.md"
@@ -8,8 +8,8 @@
 我们支持三种**自定义数据集**的方法.
 
 1. 【推荐】**命令行参数**的形式: **更加方便支持自定义数据集**, 支持四种数据集格式（即使用`SmartPreprocessor`）, 支持`dataset_id`和`dataset_path`.
-2. 添加数据集到`dataset_info.json`中, 比第一种方式更灵活, 支持对数据集使用两种预处理器并指定其参数: `RenameColumnsPreprocessor`, `ConversationsPreprocessor`（默认使用`SmartPreprocessor`）. 支持直接修改swift内置的`dataset_info.json`, 或者通过`--dataset_info_path xxx.json`的方式传入外置的json文件（方便pip install而非git clone的用户拓展数据集）.
-3. **注册数据集**的方式: 比第1、2种方式更加灵活, 支持使用函数对数据集进行预处理. 方法1、2在实现上借助了方法3. 可以直接修改源码进行拓展, 或者通过`--custom_register_path xxx.py`的方式传入, 脚本会对py文件进行解析（方便pip install的用户）.
+2. 添加数据集到`dataset_info.json`中, 比第一种方式更灵活但繁琐, 支持对数据集使用两种预处理器并指定其参数: `RenameColumnsPreprocessor`, `ConversationsPreprocessor`（默认使用`SmartPreprocessor`）. 支持直接修改swift内置的`dataset_info.json`, 或者通过`--dataset_info_path xxx.json`的方式传入外置的json文件（方便pip install而非git clone的用户拓展数据集）.
+3. **注册数据集**的方式: 比第1、2种方式更加灵活但繁琐, 支持使用函数对数据集进行预处理. 方法1、2在实现上借助了方法3. 可以直接修改源码进行拓展, 或者通过`--custom_register_path xxx.py`的方式传入, 脚本会对py文件进行解析（方便pip install的用户）.
 
 ### 📌 【推荐】命令行参数的形式
 支持直接传入行自定义的**dataset_id**(兼容MS和HF)和**dataset_path**, 以及同时传入多个自定义数据集以及对应采样数, 脚本会进行自动的预处理和拼接. 如果传入的是`dataset_id`, 默认会使用dataset\_id中的'default'子数据集, 并设置split为'train'. 如果该dataset\_id已经注册, 则会使用注册时传入的subsets、split以及预处理函数. 如果传入的是`dataset_path`, 则可以指定为相对路径和绝对路径, 其中相对路径为相对于当前运行目录.
diff --git "a/docs/source/Multi-Modal/llava\346\234\200\344\275\263\345\256\236\350\267\265.md" "b/docs/source/Multi-Modal/llava\346\234\200\344\275\263\345\256\236\350\267\265.md"
index fa485b3ab..5e9f91561 100644
--- "a/docs/source/Multi-Modal/llava\346\234\200\344\275\263\345\256\236\350\267\265.md"
+++ "b/docs/source/Multi-Modal/llava\346\234\200\344\275\263\345\256\236\350\267\265.md"
@@ -4,8 +4,8 @@
 
 | model | model_type |
 |-------|------------|
-| [llava-v1.6-mistral-7b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-mistral-7b/summary) | llava1d6-mistral-7b-instruct |
-| [llava-v1.6-34b](https://www.modelscope.cn/models/AI-ModelScope/llava-v1.6-34b/summary) | llava1d6-yi-34b-instruct |
+| [llava-v1.6-mistral-7b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-mistral-7b/summary) | llava1_6-mistral-7b-instruct |
+| [llava-v1.6-34b](https://www.modelscope.cn/models/AI-ModelScope/llava-v1.6-34b/summary) | llava1_6-yi-34b-instruct |
 |[llama3-llava-next-8b](https://modelscope.cn/models/AI-ModelScope/llama3-llava-next-8b/summary)|llama3-llava-next-8b|
 |[llava-next-72b](https://modelscope.cn/models/AI-ModelScope/llava-next-72b/summary)|llava-next-72b|
 |[llava-next-110b](https://modelscope.cn/models/AI-ModelScope/llava-next-110b/summary)|llava-next-110b|
@@ -30,13 +30,13 @@ pip install -e '.[llm]'
 ```shell
 # Experimental environment: A100
 # 20GB GPU memory
-CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava1d6-mistral-7b-instruct
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava1_6-mistral-7b-instruct
 
 # 70GB GPU memory
-CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava1d6-yi-34b-instruct
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava1_6-yi-34b-instruct
 
 # 4*20GB GPU memory
-CUDA_VISIBLE_DEVICES=0,1,2,3 swift infer --model_type llava1d6-yi-34b-instruct
+CUDA_VISIBLE_DEVICES=0,1,2,3 swift infer --model_type llava1_6-yi-34b-instruct
 ```
 
 输出: (支持传入本地路径或URL)
@@ -119,7 +119,7 @@ from swift.llm import (
 from swift.utils import seed_everything
 import torch
 
-model_type = 'llava1d6-mistral-7b-instruct'
+model_type = 'llava1_6-mistral-7b-instruct'
 template_type = get_default_template_type(model_type)
 print(f'template_type: {template_type}')
 
@@ -176,13 +176,13 @@ LoRA微调:
 # Experimental environment: A10, 3090, V100...
 # 21GB GPU memory
 CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type llava1d6-mistral-7b-instruct \
+    --model_type llava1_6-mistral-7b-instruct \
     --dataset coco-en-2-mini \
 
 # Experimental environment: 2*A100...
 # 2*45GB GPU memory
 CUDA_VISIBLE_DEVICES=0,1 swift sft \
-    --model_type llava1d6-yi-34b-instruct \
+    --model_type llava1_6-yi-34b-instruct \
     --dataset coco-en-2-mini \
 ```
 
@@ -191,14 +191,14 @@ CUDA_VISIBLE_DEVICES=0,1 swift sft \
 # Experimental environment: 4 * A100
 # 4 * 70 GPU memory
 NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
-    --model_type llava1d6-mistral-7b-instruct \
+    --model_type llava1_6-mistral-7b-instruct \
     --dataset coco-en-2-mini \
     --sft_type full \
     --deepspeed default-zero2
 
 # 8 * 50 GPU memory
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 swift sft \
-    --model_type llava1d6-yi-34b-instruct \
+    --model_type llava1_6-yi-34b-instruct \
     --dataset coco-en-2-mini \
     --sft_type full \
 ```
@@ -217,7 +217,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 swift sft \
 ## 微调后推理
 直接推理:
 ```shell
-model_type="llava1d6-mistral-7b-instruct"
+model_type="llava1_6-mistral-7b-instruct"
 
 CUDA_VISIBLE_DEVICES=0 swift infer \
     --ckpt_dir output/${model_type}/vx-xxx/checkpoint-xxx \
@@ -226,7 +226,7 @@ CUDA_VISIBLE_DEVICES=0 swift infer \
 
 **merge-lora**并推理:
 ```shell
-model_type="llava1d6-mistral-7b-instruct"
+model_type="llava1_6-mistral-7b-instruct"
 CUDA_VISIBLE_DEVICES=0 swift export \
     --ckpt_dir "output/${model_type}/vx-xxx/checkpoint-xxx" \
     --merge_lora true
diff --git "a/docs/source/Multi-Modal/mplug-owl2\346\234\200\344\275\263\345\256\236\350\267\265.md" "b/docs/source/Multi-Modal/mplug-owl2\346\234\200\344\275\263\345\256\236\350\267\265.md"
index 6718cd3f5..1d38a638b 100644
--- "a/docs/source/Multi-Modal/mplug-owl2\346\234\200\344\275\263\345\256\236\350\267\265.md"
+++ "b/docs/source/Multi-Modal/mplug-owl2\346\234\200\344\275\263\345\256\236\350\267\265.md"
@@ -1,6 +1,6 @@
 
 # mPLUG-Owl2 最佳实践
-以下内容以`mplug-owl2d1-chat`为例, 你也可以选择`mplug-owl2-chat`.
+以下内容以`mplug-owl2_1-chat`为例, 你也可以选择`mplug-owl2-chat`.
 
 ## 目录
 - [环境准备](#环境准备)
@@ -17,17 +17,17 @@ pip install -e '.[llm]'
 ```
 
 模型链接:
-- mplug-owl2d1-chat: [https://modelscope.cn/models/iic/mPLUG-Owl2.1/summary](https://modelscope.cn/models/iic/mPLUG-Owl2.1/summary)
+- mplug-owl2_1-chat: [https://modelscope.cn/models/iic/mPLUG-Owl2.1/summary](https://modelscope.cn/models/iic/mPLUG-Owl2.1/summary)
 - mplug-owl2-chat: [https://modelscope.cn/models/iic/mPLUG-Owl2/summary](https://modelscope.cn/models/iic/mPLUG-Owl2/summary)
 
 
 ## 推理
 
-推理`mplug-owl2d1-chat`:
+推理`mplug-owl2_1-chat`:
 ```shell
 # Experimental environment: A10, 3090, V100...
 # 24GB GPU memory
-CUDA_VISIBLE_DEVICES=0 swift infer --model_type mplug-owl2d1-chat
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type mplug-owl2_1-chat
 ```
 
 输出: (支持传入本地路径或URL)
@@ -82,7 +82,7 @@ from swift.llm import (
 from swift.utils import seed_everything
 import torch
 
-model_type = ModelType.mplug_owl2d1_chat
+model_type = ModelType.mplug_owl2_1_chat
 template_type = get_default_template_type(model_type)
 print(f'template_type: {template_type}')
 
@@ -134,7 +134,7 @@ road:
 # Experimental environment: A10, 3090, V100...
 # 24GB GPU memory
 CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type mplug-owl2d1-chat \
+    --model_type mplug-owl2_1-chat \
     --dataset coco-en-2-mini \
 ```
 
@@ -153,17 +153,17 @@ CUDA_VISIBLE_DEVICES=0 swift sft \
 直接推理:
 ```shell
 CUDA_VISIBLE_DEVICES=0 swift infer \
-    --ckpt_dir output/mplug-owl2d1-chat/vx-xxx/checkpoint-xxx \
+    --ckpt_dir output/mplug-owl2_1-chat/vx-xxx/checkpoint-xxx \
     --load_dataset_config true \
 ```
 
 **merge-lora**并推理:
 ```shell
 CUDA_VISIBLE_DEVICES=0 swift export \
-    --ckpt_dir output/mplug-owl2d1-chat/vx-xxx/checkpoint-xxx \
+    --ckpt_dir output/mplug-owl2_1-chat/vx-xxx/checkpoint-xxx \
     --merge_lora true
 
 CUDA_VISIBLE_DEVICES=0 swift infer \
-    --ckpt_dir output/mplug-owl2d1-chat/vx-xxx/checkpoint-xxx-merged \
+    --ckpt_dir output/mplug-owl2_1-chat/vx-xxx/checkpoint-xxx-merged \
     --load_dataset_config true
 ```
diff --git a/docs/source_en/LLM/Customization.md b/docs/source_en/LLM/Customization.md
index cbcad2383..0f3b63575 100644
--- a/docs/source_en/LLM/Customization.md
+++ b/docs/source_en/LLM/Customization.md
@@ -9,8 +9,8 @@
 We support three methods for **customizing datasets**.
 
 1. \[Recommended\] using command line arguments: It is more convenient to support custom datasets, and it supports four dataset formats (using `SmartPreprocessor`) as well as the `dataset_id` and `dataset_path`.
-2. Adding datasets to `dataset_info.json` is more flexible than the first method, and supports using two preprocessors and specifying their parameters: `RenameColumnsPreprocessor`, `ConversationsPreprocessor` (default is to use `SmartPreprocessor`). You can directly modify the built-in `dataset_info.json` in Swift, or pass in an external json file using `--dataset_info_path xxx.json` (for users who prefer pip install over git clone to expand datasets).
-3. Registering datasets: More flexible than the first two methods, it supports using functions to preprocess datasets. Methods 1 and 2 are implemented by leveraging method 3. You can directly modify the source code for expansion, or pass in a custom registration path using `--custom_register_path xxx.py`, where the script will parse the py file (for pip install users).
+2. Adding datasets to `dataset_info.json` is more flexible but cumbersome compared to the first method, and supports using two preprocessors and specifying their parameters: `RenameColumnsPreprocessor`, `ConversationsPreprocessor` (default is to use `SmartPreprocessor`). You can directly modify the built-in `dataset_info.json` in Swift, or pass in an external json file using `--dataset_info_path xxx.json` (for users who prefer pip install over git clone to expand datasets).
+3. Registering datasets: More flexible but cumbersome compared to the first and second methods, it supports using functions to preprocess datasets. Methods 1 and 2 are implemented by leveraging method 3. You can directly modify the source code for expansion, or pass in a custom registration path using `--custom_register_path xxx.py`, where the script will parse the py file (for pip install users).
 
 ### 📌 \[Recommended\] using Command Line Arguments
 
diff --git a/docs/source_en/LLM/LLM-inference.md b/docs/source_en/LLM/LLM-inference.md
index f270c699a..d4381aa25 100644
--- a/docs/source_en/LLM/LLM-inference.md
+++ b/docs/source_en/LLM/LLM-inference.md
@@ -181,26 +181,48 @@ model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': '
 
 template = get_template(template_type, tokenizer)
 seed_everything(42)
-query = 'Where is the capital of Zhejiang?'
+
+query = 'What is the capital of Zhejiang Province?'
 gen = inference_stream(model, template, query)
 print(f'query: {query}')
 for response, history in gen:
-    print(f'response: {response}')
-query = 'What are some famous foods there?'
-gen = inference_stream(model, template, query, history)
+    pass
+print(f'response: {response}')
+
+# method1
+query = 'What is there to eat?'
+old_history = history
+gen = inference_stream(model, template, query, old_history)
 print(f'query: {query}')
 for response, history in gen:
     print(f'response: {response}')
 print(f'history: {history}')
 
+# method2
+query = 'What is there to eat?'
+gen = inference_stream(model, template, query, old_history)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, history in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print(f'\nhistory: {history}')
+
 """Out[0]
-query: Where is the capital of Zhejiang?
+query: What is the capital of Zhejiang Province?
+response: The capital of Zhejiang Province is Hangzhou.
+query: What is there to eat?
+response: Zhejiang
+response: Zhejiang cuisine,
+response: Zhejiang cuisine,
+response: Zhejiang cuisine, also
 ...
-response: The capital of Zhejiang province is Hangzhou.
-query: What are some famous foods there?
-...
-response: Hangzhou has many famous local foods, such as West Lake Vinegar Fish, Longjing Shrimp, Sweet and Sour Pork Ribs, Spicy Beef, etc. In addition, there are also Hangzhou specialties like Osmanthus Cake, Lotus Seed Pastry, Ai Wo Wo, and more.
-history: [('Where is the capital of Zhejiang?', 'The capital of Zhejiang province is Hangzhou.'), ('What are some famous foods there?', 'Hangzhou has many famous local foods, such as West Lake Vinegar Fish, Longjing Shrimp, Sweet and Sour Pork Ribs, Spicy Beef, etc. In addition, there are also Hangzhou specialties like Osmanthus Cake, Lotus Seed Pastry, Ai Wo Wo, and more.')]
+response: Zhejiang cuisine, also known as "Hangzhou cuisine", is one of the eight traditional Chinese cuisines and is famous for its delicate taste, light fragrance, and natural appearance. It has a long history and is influenced by various cultures, including Huaiyang cuisine, Jiangnan cuisine, and Cantonese cuisine. Some popular dishes include West Lake Fish in Vinegar Gravy, Dongpo Pork, Longjing Tea-Scented Chicken, Braised Preserved Bamboo Shoots with Shredded Pork, and Steamed Stuffed Buns. There are many other delicious dishes that you can try when visiting Zhejiang.
+history: [['What is the capital of Zhejiang Province?', 'The capital of Zhejiang Province is Hangzhou.'], ['What is there to eat?', 'Zhejiang cuisine, also known as "Hangzhou cuisine", is one of the eight traditional Chinese cuisines and is famous for its delicate taste, light fragrance, and natural appearance. It has a long history and is influenced by various cultures, including Huaiyang cuisine, Jiangnan cuisine, and Cantonese cuisine. Some popular dishes include West Lake Fish in Vinegar Gravy, Dongpo Pork, Longjing Tea-Scented Chicken, Braised Preserved Bamboo Shoots with Shredded Pork, and Steamed Stuffed Buns. There are many other delicious dishes that you can try when visiting Zhejiang.']]
+query: What is there to eat?
+response: There are many delicious foods to try in Hangzhou, such as West Lake Fish in Vinegar Gravy, Dongpo Pork, Longjing Tea Pancakes, and XiHu-style Mandarin Duck. Additionally, Hangzhou is famous for its snacks like xiaolongbao (soup dumplings), qingtuan (green tea cakes), and huoguoliangzi (cold barley noodles).
+history: [['What is the capital of Zhejiang Province?', 'The capital of Zhejiang Province is Hangzhou.'], ['What is there to eat?', 'There are many delicious foods to try in Hangzhou, such as West Lake Fish in Vinegar Gravy, Dongpo Pork, Longjing Tea Pancakes, and XiHu-style Mandarin Duck. Additionally, Hangzhou is famous for its snacks like xiaolongbao (soup dumplings), qingtuan (green tea cakes), and huoguoliangzi (cold barley noodles).']]
 """
 ```
 
diff --git a/docs/source_en/LLM/LLM-quantization.md b/docs/source_en/LLM/LLM-quantization.md
index 6bcde9f90..feb5487af 100644
--- a/docs/source_en/LLM/LLM-quantization.md
+++ b/docs/source_en/LLM/LLM-quantization.md
@@ -1,19 +1,18 @@
 # LLM Quantization Documentation
-Swift supports model quantization using the techniques of awq, gptq, bnb, hqq, eetq. Among these, awq and gptq quantization techniques support inference acceleration for vllm, and the quantized models support fine-tuning with qlora.
-Note The effect of quantization varies under different commands:
-- During sft lora training, quantization specified for `qlora` is used to reduce the memory required for training.
-- In export, quantization is specified to quantize the model and save it.
-- In infer, quantization is specified for model quantization and inference.
+Swift supports the use of awq, gptq, bnb, hqq, and eetq technologies to quantize models. Among them, awq and gptq quantization technologies support vllm for accelerated inference, requiring the use of a calibration dataset for better quantization performance, but with slower quantization speed. On the other hand, bnb, hqq, and eetq do not require calibration data and have faster quantization speed. All five quantization methods support qlora fine-tuning.
 
-bnb, hqq, and eetq do not require calibration data and offer fast quantization speed. They are used in sft lora training and inference by specifying `--quant_method bnb/hqq/eetq`.
 
-awq and gptq require calibration data and are used in export by specifying `--quant_method awq/gptq`.
+Quantization using awq and gptq requires the use of 'swift export', while bnb, hqq, and eetq can be quickly quantized during sft and infer.
+
+
+From the perspective of vllm inference acceleration support, it is more recommended to use awq and gptq for quantization. From the perspective of quantization effectiveness, it is more recommended to use awq, hqq, and gptq for quantization. And from the perspective of quantization speed, it is more recommended to use hqq for quantization.
+
 
 ## Table of Contents
 - [Environment Preparation](#environment-preparation)
-- [Qlora](#qlora)
 - [Original Model](#original-model)
 - [Fine-tuned Model](#fine-tuned-model)
+- [QLoRA](#QLoRA)
 - [Pushing Models](#pushing-models)
 
 ## Environment Preparation
@@ -32,60 +31,31 @@ pip install autoawq -U
 # Auto_GPTQ and CUDA versions have a corresponding relationship, please select the version according to `https://github.com/PanQiWei/AutoGPTQ#quick-installation`
 pip install auto_gptq -U
 
-# Environment alignment (usually not needed. If you encounter errors, you can run the code below, the repository uses the latest environment for testing)
-pip install -r requirements/framework.txt -U
-pip install -r requirements/llm.txt -U
-```
+# Using bnb quantization:
+pip install bitsandbytes -U
 
-## QLora
-In the sft lora training, specify `--quant_method` and `--quantization_bit` to execute qlora, which significantly reduces the GPU memory required for training.
+# Using hqq quantization:
+# Requires transformers version >4.40, install from source
+pip install git+https://github.com/huggingface/transformers pip install hqq
+# If compatibility with training is desired, install peft from source
+pip install git+https://github.com/huggingface/peft.git
 
-```bash
-CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type qwen1half-7b-chat \
-    --sft_type lora \
-    --dataset alpaca-zh#5000 \
-    --quant_method hqq \
-    --quantization_bit 4 \
+# Using eetq quantization:
+# Requires transformers version >4.40, install from source
+pip install git+https://github.com/huggingface/transformers
+# See https://github.com/NetEase-FuXi/EETQ for reference
+git clone https://github.com/NetEase-FuXi/EETQ.git cd EETQ/ git submodule update --init --recursive pip install .
+# If compatibility with training is desired, install peft from source
+pip install git+https://github.com/huggingface/peft.git
 
-CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type qwen1half-7b-chat \
-    --sft_type lora \
-    --dataset alpaca-zh#5000 \
-    --quant_method eetq \
-    --dtype fp16 \
-
-CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type qwen1half-7b-chat \
-    --sft_type lora \
-    --dataset alpaca-zh#5000 \
-    --quant_method bnb \
-    --quantization_bit 4 \
-    --dtype fp16 \
+# Environment alignment (usually not needed. If you encounter errors, you can run the code below, the repository uses the latest environment for testing)
+pip install -r requirements/framework.txt -U
+pip install -r requirements/llm.txt -U
 ```
-**Note**
-- hqq supports more customizable parameters, such as specifying different quantization configurations for different network layers. For details, please see [Command Line Arguments](https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/Command-line-parameters.md).
-- eetq quantization uses 8-bit quantization, and there's no need to specify quantization_bit. Currently, bf16 is not supported; you need to specify dtype as fp16.
-- Currently, eetq's qlora speed is relatively slow; it is recommended to use hqq instead. For reference, see the [issue](https://github.com/NetEase-FuXi/EETQ/issues/17).
 
 ## Original Model
-Use bnb, hqq, and eetq for model quantization and inference.
-```bash
-CUDA_VISIBLE_DEVICES=0 swift infer \
-    --model_type qwen1half-7b-chat \
-    --quant_method bnb \
-    --quantization_bit 4
 
-CUDA_VISIBLE_DEVICES=0 swift infer \
-    --model_type qwen1half-7b-chat \
-    --quant_method hqq \
-    --quantization_bit 4
-
-CUDA_VISIBLE_DEVICES=0 swift infer \
-    --model_type qwen1half-7b-chat \
-    --quant_method eetq \
-    --dtype fp16
-```
+### awq, gptq
 
 Here we demonstrate AWQ and GPTQ quantization on the qwen1half-7b-chat model.
 ```bash
@@ -125,6 +95,26 @@ CUDA_VISIBLE_DEVICES=0 swift infer \
 CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen1half-7b-chat
 ```
 
+### bnb, hqq, eetq
+
+For bnb, hqq, and eetq, we only need to use `swift infer` for rapid quantization and inference.
+```bash
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --model_type qwen1half-7b-chat \
+    --quant_method bnb \
+    --quantization_bit 4
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --model_type qwen1half-7b-chat \
+    --quant_method hqq \
+    --quantization_bit 4
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --model_type qwen1half-7b-chat \
+    --quant_method eetq \
+    --dtype fp16
+```
+
 ## Fine-tuned Model
 
 Assume you fine-tuned qwen1half-4b-chat using LoRA, and the model weights directory is: `output/qwen1half-4b-chat/vx-xxx/checkpoint-xxx`.
@@ -172,6 +162,67 @@ curl http://localhost:8000/v1/chat/completions \
 }'
 ```
 
+## QLoRA
+
+### awq, gptq
+
+If you want to fine-tune the models quantized with awq and gptq using QLoRA, you need to perform pre-quantization. For example, you can use `swift export` to quantize the original model. Then, for fine-tuning, you need to specify `--quant_method` to specify the corresponding quantization method using the following command:
+
+```bash
+# awq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --model_id_or_path qwen1half-7b-chat-awq-int4 \
+    --quant_method awq \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+
+# gptq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --model_id_or_path qwen1half-7b-chat-gptq-int4 \
+    --quant_method gptq \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+```
+
+### bnb, hqq, eetq
+
+If you want to use bnb, hqq, eetq for QLoRA fine-tuning, you need to specify `--quant_method` and `--quantization_bit` during training:
+
+```bash
+# bnb
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+    --quant_method bnb \
+    --quantization_bit 4 \
+    --dtype fp16 \
+
+# hqq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+    --quant_method hqq \
+    --quantization_bit 4 \
+
+# eetq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+    --quant_method eetq \
+    --dtype fp16 \
+```
+
+**Note**
+- hqq supports more customizable parameters, such as specifying different quantization configurations for different network layers. For details, please see [Command Line Arguments](https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/Command-line-parameters.md).
+- eetq quantization uses 8-bit quantization, and there's no need to specify quantization_bit. Currently, bf16 is not supported; you need to specify dtype as fp16.
+- Currently, eetq's qlora speed is relatively slow; it is recommended to use hqq instead. For reference, see the [issue](https://github.com/NetEase-FuXi/EETQ/issues/17).
+
+
 ## Pushing Models
 Assume you fine-tuned qwen1half-4b-chat using LoRA, and the model weights directory is: `output/qwen1half-4b-chat/vx-xxx/checkpoint-xxx`.
 
diff --git a/docs/source_en/LLM/Qwen1.5-best-practice.md b/docs/source_en/LLM/Qwen1.5-best-practice.md
index 02852e5f6..6c13e863f 100644
--- a/docs/source_en/LLM/Qwen1.5-best-practice.md
+++ b/docs/source_en/LLM/Qwen1.5-best-practice.md
@@ -415,7 +415,9 @@ for query in ['78654+657=?', "What to do if I can't fall asleep at night"]:
 
     print(f'query: {query}')
     print('response: ', end='')
+    response = ''
     for chunk in stream_resp:
+        response += chunk.choices[0].delta.content
         print(chunk.choices[0].delta.content, end='', flush=True)
     print()
     messages.append({'role': 'assistant', 'content': response})
@@ -575,7 +577,9 @@ for query in ['78654+657=?', "What to do if I can't fall asleep at night"]:
 
     print(f'query: {query}')
     print('response: ', end='')
+    response = ''
     for chunk in stream_resp:
+        response += chunk.choices[0].delta.content
         print(chunk.choices[0].delta.content, end='', flush=True)
     print()
     messages.append({'role': 'assistant', 'content': response})
diff --git a/docs/source_en/LLM/Supported-models-datasets.md b/docs/source_en/LLM/Supported-models-datasets.md
index 31b906612..d12d7c369 100644
--- a/docs/source_en/LLM/Supported-models-datasets.md
+++ b/docs/source_en/LLM/Supported-models-datasets.md
@@ -121,8 +121,8 @@ The table below introcudes all models supported by SWIFT:
 |llama-3-chinese-8b-instruct|[ChineseAlpacaGroup/llama-3-chinese-8b-instruct](https://modelscope.cn/models/ChineseAlpacaGroup/llama-3-chinese-8b-instruct/summary)|q_proj, k_proj, v_proj|llama3|&#x2714;|&#x2714;||-|[hfl/llama-3-chinese-8b-instruct](https://huggingface.co/hfl/llama-3-chinese-8b-instruct)|
 |atom-7b|[FlagAlpha/Atom-7B](https://modelscope.cn/models/FlagAlpha/Atom-7B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;||-|[FlagAlpha/Atom-7B](https://huggingface.co/FlagAlpha/Atom-7B)|
 |atom-7b-chat|[FlagAlpha/Atom-7B-Chat](https://modelscope.cn/models/FlagAlpha/Atom-7B-Chat/summary)|q_proj, k_proj, v_proj|atom|&#x2714;|&#x2714;||-|[FlagAlpha/Atom-7B-Chat](https://huggingface.co/FlagAlpha/Atom-7B-Chat)|
-|llava1d6-mistral-7b-instruct|[AI-ModelScope/llava-v1.6-mistral-7b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-mistral-7b/summary)|q_proj, k_proj, v_proj|llava-mistral-instruct|&#x2714;|&#x2718;|transformers>=4.34|multi-modal, vision|[liuhaotian/llava-v1.6-mistral-7b](https://huggingface.co/liuhaotian/llava-v1.6-mistral-7b)|
-|llava1d6-yi-34b-instruct|[AI-ModelScope/llava-v1.6-34b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-34b/summary)|q_proj, k_proj, v_proj|llava-yi-instruct|&#x2714;|&#x2718;||multi-modal, vision|[liuhaotian/llava-v1.6-34b](https://huggingface.co/liuhaotian/llava-v1.6-34b)|
+|llava1_6-mistral-7b-instruct|[AI-ModelScope/llava-v1.6-mistral-7b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-mistral-7b/summary)|q_proj, k_proj, v_proj|llava-mistral-instruct|&#x2714;|&#x2718;|transformers>=4.34|multi-modal, vision|[liuhaotian/llava-v1.6-mistral-7b](https://huggingface.co/liuhaotian/llava-v1.6-mistral-7b)|
+|llava1_6-yi-34b-instruct|[AI-ModelScope/llava-v1.6-34b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-34b/summary)|q_proj, k_proj, v_proj|llava-yi-instruct|&#x2714;|&#x2718;||multi-modal, vision|[liuhaotian/llava-v1.6-34b](https://huggingface.co/liuhaotian/llava-v1.6-34b)|
 |llama3-llava-next-8b|[AI-Modelscope/llama3-llava-next-8b](https://modelscope.cn/models/AI-Modelscope/llama3-llava-next-8b/summary)|q_proj, k_proj, v_proj|llama-llava-next|&#x2714;|&#x2718;||multi-modal, vision|[lmms-lab/llama3-llava-next-8b](https://huggingface.co/lmms-lab/llama3-llava-next-8b)|
 |llava-next-72b|[AI-Modelscope/llava-next-72b](https://modelscope.cn/models/AI-Modelscope/llava-next-72b/summary)|q_proj, k_proj, v_proj|llava-qwen-instruct|&#x2714;|&#x2718;||multi-modal, vision|[lmms-lab/llava-next-72b](https://huggingface.co/lmms-lab/llava-next-72b)|
 |llava-next-110b|[AI-Modelscope/llava-next-110b](https://modelscope.cn/models/AI-Modelscope/llava-next-110b/summary)|q_proj, k_proj, v_proj|llava-qwen-instruct|&#x2714;|&#x2718;||multi-modal, vision|[lmms-lab/llava-next-110b](https://huggingface.co/lmms-lab/llava-next-110b)|
@@ -236,7 +236,7 @@ The table below introcudes all models supported by SWIFT:
 |baichuan2-13b-chat|[baichuan-inc/Baichuan2-13B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat/summary)|W_pack|baichuan|&#x2718;|&#x2714;||-|[baichuan-inc/Baichuan2-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat)|
 |baichuan2-13b-chat-int4|[baichuan-inc/Baichuan2-13B-Chat-4bits](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat-4bits/summary)|W_pack|baichuan|&#x2718;|&#x2718;|bitsandbytes<0.41.2, accelerate<0.26|-|[baichuan-inc/Baichuan2-13B-Chat-4bits](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat-4bits)|
 |mplug-owl2-chat|[iic/mPLUG-Owl2](https://modelscope.cn/models/iic/mPLUG-Owl2/summary)|q_proj, k_proj.multiway.0, k_proj.multiway.1, v_proj.multiway.0, v_proj.multiway.1|mplug-owl2|&#x2714;|&#x2718;|transformers<4.35, icecream|-|[MAGAer13/mplug-owl2-llama2-7b](https://huggingface.co/MAGAer13/mplug-owl2-llama2-7b)|
-|mplug-owl2d1-chat|[iic/mPLUG-Owl2.1](https://modelscope.cn/models/iic/mPLUG-Owl2.1/summary)|c_attn.multiway.0, c_attn.multiway.1|mplug-owl2|&#x2714;|&#x2718;|transformers<4.35, icecream|-|[Mizukiluke/mplug_owl_2_1](https://huggingface.co/Mizukiluke/mplug_owl_2_1)|
+|mplug-owl2_1-chat|[iic/mPLUG-Owl2.1](https://modelscope.cn/models/iic/mPLUG-Owl2.1/summary)|c_attn.multiway.0, c_attn.multiway.1|mplug-owl2|&#x2714;|&#x2718;|transformers<4.35, icecream|-|[Mizukiluke/mplug_owl_2_1](https://huggingface.co/Mizukiluke/mplug_owl_2_1)|
 |yuan2-2b-instruct|[YuanLLM/Yuan2.0-2B-hf](https://modelscope.cn/models/YuanLLM/Yuan2.0-2B-hf/summary)|q_proj, k_proj, v_proj|yuan|&#x2714;|&#x2718;||-|[IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf)|
 |yuan2-2b-janus-instruct|[YuanLLM/Yuan2-2B-Janus-hf](https://modelscope.cn/models/YuanLLM/Yuan2-2B-Janus-hf/summary)|q_proj, k_proj, v_proj|yuan|&#x2714;|&#x2718;||-|[IEITYuan/Yuan2-2B-Janus-hf](https://huggingface.co/IEITYuan/Yuan2-2B-Janus-hf)|
 |yuan2-51b-instruct|[YuanLLM/Yuan2.0-51B-hf](https://modelscope.cn/models/YuanLLM/Yuan2.0-51B-hf/summary)|q_proj, k_proj, v_proj|yuan|&#x2714;|&#x2718;||-|[IEITYuan/Yuan2-51B-hf](https://huggingface.co/IEITYuan/Yuan2-51B-hf)|
diff --git a/docs/source_en/Multi-Modal/llava-best-practice.md b/docs/source_en/Multi-Modal/llava-best-practice.md
index 659c9dba4..8620ed355 100644
--- a/docs/source_en/Multi-Modal/llava-best-practice.md
+++ b/docs/source_en/Multi-Modal/llava-best-practice.md
@@ -3,8 +3,8 @@ The document corresponds to the following models
 
 | model | model_type |
 |-------|------------|
-| [llava-v1.6-mistral-7b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-mistral-7b/summary) | llava1d6-mistral-7b-instruct |
-| [llava-v1.6-34b](https://www.modelscope.cn/models/AI-ModelScope/llava-v1.6-34b/summary) | llava1d6-yi-34b-instruct |
+| [llava-v1.6-mistral-7b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-mistral-7b/summary) | llava1_6-mistral-7b-instruct |
+| [llava-v1.6-34b](https://www.modelscope.cn/models/AI-ModelScope/llava-v1.6-34b/summary) | llava1_6-yi-34b-instruct |
 |[llama3-llava-next-8b](https://modelscope.cn/models/AI-ModelScope/llama3-llava-next-8b/summary)|llama3-llava-next-8b|
 |[llava-next-72b](https://modelscope.cn/models/AI-ModelScope/llava-next-72b/summary)|llava-next-72b|
 |[llava-next-110b](https://modelscope.cn/models/AI-ModelScope/llava-next-110b/summary)|llava-next-110b|
@@ -29,13 +29,13 @@ pip install -e '.[llm]'
 ```shell
 # Experimental environment: A100
 # 20GB GPU memory
-CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava1d6-mistral-7b-instruct
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava1_6-mistral-7b-instruct
 
 # 70GB GPU memory
-CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava1d6-yi-34b-instruct
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava1_6-yi-34b-instruct
 
 # 4*20GB GPU memory
-CUDA_VISIBLE_DEVICES=0,1,2,3 swift infer --model_type llava1d6-yi-34b-instruct
+CUDA_VISIBLE_DEVICES=0,1,2,3 swift infer --model_type llava1_6-yi-34b-instruct
 ```
 
 Output: (supports passing in local path or URL)
@@ -118,7 +118,7 @@ from swift.llm import (
 from swift.utils import seed_everything
 import torch
 
-model_type = 'llava1d6-mistral-7b-instruct'
+model_type = 'llava1_6-mistral-7b-instruct'
 template_type = get_default_template_type(model_type)
 print(f'template_type: {template_type}')
 
@@ -175,12 +175,12 @@ LoRA fine-tuning:
 # Experimental environment: A10, 3090, V100...
 # 21GB GPU memory
 CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type llava1d6-mistral-7b-instruct \
+    --model_type llava1_6-mistral-7b-instruct \
     --dataset coco-en-2-mini \
 
 # 2*45GB GPU memory
 CUDA_VISIBLE_DEVICES=0,1 swift sft \
-    --model_type llava1d6-yi-34b-instruct \
+    --model_type llava1_6-yi-34b-instruct \
     --dataset coco-en-2-mini \
 ```
 
@@ -189,14 +189,14 @@ Full parameter fine-tuning:
 # Experimental environment: 4 * A100
 # 4 * 70 GPU memory
 NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
-    --model_type llava1d6-mistral-7b-instruct \
+    --model_type llava1_6-mistral-7b-instruct \
     --dataset coco-en-2-mini \
     --sft_type full \
     --deepspeed default-zero2
 
 # 8 * 50 GPU memory
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 swift sft \
-    --model_type llava1d6-yi-34b-instruct \
+    --model_type llava1_6-yi-34b-instruct \
     --dataset coco-en-2-mini \
     --sft_type full \
 ```
@@ -215,7 +215,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 swift sft \
 ## Inference after Fine-tuning
 Direct inference:
 ```shell
-model_type="llava1d6-mistral-7b-instruct"
+model_type="llava1_6-mistral-7b-instruct"
 CUDA_VISIBLE_DEVICES=0 swift infer \
     --ckpt_dir output/${model_type}/vx-xxx/checkpoint-xxx \
     --load_dataset_config true
@@ -223,7 +223,7 @@ CUDA_VISIBLE_DEVICES=0 swift infer \
 
 **merge-lora** and inference:
 ```shell
-model_type="llava1d6-mistral-7b-instruct"
+model_type="llava1_6-mistral-7b-instruct"
 CUDA_VISIBLE_DEVICES=0 swift export \
     --ckpt_dir "output/${model_type}/vx-xxx/checkpoint-xxx" \
     --merge_lora true
diff --git a/swift/llm/deploy.py b/swift/llm/deploy.py
index df8469332..ad2574893 100644
--- a/swift/llm/deploy.py
+++ b/swift/llm/deploy.py
@@ -42,7 +42,10 @@ async def get_available_models():
     model_list = [_args.model_type]
     if _args.lora_request_list is not None:
         model_list += [lora_request.lora_name for lora_request in _args.lora_request_list]
-    data = [Model(id=model_id) for model_id in model_list]
+    data = [
+        Model(id=model_id, is_chat=not is_generation_template(model_id), owned_by=_args.owned_by)
+        for model_id in model_list
+    ]
     return ModelList(data=data)
 
 
diff --git a/swift/llm/infer.py b/swift/llm/infer.py
index 3b434aa1c..db137c131 100644
--- a/swift/llm/infer.py
+++ b/swift/llm/infer.py
@@ -33,8 +33,7 @@ def save_checkpoint(model: Optional[PreTrainedModel],
         model.save_pretrained(target_dir, safe_serialization=save_safetensors)
     if hasattr(tokenizer, 'processor'):
         tokenizer.processor.save_pretrained(target_dir)
-    else:
-        tokenizer.save_pretrained(target_dir)
+    tokenizer.save_pretrained(target_dir)
     model_type = getattr(tokenizer, 'model_type')
     fname_list = ['generation_config.json', 'preprocessor_config.json']
     if model_type is not None:
diff --git a/swift/llm/sft.py b/swift/llm/sft.py
index a98c9742a..9e1da6ed2 100644
--- a/swift/llm/sft.py
+++ b/swift/llm/sft.py
@@ -101,6 +101,13 @@ def llm_sft(args: SftArguments) -> Dict[str, Union[str, Any]]:
         kwargs['use_flash_attn'] = args.use_flash_attn
     if args.local_repo_path:
         kwargs['local_repo_path'] = args.local_repo_path
+    if args.quant_method == 'awq':
+        kwargs['is_awq'] = True
+    elif args.quant_method == 'aqlm':
+        kwargs['is_aqlm'] = True
+    elif args.quant_method == 'gptq':
+        kwargs['is_gptq'] = True
+
     model, tokenizer = get_model_tokenizer(
         args.model_type,
         args.torch_dtype,
diff --git a/swift/llm/utils/argument.py b/swift/llm/utils/argument.py
index 31dcc3aef..0b98142be 100644
--- a/swift/llm/utils/argument.py
+++ b/swift/llm/utils/argument.py
@@ -169,7 +169,10 @@ def handle_compatibility(self: Union['SftArguments', 'InferArguments']) -> None:
             'openbmb-minicpm-2b-sft-chat': 'minicpm-2b-sft-chat',
             'openbmb-minicpm-2b-chat': 'minicpm-2b-chat',
             'cogvlm-17b-instruct': 'cogvlm-17b-chat',
-            'minicpm-v-v2': 'minicpm-v-v2-chat'
+            'minicpm-v-v2': 'minicpm-v-v2-chat',
+            'mplug-owl2d1-chat': 'mplug-owl2_1-chat',
+            'llava1d6-mistral-7b-instruct': 'llava1_6-mistral-7b-instruct',
+            'llava1d6-yi-34b-instruct': 'llava1_6-yi-34b-instruct',
         }
         dataset_name_mapping = {
             'ms-bench-mini': 'ms-bench#20000',
@@ -442,7 +445,9 @@ class SftArguments(ArgumentsBase):
     model_author: List[str] = field(
         default_factory=lambda: [None, None], metadata={'help': "e.g. ['魔搭', 'ModelScope']"})
     # note: bf16 and quantization have requirements for gpu architecture
-    quant_method: Literal['bnb', 'hqq', 'eetq'] = None
+    # awq, gptq, and aqlm need to be pre-quantized models,
+    # while bnb, hqq, and eetq can be quantized during SFT using the original models.
+    quant_method: Literal['bnb', 'hqq', 'eetq', 'awq', 'gptq', 'aqlm'] = None
     quantization_bit: Literal[0, 1, 2, 3, 4, 8] = 0  # hqq: 1,2,3,4,8. bnb: 4,8
     hqq_axis: Literal[0, 1] = 0
     hqq_dynamic_config_path: Optional[str] = None
@@ -1205,6 +1210,8 @@ class DeployArguments(InferArguments):
     ssl_keyfile: Optional[str] = None
     ssl_certfile: Optional[str] = None
 
+    owned_by: str = 'swift'
+
     def __post_init__(self):
         super().__post_init__()
         model_info = MODEL_MAPPING[self.model_type]
diff --git a/swift/llm/utils/client_utils.py b/swift/llm/utils/client_utils.py
index f1f976487..a374f0667 100644
--- a/swift/llm/utils/client_utils.py
+++ b/swift/llm/utils/client_utils.py
@@ -36,17 +36,18 @@ def inference_client(
     request_config: Optional[XRequestConfig] = None,
     host: str = '127.0.0.1',
     port: str = '8000',
-    adapter_name: str = None,
     is_chat_request: Optional[bool] = None,
 ) -> Union[ChatCompletionResponse, CompletionResponse, Iterator[ChatCompletionStreamResponse],
            Iterator[CompletionStreamResponse]]:
     if request_config is None:
         request_config = XRequestConfig()
     if is_chat_request is None:
-        template_type = get_default_template_type(model_type)
-        is_chat_request = 'generation' not in template_type
+        model_list = get_model_list_client(host, port)
+        for model in model_list.data:
+            if model_type == model.id:
+                is_chat_request = model.is_chat
     data = {k: v for k, v in request_config.__dict__.items() if not k.startswith('__')}
-    data['model'] = adapter_name or model_type
+    data['model'] = model_type
     if is_chat_request:
         data['messages'] = history_to_messages(history, query, system)
         url = f'http://{host}:{port}/v1/chat/completions'
diff --git a/swift/llm/utils/model.py b/swift/llm/utils/model.py
index 24dfb8964..c2404a769 100644
--- a/swift/llm/utils/model.py
+++ b/swift/llm/utils/model.py
@@ -158,8 +158,8 @@ class ModelType:
     atom_7b = 'atom-7b'
     atom_7b_chat = 'atom-7b-chat'
     # llava
-    llava1d6_mistral_7b_instruct = 'llava1d6-mistral-7b-instruct'
-    llava1d6_yi_34b_instruct = 'llava1d6-yi-34b-instruct'
+    llava1_6_mistral_7b_instruct = 'llava1_6-mistral-7b-instruct'
+    llava1_6_yi_34b_instruct = 'llava1_6-yi-34b-instruct'
     llama3_llava_next_8b = 'llama3-llava-next-8b'
     llava_next_72b = 'llava-next-72b'
     llava_next_110b = 'llava-next-110b'
@@ -295,7 +295,7 @@ class ModelType:
     baichuan2_13b_chat_int4 = 'baichuan2-13b-chat-int4'
     # owl
     mplug_owl2_chat = 'mplug-owl2-chat'  # llama
-    mplug_owl2d1_chat = 'mplug-owl2d1-chat'  # qwen
+    mplug_owl2_1_chat = 'mplug-owl2_1-chat'  # qwen
     # yuan
     yuan2_2b_instruct = 'yuan2-2b-instruct'
     yuan2_2b_janus_instruct = 'yuan2-2b-janus-instruct'
@@ -418,7 +418,7 @@ class LoRATM(NamedTuple):
         'v_proj.multiway.0',
         'v_proj.multiway.1',
     ]
-    mplug_owl2d1 = [
+    mplug_owl2_1 = [
         'c_attn.multiway.0',
         'c_attn.multiway.1',
     ]
@@ -499,8 +499,10 @@ def _check_awq_ext() -> None:
                           '&& cd AutoAWQ_kernels && pip install -e .`') from e
 
 
-def _check_gptq_model(bits: int, model_kwargs: Dict[str, Any]) -> None:
+def _check_gptq_model(bits: int, model_config, model_kwargs: Dict[str, Any]) -> None:
     assert model_kwargs.get('quantization_config') is None
+    if bits == 0:
+        bits = model_config.quantization_config['bits']
     if version.parse(transformers.__version__) >= version.parse('4.35'):
         model_kwargs['quantization_config'] = GPTQConfig(bits=bits, use_exllama=False)
     else:
@@ -811,14 +813,20 @@ def get_model_tokenizer_from_repo(model_dir: str,
                                   automodel_class=AutoModelForCausalLM,
                                   **kwargs):
     """load from an independent repository"""
+    if model_config is None:
+        model_config = AutoConfig.from_pretrained(model_dir, trust_remote_code=True)
     is_awq = kwargs.pop('is_awq', False)
     is_aqlm = kwargs.pop('is_aqlm', False)
     gptq_bits = kwargs.pop('gptq_bits', 0)
+    if gptq_bits > 0:
+        is_gptq = True
+    else:
+        is_gptq = kwargs.pop('is_gptq', False)
     is_training = kwargs.pop('is_training', False)
     if is_awq and is_training:
         _check_awq_ext()
-    if gptq_bits > 0 and is_training:
-        _check_gptq_model(gptq_bits, model_kwargs)
+    if is_gptq and is_training:
+        _check_gptq_model(gptq_bits, model_config, model_kwargs)
     context = kwargs.get('context', None)
     if is_aqlm and is_training:
         require_version('transformers>=4.39')
@@ -826,8 +834,6 @@ def get_model_tokenizer_from_repo(model_dir: str,
         context = aqlm.optimize_for_training()
     if context is None:
         context = nullcontext()
-    if model_config is None:
-        model_config = AutoConfig.from_pretrained(model_dir, trust_remote_code=True)
     if torch_dtype is not None:
         model_config.torch_dtype = torch_dtype
     if tokenizer is None:
@@ -857,10 +863,6 @@ def get_model_tokenizer_from_repo(model_dir: str,
             with context:
                 model = automodel_class.from_pretrained(
                     model_dir, config=model_config, torch_dtype=torch_dtype, trust_remote_code=True, **model_kwargs)
-        if load_model and is_awq:
-            model.is_awq = is_awq
-        if load_model and gptq_bits > 0:
-            model.gptq_bits = gptq_bits
     return model, tokenizer
 
 
@@ -2948,6 +2950,30 @@ def _new_func(*args, **kwargs):
 
 
 def _patch_deepseek_vl(model) -> None:
+
+    if not hasattr(model, 'hf_device_map') or len(model.hf_device_map.values()) == 1:
+        return
+    if hasattr(model.language_model, '__old_forward'):
+        # avoid double patching
+        return
+    # device_map
+    __old_forward = model.language_model.forward
+
+    def _new_forward(*args, **kwargs) -> Tensor:
+        inputs = kwargs.get('inputs_embeds')
+        if inputs is None:
+            inputs = kwargs.get('input_ids')
+        device = inputs.device
+        output = __old_forward(*args, **kwargs)
+        if output.logits is not None:
+            output.logits = output.logits.to(device)
+        if output.loss is not None:
+            output.loss = output.loss.to(device)
+        return output
+
+    model.language_model.forward = _new_forward
+    model.language_model.__old_forward = __old_forward
+
     model.prepare_inputs_embeds = MethodType(__prepare_inputs_embeds, model)
     func_list = ['generate', 'get_input_embeddings', 'gradient_checkpointing_enable', 'forward']
     _use_submodel_func(model, 'language_model', func_list)
@@ -2989,8 +3015,8 @@ def get_model_tokenizer_deepseek_vl(model_dir: str,
         local_repo_path = _git_clone_github('https://github.com/deepseek-ai/DeepSeek-VL')
     sys.path.append(os.path.join(local_repo_path))
     from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
-    vl_chat_processor = VLChatProcessor.from_pretrained(model_dir)
-    tokenizer = vl_chat_processor.tokenizer
+    processor = VLChatProcessor.from_pretrained(model_dir)
+    tokenizer = processor.tokenizer
     # flash_attn
     model_config = AutoConfig.from_pretrained(model_dir, trust_remote_code=True)
     use_flash_attn = kwargs.pop('use_flash_attn', False)
@@ -3001,7 +3027,7 @@ def get_model_tokenizer_deepseek_vl(model_dir: str,
         model_config.language_config._flash_attn_2_enabled = use_flash_attn
     model, tokenizer = get_model_tokenizer_from_repo(
         model_dir, torch_dtype, model_kwargs, load_model, model_config=model_config, tokenizer=tokenizer, **kwargs)
-    tokenizer.vl_chat_processor = vl_chat_processor
+    tokenizer.processor = processor
     if load_model:
         _patch_deepseek_vl(model)
     return model, tokenizer
@@ -4071,7 +4097,7 @@ def _new_generate(inputs=None, *args, **kwargs):
 
 
 @register_model(
-    ModelType.llava1d6_yi_34b_instruct,
+    ModelType.llava1_6_yi_34b_instruct,
     'AI-ModelScope/llava-v1.6-34b',
     LoRATM.llama2,
     TemplateType.llava_yi_instruct,
@@ -4081,7 +4107,7 @@ def _new_generate(inputs=None, *args, **kwargs):
     tags=['multi-modal', 'vision'],
     hf_model_id='liuhaotian/llava-v1.6-34b')
 @register_model(
-    ModelType.llava1d6_mistral_7b_instruct,
+    ModelType.llava1_6_mistral_7b_instruct,
     'AI-ModelScope/llava-v1.6-mistral-7b',
     LoRATM.llama2,
     TemplateType.llava_mistral_instruct,
@@ -4126,9 +4152,9 @@ def get_model_tokenizer_llava(model_dir: str,
     if 'local_repo_path' in kwargs:
         repo_path = kwargs['local_repo_path']
     elif 'next' in llm_model_type:
-        repo_path = 'https://github.com/LLaVA-VL/LLaVA-NeXT.git'
+        repo_path = 'https://github.com/LLaVA-VL/LLaVA-NeXT'
     else:
-        repo_path = 'https://github.com/haotian-liu/LLaVA.git'
+        repo_path = 'https://github.com/haotian-liu/LLaVA'
     local_repo_path = _git_clone_github(repo_path)
     sys.path.append(os.path.join(local_repo_path))
 
@@ -4188,9 +4214,9 @@ def _new_forward(*args, **kwargs):
     support_flash_attn=True,
     hf_model_id='MAGAer13/mplug-owl2-llama2-7b')
 @register_model(
-    ModelType.mplug_owl2d1_chat,
+    ModelType.mplug_owl2_1_chat,
     'iic/mPLUG-Owl2.1',
-    LoRATM.mplug_owl2d1,
+    LoRATM.mplug_owl2_1,
     TemplateType.mplug_owl2,
     requires=['transformers<4.35', 'icecream'],
     eos_token='<|endoftext|>',
@@ -4224,8 +4250,8 @@ def get_model_tokenizer_mplug_owl2(model_dir: str,
     model, tokenizer = get_model_tokenizer_function(
         model_dir, torch_dtype, model_kwargs, load_model, model_config=model_config, **kwargs)
     logger.info('Please ignore the unimported warning.')
-    image_processor = CLIPImageProcessor.from_pretrained(model_dir)
-    tokenizer.image_processor = image_processor
+    processor = CLIPImageProcessor.from_pretrained(model_dir)
+    tokenizer.processor = processor
     return model, tokenizer
 
 
diff --git a/swift/llm/utils/protocol.py b/swift/llm/utils/protocol.py
index 264f00fef..ba0cf50a0 100644
--- a/swift/llm/utils/protocol.py
+++ b/swift/llm/utils/protocol.py
@@ -12,6 +12,7 @@ def random_uuid() -> str:
 @dataclass
 class Model:
     id: str  # model_type
+    is_chat: bool  # chat model or generation model
     object: str = 'model'
     created: int = field(default_factory=lambda: int(time.time()))
     owned_by: str = 'swift'
diff --git a/swift/llm/utils/template.py b/swift/llm/utils/template.py
index a85b28e2b..4a28b520b 100644
--- a/swift/llm/utils/template.py
+++ b/swift/llm/utils/template.py
@@ -1194,17 +1194,17 @@ def encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, Any
             image = _read_from_path(image_path)
             images.append(image)
 
-        vl_chat_processor = self.tokenizer.vl_chat_processor
+        processor = self.tokenizer.processor
         input_ids, labels = inputs['input_ids'], inputs['labels']
-        idx_list = _findall(input_ids, vl_chat_processor.image_id)
+        idx_list = _findall(input_ids, processor.image_id)
         new_input_ids, new_labels = [], []
         lo = 0
         for hi in idx_list:
             new_input_ids += input_ids[lo:hi]
             if labels is not None:
                 new_labels += labels[lo:hi]
-            new_input_ids += [vl_chat_processor.image_id] * vl_chat_processor.num_image_tokens
-            new_labels += [-100] * vl_chat_processor.num_image_tokens
+            new_input_ids += [processor.image_id] * processor.num_image_tokens
+            new_labels += [-100] * processor.num_image_tokens
             lo = hi + 1
         new_input_ids += input_ids[lo:]
         if labels is not None:
@@ -1212,15 +1212,15 @@ def encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, Any
         else:
             new_labels = None
         new_input_ids = torch.tensor(new_input_ids)
-        num_image_tokens = torch.tensor([vl_chat_processor.num_image_tokens] * len(idx_list))
-        images_outputs = vl_chat_processor.image_processor(images, return_tensors='pt')
+        num_image_tokens = torch.tensor([processor.num_image_tokens] * len(idx_list))
+        images_outputs = processor.image_processor(images, return_tensors='pt')
         from deepseek_vl.models.processing_vlm import VLChatProcessorOutput
         output = VLChatProcessorOutput(
             sft_format=None,
             input_ids=new_input_ids,
             pixel_values=images_outputs.pixel_values,
             num_image_tokens=num_image_tokens)
-        batched_output = vl_chat_processor.batchify([output])
+        batched_output = processor.batchify([output])
         model = self.model
         batched_output = batched_output.to(device=model.device, dtype=model.dtype)
         inputs_embeds = model.prepare_inputs_embeds(**batched_output)[0]
@@ -1491,7 +1491,7 @@ def __init__(self):
 
     def encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, Any]]:
         from mplug_owl2.mm_utils import process_images
-        image_processor = self.tokenizer.image_processor
+        processor = self.tokenizer.processor
         images_path = example['images']
         images = []
         for image_path in images_path:
@@ -1505,7 +1505,7 @@ def encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, Any
             return inputs, {}
         input_ids = inputs['input_ids']
         labels = inputs['labels']
-        images = process_images(images, image_processor)
+        images = process_images(images, processor)
         images = images.to(self.model.dtype)
         return {'input_ids': input_ids, 'labels': labels, 'images': images}, {}
 
diff --git a/swift/llm/utils/utils.py b/swift/llm/utils/utils.py
index 9bf5b1a34..8ce5c12bc 100644
--- a/swift/llm/utils/utils.py
+++ b/swift/llm/utils/utils.py
@@ -503,8 +503,7 @@ def put(self, value: torch.Tensor) -> None:
         if value.ndim > 1:
             value = value[0]
         value = value.tolist()
-        for v in value:
-            self.token_queue.put(v)
+        self.token_queue.put(value)
 
     def end(self) -> None:
         self.token_queue.put(self.stop_signal)
@@ -512,7 +511,7 @@ def end(self) -> None:
     def __iter__(self):
         return self
 
-    def __next__(self):
+    def __next__(self) -> List[int]:
         value = self.token_queue.get(timeout=self.timeout)
         if value == self.stop_signal:
             raise StopIteration()
@@ -632,8 +631,8 @@ def _model_generate(*args, **kwargs):
     is_finished = False
     while not is_finished:
         try:
-            token = next(streamer)
-            raw_generate_ids.append(token)
+            token_list = next(streamer)
+            raw_generate_ids += token_list
         except StopIteration:
             is_finished = True
         generate_ids = template.get_generate_ids(torch.tensor(raw_generate_ids)[None], token_len)
diff --git a/swift/trainers/mixin.py b/swift/trainers/mixin.py
index 705db931d..aa489a9a4 100644
--- a/swift/trainers/mixin.py
+++ b/swift/trainers/mixin.py
@@ -383,6 +383,8 @@ def _save(self, output_dir: Optional[str] = None, state_dict=None):
         sft_args = getattr(self, 'sft_args', None)
         # tokenizer
         if self.tokenizer is not None and sft_args is not None and sft_args.sft_type == 'full':
+            if hasattr(self.tokenizer, 'processor'):
+                self.tokenizer.processor.save_pretrained(output_dir)
             self.tokenizer.save_pretrained(output_dir)
         # training_args.bin
         torch.save(self.args, os.path.join(output_dir, 'training_args.bin'))