Merge 1126 (#7)

* Remove hardcode flash-attn disable setting (lm-sys#2342) * Document turning off proxy_buffering when api is streaming (lm-sys#2337) * Simplify huggingface api example (lm-sys#2355) * Update sponsor logos (lm-sys#2367) * if LOGDIR is empty, then don't try output log to local file (lm-sys#2357) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com> * add best_of and use_beam_search for completions interface (lm-sys#2348) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com> * Extract upvote/downvote from log files (lm-sys#2369) * Revert "add best_of and use_beam_search for completions interface" (lm-sys#2370) * Improve doc (lm-sys#2371) * add best_of and use_beam_search for completions interface (lm-sys#2372) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com> * update monkey patch for llama2 (lm-sys#2379) * Make E5 adapter more restrict to reduce mismatch (lm-sys#2381) * Update UI and sponsers (lm-sys#2387) * Use fsdp api for save save (lm-sys#2390) * Release v0.2.27 * Spicyboros + airoboros 2.2 template update. (lm-sys#2392) Co-authored-by: Jon Durbin <jon.durbin@onna.com> * bugfix of openai_api_server for fastchat.serve.vllm_worker (lm-sys#2398) Co-authored-by: wuyongyu <wuyongyu@atomecho.xyz> * Revert "bugfix of openai_api_server for fastchat.serve.vllm_worker" (lm-sys#2400) * Revert "add best_of and use_beam_search for completions interface" (lm-sys#2401) * Release a v0.2.28 with bug fixes and more test cases * Fix model_worker error (lm-sys#2404) * Added google/flan models and fixed AutoModelForSeq2SeqLM when loading T5 compression model (lm-sys#2402) * Rename twitter to X (lm-sys#2406) * Update huggingface_api.py (lm-sys#2409) * Add support for baichuan2 models (lm-sys#2408) * Fixed character overlap issue when api streaming output (lm-sys#2431) * Support custom conversation template in multi_model_worker (lm-sys#2434) * Add Ascend NPU support (lm-sys#2422) * Add raw conversation template (lm-sys#2417) (lm-sys#2418) * Improve docs & UI (lm-sys#2436) * Fix Salesforce xgen inference (lm-sys#2350) * Add support for Phind-CodeLlama models (lm-sys#2415) (lm-sys#2416) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * Add falcon 180B chat conversation template (lm-sys#2384) * Improve docs (lm-sys#2438) * add dtype and seed (lm-sys#2430) * Data cleaning scripts for dataset release (lm-sys#2440) * merge google/flan based adapters: T5Adapter, CodeT5pAdapter, FlanAdapter (lm-sys#2411) * Fix docs * Update UI (lm-sys#2446) * Add Optional SSL Support to controller.py (lm-sys#2448) * Format & Improve docs * Release v0.2.29 (lm-sys#2450) * Show terms of use as an JS alert (lm-sys#2461) * vllm worker awq quantization update (lm-sys#2463) Co-authored-by: 董晓龙 <dongxiaolong@shiyanjia.com> * Fix falcon chat template (lm-sys#2464) * Fix chunk handling when partial chunks are returned (lm-sys#2485) * Update openai_api_server.py to add an SSL option (lm-sys#2484) * Update vllm_worker.py (lm-sys#2482) * fix typo quantization (lm-sys#2469) * fix vllm quanziation args * Update README.md (lm-sys#2492) * Huggingface api worker (lm-sys#2456) * Update links to lmsys-chat-1m (lm-sys#2497) * Update train code to support the new tokenizer (lm-sys#2498) * Third Party UI Example (lm-sys#2499) * Add metharme (pygmalion) conversation template (lm-sys#2500) * Optimize for proper flash attn causal handling (lm-sys#2503) * Add Mistral AI instruction template (lm-sys#2483) * Update monitor & plots (lm-sys#2506) * Release v0.2.30 (lm-sys#2507) * Fix for single turn dataset (lm-sys#2509) * replace os.getenv with os.path.expanduser because the first one doesn… (lm-sys#2515) Co-authored-by: khalil <k.hennara@work-with-nerds.ca> * Fix arena (lm-sys#2522) * Update Dockerfile (lm-sys#2524) * add Llama2ChangAdapter (lm-sys#2510) * Add ExllamaV2 Inference Framework Support. (lm-sys#2455) * Improve docs (lm-sys#2534) * Fix warnings for new gradio versions (lm-sys#2538) * revert the gradio change; now works for 3.40 * Improve chat templates (lm-sys#2539) * Add Zephyr 7B Alpha (lm-sys#2535) * Improve Support for Mistral-Instruct (lm-sys#2547) * correct max_tokens by context_length instead of raise exception (lm-sys#2544) * Revert "Improve Support for Mistral-Instruct" (lm-sys#2552) * Fix Mistral template (lm-sys#2529) * Add additional Informations from the vllm worker (lm-sys#2550) * Make FastChat work with LMSYS-Chat-1M Code (lm-sys#2551) * Create `tags` attribute to fix `MarkupError` in rich CLI (lm-sys#2553) * move BaseModelWorker outside serve.model_worker to make it independent (lm-sys#2531) * Misc style and bug fixes (lm-sys#2559) * Fix README.md (lm-sys#2561) * release v0.2.31 (lm-sys#2563) * resolves lm-sys#2542 modify dockerfile to upgrade cuda to 12.2.0 and pydantic 1.10.13 (lm-sys#2565) * Add airoboros_v3 chat template (llama-2 format) (lm-sys#2564) * Add Xwin-LM V0.1, V0.2 support (lm-sys#2566) * Fixed model_worker generate_gate may blocked main thread (lm-sys#2540) (lm-sys#2562) * feat: add claude-v2 (lm-sys#2571) * Update vigogne template (lm-sys#2580) * Fix issue lm-sys#2568: --device mps led to TypeError: forward() got an unexpected keyword argument 'padding_mask'. (lm-sys#2579) * Add Mistral-7B-OpenOrca conversation_temmplate (lm-sys#2585) * docs: bit misspell comments model adapter default template name conversation (lm-sys#2594) * Update Mistral template (lm-sys#2581) * Fix <s> in mistral template * Update README.md (vicuna-v1.3 -> vicuna-1.5) (lm-sys#2592) * Update README.md to highlight chatbot arena (lm-sys#2596) * Add Lemur model (lm-sys#2584) Co-authored-by: Roberto Ugolotti <Roberto.UGOLOTTI@ec.europa.eu> * add trust_remote_code=True in BaseModelAdapter (lm-sys#2583) * Openai interface add use beam search and best of 2 (lm-sys#2442) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com> * Update qwen and add pygmalion (lm-sys#2607) * feat: Support model AquilaChat2 (lm-sys#2616) * Added settings vllm (lm-sys#2599) Co-authored-by: bodza <bodza@qnovi.de> Co-authored-by: bodza <sebastian.bodza@qnovi.de> * [Logprobs] Support logprobs=1 (lm-sys#2612) * release v0.2.32 * fix: Fix for OpenOrcaAdapter to return correct conversation template (lm-sys#2613) * Make fastchat.serve.model_worker to take debug argument (lm-sys#2628) Co-authored-by: hi-jin <crushed7@o.cnu.ac.kr> * openchat 3.5 model support (lm-sys#2638) * xFastTransformer framework support (lm-sys#2615) * feat: support custom models vllm serving (lm-sys#2635) * kill only fastchat process (lm-sys#2641) * Update server_arch.png * Use conv.update_last_message api in mt-bench answer generation (lm-sys#2647) * Improve Azure OpenAI interface (lm-sys#2651) * Add required_temp support in jsonl format to support flexible temperature setting for gen_api_answer (lm-sys#2653) * Pin openai version < 1 (lm-sys#2658) * Remove exclude_unset parameter (lm-sys#2654) * Revert "Remove exclude_unset parameter" (lm-sys#2666) * added support for CodeGeex(2) (lm-sys#2645) * add chatglm3 conv template support in conversation.py (lm-sys#2622) * UI and model change (lm-sys#2672) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * train_flant5: fix typo (lm-sys#2673) * Fix gpt template (lm-sys#2674) * Update README.md (lm-sys#2679) * feat: support template's stop_str as list (lm-sys#2678) * Update exllama_v2.md (lm-sys#2680) * save model under deepspeed (lm-sys#2689) * Adding SSL support for model workers and huggingface worker (lm-sys#2687) * Check the max_new_tokens <= 0 in openai api server (lm-sys#2688) * Add Microsoft/Orca-2-7b and update model support docs (lm-sys#2714) * fix tokenizer of chatglm2 (lm-sys#2711) * Template for using Deepseek code models (lm-sys#2705) * add support for Chinese-LLaMA-Alpaca (lm-sys#2700) * Make --load-8bit flag work with weights in safetensors format (lm-sys#2698) * Format code and minor bug fix (lm-sys#2716) * Bump version to v0.2.33 (lm-sys#2717) * fix tokenizer.pad_token attribute error (lm-sys#2710) * support stable-vicuna model (lm-sys#2696) * Exllama cache 8bit (lm-sys#2719) * Add Yi support (lm-sys#2723) * Add Hermes 2.5 [fixed] (lm-sys#2725) * Fix Hermes2Adapter (lm-sys#2727) * Fix YiAdapter (lm-sys#2730) * add trust_remote_code argument (lm-sys#2715) * Add revision arg to MT Bench answer generation (lm-sys#2728) * Fix MPS backend 'index out of range' error (lm-sys#2737) * add starling support (lm-sys#2738) --------- Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Trangle <kw_w@foxmail.com> Co-authored-by: Nathan Stitt <nathan@stitt.org> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: leiwen83 <leiwen83@users.noreply.github.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Jon Durbin <jon@jondurbin.com> Co-authored-by: Jon Durbin <jon.durbin@onna.com> Co-authored-by: Rayrtfr <2384172887@qq.com> Co-authored-by: wuyongyu <wuyongyu@atomecho.xyz> Co-authored-by: wangxiyuan <wangxiyuan@huawei.com> Co-authored-by: Jeff (Zhen) Wang <wangzhen263@gmail.com> Co-authored-by: karshPrime <94996251+karshPrime@users.noreply.github.com> Co-authored-by: obitolyz <obitoquilt@qq.com> Co-authored-by: Shangwei Chen <109785802+Somezak1@users.noreply.github.com> Co-authored-by: HyungJin Ahn <crushed7@o.cnu.ac.kr> Co-authored-by: zhangsibo1129 <134488188+zhangsibo1129@users.noreply.github.com> Co-authored-by: Tobias Birchler <tobias@birchlerfamily.ch> Co-authored-by: Jae-Won Chung <jwnchung@umich.edu> Co-authored-by: Mingdao Liu <joshua@btlmd.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: Brandon Biggs <brandonsbiggs@gmail.com> Co-authored-by: dongxiaolong <774848421@qq.com> Co-authored-by: 董晓龙 <dongxiaolong@shiyanjia.com> Co-authored-by: Siddartha Naidu <siddartha@abacus.ai> Co-authored-by: shuishu <990941859@qq.com> Co-authored-by: Andrew Aikawa <asai@berkeley.edu> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: enochlev <47466848+enochlev@users.noreply.github.com> Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com> Co-authored-by: Lé <lerela@users.noreply.github.com> Co-authored-by: Toshiki Kataoka <tos.lunar@gmail.com> Co-authored-by: khalil <90086758+khalil-Hennara@users.noreply.github.com> Co-authored-by: khalil <k.hennara@work-with-nerds.ca> Co-authored-by: dubaoquan404 <87166864@qq.com> Co-authored-by: Chang W. Lee <changlee99@gmail.com> Co-authored-by: theScotchGame <36061851+leonxia1018@users.noreply.github.com> Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> Co-authored-by: Stephen Horvath <s.horvath@outlook.com.au> Co-authored-by: liunux4odoo <41217877+liunux4odoo@users.noreply.github.com> Co-authored-by: Norman Mu <normster@users.noreply.github.com> Co-authored-by: Sebastian Bodza <66752172+SebastianBodza@users.noreply.github.com> Co-authored-by: Tianle (Tim) Li <67527391+CodingWithTim@users.noreply.github.com> Co-authored-by: Wei-Lin Chiang <weichiang@berkeley.edu> Co-authored-by: Alex <alexander.s.delapaz@gmail.com> Co-authored-by: Jingcheng Hu <67776176+REIGN12@users.noreply.github.com> Co-authored-by: lvxuan <3645933+lvxuan263@users.noreply.github.com> Co-authored-by: cOng <erdongerzong@qq.com> Co-authored-by: bofeng huang <bofenghuang7@gmail.com> Co-authored-by: Phil-U-U <phil.h.cui@gmail.com> Co-authored-by: Wayne Spangenberg <waynespa@gmail.com> Co-authored-by: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com> Co-authored-by: Rohan Gupta <63547845+Gk-rohan@users.noreply.github.com> Co-authored-by: ugolotti <96428459+ugolotti@users.noreply.github.com> Co-authored-by: Roberto Ugolotti <Roberto.UGOLOTTI@ec.europa.eu> Co-authored-by: edisonwd <2388100489@qq.com> Co-authored-by: FangYin Cheng <staneyffer@gmail.com> Co-authored-by: bodza <bodza@qnovi.de> Co-authored-by: bodza <sebastian.bodza@qnovi.de> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Srinath Janakiraman <me@vjsrinath.com> Co-authored-by: Jaeheon Jeong <tizm423@gmail.com> Co-authored-by: One <imoneoi@users.noreply.github.com> Co-authored-by: sheng.gui@intel.com <guisheng315@sina.com> Co-authored-by: David <scenaristeur@gmail.com> Co-authored-by: Witold Wasiczko <snapshotpl@users.noreply.github.com> Co-authored-by: Peter Willemsen <peter@codebuffet.co> Co-authored-by: ZeyuTeng96 <96521059+ZeyuTeng96@users.noreply.github.com> Co-authored-by: Forceless <72636351+Force1ess@users.noreply.github.com> Co-authored-by: Jeff <122586668+jm23jeffmorgan@users.noreply.github.com> Co-authored-by: MrZhengXin <34998703+MrZhengXin@users.noreply.github.com> Co-authored-by: Long Nguyen <long.nguyen11288@gmail.com> Co-authored-by: Elsa Granger <zeyugao@outlook.com> Co-authored-by: Christopher Chou <49086305+BabyChouSr@users.noreply.github.com> Co-authored-by: wangshuai09 <391746016@qq.com> Co-authored-by: amaleshvemula <vemulaamalesh1997@gmail.com> Co-authored-by: Zollty Tsou <zollty@163.com> Co-authored-by: xuguodong1999 <bugxu@outlook.com> Co-authored-by: Michael J Kaye <1014467+mjkaye@users.noreply.github.com> Co-authored-by: 152334H <54623771+152334H@users.noreply.github.com> Co-authored-by: Jingsong-Yan <75230787+Jingsong-Yan@users.noreply.github.com> Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com>
shaleprotocol · Nov 27, 2023 · 94421ea · 94421ea
1 parent a887de7
commit 94421ea
Show file tree

Hide file tree

Showing 62 changed files with 6,801 additions and 5,987 deletions.
diff --git a/assets/server_arch.png b/assets/server_arch.png
diff --git a/data/dummy_conversation.json b/data/dummy_conversation.json
diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -1,6 +1,7 @@
-FROM nvidia/cuda:11.7.1-runtime-ubuntu20.04
+FROM nvidia/cuda:12.2.0-runtime-ubuntu20.04
 
 RUN apt-get update -y && apt-get install -y python3.9 python3.9-distutils curl
 RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
 RUN python3.9 get-pip.py
-RUN pip3 install fschat
+RUN pip3 install fschat
+RUN pip3 install fschat[model_worker,webui] pydantic==1.10.13
diff --git a/docker/docker-compose.yml b/docker/docker-compose.yml
@@ -23,7 +23,7 @@ services:
             - driver: nvidia
               count: 1
               capabilities: [gpu]
-    entrypoint: ["python3.9", "-m", "fastchat.serve.model_worker", "--model-names", "${FASTCHAT_WORKER_MODEL_NAMES:-vicuna-7b-v1.3}", "--model-path", "${FASTCHAT_WORKER_MODEL_PATH:-lmsys/vicuna-7b-v1.3}", "--worker-address", "http://fastchat-model-worker:21002", "--controller-address", "http://fastchat-controller:21001", "--host", "0.0.0.0", "--port", "21002"]
+    entrypoint: ["python3.9", "-m", "fastchat.serve.model_worker", "--model-names", "${FASTCHAT_WORKER_MODEL_NAMES:-vicuna-7b-v1.5}", "--model-path", "${FASTCHAT_WORKER_MODEL_PATH:-lmsys/vicuna-7b-v1.5}", "--worker-address", "http://fastchat-model-worker:21002", "--controller-address", "http://fastchat-controller:21001", "--host", "0.0.0.0", "--port", "21002"]
   fastchat-api-server:
     build:
       context: .

diff --git a/docs/commands/leaderboard.md b/docs/commands/leaderboard.md
@@ -24,3 +24,14 @@ scp atlas:/data/lmzheng/FastChat/fastchat/serve/monitor/elo_results_20230905.pkl
 ```
 wget https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/raw/main/leaderboard_table_20230905.csv
 ```
+
+### Update files on webserver
+```
+DATE=20231002
+
+rm -rf elo_results.pkl leaderboard_table.csv
+wget https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/resolve/main/elo_results_$DATE.pkl
+wget https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/resolve/main/leaderboard_table_$DATE.csv
+ln -s leaderboard_table_$DATE.csv leaderboard_table.csv
+ln -s elo_results_$DATE.pkl elo_results.pkl
+```
diff --git a/docs/commands/webserver.md b/docs/commands/webserver.md
@@ -72,7 +72,16 @@ vim /home/vicuna/anaconda3/envs/fastchat/lib/python3.9/site-packages/gradio/temp
 <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.4.1/html2canvas.min.js"></script>
 ```
 
-2. Loading
+2. deprecation warnings
+```
+vim /home/vicuna/anaconda3/envs/fastchat/lib/python3.9/site-packages/gradio/deprecation.py
+```
+
+```
+def check_deprecated_parameters(
+```
+
+3. Loading
 ```
 vim /home/vicuna/anaconda3/envs/fastchat/lib/python3.9/site-packages/gradio/templates/frontend/assets/index-188ef5e8.js
 ```

diff --git a/docs/dataset_release.md b/docs/dataset_release.md
@@ -0,0 +1,6 @@
+## Datasets
+We release the following datasets based on our projects and websites.
+
+- [LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)
+- [Chatbot Arena Conversation Dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)
+- [MT-bench Human Annotation Dataset](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments)
diff --git a/docs/exllama_v2.md b/docs/exllama_v2.md
@@ -0,0 +1,63 @@
+# ExllamaV2 GPTQ Inference Framework
+
+Integrated [ExllamaV2](https://github.com/turboderp/exllamav2) customized kernel into Fastchat to provide **Faster** GPTQ inference speed.
+
+**Note: Exllama not yet support embedding REST API.**
+
+## Install ExllamaV2
+
+Setup environment (please refer to [this link](https://github.com/turboderp/exllamav2#how-to) for more details):
+
+```bash
+git clone https://github.com/turboderp/exllamav2
+cd exllamav2
+pip install -e .
+```
+
+Chat with the CLI:
+```bash
+python3 -m fastchat.serve.cli \
+    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
+    --enable-exllama
+```
+
+Start model worker:
+```bash
+# Download quantized model from huggingface
+# Make sure you have git-lfs installed (https://git-lfs.com)
+git lfs install
+git clone https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g models/vicuna-7B-1.1-GPTQ-4bit-128g
+
+# Load model with default configuration (max sequence length 4096, no GPU split setting).
+python3 -m fastchat.serve.model_worker \
+    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
+    --enable-exllama
+
+#Load model with max sequence length 2048, allocate 18 GB to CUDA:0 and 24 GB to CUDA:1.
+python3 -m fastchat.serve.model_worker \
+    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
+    --enable-exllama \
+    --exllama-max-seq-len 2048 \
+    --exllama-gpu-split 18,24
+```
+
+`--exllama-cache-8bit` can be used to enable 8-bit caching with exllama and save some VRAM.
+
+## Performance 
+
+Reference: https://github.com/turboderp/exllamav2#performance
+
+
+| Model      | Mode         | Size  | grpsz | act | V1: 3090Ti | V1: 4090 | V2: 3090Ti | V2: 4090    |
+|------------|--------------|-------|-------|-----|------------|----------|------------|-------------|
+| Llama      | GPTQ         | 7B    | 128   | no  | 143 t/s    | 173 t/s  | 175 t/s    | **195** t/s |
+| Llama      | GPTQ         | 13B   | 128   | no  | 84 t/s     | 102 t/s  | 105 t/s    | **110** t/s |
+| Llama      | GPTQ         | 33B   | 128   | yes | 37 t/s     | 45 t/s   | 45 t/s     | **48** t/s  |
+| OpenLlama  | GPTQ         | 3B    | 128   | yes | 194 t/s    | 226 t/s  | 295 t/s    | **321** t/s |
+| CodeLlama  | EXL2 4.0 bpw | 34B   | -     | -   | -          | -        | 42 t/s     | **48** t/s  |
+| Llama2     | EXL2 3.0 bpw | 7B    | -     | -   | -          | -        | 195 t/s    | **224** t/s |
+| Llama2     | EXL2 4.0 bpw | 7B    | -     | -   | -          | -        | 164 t/s    | **197** t/s |
+| Llama2     | EXL2 5.0 bpw | 7B    | -     | -   | -          | -        | 144 t/s    | **160** t/s |
+| Llama2     | EXL2 2.5 bpw | 70B   | -     | -   | -          | -        | 30 t/s     | **35** t/s  |
+| TinyLlama  | EXL2 3.0 bpw | 1.1B  | -     | -   | -          | -        | 536 t/s    | **635** t/s |
+| TinyLlama  | EXL2 4.0 bpw | 1.1B  | -     | -   | -          | -        | 509 t/s    | **590** t/s |
diff --git a/docs/langchain_integration.md b/docs/langchain_integration.md
@@ -19,7 +19,7 @@ Here, we use Vicuna as an example and use it for three endpoints: chat completio
 See a full list of supported models [here](../README.md#supported-models).
 
 ```bash
-python3 -m fastchat.serve.model_worker --model-names "gpt-3.5-turbo,text-davinci-003,text-embedding-ada-002" --model-path lmsys/vicuna-7b-v1.3
+python3 -m fastchat.serve.model_worker --model-names "gpt-3.5-turbo,text-davinci-003,text-embedding-ada-002" --model-path lmsys/vicuna-7b-v1.5
 ```
 
 Finally, launch the RESTful API server

diff --git a/docs/model_support.md b/docs/model_support.md
@@ -5,8 +5,10 @@
 - [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
   - example: `python3 -m fastchat.serve.cli --model-path meta-llama/Llama-2-7b-chat-hf`
 - Vicuna, Alpaca, LLaMA, Koala
-  - example: `python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.3`
+  - example: `python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5`
 - [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B)
+- [BAAI/AquilaChat2-7B](https://huggingface.co/BAAI/AquilaChat2-7B)
+- [BAAI/AquilaChat2-34B](https://huggingface.co/BAAI/AquilaChat2-34B)
 - [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en#using-huggingface-transformers)
 - [baichuan-inc/baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B)
 - [BlinkDL/RWKV-4-Raven](https://huggingface.co/BlinkDL/rwkv-4-raven)
@@ -30,6 +32,8 @@
 - [NousResearch/Nous-Hermes-13b](https://huggingface.co/NousResearch/Nous-Hermes-13b)
 - [openaccess-ai-collective/manticore-13b-chat-pyg](https://huggingface.co/openaccess-ai-collective/manticore-13b-chat-pyg)
 - [OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5](https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5)
+- [openchat/openchat_3.5](https://huggingface.co/openchat/openchat_3.5)
+- [Open-Orca/Mistral-7B-OpenOrca](https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca)
 - [VMware/open-llama-7b-v2-open-instruct](https://huggingface.co/VMware/open-llama-7b-v2-open-instruct)
 - [Phind/Phind-CodeLlama-34B-v2](https://huggingface.co/Phind/Phind-CodeLlama-34B-v2)
 - [project-baize/baize-v2-7b](https://huggingface.co/project-baize/baize-v2-7b)
@@ -45,6 +49,11 @@
 - [WizardLM/WizardLM-13B-V1.0](https://huggingface.co/WizardLM/WizardLM-13B-V1.0)
 - [WizardLM/WizardCoder-15B-V1.0](https://huggingface.co/WizardLM/WizardCoder-15B-V1.0)
 - [HuggingFaceH4/starchat-beta](https://huggingface.co/HuggingFaceH4/starchat-beta)
+- [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)
+- [Xwin-LM/Xwin-LM-7B-V0.1](https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1)
+- [OpenLemur/lemur-70b-chat-v1](https://huggingface.co/OpenLemur/lemur-70b-chat-v1)
+- [allenai/tulu-2-dpo-7b](https://huggingface.co/allenai/tulu-2-dpo-7b)
+- [Microsoft/Orca-2-7b](https://huggingface.co/microsoft/Orca-2-7b)
 - Any [EleutherAI](https://huggingface.co/EleutherAI) pythia model such as [pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b)
 - Any [Peft](https://github.com/huggingface/peft) adapter trained on top of a
   model above.  To activate, must have `peft` in the model path.  Note: If
@@ -64,7 +73,7 @@ python3 -m fastchat.serve.cli --model [YOUR_MODEL_PATH]
 You can run this example command to learn the code logic.
 
 ```
-python3 -m fastchat.serve.cli --model lmsys/vicuna-7b-v1.3
+python3 -m fastchat.serve.cli --model lmsys/vicuna-7b-v1.5
 ```
 
 You can add `--debug` to see the actual prompt sent to the model.

diff --git a/docs/openai_api.md b/docs/openai_api.md
@@ -18,7 +18,7 @@ python3 -m fastchat.serve.controller
 Then, launch the model worker(s)
 
 ```bash
-python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.3
+python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.5
 ```
 
 Finally, launch the RESTful API server
@@ -45,7 +45,7 @@ import openai
 openai.api_key = "EMPTY"
 openai.api_base = "http://localhost:8000/v1"
 
-model = "vicuna-7b-v1.3"
+model = "vicuna-7b-v1.5"
 prompt = "Once upon a time"
 
 # create a completion
@@ -77,7 +77,7 @@ Chat Completions:
 curl http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
-    "model": "vicuna-7b-v1.3",
+    "model": "vicuna-7b-v1.5",
     "messages": [{"role": "user", "content": "Hello! What is your name?"}]
   }'
 ```
@@ -87,7 +87,7 @@ Text Completions:
 curl http://localhost:8000/v1/completions \
   -H "Content-Type: application/json" \
   -d '{
-    "model": "vicuna-7b-v1.3",
+    "model": "vicuna-7b-v1.5",
     "prompt": "Once upon a time",
     "max_tokens": 41,
     "temperature": 0.5
@@ -99,7 +99,7 @@ Embeddings:
 curl http://localhost:8000/v1/embeddings \
   -H "Content-Type: application/json" \
   -d '{
-    "model": "vicuna-7b-v1.3",
+    "model": "vicuna-7b-v1.5",
     "input": "Hello world!"
   }'
 ```
@@ -111,8 +111,8 @@ you can replace the `model_worker` step above with a multi model variant:
 
 ```bash
 python3 -m fastchat.serve.multi_model_worker \
-    --model-path lmsys/vicuna-7b-v1.3 \
-    --model-names vicuna-7b-v1.3 \
+    --model-path lmsys/vicuna-7b-v1.5 \
+    --model-names vicuna-7b-v1.5 \
     --model-path lmsys/longchat-7b-16k \
     --model-names longchat-7b-16k
 ```

diff --git a/docs/vllm_integration.md b/docs/vllm_integration.md
@@ -11,15 +11,15 @@ See the supported models [here](https://vllm.readthedocs.io/en/latest/models/sup
 
 2. When you launch a model worker, replace the normal worker (`fastchat.serve.model_worker`) with the vLLM worker (`fastchat.serve.vllm_worker`). All other commands such as controller, gradio web server, and OpenAI API server are kept the same.
    ```
-   python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.3
+   python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.5
    ```
 
    If you see tokenizer errors, try
    ```
-   python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.3 --tokenizer hf-internal-testing/llama-tokenizer
+   python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer hf-internal-testing/llama-tokenizer
    ```
 
-   if you use a awq model, try
+   If you use an AWQ quantized model, try
    '''
    python3 -m fastchat.serve.vllm_worker --model-path TheBloke/vicuna-7B-v1.5-AWQ --quantization awq
    '''
diff --git a/docs/xFasterTransformer.md b/docs/xFasterTransformer.md
@@ -0,0 +1,90 @@
+# xFasterTransformer Inference Framework
+
+Integrated [xFasterTransformer](https://github.com/intel/xFasterTransformer) customized framework into Fastchat to provide **Faster** inference speed on Intel CPU.
+
+## Install xFasterTransformer
+
+Setup environment (please refer to [this link](https://github.com/intel/xFasterTransformer#installation) for more details):
+
+```bash
+pip install xfastertransformer
+```
+
+## Prepare models
+
+Prepare Model (please refer to [this link](https://github.com/intel/xFasterTransformer#prepare-model) for more details):
+```bash
+python ./tools/chatglm_convert.py -i ${HF_DATASET_DIR} -o  ${OUTPUT_DIR}
+```
+
+## Parameters of xFasterTransformer
+--enable-xft to enable xfastertransformer in Fastchat
+--xft-max-seq-len to set the max token length the model can process. max token length include input token length.
+--xft-dtype to set datatype used in xFasterTransformer for computation. xFasterTransformer can support fp32, fp16, int8, bf16 and hybrid data types like : bf16_fp16, bf16_int8. For datatype details please refer to [this link](https://github.com/intel/xFasterTransformer/wiki/Data-Type-Support-Platform)
+
+
+Chat with the CLI:
+```bash
+#run inference on all CPUs and using float16
+python3 -m fastchat.serve.cli \
+    --model-path /path/to/models \
+    --enable-xft \
+    --xft-dtype fp16
+```
+or with numactl on multi-socket server for better performance
+```bash
+#run inference on numanode 0 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
+numactl -N 0  --localalloc \
+python3 -m fastchat.serve.cli \
+    --model-path /path/to/models/chatglm2_6b_cpu/ \
+    --enable-xft \
+    --xft-dtype bf16_fp16
+```
+or using MPI to run inference on 2 sockets for better performance
+```bash
+#run inference on numanode 0 and 1 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
+OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \
+-n 1 numactl -N 0  --localalloc \
+python -m fastchat.serve.cli \ 
+    --model-path /path/to/models/chatglm2_6b_cpu/ \
+    --enable-xft \
+    --xft-dtype bf16_fp16 : \
+-n 1 numactl -N 1  --localalloc \
+python -m fastchat.serve.cli \
+    --model-path /path/to/models/chatglm2_6b_cpu/ \
+    --enable-xft \
+    --xft-dtype bf16_fp16
+```
+
+
+Start model worker:
+```bash
+# Load model with default configuration (max sequence length 4096, no GPU split setting).
+python3 -m fastchat.serve.model_worker \
+    --model-path /path/to/models \
+    --enable-xft \
+    --xft-dtype bf16_fp16 
+```
+or with numactl on multi-socket server for better performance
+```bash
+#run inference on numanode 0 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
+numactl -N 0  --localalloc python3 -m fastchat.serve.model_worker \
+    --model-path /path/to/models \
+    --enable-xft \
+    --xft-dtype bf16_fp16 
+```
+or using MPI to run inference on 2 sockets for better performance
+```bash
+#run inference on numanode 0 and 1 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
+OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \
+-n 1 numactl -N 0  --localalloc  python -m fastchat.serve.model_worker \
+    --model-path /path/to/models \
+    --enable-xft \
+    --xft-dtype bf16_fp16 : \
+-n 1 numactl -N 1  --localalloc  python -m fastchat.serve.model_worker \
+    --model-path /path/to/models \
+    --enable-xft \
+    --xft-dtype bf16_fp16 
+```
+
+For more details, please refer to [this link](https://github.com/intel/xFasterTransformer#how-to-run) 
diff --git a/fastchat/__init__.py b/fastchat/__init__.py
@@ -1 +1 @@
-__version__ = "0.2.29"
+__version__ = "0.2.33"
diff --git a/fastchat/constants.py b/fastchat/constants.py
@@ -11,11 +11,12 @@
 SERVER_ERROR_MSG = (
     "**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**"
 )
-MODERATION_MSG = "YOUR INPUT VIOLATES OUR CONTENT MODERATION GUIDELINES. PLEASE FIX YOUR INPUT AND TRY AGAIN."
+MODERATION_MSG = "$MODERATION$ YOUR INPUT VIOLATES OUR CONTENT MODERATION GUIDELINES."
 CONVERSATION_LIMIT_MSG = "YOU HAVE REACHED THE CONVERSATION LENGTH LIMIT. PLEASE CLEAR HISTORY AND START A NEW CONVERSATION."
 INACTIVE_MSG = "THIS SESSION HAS BEEN INACTIVE FOR TOO LONG. PLEASE REFRESH THIS PAGE."
+SLOW_MODEL_MSG = "⚠️  Both models will show the responses all at once. Please stay patient as it may take over 30 seconds."
 # Maximum input length
-INPUT_CHAR_LEN_LIMIT = int(os.getenv("FASTCHAT_INPUT_CHAR_LEN_LIMIT", 3072))
+INPUT_CHAR_LEN_LIMIT = int(os.getenv("FASTCHAT_INPUT_CHAR_LEN_LIMIT", 12000))
 # Maximum conversation turns
 CONVERSATION_TURN_LIMIT = 50
 # Session expiration time