Adding OPENAI API Demo using FastAPI #530

sunyuhan19981208 · 2023-06-07T15:13:29Z

Description

This pull request adds a demo of the OPENAI API using FastAPI. The demo includes support for three APIs: completions, chat/completions and embeddings.

Changes Made

Created a new directory named api_server_demo in the project root.
Added a new file openai_api_server.py and openai_api_protocol.py in the api_server_demo directory containing the implementation of the API demo.
Implemented three API endpoints: /v1/completions, /v1/chat/completions, and /v1/embeddings.
Utilized FastAPI for creating the API server and handling HTTP requests.
Added necessary api reference document in the README.md file.

Usage

Use the command below to deploy the server.

python scripts/openai_server_demo/openai_api_server.py --base_model /path/to/base_model --lora_model /path/to/lora_model --gpus 0,1

Test APIs

completions:

curl http://localhost:19327/v1/completions \
  -H "Content-Type: application/json" \
  -d '{   
    "prompt": "告诉我中国的首都在哪里"
  }'

chat:

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{   
    "messages": [
      {"role": "user","message": "给我讲一些有关杭州的故事吧"}
    ],
    "repetition_penalty": 1.0
  }'

embeddings:

curl http://localhost:19327/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "今天天气真不错"
  }'

1anglesmith1 · 2023-06-14T02:25:46Z

你好，请问一下，我按照教程将中文版的llama模型在Windows本地部署了，但是怎么样才能去调用这个本地模型的api或者通过其他方法定义问题然后模型推理返回一个输出呢？

sunyuhan19981208 · 2023-06-14T07:05:11Z

你好，请问一下，我按照教程将中文版的llama模型在Windows本地部署了，但是怎么样才能去调用这个本地模型的api或者通过其他方法定义问题然后模型推理返回一个输出呢？

使用Python的requests库可以发送HTTP请求。以下是将curl命令转换为使用requests库发送POST请求的示例代码：

import requests
import json

url = 'http://localhost:19327/v1/chat/completions'
headers = {
    'Content-Type': 'application/json'
}

data = {
    'messages': [
        {'role': 'user', 'message': '给我讲一些有关杭州的故事吧'}
    ],
    'repetition_penalty': 1.0
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json())

确保已安装requests库。可以使用以下命令通过pip安装：

pip install requests

请注意，要成功执行此代码，您的本地服务器必须运行，并且在 http://localhost:19327 上接受POST请求。

1anglesmith1 · 2023-06-14T09:05:58Z

你好，请问一下，我按照教程将中文版的llama模型在Windows本地部署了，但是怎么样才能去调用这个本地模型的api或者通过其他方法定义问题然后模型推理返回一个输出呢？

使用Python的requests库可以发送HTTP请求。以下是将curl命令转换为使用requests库发送POST请求的示例代码：
import requests
import json

url = 'http://localhost:19327/v1/chat/completions'
headers = {
    'Content-Type': 'application/json'
}

data = {
    'messages': [
        {'role': 'user', 'message': '给我讲一些有关杭州的故事吧'}
    ],
    'repetition_penalty': 1.0
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json())
确保已安装requests库。可以使用以下命令通过pip安装：
pip install requests
请注意，要成功执行此代码，您的本地服务器必须运行，并且在 http://localhost:19327 上接受POST请求。

你好，非常感谢你的回答，但是我在启动本地服务器的时候，选用13b的模型最后报错说gpu显存不够，我的gpu显存是24G，在powershell上可以通过量化后的模型进行推理，但是按照您的方法要将两个模型全部加载进去就会有问题。请问有方法可以通过量化后的13b模型去做一个api的交互吗？或者减少显存的消耗

sunyuhan19981208 · 2023-06-14T10:18:26Z

@1anglesmith1 可以在启动选项中添加--load_in_8bit

1anglesmith1 · 2023-06-15T02:01:23Z

--load_in_8bit

我怎么加的都是报错没这个参数：usage: openai_api_server.py [-h] --base_model BASE_MODEL [--lora_model LORA_MODEL] [--tokenizer_path TOKENIZER_PATH]
[--gpus GPUS] [--load_in_8bit] [--only_cpu]
openai_api_server.py: error: unrecognized arguments: --load_in_4bit
这个 --load_in_4bit应该加在启动项哪里呢

sunyuhan19981208 · 2023-06-15T02:12:05Z

@1anglesmith1 没有4bit，只有8bit

1anglesmith1 · 2023-06-15T02:12:56Z

@1anglesmith1没有4bit，只有8bit

python scripts/openai_server_demo/openai_api_server.py --base_model /path/to/base_model --lora_model /path/to/lora_model --gpus 0,1 --load_in_8bit 是这样吗？

sunyuhan19981208 · 2023-06-15T02:13:44Z

@1anglesmith1 是的，你一张卡就--gpus 0

ymcui · 2023-06-15T02:31:13Z

Thank you for your continuous contributions.
After a quick scanning, the proposed PR might be a valuable addition to our project.
In the meantime, we are sorry for the delay in reviewing this PR as we are packed with other work to do.
We will review this PR asap and thanks again for your understanding.

TGLTommy · 2023-06-15T02:36:18Z

@sunyuhan19981208
你好，感谢你的贡献。我在跑你的代码 openai_api_server.py 时，发现一个问题：在调用api服务计算一句话的embeddings时，抛出异常：

raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

解决措施：是不是需要在 161 行的下面，插入一行代码：

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

我插入一行后，再次运行服务，就没有报错了。

sunyuhan19981208 · 2023-06-15T02:48:45Z

@sunyuhan19981208 你好，感谢你的贡献。我在跑你的代码 openai_api_server.py 时，发现一个问题：在调用api服务计算一句话的embeddings时，抛出异常：
raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
解决措施：是不是需要在 161 行的下面，插入一行代码：
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
我插入一行后，再次运行服务，就没有报错了。

好的，感谢您的测试，我自己测试的时候没发现这个问题，我今天下班去看一下啥原因

1anglesmith1 · 2023-06-15T02:51:16Z

RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues 显示我的cuda有问题，但是我看了我的cuda也对应的安装了。请问一下大佬，为什么在构建api的时候不直接使用量化好的模型去推理，而是使用llama的hf模型和lora权重模型。是不能直接使用调用量化好的模型的结果去搭建一个api吗？

sunyuhan19981208 · 2023-06-15T03:01:13Z

@1anglesmith1

您可以试试在执行python脚本之前加上LD_LIBRARY_PATH={你的CUDA so lib的那个目录}:$LD_LIBRARY_PATH python ....
https://github.com/abetlen/llama-cpp-python 您使用这个项目可以直接搭起来一个类似的服务
过两天周末我可以做一下这个量化的推理，原先我觉得和上面这个工作重合了就没做

sunyuhan19981208 · 2023-06-15T13:57:58Z

@TommyTang930
我这边tokenizer加载进来默认就是有pad_token的，我估计您那个模型合完应该是没有tokenizer，不过也很感谢您发现了这一点，我加了一个判断

(Pdb) p tokenizer
LlamaTokenizer(name_or_path='/home/sunyuhan/syh/sunyuhan/zju/chinese-alpaca-plus-7b', vocab_size=49954, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '[PAD]'}, clean_up_tokenization_spaces=False)

…n if the tokenizer do not already have one

airaria · 2023-06-19T07:03:01Z

scripts/openai_server_demo/openai_api_server.py

+def generate_chat_prompt(messages: list):
+    """Generate prompt for chat completion"""
+    system_msg = '''A chat between a curious user and an artificial intelligence assistant. 
+    The assistant gives helpful, detailed, and polite answers to the user's questions.'''
+    for msg in messages:
+        if msg.role == 'system':
+            system_msg = msg.message
+    prompt = f"{system_msg} <\s>"
+    for msg in messages:
+        if msg.role == 'system':
+            continue
+        prompt += f"{msg.role}: {msg.content} <\s>"
+    prompt += "assistant:"
+    return prompt


Alpaca models have not been trained with a different prompt template for multi-turn conversations.
While in our training scheme, the multi-turn conversation is formatted as follows:

Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Response: {response} ### Instruction: {instruction} ### Response: {response} ...

However, it is possible that the format you used above is more appropriate for multi-turn conversations in inference.
Can you please compare the these two formats and determine which option is superior? If necessary, please modify the code accordingly.

I will modify it tonight, thanks a lot for your code review!!!!

airaria

It may need to modify generate_chat_prompt

…aMA-Alpaca into openai_api

sunyuhan19981208 and others added 6 commits May 8, 2023 19:20

feat: support multi-gpus transformers inference

05fc83d

Merge branch 'main' of github.com:ymcui/Chinese-LLaMA-Alpaca into main

e891df1

Merge branch 'ymcui:main' into main

3a7a74f

feat: add openai api demo

1cc02e4

doc: add Chinese README for openai api demo

804cc03

doc: refine the README of openai api demo

8cfd0d6

airaria mentioned this pull request Jun 9, 2023

请问在Windows上部署的模型，我该怎么去调用本地模型的api？ #544

Closed

airaria mentioned this pull request Jun 14, 2023

使用Transformers推理的api接口交互 #587

Closed

5 tasks

airaria self-requested a review June 14, 2023 09:07

sunyuhan19981208 and others added 6 commits June 15, 2023 22:14

fix: fix the bug in the text embedding calculation and add a pad_toke…

abb79a7

…n if the tokenizer do not already have one

doc: add explaination of openai server load_in_8bit option

7c67ced

Update README.md

19ed4dc

fix inference on cpu

ddf4bb6

Improve documentation

327da1d

fix cpu inference

16e2221

airaria reviewed Jun 19, 2023

View reviewed changes

airaria suggested changes Jun 19, 2023

View reviewed changes

sunyuhan19981208 added 2 commits June 19, 2023 21:27

fix: fix template of chat completion in api demo

b03f578

chore: delete print

fdcf5eb

sunyuhan19981208 requested a review from airaria June 19, 2023 13:30

add requirements

d8b80d8

airaria approved these changes Jun 20, 2023

View reviewed changes

airaria requested a review from ymcui June 20, 2023 01:20

airaria added 3 commits June 20, 2023 09:37

remove banner

d395e0e

update README

172ede5

Merge branch 'main' into openai_api

034c864

ymcui approved these changes Jun 20, 2023

View reviewed changes

airaria added 6 commits June 20, 2023 09:58

update README

a36309b

Merge branch 'main' of https://github.com/sunyuhan19981208/Chinese-LL…

dc80e08

…aMA-Alpaca into openai_api

merge

4f91b0b

update scripts/README.md

0a8879e

Delete inference_hf_gpus.py

d38938e

Update scripts/README.md

06e449a

ymcui merged commit 129cb86 into ymcui:main Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding OPENAI API Demo using FastAPI #530

Adding OPENAI API Demo using FastAPI #530

sunyuhan19981208 commented Jun 7, 2023

1anglesmith1 commented Jun 14, 2023

sunyuhan19981208 commented Jun 14, 2023

1anglesmith1 commented Jun 14, 2023

sunyuhan19981208 commented Jun 14, 2023

1anglesmith1 commented Jun 15, 2023

sunyuhan19981208 commented Jun 15, 2023

1anglesmith1 commented Jun 15, 2023

sunyuhan19981208 commented Jun 15, 2023

ymcui commented Jun 15, 2023

TGLTommy commented Jun 15, 2023

sunyuhan19981208 commented Jun 15, 2023

1anglesmith1 commented Jun 15, 2023

sunyuhan19981208 commented Jun 15, 2023

sunyuhan19981208 commented Jun 15, 2023 •

edited

airaria Jun 19, 2023 •

edited

sunyuhan19981208 Jun 19, 2023

airaria left a comment

Adding OPENAI API Demo using FastAPI #530

Adding OPENAI API Demo using FastAPI #530

Conversation

sunyuhan19981208 commented Jun 7, 2023

Description

Changes Made

Usage

Test APIs

1anglesmith1 commented Jun 14, 2023

sunyuhan19981208 commented Jun 14, 2023

1anglesmith1 commented Jun 14, 2023

sunyuhan19981208 commented Jun 14, 2023

1anglesmith1 commented Jun 15, 2023

sunyuhan19981208 commented Jun 15, 2023

1anglesmith1 commented Jun 15, 2023

sunyuhan19981208 commented Jun 15, 2023

ymcui commented Jun 15, 2023

TGLTommy commented Jun 15, 2023

sunyuhan19981208 commented Jun 15, 2023

1anglesmith1 commented Jun 15, 2023

sunyuhan19981208 commented Jun 15, 2023

sunyuhan19981208 commented Jun 15, 2023 • edited

airaria Jun 19, 2023 • edited

Choose a reason for hiding this comment

sunyuhan19981208 Jun 19, 2023

Choose a reason for hiding this comment

airaria left a comment

Choose a reason for hiding this comment

sunyuhan19981208 commented Jun 15, 2023 •

edited

airaria Jun 19, 2023 •

edited