Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding OPENAI API Demo using FastAPI #530

Merged
merged 24 commits into from
Jun 20, 2023
Merged

Conversation

sunyuhan19981208
Copy link
Contributor

Description

This pull request adds a demo of the OPENAI API using FastAPI. The demo includes support for three APIs: completions, chat/completions and embeddings.

Changes Made

  • Created a new directory named api_server_demo in the project root.
  • Added a new file openai_api_server.py and openai_api_protocol.py in the api_server_demo directory containing the implementation of the API demo.
  • Implemented three API endpoints: /v1/completions, /v1/chat/completions, and /v1/embeddings.
  • Utilized FastAPI for creating the API server and handling HTTP requests.
  • Added necessary api reference document in the README.md file.

Usage

Use the command below to deploy the server.

python scripts/openai_server_demo/openai_api_server.py --base_model /path/to/base_model --lora_model /path/to/lora_model --gpus 0,1

Test APIs

completions:

curl http://localhost:19327/v1/completions \
  -H "Content-Type: application/json" \
  -d '{   
    "prompt": "告诉我中国的首都在哪里"
  }'

chat:

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{   
    "messages": [
      {"role": "user","message": "给我讲一些有关杭州的故事吧"}
    ],
    "repetition_penalty": 1.0
  }'

embeddings:

curl http://localhost:19327/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "今天天气真不错"
  }'

@1anglesmith1
Copy link

你好,请问一下,我按照教程将中文版的llama模型在Windows本地部署了,但是怎么样才能去调用这个本地模型的api或者通过其他方法定义问题然后模型推理返回一个输出呢?

@sunyuhan19981208
Copy link
Contributor Author

你好,请问一下,我按照教程将中文版的llama模型在Windows本地部署了,但是怎么样才能去调用这个本地模型的api或者通过其他方法定义问题然后模型推理返回一个输出呢?

使用Python的requests库可以发送HTTP请求。以下是将curl命令转换为使用requests库发送POST请求的示例代码:

import requests
import json

url = 'http://localhost:19327/v1/chat/completions'
headers = {
    'Content-Type': 'application/json'
}

data = {
    'messages': [
        {'role': 'user', 'message': '给我讲一些有关杭州的故事吧'}
    ],
    'repetition_penalty': 1.0
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json())

确保已安装requests库。可以使用以下命令通过pip安装:

pip install requests

请注意,要成功执行此代码,您的本地服务器必须运行,并且在 http://localhost:19327 上接受POST请求。

@1anglesmith1
Copy link

你好,请问一下,我按照教程将中文版的llama模型在Windows本地部署了,但是怎么样才能去调用这个本地模型的api或者通过其他方法定义问题然后模型推理返回一个输出呢?

使用Python的requests库可以发送HTTP请求。以下是将curl命令转换为使用requests库发送POST请求的示例代码:

import requests
import json

url = 'http://localhost:19327/v1/chat/completions'
headers = {
    'Content-Type': 'application/json'
}

data = {
    'messages': [
        {'role': 'user', 'message': '给我讲一些有关杭州的故事吧'}
    ],
    'repetition_penalty': 1.0
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json())

确保已安装requests库。可以使用以下命令通过pip安装:

pip install requests

请注意,要成功执行此代码,您的本地服务器必须运行,并且在 http://localhost:19327 上接受POST请求。

你好,非常感谢你的回答,但是我在启动本地服务器的时候,选用13b的模型最后报错说gpu显存不够,我的gpu显存是24G,在powershell上可以通过量化后的模型进行推理,但是按照您的方法要将两个模型全部加载进去就会有问题。请问有方法可以通过量化后的13b模型去做一个api的交互吗?或者减少显存的消耗

@airaria airaria self-requested a review June 14, 2023 09:07
@sunyuhan19981208
Copy link
Contributor Author

@1anglesmith1 可以在启动选项中添加--load_in_8bit

@1anglesmith1
Copy link

--load_in_8bit

我怎么加的都是报错没这个参数:usage: openai_api_server.py [-h] --base_model BASE_MODEL [--lora_model LORA_MODEL] [--tokenizer_path TOKENIZER_PATH]
[--gpus GPUS] [--load_in_8bit] [--only_cpu]
openai_api_server.py: error: unrecognized arguments: --load_in_4bit
这个 --load_in_4bit应该加在启动项哪里呢

@sunyuhan19981208
Copy link
Contributor Author

@1anglesmith1 没有4bit,只有8bit

@1anglesmith1
Copy link

@1anglesmith1没有4bit,只有8bit

python scripts/openai_server_demo/openai_api_server.py --base_model /path/to/base_model --lora_model /path/to/lora_model --gpus 0,1 --load_in_8bit 是这样吗?

@sunyuhan19981208
Copy link
Contributor Author

@1anglesmith1 是的,你一张卡就--gpus 0

@ymcui
Copy link
Owner

ymcui commented Jun 15, 2023

Thank you for your continuous contributions.
After a quick scanning, the proposed PR might be a valuable addition to our project.
In the meantime, we are sorry for the delay in reviewing this PR as we are packed with other work to do.
We will review this PR asap and thanks again for your understanding.

@TGLTommy
Copy link

@sunyuhan19981208
你好,感谢你的贡献。我在跑你的代码 openai_api_server.py 时,发现一个问题:在调用api服务计算一句话的embeddings时,抛出异常:

raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

解决措施:是不是需要在 161 行的下面,插入一行代码:

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

我插入一行后,再次运行服务,就没有报错了。

@sunyuhan19981208
Copy link
Contributor Author

@sunyuhan19981208 你好,感谢你的贡献。我在跑你的代码 openai_api_server.py 时,发现一个问题:在调用api服务计算一句话的embeddings时,抛出异常:

raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

解决措施:是不是需要在 161 行的下面,插入一行代码:

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

我插入一行后,再次运行服务,就没有报错了。

好的,感谢您的测试,我自己测试的时候没发现这个问题,我今天下班去看一下啥原因

@1anglesmith1
Copy link

RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues 显示我的cuda有问题,但是我看了我的cuda也对应的安装了。请问一下大佬,为什么在构建api的时候不直接使用量化好的模型去推理,而是使用llama的hf模型和lora权重模型。是不能直接使用调用量化好的模型的结果去搭建一个api吗?

@sunyuhan19981208
Copy link
Contributor Author

@1anglesmith1

  1. 您可以试试在执行python脚本之前加上LD_LIBRARY_PATH={你的CUDA so lib的那个目录}:$LD_LIBRARY_PATH python ....

  2. https://github.com/abetlen/llama-cpp-python 您使用这个项目可以直接搭起来一个类似的服务

  3. 过两天周末我可以做一下这个量化的推理,原先我觉得和上面这个工作重合了就没做

@sunyuhan19981208
Copy link
Contributor Author

sunyuhan19981208 commented Jun 15, 2023

@TommyTang930
我这边tokenizer加载进来默认就是有pad_token的,我估计您那个模型合完应该是没有tokenizer,不过也很感谢您发现了这一点,我加了一个判断

(Pdb) p tokenizer
LlamaTokenizer(name_or_path='/home/sunyuhan/syh/sunyuhan/zju/chinese-alpaca-plus-7b', vocab_size=49954, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '[PAD]'}, clean_up_tokenization_spaces=False)

Comment on lines 92 to 105
def generate_chat_prompt(messages: list):
"""Generate prompt for chat completion"""
system_msg = '''A chat between a curious user and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the user's questions.'''
for msg in messages:
if msg.role == 'system':
system_msg = msg.message
prompt = f"{system_msg} <\s>"
for msg in messages:
if msg.role == 'system':
continue
prompt += f"{msg.role}: {msg.content} <\s>"
prompt += "assistant:"
return prompt
Copy link
Contributor

@airaria airaria Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alpaca models have not been trained with a different prompt template for multi-turn conversations.
While in our training scheme, the multi-turn conversation is formatted as follows:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response: {response}

### Instruction:
{instruction}

### Response: {response}

...

However, it is possible that the format you used above is more appropriate for multi-turn conversations in inference.
Can you please compare the these two formats and determine which option is superior? If necessary, please modify the code accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will modify it tonight, thanks a lot for your code review!!!!

Copy link
Contributor

@airaria airaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may need to modify generate_chat_prompt

@airaria airaria requested a review from ymcui June 20, 2023 01:20
@ymcui ymcui merged commit 129cb86 into ymcui:main Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants