-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding OPENAI API Demo using FastAPI #530
Conversation
你好,请问一下,我按照教程将中文版的llama模型在Windows本地部署了,但是怎么样才能去调用这个本地模型的api或者通过其他方法定义问题然后模型推理返回一个输出呢? |
使用Python的requests库可以发送HTTP请求。以下是将curl命令转换为使用requests库发送POST请求的示例代码: import requests
import json
url = 'http://localhost:19327/v1/chat/completions'
headers = {
'Content-Type': 'application/json'
}
data = {
'messages': [
{'role': 'user', 'message': '给我讲一些有关杭州的故事吧'}
],
'repetition_penalty': 1.0
}
response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json()) 确保已安装requests库。可以使用以下命令通过pip安装: pip install requests 请注意,要成功执行此代码,您的本地服务器必须运行,并且在 http://localhost:19327 上接受POST请求。 |
你好,非常感谢你的回答,但是我在启动本地服务器的时候,选用13b的模型最后报错说gpu显存不够,我的gpu显存是24G,在powershell上可以通过量化后的模型进行推理,但是按照您的方法要将两个模型全部加载进去就会有问题。请问有方法可以通过量化后的13b模型去做一个api的交互吗?或者减少显存的消耗 |
@1anglesmith1 可以在启动选项中添加 |
我怎么加的都是报错没这个参数:usage: openai_api_server.py [-h] --base_model BASE_MODEL [--lora_model LORA_MODEL] [--tokenizer_path TOKENIZER_PATH] |
@1anglesmith1 没有4bit,只有8bit |
python scripts/openai_server_demo/openai_api_server.py --base_model /path/to/base_model --lora_model /path/to/lora_model --gpus 0,1 --load_in_8bit 是这样吗? |
@1anglesmith1 是的,你一张卡就--gpus 0 |
Thank you for your continuous contributions. |
@sunyuhan19981208 raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`. 解决措施:是不是需要在 161 行的下面,插入一行代码: tokenizer.add_special_tokens({'pad_token': '[PAD]'}) 我插入一行后,再次运行服务,就没有报错了。 |
好的,感谢您的测试,我自己测试的时候没发现这个问题,我今天下班去看一下啥原因 |
RuntimeError:
|
|
@TommyTang930 (Pdb) p tokenizer
LlamaTokenizer(name_or_path='/home/sunyuhan/syh/sunyuhan/zju/chinese-alpaca-plus-7b', vocab_size=49954, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '[PAD]'}, clean_up_tokenization_spaces=False) |
def generate_chat_prompt(messages: list): | ||
"""Generate prompt for chat completion""" | ||
system_msg = '''A chat between a curious user and an artificial intelligence assistant. | ||
The assistant gives helpful, detailed, and polite answers to the user's questions.''' | ||
for msg in messages: | ||
if msg.role == 'system': | ||
system_msg = msg.message | ||
prompt = f"{system_msg} <\s>" | ||
for msg in messages: | ||
if msg.role == 'system': | ||
continue | ||
prompt += f"{msg.role}: {msg.content} <\s>" | ||
prompt += "assistant:" | ||
return prompt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alpaca models have not been trained with a different prompt template for multi-turn conversations.
While in our training scheme, the multi-turn conversation is formatted as follows:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response: {response}
### Instruction:
{instruction}
### Response: {response}
...
However, it is possible that the format you used above is more appropriate for multi-turn conversations in inference.
Can you please compare the these two formats and determine which option is superior? If necessary, please modify the code accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will modify it tonight, thanks a lot for your code review!!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may need to modify generate_chat_prompt
Description
This pull request adds a demo of the OPENAI API using FastAPI. The demo includes support for three APIs: completions, chat/completions and embeddings.
Changes Made
Usage
Use the command below to deploy the server.
Test APIs
completions:
chat:
embeddings: