Streaming openai api support #43

lealaxy · 2023-08-01T16:07:46Z

Description

Use transformers TextIteratorStreamer to support streaming response for the OpenAI API.

Ref https://huggingface.co/docs/transformers/internal/generation_utils

Related Issue

None

airaria · 2023-08-03T02:06:04Z

@lealaxy Can you add the usages of the new feature to the documentation scripts/openai_server_demo/README.md?

lealaxy · 2023-08-03T06:35:17Z

@lealaxy Can you add the usages of the new feature to the documentation scripts/openai_server_demo/README.md?

Hello, I have added the docs and modified the format of the chat/completions API request body.

Now, by running openai_api_server.py, you can use chinese-llama-alpaca-2 as the backend for any frontend application based on ChatGPT.

airaria · 2023-08-03T08:10:34Z

When I test the api with curl:

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "给我讲一些有关杭州的故事吧"}
    ],
    "repetition_penalty": 1.0, "stream":true
  }'

an error occurs:

Traceback (most recent call last):
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 429, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/routing.py", line 69, in app
    await response(scope, receive, send)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/sse_starlette/sse.py", line 251, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 574, in __aexit__
    raise exceptions[0]
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/sse_starlette/sse.py", line 240, in wrap
    await func()
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/sse_starlette/sse.py", line 225, in stream_response
    async for data in self.body_iterator:
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/concurrency.py", line 63, in iterate_in_threadpool
    yield await anyio.to_thread.run_sync(_next, iterator)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/anyio/to_thread.py", line 28, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
    return await future
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 754, in run
    result = context.run(func, *args)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/starlette/concurrency.py", line 53, in _next
    return next(iterator)
  File "/Users/yangziqing/Documents/projects/llama/PR/Chinese-LLaMA-Alpaca-2/scripts/openai_server_demo/openai_api_server.py", line 228, in stream_predict
    yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False))
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/typing_extensions.py", line 2509, in wrapper
    return __arg(*args, **kwargs)
  File "/Users/yangziqing/opt/anaconda3/lib/python3.9/site-packages/pydantic/main.py", line 945, in json
    raise TypeError('`dumps_kwargs` keyword arguments are no longer supported.')

@lealaxy Do you have any idea? Could it be related to the package version?

my env:
python 3.9.16
pydantic 2.1.1
fastapi 0.100.1
uvicorn 0.21.1
sse-starlette 1.6.1
starlette 0.27.0

lealaxy · 2023-08-03T08:49:28Z

Yes, it is because your are using pydantic>2.0.0, while in my environment, I have pydantic 1.10.9 installed.

Additionally, deepspeed currently requires pydantic<2.0.0. To maintain consistency between the inference and training environments, I suggest install pydantic<2.0.0.

my env:
Python 3.10.11
pydantic 1.10.9
fastapi 0.100.1
uvicorn 0.22.0
sse-starlette 1.6.1
starlette 0.27.0
torch 2.0.1
deepspeed 0.10.0

airaria · 2023-08-03T13:11:17Z

After installing pydantic==1.10.9 (other packages remain the same)
run the server (on MacOS)

python openai_api_server.py --base_model ./chinese-alpaca-2-7b --only_cpu

test without streaming

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "中国的首都在哪里？"}
    ],
    "repetition_penalty": 1.0
  }'

output(copied from terminal):

{"id":"chatcmpl-N4hHBcHfx7WjwcEsQaNG8k","object":"chat.completion","created":1691067769,"model":"chinese-llama-alpaca-2","choices":[{"index":0,"message":{"role":"user","content":"中国的首都是哪里？"}},{"index":1,"message":{"role":"assistant","content":"中国的首都是北京。"}}]}

test with streaming

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "中国的首都在哪里？"}
    ],
    "repetition_penalty": 1.0,
    "stream":true
  }'

output (copied from terminal, ping info is omitted):

data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "的"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "首都"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "是"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "北京"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "京。"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}
data: [DONE]

It looks like the output is incomplete and messed up with some tokens.
Did I use it in the right way?

lealaxy · 2023-08-03T17:00:30Z

Sorry. This was due to an error during the split generate process. I have fixed the bug now.

airaria · 2023-08-04T01:02:23Z

There are still some extra tokens...
input:

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "中国的首都是哪里"},
      {"role": "assistant","content": "北京。"},
      {"role": "user","content": "法国的呢"}
    ],
    "repetition_penalty": 1.0,
    "stream": true
  }'

output:

data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": " "}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "法国"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "的"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "首都"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "是"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "巴黎"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "黎。"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}
data: [DONE]

lealaxy · 2023-08-04T03:36:45Z

Fixed. I think that "" should not be counted as an extra token.

Input:

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "中国的首都是哪里"},
      {"role": "assistant","content": "北京。"},
      {"role": "user","content": "法国的呢"}
    ],
    "repetition_penalty": 1.0,
    "stream": true
  }'

Output:

data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "法国"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "的"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "首都"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "是"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "巴黎"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "黎。"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}
data: [DONE]

Input2

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "中国的首都在哪里？"}
    ],
    "repetition_penalty": 1.0,
    "stream":true
  }'

Output2:

data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "中国"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "的"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "首都"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "是"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "北京"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "京。"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}
data: [DONE]

airaria · 2023-08-04T03:53:48Z

data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "北京"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "京。"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}

But why there is an extra("京。" (and "黎。") and repeated ""?
Can you fix this?

lealaxy · 2023-08-04T13:25:55Z

The repeated "" is a normal behavior that occurs during the model's generation process. It returns an empty string, but it doesn't affect the final answer when concatenated together.

The bug causing the repetition of the last word has been fixed.

GoGoJoestar · 2023-08-07T08:42:50Z

I test the api with stream mode, but the responses were strange:

My input:

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "告诉我中国的首都在哪里"}
    ],
    "repetition_penalty": 1.0,
    "stream": true
  }'

Output:

The previous responses all returned "", only the last one returned output, and the output loss the first token.

lealaxy · 2023-08-07T10:25:29Z

Sorry. Fixed.

GoGoJoestar · 2023-08-08T01:55:53Z

It seems that not every token is immediately returned when generated. Instead, every once in a while, all tokens during this period will be returned together.

data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
... # Here are many lines of the repeat responses
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "亚洲是一个广阔的洲，拥有许多美丽的国家和城市。以下是一些亚洲国家及其首都：\n"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "1. "}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
...
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "印度：新德里\n"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "2. "}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
...
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "巴基斯坦：伊斯兰堡\n"}, "finish_reason": null}]}
...

lealaxy · 2023-08-08T07:26:08Z

This is due to the difference in GPU computing power, where different GPU have different token generation speeds.

The generated tokens are added to a buffer. I use the TextIteratorStreamer to read tokens from the buffer in a loop and return them.

In your case, it is possible that your GPU generated a large number of tokens in a short period of time, and when reading from the buffer, returned multiple tokens at once. Then, due to CUDA calls or other reasons, there was an idle period, followed by another short period of generating a large number of tokens.

However, this does not affect the usage. On my machine (with server cooling) using the A100 GPU, token generation time is consistent.

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "告诉我中国的首都在哪里"}
    ],
    "repetition_penalty": 1.0,
    "stream": true
  }'

data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "中国"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "的"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "首都"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "是"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "北京"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "，位于"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "华北"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "地区"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {"content": "。"}, "finish_reason": null}]}
data: {"object": "chat.completion.chunk", "model": "chinese-llama-alpaca-2", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}
data: [DONE]

GoGoJoestar · 2023-08-08T08:32:11Z

I test on a P40 GPU, which has lower computer capability than A100. When generating, the phenomenon is that it return several None output ("content": "") responses, then stop and wait for a long output consist of many tokens. If I input an English instruction, it can return tokens more frequently.

My shell command:

python scripts/openai_server_demo/openai_api_server.py --base_model [model_path] --gpus 0

Input and output:

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "告诉我中国的首都在哪里"}
    ],
    "repetition_penalty": 1.0,
    "stream": true
  }'

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "Tell me where is the capital of China."}
    ],
    "repetition_penalty": 1.0,
    "stream": true
  }'

# Here I only paste content
"content": ""
"content": ""
"content": ""
"content": ""
"content": ""
"content": ""
"content": "中国的首都是北京。"

"content": ""
"content": "The"
"content": "capital "
"content": "of "
"content": "China "
"content": "is "
"content": ""
"content": ""
"content": ""
"content": ""
"content": "Beijing."

The count of "content": "" is the same as tokens count of tokenized following text (Exclude the first return "content": "" of each request).

I guess the TextIteratorStreamer will return the text when the tokens can merge into a complete word. This will lead to discontinuous text generation. However, the process of generation on your device is right. Can you check if there is something wrong with my usage?

lealaxy · 2023-08-08T09:15:24Z

You are right. TextIteratorStreamer avoid printing incomplete words.

The issue you encountered occurred in versions of transformers<4.29.0 and was fixed in #22664.

In the latest version of transformers, TextIteratorStreamer has been fixed to handle Chinese text, allowing for smooth retrieval of Chinese tokens.

GoGoJoestar · 2023-08-09T01:47:21Z

That's the reason. After updating the version of transformers, everything is ok!

lealaxy and others added 3 commits August 2, 2023 00:06

add stream openai api support

b56800d

Merge branch 'main' into stream-openai-api

3f7f13c

fix spelling error

9efdb97

lealaxy and others added 2 commits August 3, 2023 13:50

Merge branch 'ymcui:main' into stream-openai-api

41c1423

Add openai stream api docs.

7205815

fix stream api split

30cb8a5

fix stream api bug

ef0e1c1

fix api extra token

0db43d8

MrChenYukun approved these changes Aug 4, 2023

View reviewed changes

fix openai api repetition of the last word

01fc1df

airaria requested a review from GoGoJoestar August 7, 2023 08:06

fix api first toekn missed

ebbe80e

GoGoJoestar approved these changes Aug 9, 2023

View reviewed changes

ymcui merged commit 7b19c67 into ymcui:main Aug 9, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming openai api support #43

Streaming openai api support #43

lealaxy commented Aug 1, 2023

airaria commented Aug 3, 2023 •

edited

Loading

lealaxy commented Aug 3, 2023

airaria commented Aug 3, 2023

lealaxy commented Aug 3, 2023

airaria commented Aug 3, 2023

lealaxy commented Aug 3, 2023

airaria commented Aug 4, 2023

lealaxy commented Aug 4, 2023

airaria commented Aug 4, 2023

lealaxy commented Aug 4, 2023

GoGoJoestar commented Aug 7, 2023

lealaxy commented Aug 7, 2023

GoGoJoestar commented Aug 8, 2023 •

edited

Loading

lealaxy commented Aug 8, 2023

GoGoJoestar commented Aug 8, 2023

lealaxy commented Aug 8, 2023

GoGoJoestar commented Aug 9, 2023

Streaming openai api support #43

Streaming openai api support #43

Conversation

lealaxy commented Aug 1, 2023

Description

Related Issue

airaria commented Aug 3, 2023 • edited Loading

lealaxy commented Aug 3, 2023

airaria commented Aug 3, 2023

lealaxy commented Aug 3, 2023

airaria commented Aug 3, 2023

lealaxy commented Aug 3, 2023

airaria commented Aug 4, 2023

lealaxy commented Aug 4, 2023

airaria commented Aug 4, 2023

lealaxy commented Aug 4, 2023

GoGoJoestar commented Aug 7, 2023

lealaxy commented Aug 7, 2023

GoGoJoestar commented Aug 8, 2023 • edited Loading

lealaxy commented Aug 8, 2023

GoGoJoestar commented Aug 8, 2023

lealaxy commented Aug 8, 2023

GoGoJoestar commented Aug 9, 2023

airaria commented Aug 3, 2023 •

edited

Loading

GoGoJoestar commented Aug 8, 2023 •

edited

Loading