# Streaming chat completions - OpenAI API

### Overview

Streaming chat completions from OpenAI allows you to get responses faster. Here's how it works:

1. By default, OpenAI generates the full completion before sending it back. For long completions, this can take some time. 

2. To get responses sooner, enable streaming by setting `stream=True` when calling the chat completions API.
3. Instead of one response, the API will return a stream of events as the completion is generated. 
4. Each event has a `delta` field containing a new chunk of text. Extract text from `delta` instead of `message`.
5. As OpenAI generates the completion, your code can process these chunks incrementally rather than waiting for the full completion.
6. This lets you print or display parts of the response before generation finishes.

So in summary, streaming allows you to get partial responses immediately rather than waiting for the full completion. Enable it by setting `stream=True`, then extract text from the `delta` field of events.

### Pros and Cons

Here is a 2 column markdown table with pros and cons of streaming chat completions:

| Pros | Cons |
|-|-|  
| Faster responses | Harder to moderate content |
| Low latency improves the UX | Increased complexity handling streaming events |
| Really helpful for long completions | Potential errors if stream is interrupted/truncated |

### Setup

Install dependencies and import all the necessary modules

In [1]:
!pip install openai



In [2]:
import openai, time

### 1. Normal Chat Completion

In [3]:
start_time = time.time()

# send a ChatCompletion request to list all prime numbers up to 200
response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[
        {'role': 'user', 'content': 'List all prime numbers up to 200, with a comma between each number and no newlines. E.g., 2, 3, 5, ...'}
    ],
    temperature=0,
)

# calculate the taken 
response_time = time.time() - start_time

# print the time delay and text received
print(f"Full response received {response_time:.2f} seconds after request")
print(f"Full response received:\n{response}")

Full response received 2.99 seconds after request
Full response received:
{
  "id": "chatcmpl-85fA3cR2hgF6gZknGPlxq9jKxvC8X",
  "object": "chat.completion",
  "created": 1696360555,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 40,
    "completion_tokens": 136,
    "total_tokens": 176
  }
}


In [4]:
response["choices"][0]["message"]["content"]

'2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199'

### 2. Streamed Chat Completion

Let's start by demonstrating how this works first:

In [5]:
response_stream = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[
        {'role': 'user', 'content': 'Whats the capital of France? Answer in one word.'}
    ],
    temperature=0,
    stream=True # by default, this is False
)

for chunk in response_stream:
    print(chunk)

{
  "id": "chatcmpl-85fA7UAHPHdM4lYgmfjlrVBMr4lx5",
  "object": "chat.completion.chunk",
  "created": 1696360559,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "delta": {
        "role": "assistant",
        "content": ""
      },
      "finish_reason": null
    }
  ]
}
{
  "id": "chatcmpl-85fA7UAHPHdM4lYgmfjlrVBMr4lx5",
  "object": "chat.completion.chunk",
  "created": 1696360559,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "Paris"
      },
      "finish_reason": null
    }
  ]
}
{
  "id": "chatcmpl-85fA7UAHPHdM4lYgmfjlrVBMr4lx5",
  "object": "chat.completion.chunk",
  "created": 1696360559,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "delta": {},
      "finish_reason": "stop"
    }
  ]
}


Now let's replicate the initial example, limiting progressively merging the streamed output:

In [10]:
start_time = time.time()

# send a ChatCompletion request to list all prime numbers up to 200
response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[
        {'role': 'user', 'content': 'List all prime numbers up to 200, with a comma between each number and no newlines. E.g., 2, 3, 5, ...'}
    ],
    temperature=0,
    stream=True
)

result_string = "" 
for delta in response:
    delta_string=""
    if "content" in delta["choices"][0]["delta"]:
        delta_string = delta["choices"][0]["delta"]["content"]
    print(delta_string)
    result_string += delta_string


2
,
 
3
,
 
5
,
 
7
,
 
11
,
 
13
,
 
17
,
 
19
,
 
23
,
 
29
,
 
31
,
 
37
,
 
41
,
 
43
,
 
47
,
 
53
,
 
59
,
 
61
,
 
67
,
 
71
,
 
73
,
 
79
,
 
83
,
 
89
,
 
97
,
 
101
,
 
103
,
 
107
,
 
109
,
 
113
,
 
127
,
 
131
,
 
137
,
 
139
,
 
149
,
 
151
,
 
157
,
 
163
,
 
167
,
 
173
,
 
179
,
 
181
,
 
191
,
 
193
,
 
197
,
 
199



In [11]:
print(result_string)

2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199


### 3. Streamed Chat Completion - Sentence-by-Sentence

In [31]:
start_time = time.time()

# send a ChatCompletion request to list all prime numbers up to 200
response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[
        {'role': 'user', 'content': 'Return the first 3 paragraphs of oliver twist.'}
    ],
    temperature=0,
    stream=True
)

result_string = "" 
result_array = []
for delta in response:
    delta_string=""
    if "content" in delta["choices"][0]["delta"]:
        delta_string = delta["choices"][0]["delta"]["content"]
    result_string += delta_string
    if any(map(result_string.__contains__, [f"\n"])) or "content" not in delta["choices"][0]["delta"]:
        result_array.append(result_string)
        print(result_string)
        result_string = ""

# if len(result_string) > 0:
#     result_array.append(result_string)
#     print(result_string)
#     result_string = ""
        

Among other public buildings in a certain town, which for many reasons it will be prudent to refrain from mentioning, and to which I will assign no fictitious name, there is one anciently common to most towns, great or small: to wit, a workhouse; and in this workhouse was born; on a day and date which I need not trouble myself to repeat, inasmuch as it can be of no possible consequence to the reader, in this stage of the business at all events; the item of mortality whose name is prefixed to the head of this chapter.


For a long time after it was ushered into this world of sorrow and trouble, by the parish surgeon, it remained a matter of considerable doubt whether the child would survive to bear any name at all; in which case it is somewhat more than probable that these memoirs would never have appeared; or, if they had, that being comprised within a couple of pages, they would have possessed the inestimable merit of being the most concise and faithful specimen of biography, extant i

In [32]:
result_string

''

In [33]:
result_array

['Among other public buildings in a certain town, which for many reasons it will be prudent to refrain from mentioning, and to which I will assign no fictitious name, there is one anciently common to most towns, great or small: to wit, a workhouse; and in this workhouse was born; on a day and date which I need not trouble myself to repeat, inasmuch as it can be of no possible consequence to the reader, in this stage of the business at all events; the item of mortality whose name is prefixed to the head of this chapter.\n\n',
 'For a long time after it was ushered into this world of sorrow and trouble, by the parish surgeon, it remained a matter of considerable doubt whether the child would survive to bear any name at all; in which case it is somewhat more than probable that these memoirs would never have appeared; or, if they had, that being comprised within a couple of pages, they would have possessed the inestimable merit of being the most concise and faithful specimen of biography, 