-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autocomplete: Use streaming to early-terminate Anthropic requests #723
Conversation
data: string | ||
} | ||
|
||
async function* createSSEDecoder(iterator: AsyncIterableIterator<BufferSource>): AsyncGenerator<SSEMessage> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unimportant information: I think this is the first time I’m using an async generator function 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going advanced!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took some time to understand the details of the new streaming logic. Looks great, and it should significantly reduce the latency for multiline autocomplete! 💪
|
||
if (isStreamingResponse) { | ||
try { | ||
const iterator = createSSEDecoder(response.body as any) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this couple of discussions, it looks like we can do the following to avoid any
casting:
import type { ReadableStream } from 'node:stream/web'
const iterator = createSSEDecoder(response.body as ReadableStream<Uint8Array>)
async function* createSSEDecoder(iterator: ReadableStream<Uint8Array>): AsyncGenerator<SSEMessage> {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice! I also think I need to add some feature checking here. I just thought of that: If Node 18 has native fetch, does that mean we'll have a proper web stream here instead of the node stream and would that mean that e.g. in Agent or in a future VS Code release we'll have issues?
This polyfill is such a mess haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valerybugakov I think there's some confusion here, the web ReadableStream is not available for us in node 16 (VS Code) and the node-fetch polyfill returns a Node Stream instead of a compatible ReadableStream (https://github.com/node-fetch/node-fetch#streams) that's why I went with the iterator pattern which apparently is available on Node streams but not native web streams.
The type cast is necessary because our environment thinks that fetch
is a standards compliant implementation which it is not. That's also why I’m gating this feature to only work on node for now.
I changed the code now to not rely on the global fetch
but instead get fetch
from isomorphic-fetch
. In Node environments, this will now always resolve to node-fetch
and the incompatible buffer polyfill, so we only have to handle this case for now (I was worried that not monkey patching global.fetch
could lead to issues in newer nodejs versions due to them having added proper fetch
support since. But I tested with the completions generation script (using the local Node 20) and it worked fine.
I'll add a comment
let index: number | ||
while ((index = buffer.indexOf('\n\n')) >= 0) { | ||
const message = buffer.slice(0, index) | ||
buffer = buffer.slice(index + 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we add two here? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we check for two newlines and then slice it. I can make a const SSE_BREAK
and use SSE_BREAK.length
so it's clearer :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, I figured it's related to new lines, but my brain processed them as '\n\n'.length === 4
, so I was confused 😛
// eslint-disable-next-line @typescript-eslint/no-misused-promises, no-async-promise-executor | ||
return new Promise(async (resolve, reject) => { | ||
try { | ||
const abortController = forkSignal(abortSignal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a network controller one level higher in the call stack:
cody/vscode/src/completions/request-manager.ts
Lines 59 to 62 in 8ca07b9
// We forward a different abort controller to the network request so we | |
// can cancel the network request independently of the user cancelling | |
// the completion. | |
const networkRequestAbortController = new AbortController() |
Would it make sense to pass it down the call tree and rely on networkRequestAbortController.abort()
to reduce the number of entities instead of forking it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought about this, the issue here is that for the request manager, there is just one multi-line request with n
results where as for this logic, we only want to cancel one of the n requests, hence the forking. If one request resolves early, the others might still need to sample a few more tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the other requests are helpful, they will be accessible via the UI — that makes sense. Let's add your comment to sources then 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. There's also a follow-up consideration to not wait for all three results for the UI (where we only show one). That was a bit out of scope of this PR though but I'll make sure to comment.
/** | ||
* Evaluates a partial completion response and returns true when we can already use it. This is used | ||
* to terminate any streaming responses where we can get a token-by-token access to the result and | ||
* want to terminate as soon as stop conditions are triggered. | ||
* | ||
* Right now this handles two cases: | ||
* 1. When a single line completion is requested, it terminates after the first full line was | ||
* received. | ||
* 2. For a multi-line completion, it terminates when the completion will be truncated based on the | ||
* multi-line indentation logic or an eventual match with a line already in the editor. | ||
*/ | ||
export function canUsePartialCompletion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments 💜
Co-authored-by: Valery Bugakov <skymk1@gmail.com>
…cegraph/cody into ps/autocomplete-anthropic-streaming
const isFeatureFlagEnabled = this.featureFlagProvider | ||
? await this.featureFlagProvider.evaluateFeatureFlag(FeatureFlag.CodyAutocompleteStreamingResponse) | ||
: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I validated this works by some console.log. We have to create the test on dotcom yet though, but we'll have to wair for the server to support streaming anyways (that PR is already merged) cc @taras-yemets
This adds a backward-compatible support for using the new streaming backend added in sourcegraph/sourcegraph#55967.
It currently only works in Node.js because the node-fetch polyfill we use returns native Node streams and not unified browser streams. However given that browser support is not official right now, I decided that for now (especially until we know this experiment is giving us wins) it's not worth to add a separate implementation.
This all works by the new
stream
bool option on the server. When set and the server has the changes in sourcegraph/sourcegraph#55967, we return an SSE event stream. On the client, we can check the header values for wether we got a streaming response and use that.Architecture wise this is not super nice yet and duplicates some code of post-processing to find out if the request is already finished. There are three econditions to detect wether a streaming response is "fullfilled" right now:
Test plan
Tested in combination with sourcegraph/sourcegraph#55967 and sourcegraph.com (which is behind this change)