Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autocomplete: Use streaming to early-terminate Anthropic requests #723

Merged
merged 15 commits into from
Aug 18, 2023

Conversation

philipp-spiess
Copy link
Contributor

@philipp-spiess philipp-spiess commented Aug 17, 2023

This adds a backward-compatible support for using the new streaming backend added in sourcegraph/sourcegraph#55967.

It currently only works in Node.js because the node-fetch polyfill we use returns native Node streams and not unified browser streams. However given that browser support is not official right now, I decided that for now (especially until we know this experiment is giving us wins) it's not worth to add a separate implementation.

This all works by the new stream bool option on the server. When set and the server has the changes in sourcegraph/sourcegraph#55967, we return an SSE event stream. On the client, we can check the header values for wether we got a streaming response and use that.

Architecture wise this is not super nice yet and duplicates some code of post-processing to find out if the request is already finished. There are three econditions to detect wether a streaming response is "fullfilled" right now:

  1. A single-line completion can be terminated if we have one non-empty full line of response
  2. A multi-line completion can be terminated if the multi-line truncation logic (which currently uses indentation but might later use tree sitter) starts to truncate the answer
    1. A multi-line completion can be terminated if a line inside the new completion matches one that is already inside the document

Test plan

Tested in combination with sourcegraph/sourcegraph#55967 and sourcegraph.com (which is behind this change)

@philipp-spiess philipp-spiess requested review from valerybugakov and a team August 17, 2023 14:00
@philipp-spiess philipp-spiess self-assigned this Aug 17, 2023
@philipp-spiess philipp-spiess marked this pull request as ready for review August 17, 2023 14:00
data: string
}

async function* createSSEDecoder(iterator: AsyncIterableIterator<BufferSource>): AsyncGenerator<SSEMessage> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unimportant information: I think this is the first time I’m using an async generator function 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going advanced!

Copy link
Member

@valerybugakov valerybugakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took some time to understand the details of the new streaming logic. Looks great, and it should significantly reduce the latency for multiline autocomplete! 💪


if (isStreamingResponse) {
try {
const iterator = createSSEDecoder(response.body as any)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this couple of discussions, it looks like we can do the following to avoid any casting:

import type { ReadableStream } from 'node:stream/web'

const iterator = createSSEDecoder(response.body as ReadableStream<Uint8Array>)
async function* createSSEDecoder(iterator: ReadableStream<Uint8Array>): AsyncGenerator<SSEMessage> {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice! I also think I need to add some feature checking here. I just thought of that: If Node 18 has native fetch, does that mean we'll have a proper web stream here instead of the node stream and would that mean that e.g. in Agent or in a future VS Code release we'll have issues?

This polyfill is such a mess haha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valerybugakov I think there's some confusion here, the web ReadableStream is not available for us in node 16 (VS Code) and the node-fetch polyfill returns a Node Stream instead of a compatible ReadableStream (https://github.com/node-fetch/node-fetch#streams) that's why I went with the iterator pattern which apparently is available on Node streams but not native web streams.

The type cast is necessary because our environment thinks that fetch is a standards compliant implementation which it is not. That's also why I’m gating this feature to only work on node for now.

I changed the code now to not rely on the global fetch but instead get fetch from isomorphic-fetch. In Node environments, this will now always resolve to node-fetch and the incompatible buffer polyfill, so we only have to handle this case for now (I was worried that not monkey patching global.fetch could lead to issues in newer nodejs versions due to them having added proper fetch support since. But I tested with the completions generation script (using the local Node 20) and it worked fine.

I'll add a comment

let index: number
while ((index = buffer.indexOf('\n\n')) >= 0) {
const message = buffer.slice(0, index)
buffer = buffer.slice(index + 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we add two here? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we check for two newlines and then slice it. I can make a const SSE_BREAK and use SSE_BREAK.length so it's clearer :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, I figured it's related to new lines, but my brain processed them as '\n\n'.length === 4, so I was confused 😛

vscode/src/completions/providers/anthropic.ts Show resolved Hide resolved
// eslint-disable-next-line @typescript-eslint/no-misused-promises, no-async-promise-executor
return new Promise(async (resolve, reject) => {
try {
const abortController = forkSignal(abortSignal)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a network controller one level higher in the call stack:

// We forward a different abort controller to the network request so we
// can cancel the network request independently of the user cancelling
// the completion.
const networkRequestAbortController = new AbortController()

Would it make sense to pass it down the call tree and rely on networkRequestAbortController.abort() to reduce the number of entities instead of forking it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about this, the issue here is that for the request manager, there is just one multi-line request with n results where as for this logic, we only want to cancel one of the n requests, hence the forking. If one request resolves early, the others might still need to sample a few more tokens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the other requests are helpful, they will be accessible via the UI — that makes sense. Let's add your comment to sources then 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. There's also a follow-up consideration to not wait for all three results for the UI (where we only show one). That was a bit out of scope of this PR though but I'll make sure to comment.

Comment on lines +5 to +16
/**
* Evaluates a partial completion response and returns true when we can already use it. This is used
* to terminate any streaming responses where we can get a token-by-token access to the result and
* want to terminate as soon as stop conditions are triggered.
*
* Right now this handles two cases:
* 1. When a single line completion is requested, it terminates after the first full line was
* received.
* 2. For a multi-line completion, it terminates when the completion will be truncated based on the
* multi-line indentation logic or an eventual match with a line already in the editor.
*/
export function canUsePartialCompletion(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments 💜

Comment on lines +81 to +83
const isFeatureFlagEnabled = this.featureFlagProvider
? await this.featureFlagProvider.evaluateFeatureFlag(FeatureFlag.CodyAutocompleteStreamingResponse)
: false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I validated this works by some console.log. We have to create the test on dotcom yet though, but we'll have to wair for the server to support streaming anyways (that PR is already merged) cc @taras-yemets

@philipp-spiess philipp-spiess merged commit f469aad into main Aug 18, 2023
9 checks passed
@philipp-spiess philipp-spiess deleted the ps/autocomplete-anthropic-streaming branch August 18, 2023 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants