Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Cody Gateway batches streaming requests #2180

Closed
philipp-spiess opened this issue Dec 7, 2023 · 1 comment
Closed

bug: Cody Gateway batches streaming requests #2180

philipp-spiess opened this issue Dec 7, 2023 · 1 comment
Assignees
Labels
bug Something isn't working clients/vscode

Comments

@philipp-spiess
Copy link
Contributor

Version

latest

Describe the bug

Noticable for autocomplete but all streaming endpoints are affected

Cody Gateway doesn't flush messages as soon as it receives it but instead seems to buffer (for a bit over a second) until it starts sending the first payload. This causes some issues

  • Multiline completions are slower because of it (we always have to wait for the first n lines, even if we only use the first line etc)
  • The dynamic multiline experiment is blocked (because it would need to remove the \n stop character leading to the same issue as the line above)
  • Chat requires workarounds like the "typewriter effect" to make it appear that we're typing word per word and not multiline lines at the same time.
  • This can impact time-to-first-token for every streaming endpoint

Expected behavior

The backend relays the first SSE chunk as soon as the inference provider sends it

Additional context

c.f. https://sourcegraph.slack.com/archives/C05AGQYD528/p1701791953787829

@philipp-spiess philipp-spiess added bug Something isn't working clients/vscode labels Dec 7, 2023
philipp-spiess added a commit that referenced this issue Dec 11, 2023
Closes #2205

This PR adds a new feature flag and experimental config option to enable
_hot streak mode_.

The idea is simple: When generating a completion (and this is mostly for
single-line and works beautifully with dynamic-multilin), we let the LLM
continue to generate more than just the current line and use follow-up
lines to seed a cache.

Then, when a user accepts a completion and moves to the next line by
pressing enter, we can instantly generate another completion (and thus
avoid doing another LLM request and incurring latency). The result is a
UX where monotonous, single-line completions appear a lot faster and
allows you to fall into a tab+enter, tab+enter, … mode in which you can
review a longer completion line by line. It’s also going to be much
faster than multiline-completions because we can show the first line as
soon as it is ready.

⚠️ This is currently limited by #2180

## Test plan

- Enable hot streak mode
- Observe that after accepting a completion, pressing enter will show
the continuation of that completion ✨ instantly ✨

https://github.com/sourcegraph/cody/assets/458591/5f70ab45-18a2-4a4c-8d02-61a45328df47
@philipp-spiess
Copy link
Contributor Author

After the changes on the client and the backend, this indeed seems fixed now! Here's a screenshot of incoming SSE event packages and the timestamp (in ms) between them. Previously, a bunch of messages arrived at the same time. Now, it seems like there's a reasonable delay between messages.

Screenshot 2023-12-11 at 19 47 56

Huge props to @rafax for fixing this! I’m am really curious if we'll see a drop in latency in our E2E metrics now 👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working clients/vscode
Projects
None yet
Development

No branches or pull requests

2 participants