Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSE connections leaked #468

Open
spikecurtis opened this issue Apr 3, 2025 · 9 comments
Open

SSE connections leaked #468

spikecurtis opened this issue Apr 3, 2025 · 9 comments
Assignees
Labels
customer-reported regression Something that used to work and is now broken

Comments

@spikecurtis
Copy link

spikecurtis commented Apr 3, 2025

We leak SSE connections to /api/v2/workspaceagents/{id}/watch-metadata in at least the following scenarios, since v1.4.2.

  • A workspace is restarted while the IDE plugin is tracking it
  • You sign into a different Coder deployment with some workspaces running

I have a strong suspicion that #440 is involved, although possibly not the root cause.

Likely the cause of https://github.com/coder/customers/issues/741

How to reproduce

  1. Connect to a Coder server as a user running at least one workspace.
  2. Monitor the coderd_api_concurrent_requests gauge via the Prometheus endpoint on Coder server.

Either:

  1. Repeatedly stop and start the workspace
  2. Repeatedly sign into a different Coder deployment and then back into your test deployment

Then:

  1. Observer the concurrent requests guage climb with each cycle
  2. Exit VSCode, and see it drop.
@spikecurtis spikecurtis added customer-reported regression Something that used to work and is now broken labels Apr 4, 2025
@BrunoQuaresma
Copy link
Contributor

Ok, looking at the metrics from Prometheus, it does confirm the extension is part of the issue, but it is kinda hard to know where exactly the problem is since VSCode does not provide any way to debug extension requests. Trying to get more info, I set a proxy as suggested from this user in this issue, and from what I can tell, they look normal. The request is being closed/stop when a watched agent gets stopped for example.

Talking with @bcpeinhardt maybe it is better, for now, to undo the changes from #440 until we have a better idea on what is happening. Another option, is to use the web socket endpoint instead of SSE and see if it fixes the issue. Thoughts?

cc.: @spikecurtis

@BrunoQuaresma
Copy link
Contributor

My guess is Axios is not closing the connection with the server correctly due to its own internal or because of our custom streaming implementation. I'm starting to think the best move here is to move to the web socket connection as we do in the UI.

@spikecurtis
Copy link
Author

We need to understand the cause of the leak so that we can be sure we have fixed it. Ideally this means writing a unit test that reproduces the issue, then fixing it. If we don't understand the cause we won't know whether swapping SSE for Websocket truly fixes the issue, or just makes it less likely.

@BrunoQuaresma
Copy link
Contributor

I'm going to post all my investigation findings here so we can keep track on what investigation paths we already tried.

@BrunoQuaresma
Copy link
Contributor

I thought the problem could be the event source not being closed properly before removing the agent watcher, and for some reason having VSCode keeping the connection, but it is not happening. I can confirm the agent watchers are closing the event source properly before removing them.

Image

@BrunoQuaresma
Copy link
Contributor

BrunoQuaresma commented Apr 10, 2025

I went through this part of the code:

vscode-coder/src/api.ts

Lines 112 to 131 in 0d97271

start(controller) {
response.data.on("data", (chunk: Buffer) => {
controller.enqueue(chunk)
})
response.data.on("end", () => {
controller.close()
})
response.data.on("error", (err: Error) => {
controller.error(err)
})
},
cancel() {
response.data.destroy()
return Promise.resolve()
},
})

It is where we define a custom stream, but after read the ReadableStream and IncomingMessage docs, our implementation looks correct and the connections should be closed as expected.

One thing that I was not able to do in this part of the code was to debug the events. For some reason, only the data event was being called 🤔. Maybe Axios is not closing the streams properly causing this leak, but I'm not sure how to test that since there is no way to access the stream from the adapter after it is created.

@BrunoQuaresma
Copy link
Contributor

Investigation note: I was able to verify the issue is on v1.4.2, but after checking the diff here, I think the problem is not on using Axios but the agent singleton as you can see here.

@BrunoQuaresma
Copy link
Contributor

Investigation note: After debugging a bit more I figured out the issue is not on removing the singleton and not with Axios. I tried to use fetch instead of Axios and the result was the same.

const eventSource = new EventSource(watchUrl.toString(), {
  fetch: (input, init) =>
    fetch(input, {
      ...init,
      headers: {
        ...init?.headers,
        "Coder-Session-Token": token ?? "",
      },
    }),
})

Read the docs

Another big change we made was to update the eventsource package from ^2.0.2 to ^3.0.5. I'm going to rollback to this specific version and see what happens.

@BrunoQuaresma
Copy link
Contributor

BrunoQuaresma commented Apr 11, 2025

Investigation note: After downgrading eventsource to ^2.0.2 things got fixed. The breaking changes note from the 3.0.0 says:

https.* options dropped. Pass a custom fetch function that provides an agent/dispatcher instead.

We may better need to understand how to implement these agent/dispatcher correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer-reported regression Something that used to work and is now broken
Projects
None yet
Development

No branches or pull requests

2 participants