Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSE rearchitecture #445

Open
bartekn opened this issue May 14, 2018 · 0 comments
Open

SSE rearchitecture #445

bartekn opened this issue May 14, 2018 · 0 comments
Labels

Comments

@bartekn
Copy link
Contributor

bartekn commented May 14, 2018

Problem

After deploying master branch to testnet horizon we noticed a huge increase in CPU usage and network throughput of horizon's DB.

screen shot 2018-05-14 at 19 28 29

After investigating it we realized that code in #379 sends "preamble" ( hello message) with 200 OK header to SSE stream. Once status code is sent we can't modify it and it means SSE clients will always reconnect to stream according to SSE specification even if it makes no sense to reconnect (ex. account doesn't exist). This caused a lot of clients to reconnect over and over again and generated a huge load on our DB.

Possible solutions

We definitely need to implement a few changes in sse package. First of all, let's define a special scenarios and how streaming should behave:

  • When account doesn't exist we should probably fail with 404 Not Found status code to let client know that it should stop streaming rather than wait for create_operation for possibly infinite amount of time.
  • It's possible that temporary errors (like DB connection is broken) can happen.

To solve the first scenario, we could check if account in question exists before sending a preamble. This is easy to fix as other streaming events should always start right away streaming. (Fixed in #446)

The second scenario is more complicated because an error can occur at any moment, in most cases after sending headers. We shouldn't "fail the connection", we should rather try to reconnect until problem is fixed. I think we need to implement exponential backoff in our streaming code so that we are not flooded by multiple clients trying to reconnect every second (current retry value).

To implement exponential backoff we need to be able to identify each connection:

  • First we need to store (and flush) information about each connection to remember number of attempts. We should probably store it in Redis with multiple horizon servers deployments in mind.
  • Second, how to make clients to remember their stream ID? I think we can solve this but checking id (or other name) GET param in streaming request. If id parameter does not exist, we redirect user to the URL with random id appended. If id exists we check the stream data in storage and modify retry value to implement a backoff. From SSE specification:

    HTTP 301 Moved Permanently responses must cause the user agent to reconnect using the new server specified URL instead of the previously specified URL for all subsequent requests for this event source. (It doesn't affect other EventSource objects with the same URL unless they also receive 301 responses, and it doesn't affect future sessions, e.g. if the page is reloaded.)

The first fix (404 Not Found when account does not exist) should solve the issue we're experiencing directly and should be considered a quickfix so we can deploy master branch to testnet. The second fix is a long term fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant