SSE rearchitecture #445

bartekn · 2018-05-14T17:57:50Z

Problem

After deploying master branch to testnet horizon we noticed a huge increase in CPU usage and network throughput of horizon's DB.

After investigating it we realized that code in #379 sends "preamble" ( hello message) with 200 OK header to SSE stream. Once status code is sent we can't modify it and it means SSE clients will always reconnect to stream according to SSE specification even if it makes no sense to reconnect (ex. account doesn't exist). This caused a lot of clients to reconnect over and over again and generated a huge load on our DB.

Possible solutions

We definitely need to implement a few changes in sse package. First of all, let's define a special scenarios and how streaming should behave:

When account doesn't exist we should probably fail with 404 Not Found status code to let client know that it should stop streaming rather than wait for create_operation for possibly infinite amount of time.
It's possible that temporary errors (like DB connection is broken) can happen.

To solve the first scenario, we could check if account in question exists before sending a preamble. This is easy to fix as other streaming events should always start right away streaming. (Fixed in #446)

The second scenario is more complicated because an error can occur at any moment, in most cases after sending headers. We shouldn't "fail the connection", we should rather try to reconnect until problem is fixed. I think we need to implement exponential backoff in our streaming code so that we are not flooded by multiple clients trying to reconnect every second (current retry value).

To implement exponential backoff we need to be able to identify each connection:

First we need to store (and flush) information about each connection to remember number of attempts. We should probably store it in Redis with multiple horizon servers deployments in mind.
Second, how to make clients to remember their stream ID? I think we can solve this but checking id (or other name) GET param in streaming request. If id parameter does not exist, we redirect user to the URL with random id appended. If id exists we check the stream data in storage and modify retry value to implement a backoff. From SSE specification:

HTTP 301 Moved Permanently responses must cause the user agent to reconnect using the new server specified URL instead of the previously specified URL for all subsequent requests for this event source. (It doesn't affect other EventSource objects with the same URL unless they also receive 301 responses, and it doesn't affect future sessions, e.g. if the page is reloaded.)

The first fix (404 Not Found when account does not exist) should solve the issue we're experiencing directly and should be considered a quickfix so we can deploy master branch to testnet. The second fix is a long term fix.

The text was updated successfully, but these errors were encountered:

bartekn added the horizon label May 14, 2018

This was referenced May 15, 2018

Send initial errors while streaming #446

Merged

Missing toml in asset_stats #427

Closed

lechengfan mentioned this issue Sep 24, 2018

Validate requests for SSE before creating a new stream #686

Closed

bartekn mentioned this issue Oct 11, 2018

Streaming requests should be rate limited per DB query #715

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSE rearchitecture #445

SSE rearchitecture #445

bartekn commented May 14, 2018 •

edited

SSE rearchitecture #445

SSE rearchitecture #445

Comments

bartekn commented May 14, 2018 • edited

Problem

Possible solutions

bartekn commented May 14, 2018 •

edited