You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After deploying master branch to testnet horizon we noticed a huge increase in CPU usage and network throughput of horizon's DB.
After investigating it we realized that code in #379 sends "preamble" ( hello message) with 200 OK header to SSE stream. Once status code is sent we can't modify it and it means SSE clients will always reconnect to stream according to SSE specification even if it makes no sense to reconnect (ex. account doesn't exist). This caused a lot of clients to reconnect over and over again and generated a huge load on our DB.
Possible solutions
We definitely need to implement a few changes in sse package. First of all, let's define a special scenarios and how streaming should behave:
When account doesn't exist we should probably fail with 404 Not Found status code to let client know that it should stop streaming rather than wait for create_operation for possibly infinite amount of time.
It's possible that temporary errors (like DB connection is broken) can happen.
To solve the first scenario, we could check if account in question exists before sending a preamble. This is easy to fix as other streaming events should always start right away streaming. (Fixed in #446)
The second scenario is more complicated because an error can occur at any moment, in most cases after sending headers. We shouldn't "fail the connection", we should rather try to reconnect until problem is fixed. I think we need to implement exponential backoff in our streaming code so that we are not flooded by multiple clients trying to reconnect every second (current retry value).
To implement exponential backoff we need to be able to identify each connection:
First we need to store (and flush) information about each connection to remember number of attempts. We should probably store it in Redis with multiple horizon servers deployments in mind.
Second, how to make clients to remember their stream ID? I think we can solve this but checking id (or other name) GET param in streaming request. If id parameter does not exist, we redirect user to the URL with random id appended. If id exists we check the stream data in storage and modify retry value to implement a backoff. From SSE specification:
HTTP 301 Moved Permanently responses must cause the user agent to reconnect using the new server specified URL instead of the previously specified URL for all subsequent requests for this event source. (It doesn't affect other EventSource objects with the same URL unless they also receive 301 responses, and it doesn't affect future sessions, e.g. if the page is reloaded.)
The first fix (404 Not Found when account does not exist) should solve the issue we're experiencing directly and should be considered a quickfix so we can deploy master branch to testnet. The second fix is a long term fix.
The text was updated successfully, but these errors were encountered:
Problem
After deploying
master
branch to testnet horizon we noticed a huge increase in CPU usage and network throughput of horizon's DB.After investigating it we realized that code in #379 sends "preamble" (
hello
message) with200 OK
header to SSE stream. Once status code is sent we can't modify it and it means SSE clients will always reconnect to stream according to SSE specification even if it makes no sense to reconnect (ex. account doesn't exist). This caused a lot of clients to reconnect over and over again and generated a huge load on our DB.Possible solutions
We definitely need to implement a few changes in
sse
package. First of all, let's define a special scenarios and how streaming should behave:404 Not Found
status code to let client know that it should stop streaming rather than wait forcreate_operation
for possibly infinite amount of time.To solve the first scenario, we could check if account in question exists before sending a preamble. This is easy to fix as other streaming events should always start right away streaming. (Fixed in #446)
The second scenario is more complicated because an error can occur at any moment, in most cases after sending headers. We shouldn't "fail the connection", we should rather try to reconnect until problem is fixed. I think we need to implement exponential backoff in our streaming code so that we are not flooded by multiple clients trying to reconnect every second (current
retry
value).To implement exponential backoff we need to be able to identify each connection:
id
(or other name) GET param in streaming request. Ifid
parameter does not exist, we redirect user to the URL with randomid
appended. Ifid
exists we check the stream data in storage and modifyretry
value to implement a backoff. From SSE specification:The first fix (
404 Not Found
when account does not exist) should solve the issue we're experiencing directly and should be considered a quickfix so we can deploymaster
branch to testnet. The second fix is a long term fix.The text was updated successfully, but these errors were encountered: