[Feature] scaling, read-only replicas on a shared database #622

ph4r05 · 2021-05-27T12:25:04Z

tldr: Idea is to deploy and scale Blockbook instances in cloud. Store synced DB on a single storage (e.g.,Amazon EFS), one master, multiple read-only replicas.

According to our benchmarking measurements, we could benefit from hooking a single storage (fully synchronized database) to multiple read-only worker nodes as CPU/RAM requirements increase with number of requests (assuming FS is fast enough).

Naive scaling solution would be to start new Blockbook instances from a "template" instance, that is kept running and fully synchronized. The problem is that copying Blockbook to a new machine requires quite some amount of bandwidth and time to copy full database (e.g., Ethereum 2.3 TB). First copy pass can be done when source instance is running. Second copy pass (e.g., with rsync) only changed files are copied, saving the bandwidth and time. However, during the second pass, source instance has to be stopped to ensure database consistency. Then both nodes have to resync. Thus the Blockbook instance cloning is slow and requires downtime.

We also noticed that Blockbook could nicely scale horizontally according to the demand (pay only for consumed resources), keeping the database storage separated (again billed by IOPS, consumed storage).

Also, if a read-only replica gets overloaded, it cannot crash the database / make it inconsistent. Otherwise, without proper rate limiting it is quite easy to DoS the Blockbook with few requests, leading to unrecoverable database (you have to basically reindex or recover from snapshot).

Seems like RocksDB already support few interesting features that could make this possible:

ReadOnlyMode, (code), allows to open database in read only mode. Benefit is not crashing the database and having multiple read-only readers from a single database, maintained with a single master writer. Problem is you have to reopen database to see new changes. Basically every few seconds.
OpenAsSecondary also mentioned in this issue. I am not sure about the current state of this method, but this seems to provide a good support for implementing a proper read-only replication. Maybe with calling TryCatchUpWithPrimary() periodically, e.g., each x seconds (configurable?). An example.

So the question is whether it is already supported or whether you consider adding support for easier scaling with read-only replicas. If Blockbook could be configured to open the database as secondary, from a shared storage (NFS / Amazon EFS), responding only to user-issued requests, it would work, I guess.

The benefit is easier horizontal scaling, better availability and increased robustness. As there would be only a single syncing master (lets say master won't respond to user requests), chances of a DB corruption to an overload are minimal.

Another scaling approach could be adding a new FS-level abstraction layer so many remote workers could use the same synced database over a controlled API. But it seems like a lot of work compared to sharing DB on a file level with locks (Amazon EFS is fully posix compatible, there can be another storage backend, Ceph, ...)

Does it make sense to have a read-only replicas or is there another scaling approach I am missing?

Thanks for info and opinion!

The text was updated successfully, but these errors were encountered:

martinboehm · 2021-06-02T15:58:29Z

Hi, thank you for your proposal. My comments:

Crashing or making the DB inconsistent under load: we have never observed this behavior and since Blockbook uses RocksDB transactions to update DB, I cannot really imagine that it happens. If you have some scenario which can crash the DB under load, please give it to us as it needs to be fixed in the code and not by adding hardware.
DB in the cloud: RocksDB is very disk demanding. We tried to run the DB on rotational disks and the performance was terrible. The cloud disk, with even higher latencies, will probably not give you a reasonable performance, but we haven't tried this.
overall performance: I don't know your use case but we just went through a large bull run without any performance issues, in general Blockbook is very performant. It certainly can be DoSed like any other system but I think the solution to it is not horizontal scaling but some form of request throttling.
shared readonly database to save the disk space: you would lose the blockchain transaction caching, which improves the speed of Blockbook significantly. I do not think that the ReadOnlyMode can open a DB which is currently also open by writing instance of RocksDB. The option OpenAsSecondary would probably theoretically work but periodical calling of TryCatchUpWithPrimary() sounds really hacky and would not solve the problem with caching. There is however much easier way how to scale Blockbook horizontally: use one backend for multiple Blockbooks. Blockbook DB size to backend DB size ratios are about 230GB/2TB for Ethereum or 340GB/400GB for Bitcoin, meaning that the disk reduction would be substantial.

ph4r05 · 2021-06-02T16:27:04Z

hi @martinboehm , thanks for response. so

Crashing or making the DB inconsistent under load: we have never observed this behavior and since Blockbook uses RocksDB transactions to update DB, I cannot really imagine that it happens. If you have some scenario which can crash the DB under load, please give it to us as it needs to be fixed in the code and not by adding hardware.

We use Blockbook on Amazon with GP3 drives, 6000 IOPS, 300MBps and on "fast endpoints" it can provide 500-1700 RPS, depending on the CPU, per instance. This performance is fine for us.

However, if you use ethereum and call /api/v2/address/<address>?details=txs, with address having reasonably populated transaction history, it yields 1-3 RPS. When you keep smashing Blockbook with this load, RAM consumption grows (32BG available) OOM eventually kills it and after a restart Blockbook simply outputs that the database is inconsistent and cannot be reused. Currently I don't have time to repeat this now, but we experienced this 3 times already, always the same scenario, (CPU load can go to 500 with 4vcpus). Maybe disk writes are also delayed?

I think I also noticed github issues (2) saying OOM-killed Blockbook during synchronization damages the database and reindexing is needed. Cannot this be the same reason (kill when adding new blocks)? Can Blockbook be terminated with kill -9 at any moment, without damaging the database? Or is it just during the initial sync? Maybe the problem manifests with slower disks?

Is this relevant?

I0629 08:21:21.157771   63636 bulkconnect.go:56] rocksdb: bulk connect init, db set to inconsistent state

Database of geth gets corrupted as well by killing from OOM.

it needs to be fixed in the code and not by adding hardware.

We aimed to fix it by adding read-only readers that cannot in principle damage the shared database even if OOM kills them anytime. Fixing it in the code is out of our scope.

DB in the cloud: RocksDB is very disk demanding. We tried to run the DB on rotational disks and the performance was terrible. The cloud disk, with even higher latencies, will probably not give you a reasonable performance, but we haven't tried this.

As mentioned, GP3 SSD with 6000 IOPS and 300MBps is enough for our needs. We are also using storage-optimized instances with NVMe SSD storage (native latency, 6-8x better than GP3, 0.09-0.1m rand read lat, 13.8k read IO, 10.2 write IO; sync uses just 6100 IOPS) for fast synchronization. Both usecases work great.

overall performance: I don't know your use case but we just went through a large bull run without any performance issues, in general Blockbook is very performant. It certainly can be DoSed like any other system but I think the solution to it is not horizontal scaling but some form of request throttling.

My experience is a bit different. I implemented Blockbook benchmark with Locust, having 16 workers issuing thousands of requests per second each, from multiple servers. I tested several endpoints with random addresses and transactions collected from the blockchain sampling. There is a huge gap in endpoint performance, /api/v2/address/<address>?details=txs being the slowest one. Difference is orders of magnitude. (I would love to share the benchmarking tool and the results, but it was done for a company so a written approval would be necessary).

We rate-limit the requests, according to data collected from the benchmark. Some native rate-limiting in the Blockbook would be also great. For instance, setting maximum page size in config file, so DoSer cannot use 1000 records per page, with all transaction details for the busiest addresses on the chain (I sampled blocks and picked those addresses that are quite busy, collected thousands of those). Or allowing only a subset of ?detail= options. Doing it in application firewall is a bit more tricky.

shared readonly database to save the disk space: you would lose the blockchain transaction caching, which improves the speed of Blockbook significantly. I do not think that the ReadOnlyMode can open a DB which is currently also open by writing instance of RocksDB. The option OpenAsSecondary would probably theoretically work but periodical calling of TryCatchUpWithPrimary() sounds really hacky and would not solve the problem with caching. There is however much easier way how to scale Blockbook horizontally: use one backend for multiple Blockbooks. Blockbook DB size to backend DB size ratios are about 230GB/2TB for Ethereum or 340GB/400GB for Bitcoin, meaning that the disk reduction would be substantial.

good point on the ReadOnlyMode, I didn't check for this.
TryCatchUpWithPrimary solution is a usual periodical catch-up, if there is no direct IPC, this seems like usual approach. One could maybe check mtime of a new event-file to detect if TryCatchUpWithPrimary is necessary, but thats just polling moved elsewhere. IPC could be polling Geth for new blocks and calling TryCatchUpWithPrimary on change (or subscribe to event if possible). But I understand that you currently don't see a value in having read-only replicas.
Would you consider merging a simple PR adding this functionality, that is by default off, with possibility to enable it via config / cli?
Thanks for the scaling advice! If read-only replicas are not available, we will go for this.

Thanks again for your time

ph4r05 · 2021-06-03T09:07:06Z

Btw @martinboehm if you setup a testing instance of Blockbook for Ethereum mainnet (or you already have one that can be tested), I can benchmark it from Amazon with my tools. We can check if we can replicate the problem with DB corruption by overloading the server with generic requests. What do you think? We could maybe also publish result of this benchmark, could be handy for others.

martinboehm closed this as completed Nov 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] scaling, read-only replicas on a shared database #622

[Feature] scaling, read-only replicas on a shared database #622

ph4r05 commented May 27, 2021 •

edited

martinboehm commented Jun 2, 2021

ph4r05 commented Jun 2, 2021 •

edited

ph4r05 commented Jun 3, 2021 •

edited

[Feature] scaling, read-only replicas on a shared database #622

[Feature] scaling, read-only replicas on a shared database #622

Comments

ph4r05 commented May 27, 2021 • edited

martinboehm commented Jun 2, 2021

ph4r05 commented Jun 2, 2021 • edited

ph4r05 commented Jun 3, 2021 • edited

ph4r05 commented May 27, 2021 •

edited

ph4r05 commented Jun 2, 2021 •

edited

ph4r05 commented Jun 3, 2021 •

edited