Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 021: The Future of the Socket Protocol #8584

Merged
merged 12 commits into from
Jul 19, 2022
1 change: 1 addition & 0 deletions docs/rfc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,5 +58,6 @@ sections.
- [RFC-018: BLS Signature Aggregation Exploration](./rfc-018-bls-agg-exploration.md)
- [RFC-019: Configuration File Versioning](./rfc-019-config-version.md)
- [RFC-020: Onboarding Projects](./rfc-020-onboarding-projects.rst)
- [RFC-022: The Future of the Socket Protocol](./rfc-021-socket-protocol.md)
creachadair marked this conversation as resolved.
Show resolved Hide resolved

<!-- - [RFC-NNN: Title](./rfc-NNN-title.md) -->
266 changes: 266 additions & 0 deletions docs/rfc/rfc-021-socket-protocol.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
# RFC 021: The Future of the Socket Protocol

## Changelog

- 19-May-2022: Initial draft (@creachadair)
- 19-Jul-2022: Converted from ADR to RFC (@creachadair)

## Abstract

This RFC captures some technical discussion about the ABCI socket protocol that
was originally documented to solicit an architectural decision. This topic was
not high-enough priority as of this writing to justify making a final decision.

For that reason, the text of this RFC has the general structure of an ADR, but
should be viewed primarily as a record of the issue for future reference.

## Background

The [Application Blockchain Interface (ABCI)][abci] is a client-server protocol
used by the Tendermint consensus engine to communicate with the application on
whose behalf it performs state replication. There are currently three transport
options available for ABCI applications:

1. **In-process**: Applications written in Go can be linked directly into the
same binary as the consensus node. Such applications use a "local" ABCI
connection, which exposes application methods to the node as direct function
calls.

2. **Socket protocol**: Out-of-process applications may export the ABCI service
via a custom socket protocol that sends requests and responses over a
Unix-domain or TCP socket connection as length-prefixed protocol buffers.
In Tendermint, this is handled by the [socket client][socket-client].

3. **gRPC**: Out-of-process applications may export the ABCI service via gRPC.
In Tendermint, this is handled by the [gRPC client][grpc-client].

Both the out-of-process options (2) and (3) have a long history in Tendermint.
The beginnings of the gRPC client were added in [May 2016][abci-start] when
ABCI was still hosted in a separate repository, and the socket client (formerly
called the "remote client") was part of ABCI from its inception in November
2015.

At that time when ABCI was first being developed, the gRPC project was very new
(it launched Q4 2015) and it was not an obvious choice for use in Tendermint.
It took a while before the language coverage and quality of gRPC reached a
point where it could be a viable solution for out-of-process applications. For
that reason, it made sense for the initial design of ABCI to focus on a custom
protocol for out-of-process applications.

## Problem Statement

For practical reasons, ABCI needs an interprocess communication option to
support applications not written in Go. The two practical options are RPC and
FFI, and for operational reasons an RPC mechanism makes more sense.

The socket protocol has not changed all that substantially since its original
design, and has the advantage of being simple to implement in almost any
reasonable language. However, its simplicity includes some limitations that
have had a negative impact on the stability and performance of out-of-process
applications using it. In particular:

- The protocol lacks request identifiers, so the client and server must return
responses in strict FIFO order. Even if the client issues requests that have
no dependency on each other, the protocol has no way except order of issue to
map responses to requests.

This reduces (in some cases substantially) the concurrency an application can
exploit, since the parallelism of requests in flight is gated by the slowest
active request at any moment. There have been complaints from some network
operators on that basis.

- The protocol lacks method identifiers, so the only way for the client and
server to understand which operation is requested is to dispatch on the type
of the request and response payloads. For responses, this means that [any
error condition is terminal not only to the request, but to the entire ABCI
client](https://github.com/tendermint/tendermint/blob/master/abci/client/socket_client.go#L149).

The historical intent of terminating for any error seems to have been that
all ABCI errors are unrecoverable and hence protocol fatal <!-- markdown-link-check-disable-next-line -->
(see [Note 1](#note1)). In practice, however, this greatly complicates
debugging a faulty node, since the only way to respond to errors is to panic
the node which loses valuable context that could have been logged.

- There are subtle concurrency management dependencies between the client and
the server that are not clearly documented anywhere, and it is very easy for
small changes in both the client and the server to lead to tricky deadlocks,
panics, race conditions, and slowdowns. As a recent example of this, see
https://github.com/tendermint/tendermint/pull/8581.

These limitations are fixable, but one important question is whether it is
worthwhile to fix them. We can add request and method identifiers, for
example, but doing so would be a breaking change to the protocol requiring
every application using it to update. If applications have to migrate anyway,
the stability and language coverage of gRPC have improved a lot, and today it
is probably simpler to set up and maintain an application using gRPC transport
than to reimplement the Tendermint socket protocol.

Moreover, gRPC addresses all the above issues out-of-the-box, and requires
(much) less custom code for both the server (i.e., the application) and the
client. The project is well-funded and widely-used, which makes it a safe bet
for a dependency.

## Decision

There is a set of related alternatives to consider:

- Question 1: Designate a single IPC standard for out-of-process applications?

Claim: We should converge on one (and only one) IPC option for out-of-process
applications. We should choose an option that, after a suitable period of
deprecation for alternatives, will address most or all the highest-impact
uses of Tendermint. Maintaining multiple options increases the surface area
for bugs and vulnerabilities, and we should not have multiple options for
basic interfaces without a clear and well-documented reason.

- Question 2a: Choose gRPC and deprecate/remove the socket protocol?

Claim: Maintaining and improving a custom RPC protocol is a substantial
project and not directly relevant to the requirements of consensus. We would
be better served by depending on a well-maintained open-source library like
gRPC.

- Question 2b: Improve the socket protocol and deprecate/remove gRPC?

Claim: If we find meaningful advantages to maintaining our own custom RPC
protocol in Tendermint, we should treat it as a first-class project within
the core and invest in making it good enough that we do not require other
options.

**One important consideration** when discussing these questions is that _any
outcome which includes keeping the socket protocol will have eventual migration
impacts for out-of-process applications_ regardless. To fix the limitations of
the socket protocol as it is currently designed will require making _breaking
changes_ to the protocol. So, while we may put off a migration cost for
out-of-process applications by retaining the socket protocol in the short term,
we will eventually have to pay those costs to fix the problems in its current
design.

## Detailed Design

1. If we choose to standardize on gRPC, the main work in Tendermint core will
be removing and cleaning up the code for the socket client and server.

Besides the code cleanup, we will also need to clearly document a
deprecation schedule, and invest time in making the migration easier for
applications currently using the socket protocol.

> **Point for discussion:** Migrating from the socket protocol to gRPC
> should mostly be a plumbing change, as long as we do it during a release
> in which we are not making other breaking changes to ABCI. However, the
> effort may be more or less depending on how gRPC integration works in the
> application's implementation language, and would have to be sure networks
> have plenty of time not only to make the change but to verify that it
> preserves the function of the network.
>
> What questions should we be asking node operators and application
> developers to understand the migration costs better?

2. If we choose to keep only the socket protocol, we will need to follow up
with a more detailed design for extending and upgrading the protocol to fix
the existing performance and operational issues with the protocol.

Moreover, since the gRPC interface has been around for a long time we will
also need a deprecation plan for it.

3. If we choose to keep both options, we will still need to do all the work of
(2), but the gRPC implementation should not require any immediate changes.


## Alternatives Considered

- **FFI**. Another approach we could take is to use a C-based FFI interface so
that applications written in other languages are linked directly with the
consensus node, an option currently only available for Go applications.

An FFI interface is possible for a lot of languages, but FFI support varies
widely in coverage and quality across languages and the points of friction
can be tricky to work around. Moreover, it's much harder to add FFI support
to a language where it's missing after-the-fact for an application developer.

Although a basic FFI interface is not too difficult on the Go side, the C
shims for an FFI can get complicated if there's a lot of variability in the
runtime environment on the other end.

If we want to have one answer for non-Go applications, we are better off
picking an IPC-based solution (whether that's gRPC or an extension of our
custom socket protocol or something else).

## Consequences

- **Standardize on gRPC**

- ✅ Addresses existing performance and operational issues.
- ✅ Replaces custom code with a well-maintained widely-used library.
- ✅ Aligns with Cosmos SDK, which already uses gRPC extensively.
- ✅ Aligns with priv validator interface, for which the socket protocol is already deprecated for gRPC.
- ❓ Applications will be hard to implement in a language without gRPC support.
- ⛔ All users of the socket protocol have to migrate to gRPC, and we believe most current out-of-process applications use the socket protocol.

- **Standardize on socket protocol**

- ✅ Less immediate impact for existing users (but see below).
- ✅ Simplifies ABCI API surface by removing gRPC.
- ❓ Users of the socket protocol will have a (smaller) migration.
- ❓ Potentially easier to implement for languages that do not have support.
- ⛔ Need to do all the work to fix the socket protocol (which will require existing users to update anyway later).
- ⛔ Ongoing maintenance burden for per-language server implementations.

- **Keep both options**

- ✅ Less immediate impact for existing users (but see below).
- ❓ Users of the socket protocol will have a (smaller) migration.
- ⛔ Still need to do all the work to fix the socket protocol (which will require existing users to update anyway later).
- ⛔ Requires ongoing maintenance and support of both gRPC and socket protocol integrations.


## References

- [Application Blockchain Interface (ABCI)][abci]
- [Tendermint ABCI socket client][socket-client]
- [Tendermint ABCI gRPC client][grpc-client]
- [Initial commit of gRPC client][abci-start]

[abci]: https://github.com/tendermint/spec/tree/master/spec/abci
[socket-client]: https://github.com/tendermint/tendermint/blob/master/abci/client/socket_client.go
[socket-server]: https://github.com/tendermint/tendermint/blob/master/abci/server/socket_server.go
[grpc-client]: https://github.com/tendermint/tendermint/blob/master/abci/client/grpc_client.go
[abci-start]: https://github.com/tendermint/abci/commit/1ab3c747182aaa38418258679c667090c2bb1e0d

## Notes

- <a id=note1></a>**Note 1**: The choice to make all ABCI errors protocol-fatal
was intended to avoid the risk that recovering an application error could
cause application state to diverge. Divergence can break consensus, so it's
essential to avoid it.

This is a sound principle, but conflates protocol errors with "mechanical"
errors such as timeouts, resoures exhaustion, failed connections, and so on.
Because the protocol has no way to distinguish these conditions, the only way
for an application to report an error is to panic or crash.

Whether a node is running in the same process as the application or as a
separate process, application errors should not be suppressed or hidden.
However, it's important to ensure that errors are handled at a consistent and
well-defined point in the protocol: Having the application panic or crash
rather than reporting an error means the node sees different results
depending on whether the application runs in-process or out-of-process, even
if the application logic is otherwise identical.

## Appendix: Known Implementations of ABCI Socket Protocol

This is a list of known implementations of the Tendermint custom socket
protocol. Note that in most cases I have not checked how complete or correct
these implementations are; these are based on search results and a cursory
visual inspection.

- Tendermint Core (Go): [client][socket-client], [server][socket-server]
- Informal Systems [tendermint-rs](https://github.com/informalsystems/tendermint-rs) (Rust): [client](https://github.com/informalsystems/tendermint-rs/blob/master/abci/src/client.rs), [server](https://github.com/informalsystems/tendermint-rs/blob/master/abci/src/server.rs)
- Tendermint [js-abci](https://github.com/tendermint/js-abci) (JS): [server](https://github.com/tendermint/js-abci/blob/master/src/server.js)
- [Hotmoka](https://github.com/Hotmoka/hotmoka) ABCI (Java): [server](https://github.com/Hotmoka/hotmoka/blob/master/io-hotmoka-tendermint-abci/src/main/java/io/hotmoka/tendermint_abci/Server.java)
- [Tower ABCI](https://github.com/penumbra-zone/tower-abci) (Rust): [server](https://github.com/penumbra-zone/tower-abci/blob/main/src/server.rs)
- [abci-host](https://github.com/datopia/abci-host) (Clojure): [server](https://github.com/datopia/abci-host/blob/master/src/abci/host.clj)
- [abci_server](https://github.com/KrzysiekJ/abci_server) (Erlang): [server](https://github.com/KrzysiekJ/abci_server/blob/master/src/abci_server.erl)
- [py-abci](https://github.com/davebryson/py-abci) (Python): [server](https://github.com/davebryson/py-abci/blob/master/src/abci/server.py)
- [scala-tendermint-server](https://github.com/intechsa/scala-tendermint-server) (Scala): [server](https://github.com/InTechSA/scala-tendermint-server/blob/master/src/main/scala/lu/intech/tendermint/Server.scala)
- [kepler](https://github.com/f-o-a-m/kepler) (Rust): [server](https://github.com/f-o-a-m/kepler/blob/master/hs-abci-server/src/Network/ABCI/Server.hs)