Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine extended flow messages into a single composite message #138

Closed
ksky8864 opened this issue Aug 28, 2019 · 22 comments
Closed

Combine extended flow messages into a single composite message #138

ksky8864 opened this issue Aug 28, 2019 · 22 comments
Labels
status: superseded An issue that has been superseded by another type: enhancement A general enhancement

Comments

@ksky8864
Copy link

ksky8864 commented Aug 28, 2019

I benchmarked r2dbc-postgresql and JDBC driver for PostgreSQL to find out if r2dbc is suitable for online transaction processing, but the result of r2dbc was not as good as that of JDBC.

  • is there something wrong with my r2dbc usage?
  • any solution to improve the throughput in the following test case?
  • does r2dbc not target at OLTP?

these are the benchmark test conditions:

  • common
    • execute simple insert statement for 500,000 times
    • run on OpenJDK 11
    • measure elapsed time for the workload above
  • r2dbc
    • use r2dbc-postgresql, r2dbc-client and r2dbc-pool
    • pool 64 sessions with r2dbc-pool
  • JDBC
    • commit by each insertion instead of using batch insert (to simulate OLTP)
    • execute the query with pooled 64 threads
    • pool 64 sessions with hikariCP

parts of test code that use r2dbc are:

private Flux<Integer> insert() {
    return this.r2dbc.<Integer> withHandle(handle -> {
        return handle.createUpdate("insert into test_table values ($1)")
            .bind("$1", "TEST")
            .execute();
    });
}
this.r2dbc = new R2dbc(ConnectionFactories.get("r2dbc:pool:postgresql://<user>:<password>@<hostname>:​​<port>/<database>?maxSize=64"));
var start = Instant.now();
Flux.range(0, 500000).flatMap(i -> insert()).blockLast();
System.out.println("finished: time=" + ChronoUnit.MILLIS.between(start, Instant.now()) + "msec");

The test result is:

  • r2dbc
    • 34,401msec.
  • JDBC
    • 2,422msec.
@odrotbohm
Copy link

Thanks for bringing this to our attention. Admittedly, we haven't had the amount of time to look into potential performance issues as we would've liked to. Is there a chance you prepare the example as JMH benchmark so that we can take a look? @mp911de's microbenchmark runner might be helpful with that.

@ksky8864
Copy link
Author

I placed a benchmark code using JMH in dbc-benchmark.

As I wrote earlier, my code uses r2dbc-postgresql, r2dbc-client, r2dbc-pool. First I'd like to find out which modules cause the performance issue.
Do you have any idea? I also have Java VisualVM snapshots that I took in my environment. If they seem to be helpful to you, please freely let me know.

@odrotbohm
Copy link

Thanks for preparing this. I'll have a look early next week. In general, I usually just run the benchmarks and attach a profiler (like YourKit) to see where time is spent.

@skroll
Copy link

skroll commented Sep 20, 2019

Have you tried it without using the connection pool? HikariCP does some things with the connections it manages by putting some items thread local storage and only going across boundaries to steal connections from others if it needs it. Since you are doing all the work in a single thread, it's probably reusing the same connection over and over again.

I would do a test without r2dbc-pool and hikaricp at all, and just do it on a single connection to see if it's even the database driver at all introducing overhead.

@ksky8864
Copy link
Author

@skroll
I tried it without connection pool last month.
It took approx 10sec in JDBC with 500,000 requests, whereas R2DBC resulted in connection timed out. At that time I saw 80 idle connections spiked on pgAdmin.

@skroll
Copy link

skroll commented Sep 24, 2019

That sounds like you're doing parallel work then. If you are not using a connection pool, there should only be 1 connection active. Were you creating a connection each time?

@ksky8864
Copy link
Author

ksky8864 commented Oct 2, 2019

I used following code dbc-benchmark/without-connection-pool.
In this test, an io.r2dbc.client.R2dbc instance is made at the beginning of the test and a single thread uses this repeatedly.
Is this usage appropriate?

@mp911de
Copy link
Collaborator

mp911de commented Oct 2, 2019

@ksky8864 the code looks okay. Benchmarking without pool usage includes the expensive connection creation phase.

@skroll
Copy link

skroll commented Oct 16, 2019

Fired up wireshark to take a look at the packets underneath the hood. When doing a select statement, with a single parameter with the JDBC driver, the driver does the Parse, Bind, Describe, Execute actions in the same message which gets sent in a single TCP packet.

The R2DBC driver does the following:
Packet 1: Parse
Packet 2: Describe
Packet 3: Sync
Packet 4: (gets description response)
Packet 5: Bind
Packet 6: Describe
Packet 7: Execute
Packet 8: Close
Packet 9: Sync
Packet 10: (gets bind completion and row description response)

I'm guessing this is a part of where you are seeing some differences in performance.

@mp911de
Copy link
Collaborator

mp911de commented Oct 17, 2019

Thanks a lot for having a look @skroll. We could create composite messages to combine the extended flow into a single package. These messages are issued in ExtendedQueryMessageFlow. We would avoid TCP packet overhead and flush I/O buffers only once instead of after each message.

@mp911de mp911de changed the title performance issue when queries rush Combine extended flow messages into a single composite message Oct 17, 2019
@mp911de mp911de added the type: enhancement A general enhancement label Oct 17, 2019
@mp911de
Copy link
Collaborator

mp911de commented Oct 17, 2019

I updated the ticket title to reflect the actual goal.

@skroll
Copy link

skroll commented Oct 17, 2019

For completions sake uploaded wireshark dumps of the same query being performed through jdbc and the r2dbc-postgresql driver. It seems like the jdbc driver just sends Parse/Bind/Describe/Execute/Sync while the r2dbc-postgresql implementation sends Parse/Describe/Sync/Bind/Describe/Execute/Close/Sync.

Would combining the extended flow messages into a single composite message reduce the number of messages needed to be sent? I'm guessing you don't need a named portal when using a composite message and doing a simple statement with parameter binding.

jdbc.txt
r2dbc.txt

@mp911de
Copy link
Collaborator

mp911de commented Oct 17, 2019

I'm not terribly familiar with the extended flow yet and I'm not sure why we additionally Describe+Sync and Describe before execute. Probably we could improve on that end.

A composite message would combine buffers instead of putting each message into its own packet and it would flush only once after sending the message to the transport. Internally, the message structure would remain the same and the database server shouldn't notice any difference.

@skroll
Copy link

skroll commented Oct 18, 2019

OK, so I figured out why there's double describe/sync. The IndefiniteStatementCache is calling ExtendedQueryMessageFlow.parse, which sends Parse, Describe, Sync. Then the actual ExtendedQueryMessageFlow.execute is called when the query is invoked, which does Bind, Describe, Execute, Close, Sync.

It only happens the first time for a query until it is cached.

@mp911de
Copy link
Collaborator

mp911de commented Oct 23, 2019

I did some experiments with pre-allocated messages and streamlined flushing behavior. By reducing the number of flushes we can roughly gain about 10% throughput in the extended query flow:

Baseline:

Benchmark                           Mode  Cnt      Score      Error  Units
StatementBenchmarks.extendedJdbc   thrpt    5  24398,045 ± 4571,383  ops/s
StatementBenchmarks.extendedR2dbc  thrpt    5  12061,340 ±  885,606  ops/s
StatementBenchmarks.simpleJdbc     thrpt    5  13693,325 ± 1661,750  ops/s
StatementBenchmarks.simpleR2dbc    thrpt    5  12073,777 ±  770,400  ops/s

Composite message:

Benchmark                           Mode  Cnt      Score      Error  Units
StatementBenchmarks.extendedJdbc   thrpt    5  25200,876 ±  496,733  ops/s
StatementBenchmarks.extendedR2dbc  thrpt    5  13515,085 ± 1674,362  ops/s
StatementBenchmarks.simpleJdbc     thrpt    5  15891,033 ±  700,806  ops/s
StatementBenchmarks.simpleR2dbc    thrpt    5  12151,821 ±  903,077  ops/s

There is however still a significant gap between the extended flow in JDBC and the extended flow in R2DBC.

Caching the RowDescription (omitting Describe) didn't have a significant effect.

Simple flow figures are quite close so we might want to focus on the gap in the extended flow and why this is. Maybe some more knowledgeable is able to help us out. I pushed the code changes the perf-experiments branch. Maybe someone with more in-depth knowledge can have a look at the issue why there is this significant difference. From my measurements, most time is spent within Postgres.

@skroll
Copy link

skroll commented Oct 24, 2019

Could be related to naming every portal/prepared statement but still calling Parse, which the JDBC driver does not do. IIRC, the JDBC driver only names prepared statements that are called X times in the same session (I think its 5 by default). Afterwards, once it has named the statement, it no longer uses the Parse or Describe messages. It simply calls Bind, Execute, Sync.

I just tested this using wireshark, and the behavior matches what I expected.

I have a feeling that naming a statement over and over again, while ending with "Close" is incurring overhead. Postgres has a concept of "unnamed statements" which don't require the "Close" operation.

I think that statements should only be named if they are re-used, and they shouldn't be closed each time. If the driver doesn't intend on persisting named statements on the server side, they should never be named.

Also, from the postgres documentation:

The backend can keep track of multiple prepared statements and portals (but note that these exist only within a session, and are never shared across sessions). Existing prepared statements and portals are referenced by names assigned when they were created. In addition, an "unnamed" prepared statement and portal exist. Although these behave largely the same as named objects, operations on them are optimized for the case of executing a query only once and then discarding it, whereas operations on named objects are optimized on the expectation of multiple uses.

@mp911de
Copy link
Collaborator

mp911de commented Oct 25, 2019

I noticed two effects during benchmarking:

  1. Our JDBC fixture puts the connection into explicit transaction mode (auto-commit disabled) while the R2DBC connection is in auto-commit mode. Putting the R2DBC connection into explicit transaction mode yields another 1000 queries/sec improvement.
  2. Switching message ordering in regards to Sync makes a huge difference. So switching from Bind/Describe/Execute/Close/Sync to Bind/Describe/Execute/Sync/Close gives another 2000 queries/sec of improvement

So that I end up with 13515 vs. 17354 queries/sec.

@Squiry
Copy link
Collaborator

Squiry commented Jan 8, 2020

@mp911de I think we can use NettyOutbound.sendGroups instead of those composite messages. It works just as you want and we already submit our messages grouped in publishers.
Log:

QUERY                               Executing query: SELECT value FROM test WHERE $1 <> 1
client.ReactorNettyClient           Request:  Bind{name='B_5', parameterFormats=[FORMAT_BINARY], parameters=[CompositeByteBuf(ridx: 0, widx: 4, cap: 4, components=1)], resultFormats=[], source='S_0'}
client.ReactorNettyClient           [id: 0x20d31761, L:/127.0.0.1:60344 - R:localhost/127.0.0.1:33466] WRITE: 29B
client.ReactorNettyClient           Request:  Describe{name='B_5', type=PORTAL}
client.ReactorNettyClient           [id: 0x20d31761, L:/127.0.0.1:60344 - R:localhost/127.0.0.1:33466] WRITE: 10B
client.ReactorNettyClient           Request:  Execute{name='B_5', rows=0}
client.ReactorNettyClient           [id: 0x20d31761, L:/127.0.0.1:60344 - R:localhost/127.0.0.1:33466] WRITE: 13B
client.ReactorNettyClient           Request:  Close{name='B_5', type=PORTAL}
client.ReactorNettyClient           [id: 0x20d31761, L:/127.0.0.1:60344 - R:localhost/127.0.0.1:33466] WRITE: 10B
client.ReactorNettyClient           Request:  Sync{}
client.ReactorNettyClient           [id: 0x20d31761, L:/127.0.0.1:60344 - R:localhost/127.0.0.1:33466] WRITE: 5B
client.ReactorNettyClient           [id: 0x20d31761, L:/127.0.0.1:60344 - R:localhost/127.0.0.1:33466] FLUSH

@mp911de
Copy link
Collaborator

mp911de commented Jan 8, 2020

Thanks for having a look. That looks good. Probably we need another exchange method so we can choose the method depending on the type of exchange that we want to use.

The performance gains from my branch are okay but that's not a terribly huge increase in throughout. Especially the Bind/Describe/Execute/Sync/Close sequence reordering requires changes in windowing and I'm not sure I can find the time to dig myself deeply enough. If you like, feel free to pick up my branch and introduce the changes for sendGroups and roll back my composite message.

@mp911de
Copy link
Collaborator

mp911de commented Jan 30, 2020

I merged parts of the branch into 0.8.x and master branches. These changes reduce allocations so the error during benchmarks gets reduced a bit. Here are the results before/after:

Before:
Benchmark                           Mode  Cnt      Score      Error  Units
StatementBenchmarks.extendedR2dbc  thrpt    5  12407,077 ± 1610,350  ops/s
StatementBenchmarks.simpleR2dbc    thrpt    5  11630,534 ±  831,796  ops/s

After:
Benchmark                           Mode  Cnt      Score     Error  Units
StatementBenchmarks.extendedR2dbc  thrpt    5  12650,548 ± 632,047  ops/s
StatementBenchmarks.simpleR2dbc    thrpt    5  11591,905 ± 383,378  ops/s

Let's have another benchmark round with sendGroups to see whether it's worth the effort.

mp911de added a commit that referenced this issue Oct 15, 2020
…CP packet

Extended flow parse+flush,execute+flush, execute+close+sync, and bind+describe messages are now sent as single TCP packet to improve TCP efficiency.

[#138]
mp911de added a commit that referenced this issue Oct 15, 2020
…CP packet

Extended flow parse+flush,execute+flush, execute+close+sync, and bind+describe messages are now sent as single TCP packet to improve TCP efficiency.

[#138]
@mp911de
Copy link
Collaborator

mp911de commented Oct 16, 2020

Related to #334, we should optimize more aggressively for the default case to combine extended flow packets if there's a single parameter binding set.

mp911de added a commit that referenced this issue Oct 16, 2020
We now enable TCP NoDelay by default to avoid delayed TCP ACK's. Our flushing behavior isn't ideal yet, but combining messages allows to reduce the number of TCP packets.

[resolves #334][#138]
mp911de added a commit that referenced this issue Oct 16, 2020
We now enable TCP NoDelay by default to avoid delayed TCP ACK's. Our flushing behavior isn't ideal yet, but combining messages allows to reduce the number of TCP packets.

[resolves #334][#138]
@mp911de
Copy link
Collaborator

mp911de commented Jan 11, 2022

ExtendedFlowDelegate combines now many messages in a composite message where possible, see #373

@mp911de mp911de closed this as completed Jan 11, 2022
@mp911de mp911de added the status: superseded An issue that has been superseded by another label Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: superseded An issue that has been superseded by another type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

5 participants