Hanging connections; Pool issue? #57

hwchen · 2019-10-07T19:21:44Z

Hello,

While using clickhouse in our project, we've been finding that the connection to clickhouse will sometimes hang. While the upcoming v0.1.16 will solve an issue that fixes one of the symptom, I believe that there may be a more fundamental issue.

The minimal example is at https://github.com/hwchen/clickhouse-hang. The basic idea is that when using pool_max=1, running two queries in series will cause the second query to enter reconnect logic (as seen from the logs) and then hang. In v0.1.16, a timeout error will eventually occur, freeing up the connection and a query can be sent again. In v0.1.14, the hang is indefinite.

Assuming that this behavior is real, I think that it's likely that our usage of pool_max much greater than 1 has been hiding it to some extent.

Let me know if you need any more details. Also, @jspeis and I are both happy to investigate further and/or work on a fix.

Sorry, I wasn't able to find yet what code in clickhouse is responsible, but I wanted to point out the behavior first in case you could provide some direction.

The text was updated successfully, but these errors were encountered:

hwchen · 2019-10-07T19:22:38Z

code, for posterity.

clickhouse-rs = "=0.1.14"
futures = "0.1.29"
tokio = "0.1.22"
env_logger = "0.7.0"

use clickhouse_rs::Pool;
use futures::Future;
use std::io::{stdin, stdout, Write};

fn main() {
    // Important: set pool_max and pool_min to 1
    //let database_url = "tcp://localhost:9001?pool_max=1&pool_min=1";

    env_logger::init();

    let database_uri = std::env::args().nth(1).expect("please enter db uri");

    let pool = Pool::new(database_uri);

    let mut buf = String::new();

    loop {
        print!("press enter to run query:");
        stdout().flush().expect("could not flush stdout");

        stdin()
            .read_line(&mut buf)
            .expect("could not read line from stdin");

        let fut = pool
            .get_handle()
            .and_then(move |c| c.query("select 1;").fetch_all())
            .and_then(move |(_, block)| {
                println!("query success: {:?}", block);
                Ok(())
            })
            .map_err(|err| println!("database error: {}", err));

        tokio::run(fut);
    }
}

hwchen · 2019-10-07T19:24:52Z

logs, where

the program is run using cargo build && RUST_LOG=info target/debug/clickhouse-hang 'tcp://localhost:9000?pool_max=1&pool_min=1';
and <enter> is pressed twice, to run the query twice in succession.

press enter to run query:
[2019-10-07T17:54:31Z INFO  clickhouse_rs] [hello] -> Context { options: Options { addr: Url("localhost:9000"), database: "default", username: "default", password: "", compression: false, pool_min: 1, pool_max: 1, nodelay: true, keepalive: None, ping_before_query: true, send_retries: 3, retry_timeout: 5s, ping_timeout: 500ms, connection_timeout: 500ms, query_timeout: Some(180s), query_block_timeout: Some(180s), insert_timeout: Some(180s), execute_timeout: Some(180s) }, hostname: "mochi" }
[2019-10-07T17:54:31Z INFO  clickhouse_rs] [hello] <- ClickHouse 19.4.54416 (America/Montreal)
[2019-10-07T17:54:31Z INFO  clickhouse_rs] [ping]
[2019-10-07T17:54:31Z INFO  clickhouse_rs] [pong]
[2019-10-07T17:54:31Z INFO  clickhouse_rs::types::query_result] [send query] select 1;
query success: 
┌───┐
│ 1 │
├───┤
│ 1 │
└───┘
press enter to run query:
[2019-10-07T17:54:43Z INFO  clickhouse_rs] [ping]
[2019-10-07T17:54:43Z WARN  clickhouse_rs] [reconnect]

If run with v0.1.16 instead of 0.1.14, the following line is added (after a few minutes?)

database error: Driver error: `Timeout error.`

suharev7 · 2019-10-07T22:30:40Z

Hello,

It's incorrect usage. You shouldn't use the same connection pool with multiple runtimes (each call of tokio::run creates its runtime). If you try to use one instance of TCP stream in different runtimes, it will also work incorrectly.

This example should look like:

use clickhouse_rs::{errors::Error, Pool};
use futures::Future;
use std::io::{stdin, stdout, Write};

type BoxFuture<T> = Box<dyn Future<Item = T, Error = Error> + Send>;

fn loop_(pool: Pool, mut buf: String) -> BoxFuture<()> {
    print!("press enter to run query:");
    stdout().flush().expect("could not flush stdout");

    stdin()
        .read_line(&mut buf)
        .expect("could not read line from stdin");

    let fut = pool
        .get_handle()
        .and_then(move |c| c.query("select 1;").fetch_all())
        .and_then(move |(_, block)| {
            println!("query success: {:?}", block);
            Ok(())
        })
        .and_then(move |_| loop_(pool, buf));

    Box::new(fut)
}

fn main() {
    env_logger::init();

    let database_uri = std::env::args().nth(1).expect("please enter db uri");
    let pool = Pool::new(database_uri);

    let done = loop_(pool, String::new()).map_err(|err| println!("database error: {}", err));

    tokio::run(done);
}

I think v0.1.16 solved the issues which you describe.

hwchen · 2019-10-07T23:55:42Z

Ah, thanks, I didn't know that multiple runtimes would create an issue. Funnily, I guess the incorrect usage seen here may not even be present in our app, since we just use the system runner in Actix. It's kind of an artifact of trying to present a clean minimum example of the bug we're seeing in our app.

However, I did want to confirm about whether v0.1.16 will fix the other problems we're seeing. We were seeing hanging across all queries with v0.1.14, even though I don't believe we're creating multiple runtimes. I had guessed that the hanging might be related to reconnect logic, I'm still not sure if this is true.

Even with your fix to the example, I'm still able to trigger reconnect by cutting the connection, and then observe hanging (v0.1.16 will at least timeout error, where I'm seeing about 3min between reconnect and driver error: timeout, though the default retry timeout is 5s)

So I guess in the end I'm still asking three questions:

to me, the time it takes to timeout seems long, but I might be misunderstanding. Is this the expected behavior on reconnect?
is the fix which allows the reconnect to timeout with an error (v0.1.16) the fix in this situation?
is it likely that the reconnects logic could be the cause of hanging across all queries in the app, or are there other places more likely (or as likely), like just normal database errors?

And thanks for taking the time to answer this question, it's really helpful and it also helps us be able to contribute in the future.

suharev7 · 2019-10-10T11:59:27Z

By default, in case of cutting the connection, it should try to reconnect send_retries times with retry_timeout of time to wait before the next retry.
I've reimplemented some part of the connection pool, so v0.1.17 should solve.
No, it's connection pools related problem.

suharev7 added a commit that referenced this issue Oct 9, 2019

fix connection leak (#57)

3847794

suharev7 closed this as completed Oct 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hanging connections; Pool issue? #57

Hanging connections; Pool issue? #57

hwchen commented Oct 7, 2019 •

edited

Loading

hwchen commented Oct 7, 2019 •

edited

Loading

hwchen commented Oct 7, 2019 •

edited

Loading

suharev7 commented Oct 7, 2019

hwchen commented Oct 7, 2019 •

edited

Loading

suharev7 commented Oct 10, 2019

Hanging connections; Pool issue? #57

Hanging connections; Pool issue? #57

Comments

hwchen commented Oct 7, 2019 • edited Loading

hwchen commented Oct 7, 2019 • edited Loading

hwchen commented Oct 7, 2019 • edited Loading

suharev7 commented Oct 7, 2019

hwchen commented Oct 7, 2019 • edited Loading

suharev7 commented Oct 10, 2019

hwchen commented Oct 7, 2019 •

edited

Loading

hwchen commented Oct 7, 2019 •

edited

Loading

hwchen commented Oct 7, 2019 •

edited

Loading

hwchen commented Oct 7, 2019 •

edited

Loading