Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hanging connections; Pool issue? #57

Closed
hwchen opened this issue Oct 7, 2019 · 5 comments
Closed

Hanging connections; Pool issue? #57

hwchen opened this issue Oct 7, 2019 · 5 comments

Comments

@hwchen
Copy link
Contributor

hwchen commented Oct 7, 2019

Hello,

While using clickhouse in our project, we've been finding that the connection to clickhouse will sometimes hang. While the upcoming v0.1.16 will solve an issue that fixes one of the symptom, I believe that there may be a more fundamental issue.

The minimal example is at https://github.com/hwchen/clickhouse-hang. The basic idea is that when using pool_max=1, running two queries in series will cause the second query to enter reconnect logic (as seen from the logs) and then hang. In v0.1.16, a timeout error will eventually occur, freeing up the connection and a query can be sent again. In v0.1.14, the hang is indefinite.

Assuming that this behavior is real, I think that it's likely that our usage of pool_max much greater than 1 has been hiding it to some extent.

Let me know if you need any more details. Also, @jspeis and I are both happy to investigate further and/or work on a fix.

Sorry, I wasn't able to find yet what code in clickhouse is responsible, but I wanted to point out the behavior first in case you could provide some direction.

@hwchen
Copy link
Contributor Author

hwchen commented Oct 7, 2019

code, for posterity.

clickhouse-rs = "=0.1.14"
futures = "0.1.29"
tokio = "0.1.22"
env_logger = "0.7.0"
use clickhouse_rs::Pool;
use futures::Future;
use std::io::{stdin, stdout, Write};

fn main() {
    // Important: set pool_max and pool_min to 1
    //let database_url = "tcp://localhost:9001?pool_max=1&pool_min=1";

    env_logger::init();

    let database_uri = std::env::args().nth(1).expect("please enter db uri");

    let pool = Pool::new(database_uri);

    let mut buf = String::new();

    loop {
        print!("press enter to run query:");
        stdout().flush().expect("could not flush stdout");

        stdin()
            .read_line(&mut buf)
            .expect("could not read line from stdin");

        let fut = pool
            .get_handle()
            .and_then(move |c| c.query("select 1;").fetch_all())
            .and_then(move |(_, block)| {
                println!("query success: {:?}", block);
                Ok(())
            })
            .map_err(|err| println!("database error: {}", err));

        tokio::run(fut);
    }
}

@hwchen
Copy link
Contributor Author

hwchen commented Oct 7, 2019

logs, where

  • the program is run using cargo build && RUST_LOG=info target/debug/clickhouse-hang 'tcp://localhost:9000?pool_max=1&pool_min=1';
  • and <enter> is pressed twice, to run the query twice in succession.
press enter to run query:
[2019-10-07T17:54:31Z INFO  clickhouse_rs] [hello] -> Context { options: Options { addr: Url("localhost:9000"), database: "default", username: "default", password: "", compression: false, pool_min: 1, pool_max: 1, nodelay: true, keepalive: None, ping_before_query: true, send_retries: 3, retry_timeout: 5s, ping_timeout: 500ms, connection_timeout: 500ms, query_timeout: Some(180s), query_block_timeout: Some(180s), insert_timeout: Some(180s), execute_timeout: Some(180s) }, hostname: "mochi" }
[2019-10-07T17:54:31Z INFO  clickhouse_rs] [hello] <- ClickHouse 19.4.54416 (America/Montreal)
[2019-10-07T17:54:31Z INFO  clickhouse_rs] [ping]
[2019-10-07T17:54:31Z INFO  clickhouse_rs] [pong]
[2019-10-07T17:54:31Z INFO  clickhouse_rs::types::query_result] [send query] select 1;
query success: 
┌───┐
│ 1 │
├───┤
│ 1 │
└───┘
press enter to run query:
[2019-10-07T17:54:43Z INFO  clickhouse_rs] [ping]
[2019-10-07T17:54:43Z WARN  clickhouse_rs] [reconnect]

If run with v0.1.16 instead of 0.1.14, the following line is added (after a few minutes?)

database error: Driver error: `Timeout error.`

@suharev7
Copy link
Owner

suharev7 commented Oct 7, 2019

Hello,

It's incorrect usage. You shouldn't use the same connection pool with multiple runtimes (each call of tokio::run creates its runtime). If you try to use one instance of TCP stream in different runtimes, it will also work incorrectly.

This example should look like:

use clickhouse_rs::{errors::Error, Pool};
use futures::Future;
use std::io::{stdin, stdout, Write};

type BoxFuture<T> = Box<dyn Future<Item = T, Error = Error> + Send>;

fn loop_(pool: Pool, mut buf: String) -> BoxFuture<()> {
    print!("press enter to run query:");
    stdout().flush().expect("could not flush stdout");

    stdin()
        .read_line(&mut buf)
        .expect("could not read line from stdin");

    let fut = pool
        .get_handle()
        .and_then(move |c| c.query("select 1;").fetch_all())
        .and_then(move |(_, block)| {
            println!("query success: {:?}", block);
            Ok(())
        })
        .and_then(move |_| loop_(pool, buf));

    Box::new(fut)
}

fn main() {
    env_logger::init();

    let database_uri = std::env::args().nth(1).expect("please enter db uri");
    let pool = Pool::new(database_uri);

    let done = loop_(pool, String::new()).map_err(|err| println!("database error: {}", err));

    tokio::run(done);
}

I think v0.1.16 solved the issues which you describe.

@hwchen
Copy link
Contributor Author

hwchen commented Oct 7, 2019

Ah, thanks, I didn't know that multiple runtimes would create an issue. Funnily, I guess the incorrect usage seen here may not even be present in our app, since we just use the system runner in Actix. It's kind of an artifact of trying to present a clean minimum example of the bug we're seeing in our app.

However, I did want to confirm about whether v0.1.16 will fix the other problems we're seeing. We were seeing hanging across all queries with v0.1.14, even though I don't believe we're creating multiple runtimes. I had guessed that the hanging might be related to reconnect logic, I'm still not sure if this is true.

Even with your fix to the example, I'm still able to trigger reconnect by cutting the connection, and then observe hanging (v0.1.16 will at least timeout error, where I'm seeing about 3min between reconnect and driver error: timeout, though the default retry timeout is 5s)

So I guess in the end I'm still asking three questions:

  • to me, the time it takes to timeout seems long, but I might be misunderstanding. Is this the expected behavior on reconnect?
  • is the fix which allows the reconnect to timeout with an error (v0.1.16) the fix in this situation?
  • is it likely that the reconnects logic could be the cause of hanging across all queries in the app, or are there other places more likely (or as likely), like just normal database errors?

And thanks for taking the time to answer this question, it's really helpful and it also helps us be able to contribute in the future.

suharev7 added a commit that referenced this issue Oct 9, 2019
@suharev7
Copy link
Owner

  • By default, in case of cutting the connection, it should try to reconnect send_retries times with retry_timeout of time to wait before the next retry.
  • I've reimplemented some part of the connection pool, so v0.1.17 should solve.
  • No, it's connection pools related problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants