Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practices using pool with promises/future #83

Closed
davidchall opened this issue Feb 25, 2019 · 9 comments
Closed

Best practices using pool with promises/future #83

davidchall opened this issue Feb 25, 2019 · 9 comments

Comments

@davidchall
Copy link

In the documentation for the promises package, it states:

Future code blocks cannot use resources such as database connections and network sockets that were created in the parent process. This is true regardless of what future implementation you use! Even if it seems to work with a simple test, you are asking for crashes or worse by sharing these kinds of resources across processes.

Instead, make sure you create, use, and destroy such resources entirely within the scope of the future code block.

This means they are recommending to write future code blocks like:

future({
  conn <- DBI::dbConnect(RMySQL::MySQL(), ...)
  result <- tbl(conn, "flights") %>%
    group_by(dest) %>%
    summarise(delay = mean(dep_time)) %>%
    collect()
  DBI::dbDisconnect(conn)
  result
})

I'd like to retain the pool package to manage database connections in my Shiny app, if possible.

If I set a minimum pool size of zero and an idle timeout of zero, does this ensure that a new database connection is instantiated for each future code block? (Thereby satisfying the requirements of the promises package.)

For example:

pool <- pool::dbPool(
  RMySQL::MySQL(),
  minSize = 0,
  maxSize = 5,
  idleTimeout = 0,
  ...
)
future({
  tbl(pool, "flights") %>%
    group_by(dest) %>%
    summarise(delay = mean(dep_time)) %>%
    collect()
})
pool::poolClose(pool)
@r2evans
Copy link

r2evans commented Apr 17, 2019

@davidchall, have you had luck playing with and/or using the above code chunk? I have a project where parallelization (with future/promises) is a must-have, and I'd really like to be more parallel-friendly with something like your code above. (I have not tested it, I'm asking about lessons-learned since then so that I don't have to learn them, too :-)

@antoine-sachet
Copy link

It seems to be working! I have successfully used an external pool (with RMariaDB backend) with minSize = 0 and idleTimeout = 0 in a future.

Just in case, I use the following option to make sure I fail fast if a db connection is passed to the future (which happens when idleTimeout > 0).

# Fail early if a non-exportable object (db con) is passed to future
options(future.globals.onReference = "error")

With this option, the code fails straight away if the pool contains a db connection, with the error:

Error: Detected a non-exportable reference (‘externalptr’) in one of the globals (‘db_pool’ of class ‘Pool’) used in the future expression

I will report back if I observe errors under load.

@Tazovsky
Copy link

Remember that without dplyr::collect at the end of query future will fail, because it loses connection with pool

future::plan(multisession)

db_res <- future::future({
  connection <- pool::dbPool(
    maxSize = Inf,
    minSize = 0,
    validationInterval = 15,
    RPostgres::Postgres(),
    host = host,
    user = user,
    password = password,
    dbname = dbname,
    port = port
  )

# does not work:
  get_data_from_db(connection)
 
# works:
  get_data_from_db(connection) %>% 
    dplyr::collect()

value(db_res)
}

@antoine-sachet
Copy link

@Tazovsky If you define the pool in the future, you also need to close it from within.

By the way, to avoid leaking connections when you turn off the API, you need to close the poolon.exit.

I would suggest calling on.exit(pool::poolClose(db_pool)) immediately after creating the pool is best practice, regardless of where you create the pool.

@Tazovsky
Copy link

Yes, you are absolutely right. Thank for reminding it ;)

@r2evans
Copy link

r2evans commented Oct 18, 2019

Correct me if I'm wrong, but ... one commonly-used mechanism in shiny apps is that the block would be reacting to inputs and/or other events, so the connection might (and probably will) be needed again. Opening and closing a connection every time can become a bottleneck (on both ends, though mostly for the client).

I answered a question on stackoverflow some time ago (https://stackoverflow.com/a/57873387/3358272) with a "checkout/return" mechanism. It's imperfect (still debugging leaked connections), but it keeps the pool alive within each future block, as many future instances as are created (by the parallelization backend). Using that example, I'd recommend something like:

if (!exists("mypool")) assign("mypool", do.call(pool::dbPool, cred), envir = .GlobalEnv)
con <- pool::poolCheckout(mypool)
# do database stuff here
on.exit(pool::poolReturn(con), add=TRUE)

You could do this manually in every future-able block you use, or write a wrapper that does it for you. Assigning the mypool to the global environment is intentional, hoping to give the best chance of preserving what namespace control shiny and/or future will impose implicitly on the executed code.

The first time that block runs, assuming cred was already defined in the shiny global or server environment (and passed to the sub-process via future's implicit recognition of global variable references), then mypool does not exist so is created. From there on out, each call to the chunk will see the mypool and checkout an object. If it times out, then pool should internally handle reconnection. (The next step in this code would be some basic error-checking to ensure other errors don't occur, such as unknown host, authentication error, etc.)

The cred object should likely be created in the parent (shiny app global), and look something like

cred <- list(
  driver = "PostgreSQL ANSI(x64)",
  uid = "myname", pwd = Sys.getenv("DBPASS"),
  host = "serverip", port = 5432
)

or, better yet, use the config:: package so that you can differentiate between deployed/production and test/dev environments. (I'm making a leap on that last recommendation.) One such config.yml file could be:

development:
  db:
    driver: "PostgreSQL ANSI(x64)",
    server: "127.0.0.1"
    port: 25432
    database: devdb
    user: postgres
    password: mysecret

rsconnect:
  db:
    driver: "PostgreSQL ANSI(x64)",
    server: "11.22.33.44"
    port: 5432
    database: proddb
    user: !expr Sys.getenv("dbuser")
    password: !expr Sys.getenv("dbpass")

Capitalizing on RStudio's recommendations (in this case, for an RStudio Connect deployment, but frankly it doesn't matter as long as your production and development environments set a R_CONFIG_ACTIVE env var; ref https://github.com/rstudio/config). Once that is in a config.yml file in the shiny dir or one of its parent dirs, you can add this to your shiny app (global or server environment):

cred <- config::get("db")

@lmeninato
Copy link

It seems to be working! I have successfully used an external pool (with RMariaDB backend) with minSize = 0 and idleTimeout = 0 in a future.

Just in case, I use the following option to make sure I fail fast if a db connection is passed to the future (which happens when idleTimeout > 0).

# Fail early if a non-exportable object (db con) is passed to future
options(future.globals.onReference = "error")

With this option, the code fails straight away if the pool contains a db connection, with the error:

Error: Detected a non-exportable reference (‘externalptr’) in one of the globals (‘db_pool’ of class ‘Pool’) used in the future expression

I will report back if I observe errors under load.

Is there a way to use idleTimeout > 0 with futures/promises? Essentially this setting makes it so that it opens/closes a connection for every query. It would be great if there was an R6 method like emptyPool to get around this.

Possible usage:

pool <- pool::dbPool(...)

f <- future::future({
  # definitely needed to not use database connections from other process.
  pool$emptyPool()

  # insert code running in forked process (or however defined by plan()) here

  # might be unecessary? is it sharing the pool object from outer scope or was the object copied?
  pool$emptyPool() 
})

Maybe there could be a PR that handles this? Obviously the ideal thing would be for the pool to detect it was passed (I believe it gets copied, otherwise there'd be so many race conditions) to a future object, and then flush its pool of database connections. But I'm sure I could hack out something that closes all the database connections and builds minSize database connections.

@zmbc
Copy link

zmbc commented Apr 2, 2021

For anyone looking at this in the future, note there are two things going on here:

  • People asking for a method to use {pool} as a way to essentially cache credentials while still creating a database connection for each future. In this case, the pool is not actually pooling anything and performance is the same as not using a pool at all. This can be done using minSize = 0 and idleTimeout = 0 (I recommend also taking this advice to handle errors).
  • People asking for a method to actually pool database connections between futures (and hence multiple R processes). This is not possible due to limitations in the database packages themselves. See my answer here for Postgres: https://stackoverflow.com/a/66664783/5981641

@hadley
Copy link
Member

hadley commented Jan 5, 2023

Thanks for the great summary @zmbc!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants