Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document batch configuration options in run #248

Closed
coffeemug opened this issue Mar 27, 2014 · 20 comments
Closed

Document batch configuration options in run #248

coffeemug opened this issue Mar 27, 2014 · 20 comments
Assignees

Comments

@coffeemug
Copy link
Contributor

@wojons asked how to configure batching. @mlucy pointed out the following. We should document it properly.

So, this isn't documented anywhere, but there's a batch_conf optarg which lets you set these things. If you write query.run(batch_conf:{max_els:5, max_dur:50*1000}), you'll get a batch back as soon as 5 rows are available or 50*1000 microseconds pass, whichever happens first. (You can also use max_size to configure a maximum serialized size in bytes, which is what we use internally for most batch sizing.)

@AtnNn
Copy link
Member

AtnNn commented Mar 27, 2014

@mlucy I believe you left it undocumented for a reason. Is the code foolproof enough? Does it error correctly if zero or negative values are passed?

@mlucy
Copy link
Member

mlucy commented Mar 27, 2014

I honestly have no idea. It wasn't originally written for external use. We should write some tests for error cases before we document it for people.

@wojons
Copy link

wojons commented May 2, 2014

Does this batching also work with controlling the internal batching size.

@mlucy
Copy link
Member

mlucy commented May 2, 2014

Yeah; it affects the intracluster batch sizes as well.

@coffeemug
Copy link
Contributor Author

This has been fixed (see rethinkdb/rethinkdb#2185), so we should document it for 1.16.

@coffeemug coffeemug added this to the 1.16 milestone Sep 18, 2014
@neumino
Copy link
Member

neumino commented Sep 25, 2014

The changes landed in 1.15.
Shouldn't we document it now?

@coffeemug
Copy link
Contributor Author

Yep, we missed this. /cc @chipotle

@chipotle
Copy link
Contributor

chipotle commented Oct 1, 2014

What are the defaults for the various options? There's max_batch_rows, max_batch_bytes, max_batch_seconds, and first_batch_scaledown_factor. Also, what's that last one do? (This ticket references the first three by their old names, but not the scaledown factor.)

@chipotle chipotle self-assigned this Oct 1, 2014
@coffeemug
Copy link
Contributor Author

/cc @mlucy

@chipotle
Copy link
Contributor

chipotle commented Oct 1, 2014

@neumino pointed me at the source code with the defaults. One other question: I'm assuming that in the JS driver these "sub-optargs" are not converted from camelCase, so you would write

run(conn, {batchConf: {max_batch_rows: 10}}

rather than

run(conn, {batchConf: {maxBatchRows: 10}}

Is that right?

@neumino
Copy link
Member

neumino commented Oct 1, 2014

I just looked at the code, there's currently a bug in the JS driver.

We accept only batchConf but we do not translate it.
And like @chipotle said, we do not translate sub-optargs.

@neumino
Copy link
Member

neumino commented Oct 1, 2014

Err sorry, I just read ast.coffee and not net.coffee.

The available options seem to be here:
rethinkdb/rethinkdb#2463 (comment)

I don't see nested options, so I'm a bit confused what the new syntax is.

@coffeemug
Copy link
Contributor Author

Do we need to create an issue to keep track of the js driver bug?

@neumino
Copy link
Member

neumino commented Oct 1, 2014

I'm not sure if it's a bug in the JS driver, or the actual spec.
@gchpaco should know more.

@chipotle
Copy link
Contributor

chipotle commented Oct 1, 2014

I've looked at batching.cc lines 173-199; I can see what effect changing first_scaledown_factor has on the first result batch retrieved but I don't understand the use case for changing that parameter.

@coffeemug
Copy link
Contributor Author

@chipotle -- it's perceived latency. What happens is, people type a query in the repl, run it, and measure how long it takes to get the first batch. They then treat it as an indication of RethinkDB performance. Of course what really happens is that there is a tradeoff between latency and throughput, but that's not generally how people think.

So the scaledown factor gives people the first batch quickly to improve perceived latency in repl interactions, and then starts optimizing for throughput for future batches. This turns out to hit a great balance between perceived vs. real latency/throughput performance.

@chipotle
Copy link
Contributor

chipotle commented Oct 2, 2014

It looks like the default values ensure that the initial batch will be one row -- max_batch_rows and first_batch_scaledown_factor are both 8. If you adjust max_batch_rows is there guidance we should give on what to change the scaledown factor to? Is there really an advantage to making first_batch_scaledown_factor user-tunable, rather than just automatically setting it to be the same as max_batch_rows? And if so, would you ever want to set it to anything other than max_batch_rows or 1, which (I think) would make the first batch perform the same as the following batches?

@chipotle chipotle removed this from the 1.16 milestone Oct 2, 2014
@AtnNn
Copy link
Member

AtnNn commented Oct 5, 2014

The batch_conf documentation was only added to the JavaScript docs. I think it should be added to the Ruby and Python docs as well. Re-opening.

@AtnNn AtnNn reopened this Oct 5, 2014
@chipotle
Copy link
Contributor

chipotle commented Oct 6, 2014

Argh.

@chipotle
Copy link
Contributor

chipotle commented Oct 6, 2014

Code review 2181 open. (@AtnNn, I've set you as the reviewer for this one.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants