Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make batch_conf option use user-friendly names if it's documented. #2185

Closed
srh opened this issue Mar 27, 2014 · 41 comments
Closed

Make batch_conf option use user-friendly names if it's documented. #2185

srh opened this issue Mar 27, 2014 · 41 comments

Comments

@srh
Copy link
Contributor

srh commented Mar 27, 2014

See rethinkdb/docs#248

max_els? max_dur? batch_conf?

All these abbreviations are cryptic, unprofessional-seeming, inconvenient for non-native speakers, and exude an aura of poor quality.

batch_conf should not have conf or configuration in its name at all, it's not configuration in any sense that other optargs aren't.

@mlucy
Copy link
Member

mlucy commented Mar 27, 2014

max_els and max_dur should probably change. conf is common though. See also: .conf files, autoconf, etc.

@neumino
Copy link
Member

neumino commented Mar 27, 2014

We can just call it batch_size : )
Same number of characters :-)

@srh
Copy link
Contributor Author

srh commented Mar 27, 2014

conf is common though. See also: .conf files, autoconf, etc.

This has nothing to do with whether the word configuration is appropriate for an optarg name.

@srh
Copy link
Contributor Author

srh commented Mar 27, 2014

Also, file suffixes and tool names follow a different kind of naming scheme -- just because an abbreviation there seems reasonable doesn't mean it would be reasonable in other contexts.

@coffeemug coffeemug added this to the 1.13-polish milestone Mar 27, 2014
@coffeemug
Copy link
Contributor

I feel like we should flatten it. There is no reason to have a nested object. One possible choice for a user-friendly version would be batch_max_length and batch_max_duration. I'd prefer that.

@coffeemug coffeemug modified the milestone: 1.13-polish Mar 27, 2014
@mlucy
Copy link
Member

mlucy commented Mar 27, 2014

How about batch_max_size, batch_max_rows, and batch_max_time?

@coffeemug
Copy link
Contributor

Presumably you get whichever comes first?

That makes sense to me, I like that.

@AtnNn
Copy link
Member

AtnNn commented Mar 27, 2014

How about batch_max_bytes and batch_max_time_ms? Making the unit explicit would avoid problems such as #1365

@mlucy
Copy link
Member

mlucy commented Mar 27, 2014

In this case it's microseconds, so I'd rather spell it out to make it clear:

  • batch_max_rows
  • batch_max_bytes
  • batch_max_microseconds

@coffeemug
Copy link
Contributor

I'd prefer:

  • batch_max_rows
  • batch_max_bytes
  • batch_max_us

@coffeemug
Copy link
Contributor

Err, may be not. Now that I typed it out, it's a bit weird. microseconds is fine.

@mlucy
Copy link
Member

mlucy commented Mar 27, 2014

Actually, how about max_batch_rows instead of batch_max_rows?

@coffeemug
Copy link
Contributor

That's cool too, as long as the other ones are changed in the same way. I like that more.

@mlucy
Copy link
Member

mlucy commented Apr 1, 2014

I like the most recent incarnation of this. We also need to add first_batch_scaledown_factor, though (I just looked in batching.cc).

@coffeemug
Copy link
Contributor

I also like the recent incarnation.

What's first_batch_scaledown_factor? I don't think it's a very good name :)

@mlucy
Copy link
Member

mlucy commented Apr 1, 2014

We scale down the first batch to help with latency (so the first batch a cursor gets will be smaller than the other batches). This lets users show their users something as soon as possible.

@coffeemug
Copy link
Contributor

So the scaledown factor applies to all other constants (rows, bytes, microseconds)? In that case the name is ok IMO.

@mlucy
Copy link
Member

mlucy commented Apr 1, 2014

I'm actually not 100% sure -- @danielmewes, what does the scaledown factor apply to?

@danielmewes
Copy link
Member

first_batch_scaledown_factor applies to the max rows and the max bytes, but not to the max microseconds.
first_batch_downsizing_factor would be a slightly better name, though it's still pretty long.

@coffeemug
Copy link
Contributor

I guess there is no great name here, we just have to document things well. I think either of the two is fine (though I have a preference for scaledown).

@srh
Copy link
Contributor Author

srh commented Apr 2, 2014

first_batch_downsizing_factor would be a slightly better name, though it's still pretty long.

That has negative associations I think we should avoid.

Why is it a factor and not simply first_batch_size? I don't mean to nitpick an obscure API that'll disappear if we ever do batching appropriately, but (I can't help myself) -- increasing the batch size doesn't immediately suggest that you'd want to increase the first batch proportionately. As a general rule that's true 70% of the time, it's not great to have API options representing proportions of one another.

@danielmewes
Copy link
Member

@srh: The main reason for it being a factor is that the batch size is already divided down internally, to account for the fact that we have multiple shards. Having the first batch size as a factor means that those size-reduction operations can just be chained together and we don't have to explicitly handle the first_batch_size when we go to the individual shards. It just makes the implementation simpler.

We could certainly change that for a user facing API. If the user doesn't specify first_batch_size, would it still be implicitly downscaled from the regular batch size?

@srh
Copy link
Contributor Author

srh commented Apr 2, 2014

@danielmewes: That doesn't make any sense. At some point the first batch factor gets combined with the batch size on the first batch. That would require passing it down to the same place that the first batch size would have been passed down.

@danielmewes
Copy link
Member

@srh: Yes, but the factor is one value, while first_batch_size is four values (min_els, max_els, max_size, max_dur).

@danielmewes
Copy link
Member

@coffeemug It's not a lot of work.

The question remains what we do if somebody specifies some of the options, but not the corresponding first_batch_* ones. Would we still scale those down by some hard-coded factor? Or would we keep the first batch unchanged, even if it ends up being the same or even larger than the latter (explicitly specified) batches?
The scaledown factor is a bit nicer in this respect because it doesn't require users to take care of the size of the first batch if they don't want to. At the same time, it makes the first batch's size adapt in a reasonable and consistent way if users change some of the other batch parameters.

@danielmewes
Copy link
Member

Actually, scaling the regular size parameters down by a hard coded factor unless the user overrides them through one of the first_batch_* parameters isn't unreasonable.

@coffeemug
Copy link
Contributor

I don't think we should scale it down by a hardcoded factor. I think we should just specify defaults for all these values. If the user changes some options, I'd leave the corresponding first_batch options as is. I don't think it would hurt anybody, it's easy to understand and transparent, and it's easy to change if you don't like the values.

@danielmewes
Copy link
Member

@coffeemug The problem is that users have to be aware of the existence of the first_batch options and what they do, or they might be surprised when changing some batch option but not seeing any impact on the first batch.
For example somebody specifying batch_max_rows: 1 would probably expect to only receive single-document batches. But unless they also set first_batch_max_rows: 1 that wouldn't be the case.
I don't know, I feel like not having an automatic scale down makes the interface more cumbersome to use.

@coffeemug
Copy link
Contributor

Err, I see. What if we define missing first_batch_x to mean min(first_batch_x_default, batch_max_x)?

@danielmewes
Copy link
Member

Using the min is another option. I think having a constant factor unless the value is explicitly overridden is slightly better, but I would be ok with either.

@mlucy
Copy link
Member

mlucy commented Apr 5, 2014

You know, I sort of think the scaledown factor might be better. It has the behavior you want if people don't bother to specify it (since it's a factor rather than a raw value), and I'm having trouble thinking of times when I want to adjust just one maximum value for the first batch. 4 options instead of 6 with more intuitive behavior seems like a win, especially since I think 3 of those 6 options would see very little use. The scaledown approach is also more flexible if we switch to some sort of smarter batch sizing in the future.

(There's also the fact that it's already implemented.)

@coffeemug
Copy link
Contributor

I sort of disagree, but whatever you guys decide I wont stand in your way.

@srh
Copy link
Contributor Author

srh commented Apr 5, 2014

It's an implementation detail option anyway, in the long run it'll change request-to-request, get increased until increasing it no longer speeds things up, or be specifiable live on the cursor. (Or we'll just have a push API.)

@mlucy
Copy link
Member

mlucy commented Apr 7, 2014

Alright, let's just keep the scaledown factor. Marking as settled with these names:

  • max_batch_rows
  • max_batch_bytes
  • max_batch_seconds (can be a double, we'll scale to microseconds internally)
  • first_batch_scaledown_factor

@coffeemug coffeemug modified the milestones: subsequent, 1.13-polish Jun 12, 2014
@coffeemug coffeemug assigned gchpaco and unassigned mlucy Aug 20, 2014
gchpaco pushed a commit that referenced this issue Sep 12, 2014
Per #2185.  ReQL tests are not patched yet, but it passes its unit tests.
gchpaco pushed a commit that referenced this issue Sep 18, 2014
Fixes #2185 for next.  In CR 2083.
@coffeemug
Copy link
Contributor

@gchpaco -- in the future, please reference the relevant issue in the docs tracker so we don't forget to document the API in time for release.

@gchpaco
Copy link
Contributor

gchpaco commented Sep 18, 2014

Whoops, sorry.

@neumino
Copy link
Member

neumino commented Sep 18, 2014

For docs purposes, are the constraints strict? Or do we try to comply as much as possible only?

If I use max_batch_rows=2, am I guarantee to never get more than 2 elements per batch?

@mlucy
Copy link
Member

mlucy commented Sep 18, 2014

They're hints which the server is free to ignore.

@AtnNn AtnNn modified the milestones: 1.15, subsequent Sep 25, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants