New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Modifier functions #5813

Closed
danielmewes opened this Issue Jun 1, 2016 · 23 comments

Comments

Projects
None yet
8 participants
@danielmewes
Member

danielmewes commented Jun 1, 2016

Intro

Modifier functions are the first proposal out of two, which together will allow users to set up expiration / TTL (#746). I'll post the second part shortly.

A modifier function is a ReQL function that can be configured on a given table. A modifier function is applied to each write, and can:

  • perform schema validation
  • rewrite documents or add additional fields, for example a lastModificationTs field to implement expiration (TTL)
  • implement security rules, such as insert-only tables. In the future, we can extend this concept to expose a user context with the user name and user groups to the modifier function.

Specification

The modifier function has the signature (oldVal : object, newVal : object) -> object. Each of the objects can also be null.

A modifier function has to be deterministic.

Whenever a write operation on the table inserts, deletes or modifies a given document, the modifier function will be called with the old value of the document (or null on inserts) and the new value of the document (or null on deletes). It then returns the value that should actually be inserted and/or replaced instead of newVal. It can also return r.error(...) to abort the write.

To simplify things, I think we should require the modifier function to return null exactly iff newVal is null. That way we make sure that it doesn't turn an insert/update into a deletion, and doesn't turn a deletion into an update, which could be confusing and might break some assumptions in our code.

Configuration API

Proposal for an API for setting a modifier function on a given table:

  • tbl.setModifier(function) sets the modifier function or overwrites it if one already exists
  • tbl.setModifier(null) deletes the modifier function
  • tbl.getModifier() gets a string representation of the current modifier function, or null

The setModifier term requires "config" permissions on tbl.

This API could be more straight-forward if we had a ReQL query pseudo-type ( #3636 ), but I think that would become too much for 2.4. The current API therefore is loosely inspired by the secondary index API.

Examples for modifier functions:

// Inject a `lastModified` timestamp
tbl.setModifier(function(oldVal, newVal) {
    return newVal.merge({lastModified: r.now}).default(null);
  })

(Note that r.now is the deterministic r.now that's filled in when the write query triggering this is compiled, not when the modifier is set. This might require some extra plumbing to work.)

// Verify that the documents have certain fields.
tbl.setModifier(function(oldVal, newVal) {
    return r.branch(newVal.hasFields("field1", "field2"), newVal, r.error("Missing fields")).default(null);
  })

@danielmewes danielmewes added this to the 2.4 milestone Jun 1, 2016

@deontologician

This comment has been minimized.

Show comment
Hide comment
@deontologician

deontologician Jun 1, 2016

Contributor

I think we should require the modifier function to return null exactly iff newVal is null

Or r.error

Contributor

deontologician commented Jun 1, 2016

I think we should require the modifier function to return null exactly iff newVal is null

Or r.error

@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jun 1, 2016

Member

Or r.error

Ah right. It can always throw or return errors.

Member

danielmewes commented Jun 1, 2016

Or r.error

Ah right. It can always throw or return errors.

@mlucy

This comment has been minimized.

Show comment
Hide comment
@mlucy

mlucy Jun 1, 2016

Member

A modifier function has to be deterministic.

Truly deterministic or per-node deterministic? (I.e. can you use geo functions?)

Also, are we making an explicit exception for the current time? If we're OK making an exception for the current time, why do we need the determinism constraint at all?

setModifier and getModifier

The modifiers should presumably also live in an artificial table somewhere, so that people can e.g. subscribe to changes on them. Maybe on table_config alongside the indexes?

Also, do we want to back up and restore modifiers when doing import/export the way we do with indexes?

tbl.setModifier(function(oldVal, newVal)

The signature should maybe be (id, oldVal, newVal) for consistency with insert conflict functions.

Note that r.now is the deterministic r.now that's filled in when the write query triggering this is compiled, not when the modifier is set. This might require some extra plumbing to work.

How are the semantics of r.now going to change exactly? Are we going to move to r.now being a non-deterministic term that always evaluates to the current time? Or is the proposal to special-case it inside of modifier functions?

I think it would be better if we could keep a consistent meaning for r.now everywhere -- it would be extremely confusing if it meant something different in normal functions vs. modifier functions vs. conflict resolution functions vs. secondary index functions vs. etc.

Member

mlucy commented Jun 1, 2016

A modifier function has to be deterministic.

Truly deterministic or per-node deterministic? (I.e. can you use geo functions?)

Also, are we making an explicit exception for the current time? If we're OK making an exception for the current time, why do we need the determinism constraint at all?

setModifier and getModifier

The modifiers should presumably also live in an artificial table somewhere, so that people can e.g. subscribe to changes on them. Maybe on table_config alongside the indexes?

Also, do we want to back up and restore modifiers when doing import/export the way we do with indexes?

tbl.setModifier(function(oldVal, newVal)

The signature should maybe be (id, oldVal, newVal) for consistency with insert conflict functions.

Note that r.now is the deterministic r.now that's filled in when the write query triggering this is compiled, not when the modifier is set. This might require some extra plumbing to work.

How are the semantics of r.now going to change exactly? Are we going to move to r.now being a non-deterministic term that always evaluates to the current time? Or is the proposal to special-case it inside of modifier functions?

I think it would be better if we could keep a consistent meaning for r.now everywhere -- it would be extremely confusing if it meant something different in normal functions vs. modifier functions vs. conflict resolution functions vs. secondary index functions vs. etc.

@mlucy

This comment has been minimized.

Show comment
Hide comment
@mlucy

mlucy Jun 1, 2016

Member

Also, we might want to consider a name other than "modifier function". It's not at all clear to me that I would want to use a modifier function for schema validation. What do you think about a name like "write_hook" or "pre_write_hook" instead?

Member

mlucy commented Jun 1, 2016

Also, we might want to consider a name other than "modifier function". It's not at all clear to me that I would want to use a modifier function for schema validation. What do you think about a name like "write_hook" or "pre_write_hook" instead?

@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jun 1, 2016

Member

@mlucy We need full determinism because these functions will be executed on each replica while holding the lock on the leaf node. They have the same constraints as update functions.

The modifiers should presumably also live in an artificial table somewhere, so that people can e.g. subscribe to changes on them. Maybe on table_config alongside the indexes?

Yea, that sounds good to me.

Also, do we want to back up and restore modifiers when doing import/export the way we do with indexes?

Good point, I think yes. We just need to make sure that we only set them after we're done with the data insert.

The signature should maybe be (id, oldVal, newVal) for consistency with insert conflict functions.

Makes sense.

About r.now: I think it's consistent, but with the one subtlety that the modifier function will be executed with the "r.now environment" of the query that causes the write, not the environment that existed when setting the modifier.
The only other place where we currently can store functions and where this is an issue is in secondary indexes. Our current behavior for secondary index functions is inconsistent to this, but our it also doesn't make sense and doesn't seem to be what people are expecting. We should probably just disallow r.now in secondary index functions as part of this, and the behavior would be consistent again afaikt.
You could describe it as: r.now in a query is evaluated based on an r.now value in a hidden query environment. The r.now value in the environment is set once when the query gets compiled.
With the exception of secondary index functions, I believe this is equivalent to our current behavior, even if this is not how it is implemented.

Member

danielmewes commented Jun 1, 2016

@mlucy We need full determinism because these functions will be executed on each replica while holding the lock on the leaf node. They have the same constraints as update functions.

The modifiers should presumably also live in an artificial table somewhere, so that people can e.g. subscribe to changes on them. Maybe on table_config alongside the indexes?

Yea, that sounds good to me.

Also, do we want to back up and restore modifiers when doing import/export the way we do with indexes?

Good point, I think yes. We just need to make sure that we only set them after we're done with the data insert.

The signature should maybe be (id, oldVal, newVal) for consistency with insert conflict functions.

Makes sense.

About r.now: I think it's consistent, but with the one subtlety that the modifier function will be executed with the "r.now environment" of the query that causes the write, not the environment that existed when setting the modifier.
The only other place where we currently can store functions and where this is an issue is in secondary indexes. Our current behavior for secondary index functions is inconsistent to this, but our it also doesn't make sense and doesn't seem to be what people are expecting. We should probably just disallow r.now in secondary index functions as part of this, and the behavior would be consistent again afaikt.
You could describe it as: r.now in a query is evaluated based on an r.now value in a hidden query environment. The r.now value in the environment is set once when the query gets compiled.
With the exception of secondary index functions, I believe this is equivalent to our current behavior, even if this is not how it is implemented.

@deontologician

This comment has been minimized.

Show comment
Hide comment
@deontologician

deontologician Jun 1, 2016

Contributor

What if the modifier function itself has to be deterministic, but you're allowed to return an object that has non-deterministic terms in it, which are evaluated at the time the modifier function completes without an error.

Contributor

deontologician commented Jun 1, 2016

What if the modifier function itself has to be deterministic, but you're allowed to return an object that has non-deterministic terms in it, which are evaluated at the time the modifier function completes without an error.

@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jun 1, 2016

Member

@deontologician That wouldn't work. The whole thing needs to be deterministic, because it will be executed independently on every replica. If each execution can lead to different results, then we'll get divergence between replicas.

As a secondary concern, the function should also be efficient since we will be holding a write lock on the leaf node and globally on all secondary indexes of the table while we're evaluating it.

Member

danielmewes commented Jun 1, 2016

@deontologician That wouldn't work. The whole thing needs to be deterministic, because it will be executed independently on every replica. If each execution can lead to different results, then we'll get divergence between replicas.

As a secondary concern, the function should also be efficient since we will be holding a write lock on the leaf node and globally on all secondary indexes of the table while we're evaluating it.

@encryptio

This comment has been minimized.

Show comment
Hide comment
@encryptio

encryptio Jun 1, 2016

Contributor

What about cases where you want a modifier function in the vast majority of cases (including new stuff that developers write) but want to skip it for just this one weird query? (e.g., table is append and delete only, but you want to rewrite all rows once with an .update({"_notimportant_":true}); .update({"_notimportant_": r.literal()}), and you'd like to keep your modifiers in place during that process.)

Should we have an optarg to insert/update to optionally skip the modifier functions? Do you need config permissions to use the optarg? Or is this too subtle and unknown territory to put into the first version?

Contributor

encryptio commented Jun 1, 2016

What about cases where you want a modifier function in the vast majority of cases (including new stuff that developers write) but want to skip it for just this one weird query? (e.g., table is append and delete only, but you want to rewrite all rows once with an .update({"_notimportant_":true}); .update({"_notimportant_": r.literal()}), and you'd like to keep your modifiers in place during that process.)

Should we have an optarg to insert/update to optionally skip the modifier functions? Do you need config permissions to use the optarg? Or is this too subtle and unknown territory to put into the first version?

@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jun 1, 2016

Member

@encryptio Interesting, I think that could be very useful.

I'm not quite sure what the permission requirements for the override should be.

If we allow everyone with write permissions to ignore the modifier, that would obviously rule out modifiers as a security feature. On the other hand they will be really limited for that purpose in the current proposal anyway, and their main use case is a different one.

Requiring config permissions for it would be safe, and might actually make sense since most cases I can think of where you would want to ignore the modifier are administrative operations.

Member

danielmewes commented Jun 1, 2016

@encryptio Interesting, I think that could be very useful.

I'm not quite sure what the permission requirements for the override should be.

If we allow everyone with write permissions to ignore the modifier, that would obviously rule out modifiers as a security feature. On the other hand they will be really limited for that purpose in the current proposal anyway, and their main use case is a different one.

Requiring config permissions for it would be safe, and might actually make sense since most cases I can think of where you would want to ignore the modifier are administrative operations.

@VeXocide

This comment has been minimized.

Show comment
Hide comment
@VeXocide

VeXocide Jun 3, 2016

Member

I agree with @mlucy that this should be part of tbl.config() instead of having an explicit getter and setter.

Member

VeXocide commented Jun 3, 2016

I agree with @mlucy that this should be part of tbl.config() instead of having an explicit getter and setter.

@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jun 3, 2016

Member

I agree with @mlucy that this should be part of tbl.config() instead of having an explicit getter and setter.

Yeah, but it doesn't work without a ReQL query data type unfortunately.

Member

danielmewes commented Jun 3, 2016

I agree with @mlucy that this should be part of tbl.config() instead of having an explicit getter and setter.

Yeah, but it doesn't work without a ReQL query data type unfortunately.

@tatsujin1

This comment has been minimized.

Show comment
Hide comment
@tatsujin1

tatsujin1 Jun 3, 2016

Am I understanding it correctly that you're sneaking in an implementation to #3583 under the fence? :)
If so, doubly awesome!

tatsujin1 commented Jun 3, 2016

Am I understanding it correctly that you're sneaking in an implementation to #3583 under the fence? :)
If so, doubly awesome!

@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jun 3, 2016

Member

@tatsujin1 No, #3583 will not be addressed by the changes proposed here. r.now is still going to resolve to a single constant timestamp at the beginning of query execution.

Member

danielmewes commented Jun 3, 2016

@tatsujin1 No, #3583 will not be addressed by the changes proposed here. r.now is still going to resolve to a single constant timestamp at the beginning of query execution.

@mlucy

This comment has been minimized.

Show comment
Hide comment
@mlucy

mlucy Jun 3, 2016

Member

To clarify, I think there should be a setter and getter for the same reason indexCreate and indexDrop are terms, but you should also be able to interact with these things through the system tables.

Member

mlucy commented Jun 3, 2016

To clarify, I think there should be a setter and getter for the same reason indexCreate and indexDrop are terms, but you should also be able to interact with these things through the system tables.

@marshall007

This comment has been minimized.

Show comment
Hide comment
@marshall007

marshall007 Jun 6, 2016

Contributor

@danielmewes @mlucy if modifier functions are to be backed up during import/export, I think .getModifier() is going to need to return an .indexStatus() style response rather than a string representation, right? In the existing proposal, I don't see where else you would get the binary representation in order to recreate it (like we do with secondary indices).

Contributor

marshall007 commented Jun 6, 2016

@danielmewes @mlucy if modifier functions are to be backed up during import/export, I think .getModifier() is going to need to return an .indexStatus() style response rather than a string representation, right? In the existing proposal, I don't see where else you would get the binary representation in order to recreate it (like we do with secondary indices).

@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jun 6, 2016

Member

@marshall007 Good point! We should probably return an object that has both the binary and the pretty-printed representation.

Member

danielmewes commented Jun 6, 2016

@marshall007 Good point! We should probably return an object that has both the binary and the pretty-printed representation.

@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jun 22, 2016

Member

Updated proposal with the following changes:

  • renamed to write hook
  • the first argument to the write hook is now the document's ID
  • added ignore_write_hook flag to ignore the hook on a per-query basis
  • added specification of getWriteHook return format
  • added note on r.now evaluation changes
  • added note on backup tool changes
  • added note on system table representation

Specification

The write hook function has the signature (id: datum, oldVal : object, newVal : object) -> object. Each of the objects can also be null.

A write hook function has to be deterministic.

Whenever a write operation on the table inserts, deletes or modifies a given document, the write hook function will be called with the primary key of the document, the old value of the document (or null on inserts) and the new value of the document (or null on deletes). It then returns the value that should actually be inserted and/or replaced instead of newVal. It can also return r.error(...) to abort the write.

For simplicity, the write hook function is allowed to return null exactly iff newVal is null. That way we make sure that it doesn't turn an insert/update into a deletion, and doesn't turn a deletion into an update, which could be confusing and might break some assumptions in our code.

Configuration API

Proposal for an API for setting a write function on a given table:

  • tbl.setWriteHook(function) sets the modifier function or overwrites it if one already exists.
  • tbl.setWriteHook(binary) variant for rethinkdb restore that accepts a binary representation of the function as obtained from getWriteHook
  • tbl.setWriteHook(null) deletes the modifier function
  • tbl.getWriteHook() returns either null or an object of the following form:
{
  "function": binary,
  "query": "setWriteHook(function(_var1, _var2, _var3) { return ...; })" , 
}

The setWriteHook term requires "config" permissions on tbl.

Override
Any write query can override the write hook. The argument ignoreWriteHook: true can be passed to insert, update, delete and replace or globally to run. If specified, the write hook is not executed.

Notes

  • The way r.row is evaluated will be changed slightly. r.now in a query will be evaluated based on an r.now value in a hidden query environment. The r.now value in the environment is set once when the query gets compiled. We no longer replace r.now by a date-time constant during query compilation. We will allow r.now to appear in write hook functions, and will disallow it to appear in index function. For write hooks, the r.now environment value will be taken from the write query's environment.
  • The rethinkdb backup tools should be adapted in two ways. First, they should store and restore the write hook function, similar to how they currently store and restore secondary indexes. Secondly, rethinkdb restore should use the ignoreWriteHook flag for its inserts. I'm not sure what the best default for rethinkdb import is. We can also make this configurable.
  • The table's entry in table_config will get a new field write_hook with the same value as what getWriteHook() returns when called on the table. For simplicity I think we should make have this field be read-only.
Member

danielmewes commented Jun 22, 2016

Updated proposal with the following changes:

  • renamed to write hook
  • the first argument to the write hook is now the document's ID
  • added ignore_write_hook flag to ignore the hook on a per-query basis
  • added specification of getWriteHook return format
  • added note on r.now evaluation changes
  • added note on backup tool changes
  • added note on system table representation

Specification

The write hook function has the signature (id: datum, oldVal : object, newVal : object) -> object. Each of the objects can also be null.

A write hook function has to be deterministic.

Whenever a write operation on the table inserts, deletes or modifies a given document, the write hook function will be called with the primary key of the document, the old value of the document (or null on inserts) and the new value of the document (or null on deletes). It then returns the value that should actually be inserted and/or replaced instead of newVal. It can also return r.error(...) to abort the write.

For simplicity, the write hook function is allowed to return null exactly iff newVal is null. That way we make sure that it doesn't turn an insert/update into a deletion, and doesn't turn a deletion into an update, which could be confusing and might break some assumptions in our code.

Configuration API

Proposal for an API for setting a write function on a given table:

  • tbl.setWriteHook(function) sets the modifier function or overwrites it if one already exists.
  • tbl.setWriteHook(binary) variant for rethinkdb restore that accepts a binary representation of the function as obtained from getWriteHook
  • tbl.setWriteHook(null) deletes the modifier function
  • tbl.getWriteHook() returns either null or an object of the following form:
{
  "function": binary,
  "query": "setWriteHook(function(_var1, _var2, _var3) { return ...; })" , 
}

The setWriteHook term requires "config" permissions on tbl.

Override
Any write query can override the write hook. The argument ignoreWriteHook: true can be passed to insert, update, delete and replace or globally to run. If specified, the write hook is not executed.

Notes

  • The way r.row is evaluated will be changed slightly. r.now in a query will be evaluated based on an r.now value in a hidden query environment. The r.now value in the environment is set once when the query gets compiled. We no longer replace r.now by a date-time constant during query compilation. We will allow r.now to appear in write hook functions, and will disallow it to appear in index function. For write hooks, the r.now environment value will be taken from the write query's environment.
  • The rethinkdb backup tools should be adapted in two ways. First, they should store and restore the write hook function, similar to how they currently store and restore secondary indexes. Secondly, rethinkdb restore should use the ignoreWriteHook flag for its inserts. I'm not sure what the best default for rethinkdb import is. We can also make this configurable.
  • The table's entry in table_config will get a new field write_hook with the same value as what getWriteHook() returns when called on the table. For simplicity I think we should make have this field be read-only.
@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jun 22, 2016

Member

@mlucy @coffeemug could you make a pass over the updated proposal please?

Member

danielmewes commented Jun 22, 2016

@mlucy @coffeemug could you make a pass over the updated proposal please?

@encryptio

This comment has been minimized.

Show comment
Hide comment
@encryptio

encryptio Jun 22, 2016

Contributor

To be explicit, your proposal allows ignoreWriteHook without config permissions?

Contributor

encryptio commented Jun 22, 2016

To be explicit, your proposal allows ignoreWriteHook without config permissions?

@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jun 22, 2016

Member

To be explicit, your proposal allows ignoreWriteHook without config permissions?

I was thinking yes, but on second thought it's probably better to require config permissions on the table for this.

Member

danielmewes commented Jun 22, 2016

To be explicit, your proposal allows ignoreWriteHook without config permissions?

I was thinking yes, but on second thought it's probably better to require config permissions on the table for this.

@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jun 24, 2016

Member

Marking settled as described here #5813 (comment) with the addition that specifying ignoreWriteHook should required config permissions on the table.

Member

danielmewes commented Jun 24, 2016

Marking settled as described here #5813 (comment) with the addition that specifying ignoreWriteHook should required config permissions on the table.

@nighelles

This comment has been minimized.

Show comment
Hide comment
@nighelles

nighelles Aug 11, 2016

This is in review #3725.

nighelles commented Aug 11, 2016

This is in review #3725.

nighelles pushed a commit that referenced this issue Sep 2, 2016

Nighelles
Write Hooks for #5813
Allows creating ReQL functions on a table that are applied to each write. Also changes cluster version to 2.4.

Review by @danielmewes and @VeXocide

Implemented in nighelles/5813
@nighelles

This comment has been minimized.

Show comment
Hide comment
@nighelles

nighelles Sep 2, 2016

This is in next.

nighelles commented Sep 2, 2016

This is in next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment