Bulletproof js execution in v8 #69

Closed
coffeemug opened this Issue Nov 16, 2012 · 32 comments

Comments

Projects
None yet
5 participants
@coffeemug
Contributor

coffeemug commented Nov 16, 2012

For example, the following query basically requires us to do a server restart:

r.js("while (1) {}").run()

RethinkDB should be able to restart crashed v8 processes, and should kill them if they use too much cpu/ram for too long.

@skorokithakis

This comment has been minimized.

Show comment
Hide comment
@skorokithakis

skorokithakis Nov 17, 2012

I would so much love this, especially if it were configurable per-process (but I know that's too much to ask).

I would so much love this, especially if it were configurable per-process (but I know that's too much to ask).

@coffeemug

This comment has been minimized.

Show comment
Hide comment
@coffeemug

coffeemug Nov 17, 2012

Contributor

Hmm, I think it's pretty difficult to write a useful js expression that also needs to be killed because it's using too much ram/cpu. I think most of the time the expressions that will cross the threshold will be obviously poorly written. Can you think of an example when this would have to be configurable?

Contributor

coffeemug commented Nov 17, 2012

Hmm, I think it's pretty difficult to write a useful js expression that also needs to be killed because it's using too much ram/cpu. I think most of the time the expressions that will cross the threshold will be obviously poorly written. Can you think of an example when this would have to be configurable?

@skorokithakis

This comment has been minimized.

Show comment
Hide comment
@skorokithakis

skorokithakis Nov 17, 2012

The application I'm making right now allows (untrusted) users to write their own analytics code, so it would be fantastic if I could just push it to RethinkDB to run. I'm currently using a custom Lua backend which I kill when it exceeds some limits.

However, if I were designing a database, I wouldn't put that feature in, as there's no clean way to do it. Is there a way, at least, to do map/reduce with a custom process of my own (i.e. launch a separate process/script/program to be the mapper/reducer)?

Obviously, horizontal scaling is important to me here, rather than having to ferry all the data across the shards to process it.

EDIT: Here's the app, for reference: http://www.instahero.com/

The application I'm making right now allows (untrusted) users to write their own analytics code, so it would be fantastic if I could just push it to RethinkDB to run. I'm currently using a custom Lua backend which I kill when it exceeds some limits.

However, if I were designing a database, I wouldn't put that feature in, as there's no clean way to do it. Is there a way, at least, to do map/reduce with a custom process of my own (i.e. launch a separate process/script/program to be the mapper/reducer)?

Obviously, horizontal scaling is important to me here, rather than having to ferry all the data across the shards to process it.

EDIT: Here's the app, for reference: http://www.instahero.com/

@coffeemug

This comment has been minimized.

Show comment
Hide comment
@coffeemug

coffeemug Nov 17, 2012

Contributor

Ah, I see. Currently all v8 computation is already done in separate processes, so they can't really hurt rethinkdb server. We do have to write some code to kill/restart them when they get out of line, but that wouldn't be too difficult.

What's the format/API for your users to write map/reduce code in? If they just provide a javascript map function and a javascript reduce function, as soon as this bug is closed you can just push them down to rethink and they won't be able to hurt the server. If you want them to provide full blown reql queries, that would be more difficult because we'd have to write an entire quota system, and that's a non-trivial amount of work.

Contributor

coffeemug commented Nov 17, 2012

Ah, I see. Currently all v8 computation is already done in separate processes, so they can't really hurt rethinkdb server. We do have to write some code to kill/restart them when they get out of line, but that wouldn't be too difficult.

What's the format/API for your users to write map/reduce code in? If they just provide a javascript map function and a javascript reduce function, as soon as this bug is closed you can just push them down to rethink and they won't be able to hurt the server. If you want them to provide full blown reql queries, that would be more difficult because we'd have to write an entire quota system, and that's a non-trivial amount of work.

@skorokithakis

This comment has been minimized.

Show comment
Hide comment
@skorokithakis

skorokithakis Nov 17, 2012

It's just map/reduce (or it will be), the query will be done before the map/reduce step with some parser I'll write. My main problem is that the map/reduce code can easily contain an infinite loop or something that allocates memory unchecked, hence the limits requirement.

It's just map/reduce (or it will be), the query will be done before the map/reduce step with some parser I'll write. My main problem is that the map/reduce code can easily contain an infinite loop or something that allocates memory unchecked, hence the limits requirement.

@coffeemug

This comment has been minimized.

Show comment
Hide comment
@coffeemug

coffeemug Nov 17, 2012

Contributor

That's actually trivial for us to do. What's your timeline on this?

Contributor

coffeemug commented Nov 17, 2012

That's actually trivial for us to do. What's your timeline on this?

@skorokithakis

This comment has been minimized.

Show comment
Hide comment
@skorokithakis

skorokithakis Nov 17, 2012

I'm not in a hurry (read: months is fine), it would just save me a lot of work if it were doable. I'm amazed and grateful for your help, but, as I said before, this would be a bit of feature creep for a database, so you might want to think about it a bit first!

I'm not in a hurry (read: months is fine), it would just save me a lot of work if it were doable. I'm amazed and grateful for your help, but, as I said before, this would be a bit of feature creep for a database, so you might want to think about it a bit first!

@skorokithakis

This comment has been minimized.

Show comment
Hide comment
@skorokithakis

skorokithakis Nov 17, 2012

By the way, when I said "per-process" I meant "per-call", i.e. different map/reduce tasks having different limits, which is the part that strikes me as "unclean". Sanity checks on v8 processes would be nice, obviously.

By the way, Linux includes builtin support for setting process limits, I can give you some info about it if you need it.

By the way, when I said "per-process" I meant "per-call", i.e. different map/reduce tasks having different limits, which is the part that strikes me as "unclean". Sanity checks on v8 processes would be nice, obviously.

By the way, Linux includes builtin support for setting process limits, I can give you some info about it if you need it.

@wmrowan

This comment has been minimized.

Show comment
Hide comment
@wmrowan

wmrowan Dec 17, 2012

Contributor

I believe that the v8 api makes it pretty easy to set resource limits on scripts you give it to evaluate. I'll be working our JS evaluation code in the near future so I'll just assign this to self.

Contributor

wmrowan commented Dec 17, 2012

I believe that the v8 api makes it pretty easy to set resource limits on scripts you give it to evaluate. I'll be working our JS evaluation code in the near future so I'll just assign this to self.

@coffeemug

This comment has been minimized.

Show comment
Hide comment
@coffeemug

coffeemug Apr 10, 2013

Contributor

@wmrowan has been preoccupied with #505, so this isn't going to make it into the sprint. Moving out for now.

Contributor

coffeemug commented Apr 10, 2013

@wmrowan has been preoccupied with #505, so this isn't going to make it into the sprint. Moving out for now.

@wmrowan

This comment has been minimized.

Show comment
Hide comment
@wmrowan

wmrowan Apr 22, 2013

Contributor

It turns out that timeout based external job interruption has been lying dormant in the code the whole time. Turning it on is simply a matter of setting a non-zero timeout value. There are a few next steps here.

  • decide on the right timeout value
  • make sure that the interruptor code does the right thing. It appears to work but this dormant functionality has never been tested
  • change the error message generated when the JS query is interrupted
Contributor

wmrowan commented Apr 22, 2013

It turns out that timeout based external job interruption has been lying dormant in the code the whole time. Turning it on is simply a matter of setting a non-zero timeout value. There are a few next steps here.

  • decide on the right timeout value
  • make sure that the interruptor code does the right thing. It appears to work but this dormant functionality has never been tested
  • change the error message generated when the JS query is interrupted
@coffeemug

This comment has been minimized.

Show comment
Hide comment
@coffeemug

coffeemug Apr 23, 2013

Contributor

A really good implementation would thread the timeout through to the js operation so the users could actually configure it. I think for default, five seconds is fine.

Other than that, the plan looks sound to me.

Contributor

coffeemug commented Apr 23, 2013

A really good implementation would thread the timeout through to the js operation so the users could actually configure it. I think for default, five seconds is fine.

Other than that, the plan looks sound to me.

@coffeemug

This comment has been minimized.

Show comment
Hide comment
@coffeemug

coffeemug Apr 23, 2013

Contributor

Note: this should be done as part of sprint-3 via @wmrowan.

Contributor

coffeemug commented Apr 23, 2013

Note: this should be done as part of sprint-3 via @wmrowan.

@wmrowan

This comment has been minimized.

Show comment
Hide comment
@wmrowan

wmrowan Apr 23, 2013

Contributor

I don't think that that actually wouldn't be very hard to do at all. We'd simply add a "timeout" optional argument to the query language. Here's how it might look from the outside.

r.js("while(true) { ; }", timeout=5)

I'm personally not familiar with how optional arguments are handled on the server beyond protobuf parsing but I assume it's pretty straight forward.

Contributor

wmrowan commented Apr 23, 2013

I don't think that that actually wouldn't be very hard to do at all. We'd simply add a "timeout" optional argument to the query language. Here's how it might look from the outside.

r.js("while(true) { ; }", timeout=5)

I'm personally not familiar with how optional arguments are handled on the server beyond protobuf parsing but I assume it's pretty straight forward.

@coffeemug

This comment has been minimized.

Show comment
Hide comment
@coffeemug

coffeemug Apr 23, 2013

Contributor

Yep, agreed on the API. (Except that the default should be on the server, not clients)

Contributor

coffeemug commented Apr 23, 2013

Yep, agreed on the API. (Except that the default should be on the server, not clients)

@wmrowan

This comment has been minimized.

Show comment
Hide comment
@wmrowan

wmrowan Apr 23, 2013

Contributor

Ok, this is mostly done. @coffeemug do you have a preference for error message? Currently you get:

RqlRuntimeError: JavaScript query timeout. in:
r.js('while(1) {}', 'timeout'=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Contributor

wmrowan commented Apr 23, 2013

Ok, this is mostly done. @coffeemug do you have a preference for error message? Currently you get:

RqlRuntimeError: JavaScript query timeout. in:
r.js('while(1) {}', 'timeout'=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@jdoliner

This comment has been minimized.

Show comment
Hide comment
@jdoliner

jdoliner Apr 23, 2013

Contributor

Do we have a timeout by default? I ask because getting a message like:

RqlRuntimeError: JavaScript query timeout. in:
r.js('while(1) {}')
^^^^^^^^^^^^^^^^^^^

seems a bit confusing. Could it be:

RqlRuntimeError: JavaScript query timed out after 1 seconds in:
r.js('while(1) {}', 'timeout'=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Contributor

jdoliner commented Apr 23, 2013

Do we have a timeout by default? I ask because getting a message like:

RqlRuntimeError: JavaScript query timeout. in:
r.js('while(1) {}')
^^^^^^^^^^^^^^^^^^^

seems a bit confusing. Could it be:

RqlRuntimeError: JavaScript query timed out after 1 seconds in:
r.js('while(1) {}', 'timeout'=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@jdoliner

This comment has been minimized.

Show comment
Hide comment
@jdoliner

jdoliner Apr 23, 2013

Contributor

Whoops, really bad copy and paste error there don't read that last comment in email.

Contributor

jdoliner commented Apr 23, 2013

Whoops, really bad copy and paste error there don't read that last comment in email.

@skorokithakis

This comment has been minimized.

Show comment
Hide comment

Too late. :p

@wmrowan

This comment has been minimized.

Show comment
Hide comment
@wmrowan

wmrowan Apr 23, 2013

Contributor

Ok, the error message format is now "JavaScript query timed out after %.2G seconds", timeout_s.

Contributor

wmrowan commented Apr 23, 2013

Ok, the error message format is now "JavaScript query timed out after %.2G seconds", timeout_s.

@wmrowan

This comment has been minimized.

Show comment
Hide comment
@wmrowan

wmrowan Apr 23, 2013

Contributor

this is in code review number 425. @mlucy will have to implement this opt arg in Ruby.

Contributor

wmrowan commented Apr 23, 2013

this is in code review number 425. @mlucy will have to implement this opt arg in Ruby.

@coffeemug

This comment has been minimized.

Show comment
Hide comment
@coffeemug

coffeemug Apr 23, 2013

Contributor

The error message seems great. I'd still like an answer to how the thing actually works.

Contributor

coffeemug commented Apr 23, 2013

The error message seems great. I'd still like an answer to how the thing actually works.

@wmrowan

This comment has been minimized.

Show comment
Hide comment
@wmrowan

wmrowan Apr 23, 2013

Contributor

Here's how it works. All this code is in extproc/pool.cc if you want to check it out.

  1. a timer interruptor is created and set to go off after the configured timeout time is up
  2. the js job is serialized and assigned to an external worker in the pool
  3. a coroutine waits interruptibly with the timer interruptor for the external worker to respond
  4. if the timer interruptor goes off before the external process responds the worker is killed with a SIGKILL signal
  5. more workers are then spawned until the pool is full again
Contributor

wmrowan commented Apr 23, 2013

Here's how it works. All this code is in extproc/pool.cc if you want to check it out.

  1. a timer interruptor is created and set to go off after the configured timeout time is up
  2. the js job is serialized and assigned to an external worker in the pool
  3. a coroutine waits interruptibly with the timer interruptor for the external worker to respond
  4. if the timer interruptor goes off before the external process responds the worker is killed with a SIGKILL signal
  5. more workers are then spawned until the pool is full again
@coffeemug

This comment has been minimized.

Show comment
Hide comment
@coffeemug

coffeemug Apr 23, 2013

Contributor

Thank you, I am satisfied!

Contributor

coffeemug commented Apr 23, 2013

Thank you, I am satisfied!

@wmrowan

This comment has been minimized.

Show comment
Hide comment
@wmrowan

wmrowan Apr 24, 2013

Contributor

this is in code review 425 by @jdoliner and @coffeemug

Contributor

wmrowan commented Apr 24, 2013

this is in code review 425 by @jdoliner and @coffeemug

@coffeemug

This comment has been minimized.

Show comment
Hide comment
@coffeemug

coffeemug Apr 26, 2013

Contributor

@wmrowan -- what's the state of the issue?

Contributor

coffeemug commented Apr 26, 2013

@wmrowan -- what's the state of the issue?

@wmrowan

This comment has been minimized.

Show comment
Hide comment
@wmrowan

wmrowan Apr 26, 2013

Contributor

Still in code review. All the issues have been fixed but @jdoliner still needs to approve.

Contributor

wmrowan commented Apr 26, 2013

Still in code review. All the issues have been fixed but @jdoliner still needs to approve.

@wmrowan

This comment has been minimized.

Show comment
Hide comment
@wmrowan

wmrowan Apr 29, 2013

Contributor

This has been merged into next.

Contributor

wmrowan commented Apr 29, 2013

This has been merged into next.

@mlucy

This comment has been minimized.

Show comment
Hide comment
@mlucy

mlucy May 14, 2013

Member

Reopening; the optional parameter was never accounted for in Ruby.

Member

mlucy commented May 14, 2013

Reopening; the optional parameter was never accounted for in Ruby.

@mlucy mlucy reopened this May 14, 2013

@ghost ghost assigned mlucy May 14, 2013

@mlucy

This comment has been minimized.

Show comment
Hide comment
@mlucy

mlucy May 14, 2013

Member

Also, the YAML docs don't appear to have been updated.

Member

mlucy commented May 14, 2013

Also, the YAML docs don't appear to have been updated.

@mlucy

This comment has been minimized.

Show comment
Hide comment
@mlucy

mlucy May 14, 2013

Member

This is in code-review 538 by @wmrowan .

Member

mlucy commented May 14, 2013

This is in code-review 538 by @wmrowan .

@mlucy

This comment has been minimized.

Show comment
Hide comment
@mlucy

mlucy May 14, 2013

Member

This is in next.

Member

mlucy commented May 14, 2013

This is in next.

@mlucy mlucy closed this May 14, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment