Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow workers to timeout. #149

Closed
wants to merge 4 commits into from
Closed

Conversation

tgxworld
Copy link
Contributor

No description provided.

@tgxworld tgxworld force-pushed the implement_timeout branch 4 times, most recently from 3459cd0 to ed50964 Compare April 18, 2017 06:16
README.markdown Outdated
@@ -271,6 +271,7 @@ optipng:
* `:allow_lossy` — Allow lossy workers and optimizations *(defaults to `false`)*
* `:cache_dir` — Configure cache directory
* `:cache_worker_digests` - Also cache worker digests along with original file digest and worker options: updating workers invalidates cache
* `:timeout` — Number of seconds before workers are timed out. *(defaults to `0`)*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the default of 0 mean? I assume it means "no timeout", but it would be good to clarify that, as a naive reading suggests that the default is basically "kill as soon as you start".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point 👍 I'll update the description for the option.


class ImageOptim
# Helper for running commands
module Cmd
class Timeout < StandardError; end
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more "obviously erroneous" name might be good here. The Timeout.timeout standard library method uses Timeout::Error by default, which isn't great but is at least obviously some sort of Bad Thing. Perhaps TimeoutExceeded?

end
end

# Run commands using `Process.spawn`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment demonstrates why mentioning implementation details in comments is fraught with peril... as far as I can tell, you're not actually using Process.spawn in the method. Even if you were, anyone calling run_with_timeout shouldn't need to care how you're doing what you're doing, just what is supposed to happen. "Run the specified command, and kill it off if it runs longer than :timeout seconds" would be more descriptive.

success = thread.value.exitstatus.zero?
end
ensure
stdin.close
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're not going to use stdin, it's best to close it immediately after calling popen2, rather than in an ensure block after everything is done. It guarantees that the child process won't hang around unnecessarily waiting on input that will never come, and makes it clearer that the pipes definitely are getting closed. If you're very (very) sure that the subprocess won't produce output, you can close stdout too immediately, however if the subprocess does write anything, sadness will result (the process will get a SIGPIPE and almost certainly fall over in a heap). On the other hand, failing to read from stdout if it is left open can lead to the process stalling, because the pipe has a finite capacity, and if it writes too much to its stdout further writes will block on the pipe emptying, and the process won't make any progress, which is also sad.

Isn't multiprocessing fun?

Copy link
Contributor Author

@tgxworld tgxworld Apr 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for explaining @mpalmer 👍

I've gone ahead and just close stdout since the only method that uses this library redirects :out and :err to Path::NULL.

now = Time.now
pid = thread[:pid]

sleep 0.001 while (Time.now - now) < timeout && thread.alive?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Willthread.join(timeout) work here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Now I know that I can set a limit on Thread#join


while Time.now - now < 10
begin
Process.getpgid(pid)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more idiomatically expressed as Process.kill(0, pid) (and also raises ESRCH when the process goes away).


def cleanup_process(pid)
Thread.new do
Process.kill('-TERM', pid)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Process.kill:

If signal is negative (or starts with a minus sign), kills process groups instead of processes.

I'm pretty sure that isn't what you want to have happen here. A straight-up TERM might be more appropriate.

(Also applies to the -KILL on line 117)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I want to be killing the process group here since we won't know if the command being run will spawn child processes. My understanding is that sending a signal to the process group would also send the signal to all the processes in the process group.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, of course. Not sure why I didn't associate the new_pgroup code with the "kill the process group"... Carry on.

@tgxworld tgxworld force-pushed the implement_timeout branch 6 times, most recently from 1101d9c to 5b4fd97 Compare April 20, 2017 07:11
Copy link
Owner

@toy toy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the wait. Checked a lot of stuff and created an issue in ruby bug tracker.

options[:allow_lossy] = image_optim.allow_lossy

[:allow_lossy, :timeout].each do |option_key|
if !options.key?(option_key) && klass.method_defined?(option_key)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a guard (or two) would be cleaner in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't quite what you mean here. What would the guard be used for?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next if … and next unless … instead of if with block

0,
'Number of seconds before worker is timed out. Must be greater than' \
'0 to enable timeout.'
){ |v| v }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.to_i

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

option(
:timeout,
0,
'Number of seconds before worker is timed out. Must be greater than' \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space missing between than and 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch will fix

@@ -7,6 +7,8 @@ class Worker
#
# Jhead internally uses jpegtran which should be on path
class Jhead < Worker
TIMEOUT_OPTION = timeout_option
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is be better to not timeout jhead – it should not be a time consuming process, but it may time out for a big jpeg and if exif will be removed, the image will be "broken"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, none of the image optimizing is done in place right? Even for optimize_image!, it does not replace the original image if optimization fails.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I understood only later that you've implemented the timeout in the way that if any worker timeouts, than optimisation will not continue. But if workers that timeout are just skipped, it is better to now allow jhead to timeout.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that jhead should not timeout

@@ -151,7 +153,18 @@ def run_command(cmd_args)
{:out => Path::NULL, :err => Path::NULL},
].flatten
end
Cmd.run(*args)

seconds_to_timeout = timeout || @image_optim.timeout
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout will already contain specific or global value, it is passed in Worker.init_all

@@ -15,6 +18,44 @@ def run(*args)
success
end

def support_timeout?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think supports_timeout? or timeout_supported? would be better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I'll go with supports_timeout?

init_options!(args)

begin
stdin, stdout, thread = Open3.popen2(*args)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As pipe is not really used, why not directly use spawn and Process.detach?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it looks like some workers don't like closed stdin or stdout, so maybe better to use :in, :out with Path::NULL options of spawn

thread.kill
fail TimeoutExceeded
else
success = thread.value.exitstatus.zero?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running ruby -Ilib bin/image_optim --timeout 1 -r spec/images dies with undefined method zero?' for nil:NilClass`

begin
Cmd.run_with_timeout(seconds_to_timeout, *args)
rescue Cmd::TimeoutExceeded
raise ImageOptim::Worker::TimeoutExceeded
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception is not handled, cli interface will fail for all images and I would expect timeout to stop only one worker and try next one.

pid = thread[:pid]

if thread.join(timeout).nil?
cleanup_process(pid)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup_process creates a thread which is never joined, then the watching thread is killed, I think in this case it is better to wait for cleanup to finish

@toy
Copy link
Owner

toy commented May 10, 2017

ping @tgxworld

@tgxworld
Copy link
Contributor Author

@toy Noted. Give me awhile as I've been alittle busy at work, will fix those changes soon.

@tgxworld tgxworld force-pushed the implement_timeout branch 8 times, most recently from c0bef26 to e49089b Compare May 19, 2017 08:37
thread = Process.detach(pid)

if thread.join(timeout).nil?
cleanup_process(pid)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toy The reason I'm not calling join here is because we don't want the cleanup process to be blocking the main process.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the comment for Thread.new of cleanup_process

Process.detach(pid)
now = Time.now

while Time.now - now < 10
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toy 10 is sort of a magic number here. Any ideas about how we can handle this better?

@tgxworld
Copy link
Contributor Author

@toy Updated per your review comments. Rubocop is broken on master so Travis is having abit of sad here as well.

@toy
Copy link
Owner

toy commented May 20, 2017

@tgxworld I've reverted the last commit, so rubocop doesn't complain. I'll try to review the changes in the nearest time

Copy link
Owner

@toy toy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of small comments, but main points are:

  • simplifying run_with_timeout
  • cleanup in background can create problems with reusing of temporary files
  • it is better not to timeout jhead

begin
handler.process{ |src, dst| worker.optimize(src, dst) }
rescue Worker::TimeoutExceeded
next
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next is not needed, but a message to stderr if verbose is on can be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can add that

@@ -15,6 +17,40 @@ def run(*args)
success
end

def supports_timeout?
if defined?(JRUBY_VERSION)
JRUBY_VERSION >= '9.0.0.0'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of versions to go till 10.0 and all of them >= 1.9, but next major jruby version will not be.
Does it really work well with jruby at all and only starting with version 9? I remember having lots of problems trying to just make new system (spawn) syntax work with jruby (though maybe pre 9).

if defined?(JRUBY_VERSION)
JRUBY_VERSION >= '9.0.0.0'
else
RUBY_VERSION >= '1.9'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Kernel.respond_to?(:spawn)

if timeout
begin
Cmd.run_with_timeout(timeout, *args)
rescue Cmd::TimeoutExceeded
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it works having two new exceptions just to reraise?

return run(*args) unless timeout > 0 && supports_timeout?

success = false
init_options!(args)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method name is not descriptive, method is short and not used for anything else, is it worth extraction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rubocop was complaining about the method being too long 😉

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can be shortened, so maybe it will fit inside ;)

begin
Process.kill(0, pid)
sleep 0.001
next
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed

sleep 0.001
next
rescue Errno::ESRCH
break
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this place you can already return or even combine it with other rescue as in both cases there should be no such process

@@ -7,6 +7,8 @@ class Worker
#
# Jhead internally uses jpegtran which should be on path
class Jhead < Worker
TIMEOUT_OPTION = timeout_option
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that jhead should not timeout

def timeout_option
option(
:timeout,
nil,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert Integer as next parameter so you don't need to add NilClass in option_parser.rb

thread = Process.detach(pid)

if thread.join(timeout).nil?
cleanup_process(pid)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the comment for Thread.new of cleanup_process

@tgxworld tgxworld force-pushed the implement_timeout branch 2 times, most recently from eb6e374 to 4feeb03 Compare June 5, 2017 05:03
@tgxworld tgxworld force-pushed the implement_timeout branch 3 times, most recently from 5e130be to 8416f03 Compare June 5, 2017 05:28
@tgxworld
Copy link
Contributor Author

tgxworld commented Jun 5, 2017

@toy I've been thinking about my approach here and feel like it might be wrong. The use case that we had at Discourse was to timeout calls to optimize_images! so that we wouldn't hit Unicorn's per request timeout.. What I initially wanted to achieve is to implement a global timeout for all the workers instead of timeouts for individual workers. Are you open to that? At least for us, timeouts for individual workers isn't going to be useful.

@toy
Copy link
Owner

toy commented Jun 8, 2017

@tgxworld Sure, there can be timeout per worker and timeout for the whole optimisation (two global options). I'm fine with a PR for one of them or both. Probably if you decide to only implement one for whole optimisation, better open a separate PR, so this one can also be finished later

@smileart
Copy link

smileart commented Sep 6, 2017

Guys, thanks for the lib and all the work done! I'm really waiting for this one to be merged since when I use external Timeout.timeout around the compression method it (obviously) leaves zombie processes which eventually use too many threads! For now I came up with a dirty hack of killing parent (puma) workers which produce those zombies but it'd be much better to have this functionality in the gem itself. Or at least it'd be nice to expose worker's PID to give a chance to detach it from the outside if we ran out of execution time. Anyway, thanks one more time. Cheers! 👍

@toy
Copy link
Owner

toy commented Sep 6, 2017

@smileart Thanks for reminding that the issue will be useful

@tgxworld
Copy link
Contributor Author

tgxworld commented Jul 5, 2018

Closing in favor of #162

@tgxworld tgxworld closed this Jul 5, 2018
toy pushed a commit to oblakeerickson/image_optim that referenced this pull request May 9, 2021
The original commit discourse/image_optim@8bf3c0e, see discussion in resolved PRs.

Resolves toy#21, resolves toy#148, resolves toy#149, resolves toy#162, resolves toy#184, resolves toy#189.

Co-authored-by: Blake Erickson <o.blakeerickson@gmail.com>
toy added a commit to oblakeerickson/image_optim that referenced this pull request May 9, 2021
The original commit discourse/image_optim@8bf3c0e, see discussion in resolved PRs.

Resolves toy#21, resolves toy#148, resolves toy#149, resolves toy#162, resolves toy#184, resolves toy#189.

Co-authored-by: Blake Erickson <o.blakeerickson@gmail.com>
Co-authored-by: Ivan Kuchin <tadump+git@gmail.com>
toy added a commit that referenced this pull request May 9, 2021
The original commit discourse/image_optim@8bf3c0e, see discussion in resolved PRs.

Resolves #21, resolves #148, resolves #149, resolves #162, resolves #184, resolves #189.

Co-authored-by: Blake Erickson <o.blakeerickson@gmail.com>
Co-authored-by: Ivan Kuchin <tadump+git@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants