Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gproc_pool:claim doesn't return worker when calling process is killed #109

Open
netDalek opened this issue Apr 1, 2016 · 7 comments
Open

Comments

@netDalek
Copy link

netDalek commented Apr 1, 2016

I've found gproc_pool very flexible vs other pool libraries, but found this behaviour

48>  gproc_pool:claim(pool, fun(_, P) ->  P end).
{true,<0.185.0>}
49> Pid = spawn_link(fun() -> gproc_pool:claim(pool, fun(_, P) -> timer:sleep(10000), P en
d) end).
<0.223.0>
50> exit(Pid, kill).
** exception exit: killed
51>  gproc_pool:claim(pool, fun(_, P) ->  P end).
false
69> gproc_pool:active_workers(pool).
[{a,<0.185.0>}]

After killing the process, worker stay busy forever. I look through the code and found that it is expected behaviour. Am I right? Are there any plans to improve it?

@uwiger
Copy link
Owner

uwiger commented Apr 2, 2016

It's true that the code doesn't handle the case of a process inside a claim operation gets killed from the outside. Strictly speaking, this is a bug.

Fixing it will cause a general slowdown, but I think I found a way that is still cheap enough. See PR #110.
Let me know what you think.

@netDalek
Copy link
Author

netDalek commented Apr 2, 2016

Looks good and simple, can't find disadvantages

@uwiger
Copy link
Owner

uwiger commented Apr 3, 2016

Describing it to my son, I came up with some failure scenarios that seem very hard to address.

Basically, the approach is inherently vulnerable to the calling process being murdered while running inside the claim. Starting from the (correct) assumption that the process could be killed after any given call, at least the following scenarios, however unlikely, come to mind:

  • Line 509, just after getting the lock, but before spawning the monitoring process. A solution could be to start monitoring before trying the lock, but this seems to forfeit the purpose of the approach.
  • Line 551, if the calling process is killed just after killing the monitoring process, but before resetting the counter, the counter will never be reset.
  • If lines 551 & 552 are switched, and the calling process resets the counter then dies before killing the monitoring process, the monitor will reset the counter again, possibly after another process has claimed it.

Considering the above, one could either conclude that the claim() operation is broken and shouldn't be used, or decide to live with the probability that it sometimes breaks, and then decide which types of breakage are least desirable (I'd wager that the counter never getting reset is the most serious).

Alternatively, figure out a different approach that's safe and not so much slower that the function becomes unfit for its purpose.

@uwiger
Copy link
Owner

uwiger commented Apr 3, 2016

(This would be a good time to play around with QuickCheck/Pulse ...)

@uwiger
Copy link
Owner

uwiger commented Apr 3, 2016

Since it certainly is an improvement over the previous implementation, I decided to merge. We can try to eliminate the (very slight) holes in the algorithm later.

@netDalek
Copy link
Author

netDalek commented Apr 4, 2016

Line 551, if the calling process is killed just after killing the monitoring process, but before resetting the counter, the counter will never be reset.

May be it's better in place of kill + reset_counter send a message to spawned process.

execute_claim(F, K, Pid) ->
     Parent = self(),
     Mon = spawn(
             fun() ->
                     Ref = erlang:monitor(process, Parent),
                     receive
                         {'DOWN', Ref, _, _, _} ->
                             gproc:reset_counter(K)
                         exit ->
                             gproc:reset_counter(K)
                     end
             end),
     try begin
             Res = F(K, Pid),
             {true, Res}
         end
     after
         Pid ! exit
     end.

@uwiger
Copy link
Owner

uwiger commented Apr 4, 2016

I thought about that, but it complicates things, is not inherently safe, and also introduces significant latency (assuming the monitoring process always does the reset).

It would probably at least be a good idea to switch lines 551 and 552, though. I forgot to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants