Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XrdCl] Avoid possibility of Channel unregistering the wrong task #1947

Merged
merged 1 commit into from
Apr 14, 2023

Conversation

smithdh
Copy link
Contributor

@smithdh smithdh commented Mar 7, 2023

This is (hopefully) a fix for issue #1883

@amadio amadio added this to the 5.5.4 milestone Mar 8, 2023
@amadio amadio modified the milestones: 5.5.4, 5.6 Mar 20, 2023
@amadio amadio removed this from the 5.6 milestone Apr 5, 2023
@smithdh
Copy link
Contributor Author

smithdh commented Apr 12, 2023

Hi. What shall we do with this PR? Maybe there are some open questions:

(a) is this effective in helping with #1883
(b) is there a better solution for the issue that should be followed rather than this?

In the issue I outlined why I think this change would be of help. Unfortunately, for purposes of checking, the original problem looks rare enough that even running with a change does not make a clear case, expect in a failing case where the problem reoccurs. As an addition I could instrument a test with extra logging (and probably artificial delays) to more directly show this avoiding a reproducible problem with similar symptoms.

On the other hand during discussion of #1883 it was said that a larger change would improve the code. My subjective feeling is this is a larger change should be done only in release 6; but certainly that's debatable.

[Actually the most recent EOS packaged eos-xrootd did cherry-pick this patch, but if we decide to abandon for xroot it I guess we'd drop it again on the eos side in future].

@abh3
Copy link
Member

abh3 commented Apr 13, 2023

The problem that we found with this patch is that it appeared to just move the problem. By avoiding the call to UnregisterTask() may leave the task hanging around with a reference to the new deleted channel object. That simply appeared to create a SEGV in the opposite direction (i.e. task to channel vs channel to task). Given that the problem doesn't happen often makes this one really hard to solve and will likely require some redesign to avoid a fatal task/channel dependency. Frankly, I'm surprised that the EOS folks picked up this patch given its problematic nature.

@smithdh
Copy link
Contributor Author

smithdh commented Apr 13, 2023

Hi. Thanks, I didn't realise there was the specific concern that this change adds a problem. My reading of this is that:

(i) In ~Channel(), Invalidate() is called on the pTickGenerator. This removes the associated between this TickGeneratorTask and the Channel. If the Run() method of the task is called it no longer invokes any method of the Channel, and returns (time_t)0.
(ii) The TaskManager is primarily single shot; with optional re-queuing of tasks: If a task's Run() method returns non-zero it is re-queued at the time indicated by the return value. If the return value is zero the task is not re-queued, and if "owned" it is deleted. (And in this case the pTickGenerator is indeed owned by TaskManager).
(iii) Therefore after Invalidate() the task will be de-queued and deleted at some point (the next time it is scheduled to run). So currently there's a race between the Invalidate() and the UnregisterTask() in ~Channel: usually UnregisterTask() will unregister the task, but sometimes the task will have already been deleted and UnregisterTask() is called with a dangling pointer as argument.
(iv) The side-effect of removing the call to UnregisterTask() is expected to be the elimination of this race, with the side effect of keeping a task in the queue a bit longer than otherwise. However the task no longer contains an association to the channel, and will not do much when Run(), and will then be deleted.

So I'd think that removing UnregisterTask() is the appropriate, aside from issue 1883.

I haven't clearly shown this is the specific cause of issue #1883, although I could try to do so. And even if that's done it doesn't rule out the possibility of other bugs around this area; so your point about redesign here is taken. But I wanted to double check the evaluation this is change is actually introducing a bug, as that is certainly not wanted!

@abh3
Copy link
Member

abh3 commented Apr 14, 2023

Well, thank you for making me dive further into this. Indeed, it looks like abandoning the task will actually work here. So, I am merging the change. Sorry for causing you some grief by misunderstanding the logic flow (or miss-flow as it is).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants