Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in probe heap comparison function #3115

Merged
merged 1 commit into from Nov 11, 2019

Conversation

mbgrydeland
Copy link
Contributor

A case has been observed where the probe scheduling stopped. The system had some probes with very long interval (24h), and some with short intervals. It is suspected that the cause was the bug that this PR fixes, where the heap bubbled the wrong probe to the top giving a very long (24h) wait time before the next probe scheduling.

An attempt was made at creating a test case for this, but it proved very unreliable and timing sensitive so it was abandoned.

Fix the probe scheduler heap comparison function to be consistent with
regard to different running state of the two arguments. With this fix,
probes that are not running will always bubble to the top before those
that are already running.
@rezan
Copy link
Member

rezan commented Nov 5, 2019

#2976 (comment)

@nigoroll
Copy link
Member

nigoroll commented Nov 5, 2019

I am pretty sure that at one point in time I had more or less the same code (IIUC more like if (aa->running != bb->running) return (!aa->running)), but somehow convinced myself that the existing code was correct.

Can you explain why it is not?

@hermunn
Copy link
Member

hermunn commented Nov 5, 2019

Can you explain why it is not?

In my not so humble opinion, I think it should be the other way around, why would the current code work? Let me explain.

If a function has the name compare and is used as a compare function in some kind of algorithm, then it should be a compare function.

In the existing code, if one probe is running and one is not, then it is possible that both vbp_cmp(a,b) and vbp_cmp(b,a) return zero, which normally would indicate that they are equal or equivalent. Here, the problem might be that all of vbp_cmp(a,b), vbp_cmp(b,c) and vbp_cmp(c,a) can return all 0, while vbp_cmp(b,a) evaluates to 1, and this can happen both when exactly one or two of the three are running.

The code in the proposed fix is clear, and it clearly a true compare function. If one is running and the other is not, use that. If they are the same, use the due time.

It is possible that two different probes become equal in priority, but they will form equivalence classes (they all share running == 0 and the due time).

i think it should be uncontroversial to just merge this (but I am sharing office with @mbgrydeland, so not my call).

@nigoroll
Copy link
Member

nigoroll commented Nov 6, 2019

Just to illustrate, the resulting truth table change is (with aa and bb denoting the running state and due denoting the comparison of the due values)

before:

aa      bb      return
 0      0       due
 0      1       due
 1      0       0
 1      1       due

after:

aa      bb      return
 0      0       due
 0      1       1
 1      0       0
 1      1       due

Now I cannot understand any more why I ever thought the existing code was correct.

@bsdphk bsdphk self-requested a review November 11, 2019 07:36
@mbgrydeland mbgrydeland merged commit f402e5f into varnishcache:master Nov 11, 2019
@mbgrydeland mbgrydeland deleted the master-probefix branch November 11, 2019 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants