Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix order of COLD events for directors vs. VCL (and VMODs, implicitly) #4037

Merged
merged 1 commit into from
Jan 15, 2024

Conversation

nigoroll
Copy link
Member

@nigoroll nigoroll commented Jan 5, 2024

@delthas reported an interesting issue with vmod_dynamic: When the vmod receives a cold event, it might have some directors left to delete, and when it deletes them, they do not receive a cold event, and a follow up assertion triggers because a warm director is deleted.

I believe the root cause is actually a wrong ordering in Varnish-Cache:

vcldir_retire() only sends a VCL_EVENT_COLD when the vcl is warm. In other words, it (rightly, IMHO) asserts that a COLD event (if any) has alrady been posted when a director is deleted on a COLD vcl.

Yet, when the director is deleted upon a COLD vmod event, this assertion was wrong, because the COLD events for directors were only posted after vmod events.

Given that vmods do things like deleting directors, it appears (more) correct to post VDI COLD events before VMOD COLD events.

Copy link
Member

@dridi dridi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On paper I think this change looks good and it's certainly something we could quickly merge to solve the problem this addresses.

I thought this was going to be short review, but revisiting the VCL event logic I'm wondering whether the semantics should be clarified again. Before this change it looks like we are marking the VCL as cold and then propagating the cold events further down and I think that in general that's a sane behavior.

On the other hand it looks like we do the same for warm events: we make the VCL warm and then propagate the warm signal. I feel like it should be the other way around: once everything is warm, "commit" the warmup transaction at the VCL level. Of course if the VMODs and VBEs can't warm up without a warm VCL, then it is inappropriate.

Of course this is not something we can just do without making sure that we don't break the loose contract we currently have with VMODs. Maybe vcl_send_event() should be in charge of the dispatch.

We could start by encoding the desired behavior with types:

static const struct vcl_event {
        const char      *name;
        int             forward; /* before < 0, never == 0, after > 0 */
        unsigned        method;
        unsigned        have_msg;
        // ...
} vcl_events[] = {{
        // ...
}};

static const struct vcl_event *
vcl_event(enum vcl_event_e ev)
{
        // assert
        return (&vcl_events[ev]);
}

And with something like this we could nail down the forward logic and semantics in one central place.

@nigoroll
Copy link
Member Author

nigoroll commented Jan 8, 2024

I feel like it should be the other way around: once everything is warm, "commit" the warmup transaction at the VCL level.

I looked into a similar question before adding the PR, and in general I agree that the current logic looks inconsistent: We mark the VCL warm before it is, and we mark it cold before it is.

But running git grep -C 5 -E -- '->is_(warm|cold)' shows that, besides assertions, the warm/cold attribute is used for exactly two purposes: VSC hiding and the VDI_Event() discussed here. And I think that for both purposes the current implementation is suitable, except for the ordering issue raised herein.

I think that any proposal to complicate this logic should be motivated by relevant use cases. For this reason and my expectation of canned worms, I have not thought about your proposal in more detail yet.

@bsdphk
Copy link
Contributor

bsdphk commented Jan 8, 2024

I think the ordering should be reversed, but it's obviously something which is badly underdocumented.

@nigoroll
Copy link
Member Author

FTR, OKed by bugwash

vcldir_retire() only sends a VCL_EVENT_COLD when the vcl is warm. In
other words, it (rightly) asserts that a COLD event (if any) has
alrady been posted when a director is deleted on a COLD vcl.

Yet, when the director is deleted upon a COLD vmod event, this assertion
was wrong, because the COLD events for directors were only posted after
vmod events.

Given that vmods do things like deleting directors, it appears (more)
correct to post VDI COLD events before VMOD COLD events.
@nigoroll nigoroll merged commit c44bd67 into varnishcache:master Jan 15, 2024
1 of 7 checks passed
@nigoroll nigoroll deleted the vcl_BackendEvent_ordering branch January 15, 2024 15:42
nigoroll added a commit to nigoroll/libvmod-dynamic that referenced this pull request Jan 28, 2024
To avoid VRT_AddDirector() tripping on the "Dynamic Backends can only be
added to warm VCLs" assertion, we need to ensure that we hold a VCL
reference until all resolver threads are stopped, such that the VCL
temperature remains VCL_COOLING and VRT_AddDirector() returns NULL
instead of panicking.

Yet because of the wrong COLD event order in VC#4037, we run into a
catch-22:

- we can not return the reference until all directors received a cold
  event, which only happens after the VMOD cold event,
- but we can neither keep the reference until they are all done, because
  that trips another assertion in mgt that a cold event needs to
  succeed.

Because we do not want to keep this legacy in the main branch, we add
the workarounds specifically for 7.4.

Ref varnishcache/varnish-cache#4037

Closes #108
nigoroll added a commit to nigoroll/libvmod-dynamic that referenced this pull request Jul 30, 2024
So it turns out, that - still - we can deadlock raching between VCL
events and director deletions (see
varnishcache/varnish-cache#4140). This commit
was originally written to fix #108, but it now enables us to take a
different route again: Do not pthread_join() lookup threads, but rather
detach them and make sure _they_ return the reference before terminating.

The remainder of the commit message is an edit in light of the new
insights:

Our lookup threads may continue for a bit after a VCL_COLD event has
been received: dom_event() sets dom->status to DYNAMIC_ST_DONE while
the lookup thread may be blocking in the resolver or updating a
domain.

So, lookup threads might try to create backends even after the VCL has
gone cold, in which case VRT_AddDirector() would panic with "Dynamic
Backends can only be added to warm VCLs" - _unless_ a reference is
present on the VCL to make the VCL state change go through
VCL_TEMP_COOLING, in which case VRT_AddDirector() just returns NULL.

vmod_dynamic took a VCL reference since
99912bc back in May 2019, but before
the merge of varnishcache/varnish-cache#4037
it did not work correctly, because the VCL reference was returned
before all threads had terminated.

The naive idea to solve this would be to take/release a VCL reference
in each lookup thread (or when starting/stopping it), but references
can only be taken from the CLI context (ASSERT_CLI() in
VRT_VCL_Prevent_Discard). Also, to support use of the .resolve() type
method from vcl_init{}, we needed to depart from the original concept
of running resolver threads only for warm vcls.

We get out of this situation by using a reference on top of the VCL
reference, which we can take even before we have the VCL reference.
The VCL reference is returned once all internal "referef"s are returned.

Ref #108

Probably fixes #117 also
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants