Fix order of COLD events for directors vs. VCL (and VMODs, implicitly) #4037

nigoroll · 2024-01-05T20:47:50Z

@delthas reported an interesting issue with vmod_dynamic: When the vmod receives a cold event, it might have some directors left to delete, and when it deletes them, they do not receive a cold event, and a follow up assertion triggers because a warm director is deleted.

I believe the root cause is actually a wrong ordering in Varnish-Cache:

vcldir_retire() only sends a VCL_EVENT_COLD when the vcl is warm. In other words, it (rightly, IMHO) asserts that a COLD event (if any) has alrady been posted when a director is deleted on a COLD vcl.

Yet, when the director is deleted upon a COLD vmod event, this assertion was wrong, because the COLD events for directors were only posted after vmod events.

Given that vmods do things like deleting directors, it appears (more) correct to post VDI COLD events before VMOD COLD events.

Ref varnishcache/varnish-cache#4037 Closes #107

dridi

On paper I think this change looks good and it's certainly something we could quickly merge to solve the problem this addresses.

I thought this was going to be short review, but revisiting the VCL event logic I'm wondering whether the semantics should be clarified again. Before this change it looks like we are marking the VCL as cold and then propagating the cold events further down and I think that in general that's a sane behavior.

On the other hand it looks like we do the same for warm events: we make the VCL warm and then propagate the warm signal. I feel like it should be the other way around: once everything is warm, "commit" the warmup transaction at the VCL level. Of course if the VMODs and VBEs can't warm up without a warm VCL, then it is inappropriate.

Of course this is not something we can just do without making sure that we don't break the loose contract we currently have with VMODs. Maybe vcl_send_event() should be in charge of the dispatch.

We could start by encoding the desired behavior with types:

static const struct vcl_event {
        const char      *name;
        int             forward; /* before < 0, never == 0, after > 0 */
        unsigned        method;
        unsigned        have_msg;
        // ...
} vcl_events[] = {{
        // ...
}};

static const struct vcl_event *
vcl_event(enum vcl_event_e ev)
{
        // assert
        return (&vcl_events[ev]);
}

And with something like this we could nail down the forward logic and semantics in one central place.

nigoroll · 2024-01-08T13:34:30Z

I feel like it should be the other way around: once everything is warm, "commit" the warmup transaction at the VCL level.

I looked into a similar question before adding the PR, and in general I agree that the current logic looks inconsistent: We mark the VCL warm before it is, and we mark it cold before it is.

But running git grep -C 5 -E -- '->is_(warm|cold)' shows that, besides assertions, the warm/cold attribute is used for exactly two purposes: VSC hiding and the VDI_Event() discussed here. And I think that for both purposes the current implementation is suitable, except for the ordering issue raised herein.

I think that any proposal to complicate this logic should be motivated by relevant use cases. For this reason and my expectation of canned worms, I have not thought about your proposal in more detail yet.

bsdphk · 2024-01-08T14:39:10Z

I think the ordering should be reversed, but it's obviously something which is badly underdocumented.

nigoroll · 2024-01-15T14:24:05Z

FTR, OKed by bugwash

vcldir_retire() only sends a VCL_EVENT_COLD when the vcl is warm. In other words, it (rightly) asserts that a COLD event (if any) has alrady been posted when a director is deleted on a COLD vcl. Yet, when the director is deleted upon a COLD vmod event, this assertion was wrong, because the COLD events for directors were only posted after vmod events. Given that vmods do things like deleting directors, it appears (more) correct to post VDI COLD events before VMOD COLD events.

To avoid VRT_AddDirector() tripping on the "Dynamic Backends can only be added to warm VCLs" assertion, we need to ensure that we hold a VCL reference until all resolver threads are stopped, such that the VCL temperature remains VCL_COOLING and VRT_AddDirector() returns NULL instead of panicking. Yet because of the wrong COLD event order in VC#4037, we run into a catch-22: - we can not return the reference until all directors received a cold event, which only happens after the VMOD cold event, - but we can neither keep the reference until they are all done, because that trips another assertion in mgt that a cold event needs to succeed. Because we do not want to keep this legacy in the main branch, we add the workarounds specifically for 7.4. Ref varnishcache/varnish-cache#4037 Closes #108

So it turns out, that - still - we can deadlock raching between VCL events and director deletions (see varnishcache/varnish-cache#4140). This commit was originally written to fix #108, but it now enables us to take a different route again: Do not pthread_join() lookup threads, but rather detach them and make sure _they_ return the reference before terminating. The remainder of the commit message is an edit in light of the new insights: Our lookup threads may continue for a bit after a VCL_COLD event has been received: dom_event() sets dom->status to DYNAMIC_ST_DONE while the lookup thread may be blocking in the resolver or updating a domain. So, lookup threads might try to create backends even after the VCL has gone cold, in which case VRT_AddDirector() would panic with "Dynamic Backends can only be added to warm VCLs" - _unless_ a reference is present on the VCL to make the VCL state change go through VCL_TEMP_COOLING, in which case VRT_AddDirector() just returns NULL. vmod_dynamic took a VCL reference since 99912bc back in May 2019, but before the merge of varnishcache/varnish-cache#4037 it did not work correctly, because the VCL reference was returned before all threads had terminated. The naive idea to solve this would be to take/release a VCL reference in each lookup thread (or when starting/stopping it), but references can only be taken from the CLI context (ASSERT_CLI() in VRT_VCL_Prevent_Discard). Also, to support use of the .resolve() type method from vcl_init{}, we needed to depart from the original concept of running resolver threads only for warm vcls. We get out of this situation by using a reference on top of the VCL reference, which we can take even before we have the VCL reference. The VCL reference is returned once all internal "referef"s are returned. Ref #108 Probably fixes #117 also

nigoroll added a commit to nigoroll/libvmod-dynamic that referenced this pull request Jan 5, 2024

Work around arguably wrong event order in Varnish-Cache

1faceae

Ref varnishcache/varnish-cache#4037 Closes #107

nigoroll mentioned this pull request Jan 5, 2024

Assert error in dom_release(): Condition((dom->thread) == 0) not true nigoroll/libvmod-dynamic#107

Closed

dridi reviewed Jan 8, 2024

View reviewed changes

nigoroll force-pushed the vcl_BackendEvent_ordering branch from 3c84e9e to c44bd67 Compare January 15, 2024 14:29

nigoroll merged commit c44bd67 into varnishcache:master Jan 15, 2024
1 of 7 checks passed

nigoroll deleted the vcl_BackendEvent_ordering branch January 15, 2024 15:42

nigoroll mentioned this pull request Jan 26, 2024

Assert error in ref_del(): Condition((r->dir) != NULL) not true nigoroll/libvmod-dynamic#109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix order of COLD events for directors vs. VCL (and VMODs, implicitly) #4037

Fix order of COLD events for directors vs. VCL (and VMODs, implicitly) #4037

nigoroll commented Jan 5, 2024

dridi left a comment

nigoroll commented Jan 8, 2024

bsdphk commented Jan 8, 2024

nigoroll commented Jan 15, 2024

Fix order of COLD events for directors vs. VCL (and VMODs, implicitly) #4037

Fix order of COLD events for directors vs. VCL (and VMODs, implicitly) #4037

Conversation

nigoroll commented Jan 5, 2024

dridi left a comment

Choose a reason for hiding this comment

nigoroll commented Jan 8, 2024

bsdphk commented Jan 8, 2024

nigoroll commented Jan 15, 2024