Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing Exec race condition #1907

Merged
merged 9 commits into from
Feb 22, 2024
Merged

Fixing Exec race condition #1907

merged 9 commits into from
Feb 22, 2024

Conversation

steiler
Copy link
Collaborator

@steiler steiler commented Feb 22, 2024

This is a hotfix for an issue a user reported, where the defined execs on container would run without the network interfaces being present already. Not a problem, as long as the execs are supposed to configure (ip link set or ip address add) these interfaces.

To understand the issue you have to know how we deploy nodes at the moment. We schedule nodes and make them go through the different stages. (create, create-links, configure, healthy, exit) not all of these stages are always applicable. healthy or exited will only be applicable and waited for if some other node did define that as a wait-for stage.
For the create-link stage, there is another special workflow. If there is no dependency on the create-links phase, then the node will try to create all its network interfaces. For veths between clab nodes, it might be that the peer container is not yet created and hence the veth interface peer namespace can not be set. Hence we opted in case of veth interface for a strategy, that the second node of a link will create and assign the link endpoints to itself and the peer node.
The first node however will in the basic case not stop and simply continue with the configure phase after which the execs where being run. This could lead to race conditions where via the exec commands interfaces where being tried to be configured that are not already there.

This PR drags the exec commands into a post-node deployment phase, where all the nodes will definitely be created and therefor all the links will be there.

Anyways to fix this properly we will have to create the links on the first nodes call to create a veth link and somehow park the peer link in the host or another parking network namespace and make the second node pull the interface in when running the create-link stage.

@hellt hellt marked this pull request as ready for review February 22, 2024 14:29
Copy link

codecov bot commented Feb 22, 2024

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (f2b4a62) 48.59% compared to head (be18fb5) 53.74%.
Report is 223 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1907      +/-   ##
==========================================
+ Coverage   48.59%   53.74%   +5.15%     
==========================================
  Files         133      155      +22     
  Lines       12789    11264    -1525     
==========================================
- Hits         6215     6054     -161     
+ Misses       5870     4351    -1519     
- Partials      704      859     +155     
Files Coverage Δ
clab/clab.go 76.12% <ø> (-1.45%) ⬇️
clab/dependency_manager/dependency_node.go 95.55% <100.00%> (ø)
nodes/k8s_kind/logger.go 66.66% <ø> (ø)
cmd/deploy.go 67.36% <50.00%> (+1.63%) ⬆️

... and 124 files with indirect coverage changes

@hellt hellt merged commit ff671fb into main Feb 22, 2024
61 of 62 checks passed
@hellt hellt deleted the fixExecRace branch February 22, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants