Skip to content
This repository has been archived by the owner on Dec 7, 2023. It is now read-only.

Parallel VM creation fix #524

Merged
merged 2 commits into from
Feb 24, 2020

Conversation

darkowlzz
Copy link
Contributor

@darkowlzz darkowlzz commented Feb 15, 2020

Add lockfile at snapshot activation to avoid race condition

This creates an ignite lock file at /tmp/ignite-snapshot.lock
when an overlay snapshot is created. The locking is handled via
pid file using github.com/nightlyone/lockfile package. This
helps avoid the race condition when multiple ignite processes try
to create loop device and use the device mapper for overlay
snapshot at the same time. When a process obtains a lock, other
processes retry to obtain a lock, until a lock is obtained. Once
the snapshot device is created, the lock is released.

lockfile godoc:https://pkg.go.dev/github.com/nightlyone/lockfile?tab=doc

Fixes #510

# ignite run weaveworks/ignite-ubuntu --cpus 1 --memory 1GB --ssh --name my-vm1 & \
ignite run weaveworks/ignite-ubuntu --cpus 1 --memory 1GB --ssh --name my-vm2 & \
ignite run weaveworks/ignite-ubuntu --cpus 1 --memory 1GB --ssh --name my-vm3 & \
ignite run weaveworks/ignite-ubuntu --cpus 1 --memory 1GB --ssh --name my-vm4 &
[1] 18784
[2] 18785
[3] 18786
[4] 18787
# INFO[0002] Created VM with ID "a058bc89a63ca580" and name "my-vm3" 
INFO[0002] Created VM with ID "317b8230e47866d5" and name "my-vm4" 
INFO[0003] Created VM with ID "f91e4505483c2310" and name "my-vm2" 
INFO[0003] Created VM with ID "3ea0c71dd9728a1b" and name "my-vm1" 
INFO[0004] Networking is handled by "cni"               
INFO[0004] Started Firecracker VM "a058bc89a63ca580" in a container with ID "ignite-a058bc89a63ca580" 
INFO[0004] Networking is handled by "cni"               
INFO[0004] Started Firecracker VM "317b8230e47866d5" in a container with ID "ignite-317b8230e47866d5" 
INFO[0005] Networking is handled by "cni"               
INFO[0005] Started Firecracker VM "3ea0c71dd9728a1b" in a container with ID "ignite-3ea0c71dd9728a1b" 
INFO[0006] Networking is handled by "cni"               
INFO[0006] Started Firecracker VM "f91e4505483c2310" in a container with ID "ignite-f91e4505483c2310"

[1]   Done                    ignite run weaveworks/ignite-ubuntu --cpus 1 --memory 1GB --ssh --name my-vm1
[2]   Done                    ignite run weaveworks/ignite-ubuntu --cpus 1 --memory 1GB --ssh --name my-vm2
[3]-  Done                    ignite run weaveworks/ignite-ubuntu --cpus 1 --memory 1GB --ssh --name my-vm3
[4]+  Done                    ignite run weaveworks/ignite-ubuntu --cpus 1 --memory 1GB --ssh --name my-vm4

@darkowlzz darkowlzz force-pushed the parallel-creation-fix branch 3 times, most recently from 575f829 to 6299c72 Compare February 18, 2020 18:38
@darkowlzz
Copy link
Contributor Author

Tightened the lock by moving it closer to the loop device and device mapper setup code. This will help avoid other processes from waiting on the lock when they could start importing the images.

Tested it multiple times with and without sleep in the lock retry, didn't see any failure without sleep. Decided to not include any sleep and retry as soon as possible.

Releasing the lock right after creating the loop devices only causes race condition in the lock file creation with error FATA[0000] unable to lock "/tmp/ignite-snapshot.lock": open /tmp/ignite-snapshot.lock: no such file or directory. Releasing the lock after device mapper setup seems to be safer.

This creates an ignite lock file at /tmp/ignite-snapshot.lock
when an overlay snapshot is created. The locking is handled via
pid file using github.com/nightlyone/lockfile package. This
helps avoid the race condition when multiple ignite processes try
to create loop device and use the device mapper for overlay
snapshot at the same time. When a process obtains a lock, other
processes retry to obtain a lock, until a lock is obtained. Once
the snapshot is activated, the lock is released.
`make tidy-in-docker`
@stealthybox
Copy link
Contributor

Releasing the lock after device mapper setup seems to be safer.

Thanks so much for the thorough testing of this bug-fix

@stealthybox
Copy link
Contributor

stealthybox commented Feb 24, 2020

Concurrent VM creation is much faster than serial with this patch 🏎️

5 vm's -- parallel vs. serial:

num_vms=5

time (
    for i in {1..${num_vms}}; do
        sudo bin/ignite run weaveworks/ignite-ubuntu \
          --name concurrent-${RANDOM} --ssh 1>/dev/null &
    done
wait )

time (
    for i in {1..${num_vms}}; do
        sudo bin/ignite run weaveworks/ignite-ubuntu \
          --name serial-${RANDOM} --ssh 1>/dev/null
    done
)

results on my laptop:

( for i in {1..5}; do; sudo bin/ignite run weaveworks/ignite-ubuntu --name   )  
3.40s user 5.52s system 147% cpu 5.845 total
( for i in {1..5}; do; sudo bin/ignite run weaveworks/ignite-ubuntu --name   )  
1.00s user 0.66s system 9% cpu 17.619 total

and with num_vms=10:

( for i in {1..${num_vms}}; do; sudo bin/ignite run weaveworks/ignite-ubuntu )  
9.23s user 23.65s system 317% cpu 10.361 total
( for i in {1..${num_vms}}; do; sudo bin/ignite run weaveworks/ignite-ubuntu )  
2.02s user 1.42s system 9% cpu 35.097 total

For these cases, it's over a 3x improvement.

There is no lock for the image pull, so we run into a race when the image does not exist like we expected on last week's call:

num_vms=5

sudo bin/ignite image rm weaveworks/ignite-ubuntu:latest

echo
time ( 
    for i in {1..${num_vms}}; do
        sudo bin/ignite run weaveworks/ignite-ubuntu \
          --name concurrent-${RANDOM} --ssh &
    done
wait )
echo

(../ignite-scratch/ignite-clean.sh 2>&1; ../ignite-scratch/iptables-clean-cni-ignite.sh 2>&1) >/dev/null
sudo bin/ignite image rm weaveworks/ignite-ubuntu:latest
ef609546af94ace0

INFO[0000] Starting image import...                     
INFO[0000] Starting image import...                     
INFO[0000] Starting image import...                     
INFO[0000] Starting image import...                     
INFO[0000] Starting image import...                     
FATA[0004] command ["resize2fs" "-P" "/dev/loop3"] exited with "resize2fs 1.44.6 (5-Mar-2019)\nresize2fs: Device or resource busy while trying to open /dev/loop3\nCouldn't find valid filesystem superblock.\n": exit status 1 
FATA[0004] command ["resize2fs" "-P" "/dev/loop3"] exited with "resize2fs 1.44.6 (5-Mar-2019)\nresize2fs: Device or resource busy while trying to open /dev/loop3\nCouldn't find valid filesystem superblock.\n": exit status 1 
FATA[0004] command ["resize2fs" "-P" "/dev/loop3"] exited with "resize2fs 1.44.6 (5-Mar-2019)\nresize2fs: Invalid argument while trying to open /dev/loop3\nCouldn't find valid filesystem superblock.\n": exit status 1 
INFO[0005] Imported OCI image "weaveworks/ignite-ubuntu:latest" (226.5 MB) to base image with UID "ef609546af94ace0" 
INFO[0005] Imported OCI image "weaveworks/ignite-ubuntu:latest" (226.5 MB) to base image with UID "05054ab4da76e736" 
INFO[0005] Removed VM with name "concurrent-30625" and ID "99fecee0aa7b23e2" 
FATA[0005] ambiguous image query: "weaveworks/ignite-ubuntu:latest" matched multiple names 
INFO[0005] Removed VM with name "concurrent-4736" and ID "764ffe0d5b103a98" 
FATA[0005] ambiguous image query: "weaveworks/ignite-ubuntu:latest" matched multiple names 
( for i in {1..${num_vms}}; do; sudo bin/ignite run weaveworks/ignite-ubuntu )  2.02s user 9.23s system 200% cpu 5.602 total

FATA[0000] ambiguous image query: "weaveworks/ignite-ubuntu:latest" matched the following IDs/names: weaveworks/ignite-ubuntu:latest, weaveworks/ignite-ubuntu:latest 

This can be worked around by the user performing parallel pull operations.

We can fix that issue at a future time /w a separate issue/pr.

@stealthybox stealthybox merged commit 3ca8dae into weaveworks:master Feb 24, 2020
@luxas luxas added this to the v0.7.0 milestone Jun 2, 2020
@darkowlzz darkowlzz deleted the parallel-creation-fix branch June 8, 2020 18:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Concurrent VM Startup fails
3 participants