Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic Performance issues with 1.1 #221

Closed
sadsfae opened this Issue Mar 15, 2019 · 5 comments

Comments

Projects
None yet
1 participant
@sadsfae
Copy link
Member

sadsfae commented Mar 15, 2019

This is a tracking issue for some unreasonably long bin/quads-cli command executions that we've been experiencing at random times. We need to investigate this further with some Python profiling work to isolate where this is happening.

Some preliminary strace runs reveal severe hangups here:

18:50:41.069933 openat(AT_FDCWD, "/usr/local/lib64/python3.7/site-packages/nacl/bindings/__pycache__/utils.cpython-37.pyc", O_RDONLY|O_CLOEXEC) = 4
18:50:41.069991 fstat(4, {st_mode=S_IFREG|0644, st_size=3796, ...}) = 0
18:50:41.070050 lseek(4, 0, SEEK_CUR)   = 0
18:50:41.070099 fstat(4, {st_mode=S_IFREG|0644, st_size=3796, ...}) = 0
18:50:41.070158 read(4, "B\r\r\n\0\0\0\0\317yr\\D\21\0\0\343\0\0\0\0\0\0\0\0\0\0\0\0\2\0\0"..., 3797) = 3796
18:50:41.070209 read(4, "", 1)          = 0
18:50:41.070265 close(4)                = 0
18:50:41.070401 openat(AT_FDCWD, "/dev/random", O_RDONLY) = 4
18:50:41.070463 poll([{fd=4, events=POLLIN}], 1, -1

via strace -tt quads-cli --full-summary for example.

Namely here, note the entire 2 minute delay:

18:50:41.070401 openat(AT_FDCWD, "/dev/random", O_RDONLY) = 4
18:50:41.070463 poll([{fd=4, events=POLLIN}], 1, -1) = 1 ([{fd=4, revents=POLLIN}])
18:52:10.523470 close(4)                = 0
18:52:10.523535 openat(AT_FDCWD, "/dev/urandom", O_RDONLY) = 4
18:52:10.523659 fstat(4, {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x9), ...}) = 0
18:52:10.523707 fcntl(4, F_GETFD)       = 0
18:52:10.523747 fcntl(4, F_SETFD, FD_CLOEXEC) = 0
18:52:10.523845 read(4, "\305\241\255\306\5\302I\3223\363\302_v\225\263B", 16) = 16
18:52:10.524087 stat("/usr/local/lib64/python3.7/site-packages/nacl", {st_mode=S_IFDIR|0755, st_size=238, ...}) = 0
18:52:10.524207 stat("/usr/local/lib64/python3.7/site-packages/nacl/encoding.py", {st_mode=S_IFREG|0644, st_size=1928, ...}) = 0
18:52:10.524346 stat("/usr/local/lib64/python3.7/site-packages/nacl/encoding.py", {st_mode=S_IFREG|0644, st_size=1928, ...}) = 0

And in particular:

18:50:41.070463 poll([{fd=4, events=POLLIN}], 1, -1) = 1 ([{fd=4, revents=POLLIN}])
18:52:10.523470 close(4)                = 

Something is trying to poll a file handle, this could be symptomatic and not causal. We need Python profiling ala cProfile or similar to dig deeper here in the code.

https://docs.python.org/3/library/profile.html#introduction-to-the-profilers

It's important to note here that we initially thought this was due to running containers inside a VM (particularly quads_db / mongo) and although we see it less testing directly in a VM via RPM against latest master c72d9b2 we still do see it there.

@sadsfae sadsfae added this to the 1.1.0 milestone Mar 15, 2019

@sadsfae

This comment has been minimized.

Copy link
Member Author

sadsfae commented Mar 15, 2019

Update: After running some profiling with cProfile via:

python -m cProfile /opt/quads/bin/quads-cli --full-summary

Here is what I've found so far:

ncalls tottime percall cumtime percall filename:lineno(function)
650/12 0.000 0.000 72.322 6.027 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
514/12 0.002 0.000 72.324 6.027 <frozen importlib._bootstrap>:663(_load_unlocked)
538/10    0.002    0.000   72.326    7.233 <frozen importlib._bootstrap>:948(_find_and_load_unlocked)
538/10 0.002 0.000 72.326 7.233 <frozen importlib._bootstrap>:978(_find_and_load)
469/12 0.001 0.000 72.324 6.027 <frozen importlib._bootstrap_external>:722(exec_module)
4 0.000 0.000 72.047 18.012 __init__.py:15(<module>)
0.000 0.000 72.071 72.071 __init__.py:20(<module>)
1 0.000 0.000 72.183 72.183 move_and_rebuild_hosts.py:3(<module>)
1 0.000 0.000 72.368 72.368 quads-cli:2(<module>)
0.000 0.000 72.009 72.009 signing.py:15(<module>)
1 0.000 0.000 72.001 72.001 sodium_core.py:21(_sodium_init)
1 0.000 0.000 72.001 72.001 sodium_core.py:27(sodium_init)
1 0.000 0.000 72.072 72.072 ssh_helper.py:2(<module>)
1 0.000 0.000 72.066 72.066 transport.py:22(<module>)
1 72.001 72.001 72.001 72.001 {built-in method _sodium.sodium_init}
501/1 0.003 0.000 72.368 72.368 {built-in method builtins.exec}
1 0.000 0.000 72.001 72.001 {method 'init_once' of 'CompiledFFI' objects}


Full cProfile trace

https://gist.github.com/sadsfae/ccd42c1e627befffa674da1b040a6470


@sadsfae

This comment has been minimized.

Copy link
Member Author

sadsfae commented Mar 15, 2019

Update: It seems this might be our issue, a bug with nacl / sodium library which explains the /dev/random hangs in strace.

pyca/pynacl#327

@sadsfae

This comment has been minimized.

Copy link
Member Author

sadsfae commented Mar 15, 2019

Symptoms / cause:

By default libsodium only uses the /dev/random path to determine if
the CSPRNG is initialized before it starts using /dev/urandom.
It also prefers to use getrandom, which gives the same behavior for
free (blocking only until the CSPRNG is ready). On normal systems this
happens extremely rapidly, but there exists a weird long tail of systems
that can have bad entropy for long periods of time

Why is this an issue for QUADS

We use the paramiko libraries which make calls to libsodium / nacl bindings. libsodium only uses the /dev/random (blocking) path and this is a known limitation for VM guests, we also don't want to modify third-party libraries to use something like /dev/urandom (non-blocking entropy) as that's not tenable.

  • This would be an issue with VM's using qemu / libvirt (what we use)
  • This would be an issue for containers running on qemu / libvirt (what we use for CI)

It's entirely possible that we only hit a lack of entropy because at the time in testing/using QUADS 1.1/master code we're not running the scheduling processes and tasks as we normally would, but we don't want users to experience this on first trying things out and get discouraged.

Switching to (or telling others) to use the virtio_rng setting and modifying their QEMU VM's isn't really an option. As /dev/urandom is not an option for libsodium (by design - crypto libs as best practice should operate against blocking entropy) the pragmatic solution here is to instead use the haveged service to provide software entropy.

As amusing as it would be to have a valid excuse to run a service that just loops opening/closing a CD-ROM or some other behavior that makes the bare-metal system appear haunted to provide more entropy, passing it to the VM guest is probably not the best solution. This would be hilarious though, maybe we'll do this anyway.

Proposed Solution

  • Make haveged a service dependency on QUADS via RPM installation (you can switch back to /dev/random by turning it off automatically) so that people running QUADS in a VM are covered.
  • For containers we will not include haveged as it most likely needs its own container to just run the entropy service as a daemon and with testing it will not run unless in the Foreground via haveged --Foreground in a container.
  • If folks are using QUADS in containers, inside a VM and they hit entropy issues it's instead advised to just run the containers on a bare-metal docker host.
@sadsfae

This comment has been minimized.

Copy link
Member Author

sadsfae commented Mar 15, 2019

sadsfae added a commit that referenced this issue Mar 16, 2019

Fix Entropy Performance issue with libsodium, etc.
There seems to be a bug with entropy pool gathering with libsodium in
which calls to /dev/random fail.  This manifests itself in QEMU because
it has no hardware entropy (and as a result containers run in VMs).

As /dev/random and /dev/urandom are blocking devices this manifests
itself in slowdowns for random, sporadic calls from QUADS.  libsodium
which is used by paramiko cannot use /dev/urandom (non-blocking entropy)
so this moves us to use a software-based entropy service which is
optional.

#221

More about the nacl bindings / libsodium bug:
PyNACL / Sodium:

pyca/pynacl#327

Symptoms / cause:

======
By default libsodium only uses the /dev/random path to determine if
the CSPRNG is initialized before it starts using /dev/urandom.
It also prefers to use getrandom, which gives the same behavior for
free (blocking only until the CSPRNG is ready). On normal systems this
happens extremely rapidly, but there exists a weird long tail of systems
that can have bad entropy for long periods of time
======

The pragmatic solution here is to instead use the haveged service to
provide software entropy.

https://issihosts.com/haveged/

* Make haveged a dependency on QUADS via RPM installation
* For containers include and start it in the Dockerfile
* haveged is optional, just turn it off if you want to use /dev/random
if you're on bare-metal or feel you have sufficient entropy.

Included are other fixes/changes below:

* Add python3-ipdb to rpm spec requirements
* Specify version requirements for ipmitool and git
* Remove contactbank wp plugin, we're not using it and it may not even
work right.
* Since HTTPD is required for serving visuals/instackenv.json make it
also be a dependency of quads-server.service
* Demote docker compose from recommended status
* Correct RPM spec warnings about incorrect date format
* Remove waffle.io badge / references as they are closing up shop.

https://waffle.io/closing-its-doors

Fixes: #221

Change-Id: I5c20defe12871e6399cf6b1ada659caf1a5e1b94
@sadsfae

This comment has been minimized.

Copy link
Member Author

sadsfae commented Mar 19, 2019

This should be resolved with the inclusion of haveged for QUADS packaged in RPM, leaving it optional for folks to use it or not.

@sadsfae sadsfae closed this Mar 19, 2019

@sadsfae sadsfae removed the in progress label Mar 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.