Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse loop device provisioning with systemd units #2936

Merged
merged 3 commits into from
Nov 27, 2020

Conversation

gdemonet
Copy link
Contributor

@gdemonet gdemonet commented Nov 19, 2020

Component: salt, storage

Context:

The previous approach was relying entirely on Salt (through our custom
metalk8s_volumes execution/state module) to manage provisioning and
cleanup of loop devices (on sparse files).
The issues arise when (re)booting a node:

  • Salt minion has to execute a startup state, which may come in when
    kube-apiserver is not (yet) available
  • losetup --find calls, with the version available for CentOS 7, are
    not atomic, so we may lose some devices if we have too many (re-running
    the state manually is the current workaround)

Summary:

To fix these, we introduce two main changes:

  • management of loop devices is delegated to systemd; using a unit
    template file, we define a service per volume (passing the Volume's UUID
    as a parameter), which will provision and attach the device on start,
    and detach it on stop
  • losetup --find invocations from these units is made sequential by
    using an exclusive lockfile (this is not perfect, but avoids the need to
    re-implement the command ourselves to include the fix from higher
    versions of losetup)

Note that we cannot simply use losetup --detach in the unit template,
because sparse volumes may not always be discovered from the same path:

  • either the volume is formatted, and we can find it in
    /dev/disk/by-uuid/
  • or it's kept raw, so we only have the UUID in the partition table, and
    we can discover the device through /dev/disk/by-partuuid/

Acceptance criteria:

  • CI keeps working (provisioning and cleanup of loop devices is tested)
  • (manual) can reboot a node with many sparse loop volumes (10+) and have them all in losetup -l after

Closes: #2726

@gdemonet gdemonet added kind:bug Something isn't working topic:storage Issues related to storage topic:flakiness Some test are flaky and cause CI to do transient failing complexity:medium Something that requires one or few days to fix labels Nov 19, 2020
@bert-e
Copy link
Contributor

bert-e commented Nov 19, 2020

Hello gdemonet,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Status report is not available.

@bert-e
Copy link
Contributor

bert-e commented Nov 19, 2020

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • one peer

Peer approvals must include at least 1 approval from the following list:

buildchain/buildchain/salt_tree.py Outdated Show resolved Hide resolved
Comment on lines -14 to -17
# Prepare volume at startup (required for loop devices persistency).
startup_states: sls
sls_list:
- /metalk8s/volumes/provisioned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

salt/metalk8s/volumes/prepared/init.sls Outdated Show resolved Hide resolved
salt/metalk8s/volumes/prepared/init.sls Outdated Show resolved Hide resolved
salt/metalk8s/volumes/prepared/init.sls Outdated Show resolved Hide resolved
salt/metalk8s/volumes/prepared/init.sls Show resolved Hide resolved
@bert-e
Copy link
Contributor

bert-e commented Nov 19, 2020

Branches have diverged

This pull request's source branch bugfix/2726-sparse-file-systemd has diverged from
development/2.7 by more than 50 commits.

To avoid any integration risks, please re-synchronize them using one of the
following solutions:

  • Merge origin/development/2.7 into bugfix/2726-sparse-file-systemd
  • Rebase bugfix/2726-sparse-file-systemd onto origin/development/2.7

Note: If you choose to rebase, you may have to ask me to rebuild
integration branches using the reset command.

@@ -0,0 +1,61 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should specified 2 or 3 (seeing the import future I guess it's 2, but better be explicit).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And somewhere make sure it's installed on the system. Sure, we can assume it is (we're running salt-minion after all), but still...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sticking with Python 2 for now, not exactly sure where we should check if it's installed though. In the service template? Or through Salt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Salt. This is where maybe at some point these files shouldn't be dropped as-is by Salt, but packaged properly, which then allows to add a dependency on some Python version.

Comment on lines 30 to 33
partition_name = os.path.basename(os.path.realpath(path_by_partuuid))
device_name = os.path.basename(
os.path.realpath("/sys/class/block/{}/..".format(partition_name))
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if using regex to extract loop[0-9]+ isn't simpler here, but maybe that way is more reliable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A regex would make assumptions about the naming, which is not guaranteed. Here all we assume is how sysfs is structured, which is kernel API (I believe...).

We could assert the final path we calculate is an actual block device (from stat), and a loop device (also from stat, though not sure how to figure out the major of loop except by parsing /proc/devices).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Easy using awk: awk '$2=="loop" {print $1}' /proc/devices, and likely 7 on all systems, not dynamically registered, but still, hard-coding isn't great).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do prefer the regex as well. It's not like we're trying to make this logic work for "any kind of device partition", we know it is supposed to be a loop device, so if we don't find loop[0-9]+, it's a good idea to bail out (if some other OS we support decides to break this behaviour, our code will explicitly tell us).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree we want to assert it's a loop-device. However, we should not do so based on some name, but based on the major of the block-device matching the major assigned to the loop block-device driver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check to verify that:

  • the device path exists
  • it's a block device
  • its major is the "loop" major (set to 7, hardcoded, sorry @NicolasT)

@gdemonet gdemonet force-pushed the bugfix/2726-sparse-file-systemd branch 3 times, most recently from 64594db to d1fd5e2 Compare November 20, 2020 11:19


SPARSE_FILES_DIR = '/var/lib/metalk8s/storage/sparse/'
LOOP_MAJOR = 7
Copy link
Contributor

@slaperche-scality slaperche-scality Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could cite Documentation/admin-guide/devices.txt from linux source tree as reference.

e.g.:

# See Documentation/admin-guide/devices.txt in linux source tree
 LOOP_MAJOR = 7 

def is_loop_device(path):
device_stat = os.stat(path)
return stat.S_ISBLK(device_stat.st_mode) \
and (_major(device_stat.st_rdev) == LOOP_MAJOR)
Copy link
Contributor

@slaperche-scality slaperche-scality Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use os.major(device_stat.st_rdev) instead of coding your own _major 😉

Copy link
Contributor

@NicolasT NicolasT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some issues

help='Cleanup MetalK8s sparse loop volumes',
)

parser.add_argument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is required, why use a flag argument instead of a positional one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, remnants of the temporary implementation where I had an action (provision or cleanup) as positional. Anyway, will change.

)

parser.add_argument(
'-i', '--uid',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not uid, but uuid (though again, not sure this should be a flag).

parser.add_argument(
'-i', '--uid',
required=True,
help='UID of the MetalK8s Volume object',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UUID

import sys


SPARSE_FILES_DIR = '/var/lib/metalk8s/storage/sparse/'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can drop the last / (see below).


def check_sparse_file(volume_id):
"""Check if the sparse file exists for this volume ID."""
sparse_file_path = SPARSE_FILES_DIR + volume_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.path.join iso +

device_name = os.path.basename(
os.path.realpath("/sys/class/block/{}/..".format(partition_name))
)
device_path = '/dev/{}'.format(match.group(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something, but where does match come from? May need to add this file to the linted Python files in buildchain?

def is_loop_device(path):
device_stat = os.stat(path)
return stat.S_ISBLK(device_stat.st_mode) \
and (_major(device_stat.st_rdev) == LOOP_MAJOR)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not os.major?

[Service]
Type=oneshot
ExecStart=/bin/flock --exclusive --wait 10 /var/lock/metalk8s-sparse-volume.lock \
-c "/sbin/losetup --find --partscan /var/lib/metalk8s/storage/sparse/%i"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could not have -c "...", and use the sudo-style invocation, no?

Type=oneshot
ExecStart=/bin/flock --exclusive --wait 10 /var/lock/metalk8s-sparse-volume.lock \
-c "/sbin/losetup --find --partscan /var/lib/metalk8s/storage/sparse/%i"
ExecStop=/usr/local/bin/libexec/metalk8s-sparse-volume-cleanup --uid "%i"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/usr/local/bin/libexec ain't right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops.

{%- if to_remove is not none %}
{%- if 'sparseLoopDevice' in to_remove.spec %}
Disable systemd provisioning of loop device for {{ volume_name }}:
service.disabled:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this also 'stop' the service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I didn't realize it wasn't stopped, because I left the cleanup logic in the module 😕

@gdemonet gdemonet force-pushed the bugfix/2726-sparse-file-systemd branch 4 times, most recently from b59a16f to b22f504 Compare November 25, 2020 13:26
@bert-e
Copy link
Contributor

bert-e commented Nov 25, 2020

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • one peer

Peer approvals must include at least 1 approval from the following list:

The following reviewers are expecting changes from the author, or must review again:

@gdemonet gdemonet marked this pull request as ready for review November 25, 2020 13:26
Comment on lines 17 to 23
SPARSE_FILES_DIR = '/var/lib/metalk8s/storage/sparse'

# https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/devices.txt#L191
LOOP_MAJOR = 7

# https://github.com/torvalds/linux/blob/master/include/uapi/linux/loop.h#L110
LOOP_CLR_FD = 0x4C01
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since those constants are always used by a single function, maybe it's better to declare them near their usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I just store them at the top of my modules, out of habit :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, bike-shedding: I like them on top, actually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, missed your comment @NicolasT so already did the change. I think both sides have good arguments, but the stakes are too low for us to have this debate. So I won't change it anymore 😜

)

if not is_loop_device(device_path):
die(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: instead of having a die function, maybe raising an exception (+ global try/catch around main()) would be clearer (more idiomatic at least).
This (global try/catch) would also handle the errors you forgot/ignored on purpose (failure of os.stat and os.open for instance).

Not sure it's worth it though, it's your call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good suggestion, and not too much work, I'll handle it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did something really basic, should do the trick 🙂

@@ -651,7 +651,8 @@ def _get_parts(self) -> Iterator[str]:
Path('salt/metalk8s/volumes/init.sls'),
Path('salt/metalk8s/volumes/prepared/init.sls'),
Path('salt/metalk8s/volumes/prepared/installed.sls'),
Path('salt/metalk8s/volumes/provisioned/init.sls'),
Path('salt/metalk8s/volumes/prepared/files/metalk8s-sparse-volume@.service'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About commit message: really good one 👍

On

to include the fix from higher versions of losetup

Maybe add a link to the fix and/or mention in which version it was fixed (v2.25 IIRC), maybe when we migrate to CentOS 8 we can get rid of this workaround.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@gdemonet gdemonet force-pushed the bugfix/2726-sparse-file-systemd branch 2 times, most recently from e61b37d to 6567a22 Compare November 26, 2020 08:12
@bert-e
Copy link
Contributor

bert-e commented Nov 26, 2020

Conflict

There is a conflict between your branch bugfix/2726-sparse-file-systemd and the
destination branch development/2.7.

Please resolve the conflict on the feature branch (bugfix/2726-sparse-file-systemd).

 $ git fetch
 $ git checkout origin/bugfix/2726-sparse-file-systemd
 $ git merge origin/development/2.7
 $ # <intense conflict resolution>
 $ git commit
 $ git push origin HEAD:bugfix/2726-sparse-file-systemd

The previous approach was relying entirely on Salt (through our custom
`metalk8s_volumes` execution/state module) to manage provisioning and
cleanup of loop devices (on sparse files).
The issues arise when (re)booting a node:
- Salt minion has to execute a startup state, which may come in when
kube-apiserver is not (yet) available
- `losetup --find` calls, with the version available for CentOS 7, are
not atomic, so we may lose some devices if we have too many (re-running
the state manually is the current workaround)

To fix these, we introduce two main changes:
- management of loop devices is delegated to systemd; using a unit
template file, we define a service per volume (passing the Volume's UUID
as a parameter), which will provision and attach the device on start,
and detach it on stop
- `losetup --find` invocations from these units is made sequential by
using an exclusive lockfile (this is not perfect, but avoids the need to
re-implement the command ourselves to include the fix from losetup 2.25
and higher: see util-linux/util-linux@f7e21185)

Note that we cannot simply use `losetup --detach` in the unit template,
because sparse volumes may not always be discovered from the same path:
- either the volume is formatted, and we can find it in
/dev/disk/by-uuid/
- or it's kept raw, so we only have the UUID in the partition table, and
we can discover the device through /dev/disk/by-partuuid/

Fixes: #2726
We adjust the `lint-python` tox environment to:
- handle Python 2 and Python 3 files (we only run mypy on the latter)
- always run both pylint and mypy, and report exit codes for both in
case of failure

See: #2726
@gdemonet gdemonet force-pushed the bugfix/2726-sparse-file-systemd branch from a5fcb70 to f6d45d9 Compare November 26, 2020 12:53
@bert-e
Copy link
Contributor

bert-e commented Nov 26, 2020

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • one peer

Peer approvals must include at least 1 approval from the following list:

The following reviewers are expecting changes from the author, or must review again:


# Needed by the sparse volume cleanup script
Ensure Python 2 is available:
test.fail_without_changes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I would've used a pkg_installed('python2') or something along those lines here ;-) But sure, works for this particular case I guess.

@bert-e
Copy link
Contributor

bert-e commented Nov 26, 2020

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • one peer

Peer approvals must include at least 1 approval from the following list:

@gdemonet
Copy link
Contributor Author

/approve

@bert-e
Copy link
Contributor

bert-e commented Nov 27, 2020

In the queue

The changeset has received all authorizations and has been added to the
relevant queue(s). The queue(s) will be merged in the target development
branch(es) as soon as builds have passed.

The changeset will be merged in:

  • ✔️ development/2.7

The following branches will NOT be impacted:

  • development/1.0
  • development/1.1
  • development/1.2
  • development/1.3
  • development/2.0
  • development/2.1
  • development/2.2
  • development/2.3
  • development/2.4
  • development/2.5
  • development/2.6

There is no action required on your side. You will be notified here once
the changeset has been merged. In the unlikely event that the changeset
fails permanently on the queue, a member of the admin team will
contact you to help resolve the matter.

IMPORTANT

Please do not attempt to modify this pull request.

  • Any commit you add on the source branch will trigger a new cycle after the
    current queue is merged.
  • Any commit you add on one of the integration branches will be lost.

If you need this pull request to be removed from the queue, please contact a
member of the admin team now.

The following options are set: approve

@bert-e
Copy link
Contributor

bert-e commented Nov 27, 2020

I have successfully merged the changeset of this pull request
into targetted development branches:

  • ✔️ development/2.7

The following branches have NOT changed:

  • development/1.0
  • development/1.1
  • development/1.2
  • development/1.3
  • development/2.0
  • development/2.1
  • development/2.2
  • development/2.3
  • development/2.4
  • development/2.5
  • development/2.6

Please check the status of the associated issue None.

Goodbye gdemonet.

@bert-e bert-e merged commit f6d45d9 into development/2.7 Nov 27, 2020
@bert-e bert-e deleted the bugfix/2726-sparse-file-systemd branch November 27, 2020 08:14
gdemonet added a commit that referenced this pull request Nov 27, 2020
TeddyAndrieux pushed a commit that referenced this pull request Nov 27, 2020
TeddyAndrieux pushed a commit that referenced this pull request Nov 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity:medium Something that requires one or few days to fix kind:bug Something isn't working topic:flakiness Some test are flaky and cause CI to do transient failing topic:storage Issues related to storage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pod failed to mount SparseLoopDevice Volume after node reboot
5 participants