CentOS CI "get well" plan #18

mrc0mmand · 2018-10-18T09:43:19Z

Purpose of this issue is to keep track of things which need to be done to make systemd CentOS CI work again.

Following things still need to be done:

Long term goals:

Upstream all downstream RHEL tests (= drop annoying test syncing)

Notes:

reverted back to CentOS kernel (3.10) - see Update bootstrap.sh and testsuite.sh to run test-exec-deserialization.py #14 (comment), systemd seems to be failing to compile on CentOS systemd#10474
added busybox to the copr repo

The text was updated successfully, but these errors were encountered:

mrc0mmand · 2018-10-19T09:12:40Z

* [ ]  (?) Use qemu-kvm instead of systemd-nspawn (or both?) [slave/testsuite.sh]

@keszybz any thoughts on this one?

mrc0mmand · 2018-10-21T20:28:17Z

Failing tests:

~~TEST-01-BASIC~~
~~TEST-15-DROPIN~~
~~TEST-22-TMPFILES~~
~~TEST-24-UNIT-TESTS~~

Common error:

+ env --unset=UNIFIED_CGROUP_HIERARCHY /root/systemd-centos-ci/systemd/build/systemd-nspawn -U --private-network --register=no --kill-signal=SIGKILL --directory=/var/tmp/systemd-test.UnjLx7/unprivileged-nspawn-root /usr/lib/systemd/systemd
Spawning container unprivileged-nspawn-root on /var/tmp/systemd-test.UnjLx7/unprivileged-nspawn-root.
Press ^] three times within 1s to kill container.
Selected user namespace base 970129408 and range 65536.
Failed to fork inner child: Invalid argument
E: nspawn failed with exit code 1
-rw-r-----+ 1 root systemd-journal 8388608 Oct 21 15:07 /var/tmp/systemd-test.UnjLx7/journal/32e86a2de4e543fe8c41793961c76987/system.journal
make: *** [run] Error 1
make: Leaving directory `/root/systemd-centos-ci/systemd/test/
--x-- Result of TEST-01-BASIC: 2 --x--

This happens even with user_namespace.enable=1

[root@host-8-251-180 systemd]# tr ' ' '\n' </proc/cmdline | grep user_namespace 
user_namespace.enable=1
[root@host-8-251-180 systemd]# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-3.10.0-862.14.4.el7.x86_64 root=UUID=40735eda-bc43-4610-961f-bc5c0353239a ro console=tty0 console=ttyS0,115200 crashkernel=auto net.ifnames=0 rhgb quiet LANG=en_US.UTF-8 user_namespace.enable=1

Workaround/fix:

# echo 10000 > /proc/sys/user/max_user_namespaces

~~Persisting issues:~~ Fixed by installing missing dependencies (quota, net-tools)

# make -C test/TEST-22-TMPFILES/ setup
...
+ for _x in inst_symlink inst_script inst_binary inst_simple
+ inst_simple ldconfig.real
+ [[ -f ldconfig.real ]]
+ return 1
+ return 1
+ [[ yes = yes ]]
+ dinfo 'Skipping program ldconfig.real as it cannot be found and is' 'flagged to be optional'
+ set +x
I: Skipping program ldconfig.real as it cannot be found and is flagged to be optional
make: *** [setup] Error 1
make: Leaving directory `/root/systemd-centos-ci/systemd/test/TEST-22-TMPFILES'

...

keszybz · 2018-10-23T06:47:57Z

(?) Use qemu-kvm instead of systemd-nspawn (or both?) [slave/testsuite.sh]

"Both" is worthwhile, because different things are tested in both environments. But reliability is more important than having both, so if just one can be made to work, that's better than having flaky tests.

mrc0mmand · 2018-10-26T20:26:12Z

Notes from the "make the QEMU testsuite work again" session:

run each test with correct INITRD and KERNEL_IMG env vars (/boot/initramfs-$(uname -r).img and /boot/vmlinuz-$(uname -r) respectively)
dracut includes filesystem modules ONLY for filesystems currently present in /etc/fstab (imo), which is xfs for default CentOS installation. However, the QEMU testsuite uses ext4 filesystem, which results is a boot failure for the respective virtual machine (dracut -f --filesystems ext4 for the rescue)
switching between nspawn/QEMU is currently done by removing/creating the /usr/bin/qemu-kvm symlink (to /usr/libexec/qemu-kvm). Maybe there's a nicer way
TEST-13-NSPAWN-SMOKE seems to be failing again under qemu systemd#10544

mrc0mmand · 2018-11-14T18:07:58Z

@evverx With the help of several other people I finally got something, which could get things moving again - I'm going to propose this ticket at CentOS CBS meeting (every Monday, 2 PM UTC in #centos-devel@Freenode) and hopefully it will get us somewhere.

mrc0mmand · 2018-11-19T14:34:42Z

Apparently there was some error in communication, so I didn't receive the previous email. However I finally got the credentials, so we can start breaking things!

(OT: Is there any chat to catch you in (e.g. IRC, Telegram, etc.)? @evverx)

evverx · 2018-11-20T02:31:21Z

However I finally got the credentials, so we can start breaking things!

That's great news! Congratulations!

Is there any chat to catch you in (e.g. IRC, Telegram, etc.)?

I'm afraid it isn't possible to catch me there, but, on the positive side, I usually reply to comments on GitHub relatively fast.

mrc0mmand · 2018-11-20T09:35:23Z

Notes from the "why it doesn't work in CentOS CI infrastructure" session:

the target nodes apparently don't like the new initrd image generated by upstream dracut - the machine won't boot after reboot (dropping the dracut initrd re-generation solves the issue for now); will investigate further
it's not the upstream dracut, the same thing happens after dracut -f --regenerate-all with the downstream package
I can either re-generate the initrd or install upstream systemd; if I do both, the system won't boot (and debugging boot issues without a serial console is wonderful...)

evverx · 2018-11-20T18:36:56Z

Could it be that you ran into systemd/systemd#10854? There're two PRs that are supposed to fix the issue. Could you try applying one of them to see if it works?

yuwata · 2018-11-20T18:47:06Z

If the failure is caused by systemd/systemd#10854, then please provide any logs or something if possible. Thank you.

yuwata · 2018-11-20T19:19:38Z

Another possibility is systemd/systemd#10754...

mrc0mmand · 2018-11-20T22:33:27Z

Unfortunately, neither mentioned issue seems to be relevant for this case. I did a quick bisect, but the issue occurs all the way down to systemd/systemd@80df8f2 - without this commit the systemd won't compile, will try to workaround it tomorrow.

Also, I'll try to ask for some possibility to get any useful logs from the machine after it dies.

Anyway, in my opinion, the issue is somewhere in the multipath which is used for the root filesystem...

evverx · 2018-11-20T22:51:14Z

Regarding systemd/systemd@80df8f2, I think mesone -Dnetworkd=false might help to get around it.

yuwata · 2018-11-21T02:50:17Z

Could you try to boot with udev.children_max=1?

mrc0mmand · 2018-11-21T18:06:32Z

Thanks a lot for the suggestions, unfortunately neither of them helped. -Dnetworkd=false excludes systemd-networkd from the compilation, but the sd-netlink still causes issues, and the issue still occurs even with udev.children_max=1.

I raised the post-mortem debugging issue on the CentOS CI Users mailing list so let's see if someone will be able to help.

In the meantime I'll play around with bisect in hopes I'll stumble upon the root cause...

mrc0mmand · 2018-11-22T10:18:42Z

Notes from the "why it doesn't work in CentOS CI infrastructure" session, part 2:

first compilable and bootable commit found by bisecting: systemd/systemd@7692fed
it works all the way up to systemd/systemd@53cb501 where the known issue with compilation occurs, so the naughty commit is somewhere between systemd/systemd@53cb501 and systemd/systemd@80df8f2
using a simple workaround[0] I should be able to bisect the problematic part
and the winner is apparently systemd/systemd@759d9f3 which makes systemd/systemd@5e1e4c2 the last working commit

[0]
curl -q https://github.com/systemd/systemd/commit/80df8f2518aa07ef3c328e1c634573347e130cf0.patch | git am

evverx · 2018-11-22T15:34:33Z

@mrc0mmand thank you a lot for finding the offending commit! By the way, apparently GitHub doesn't send notifications when comments are edited so probably major breakthroughs deserve to be written down separately :-)

The easiest way to unbreak CentOS CI would be to revert that commit. @keszybz @poettering @yuwata what do you think? As usual, I agree that it would be much better to figure out what's going on and fix it, but, in this case, it's not that easy and given how long it took to get access to the testing infrastructure I don't think the question @mrc0mmand asked in https://lists.centos.org/pipermail/ci-users/2018-November/000918.html will be answered anytime soon.

mrc0mmand · 2018-11-22T15:41:13Z

@evverx I was trying to get a remote shell in the initrd to get logs before pinging everyone, and believe me or not I managed to do it using https://github.com/dracut-crypt-ssh/dracut-crypt-ssh! Right now I have a working shell and access to journal and kernel ring buffer, so I'll open an issue shortly with as much logs as I can get.

evverx · 2018-11-22T15:49:42Z

@mrc0mmand that's great! I'm wondering if it would be possible to use it in the script that reboots and connects to the machine so that in the future issues like this would be a little bit easier to debug. It could just dump all the logs somewhere, which is better than nothing I guess and more or less automatic.

mrc0mmand · 2018-11-22T16:09:20Z

I guess we could incorporate it into the CI scripts, as the setup is fairly simple.

mrc0mmand · 2018-11-23T09:30:54Z

The testsuite almost passes, there's some issue with networking, hopefully it's not something major - https://ci.centos.org/job/systemd-pr-build/3673/console

Debug log from systemd-networkd-tests.py: https://paste.fedoraproject.org/paste/jnhwagD3-saGbeCNzYYk0w

@ssahani could you shed some light into what's happening here?

evverx · 2018-11-23T14:30:18Z

I suspect test-execute is failing because the regular expression doesn't cover all links that can pop up after a new network namespace is created as was discussed in systemd/systemd#10331 (comment).

Regarding systemd-networkd-tests.py, could you try stopping dnsmasq before running the test to see if it helps?

mrc0mmand · 2018-11-23T15:43:17Z

I suspect test-execute is failing because the regular expression doesn't cover all links

That makes sense, thanks for the reference link.

Regarding systemd-networkd-tests.py, could you try stopping dnsmasq before running the test to see if it helps?

Unfortunately not. I even tried rebooting the machine before the test itself, but it still fails the same.

evverx · 2018-11-23T16:19:23Z

@mrc0mmand could you create a new issue about systemd-networkd-tests.py so that it would be possible to track it properly? This issue is already hard to follow if you ask me :-)

mrc0mmand · 2018-11-26T12:45:54Z

Tracking issues for current CentOS CI blockers:
test-execute - systemd/systemd#10934
systemd-networkd-tests.py - #23

evverx · 2018-11-28T07:49:47Z

@mrc0mmand I'm wondering if you have figured out what @systemd-centos-ci is. I think it would make sense to turn CentOS CI on as soon as possible to at least make sure that systemd compiles and the rest of the tests still pass.

systemd-networkd-tests.py seems to be always broken (partly because nobody has ever run it automatically) and can be skipped for now and test-execute (or more precisely exec-privatenetwork-yes.service) isn't exactly useful and can be replaced with something that simply won't fail.

mrc0mmand · 2018-11-28T07:59:51Z

@evverx IMHO @systemd-centos-ci was created to simply provide an API key for the GitHub builder plugin in the CentOS CI jenkins - this allows jenkins to update commit/PR state according to the results of the test run. However, I don't know who has access to this account, so maybe it would be wise if I just used my API key (with limited permissions), so we have everything under our control.

I'll go ahead and temporarily disable mentioned tests so the results are finally usable.

mrc0mmand · 2018-11-28T08:02:30Z

Ah, I take that back, I can't use my API key as I don't have appropriate permissions in systemd/systemd. Either we could track down the owner of @systemd-centos-ci or just create a new account for such purpose.

evverx · 2018-11-28T08:12:55Z

I have no problem with a new account. If I understand correctly, it'll just have to be invited as a collaborator and I can do that. But, as far as know, https://wiki.centos.org/QaWiki/CI/GithubIntegration will no longer be applicable there so it'd be great if you could let me know how the webhook is supposed to look like. Now it just points to https://ci.centos.org/ghprbhook/ with no secret.

evverx · 2018-11-28T08:29:33Z

@keszybz it would be great it you could help here. Judging by the presence of @systemd-centos-ci I assume there are some unknown to me reasons for it to be here (most likely related to secure access to the repository, but who knows).

evverx · 2018-11-28T08:52:32Z

In the light of the recent events that shall remain nameless, one can never be too cautious giving write access to the repository :-)

mrc0mmand · 2018-11-28T10:38:49Z

@evverx Sorry for the delay, wanted to make sure everything works before we start messing with webhooks. I temporarily disabled the problematic parts of the testsuite in 42340c2 and it's finally passing
https://ci.centos.org/job/systemd-pr-build/3676/console.

I guess now we just have to figure out which user to use for the CI, so I can configure it properly on the jenkins side.

evverx · 2018-11-30T05:37:57Z

@mrc0mmand given that I already bother contributors with LGTM alerts like systemd/systemd#10249 (comment) I think we could use my account as a bearer of bad news (at least temporarily). What do you think?

evverx · 2018-11-30T05:44:14Z

Though, I'd prefer it if @poettering and @keszybz chimed in here because I'm still not sure whether anyone else is interested in getting it working.

evverx · 2018-11-30T06:03:29Z

On second thoughts, It also seems reasonable to me to invite @mrc0mmand as a collaborator to the systemd repository and point CentOS CI to @mrc0mmand's handle. I'm pretty sure it'll make everything much faster, simpler and even a little bit more secure.

evverx · 2018-12-03T18:00:16Z

And systemd is failing to compile on CentOS again: systemd/systemd#11036.

mrc0mmand · 2018-12-03T18:21:33Z

I guess CentOS CI could have easily prevented that...

As for the ideas above - using your account, @evverx, is definitely possible, but I don't like the idea of being in charge of someone else's API key. Not that I have any ulterior motives, but it's still a responsibility.

evverx · 2018-12-03T18:32:36Z

@mrc0mmand I'm completely with you on this one that's why I suggested inviting you as a collaborator to the systemd repository. I'd do that right now but I'm not sure I can make decisions like that without at least one ACK. Maybe you could ping someone to speed up the process.

evverx · 2018-12-04T01:30:34Z

So, manually launching a CentOS VM and running ./agent/bootstrap.sh to see whether systemd/systemd#11036 is gone was the last straw. I'll take the liberty of inviting @mrc0mmand as a collaborator to the systemd repository. I think that making the scripts from this repository usable and resurrecting TravisCI is enough for me to be sure that @mrc0mmand can get things done. Plus apparently @mrc0mmand cares about my API key even more than I do :-)

evverx · 2018-12-04T08:34:55Z

@mrc0mmand let me know when (and probably how) I should turn the webhook on. 6 hours ago https://ci.centos.org/ghprbhook/ responded with 500 so I turned it off again.

mrc0mmand · 2018-12-04T09:05:46Z

@evverx will do! However, as usual, there is one small catch, because otherwise things would be too easy... In Jenkins, every user has its credentials store, to manage credentials for various plugins, but, for some reason, I can't manage credentials for the plugin we need (GitHub Pull Request Builder). I just asked about that on the #centos-devel channel, so let's hope for a (relatively) fast response.

evverx · 2018-12-04T11:24:25Z

@mrc0mmand in case the response won't be fast, I'm wondering if it would be possible as a last resort to trigger CentOS CI via Travis CI. I'm fantasizing here and assuming you have everything you need to run Jenkins jobs that can produce reports like https://ci.centos.org/job/systemd-pr-build/3676/console. In theory could we encrypt your credentials and use them to spawn VMs via agent-control.py and then put a link to the report at the end of the Travis build log?

mrc0mmand · 2018-12-04T13:00:02Z

@evverx I just gave up and wrote a simple wrapper which does the status reporting and it seems to be working. I'll definitely improve it as soon as possible (or ditch it completely if I figure out the jenkins plugin madness), but for now it should finally start delivering results to PRs.

Right now just setup a webhook according to the CentOS CI documentation, i.e.:

1. On your Github Project page, choose 'Settings'
2. Navigate to the 'Webhooks and Services' tab
3. Choose 'Add a webhook'
4. Select 'Let me select individual events' under 'Which events would you like to trigger this webhook?'
5. Unselect 'Push' and select 'Pull Request' and 'Issue Comment'
6. Paste 'https://ci.centos.org/ghprbhook/' (note the trailing slash) in the Payload URL
7. Open the new webhook and verify the ping the Recent Deliveries section

evverx · 2018-12-04T13:52:41Z

I didn't select "Issue Comment" because I'm not sure it'd be useful. To judge from systemd/systemd#11045, the hook has started to deliver :-)

mrc0mmand · 2018-12-04T16:04:29Z

So, thanks to collaboration with Brian we now have a working CentOS CI without workarounds. As the next step I'll sort out artifact exporting, so the logs can be properly investigated in case of failure.

mrc0mmand · 2018-12-05T16:52:52Z

I was working on the artifact exporting, but stumbled upon an issue with permissions (which got fixed by Brian). In the meanwhile I set up a CI for this repository, so we don't have to manually test every change, see #24 and #25.

Hopefully the artifact exporting should be up tomorrow.

mrc0mmand · 2018-12-18T07:58:10Z

Quick update:

artifacts are now stored directly in Jenkins using the internal artifact machinery (Support artifact exporting #27)
the integration test suite under KVM is almost ready to be deployed, one test keeps failing (TEST-13-NSPAWN-SMOKE seems to be failing under KVM systemd#11173)
we got another two Jenkins executors, which should help during review periods (https://bugs.centos.org/view.php?id=15579)
in some cases the machine won't boot after a reboot, needs further investigation with dracut crypt SSH

mrc0mmand · 2018-12-25T00:16:27Z

I'd say the main goal of this issue was successfully achieved - the CentOS CI is working and delivering results. I'm going to close this issue and move any outstanding issues to a new one, to keep things easier to follow.

mrc0mmand self-assigned this Oct 18, 2018

evverx mentioned this issue Oct 18, 2018

Update bootstrap.sh and testsuite.sh to run test-exec-deserialization.py #14

Closed

evverx mentioned this issue Oct 31, 2018

Fedora CI seems to have vanished systemd/systemd#10489

Closed

mrc0mmand mentioned this issue Nov 22, 2018

System refuses to boot while using multipath for the rootfs since 759d9f3 systemd/systemd#10882

Closed

mrc0mmand mentioned this issue Nov 23, 2018

systemd-networkd-tests.py is failing in the CentOS CI infrastructure systemd/systemd#10908

Closed

evverx mentioned this issue Dec 3, 2018

[RFE] test-network: several improvements systemd/systemd#10932

Open

13 tasks

evverx mentioned this issue Dec 3, 2018

systemd seems to be failing to compile on CentOS systemd/systemd#11036

Closed

evverx mentioned this issue Dec 5, 2018

pid1: set Description even for devices which don't exist yet systemd/systemd#11047

Merged

mrc0mmand closed this as completed Dec 25, 2018

CentOS CI "get well" plan #18

CentOS CI "get well" plan #18

Comments

mrc0mmand commented Oct 18, 2018 • edited Loading

mrc0mmand commented Oct 19, 2018

mrc0mmand commented Oct 21, 2018 • edited Loading

keszybz commented Oct 23, 2018

mrc0mmand commented Oct 26, 2018 • edited Loading

mrc0mmand commented Nov 14, 2018 • edited Loading

mrc0mmand commented Nov 19, 2018

evverx commented Nov 20, 2018

mrc0mmand commented Nov 20, 2018 • edited Loading

evverx commented Nov 20, 2018

yuwata commented Nov 20, 2018

yuwata commented Nov 20, 2018

mrc0mmand commented Nov 20, 2018 • edited Loading

evverx commented Nov 20, 2018

yuwata commented Nov 21, 2018

mrc0mmand commented Nov 21, 2018

mrc0mmand commented Nov 22, 2018 • edited Loading

evverx commented Nov 22, 2018 • edited Loading

mrc0mmand commented Nov 22, 2018 • edited Loading

evverx commented Nov 22, 2018

mrc0mmand commented Nov 22, 2018

mrc0mmand commented Nov 23, 2018 • edited Loading

evverx commented Nov 23, 2018

mrc0mmand commented Nov 23, 2018

evverx commented Nov 23, 2018 • edited Loading

mrc0mmand commented Nov 26, 2018

evverx commented Nov 28, 2018

mrc0mmand commented Nov 28, 2018

mrc0mmand commented Nov 28, 2018

evverx commented Nov 28, 2018

evverx commented Nov 28, 2018

evverx commented Nov 28, 2018

mrc0mmand commented Nov 28, 2018 • edited Loading

evverx commented Nov 30, 2018

evverx commented Nov 30, 2018 • edited Loading

evverx commented Nov 30, 2018 • edited Loading

evverx commented Dec 3, 2018

mrc0mmand commented Dec 3, 2018

evverx commented Dec 3, 2018

evverx commented Dec 4, 2018

evverx commented Dec 4, 2018

mrc0mmand commented Dec 4, 2018

evverx commented Dec 4, 2018

mrc0mmand commented Dec 4, 2018

evverx commented Dec 4, 2018

mrc0mmand commented Dec 4, 2018

mrc0mmand commented Dec 5, 2018 • edited Loading

mrc0mmand commented Dec 18, 2018

mrc0mmand commented Dec 25, 2018

mrc0mmand commented Oct 18, 2018 •

edited

Loading

mrc0mmand commented Oct 21, 2018 •

edited

Loading

mrc0mmand commented Oct 26, 2018 •

edited

Loading

mrc0mmand commented Nov 14, 2018 •

edited

Loading

mrc0mmand commented Nov 20, 2018 •

edited

Loading

mrc0mmand commented Nov 20, 2018 •

edited

Loading

mrc0mmand commented Nov 22, 2018 •

edited

Loading

evverx commented Nov 22, 2018 •

edited

Loading

mrc0mmand commented Nov 22, 2018 •

edited

Loading

mrc0mmand commented Nov 23, 2018 •

edited

Loading

evverx commented Nov 23, 2018 •

edited

Loading

mrc0mmand commented Nov 28, 2018 •

edited

Loading

evverx commented Nov 30, 2018 •

edited

Loading

evverx commented Nov 30, 2018 •

edited

Loading

mrc0mmand commented Dec 5, 2018 •

edited

Loading