Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CentOS CI "get well" plan #18

Closed
13 of 19 tasks
mrc0mmand opened this issue Oct 18, 2018 · 48 comments
Closed
13 of 19 tasks

CentOS CI "get well" plan #18

mrc0mmand opened this issue Oct 18, 2018 · 48 comments
Assignees

Comments

@mrc0mmand
Copy link
Member

mrc0mmand commented Oct 18, 2018

Purpose of this issue is to keep track of things which need to be done to make systemd CentOS CI work again.

Following things still need to be done:

Long term goals:

  • Upstream all downstream RHEL tests (= drop annoying test syncing)

Notes:

@mrc0mmand
Copy link
Member Author

* [ ]  (?) Use qemu-kvm instead of systemd-nspawn (or both?) [slave/testsuite.sh]

@keszybz any thoughts on this one?

@mrc0mmand
Copy link
Member Author

mrc0mmand commented Oct 21, 2018

Failing tests:

  • TEST-01-BASIC
  • TEST-15-DROPIN
  • TEST-22-TMPFILES
  • TEST-24-UNIT-TESTS

Common error:

+ env --unset=UNIFIED_CGROUP_HIERARCHY /root/systemd-centos-ci/systemd/build/systemd-nspawn -U --private-network --register=no --kill-signal=SIGKILL --directory=/var/tmp/systemd-test.UnjLx7/unprivileged-nspawn-root /usr/lib/systemd/systemd
Spawning container unprivileged-nspawn-root on /var/tmp/systemd-test.UnjLx7/unprivileged-nspawn-root.
Press ^] three times within 1s to kill container.
Selected user namespace base 970129408 and range 65536.
Failed to fork inner child: Invalid argument
E: nspawn failed with exit code 1
-rw-r-----+ 1 root systemd-journal 8388608 Oct 21 15:07 /var/tmp/systemd-test.UnjLx7/journal/32e86a2de4e543fe8c41793961c76987/system.journal
make: *** [run] Error 1
make: Leaving directory `/root/systemd-centos-ci/systemd/test/
--x-- Result of TEST-01-BASIC: 2 --x--

This happens even with user_namespace.enable=1

[root@host-8-251-180 systemd]# tr ' ' '\n' </proc/cmdline | grep user_namespace 
user_namespace.enable=1
[root@host-8-251-180 systemd]# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-3.10.0-862.14.4.el7.x86_64 root=UUID=40735eda-bc43-4610-961f-bc5c0353239a ro console=tty0 console=ttyS0,115200 crashkernel=auto net.ifnames=0 rhgb quiet LANG=en_US.UTF-8 user_namespace.enable=1

Workaround/fix:

# echo 10000 > /proc/sys/user/max_user_namespaces

Persisting issues: Fixed by installing missing dependencies (quota, net-tools)

# make -C test/TEST-22-TMPFILES/ setup
...
+ for _x in inst_symlink inst_script inst_binary inst_simple
+ inst_simple ldconfig.real
+ [[ -f ldconfig.real ]]
+ return 1
+ return 1
+ [[ yes = yes ]]
+ dinfo 'Skipping program ldconfig.real as it cannot be found and is' 'flagged to be optional'
+ set +x
I: Skipping program ldconfig.real as it cannot be found and is flagged to be optional
make: *** [setup] Error 1
make: Leaving directory `/root/systemd-centos-ci/systemd/test/TEST-22-TMPFILES'

...

@keszybz
Copy link
Member

keszybz commented Oct 23, 2018

  • (?) Use qemu-kvm instead of systemd-nspawn (or both?) [slave/testsuite.sh]

"Both" is worthwhile, because different things are tested in both environments. But reliability is more important than having both, so if just one can be made to work, that's better than having flaky tests.

@mrc0mmand
Copy link
Member Author

mrc0mmand commented Oct 26, 2018

Notes from the "make the QEMU testsuite work again" session:

  • run each test with correct INITRD and KERNEL_IMG env vars (/boot/initramfs-$(uname -r).img and /boot/vmlinuz-$(uname -r) respectively)
  • dracut includes filesystem modules ONLY for filesystems currently present in /etc/fstab (imo), which is xfs for default CentOS installation. However, the QEMU testsuite uses ext4 filesystem, which results is a boot failure for the respective virtual machine (dracut -f --filesystems ext4 for the rescue)
  • switching between nspawn/QEMU is currently done by removing/creating the /usr/bin/qemu-kvm symlink (to /usr/libexec/qemu-kvm). Maybe there's a nicer way
  • TEST-13-NSPAWN-SMOKE seems to be failing again under qemu systemd#10544

@mrc0mmand
Copy link
Member Author

mrc0mmand commented Nov 14, 2018

@evverx With the help of several other people I finally got something, which could get things moving again - I'm going to propose this ticket at CentOS CBS meeting (every Monday, 2 PM UTC in #centos-devel@Freenode) and hopefully it will get us somewhere.

@mrc0mmand
Copy link
Member Author

Apparently there was some error in communication, so I didn't receive the previous email. However I finally got the credentials, so we can start breaking things!

(OT: Is there any chat to catch you in (e.g. IRC, Telegram, etc.)? @evverx)

@evverx
Copy link
Member

evverx commented Nov 20, 2018

However I finally got the credentials, so we can start breaking things!

That's great news! Congratulations!

Is there any chat to catch you in (e.g. IRC, Telegram, etc.)?

I'm afraid it isn't possible to catch me there, but, on the positive side, I usually reply to comments on GitHub relatively fast.

@mrc0mmand
Copy link
Member Author

mrc0mmand commented Nov 20, 2018

Notes from the "why it doesn't work in CentOS CI infrastructure" session:

  • the target nodes apparently don't like the new initrd image generated by upstream dracut - the machine won't boot after reboot (dropping the dracut initrd re-generation solves the issue for now); will investigate further
  • it's not the upstream dracut, the same thing happens after dracut -f --regenerate-all with the downstream package
  • I can either re-generate the initrd or install upstream systemd; if I do both, the system won't boot (and debugging boot issues without a serial console is wonderful...)

@evverx
Copy link
Member

evverx commented Nov 20, 2018

Could it be that you ran into systemd/systemd#10854? There're two PRs that are supposed to fix the issue. Could you try applying one of them to see if it works?

@yuwata
Copy link
Member

yuwata commented Nov 20, 2018

If the failure is caused by systemd/systemd#10854, then please provide any logs or something if possible. Thank you.

@yuwata
Copy link
Member

yuwata commented Nov 20, 2018

Another possibility is systemd/systemd#10754...

@mrc0mmand
Copy link
Member Author

mrc0mmand commented Nov 20, 2018

Unfortunately, neither mentioned issue seems to be relevant for this case. I did a quick bisect, but the issue occurs all the way down to systemd/systemd@80df8f2 - without this commit the systemd won't compile, will try to workaround it tomorrow.

Also, I'll try to ask for some possibility to get any useful logs from the machine after it dies.

Anyway, in my opinion, the issue is somewhere in the multipath which is used for the root filesystem...

@evverx
Copy link
Member

evverx commented Nov 20, 2018

Regarding systemd/systemd@80df8f2, I think mesone -Dnetworkd=false might help to get around it.

@yuwata
Copy link
Member

yuwata commented Nov 21, 2018

Could you try to boot with udev.children_max=1?

@mrc0mmand
Copy link
Member Author

Thanks a lot for the suggestions, unfortunately neither of them helped. -Dnetworkd=false excludes systemd-networkd from the compilation, but the sd-netlink still causes issues, and the issue still occurs even with udev.children_max=1.

I raised the post-mortem debugging issue on the CentOS CI Users mailing list so let's see if someone will be able to help.

In the meantime I'll play around with bisect in hopes I'll stumble upon the root cause...

@mrc0mmand
Copy link
Member Author

mrc0mmand commented Nov 22, 2018

Notes from the "why it doesn't work in CentOS CI infrastructure" session, part 2:

[0]
curl -q https://github.com/systemd/systemd/commit/80df8f2518aa07ef3c328e1c634573347e130cf0.patch | git am

@evverx
Copy link
Member

evverx commented Nov 22, 2018

@mrc0mmand thank you a lot for finding the offending commit! By the way, apparently GitHub doesn't send notifications when comments are edited so probably major breakthroughs deserve to be written down separately :-)

The easiest way to unbreak CentOS CI would be to revert that commit. @keszybz @poettering @yuwata what do you think? As usual, I agree that it would be much better to figure out what's going on and fix it, but, in this case, it's not that easy and given how long it took to get access to the testing infrastructure I don't think the question @mrc0mmand asked in https://lists.centos.org/pipermail/ci-users/2018-November/000918.html will be answered anytime soon.

@mrc0mmand
Copy link
Member Author

mrc0mmand commented Nov 22, 2018

@evverx I was trying to get a remote shell in the initrd to get logs before pinging everyone, and believe me or not I managed to do it using https://github.com/dracut-crypt-ssh/dracut-crypt-ssh! Right now I have a working shell and access to journal and kernel ring buffer, so I'll open an issue shortly with as much logs as I can get.

@evverx
Copy link
Member

evverx commented Nov 22, 2018

@mrc0mmand that's great! I'm wondering if it would be possible to use it in the script that reboots and connects to the machine so that in the future issues like this would be a little bit easier to debug. It could just dump all the logs somewhere, which is better than nothing I guess and more or less automatic.

@mrc0mmand
Copy link
Member Author

I guess we could incorporate it into the CI scripts, as the setup is fairly simple.

@mrc0mmand
Copy link
Member Author

mrc0mmand commented Nov 23, 2018

The testsuite almost passes, there's some issue with networking, hopefully it's not something major - https://ci.centos.org/job/systemd-pr-build/3673/console

Debug log from systemd-networkd-tests.py: https://paste.fedoraproject.org/paste/jnhwagD3-saGbeCNzYYk0w

@ssahani could you shed some light into what's happening here?

@evverx
Copy link
Member

evverx commented Nov 23, 2018

I suspect test-execute is failing because the regular expression doesn't cover all links that can pop up after a new network namespace is created as was discussed in systemd/systemd#10331 (comment).

Regarding systemd-networkd-tests.py, could you try stopping dnsmasq before running the test to see if it helps?

@mrc0mmand
Copy link
Member Author

I suspect test-execute is failing because the regular expression doesn't cover all links

That makes sense, thanks for the reference link.

Regarding systemd-networkd-tests.py, could you try stopping dnsmasq before running the test to see if it helps?

Unfortunately not. I even tried rebooting the machine before the test itself, but it still fails the same.

@evverx
Copy link
Member

evverx commented Nov 23, 2018

@mrc0mmand could you create a new issue about systemd-networkd-tests.py so that it would be possible to track it properly? This issue is already hard to follow if you ask me :-)

@mrc0mmand
Copy link
Member Author

Tracking issues for current CentOS CI blockers:
test-execute - systemd/systemd#10934
systemd-networkd-tests.py - #23

@evverx
Copy link
Member

evverx commented Nov 28, 2018

@mrc0mmand I'm wondering if you have figured out what @systemd-centos-ci is. I think it would make sense to turn CentOS CI on as soon as possible to at least make sure that systemd compiles and the rest of the tests still pass.

systemd-networkd-tests.py seems to be always broken (partly because nobody has ever run it automatically) and can be skipped for now and test-execute (or more precisely exec-privatenetwork-yes.service) isn't exactly useful and can be replaced with something that simply won't fail.

@mrc0mmand
Copy link
Member Author

@evverx IMHO @systemd-centos-ci was created to simply provide an API key for the GitHub builder plugin in the CentOS CI jenkins - this allows jenkins to update commit/PR state according to the results of the test run. However, I don't know who has access to this account, so maybe it would be wise if I just used my API key (with limited permissions), so we have everything under our control.

I'll go ahead and temporarily disable mentioned tests so the results are finally usable.

@mrc0mmand
Copy link
Member Author

Ah, I take that back, I can't use my API key as I don't have appropriate permissions in systemd/systemd. Either we could track down the owner of @systemd-centos-ci or just create a new account for such purpose.

@evverx
Copy link
Member

evverx commented Nov 28, 2018

I have no problem with a new account. If I understand correctly, it'll just have to be invited as a collaborator and I can do that. But, as far as know, https://wiki.centos.org/QaWiki/CI/GithubIntegration will no longer be applicable there so it'd be great if you could let me know how the webhook is supposed to look like. Now it just points to https://ci.centos.org/ghprbhook/ with no secret.

@evverx
Copy link
Member

evverx commented Nov 28, 2018

@keszybz it would be great it you could help here. Judging by the presence of @systemd-centos-ci I assume there are some unknown to me reasons for it to be here (most likely related to secure access to the repository, but who knows).

@evverx
Copy link
Member

evverx commented Nov 28, 2018

In the light of the recent events that shall remain nameless, one can never be too cautious giving write access to the repository :-)

@mrc0mmand
Copy link
Member Author

mrc0mmand commented Nov 28, 2018

@evverx Sorry for the delay, wanted to make sure everything works before we start messing with webhooks. I temporarily disabled the problematic parts of the testsuite in 42340c2 and it's finally passing
https://ci.centos.org/job/systemd-pr-build/3676/console.

I guess now we just have to figure out which user to use for the CI, so I can configure it properly on the jenkins side.

@evverx
Copy link
Member

evverx commented Nov 30, 2018

@mrc0mmand given that I already bother contributors with LGTM alerts like systemd/systemd#10249 (comment) I think we could use my account as a bearer of bad news (at least temporarily). What do you think?

@evverx
Copy link
Member

evverx commented Nov 30, 2018

Though, I'd prefer it if @poettering and @keszybz chimed in here because I'm still not sure whether anyone else is interested in getting it working.

@evverx
Copy link
Member

evverx commented Nov 30, 2018

On second thoughts, It also seems reasonable to me to invite @mrc0mmand as a collaborator to the systemd repository and point CentOS CI to @mrc0mmand's handle. I'm pretty sure it'll make everything much faster, simpler and even a little bit more secure.

@evverx
Copy link
Member

evverx commented Dec 3, 2018

And systemd is failing to compile on CentOS again: systemd/systemd#11036.

@mrc0mmand
Copy link
Member Author

I guess CentOS CI could have easily prevented that...

As for the ideas above - using your account, @evverx, is definitely possible, but I don't like the idea of being in charge of someone else's API key. Not that I have any ulterior motives, but it's still a responsibility.

@evverx
Copy link
Member

evverx commented Dec 3, 2018

@mrc0mmand I'm completely with you on this one that's why I suggested inviting you as a collaborator to the systemd repository. I'd do that right now but I'm not sure I can make decisions like that without at least one ACK. Maybe you could ping someone to speed up the process.

@evverx
Copy link
Member

evverx commented Dec 4, 2018

So, manually launching a CentOS VM and running ./agent/bootstrap.sh to see whether systemd/systemd#11036 is gone was the last straw. I'll take the liberty of inviting @mrc0mmand as a collaborator to the systemd repository. I think that making the scripts from this repository usable and resurrecting TravisCI is enough for me to be sure that @mrc0mmand can get things done. Plus apparently @mrc0mmand cares about my API key even more than I do :-)

@evverx
Copy link
Member

evverx commented Dec 4, 2018

@mrc0mmand let me know when (and probably how) I should turn the webhook on. 6 hours ago https://ci.centos.org/ghprbhook/ responded with 500 so I turned it off again.

@mrc0mmand
Copy link
Member Author

@evverx will do! However, as usual, there is one small catch, because otherwise things would be too easy... In Jenkins, every user has its credentials store, to manage credentials for various plugins, but, for some reason, I can't manage credentials for the plugin we need (GitHub Pull Request Builder). I just asked about that on the #centos-devel channel, so let's hope for a (relatively) fast response.

@evverx
Copy link
Member

evverx commented Dec 4, 2018

@mrc0mmand in case the response won't be fast, I'm wondering if it would be possible as a last resort to trigger CentOS CI via Travis CI. I'm fantasizing here and assuming you have everything you need to run Jenkins jobs that can produce reports like https://ci.centos.org/job/systemd-pr-build/3676/console. In theory could we encrypt your credentials and use them to spawn VMs via agent-control.py and then put a link to the report at the end of the Travis build log?

@mrc0mmand
Copy link
Member Author

@evverx I just gave up and wrote a simple wrapper which does the status reporting and it seems to be working. I'll definitely improve it as soon as possible (or ditch it completely if I figure out the jenkins plugin madness), but for now it should finally start delivering results to PRs.

Right now just setup a webhook according to the CentOS CI documentation, i.e.:

1. On your Github Project page, choose 'Settings'
2. Navigate to the 'Webhooks and Services' tab
3. Choose 'Add a webhook'
4. Select 'Let me select individual events' under 'Which events would you like to trigger this webhook?'
5. Unselect 'Push' and select 'Pull Request' and 'Issue Comment'
6. Paste 'https://ci.centos.org/ghprbhook/' (note the trailing slash) in the Payload URL
7. Open the new webhook and verify the ping the Recent Deliveries section 

@evverx
Copy link
Member

evverx commented Dec 4, 2018

I didn't select "Issue Comment" because I'm not sure it'd be useful. To judge from systemd/systemd#11045, the hook has started to deliver :-)

@mrc0mmand
Copy link
Member Author

So, thanks to collaboration with Brian we now have a working CentOS CI without workarounds. As the next step I'll sort out artifact exporting, so the logs can be properly investigated in case of failure.

@mrc0mmand
Copy link
Member Author

mrc0mmand commented Dec 5, 2018

I was working on the artifact exporting, but stumbled upon an issue with permissions (which got fixed by Brian). In the meanwhile I set up a CI for this repository, so we don't have to manually test every change, see #24 and #25.

Hopefully the artifact exporting should be up tomorrow.

@mrc0mmand
Copy link
Member Author

Quick update:

@mrc0mmand
Copy link
Member Author

I'd say the main goal of this issue was successfully achieved - the CentOS CI is working and delivering results. I'm going to close this issue and move any outstanding issues to a new one, to keep things easier to follow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants