Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sparc64]: systemctl fails with 'Failed to parse bus message: Bad message' #2633

Closed
glaubitz opened this issue Feb 16, 2016 · 15 comments
Closed
Labels
bug 🐛 Programming errors, that need preferential fixing sd-bus

Comments

@glaubitz
Copy link
Contributor

Hi!

On a Debian/unstable sparc64 fresh installation with all packages up-to-date and systemd at 229, we're getting the following error when running systemctl:

root@ravirin:# systemctl
Failed to parse bus message: Bad message
root@ravirin:
#

root@ravirin:# systemctl --version
systemd 229
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD -IDN
root@ravirin:
# uname -a
Linux ravirin 4.3.0-1-sparc64 #1 Debian 4.3.3-5 (2016-01-04) sparc64 GNU/Linux
root@ravirin:~#

Listing failed units works, on the other hand, while --all does not:

root@ravirin:# systemctl --all
Failed to parse bus message: Bad message
root@ravirin:
# systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● rng-tools.service loaded failed failed rng-tools.service

LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.

1 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.
root@ravirin:~#

Surprisingly, this happens only on two of three sparc64 systems I tested with the same kernel and systemd version. The machine where the issue does not arise is much newer and has much more memory (SPARC-T5 with 192 GiB of RAM). The older machines are a Sun Blade 100 with just 1 GiB RAM and an older UltraSPARC IIIi (Jalapeno).

Cheers,
Adrian

@poettering
Copy link
Member

We probably forgot some endianess conversion somewhere... Since BE systems are kinda exotic these days, we are happy to support them but kinda rely on user patches and testing to ensure this stays working.

@glaubitz
Copy link
Contributor Author

I just realized, there is actually a difference between the machines affected by the issue and the machines where systemctl works fine, the latter uses an SMP kernel while the others don't. Maybe it's some kernel option that's configured differently in the SMP kernel.

I will do some more testing, but if you have some general pointers where to poke I'll be happy to look there.

@poettering
Copy link
Member

Try gdb on it. The error message is generated by bus_log_parse_error() which is invoked at a ton of places. set a breakpoint on it when running systemctl like this, and see where it triggers. (you might want to recompile systemd with -g -O0 first, otherwise your gdb experience will be crap)

@glaubitz
Copy link
Contributor Author

So, here's my first shot:

Breakpoint 1, bus_log_parse_error (r=-76) at src/shared/bus-util.c:1369
1369            return log_error_errno(r, "Failed to parse bus message: %m");

Oops, forgot the important part:

(gdb) up
#1  0x0000010000011430 in get_unit_list (bus=0x100002d3010, machine=0x0, patterns=0x7fefffff700, unit_infos=0x7fefffff4a8, c=14, _reply=0x7fefffff4b8)
    at src/systemctl/systemctl.c:577
577                 return bus_log_parse_error(r);
(gdb)

(gdb) bt
#0  bus_log_parse_error (r=-76) at src/shared/bus-util.c:1369
#1  0x0000010000011430 in get_unit_list (bus=0x100002d3010, machine=0x0, patterns=0x7fefffff700, unit_infos=0x7fefffff4a8, c=14, _reply=0x7fefffff4b8)
    at src/systemctl/systemctl.c:577
#2  0x000001000001177c in get_unit_list_recursive (bus=0x100002d3010, patterns=0x7fefffff700, _unit_infos=0x7fefffff5a8, _replies=0x7fefffff5b0, 
    _machines=0x7fefffff5b8) at src/systemctl/systemctl.c:619
#3  0x0000010000011d38 in list_units (argc=1, argv=0x7fefffff6f8, userdata=0x0) at src/systemctl/systemctl.c:687
#4  0x0000010000106aa8 in dispatch_verb (argc=1, argv=0x7fefffffc08, verbs=0x100002c7cc8 <verbs>, userdata=0x0) at src/basic/verbs.c:99
#5  0x000001000003722c in systemctl_main (argc=1, argv=0x7fefffffc08) at src/systemctl/systemctl.c:7399
#6  0x0000010000038cac in main (argc=1, argv=0x7fefffffc08) at src/systemctl/systemctl.c:7661
(gdb)

@poettering
Copy link
Member

OK, so this is triggered by sd_message_read() somewhere (the error is an immediate result of that call).

Here's my recommendation to track this down further: recompile systemctl (and sd-bus), but add:

#undef EBADMSG
#define EBADMSG (__LINE__ + 1000)

to the top of src/libsystem/sd-bus/bus-message.c. This has the effect that instead of returning the "Bad message" error code the parsing failure will now result in an error code that is 1000 plus the line number where this was really triggered. If you run the tool then, it should output neatly the line number where the EBADMSG is generated first.

@poettering
Copy link
Member

(add those two lines immediately after all #include lines, so that the EBADMSG definition of the OS is overriden with this debug definition)

@glaubitz
Copy link
Contributor Author

So, this seems to work:

root@test-adrian1:/systemd# ./systemctl
Failed to parse bus message: Unknown error 4518
root@test-adrian1:
/systemd#

@glaubitz
Copy link
Contributor Author

So, the problem seems that the object path string contains garbage at its end:

(gdb) x/125c q
0x100002d7f8c:  47 '/'  111 'o' 114 'r' 103 'g' 47 '/'  102 'f' 114 'r' 101 'e'
0x100002d7f94:  101 'e' 100 'd' 101 'e' 115 's' 107 'k' 116 't' 111 'o' 112 'p'
0x100002d7f9c:  47 '/'  115 's' 121 'y' 115 's' 116 't' 101 'e' 109 'm' 100 'd'
0x100002d7fa4:  49 '1'  47 '/'  117 'u' 110 'n' 105 'i' 116 't' 47 '/'  100 'd'
0x100002d7fac:  101 'e' 118 'v' 95 '_'  50 '2'  100 'd' 100 'd' 105 'i' 115 's'
0x100002d7fb4:  107 'k' 95 '_'  50 '2'  100 'd' 98 'b'  121 'y' 95 '_'  53 '5'
0x100002d7fbc:  99 'c'  120 'x' 50 '2'  100 'd' 117 'u' 117 'u' 105 'i' 100 'd'
0x100002d7fc4:  95 '_'  50 '2'  100 'd' 52 '4'  98 'b'  54 '6'  98 'b'  101 'e'
0x100002d7fcc:  54 '6'  57 '9'  55 '7'  95 '_'  53 '5'  99 'c'  120 'x' 50 '2'
0x100002d7fd4:  100 'd' 54 '6'  51 '3'  101 'e' 99 'c'  95 '_'  53 '5'  99 'c'
0x100002d7fdc:  120 'x' 50 '2'  100 'd' 52 '4'  48 '0'  99 'c'  99 'c'  95 '_'
0x100002d7fe4:  53 '5'  99 'c'  120 'x' 50 '2'  100 'd' 56 '8'  55 '7'  55 '7'
0x100002d7fec:  54 '6'  95 '_'  53 '5'  99 'c'  120 'x' 50 '2'  100 'd' 99 'c'
0x100002d7ff4:  54 '6'  55 '7'  99 'c'  56 '8'  102 'f' 57 '9'  99 'c'  51 '3'
0x100002d7ffc:  56 '8'  49 '1'  48 '0'  95 '_'  0 '\000'    0 '\000'    0 '\000'    0 '\000'
0x100002d8004:  0 '\000'    0 '\000'    0 '\000'    0 '\000'    0 '\000'
(gdb) print object_path_is_valid(q)
$6 = true
(gdb) print validate_nul(q,l)
$7 = false
(gdb)

Until "3810", it correspondends to the UUID of swap:

root@test-adrian1:~# blkid /dev/sda2
/dev/sda2: UUID="4b6be697-63ec-40cc-8776-c67c8f9c3810" TYPE="swap"
root@test-adrian1:~#

But the rest of the string is garbage.

@poettering
Copy link
Member

where does the 125 come from you used? is that the value of l?

The string looks pretty OK, and is NUL terminated, but the trailing underscore surprises me. (and of course, if l is 125, then the string contains NUL bytes, and that's weird...

@glaubitz
Copy link
Contributor Author

Yes, that's the value of l. Will continue debugging tonight.

@poettering
Copy link
Member

Any update on this?

Any chance that da8358c fixed this issue, too?

@glaubitz
Copy link
Contributor Author

I just applied the single fix from da8358c to my old build tree and just ran make again but that didn't help:

root@test-adrian1:/systemd# ./systemctl
Failed to parse bus message: Unknown error 4518
root@test-adrian1:
/systemd#

I'll clone systemd from git now and then test again with the latest revision.

@glaubitz
Copy link
Contributor Author

Ok, cloned systemd from git, built and installed it. Problem still persists, unfortunately.

@poettering poettering added the bug 🐛 Programming errors, that need preferential fixing label Apr 18, 2016
@poettering
Copy link
Member

OK, no idea then. I figure there's no way around some gdb sessions on your side... Try to figure out why the memory got corrupted there...

@glaubitz
Copy link
Contributor Author

Oh, it looks like this problem has actually fixed itself somewhat magically. On a machine where it didn't work in the past (an older sun4u machine), systemctl is now working as it should. Tested on another previously affected machine and can't reproduce it there either. Nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Programming errors, that need preferential fixing sd-bus
Development

No branches or pull requests

2 participants