journalctl --list-boots fails with v232-v234 again (was working in v231) #6447

hese10 · 2017-07-25T10:32:53Z

Submission type

Bug report

systemd version the issue has been seen with

systemd v233/v234

Used distribution

company specific.

In case of bug report: Expected behaviour you didn't see

journalctrl --list-boots shuold print all boot ids.

In case of bug report: Unexpected behaviour you saw

journalctl --list-boots returns error:
Failed to determine boots: No data available

In case of bug report: Steps to reproduce the problem

causing corrupted journal files with "reboot -f" commands.
then trying to list boots.
With old systemd v231, same journal files can be parsed correctly, and thus boot ids are listed corrected. But with systemd v232,v233,v234, there comes error, and no boot ids are listed.
binary journal files cannot be shared because of company confidentially. Maybe if this can be produced in plain fedora build, i can then share the files.

medhefgo · 2017-07-25T12:09:26Z

Well, id you can't share the files, you're you could try a git bisect. In any case, you might wanna make a copy of the files to prevent them from getting rotated away.

shibumi · 2017-07-25T18:52:35Z

journalctl --list-boots works fine for me with systemd version 233 on Arch Linux.

hese10 · 2017-07-27T12:43:39Z

It seems the changes done to journal-file.c for v232 caused the problem to occur. When I reverted the journal-file.c code from v231, then the list-boots worked again for corrupted files.

medhefgo · 2017-07-27T13:40:24Z

That doesn't help. OI course it would be around there, but there are 10 commits in between v231 and v232. You'll need to run a git bisect. I can help you with that if you don't know how.

hese10 · 2017-07-27T14:34:08Z

I now used git bisect for the first time.
I used commands:

git bisect start -- src/journal/
git bisect bad a1e2ef7ec912902d8142e7cb5830cbfb47dba86c
git bisect good 38b383d9fe0f5c4e987c1e01136ae6073076fee3
compiling the code and giving "git bisect bad" or "git bisect good" based on journalctl --list-boots result.
after all tries, it gave this commit as result:
[fedora@euca-172-31-9-156 systemd]$ git bisect bad
989793d341e730f452175fa18cf0f7ef4529d62c is the first bad commit
commit 989793d341e730f452175fa18cf0f7ef4529d62c
Author: Lennart Poettering <lennart@poettering.net>
Date:   Wed Oct 12 12:36:01 2016 +0200

medhefgo · 2017-07-27T16:23:09Z

@poettering any thoughts?

hese10 · 2017-07-28T08:51:13Z

The problematic journal file give verify error:

journalctl --file system@0005540316b69226-f802b22107d17028.journal~  --verify
eecfe8: Invalid entry realtime timestamp: 0                      
eecfe8: Invalid object contents: Bad message                     
File corruption detected at system@0005540316b69226-f802b22107d17028.journal~:eecfe8 (of 16777216 bytes, 93%).
FAIL: system@0005540316b69226-f802b22107d17028.journal~ (Bad message)

And when printing out logs with journalctl output, it shows strange alone "--Reboot--" text as last output text line. In other journal files that are corrupted (after user gave "reboot -f" also in these all cases), there is not this alone "--Reboot--" text as last shown output from journalctl.

poettering · 2017-07-28T18:04:50Z

Any chance you can provide me with the offending journal file? That'd be easiest to track down this issue.

hese10 · 2017-07-31T07:16:54Z

I cannot provide the offending journal file as that has company confidential material. I have not yet managed to reproduce this in normal fedora.
Anyway here is debug log outputs from journalctl command.
journalctl_debug.txt

poettering · 2017-07-31T09:39:43Z

I fear I can't debug this without the journal file :-( Not sure what else we can do on this one...

hese10 · 2017-07-31T10:23:11Z

In journalctl.c, in get_boots functions, it is always giving up from error. Can we change the code to still output the data that had been collected before meeting the first error?
like:

diff --git a/src/journal/journalctl.c b/src/journal/journalctl.c
index f753435..4d67495 100644
--- a/src/journal/journalctl.c
+++ b/src/journal/journalctl.c
@@ -1185,8 +1185,16 @@ static int get_boots(
 
                 r = discover_next_boot(j, previous_boot_id, advance_older, &current);
                 if (r < 0) {
-                        boot_id_free_all(head);
-                        return r;
+                        /* if we have already found some boot ids, 
+                           returning those here instead of just discarding those */
+                        if ( count > 0 ) {
+                                if (boots)
+                                        *boots = head;
+                                return count;
+                        } else {
+                                /* no boot ids found yet, returning the error code */
+                                return r;
+                        }
                 }
 
                 if (!current)

at least in my problem case the boot ids were printed out with this change.

poettering · 2017-07-31T13:30:03Z

But we really shouldn't generate the error in the first place if its recoverable... Any chance you can add #undef EBADMSG followed by #define EBADMSG __LINE__ to the top of journal-file.c and recompile? Then run journalctl --verbose --list-boots and let me know the final bits of the output. That should tell where precisely the EBADMSG comes from, as the returned error code for it will not be a proper error code anymore, but instead the line number...

hese10 · 2017-08-01T07:43:36Z

What is this "--verbose" parameter ?, as there is no such parameter in journalctl

poettering · 2017-08-01T07:45:30Z

What is this "--verbose" parameter ?, as there is no such parameter in journalctl

Ah sorry, I was a bit confused... meant to say, please set SYSTEMD_LOG_LEVEL=debug...:

# SYSTEMD_LOG_LEVEL=debug journalctl …

hese10 · 2017-08-01T07:57:24Z

Here is output.
journalctl_debug2.txt
But I do not see any line number here. The code is going to debug log line "Whoopsie! We found a boot ID but can't read its last entry.", and in that case zero value is returned from journal-file routine. So how this undef/define thing would even reveal anything?

hese10 · 2017-08-01T08:07:15Z

Ok, I moved #undef/#define after all #include things, then it changed outputs.
Here it is.
journalctl_debug3.txt

hese10 · 2017-09-07T10:23:40Z

Is there going to be any progress?

shibumi · 2017-09-07T12:29:42Z

Hello,
Are you sure your journal is not corrupted? Because journalctl --list-boots works fine for me on systemd 234 on Arch Linux.

poettering · 2017-09-07T12:34:33Z

@shibumi it's besides the point whether they are corrupted or not, journalctl is supposed to deal with all kinds of corruptions implicitly and silently and make the best of it.

shibumi · 2017-09-07T13:05:31Z

@poettering Maybe this issue has something todo with this thread: #4088

rantala · 2017-09-08T08:09:47Z

Hi, I got access to the journal logs (I work for the same company).

In this case there are total 20+ journal files, and one of the corrupted files is causing journalctl -D . --list-boots to fail:

$ journalctl -D . --list-boots 
Failed to determine boots: No data available
$ chmod 0 system@0005540316b69226-f802b22107d17028.journal~
$ journalctl -D . --list-boots
-52 22633da1c5374a728d6c215e2c301dc2 Mon 2017-07-10 05:29:21 EEST—Mon 2017-07-10 05:31:51 EEST
-51 2253aab9ea7e4a2598f2abda82939eff Mon 2017-07-10 05:32:22 EEST—Mon 2017-07-10 05:36:49 EEST
-50 ef0d85d35c74486fa4104f9d6391b6ba Mon 2017-07-10 05:40:33 EEST—Mon 2017-07-10 05:40:40 EEST
[...]

journalctl --verify reports the following for the corrupted file (there are corruptions in some other journal files as well):

$ journalctl --verify --file system@0005540316b69226-f802b22107d17028.journal~
eecfe8: Invalid entry realtime timestamp: 0                                                                                        
eecfe8: Invalid object contents: Bad message                                                                                       
File corruption detected at system@0005540316b69226-f802b22107d17028.journal~:eecfe8 (of 16777216 bytes, 93%).                     
FAIL: system@0005540316b69226-f802b22107d17028.journal~ (Bad message)

The last partially written entry in the file is starting from offset eecfe8, and the bytes are:
03 00 00 00 00 00 00 00 D0 00 00 00 00 00 00 00 6F 37 00 00 00 [rest of file is zero bytes...]

If I get everything right, this is parsed as:

=> ObjectHeader.type=0x03 (OBJECT_ENTRY)
=> ObjectHeader.flags=0x00
=> ObjectHeader.reserved=0x00 00 00 00 00 00
=> ObjectHeader.size=0xD0 00 00 00 00 00 00 00 (decimal 208)
=> EntryObject.seqnum=0x6F 37 00 00 00 00 00 00
=> rest EntryObject members all zeroes.

It looks like journalctl --list-boots does not notice that the entry is corrupted, and as EntryObject.boot_id is also zero, I believe it is causing some problems as the code tries to look for non-existent zero _BOOT_ID.

As a quick test, I copy-pasted the OBJECT_ENTRY checks from journal_file_object_verify() to journal_file_move_to_object() like this:

@@ -823,6 +829,55 @@ int journal_file_move_to_object(JournalFile *f, ObjectType type, uint64_t offset
                 return -EBADMSG;
         }
 
+       if (o->object.type == OBJECT_ENTRY) {
+                if ((le64toh(o->object.size) - offsetof(EntryObject, items)) % sizeof(EntryItem) != 0) {
+                        log_debug(
+                              "Bad entry size (<= %zu): %"PRIu64,
+                              offsetof(EntryObject, items),
+                              le64toh(o->object.size));
+                        return -EBADMSG;
+                }
+
+                if ((le64toh(o->object.size) - offsetof(EntryObject, items)) / sizeof(EntryItem) <= 0) {
+                        log_debug(
+                              "Invalid number items in entry: %"PRIu64,
+                              (le64toh(o->object.size) - offsetof(EntryObject, items)) / sizeof(EntryItem));
+                        return -EBADMSG;
+                }
+
+                if (le64toh(o->entry.seqnum) <= 0) {
+                        log_debug(
+                              "Invalid entry seqnum: %"PRIx64,
+                              le64toh(o->entry.seqnum));
+                        return -EBADMSG;
+                }
+
+                if (!VALID_REALTIME(le64toh(o->entry.realtime))) {
+                        log_debug(
+                              "Invalid entry realtime timestamp: %"PRIu64,
+                              le64toh(o->entry.realtime));
+                        return -EBADMSG;
+                }
+
+                if (!VALID_MONOTONIC(le64toh(o->entry.monotonic))) {
+                        log_debug(
+                              "Invalid entry monotonic timestamp: %"PRIu64,
+                              le64toh(o->entry.monotonic));
+                        return -EBADMSG;
+                }
+
+                for (uint64_t i = 0; i < journal_file_entry_n_items(o); i++) {
+                        if (o->entry.items[i].object_offset == 0 ||
+                            !VALID64(o->entry.items[i].object_offset)) {
+                                log_debug(
+                                      "Invalid entry item (%"PRIu64"/%"PRIu64" offset: "OFSfmt,
+                                      i, journal_file_entry_n_items(o),
+                                      o->entry.items[i].object_offset);
+                                return -EBADMSG;
+                        }
+                }
+       }
+
         if (s > tsize) {
                 r = journal_file_move_to(f, type, false, offset, s, &t, NULL);
                 if (r < 0)

With this change, journal_file_move_to_object() notices that realtime is invalid, and --list-boots works better:

$ chmod 644 system@0005540316b69226-f802b22107d17028.journal~
$ ~/git/systemd/builddir/journalctl --list-boots -D . | wc -l
53

hese10 · 2017-09-15T10:09:02Z

Any comments? Do we just create pull request for this change proposal?

poettering · 2017-09-15T15:54:25Z

Your patch looks excellent. In the long run we really should add similar validators for all object types really. And yes, please submit this patch as PR.

The patch as it is looks pretty flawless, except maybe that I'd split out these checks into a function of its own that journal_file_move_to_object() just calls if the type matches.

Thank you for tracking this down and prepping the fix!

Introduce journal_file_check_object(), which does lightweight object sanity checks, and use it in journal_file_move_to_object(), so that we will catch certain corrupted objects in the journal file. This fixes systemd#6447, where we had only partially written out OBJECT_ENTRY (ObjectHeader written, but rest of object zero bytes), causing "journalctl --list-boots" to fail. $ builddir.vanilla/journalctl --list-boots -D bug6447/ Failed to determine boots: No data available $ builddir.patched/journalctl --list-boots -D bug6447/ -52 22633da1c5374a728d6c215e2c301dc2 Mon 2017-07-10 05:29:21 EEST—Mon 2017-07-10 05:31:51 EEST -51 2253aab9ea7e4a2598f2abda82939eff Mon 2017-07-10 05:32:22 EEST—Mon 2017-07-10 05:36:49 EEST -50 ef0d85d35c74486fa4104f9d6391b6ba Mon 2017-07-10 05:40:33 EEST—Mon 2017-07-10 05:40:40 EEST [...] Note that journal_file_check_object() is similar to journal_file_object_verify(). The most expensive checks are omitted, as they would slow down every journal_file_move_to_object() call too much. With this implementation, the added overhead is small, for example when dumping some journal content to /dev/null (built with -Dbuildtype=debugoptimized -Db_ndebug=true): Performance counter stats for 'builddir.vanilla/journalctl -D 76f4d4c3406945f9a60d3ca8763aa754/': 12542,311634 task-clock:u (msec) # 1,000 CPUs utilized 0 context-switches:u # 0,000 K/sec 0 cpu-migrations:u # 0,000 K/sec 80 100 page-faults:u # 0,006 M/sec 41 786 963 456 cycles:u # 3,332 GHz 105 453 864 770 instructions:u # 2,52 insn per cycle 24 342 227 334 branches:u # 1940,809 M/sec 105 709 217 branch-misses:u # 0,43% of all branches 12,545199291 seconds time elapsed Performance counter stats for 'builddir.patched/journalctl -D 76f4d4c3406945f9a60d3ca8763aa754/': 12734,723233 task-clock:u (msec) # 1,000 CPUs utilized 0 context-switches:u # 0,000 K/sec 0 cpu-migrations:u # 0,000 K/sec 80 693 page-faults:u # 0,006 M/sec 42 661 017 429 cycles:u # 3,350 GHz 107 696 985 865 instructions:u # 2,52 insn per cycle 24 950 526 745 branches:u # 1959,252 M/sec 101 762 806 branch-misses:u # 0,41% of all branches 12,737527327 seconds time elapsed Fixes systemd#6447.

journal: add object sanity check to journal_file_move_to_object() (#6447)

poettering added journal needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer labels Jul 28, 2017

poettering added this to the v235 milestone Jul 28, 2017

archuserr mentioned this issue Jul 31, 2017

journalctl --list-boots not accurate #6488

Closed

rantala mentioned this issue Sep 22, 2017

journal: add object sanity check to journal_file_move_to_object() (#6447) #6887

Merged

poettering closed this as completed in #6887 Sep 24, 2017

poettering added a commit that referenced this issue Sep 24, 2017

Merge pull request #6887 from rantala/6447

cddaa1f

journal: add object sanity check to journal_file_move_to_object() (#6447)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

journalctl --list-boots fails with v232-v234 again (was working in v231) #6447

journalctl --list-boots fails with v232-v234 again (was working in v231) #6447

hese10 commented Jul 25, 2017

medhefgo commented Jul 25, 2017

shibumi commented Jul 25, 2017

hese10 commented Jul 27, 2017

medhefgo commented Jul 27, 2017

hese10 commented Jul 27, 2017 •

edited by poettering

medhefgo commented Jul 27, 2017

hese10 commented Jul 28, 2017 •

edited by poettering

poettering commented Jul 28, 2017

hese10 commented Jul 31, 2017

poettering commented Jul 31, 2017

hese10 commented Jul 31, 2017

poettering commented Jul 31, 2017 •

edited

hese10 commented Aug 1, 2017

poettering commented Aug 1, 2017

hese10 commented Aug 1, 2017 •

edited

hese10 commented Aug 1, 2017

hese10 commented Sep 7, 2017

shibumi commented Sep 7, 2017

poettering commented Sep 7, 2017

shibumi commented Sep 7, 2017

rantala commented Sep 8, 2017

hese10 commented Sep 15, 2017

poettering commented Sep 15, 2017

journalctl --list-boots fails with v232-v234 again (was working in v231) #6447

journalctl --list-boots fails with v232-v234 again (was working in v231) #6447

Comments

hese10 commented Jul 25, 2017

Submission type

systemd version the issue has been seen with

Used distribution

In case of bug report: Expected behaviour you didn't see

In case of bug report: Unexpected behaviour you saw

In case of bug report: Steps to reproduce the problem

medhefgo commented Jul 25, 2017

shibumi commented Jul 25, 2017

hese10 commented Jul 27, 2017

medhefgo commented Jul 27, 2017

hese10 commented Jul 27, 2017 • edited by poettering

medhefgo commented Jul 27, 2017

hese10 commented Jul 28, 2017 • edited by poettering

poettering commented Jul 28, 2017

hese10 commented Jul 31, 2017

poettering commented Jul 31, 2017

hese10 commented Jul 31, 2017

poettering commented Jul 31, 2017 • edited

hese10 commented Aug 1, 2017

poettering commented Aug 1, 2017

hese10 commented Aug 1, 2017 • edited

hese10 commented Aug 1, 2017

hese10 commented Sep 7, 2017

shibumi commented Sep 7, 2017

poettering commented Sep 7, 2017

shibumi commented Sep 7, 2017

rantala commented Sep 8, 2017

hese10 commented Sep 15, 2017

poettering commented Sep 15, 2017

hese10 commented Jul 27, 2017 •

edited by poettering

hese10 commented Jul 28, 2017 •

edited by poettering

poettering commented Jul 31, 2017 •

edited

hese10 commented Aug 1, 2017 •

edited