Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

journalctl --list-boots fails with v232-v234 again (was working in v231) #6447

Closed
hese10 opened this issue Jul 25, 2017 · 23 comments · Fixed by #6887
Closed

journalctl --list-boots fails with v232-v234 again (was working in v231) #6447

hese10 opened this issue Jul 25, 2017 · 23 comments · Fixed by #6887
Labels
journal needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer
Milestone

Comments

@hese10
Copy link
Contributor

hese10 commented Jul 25, 2017

Submission type

  • Bug report

systemd version the issue has been seen with

systemd v233/v234

Used distribution

company specific.

In case of bug report: Expected behaviour you didn't see

journalctrl --list-boots shuold print all boot ids.

In case of bug report: Unexpected behaviour you saw

journalctl --list-boots returns error:
Failed to determine boots: No data available

In case of bug report: Steps to reproduce the problem

causing corrupted journal files with "reboot -f" commands.
then trying to list boots.
With old systemd v231, same journal files can be parsed correctly, and thus boot ids are listed corrected. But with systemd v232,v233,v234, there comes error, and no boot ids are listed.
binary journal files cannot be shared because of company confidentially. Maybe if this can be produced in plain fedora build, i can then share the files.

@medhefgo
Copy link
Contributor

Well, id you can't share the files, you're you could try a git bisect. In any case, you might wanna make a copy of the files to prevent them from getting rotated away.

@shibumi
Copy link
Contributor

shibumi commented Jul 25, 2017

journalctl --list-boots works fine for me with systemd version 233 on Arch Linux.

@hese10
Copy link
Contributor Author

hese10 commented Jul 27, 2017

It seems the changes done to journal-file.c for v232 caused the problem to occur. When I reverted the journal-file.c code from v231, then the list-boots worked again for corrupted files.

@medhefgo
Copy link
Contributor

That doesn't help. OI course it would be around there, but there are 10 commits in between v231 and v232. You'll need to run a git bisect. I can help you with that if you don't know how.

@hese10
Copy link
Contributor Author

hese10 commented Jul 27, 2017

I now used git bisect for the first time.
I used commands:

git bisect start -- src/journal/
git bisect bad a1e2ef7ec912902d8142e7cb5830cbfb47dba86c
git bisect good 38b383d9fe0f5c4e987c1e01136ae6073076fee3
compiling the code and giving "git bisect bad" or "git bisect good" based on journalctl --list-boots result.
after all tries, it gave this commit as result:
[fedora@euca-172-31-9-156 systemd]$ git bisect bad
989793d341e730f452175fa18cf0f7ef4529d62c is the first bad commit
commit 989793d341e730f452175fa18cf0f7ef4529d62c
Author: Lennart Poettering <lennart@poettering.net>
Date:   Wed Oct 12 12:36:01 2016 +0200

@medhefgo
Copy link
Contributor

@poettering any thoughts?

@hese10
Copy link
Contributor Author

hese10 commented Jul 28, 2017

The problematic journal file give verify error:

journalctl --file system@0005540316b69226-f802b22107d17028.journal~  --verify
eecfe8: Invalid entry realtime timestamp: 0                      
eecfe8: Invalid object contents: Bad message                     
File corruption detected at system@0005540316b69226-f802b22107d17028.journal~:eecfe8 (of 16777216 bytes, 93%).
FAIL: system@0005540316b69226-f802b22107d17028.journal~ (Bad message)

And when printing out logs with journalctl output, it shows strange alone "--Reboot--" text as last output text line. In other journal files that are corrupted (after user gave "reboot -f" also in these all cases), there is not this alone "--Reboot--" text as last shown output from journalctl.

@poettering
Copy link
Member

Any chance you can provide me with the offending journal file? That'd be easiest to track down this issue.

@poettering poettering added journal needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer labels Jul 28, 2017
@poettering poettering added this to the v235 milestone Jul 28, 2017
@hese10
Copy link
Contributor Author

hese10 commented Jul 31, 2017

I cannot provide the offending journal file as that has company confidential material. I have not yet managed to reproduce this in normal fedora.
Anyway here is debug log outputs from journalctl command.
journalctl_debug.txt

@poettering
Copy link
Member

I fear I can't debug this without the journal file :-( Not sure what else we can do on this one...

@hese10
Copy link
Contributor Author

hese10 commented Jul 31, 2017

In journalctl.c, in get_boots functions, it is always giving up from error. Can we change the code to still output the data that had been collected before meeting the first error?
like:

diff --git a/src/journal/journalctl.c b/src/journal/journalctl.c
index f753435..4d67495 100644
--- a/src/journal/journalctl.c
+++ b/src/journal/journalctl.c
@@ -1185,8 +1185,16 @@ static int get_boots(
 
                 r = discover_next_boot(j, previous_boot_id, advance_older, &current);
                 if (r < 0) {
-                        boot_id_free_all(head);
-                        return r;
+                        /* if we have already found some boot ids, 
+                           returning those here instead of just discarding those */
+                        if ( count > 0 ) {
+                                if (boots)
+                                        *boots = head;
+                                return count;
+                        } else {
+                                /* no boot ids found yet, returning the error code */
+                                return r;
+                        }
                 }
 
                 if (!current)

at least in my problem case the boot ids were printed out with this change.

@poettering
Copy link
Member

poettering commented Jul 31, 2017

But we really shouldn't generate the error in the first place if its recoverable... Any chance you can add #undef EBADMSG followed by #define EBADMSG __LINE__ to the top of journal-file.c and recompile? Then run journalctl --verbose --list-boots and let me know the final bits of the output. That should tell where precisely the EBADMSG comes from, as the returned error code for it will not be a proper error code anymore, but instead the line number...

@hese10
Copy link
Contributor Author

hese10 commented Aug 1, 2017

What is this "--verbose" parameter ?, as there is no such parameter in journalctl

@poettering
Copy link
Member

What is this "--verbose" parameter ?, as there is no such parameter in journalctl

Ah sorry, I was a bit confused... meant to say, please set SYSTEMD_LOG_LEVEL=debug...:

# SYSTEMD_LOG_LEVEL=debug journalctl …

@hese10
Copy link
Contributor Author

hese10 commented Aug 1, 2017

Here is output.
journalctl_debug2.txt
But I do not see any line number here. The code is going to debug log line "Whoopsie! We found a boot ID but can't read its last entry.", and in that case zero value is returned from journal-file routine. So how this undef/define thing would even reveal anything?

@hese10
Copy link
Contributor Author

hese10 commented Aug 1, 2017

Ok, I moved #undef/#define after all #include things, then it changed outputs.
Here it is.
journalctl_debug3.txt

@hese10
Copy link
Contributor Author

hese10 commented Sep 7, 2017

Is there going to be any progress?

@shibumi
Copy link
Contributor

shibumi commented Sep 7, 2017

Hello,
Are you sure your journal is not corrupted? Because journalctl --list-boots works fine for me on systemd 234 on Arch Linux.

@poettering
Copy link
Member

@shibumi it's besides the point whether they are corrupted or not, journalctl is supposed to deal with all kinds of corruptions implicitly and silently and make the best of it.

@shibumi
Copy link
Contributor

shibumi commented Sep 7, 2017

@poettering Maybe this issue has something todo with this thread: #4088

@rantala
Copy link
Contributor

rantala commented Sep 8, 2017

Hi, I got access to the journal logs (I work for the same company).

In this case there are total 20+ journal files, and one of the corrupted files is causing journalctl -D . --list-boots to fail:

$ journalctl -D . --list-boots 
Failed to determine boots: No data available
$ chmod 0 system@0005540316b69226-f802b22107d17028.journal~
$ journalctl -D . --list-boots
-52 22633da1c5374a728d6c215e2c301dc2 Mon 2017-07-10 05:29:21 EEST—Mon 2017-07-10 05:31:51 EEST
-51 2253aab9ea7e4a2598f2abda82939eff Mon 2017-07-10 05:32:22 EEST—Mon 2017-07-10 05:36:49 EEST
-50 ef0d85d35c74486fa4104f9d6391b6ba Mon 2017-07-10 05:40:33 EEST—Mon 2017-07-10 05:40:40 EEST
[...]

journalctl --verify reports the following for the corrupted file (there are corruptions in some other journal files as well):

$ journalctl --verify --file system@0005540316b69226-f802b22107d17028.journal~
eecfe8: Invalid entry realtime timestamp: 0                                                                                        
eecfe8: Invalid object contents: Bad message                                                                                       
File corruption detected at system@0005540316b69226-f802b22107d17028.journal~:eecfe8 (of 16777216 bytes, 93%).                     
FAIL: system@0005540316b69226-f802b22107d17028.journal~ (Bad message)

The last partially written entry in the file is starting from offset eecfe8, and the bytes are:
03 00 00 00 00 00 00 00 D0 00 00 00 00 00 00 00 6F 37 00 00 00 [rest of file is zero bytes...]

If I get everything right, this is parsed as:

=> ObjectHeader.type=0x03 (OBJECT_ENTRY)
=> ObjectHeader.flags=0x00
=> ObjectHeader.reserved=0x00 00 00 00 00 00
=> ObjectHeader.size=0xD0 00 00 00 00 00 00 00 (decimal 208)
=> EntryObject.seqnum=0x6F 37 00 00 00 00 00 00
=> rest EntryObject members all zeroes.

It looks like journalctl --list-boots does not notice that the entry is corrupted, and as EntryObject.boot_id is also zero, I believe it is causing some problems as the code tries to look for non-existent zero _BOOT_ID.

As a quick test, I copy-pasted the OBJECT_ENTRY checks from journal_file_object_verify() to journal_file_move_to_object() like this:

@@ -823,6 +829,55 @@ int journal_file_move_to_object(JournalFile *f, ObjectType type, uint64_t offset
                 return -EBADMSG;
         }
 
+       if (o->object.type == OBJECT_ENTRY) {
+                if ((le64toh(o->object.size) - offsetof(EntryObject, items)) % sizeof(EntryItem) != 0) {
+                        log_debug(
+                              "Bad entry size (<= %zu): %"PRIu64,
+                              offsetof(EntryObject, items),
+                              le64toh(o->object.size));
+                        return -EBADMSG;
+                }
+
+                if ((le64toh(o->object.size) - offsetof(EntryObject, items)) / sizeof(EntryItem) <= 0) {
+                        log_debug(
+                              "Invalid number items in entry: %"PRIu64,
+                              (le64toh(o->object.size) - offsetof(EntryObject, items)) / sizeof(EntryItem));
+                        return -EBADMSG;
+                }
+
+                if (le64toh(o->entry.seqnum) <= 0) {
+                        log_debug(
+                              "Invalid entry seqnum: %"PRIx64,
+                              le64toh(o->entry.seqnum));
+                        return -EBADMSG;
+                }
+
+                if (!VALID_REALTIME(le64toh(o->entry.realtime))) {
+                        log_debug(
+                              "Invalid entry realtime timestamp: %"PRIu64,
+                              le64toh(o->entry.realtime));
+                        return -EBADMSG;
+                }
+
+                if (!VALID_MONOTONIC(le64toh(o->entry.monotonic))) {
+                        log_debug(
+                              "Invalid entry monotonic timestamp: %"PRIu64,
+                              le64toh(o->entry.monotonic));
+                        return -EBADMSG;
+                }
+
+                for (uint64_t i = 0; i < journal_file_entry_n_items(o); i++) {
+                        if (o->entry.items[i].object_offset == 0 ||
+                            !VALID64(o->entry.items[i].object_offset)) {
+                                log_debug(
+                                      "Invalid entry item (%"PRIu64"/%"PRIu64" offset: "OFSfmt,
+                                      i, journal_file_entry_n_items(o),
+                                      o->entry.items[i].object_offset);
+                                return -EBADMSG;
+                        }
+                }
+       }
+
         if (s > tsize) {
                 r = journal_file_move_to(f, type, false, offset, s, &t, NULL);
                 if (r < 0)

With this change, journal_file_move_to_object() notices that realtime is invalid, and --list-boots works better:

$ chmod 644 system@0005540316b69226-f802b22107d17028.journal~
$ ~/git/systemd/builddir/journalctl --list-boots -D . | wc -l
53

@hese10
Copy link
Contributor Author

hese10 commented Sep 15, 2017

Any comments? Do we just create pull request for this change proposal?

@poettering
Copy link
Member

Your patch looks excellent. In the long run we really should add similar validators for all object types really. And yes, please submit this patch as PR.

The patch as it is looks pretty flawless, except maybe that I'd split out these checks into a function of its own that journal_file_move_to_object() just calls if the type matches.

Thank you for tracking this down and prepping the fix!

rantala added a commit to rantala/systemd that referenced this issue Sep 22, 2017
Introduce journal_file_check_object(), which does lightweight object
sanity checks, and use it in journal_file_move_to_object(), so that we
will catch certain corrupted objects in the journal file.

This fixes systemd#6447, where we had only partially written out OBJECT_ENTRY
(ObjectHeader written, but rest of object zero bytes), causing
"journalctl --list-boots" to fail.

  $ builddir.vanilla/journalctl --list-boots -D bug6447/
  Failed to determine boots: No data available

  $ builddir.patched/journalctl --list-boots -D bug6447/
  -52 22633da1c5374a728d6c215e2c301dc2 Mon 2017-07-10 05:29:21 EEST—Mon 2017-07-10 05:31:51 EEST
  -51 2253aab9ea7e4a2598f2abda82939eff Mon 2017-07-10 05:32:22 EEST—Mon 2017-07-10 05:36:49 EEST
  -50 ef0d85d35c74486fa4104f9d6391b6ba Mon 2017-07-10 05:40:33 EEST—Mon 2017-07-10 05:40:40 EEST
  [...]

Note that journal_file_check_object() is similar to
journal_file_object_verify(). The most expensive checks are omitted, as
they would slow down every journal_file_move_to_object() call too much.

With this implementation, the added overhead is small, for example when
dumping some journal content to /dev/null
(built with -Dbuildtype=debugoptimized -Db_ndebug=true):

 Performance counter stats for 'builddir.vanilla/journalctl -D 76f4d4c3406945f9a60d3ca8763aa754/':

      12542,311634      task-clock:u (msec)       #    1,000 CPUs utilized
                 0      context-switches:u        #    0,000 K/sec
                 0      cpu-migrations:u          #    0,000 K/sec
            80 100      page-faults:u             #    0,006 M/sec
    41 786 963 456      cycles:u                  #    3,332 GHz
   105 453 864 770      instructions:u            #    2,52  insn per cycle
    24 342 227 334      branches:u                # 1940,809 M/sec
       105 709 217      branch-misses:u           #    0,43% of all branches

      12,545199291 seconds time elapsed

 Performance counter stats for 'builddir.patched/journalctl -D 76f4d4c3406945f9a60d3ca8763aa754/':

      12734,723233      task-clock:u (msec)       #    1,000 CPUs utilized
                 0      context-switches:u        #    0,000 K/sec
                 0      cpu-migrations:u          #    0,000 K/sec
            80 693      page-faults:u             #    0,006 M/sec
    42 661 017 429      cycles:u                  #    3,350 GHz
   107 696 985 865      instructions:u            #    2,52  insn per cycle
    24 950 526 745      branches:u                # 1959,252 M/sec
       101 762 806      branch-misses:u           #    0,41% of all branches

      12,737527327 seconds time elapsed

Fixes systemd#6447.
rantala added a commit to rantala/systemd that referenced this issue Sep 24, 2017
Introduce journal_file_check_object(), which does lightweight object
sanity checks, and use it in journal_file_move_to_object(), so that we
will catch certain corrupted objects in the journal file.

This fixes systemd#6447, where we had only partially written out OBJECT_ENTRY
(ObjectHeader written, but rest of object zero bytes), causing
"journalctl --list-boots" to fail.

  $ builddir.vanilla/journalctl --list-boots -D bug6447/
  Failed to determine boots: No data available

  $ builddir.patched/journalctl --list-boots -D bug6447/
  -52 22633da1c5374a728d6c215e2c301dc2 Mon 2017-07-10 05:29:21 EEST—Mon 2017-07-10 05:31:51 EEST
  -51 2253aab9ea7e4a2598f2abda82939eff Mon 2017-07-10 05:32:22 EEST—Mon 2017-07-10 05:36:49 EEST
  -50 ef0d85d35c74486fa4104f9d6391b6ba Mon 2017-07-10 05:40:33 EEST—Mon 2017-07-10 05:40:40 EEST
  [...]

Note that journal_file_check_object() is similar to
journal_file_object_verify(). The most expensive checks are omitted, as
they would slow down every journal_file_move_to_object() call too much.

With this implementation, the added overhead is small, for example when
dumping some journal content to /dev/null
(built with -Dbuildtype=debugoptimized -Db_ndebug=true):

 Performance counter stats for 'builddir.vanilla/journalctl -D 76f4d4c3406945f9a60d3ca8763aa754/':

      12542,311634      task-clock:u (msec)       #    1,000 CPUs utilized
                 0      context-switches:u        #    0,000 K/sec
                 0      cpu-migrations:u          #    0,000 K/sec
            80 100      page-faults:u             #    0,006 M/sec
    41 786 963 456      cycles:u                  #    3,332 GHz
   105 453 864 770      instructions:u            #    2,52  insn per cycle
    24 342 227 334      branches:u                # 1940,809 M/sec
       105 709 217      branch-misses:u           #    0,43% of all branches

      12,545199291 seconds time elapsed

 Performance counter stats for 'builddir.patched/journalctl -D 76f4d4c3406945f9a60d3ca8763aa754/':

      12734,723233      task-clock:u (msec)       #    1,000 CPUs utilized
                 0      context-switches:u        #    0,000 K/sec
                 0      cpu-migrations:u          #    0,000 K/sec
            80 693      page-faults:u             #    0,006 M/sec
    42 661 017 429      cycles:u                  #    3,350 GHz
   107 696 985 865      instructions:u            #    2,52  insn per cycle
    24 950 526 745      branches:u                # 1959,252 M/sec
       101 762 806      branch-misses:u           #    0,41% of all branches

      12,737527327 seconds time elapsed

Fixes systemd#6447.
poettering added a commit that referenced this issue Sep 24, 2017
journal: add object sanity check to journal_file_move_to_object() (#6447)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
journal needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer
Development

Successfully merging a pull request may close this issue.

5 participants