Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

journald source fails to retrieve checkpoint #1081

Closed
phyber opened this issue Oct 24, 2019 · 26 comments
Closed

journald source fails to retrieve checkpoint #1081

phyber opened this issue Oct 24, 2019 · 26 comments
Assignees
Labels
source: journald Anything `journald` source related type: bug A code related bug.

Comments

@phyber
Copy link
Contributor

phyber commented Oct 24, 2019

Software Versions

$ vector --version; journalctl --version
vector 0.4.0-dev
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP \
+GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN

Note that I'm running vector from the 0.5.0 RPM package, but the version is incorrectly reported.
I'm running Vector on the latest (at time of filing issue) Amazon Linux 2 in EC2. The instance is pretty much vanilla except for Puppet and Vector.

Vector Config

data_dir = "/var/lib/vector"

[sources.in]
type = "journald"
units = ["puppet"]

[sinks.tempfile]
inputs = ["in"]
path = "/var/log/vector-puppet-%Y-%m-%d.log"
type = "file"

Issue

Hi,

There appears to be an issue with the journald source, but I'm hoping it's just me missing something.

I'm expecting the above configuration to read from journald, filter out events from the puppet service, and write them to /var/log/vector-puppet-%Y-%m-%d.

However, what actually happens is Vector throws an error regarding the journald checkpointing.

ERROR source{name="in"}:journald-server: vector::sources::journald: Could not retrieve journald checkpoint. error=Cannot assign requested address (os error 99)

I believe this error originates here: https://github.com/timberio/vector/blob/c112c4ac7f45e69fea312e7691566a3f9e8e3066/src/sources/journald.rs#L221-L234

However, this is as far as I've got. Trying to follow the cursor code quickly disappears into the journald crate which calls out to some C code.

Might I be missing some required systemd configuration? I've scanned the documentation and read the vector.spec.toml, and I believe the above config should "just work".

Thanks.

@phyber phyber changed the title journald source fails to open checkpoint journald source fails to retrieve checkpoint Oct 24, 2019
@binarylogic
Copy link
Contributor

Thanks @phyber, we'll investigate and see what's going on.

@binarylogic binarylogic added source: journald Anything `journald` source related type: bug A code related bug. labels Oct 24, 2019
@bruceg
Copy link
Member

bruceg commented Oct 24, 2019

I'm puzzled what might be happening here. There are actually two places where that error could appear, and neither make sense.

In the first, it indicates a failure to read the cursor record previously saved to a file. The error is EADDRNOTAVAIL which is not possible when reading from a file (that has already been opened, no less). This is not what you pointed to and I agree the other is more likely.

In the second, it indicates a failure to retrieve the cursor from the journald library. From my reading of the systemd code and online comments, this can only happen if the journal has not been opened and read from, but this code only happens after we try to read from the journal.

Is it possible that the journal on your system is completely empty or that there is a permissions issue with your user reading from the journal?

@dbhowell
Copy link
Contributor

I was also getting this error. I managed to narrow it down, originally thinking it was related to reading/writing to the checkpoint.txt. It was erroring in the 'second' place (which you've since changed).

You're latest changeset stops the error, however there is no output recorded from the sink.

next() is returning nothing, but not erroring either.

I believe it is an issue around:

https://github.com/timberio/vector/blob/8fef7056a1d1c515014e721a2940d04ff269a704/lib/journald/src/lib.rs#L90

This is happening on Amazon Linux 2 (systemd-219, vector-0.5) and Ubuntu 18.04 (systemd-237, vector built from source).

@phyber
Copy link
Contributor Author

phyber commented Oct 25, 2019

There are actually two places where that error could appear

It does look like that at first glance, but it's definitely the second location. One error string ends with a . while the other does not, luckily allowing us to narrow it down quicker. There's probably a minor enhancement there to make the two error locations more distinct :)

The journal on my system is definitely not empty, there are plenty of events to be seen in journalctl -fu <unit>.

I'll be looking at this more today, so I'll have a play around with various permissions and see if that changes anything.

Thanks.

@dbhowell
Copy link
Contributor

dbhowell commented Oct 25, 2019

Sorry @phyber , I should have explained better. I agree with you and I am experiencing the exact same issue as you. My comments were for @bruceg to help debug the problem.

I've determined permissions are not the problem either as I get the same result (no data from the source when there is clearly logs coming via journalctl -fu <unit> when run as root.

@phyber
Copy link
Contributor Author

phyber commented Oct 25, 2019

Sorry @dbhowell , I should have been clearer too, the reply was to @bruceg. :)
Thanks for being another set of eyes on this!

@bruceg
Copy link
Member

bruceg commented Oct 25, 2019

Thanks @dbhowell. It was not my intention to mask the problem, but this does shed a lot more light on what might be going on.

If next() is returning nothing (None), it means that sd_journal_next has returned 0 indicating that it has reached the end of the journal. I think what is happening is an issue I saw with testing as well.

The initialization for journald takes a flags parameter, which includes a bit named SD_JOURNAL_RUNTIME_ONLY. We have exposed that flag as a current_runtime_only setting, which defaults to true. However, this flag means to read from only the volatile journal, not to start from the current boot (which I have been unable to find direct access to, despite journalctl --boot working nearly instantly). This is perhaps a poor choice of either default value or naming for that setting.

Try setting current_runtime_only = false in the journald section of your Vector configuration.

@dbhowell
Copy link
Contributor

@bruceg I can confirm that setting current_runtime_only = false does produce a result, but like you say above is not the current boot, but previous boots as well.

I assume you will need to use sd_journal_seek_monotonic_usec() and then get the boot_id from sd_id128_get_boot() to get to the head of the current boot.

You're new issue #1097 looks like you're well ahead of me.

Thanks for persisting with this. Unfortunately this is well out of my expertise.

@bruceg
Copy link
Member

bruceg commented Oct 28, 2019

PR #1105 should resolve this issue. Note that it replaces the current_runtime_only setting with a much clearer current_boot_only, which again defaults to true.

@dbhowell
Copy link
Contributor

This almost resolves the issue. I believe you had to add a filter match as well, such as:

sd_journal_add_match(journal, "_BOOT_ID=" + boot_id, 0)

and most likely a sd_journal_flush_matches() prior to make sure.

@bruceg
Copy link
Member

bruceg commented Oct 29, 2019

I don't see why a filter match is necessary. If we are seeking to the last boot (which is currently running), what does the filter accomplish? I implemented and tested it, and it made no difference on my system.

@dbhowell
Copy link
Contributor

See here: systemd/systemd#1752

Basically the seek works to the nearest first record of the boot, but does not guarantee the next records are from the same boot.

@bruceg
Copy link
Member

bruceg commented Oct 31, 2019

Got it, see PR #1122.

@bruceg
Copy link
Member

bruceg commented Nov 7, 2019

This is done then, right?

@knl
Copy link

knl commented Nov 18, 2019

I'm experiencing exactly the same issue running the latest code compiled on NixOS:

$ sudo -u vector -g vector /nix/store/x838gypsxsvzidcm7kklqhk1gpcc59bb-vector-v0.6.0/bin/vector --config=/nix/store/0fc119g6vmkd6i5jisf0hmgsrpmfyy03-config.toml
Nov 18 15:38:54.136  INFO vector: Log level "info" is enabled.
Nov 18 15:38:54.136  INFO vector: Loading config. path="/nix/store/0fc119g6vmkd6i5jisf0hmgsrpmfyy03-config.toml"
Nov 18 15:38:54.138  INFO vector: Vector is starting. version="0.6.0" git_version="" released="Mon, 18 Nov 2019 15:36:45 +0000" arch="x86_64"
Nov 18 15:38:54.155  INFO vector::topology: Running healthchecks.
Nov 18 15:38:54.155  INFO vector::topology: Starting source "nginx"
Nov 18 15:38:54.156  INFO vector::topology: Starting sink "es_cluster"
Nov 18 15:38:54.157  INFO source{name=nginx type=journald}: vector::sources::journald: Starting journald server.
Nov 18 15:38:54.157 ERROR source{name=nginx type=journald}:journald-server: vector::sources::journald: Could not retrieve current journald checkpoint. error=Cannot assign requested address (os error 99)
Nov 18 15:38:54.443  INFO vector::topology::builder: Healthcheck: Passed.
Nov 18 15:38:54.657 ERROR source{name=nginx type=journald}:journald-server: vector::sources::journald: Could not retrieve current journald checkpoint. error=Cannot assign requested address (os error 99)
Nov 18 15:38:55.157 ERROR source{name=nginx type=journald}:journald-server: vector::sources::journald: Could not retrieve current journald checkpoint. error=Cannot assign requested address (os error 99)
Nov 18 15:38:55.657 ERROR source{name=nginx type=journald}:journald-server: vector::sources::journald: Could not retrieve current journald checkpoint. error=Cannot assign requested address (os error 99)
Nov 18 15:38:56.158 ERROR source{name=nginx type=journald}:journald-server: vector::sources::journald: Could not retrieve current journald checkpoint. error=Cannot assign requested address (os error 99)
^CNov 18 15:38:56.650  INFO vector: Shutting down.

config.toml:

$ cat /nix/store/0fc119g6vmkd6i5jisf0hmgsrpmfyy03-config.toml
data_dir = "/var/lib/vector"
# Ingest data by tailing one or more files
[sources.nginx]
  type         = "journald"
  units        = ["nginx"]

# Send structured data to a short-term storage
[sinks.es_cluster]
  inputs       = ["nginx"]
  type         = "elasticsearch"
  host         = "http://elasticsearch.example.com:9200"
  index        = "nginx-%Y-%m-%d"

bruceg added a commit that referenced this issue Nov 18, 2019
The two read states (saw-record vs at-end) are actually independent and
not mutually exclusive, so representing them as an enum leads to flow
control errors. This led to breaking a previous fix for issue #1081
(journald source fails to retrieve checkpoint).

Signed-off-by: Bruce Guenter <bruce@untroubled.org>
@bruceg
Copy link
Member

bruceg commented Nov 19, 2019

It appears the fix we put in place for this was un-fixed by #1106. I've submitted a new fix in #1202 which solves the problem in a better way.

bruceg added a commit that referenced this issue Nov 19, 2019
…name (#1202)

* Fix systemd dynamic library name

The correct name for the systemd library is `libsystemd.so.0` and not
`libsystemd.so`.

Signed-off-by: Bruce Guenter <bruce@untroubled.org>

* Rework journald read state logic

The two read states (saw-record vs at-end) are actually independent and
not mutually exclusive, so representing them as an enum leads to flow
control errors. This led to breaking a previous fix for issue #1081
(journald source fails to retrieve checkpoint).

Signed-off-by: Bruce Guenter <bruce@untroubled.org>

* Break up the overly-long `JournaldServer::run` function

Signed-off-by: Bruce Guenter <bruce@untroubled.org>
@bruceg
Copy link
Member

bruceg commented Nov 19, 2019

@knl could you confirm the fix just committed to master resolves this issue for you?

@knl
Copy link

knl commented Nov 20, 2019

@bruceg the failure is not there anymore. However, I see nothing being shipped to elasticsearch either. But that is probably a different issue.

@bruceg
Copy link
Member

bruceg commented Nov 20, 2019

If you add a console sink such as the following to the config, does it show data coming from journald?

[sinks.console]
  inputs = [ "my-journald-source" ]
  type = "console"
  encoding = "text"

@knl
Copy link

knl commented Nov 21, 2019

@bruceg I used the following config:

data_dir = "/var/lib/vector"
[sources.nginx]
  type         = "journald"

[sinks.console]
  inputs = [ "nginx" ]
  type = "console"
  encoding = "text"

I run it as:

$ sudo -u vector -g vector /nix/store/qlih401lmjbh5n4j2rb0dsvwy6vyb60j-vector-v0.6.0/bin/vector --config=/tmp/config.toml
Nov 21 10:35:02.246  INFO vector: Log level "info" is enabled.
Nov 21 10:35:02.246  INFO vector: Loading config. path="/tmp/config.toml"
Nov 21 10:35:02.248  INFO vector: Vector is starting. version="0.6.0" git_version="" released="Wed, 20 Nov 2019 15:35:37 +0000" arch="x86_64"
Nov 21 10:35:02.249  INFO vector::topology: Running healthchecks.
Nov 21 10:35:02.249  INFO vector::topology: Starting source "nginx"
Nov 21 10:35:02.249  INFO vector::topology: Starting sink "console"
Nov 21 10:35:02.249  INFO vector::topology::builder: Healthcheck: Passed.
Nov 21 10:35:02.250  INFO source{name=nginx type=journald}: vector::sources::journald: Starting journald server.

Nothing is displayed on the console. The system is NixOS. I'll investigate a bit during the weekend.

@bruceg
Copy link
Member

bruceg commented Nov 21, 2019

That certainly looks like the logs are not reaching vector. Try running vector with the --verbose option to get more detail logged.

@knl
Copy link

knl commented Nov 22, 2019

Sorry, I forgot to mention that, I did also run with --verbose and the output is exactly the same. I will dig over the weekend, whether is comes from the way how I compiled vector (Nix).

@binarylogic
Copy link
Contributor

@knl do you have any other information on this issue? Did you get it corrected? @bruceg it might be worth trying out the journald source on nix to see if we can reproduce.

@ghost
Copy link

ghost commented Dec 23, 2019

I think the problem with thenix package is that it doesn't have access to libsystemd.so, which is a run-time dependency for journald source. See #1242 (comment) for details.

@lukesteensen lukesteensen assigned ghost and unassigned bruceg Dec 30, 2019
@binarylogic
Copy link
Contributor

@phyber @knl thanks for your patience on this. And apologies for the rocky experience with this source. This could have been prevented with full integration tests (which we're working on). Journald integration is surprisingly more complicated than it appears. The woes of 0.x development. 😄

To update you, we've identified the issue and you can track progress on #1473. Things we're doing that will prevent regression:

  1. Adding full integration tests for journald across all of our platforms. (Add tests for journald source which would try to load libsystemd #1464)
  2. Officially supporting nix and adding it to our list of platforms that we build and run tests on. (chore(operations): Verify building of the Nix package #1432)

We're hoping to have this completed in the next week, which depends on the solution we choose in #1473.

@binarylogic binarylogic assigned bruceg and unassigned ghost Jan 13, 2020
@bruceg
Copy link
Member

bruceg commented Jan 23, 2020

This should be resolved by the resolution of #1473. If this is still a problem, please reopen this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: journald Anything `journald` source related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

5 participants