Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSD daemon logs are not being collected #2479

Closed
travisn opened this issue Jan 10, 2019 · 15 comments · Fixed by #2496
Closed

OSD daemon logs are not being collected #2479

travisn opened this issue Jan 10, 2019 · 15 comments · Fixed by #2496
Assignees
Labels
ceph main ceph tag
Projects
Milestone

Comments

@travisn
Copy link
Member

travisn commented Jan 10, 2019

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
Since the decoupling of the ceph version, the ceph-osd log output is not being captured in the k8s logs. In the osd logs we see the output of ceph-volume and that the osd process is started, but nothing after that. It seems to be an issue with rook starting the ceph-osd process and not capturing the stderr output.

Expected behavior:
OSD logging should be captured.

How to reproduce it (minimal and precise):

  • Create the rook operator
  • Create the rook cluster
  • Look at the osd pod logs
@travisn travisn added ceph main ceph tag beginner labels Jan 10, 2019
@travisn travisn added this to the 0.9 milestone Jan 10, 2019
@travisn travisn added this to To do in v0.9 via automation Jan 10, 2019
@BlaineEXE
Copy link
Member

I'm unable to repro this with Rook master and Ceph 13.2.2. Is this an issue with Rook 0.9.1?

@travisn
Copy link
Member Author

travisn commented Jan 11, 2019

Do you see any output in the osd log after the ceph-osd command? All I see is the output collected from ceph-volume, then the command to start ceph-osd. When debugging an issue yesterday with someone else, we found that setting the log file in the ceph config to a file instead of /dev/stderr gave us a bunch of output to the file

@BlaineEXE
Copy link
Member

Here is a pastebin of the log from one of my osd containers. https://paste.opensuse.org/view//1664407a

It seems to me that this is what I would expect, but maybe there is data missing I don't see. My pods are running bluestore with ceph-osd as the entrypoint. It sounds like you could be using filestore.

@dotnwat
Copy link
Contributor

dotnwat commented Jan 14, 2019

I think the issue here is that the osd spec should use -d instead of --foreground. verifying...

@jbw976 jbw976 moved this from To do to In progress in v0.9 Jan 15, 2019
@dotnwat
Copy link
Contributor

dotnwat commented Jan 15, 2019

@travisn is the intention for OSDs to not produce any log files, and rather send all logging output to stdout/stderr to be collected by k8s logging?

@travisn
Copy link
Member Author

travisn commented Jan 15, 2019

@noahdesu Yes, collecting all the log output would be the goal. Files inside the pod are more easily lost when the pod is restarted.

@travisn
Copy link
Member Author

travisn commented Jan 16, 2019

The logging problem might just be that the default logging level is 0, which seems far too low. Seems like we should remove these default logging levels and rely on the default ceph logging levels. @liewegas thoughts on this?

Note that the translation to level 0 happens here, which was an attempt to make it simpler to enable debug logging and make sure the logs weren't filling up too quickly. Level 0 seems very low though.

Perhaps we should only set the log levels with the ceph.conf override ability, which we already have.

#2496 still looks like a good idea to make sure we're capturing all the logging.

@liewegas
Copy link
Member

The logging problem might just be that the default logging level is 0, which seems far too low. Seems like we should remove these default logging levels and rely on the default ceph logging levels. @liewegas thoughts on this?

Yes, please just use the ceph defaults.

@travisn
Copy link
Member Author

travisn commented Jan 23, 2019

The issue seems to be in the following scenario, only affecting a certain type of osd:

  • On builds where ceph-volume is available (12.2.10, 13.2.4, and newer), rook provisions the osd
  • Rook starts the osd pod with the entrypoint of rook.
  • Rook creates two processes:
    • ceph-volume activate, which creates a tmpfs for the config directory on the volume
    • ceph-osd to run the osd daemon
  • Rook attempts to capture all the stdout and stderr output from the two processes and writes them to the log

Rook seems to be losing the stderr that is written by the ceph-osd process. It is certainly related to mounting /dev into the container for the osds to access and the output to /dev/stderr must be confused. However, it's not clear where the disconnect is in the logging.

The workaround to collect osd logs is to set the log file config option to write to a file in the container instead of stderror. For example, set a custom ceph.conf setting as follows:

[global]
log file = /var/lib/rook/$name.log

@travisn travisn moved this from In progress to In Review in v0.9 Jan 23, 2019
@zerkms
Copy link
Contributor

zerkms commented Jan 24, 2019

It looks like this problem is relevant to mine #2547

It looks like the /rook/rook process after it starts ceph-osd it does not forward the stderr wherever it should and after couple minutes of running under IO load - some of its buffers fills and ceph-osd blocks forever.

@zerkms
Copy link
Contributor

zerkms commented Jan 24, 2019

Ok, I think I found a bug:

@travisn

you mentioned

Rook attempts to capture all the stdout and stderr output from the two processes and writes them to the log

but the implementation is the following:

exec.ExecuteCommand https://github.com/rook/rook/blob/master/pkg/util/exec/exec.go#L64 runs logOutput(actionName, stdout, stderr)

which has the following line:

https://github.com/rook/rook/blob/master/pkg/util/exec/exec.go#L202

in := bufio.NewScanner(io.MultiReader(stdout, stderr))

The problem with it is that io.MultiReader reads from the passed io.Readers sequentially.

So the first reader (in this case stdout) it read until io.EOF, which never happens.

So lines from stderr are never read. Which leads to the problems in the bug I reported.

@dotnwat
Copy link
Contributor

dotnwat commented Jan 24, 2019

@zerkms https://github.com/noahdesu/rook/commit/8f13ccf102854339b7214b93c79d3cd4566356d2

@zerkms
Copy link
Contributor

zerkms commented Jan 24, 2019

@noahdesu right, I did not spot a PR was already sent to address it, sorry.

v0.9 automation moved this from In Review to Done Jan 24, 2019
@jbw976 jbw976 modified the milestones: 0.9, 1.0 Jan 24, 2019
@jbw976 jbw976 added this to To do in v1.0 via automation Jan 24, 2019
@jbw976 jbw976 moved this from To do to Done in v1.0 Jan 24, 2019
@jbw976
Copy link
Member

jbw976 commented Jan 24, 2019

Does this need to be backported to 0.9? it's in the 0.9 project.

@jbw976 jbw976 moved this from Done to To do in v0.9 Jan 24, 2019
@dotnwat
Copy link
Contributor

dotnwat commented Jan 24, 2019

yes

travisn pushed a commit to travisn/rook that referenced this issue Jan 25, 2019
The io.MultiReader(r1, r2) reads r1 until EOF before moving on to r2.
When used for stdout/stderr stderr will not be written to the log until
stdout reaches EOF. This patch reads stderr in a go routine so that the
two streams can be interleaved properly.

Fixes: rook#2479

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit fbb56c4)
@travisn travisn moved this from To do to In Review in v0.9 Jan 25, 2019
@travisn travisn moved this from In Review to Done in v0.9 Jan 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ceph main ceph tag
Projects
No open projects
v0.9
  
Done
v1.0
  
Done
Development

Successfully merging a pull request may close this issue.

6 participants