RFC: Keep Key Logs #73

sticnarf · 2021-08-31T08:53:23Z

No description provided.

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

BusyJay · 2021-08-31T09:05:47Z

text/0000-keep-key-logs.md

+
+The problem is that we don't define which logs are important. Whether a log is important for diagnosis is not related to the log level. 
+
+For example, we prefer to keep info level logs in the raftstore rather than keep warn level logs in the coprocessor module. The former one is critical for finding out how each region evolves while the latter one may be only a retryable error.


Is it possible to set different log level for different modules?

Yes, it is. But module level setting is coarse grained. This RFC enables you to even choose which logs can be dropped in the same module.

We used to have this problem too. Error log is not enough for diagnostics, info log is too much.
The solution we made is to cache a buffer for current logs and as long as we find an error log, we print the whole log buffer (including every logs before the error log).
I guess it may mitigate the problem.

A concern here is:
In many cases we need to troubleshoot the resource-intensive scenarios and understand why it happened.
Hence we need actually need the logs even more in such scenarios. This then leads to a paradox.
I agree with Tony on the suggestion.

We can create a circular buffer to keep diagnostic information and flush regularly or before a key point.

Tony's solution sounds good but may be not so suitable for today's TiKV. TiKV is weak at detecting problems autonomously. It is really common that service downgrade or data corruption happens but TiKV itself does not know it and does not print any related error logs immediately. So the "key point" is hard to define.

Anyway, it probably hints what we can work on in the future.

BusyJay · 2021-08-31T09:08:36Z

text/0000-keep-key-logs.md

+
+The principle is that availability should be never sacrificed. So we must not block the sender on overflow, nor use an unbounded channel which has the risk of OOM. But we still hope to keep the key logs.
+
+The fallback solution is to write logs to a file synchronously.


synchronously is ambiguous.

I change it to "directly in the working threads"

BusyJay · 2021-08-31T09:09:48Z

text/0000-keep-key-logs.md

+
+For example, we prefer to keep info level logs in the raftstore rather than keep warn level logs in the coprocessor module. The former one is critical for finding out how each region evolves while the latter one may be only a retryable error.
+
+So, these important logs deserve to have another set of macros:


Like slow log, maybe we can use tag to identify whether a logs is key log.

Yes. As said in the "Changes to slog modules" section, these macros are just shotcuts to hide the magic tag. So it would be easier for users.

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

NingLin-P · 2021-09-01T04:13:09Z

text/0073-keep-key-logs.md

+
+The fallback solution is to write logs to a file directly in the working threads.
+
+If key logs fail to be sent to the channel, they are formatted at the sender thread and written to the fallback log file. Without disk syncing, these writes should finish with very small latency and reasonable throughput. Then the working threads will not be badly affected and we do not really drop any key logs.


The key logs may write to different two files and the order of logs between them may not hold.

Indeed. We cannot tell the order of logs in these two files.

If we really need to solve the problem, we need some other light synchronizations.

A possible approach is adding a sequence field. I'm not very sure if this is really necessary or it may make the log a bit messy.

What do you think? @BusyJay

I've been thinking a while about printing sequence number only when necessary, but all approaches in my mind seem too complex and not elegant...

Why not just depends on time?

@BusyJay Time works in most cases, but it does not in extreme cases, such as two logs happen in the same millisecond.
And it looks that time is now fetched at the final drainer, the time in async logs is a bit later (adjusting the drainer sequence should work, i think).

Does strict order matter? It seems fine to me as long as I can know what logs have strict orders and what don't.

Multiple files is a challenge for log collectors.

Maybe not.
But it's hard for me to tell how much difficulty it will increase without strict order. I don't have the experience of diagnose problems under this condition.

I think strict order of logs that happen concurrently is unnecessary too and is okay to depend on time, but even two logs have a clear happen-before order both in time and in code, the order can break too.

Consider logA happen before logB in code and logA1 happen before logA2 in time. It may possible that in the normal log file: [logB1, logA2], and in fallback log file: [LogA1, LogB2], and both the time of logB1, logA2 are later than LogA1, LogB2

tisonkun · 2021-09-02T02:39:56Z

text/0073-keep-key-logs.md

+- RFC PR: 
+- Tracking issue:


Reminder: populate these fields on time.

Thanks. When this RFC is close to be accepted, I will create a tracking issue and fill these fields

dragonly · 2021-09-02T05:41:26Z

This seems a bit weird because we are actually adding another set of log levels.
Could you please refer to some similar design in other projects? Thanks!

sticnarf · 2021-09-02T07:02:28Z

This seems a bit weird because we are actually adding another set of log levels.
Could you please refer to some similar design in other projects? Thanks!

I don't know any similar design in other projects. I find async logging is not that common. Other databases like PostgreSQL and CockroachDB just print logs in the working thread.

But async logging can reduce the possible latency spikes caused by file IO. It might not be a good idea to return back to sync logging.

dragonly · 2021-09-14T06:06:08Z

text/0073-keep-key-logs.md

+The problem is that we don't define which logs are important. Whether a log is important for diagnosis is not related to the log level. 
+
+For example, we prefer to keep info level logs in the raftstore rather than keep warn level logs in the coprocessor module. The former one is critical for finding out how each region evolves while the latter one may be only a retryable error.
+
+So, these important logs deserve to have another set of macros:
+
+- `key_crit!`
+- `key_error!`
+- `key_warn!`
+- `key_info!`
+- `key_debug!`
+- `key_trace!`
+
+Logs from these macros should not be dropped as they are key information. These macros should **NOT** be abused.


I am concerned about this design.

IMHO, the standard logging levels from TRACE to FATAL is so popular in the whole industry, that it's a bad idea to reinvent one.

For example, when logging a message in INFO level, it means that we want people seeing this log message to know that the system is in certain state that the message says, like "I am ready to begin a transaction". Other verbose messages which is unnecessary in this level is logged into DEBUG or TRACE.

If there's an "important"(from this RFC's context) message that's important for diagnosis, we should consider putting it into WARN which signals some unexpected things are happening, actions must be taken, but no direct damage. Or we can put it into INFO like the above, because it's important to say that the system is currently in some state for diagnosis reason.

I believe the problem then is that there are so much INFO level log messages, which overwhelms the important diagnosis message. But wait! Why don't we take a look at all those INFO level log messages, are they too verbose? I mean, do we really need to talk to the user all the time when we are doing some common operations?

PS: Those are my personal comments, which maybe wrong, please point out. The whole point is that it's really weird to reinvent a whole bunch of logging levels XD

RFC: Keep Key Logs

684a158

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

sticnarf requested review from BusyJay and sunxiaoguang August 31, 2021 08:53

BusyJay reviewed Aug 31, 2021

View reviewed changes

rewording "synchronously"

25d8df0

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

NingLin-P reviewed Sep 1, 2021

View reviewed changes

tisonkun reviewed Sep 2, 2021

View reviewed changes

dragonly reviewed Sep 14, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Keep Key Logs #73

RFC: Keep Key Logs #73

sticnarf commented Aug 31, 2021

BusyJay Aug 31, 2021

sticnarf Aug 31, 2021

tonyxuqqi Sep 2, 2021 •

edited

Loading

lucifinil Sep 2, 2021 •

edited

Loading

sticnarf Sep 2, 2021

BusyJay Aug 31, 2021

sticnarf Aug 31, 2021

BusyJay Aug 31, 2021

sticnarf Aug 31, 2021

NingLin-P Sep 1, 2021

sticnarf Sep 1, 2021 •

edited

Loading

BusyJay Sep 1, 2021

sticnarf Sep 1, 2021

BusyJay Sep 1, 2021

sticnarf Sep 1, 2021

NingLin-P Sep 1, 2021

tisonkun Sep 2, 2021

sticnarf Sep 2, 2021 •

edited

Loading

dragonly commented Sep 2, 2021

sticnarf commented Sep 2, 2021

dragonly Sep 14, 2021 •

edited

Loading


		The problem is that we don't define which logs are important. Whether a log is important for diagnosis is not related to the log level.

		For example, we prefer to keep info level logs in the raftstore rather than keep warn level logs in the coprocessor module. The former one is critical for finding out how each region evolves while the latter one may be only a retryable error.


		The principle is that availability should be never sacrificed. So we must not block the sender on overflow, nor use an unbounded channel which has the risk of OOM. But we still hope to keep the key logs.

		The fallback solution is to write logs to a file synchronously.


		For example, we prefer to keep info level logs in the raftstore rather than keep warn level logs in the coprocessor module. The former one is critical for finding out how each region evolves while the latter one may be only a retryable error.

		So, these important logs deserve to have another set of macros:


		The fallback solution is to write logs to a file directly in the working threads.

		If key logs fail to be sent to the channel, they are formatted at the sender thread and written to the fallback log file. Without disk syncing, these writes should finish with very small latency and reasonable throughput. Then the working threads will not be badly affected and we do not really drop any key logs.

RFC: Keep Key Logs #73

Are you sure you want to change the base?

RFC: Keep Key Logs #73

Conversation

sticnarf commented Aug 31, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonyxuqqi Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

lucifinil Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sticnarf Sep 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sticnarf Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

dragonly commented Sep 2, 2021

sticnarf commented Sep 2, 2021

dragonly Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

tonyxuqqi Sep 2, 2021 •

edited

Loading

lucifinil Sep 2, 2021 •

edited

Loading

sticnarf Sep 1, 2021 •

edited

Loading

sticnarf Sep 2, 2021 •

edited

Loading

dragonly Sep 14, 2021 •

edited

Loading