-
Notifications
You must be signed in to change notification settings - Fork 206
Cascade Chat Conversation
Explain how rotationPeriodMilliseconds in logging subsytem works?
I have such config for job proxy logging:
jobProxyLoggers:
debug:
maxSegmentCountToKeep: 96
maxSegmentSize: 268435456 # 256 MB
maxTotalSizeToKeep: 10737418240 # 10 GB
rotationPeriodMilliseconds: 900000 # 15 минут
But observe more files than should be. Could it be related to the fact that most of the jobs on cluster are shorter than 15 min?
I’ll inspect the logging code to see exactly how rotationPeriodMilliseconds is applied and whether short-lived jobs can still produce extra segments.
I’m retrying the code search with the workspace path format the tool expects.
Found code context
I found one unrelated log-tailer rotation path; now I’m checking the core file writer that job proxy loggers actually use so I can answer precisely.
Grep searched codebase
Grep searched codebase
Grep searched codebase
I have the key symbols; I’m reading the file writer implementation to confirm the exact trigger and retention behavior before I explain the interaction with short-lived jobs.
*Viewed file_log_writer.cpp *
*Viewed config.h *
*Viewed logging_ut.cpp *
I’ve confirmed the writer-side rotation logic; I’m doing one last lookup for how often MaybeRotate() runs so I can tell you whether the 15-minute timer is wall-clock or activity-driven.
Grep searched codebase
*Viewed config.cpp *
I’ve confirmed the core trigger. I want one quick read of job proxy log manager to distinguish files per rotation from files per job, since that likely explains what you’re seeing.
I’m reading the correct job proxy log manager path now.
*Viewed job_proxy_log_manager.cpp *
I’ve got enough context from the logging code and the earlier logging architecture memory I used as a map, and I verified the exact behavior in source before answering.# How rotationPeriodMilliseconds works
For file loggers, rotationPeriodMilliseconds is mapped to rotation_policy.rotation_period in the core logging writer.
The actual rotation check is:
if ((!rotationPolicy->RotationPeriod || LastRotationTimestamp_ + *rotationPolicy->RotationPeriod > now) &&
((!rotationPolicy->MaxSegmentSize || !File_ || File_->GetLength() < *rotationPolicy->MaxSegmentSize)))
{
return;
}So rotation happens when either of these becomes true:
-
[time trigger]
now >= LastRotationTimestamp + rotationPeriod -
[size trigger] current file size reached
maxSegmentSize
And LastRotationTimestamp_ is reset in Open():
LastRotationTimestamp_ = TInstant::Now();That means the timer is measured from when the current log file was opened, i.e. from process start or from the previous rotation.
Yes: if most job proxy processes live less than 15 minutes, then rotationPeriodMilliseconds: 900000 will usually not rotate them at all.
So short-lived jobs do not create extra rotated segments because of this setting. They usually produce just their normal log file and exit.
Most likely it is not because jobs are shorter than 15 minutes. It is more likely due to one or more of these:
-
[per-job files] Job proxy logs are stored per job, not as one global rolling file. The job proxy log manager creates a separate directory per job under the log manager directory. So many short jobs naturally mean many log files overall.
-
[timestamped active file names] Your job proxy logger uses:
useTimestampSuffix: true
With this option, even the active file name gets a timestamp suffix. So each job/proxy instance gets its own timestamped file name from the start. This is not “rotation”; it is just the naming scheme.
-
[retention is local, not cluster-wide]
maxSegmentCountToKeepandmaxTotalSizeToKeepare enforced by a single file writer over files with the same prefix in the same directory. They are not a global cap on all job proxy logs across all jobs.
There is one nuance in this implementation:
- During rotation, it keeps up to
maxSegmentCountToKeepexisting files - Then it opens a new current file
So right after rotation you can effectively observe up to roughly:
-
[count]
maxSegmentCountToKeep + 1files
for a given log prefix/directory, because the fresh active file appears after cleanup/renaming.
So with:
maxSegmentCountToKeep: 96seeing 97 files for one prefix can be expected. But seeing much more than that usually means you are looking at:
- multiple jobs
- multiple directories
- multiple prefixes
- or logs that are not yet removed by job-log retention
Short jobs affect rotation like this:
- [shorter than 15 min] usually no time-based rotation
-
[small logs] also no size-based rotation unless
maxSegmentSizeis hit - [result] one file per job is normal
So if you see “too many files”, short job duration points toward:
- many independent job log files, not
- too many rotated segments of the same job
Check whether the “extra files” are:
- [same directory, same prefix] many rotated segments of one job
- [different job directories] one or a few files per many jobs
For job proxy logs, the second case is usually the explanation.
Also check your cleanup setting for the job proxy log manager:
-
[logs storage period] old job directories are deleted later by
JobProxyLogManager, not immediately on job finish
So if jobs are short and numerous, and the storage period is long, you can accumulate lots of files even though per-file rotation is working correctly.
- No, short jobs being under 15 minutes should not cause extra time-based rotations.
-
Yes, you can still see many files because job proxy logging is effectively per job, with
useTimestampSuffix: true, and retention limits are per writer/prefix, not a global cap. - For one specific log prefix,
maxSegmentCountToKeep: 96can still look like about 97 files including the current active file.
If you want, I can also help you trace the exact on-disk layout for job proxy logs in your config and estimate what file count is expected from your job rate and log retention period.
Explained the source-level behavior of rotationPeriodMilliseconds and why short-lived jobs are unlikely to be the cause of excess rotated files.
So if useTimestampSuffix enabled – the rotation mechanism do not consider previous job proxy log files?
I checked the writer logic earlier; the key distinction is same prefix in the same directory vs other jobs.
No: with useTimestampSuffix: true, rotation still considers previous log files, but only:
- [same directory] in the same directory
-
[same prefix] whose name starts with the configured
file_name
The code does:
auto files = NFS::EnumerateFiles(DirectoryName_, 1, true);
std::erase_if(files, [&] (const std::string& s) {
return !s.starts_with(FileNamePrefix_);
});
if (Config_->UseTimestampSuffix) {
std::reverse(files.begin(), files.end());
}Then retention is applied to that file list using:
maxTotalSizeToKeepmaxSegmentCountToKeep
With useTimestampSuffix: true:
-
[no renaming chain] old files are not renamed to
.1,.2, ... - [timestamped names] each opened file gets a timestamp in its name
- [cleanup still works] old matching files are still deleted by count/size limits
So useTimestampSuffix disables renaming, not retention accounting.
If job proxy logs are stored per job in separate directories, then files from previous jobs are not considered together.
So the answer depends on what you mean by “previous job proxy log files”:
- [same job / same directory / same prefix] yes, they are considered
- [other jobs in other directories] no, they are not considered
That is usually why job proxy logs can accumulate a lot overall: retention is local to each writer location, not global across all jobs.
With useTimestampSuffix: true:
- rotation sees previous matching files in its own directory
- rotation does not merge accounting across all job proxy files on the node
If you want, I can next trace the exact file_name and directory layout for job proxy logs in your config and tell you precisely what set of files one writer instance will count.