Skip to content

Cascade Chat Conversation

Ignat edited this page Mar 16, 2026 · 5 revisions

Question 1

User Input

Explain how rotationPeriodMilliseconds in logging subsytem works?

I have such config for job proxy logging:

jobProxyLoggers:
  debug:
    maxSegmentCountToKeep: 96
    maxSegmentSize: 268435456           # 256 MB
    maxTotalSizeToKeep: 10737418240     # 10 GB
    rotationPeriodMilliseconds: 900000  # 15 минут

But observe more files than should be. Could it be related to the fact that most of the jobs on cluster are shorter than 15 min?

Planner Response

I’ll inspect the logging code to see exactly how rotationPeriodMilliseconds is applied and whether short-lived jobs can still produce extra segments.

Planner Response

I’m retrying the code search with the workspace path format the tool expects.

Found code context

Planner Response

I found one unrelated log-tailer rotation path; now I’m checking the core file writer that job proxy loggers actually use so I can answer precisely.

Grep searched codebase

Grep searched codebase

Grep searched codebase

Planner Response

I have the key symbols; I’m reading the file writer implementation to confirm the exact trigger and retention behavior before I explain the interaction with short-lived jobs.

*Viewed file_log_writer.cpp *

*Viewed config.h *

*Viewed logging_ut.cpp *

Planner Response

I’ve confirmed the writer-side rotation logic; I’m doing one last lookup for how often MaybeRotate() runs so I can tell you whether the 15-minute timer is wall-clock or activity-driven.

Grep searched codebase

*Viewed config.cpp *

Planner Response

I’ve confirmed the core trigger. I want one quick read of job proxy log manager to distinguish files per rotation from files per job, since that likely explains what you’re seeing.

Planner Response

I’m reading the correct job proxy log manager path now.

*Viewed job_proxy_log_manager.cpp *

Planner Response

I’ve got enough context from the logging code and the earlier logging architecture memory I used as a map, and I verified the exact behavior in source before answering.# How rotationPeriodMilliseconds works

For file loggers, rotationPeriodMilliseconds is mapped to rotation_policy.rotation_period in the core logging writer.

The actual rotation check is:

if ((!rotationPolicy->RotationPeriod || LastRotationTimestamp_ + *rotationPolicy->RotationPeriod > now) &&
    ((!rotationPolicy->MaxSegmentSize || !File_ || File_->GetLength() < *rotationPolicy->MaxSegmentSize)))
{
    return;
}

So rotation happens when either of these becomes true:

  • [time trigger] now >= LastRotationTimestamp + rotationPeriod
  • [size trigger] current file size reached maxSegmentSize

And LastRotationTimestamp_ is reset in Open():

LastRotationTimestamp_ = TInstant::Now();

That means the timer is measured from when the current log file was opened, i.e. from process start or from the previous rotation.

Important implication for short jobs

Yes: if most job proxy processes live less than 15 minutes, then rotationPeriodMilliseconds: 900000 will usually not rotate them at all.

So short-lived jobs do not create extra rotated segments because of this setting. They usually produce just their normal log file and exit.

Why you may still see many files

Most likely it is not because jobs are shorter than 15 minutes. It is more likely due to one or more of these:

  • [per-job files] Job proxy logs are stored per job, not as one global rolling file. The job proxy log manager creates a separate directory per job under the log manager directory. So many short jobs naturally mean many log files overall.

  • [timestamped active file names] Your job proxy logger uses:

    useTimestampSuffix: true

    With this option, even the active file name gets a timestamp suffix. So each job/proxy instance gets its own timestamped file name from the start. This is not “rotation”; it is just the naming scheme.

  • [retention is local, not cluster-wide] maxSegmentCountToKeep and maxTotalSizeToKeep are enforced by a single file writer over files with the same prefix in the same directory. They are not a global cap on all job proxy logs across all jobs.

Subtle point about maxSegmentCountToKeep

There is one nuance in this implementation:

  • During rotation, it keeps up to maxSegmentCountToKeep existing files
  • Then it opens a new current file

So right after rotation you can effectively observe up to roughly:

  • [count] maxSegmentCountToKeep + 1 files

for a given log prefix/directory, because the fresh active file appears after cleanup/renaming.

So with:

maxSegmentCountToKeep: 96

seeing 97 files for one prefix can be expected. But seeing much more than that usually means you are looking at:

  • multiple jobs
  • multiple directories
  • multiple prefixes
  • or logs that are not yet removed by job-log retention

What short jobs actually affect

Short jobs affect rotation like this:

  • [shorter than 15 min] usually no time-based rotation
  • [small logs] also no size-based rotation unless maxSegmentSize is hit
  • [result] one file per job is normal

So if you see “too many files”, short job duration points toward:

  • many independent job log files, not
  • too many rotated segments of the same job

What to check next

Check whether the “extra files” are:

  • [same directory, same prefix] many rotated segments of one job
  • [different job directories] one or a few files per many jobs

For job proxy logs, the second case is usually the explanation.

Also check your cleanup setting for the job proxy log manager:

  • [logs storage period] old job directories are deleted later by JobProxyLogManager, not immediately on job finish

So if jobs are short and numerous, and the storage period is long, you can accumulate lots of files even though per-file rotation is working correctly.

Bottom line

  • No, short jobs being under 15 minutes should not cause extra time-based rotations.
  • Yes, you can still see many files because job proxy logging is effectively per job, with useTimestampSuffix: true, and retention limits are per writer/prefix, not a global cap.
  • For one specific log prefix, maxSegmentCountToKeep: 96 can still look like about 97 files including the current active file.

If you want, I can also help you trace the exact on-disk layout for job proxy logs in your config and estimate what file count is expected from your job rate and log retention period.

Status

Explained the source-level behavior of rotationPeriodMilliseconds and why short-lived jobs are unlikely to be the cause of excess rotated files.


Question 2

User Input

So if useTimestampSuffix enabled – the rotation mechanism do not consider previous job proxy log files?

Planner Response

I checked the writer logic earlier; the key distinction is same prefix in the same directory vs other jobs.

Answer

No: with useTimestampSuffix: true, rotation still considers previous log files, but only:

  • [same directory] in the same directory
  • [same prefix] whose name starts with the configured file_name

The code does:

auto files = NFS::EnumerateFiles(DirectoryName_, 1, true);
std::erase_if(files, [&] (const std::string& s) {
    return !s.starts_with(FileNamePrefix_);
});
if (Config_->UseTimestampSuffix) {
    std::reverse(files.begin(), files.end());
}

Then retention is applied to that file list using:

  • maxTotalSizeToKeep
  • maxSegmentCountToKeep

What changes when useTimestampSuffix is enabled

With useTimestampSuffix: true:

  • [no renaming chain] old files are not renamed to .1, .2, ...
  • [timestamped names] each opened file gets a timestamp in its name
  • [cleanup still works] old matching files are still deleted by count/size limits

So useTimestampSuffix disables renaming, not retention accounting.

Important caveat for job proxy logs

If job proxy logs are stored per job in separate directories, then files from previous jobs are not considered together.

So the answer depends on what you mean by “previous job proxy log files”:

  • [same job / same directory / same prefix] yes, they are considered
  • [other jobs in other directories] no, they are not considered

That is usually why job proxy logs can accumulate a lot overall: retention is local to each writer location, not global across all jobs.

Practical summary

With useTimestampSuffix: true:

  • rotation sees previous matching files in its own directory
  • rotation does not merge accounting across all job proxy files on the node

If you want, I can next trace the exact file_name and directory layout for job proxy logs in your config and tell you precisely what set of files one writer instance will count.

Clone this wiki locally