Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better naming of cluster log files #14

Closed
mbhall88 opened this issue Apr 8, 2020 · 3 comments
Closed

Better naming of cluster log files #14

mbhall88 opened this issue Apr 8, 2020 · 3 comments
Labels
enhancement New feature or request

Comments

@mbhall88
Copy link
Member

mbhall88 commented Apr 8, 2020

I don’t like the format the names of the cluster log files have been changed to. It makes it impossible to figure out what log file relates to what job without digging into the snakemake stderr log (which is one of the main things I hate about nextflow).

Current implementation

self.logdir / "{jobid}_{random_string}.err".format(jobid=self.jobid, random_string=self.random_string)

Proposal

self.logdir / self.rule_name / self.wildcards_str / "jobid{jobid}_{random_string}.err".format(jobid=self.jobid, random_string=self.random_string)

Contrasting both implementations

# current
'logdir/2_random.out'
# proposed
'logdir/search_fasta_on_index/i=0/jobid2_random.out'

There are two major advantages I see to the new naming scheme.

  1. It is easier to find the log file for a specific job without having to search for its jobid in the snakemake log.
  2. For large pipelines that produce tens or hundreds of thousands of jobs, this will prevent there being potentially 200,000 log files in one directory. Which I guess might send the cluster into meltdown 😅
@mbhall88 mbhall88 added the enhancement New feature or request label Apr 8, 2020
@leoisl
Copy link
Collaborator

leoisl commented Apr 8, 2020

I totally agree with this idea, it is indeed a very nice improvement!

I am not sure if we should worry about this (from https://unix.stackexchange.com/questions/32795/what-is-the-maximum-allowed-filename-and-folder-size-with-ecryptfs):
.

Linux has a maximum filename length of 255 characters for most filesystems (including EXT4), and a maximum path of 4096 characters.

filename length is not an issue, but when we add all wildcards of a rule to the log filepath, its length increases a lot. Sometimes I have rules where the wildcards are full paths to other files, but I don't think I ever hit the limit. Anyway, we can ensure the path has at most 4096 character, and cut some of the wildcards in case it exceeds, but I am just wondering if we should address this or not (seems like a rare corner case)?

That is the only possible issue I see we need to tackle, the rest is all fine. I really like this proposal

@mbhall88
Copy link
Member Author

mbhall88 commented Apr 8, 2020

I think 4096 characters is more than sufficient. If someone hit that due to wildcards I would suggest there might be some changes they could make to their pipeline.

I think enforcing it is hard as I guess it could vary from file system to file system?

mbhall88 pushed a commit to mbhall88/lsf that referenced this issue Apr 8, 2020
@leoisl
Copy link
Collaborator

leoisl commented Apr 8, 2020

Yeah, I would just consider common linux filesystem, probably just EXT4. I think we should drop this. Hard to do it properly (work on any filesystem), and will tackle an extremely rare corner case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants