Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

our WDL files are too long #115

Open
a-frantz opened this issue Oct 11, 2023 · 0 comments
Open

our WDL files are too long #115

a-frantz opened this issue Oct 11, 2023 · 0 comments

Comments

@a-frantz
Copy link
Member

a-frantz commented Oct 11, 2023

I'll single the 3 worst offenders out: util.wdl, samtools.wdl, and picard.wdl have so many tasks in them it becomes difficult to find the one you're looking for. At least that's my experience. They are each >700 lines long, which I think in just about every other language would be considered a behemoth for maintenance. Most languages have recommended file lengths (guessing an average consensus would be around 500 lines?), I propose we adopt something similar for WDL/this repo.
Although I don't want to base it off line number. I feel that can encourage some sloppy coding when the file in question is around the limit. e.g. collapsing lines that should be separated in order to keep below the length limit.

The below indented section would be me thinking out loud realizing all my ideas have some fatal flaw. I'm stumped as to how to solve the problem. Feel free to skip the indented section, or read it to see the thoughts I've had.

A saner approach to me is a task limit. The exact number might require some trial and error. Rough gut feeling: 5 seems too strict, 10 seems too lenient. I'd say we start looking in the 6-9 range for our task number limit.

"But we currently organize our files by tool. Are we throwing that scheme out?" No! (At least that's not my initial proposal. I'd hear someone out if they have an alternative.)
I say we start with organizing our files by tool, and then once they grow past 6-9 tasks, we make a split. What that split is will depend on some context.
For ex. picard.wdl: could be split into picard-qc.wdl and picard-manipulation.wdl.
picard-qc has all the Picard tasks which generate a report of some kind, and don't change the BAM file.
picard-manipulation has all the Picard tasks which deal with modifying BAM files.

samtools.wdl could be split into... Alright I don't see a great way to split this file.
Let's try util.wdl: could be split into util-python-scripts.wdl, and gosh this is proving more difficult than I expected.

Pivot: what about sorting our tasks? That would also accomplish the goal of making it easier to find a specific task in a long file.

Let's start with a file whose order I like: kraken2.wdl. It's ordered so well I know it off the top of my head: download_taxonomy, download_library, create_library_from_fastas. build_db, kraken. It flows in order that the tasks would be used. It's logical. This seems to be a special case I don't see a way to generalize. Unfortunate...

ngsderive.wdl is roughly in the order that the task/subcommand was created. Chronological ordering makes sense, although I'd say that knowledge is pretty esoteric. I doubt anyone besides me and Clay could rattle off the order that commands were added to ngsderive. So that works for me but is probably not an ordering we should stand by. Now that I think about it, I think we always (or nearly always) add new tasks to the bottom of the file. So really most of our files are ordered in this chronological way. Kinda works for helping us regular maintainers find what we're looking for, but not helpful for anyone else.

Alphabetize? Are there any other sorting choices? Would alphabetized tasks be an improvement? It wouldn't be of very much help to me. My brain is not great at alphabetizing. Not "can't do it" bad, but also it wouldn't be trivial for me to locate what I'm looking for in that sort. I imagine the situation would be roughly the same for me. Maybe a small improvement?

At first I thought this would be a silly suggestion but I don't hate it: shorter tasks at the top of the file, longer tasks the bottom of the file.
Con: really really annoying to get that sorted and maintain that sort (assuming we don't automate it).
Pro: it's the closest to "locate by vibe" that exists 😆

So I'm stumped. I still think our files are too long and they should be broken up. Or sorted, although I'd prefer a scheme for splitting files into smaller chunks. But I don't like any specific implementation I can come up with.

The best thing I thought of is alphabetizing tasks. I don't love it bc my brain is wired in a way finding things in an alphabetic sort isn't the easiest for me. But it's probably an overall improvement (especially while we lack any viable alternatives).
So, do we want to start alphabetizing our WDL files with many tasks? All our task files? Is there a threshold under which it's not worth the effort? Would that look strange, some files sorted, some not?

Opening the floor for proposals!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant