Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samtools view --no-PG default should be more nuanced #1175

Open
jmarshall opened this issue Jan 23, 2020 · 11 comments
Open

Samtools view --no-PG default should be more nuanced #1175

jmarshall opened this issue Jan 23, 2020 · 11 comments

Comments

@jmarshall
Copy link
Member

One of the main uses of samtools view is to get an accurate view of the contents of the file (the clue's in the name!). Samtools 1.10 now adds a @PG ID:samtools … CL:samtools view -h … header to the output by default, which means that what you're seeing is not an accurate rendition of the contents of the file.

(If your input file already has any @PG headers, then samtools view adds a plethora of non-existent @PG ID:samtools-N headers which is even worse.)

IMHO the ideal default behaviour of --add-PG/--no-PG for samtools view is up for debate. In particular, when output is to a terminal I think there is a very good case for printing the headers exactly as they are in the file. A case could also be made that whether output is to a terminal or a file, unlike other samtools commands, format conversions or region subsetting are not very interesting so the default in general for samtools view should be --no-PG. (Where the user has specified more interesting filtering and subsetting that they'd like to record, they would have the option to do so with samtools view --add-PG. The question is what the default should be.)

This constitutes a behaviour change since samtools 1.9, and annoyingly you can't easily suppress it in a way that's portable to previous versions — as samtools 1.9 and prior produce an unknown option error for samtools view --no-PG ….

See also arq5x/bedtools2#814 which notes (amongst other problems fixable within bedtools) that the @PG line added causes test suite failures.

@jkbonfield
Copy link
Contributor

The addition of this feature was a user request to track all stages of a pipeline, including file format changes which view is normally used for. Conversely, the adhoc debugging nature of wanting to see what's in a file isn't a pipeline thing and therefore has a lot more flexibility with regards to requiring change. Hence why we went with this option.

I also disagree that format conversions are not interesting. If trying to diagnose a bug it can be useful to know if the data went in to CRAM and back again for example. Again - this was an explicit local user request for improved data provenance.

We debated the vanilla samtools view to terminal case for some time, considering a difference between outputting to terminal vs to a file too as you suggest. In the end we decided that would could even more confusion given differing behaviour. Users can learn. Pipelines less so.

(I know it's caused some test suite failures, but so has the zero length file becoming invalid too.)

@jmarshall
Copy link
Member Author

Yes, this is why I said “should be up for debate”, not “this is wrong”.

I am inclined to agree that tracking format conversions etc in pipelines might well be a good default, and I am not going to make the latter case myself.

However I think it will be worth revisiting the view-to-terminal case in the light of experience of actually using it. When the output goes to a terminal, it is because the user has just typed the command and is viewing the output on screen, and that output will soon scroll off and be discarded. (Unless they are using script instead of redirections for their pipelining, but that's stupid enough to be ignored!)

Thus when the view output goes to a terminal, it is ephemeral so there is no permanent log aspect in favour of adding @PG lines. And the user knows very well what command line they used, because it's visible right there on the screen (or by looking at whatever script they just invoked that used samtools).

Perhaps to reduce confusion this should be the case for all the subcommands: default to --no-PG if output is to a terminal (thus interactive and ephemeral), default to --add-PG if output is a non-terminal (thus potentially pipelined and permanent). And have both options so that the user can be explicit.

Perhaps the maintainers could share their reasoning about the pros and cons in the terminal case.

@jmarshall
Copy link
Member Author

jmarshall commented Jan 23, 2020

(I know it's caused some test suite failures, but so has the zero length file becoming invalid too.)

Oh please. xx#blank.sam was a test that exercised the 0-length behaviour, and when the behaviour changed the test was also changed. That is a different kettle of fish from unrelated tests failing due to a change in unrelated behaviour! (Or do you have some other zero-length-impacted test suite failures in mind? In any case, that's irrelevant to this issue.)

@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 23, 2020

I wasn't talking about our tests (actually mine - that one came from io_lib), but other peoples. It certainly broke some of the internal pipeline tests for example. It's not irrelevant. My point was this isn't an isolated case of changes in a tool meaning changes required to tests using the tool.

As for the terminal vs file and PG issue, I agree if we wished to make a distinction it should be universal and not just in samtools view.

If you want pros and cons regarding terminal vs file/pipe behaviour, then I guess everyone will have their own view, but personally:

Pro for -add-PG being default from terminal

  • Consistency of output across all mediums.

  • No sudden gotchas due to behaviour dependent on environment; eg grepping and suddenly getting a line turn up which you weren't expecting.

  • Documentation / command line help is simple and easy to describe the behaviour.

  • Consistency with other tool behaviour. For example Scramble already did this (and is heavily used by some teams in Sanger).

Neutral

  • Having the PG line listed is pretty irrelevant either way. "Storm in a teacup"

  • It doesn't change pipelines, test frameworks, etc that aren't using a terminal.

  • It's not even visible unless you do "view -h".

  • Edit: one more. I very rarely do "samtools view -h file.bam". It's nearly always "samtools view -h file.bam | less".

Con (aka Pro for --no-PG being default)

  • It's (usually) one extra line of unnecessary verbiage.

@jmarshall
Copy link
Member Author

It's not irrelevant.

It's irrelevant for this issue#1175.


Con (i.e., in favour of --no-PG default on terminal)

  • The extra @PG line is misleading as it does not really exist in the file you're viewing.

  • When your file already contains several @PG chains, it's lots of extra lines of spurious verbiage.


Incidentally I have an upcoming PR proposing adding a new samtools head command that would work around part of this issue by providing an alternative way to view an accurate dump of a file's headers.

@jkbonfield
Copy link
Contributor

We'll just have to agree to disagree on what's irrelevant.

Also:

When your file already contains several @pg chains, it's lots of extra lines of spurious verbiage

Wrong. If you have several PG chains then you get several extra lines. If you have LOTS of PG chains then you get LOTS of extra records. It's precisely 1 per chain, always, which is exactly as the specification implies it should be.

Generally I don't see dozens and dozens of PG chains unless something has gone disastrously wrong already.

The extra @pg line is misleading as it does not really exist in the file you're viewing

How do you define "viewing"? Is samtools view -h foo.bam | less still viewing? Should we distingish | less from | grep ...? Is samtools view -h foo.bam > foo.sam viewing it too? It's such a slippery slope I think we shouldn't take a single step down that path!

Are you still arguing for different output when going to a naked unpiped terminal vs to anything else, or arguing for PG being off by default instead of on for all usage?

@jkbonfield
Copy link
Contributor

Note you can get the unadulterated data in pure "view" form using htsfile -c. This is generally more useful for debugging purposes than the kitchen sink view sub-command as it has various other features too such as being able to set verbosity.

@robmaz
Copy link

robmaz commented Feb 7, 2020

I want to add to this discussion that some tools (collate comes to mind, sort and merge possibly) do not seem to add a proper PP: tag, so that as you move downstream in a pipeline, the number of @pg lines increases geometrically because the tools then add a separate line for every perceived "branch" in the processing history. And then in the end view adds another line for each of the existing ones. This in fact muddies the view of the processing history substantially, and defeats the point.

Also, the need for --no-PG broke the interface and means that pipeline scripts now either don't work with older samtools versions, or come out with different (and IMO, now messy) headers. It would be more elegant if this behavior could (maybe additionally) steered by setting environment variables, like SAMTOOLS_NO_PG or SAMTOOLS_ADD_PG. These could just be silently ignored by samtools versions that don't know them.

@jkbonfield
Copy link
Contributor

Could you please provide some short examples of PG being added incorrectly?

I just tested samtools collate and it worked fine for me.

@robmaz
Copy link

robmaz commented Feb 7, 2020

I cannot reproduce it either & may have been wrong about not writing the PP. I had some unexpected header explosion yesterday, but maybe this came from a merge where I forgot the -p, and then some more steps behind it.

@jkbonfield
Copy link
Contributor

Sadly it's not uncommon to find files with a myriad of unconnected PG lines which then causes an explosion. The format itself also frankly sucks as there is no way to merge PG chains back into a single chain (after samtools merge for example). I had a spec change to do this, but it didn't seem to get much attention so my interest in pushing it fizzled out.

So we do occasionally see many PG lines being added, but I've yet to see this happen in a situation where it shouldn't have done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants