New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Samtools view --no-PG default should be more nuanced #1175
Comments
The addition of this feature was a user request to track all stages of a pipeline, including file format changes which view is normally used for. Conversely, the adhoc debugging nature of wanting to see what's in a file isn't a pipeline thing and therefore has a lot more flexibility with regards to requiring change. Hence why we went with this option. I also disagree that format conversions are not interesting. If trying to diagnose a bug it can be useful to know if the data went in to CRAM and back again for example. Again - this was an explicit local user request for improved data provenance. We debated the vanilla samtools view to terminal case for some time, considering a difference between outputting to terminal vs to a file too as you suggest. In the end we decided that would could even more confusion given differing behaviour. Users can learn. Pipelines less so. (I know it's caused some test suite failures, but so has the zero length file becoming invalid too.) |
Yes, this is why I said “should be up for debate”, not “this is wrong”. I am inclined to agree that tracking format conversions etc in pipelines might well be a good default, and I am not going to make the latter case myself. However I think it will be worth revisiting the view-to-terminal case in the light of experience of actually using it. When the output goes to a terminal, it is because the user has just typed the command and is viewing the output on screen, and that output will soon scroll off and be discarded. (Unless they are using Thus when the view output goes to a terminal, it is ephemeral so there is no permanent log aspect in favour of adding Perhaps to reduce confusion this should be the case for all the subcommands: default to Perhaps the maintainers could share their reasoning about the pros and cons in the terminal case. |
Oh please. xx#blank.sam was a test that exercised the 0-length behaviour, and when the behaviour changed the test was also changed. That is a different kettle of fish from unrelated tests failing due to a change in unrelated behaviour! (Or do you have some other zero-length-impacted test suite failures in mind? In any case, that's irrelevant to this issue.) |
I wasn't talking about our tests (actually mine - that one came from io_lib), but other peoples. It certainly broke some of the internal pipeline tests for example. It's not irrelevant. My point was this isn't an isolated case of changes in a tool meaning changes required to tests using the tool. As for the terminal vs file and PG issue, I agree if we wished to make a distinction it should be universal and not just in If you want pros and cons regarding terminal vs file/pipe behaviour, then I guess everyone will have their own view, but personally: Pro for -add-PG being default from terminal
Neutral
Con (aka Pro for --no-PG being default)
|
It's irrelevant for this issue#1175. Con (i.e., in favour of --no-PG default on terminal)
Incidentally I have an upcoming PR proposing adding a new |
We'll just have to agree to disagree on what's irrelevant. Also:
Wrong. If you have several PG chains then you get several extra lines. If you have LOTS of PG chains then you get LOTS of extra records. It's precisely 1 per chain, always, which is exactly as the specification implies it should be. Generally I don't see dozens and dozens of PG chains unless something has gone disastrously wrong already.
How do you define "viewing"? Is Are you still arguing for different output when going to a naked unpiped terminal vs to anything else, or arguing for PG being off by default instead of on for all usage? |
Note you can get the unadulterated data in pure "view" form using |
I want to add to this discussion that some tools (collate comes to mind, sort and merge possibly) do not seem to add a proper PP: tag, so that as you move downstream in a pipeline, the number of @pg lines increases geometrically because the tools then add a separate line for every perceived "branch" in the processing history. And then in the end view adds another line for each of the existing ones. This in fact muddies the view of the processing history substantially, and defeats the point. Also, the need for --no-PG broke the interface and means that pipeline scripts now either don't work with older samtools versions, or come out with different (and IMO, now messy) headers. It would be more elegant if this behavior could (maybe additionally) steered by setting environment variables, like SAMTOOLS_NO_PG or SAMTOOLS_ADD_PG. These could just be silently ignored by samtools versions that don't know them. |
Could you please provide some short examples of PG being added incorrectly? I just tested |
I cannot reproduce it either & may have been wrong about not writing the PP. I had some unexpected header explosion yesterday, but maybe this came from a merge where I forgot the -p, and then some more steps behind it. |
Sadly it's not uncommon to find files with a myriad of unconnected PG lines which then causes an explosion. The format itself also frankly sucks as there is no way to merge PG chains back into a single chain (after So we do occasionally see many PG lines being added, but I've yet to see this happen in a situation where it shouldn't have done. |
One of the main uses of
samtools view
is to get an accurate view of the contents of the file (the clue's in the name!). Samtools 1.10 now adds a@PG ID:samtools … CL:samtools view -h …
header to the output by default, which means that what you're seeing is not an accurate rendition of the contents of the file.(If your input file already has any
@PG
headers, thensamtools view
adds a plethora of non-existent@PG ID:samtools-N
headers which is even worse.)IMHO the ideal default behaviour of
--add-PG
/--no-PG
forsamtools view
is up for debate. In particular, when output is to a terminal I think there is a very good case for printing the headers exactly as they are in the file. A case could also be made that whether output is to a terminal or a file, unlike other samtools commands, format conversions or region subsetting are not very interesting so the default in general forsamtools view
should be--no-PG
. (Where the user has specified more interesting filtering and subsetting that they'd like to record, they would have the option to do so withsamtools view --add-PG
. The question is what the default should be.)This constitutes a behaviour change since samtools 1.9, and annoyingly you can't easily suppress it in a way that's portable to previous versions — as samtools 1.9 and prior produce an unknown option error for
samtools view --no-PG …
.See also arq5x/bedtools2#814 which notes (amongst other problems fixable within bedtools) that the
@PG
line added causes test suite failures.The text was updated successfully, but these errors were encountered: