New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Library level stats from Markdup #1143
Comments
It would take some reworking of the stats but that might not be a bad thing. As for JSON, how about posting an example of the kind of output you would like to see? |
I had a similar need. I'm in the process of choosing the proper mark duplicates software for each kind of experiment data we have. This means that I will run different mark duplicates programs, each of them giving the stats in a different way (if any statistic at all). That lead me to write my own duplicates statistics code that, given any bam, returns a json formated file that contains precisely the information that you ask. The json represents a dictionary of samples, each sample is a dictionary of libraries, each library is a dictionary of read group IDs (which in my case should be equivalent to platform units), and each read group ID contains the actual statistics. With this data, you can aggregate the fields at the level you need. An example of my output:
I think that this format could serve quite well to any purpose. Optional statistics aggregation by library or sample should be easy to implement too. |
Thanks for that @Poshi. When I get back to markdup I take a look at adding a JSON option and individual library stats. |
Is your feature request related to a problem? Please specify.
Not a problem but a limitation of the current stats summary when a given BAM file contains multiple libraries.
Describe the solution you would like.
Assuming duplicates are marked by library (RG-LB) within a given BAM, and optical duplicates are marked by the platform unit (RG-PU) then having at least the duplicate stats for each library defined in the BAM versus the sample or even multi-sample BAM would improve quality control processes.
Requested solution:
samtools markdup -f markdup.stats in.bam out.bam
produces a file with one column or row for each library within "in.bam". I acknowledge the optical duplicate stats are then left in a similar issue where the numbers might differ by RG-PU but unlike different libraries that might have very different duplicate levels I'd expect the same library sequenced on multiple RG-PU will have very similar optical duplicate numbers so averaging the level observed by library seems reasonable.In a perfect world the stats would be output in JSON with levels for the total BAM, each RG-SM (might be multiple samples in a BAM), and each RG-LB (might be multiple libraries).
The text was updated successfully, but these errors were encountered: