Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library level stats from Markdup #1143

Closed
PedalheadPHX opened this issue Nov 26, 2019 · 3 comments
Closed

Library level stats from Markdup #1143

PedalheadPHX opened this issue Nov 26, 2019 · 3 comments
Assignees

Comments

@PedalheadPHX
Copy link

Is your feature request related to a problem? Please specify.

Not a problem but a limitation of the current stats summary when a given BAM file contains multiple libraries.

Describe the solution you would like.

Assuming duplicates are marked by library (RG-LB) within a given BAM, and optical duplicates are marked by the platform unit (RG-PU) then having at least the duplicate stats for each library defined in the BAM versus the sample or even multi-sample BAM would improve quality control processes.

Requested solution: samtools markdup -f markdup.stats in.bam out.bam produces a file with one column or row for each library within "in.bam". I acknowledge the optical duplicate stats are then left in a similar issue where the numbers might differ by RG-PU but unlike different libraries that might have very different duplicate levels I'd expect the same library sequenced on multiple RG-PU will have very similar optical duplicate numbers so averaging the level observed by library seems reasonable.

In a perfect world the stats would be output in JSON with levels for the total BAM, each RG-SM (might be multiple samples in a BAM), and each RG-LB (might be multiple libraries).

@whitwham
Copy link
Contributor

It would take some reworking of the stats but that might not be a bad thing.

As for JSON, how about posting an example of the kind of output you would like to see?

@Poshi
Copy link

Poshi commented Oct 6, 2021

I had a similar need. I'm in the process of choosing the proper mark duplicates software for each kind of experiment data we have. This means that I will run different mark duplicates programs, each of them giving the stats in a different way (if any statistic at all). That lead me to write my own duplicates statistics code that, given any bam, returns a json formated file that contains precisely the information that you ask. The json represents a dictionary of samples, each sample is a dictionary of libraries, each library is a dictionary of read group IDs (which in my case should be equivalent to platform units), and each read group ID contains the actual statistics. With this data, you can aggregate the fields at the level you need.
Having a separate statistics program allows us to freely switch the marking duplicates software without having to worry about the stats we will get. On the downside, you have to read the bam again.

An example of my output:

{
  "SampleName": {
    "LibraryName": {
      "RGID_1": {
        "UNPAIRED_READS_EXAMINED": 53,
        "READ_PAIRS_EXAMINED": 22456,
        "SECONDARY_OR_SUPPLEMENTARY_RDS": 225,
        "UNMAPPED_READS": 53,
        "UNPAIRED_READ_DUPLICATES": 53,
        "READ_PAIR_DUPLICATES": 2006,
        "READ_PAIR_OPTICAL_DUPLICATES": 810,
        "PERCENT_DUPLICATION": 0.09040364728121873,
        "ESTIMATED_LIBRARY_SIZE": 188598
      },
      "RGID_2": {
        "UNPAIRED_READS_EXAMINED": 20,
        "READ_PAIRS_EXAMINED": 22503,
        "SECONDARY_OR_SUPPLEMENTARY_RDS": 263,
        "UNMAPPED_READS": 20,
        "UNPAIRED_READ_DUPLICATES": 20,
        "READ_PAIR_DUPLICATES": 1687,
        "READ_PAIR_OPTICAL_DUPLICATES": 730,
        "PERCENT_DUPLICATION": 0.075378670101719,
        "ESTIMATED_LIBRARY_SIZE": 240369
      },
      "RGID_N": {
        "UNPAIRED_READS_EXAMINED": 29,
        "READ_PAIRS_EXAMINED": 14875,
        "SECONDARY_OR_SUPPLEMENTARY_RDS": 166,
        "UNMAPPED_READS": 29,
        "UNPAIRED_READ_DUPLICATES": 29,
        "READ_PAIR_DUPLICATES": 1377,
        "READ_PAIR_OPTICAL_DUPLICATES": 705,
        "PERCENT_DUPLICATION": 0.09345511937942845,
        "ESTIMATED_LIBRARY_SIZE": 144634
      }
    }
  }
}

I think that this format could serve quite well to any purpose. Optional statistics aggregation by library or sample should be easy to implement too.

@whitwham
Copy link
Contributor

whitwham commented Oct 6, 2021

Thanks for that @Poshi. When I get back to markdup I take a look at adding a JSON option and individual library stats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants