Library level stats from Markdup #1143

PedalheadPHX · 2019-11-26T01:10:32Z

Is your feature request related to a problem? Please specify.

Not a problem but a limitation of the current stats summary when a given BAM file contains multiple libraries.

Describe the solution you would like.

Assuming duplicates are marked by library (RG-LB) within a given BAM, and optical duplicates are marked by the platform unit (RG-PU) then having at least the duplicate stats for each library defined in the BAM versus the sample or even multi-sample BAM would improve quality control processes.

Requested solution: samtools markdup -f markdup.stats in.bam out.bam produces a file with one column or row for each library within "in.bam". I acknowledge the optical duplicate stats are then left in a similar issue where the numbers might differ by RG-PU but unlike different libraries that might have very different duplicate levels I'd expect the same library sequenced on multiple RG-PU will have very similar optical duplicate numbers so averaging the level observed by library seems reasonable.

In a perfect world the stats would be output in JSON with levels for the total BAM, each RG-SM (might be multiple samples in a BAM), and each RG-LB (might be multiple libraries).

The text was updated successfully, but these errors were encountered:

whitwham · 2019-11-26T09:53:00Z

It would take some reworking of the stats but that might not be a bad thing.

As for JSON, how about posting an example of the kind of output you would like to see?

Poshi · 2021-10-06T09:36:06Z

I had a similar need. I'm in the process of choosing the proper mark duplicates software for each kind of experiment data we have. This means that I will run different mark duplicates programs, each of them giving the stats in a different way (if any statistic at all). That lead me to write my own duplicates statistics code that, given any bam, returns a json formated file that contains precisely the information that you ask. The json represents a dictionary of samples, each sample is a dictionary of libraries, each library is a dictionary of read group IDs (which in my case should be equivalent to platform units), and each read group ID contains the actual statistics. With this data, you can aggregate the fields at the level you need.
Having a separate statistics program allows us to freely switch the marking duplicates software without having to worry about the stats we will get. On the downside, you have to read the bam again.

An example of my output:

{
  "SampleName": {
    "LibraryName": {
      "RGID_1": {
        "UNPAIRED_READS_EXAMINED": 53,
        "READ_PAIRS_EXAMINED": 22456,
        "SECONDARY_OR_SUPPLEMENTARY_RDS": 225,
        "UNMAPPED_READS": 53,
        "UNPAIRED_READ_DUPLICATES": 53,
        "READ_PAIR_DUPLICATES": 2006,
        "READ_PAIR_OPTICAL_DUPLICATES": 810,
        "PERCENT_DUPLICATION": 0.09040364728121873,
        "ESTIMATED_LIBRARY_SIZE": 188598
      },
      "RGID_2": {
        "UNPAIRED_READS_EXAMINED": 20,
        "READ_PAIRS_EXAMINED": 22503,
        "SECONDARY_OR_SUPPLEMENTARY_RDS": 263,
        "UNMAPPED_READS": 20,
        "UNPAIRED_READ_DUPLICATES": 20,
        "READ_PAIR_DUPLICATES": 1687,
        "READ_PAIR_OPTICAL_DUPLICATES": 730,
        "PERCENT_DUPLICATION": 0.075378670101719,
        "ESTIMATED_LIBRARY_SIZE": 240369
      },
      "RGID_N": {
        "UNPAIRED_READS_EXAMINED": 29,
        "READ_PAIRS_EXAMINED": 14875,
        "SECONDARY_OR_SUPPLEMENTARY_RDS": 166,
        "UNMAPPED_READS": 29,
        "UNPAIRED_READ_DUPLICATES": 29,
        "READ_PAIR_DUPLICATES": 1377,
        "READ_PAIR_OPTICAL_DUPLICATES": 705,
        "PERCENT_DUPLICATION": 0.09345511937942845,
        "ESTIMATED_LIBRARY_SIZE": 144634
      }
    }
  }
}

I think that this format could serve quite well to any purpose. Optional statistics aggregation by library or sample should be easy to implement too.

whitwham · 2021-10-06T14:26:03Z

Thanks for that @Poshi. When I get back to markdup I take a look at adding a JSON option and individual library stats.

PedalheadPHX mentioned this issue May 31, 2021

No Duplicate stats in LIMS for BAMs with multiple Libraries tgen/phoenix#421

Open

whitwham self-assigned this May 16, 2022

whitwham mentioned this issue Aug 24, 2022

Add an option to mark duplicates by read group. #1699

Merged

whitwham closed this as completed Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Library level stats from Markdup #1143

Library level stats from Markdup #1143

PedalheadPHX commented Nov 26, 2019

whitwham commented Nov 26, 2019

Poshi commented Oct 6, 2021

whitwham commented Oct 6, 2021

Library level stats from Markdup #1143

Library level stats from Markdup #1143

Comments

PedalheadPHX commented Nov 26, 2019

Is your feature request related to a problem? Please specify.

Describe the solution you would like.

whitwham commented Nov 26, 2019

Poshi commented Oct 6, 2021

whitwham commented Oct 6, 2021