Skip to content
This repository has been archived by the owner on Nov 2, 2021. It is now read-only.

Releases: tecangenomics/nudup

Patch for compressed files

30 Jan 18:46
Compare
Choose a tag to compare

Fixed bug for .fq.gz file input.

-T path patch

23 Feb 21:00
Compare
Choose a tag to compare

Prevents error when using -T

Paired end patch

17 Jan 19:59
Compare
Choose a tag to compare

Fixes bug for paired end.

Support for samtools versions >= 1.2

13 Dec 18:03
Compare
Choose a tag to compare

-Support for samtools >= 1.2 with backward compatibility for older samtools versions
-Option to output only one duplicate removed BAM file
-Ability to set the temporary directory for data processing on the command line

Paired-end and single-end read deduplication

04 Dec 03:14
Compare
Choose a tag to compare

Marks/removes PCR introduced duplicate molecules based on the molecular tagging
technology used in NuGEN products.

For SINGLE END reads, duplicates are marked if they fulfill the following
criteria: a) start at the same genomic coordinate b) have the same strand
orientation c) have the same molecular tag sequence. The read with the
highest mapping quality is kept as the non-duplicate read.

For PAIRED END reads, duplicates are marked if they fulfill the following
criteria: a) start at the same genomic coordinate b) have the same template
length c) have the same molecular tag sequence. The read pair with the highest
mapping quality is kept as the non-duplicate read.

Here are the two cases for running this tool:

  • CASE 1 (Standard): User supplies two input files,
    1. SAM/BAM file that a) ends with .sam or .bam extension b) contains unique
      alignments only
    2. FASTQ file (ie: the Index FASTQ) that contains the molecular tag sequence
      for each read name in the corresponding SAM/BAM file as either a) the read
      sequence or b) in the FASTQ entry header name. If the index FASTQ read
      length is 6, 8, 12, 14, or 18nt long as expected for NuGEN products, the
      molecular tag sequence to be extracted from the read according to -s and -l
      parameters, otherwise the molecular tag will be extracted from the header
      of the FASTQ entry.
  • CASE 2 (Runtime Optimized): User supplies only one input file,
    1. SAM/BAM file that a) ends with .sam or .bam extension b) contains unique
      alignments only c) is sorted d) has a fixed length sequence containing the
      molecular tag appended to each read name.