Skip to content

a-slide/NanoCount

Repository files navigation

NanoCount v1.1.0

NanoCount

GitHub license language PyPI version PyPI downloads DOI Anaconda Version Anaconda Downloads

NanoCount estimates transcript abundances from Oxford Nanopore direct RNA sequencing datasets, using filtering steps and an expectation-maximization approach (similar to RSEM, Kallisto, Salmon, etc) to handle the uncertainty of multi-mapping reads

Further documentation is available at: https://a-slide.github.io/NanoCount/


Installation

Install with pip from GitHub.

pip install git+https://github.com/a-slide/NanoCount.git

Read alignment prior to quantification with NanoCount

Reads should be aligned to a transcriptome reference using minimap2. We recommend using the -N 10 option to retain at least 10 secondary mappings.

minimap2 -t 4 -ax map-ont -N 10 transcriptome.fasta reads.fastq | samtools view -bh > aligned_reads.bam

General Command Line Usage

NanoCount --help
usage: NanoCount [-h] [--version] -i ALIGNMENT_FILE [-o COUNT_FILE]
                 [-b FILTER_BAM_OUT] [-l MIN_ALIGNMENT_LENGTH]
                 [-f MIN_QUERY_FRACTION_ALIGNED] [-s SEC_SCORING_VALUE]
                 [-t SEC_SCORING_THRESHOLD] [-c CONVERGENCE_TARGET]
                 [-e MAX_EM_ROUNDS] [-x] [-p PRIMARY_SCORE] [-a] [-n]
                 [-d MAX_DIST_3_PRIME] [-u MAX_DIST_5_PRIME] [-v] [-q]

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

Input/Output options:
  -i ALIGNMENT_FILE, --alignment_file ALIGNMENT_FILE
                        Sorted and indexed BAM or SAM file containing aligned
                        ONT direct RNA reads including secondary alignments
                        (required) [str]
  -o COUNT_FILE, --count_file COUNT_FILE
                        Output file path where to write estimated counts (TSV
                        format) (default: None) [str]
  -b FILTER_BAM_OUT, --filter_bam_out FILTER_BAM_OUT
                        Optional output file path where to write filtered
                        reads selected by NanoCount to perform quantification
                        estimation (BAM format) (default: None) [str]

Misc options:
  -l MIN_ALIGNMENT_LENGTH, --min_alignment_length MIN_ALIGNMENT_LENGTH
                        Minimal length of the alignment to be considered valid
                        (default: 50) [int]
  -f MIN_QUERY_FRACTION_ALIGNED, --min_query_fraction_aligned MIN_QUERY_FRACTION_ALIGNED
                        Minimal fraction of the primary alignment query
                        aligned to consider the read valid (default: 0.5)
                        [float]
  -s SEC_SCORING_VALUE, --sec_scoring_value SEC_SCORING_VALUE
                        Value to use for score thresholding of secondary
                        alignments either "alignment_score" or
                        "alignment_length" (default: alignment_score) [str]
  -t SEC_SCORING_THRESHOLD, --sec_scoring_threshold SEC_SCORING_THRESHOLD
                        Fraction of the alignment score or the alignment
                        length of secondary alignments compared to the primary
                        alignment to be considered valid alignments (default:
                        0.95) [float]
  -c CONVERGENCE_TARGET, --convergence_target CONVERGENCE_TARGET
                        Convergence target value of the cummulative difference
                        between abundance values of successive EM round to
                        trigger the end of the EM loop. (default: 0.005)
                        [float]
  -e MAX_EM_ROUNDS, --max_em_rounds MAX_EM_ROUNDS
                        Maximum number of EM rounds before triggering stop
                        (default: 100) [int]
  -x, --extra_tx_info   Add transcript length and zero coverage transcripts
                        to the output file (required valid bam/sam header)
                        (default: False) [boolean]
  -p PRIMARY_SCORE, --primary_score PRIMARY_SCORE
                        Method to pick the best alignment for each read. By
                        default ("alignment_score") uses the best alignment
                        score (AS optional field), but it can be changed to
                        use either the primary alignment defined by the
                        aligner ("primary") or the longest alignment
                        ("alignment_length"). Choices = [primary,
                        alignment_score, alignment_length] (default:
                        alignment_score) [str]
  -a, --keep_supplementary
                        Retain any supplementary alignments and considered
                        them like secondary alignments. Discarded by default.
                        (default: False) [boolean]
  -n, --keep_neg_strand
                        Retain negative strand alignments for ONT cDNA data. 
                        Un-tested and not benchmarked. Use with caution.
                        If using for cDNA, either deactivate 3 prime end filtering, or enable both 
                        3 prime and 5 prime filtering. cDNA does not only begin at 3 prime 
                        ends as it does with direct RNA. 
                        (default: False) [boolean]
  -d MAX_DIST_3_PRIME, --max_dist_3_prime MAX_DIST_3_PRIME
                        Maximum distance of alignment end to 3 prime of
                        transcript. In ONT direct RNA sequencing, reads are assumed to start
                        from the polyA tail (-1 to deactivate) (default: 50)
                        [int]
  -u MAX_DIST_5_PRIME, --max_dist_5_prime MAX_DIST_5_PRIME
                        Maximum distance of alignment start to 5 prime of
                        transcript. In conjunction with max_dist_3_prime it
                        can be used to select near full length transcript reads only
                        (-1 to deactivate). (default: -1) [int]

Verbosity options:
  -v, --verbose         Increase verbosity for QC and debugging (default:
                        False) [boolean]
  -q, --quiet           Reduce verbosity (default: False) [boolean]

Basic command:

NanoCount -i data/aligned_reads_sorted.bam -b output/aligned_reads_selected.bam --extra_tx_info -o output/tx_counts.tsv

Input BAM file

NanoCount is designed for Oxford Nanopore direct RNA sequencing datasets only.

Reads should be aligned to a transcriptome reference using minimap2. We recommend using the -N 10 option to retain at least 10 secondary mappings. For highly repetitive transcriptomes, this value can be increased.

Since we use a transcriptome reference the alignment algorithm does not have to be splice aware.

NanoCount can take either BAM or SAM format and does not require reads to be sorted, although it is recommended.

Output TSV file

NanoCount returns a file containing count data per transcript. By default only transcripts with at least one read mapped are included in the output. This can be changed to include all transcripts with the option extra_tx_info.

  • transcript_name: Transcript name as in source bam/sam file.
  • raw: Raw abundance estimates. The sum of all abundance values is 1.
  • est_count: Estimated counts obtained by multiplying the raw abundance by the number of primary alignments.
  • tpm: Estimated counts obtained by multiplying the raw abundance by 1M.
  • transcript_length: Optional column included with the option extra_tx_info.

Tpm and estimated counts are not normalised by transcript length as usually done with short-read sequencing. The reason is that in long-read direct RNA sequencing one read is supposed to represent a single transcript molecule starting from the polyA tail, even if the fragment doesn't extend to the 5' end. If using a custom protocol allowing sequencing to start from internal RNA fragments (whole RNA tailing, degenerated custom adapter), then this prior may no longer be valid.

Output BAM file

Optionally, users can choose to output the alignments that pass NanoCount's filtering thresholds for downstream analysis such ad QC or visualisation. The alignments are written in the same order as the source BAM file.

Citation

If you use NanoCount please cite as follows:

Josie Gleeson, Adrien Leger, Yair D J Prawer, Tracy A Lane, Paul J Harrison, Wilfried Haerty, Michael B Clark, Accurate expression quantification from nanopore direct RNA sequencing with NanoCount, Nucleic Acids Research, Volume 50, Issue 4, 28 February 2022, Page e19, https://doi.org/10.1093/nar/gkab1129

Disclaimers

Please be aware that NanoCount is a research package that is still under development. The API, command line interface, and implementation might change without retro-compatibility. It was tested under Linux Ubuntu 16.04 and in an HPC environment running under Red Hat Enterprise 7.1.

Licence

MIT (https://mit-license.org/). Copyright © 2020 Adrien Leger.

Acknowledgements

The package was inspired from https://github.com/jts/nanopore-rna-analysis by Jared Simpson (@jts).

Classifiers

  • Development Status :: 3 - Alpha
  • Intended Audience :: Science/Research
  • Topic :: Scientific/Engineering :: Bio-Informatics
  • License :: OSI Approved :: MIT License
  • Programming Language :: Python :: 3

About

EM based transcript abundance from nanopore reads mapped to a transcriptome with minimap2

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •