Skip to content

Latest commit

 

History

History
580 lines (468 loc) · 21.4 KB

MANUAL.md

File metadata and controls

580 lines (468 loc) · 21.4 KB
title author
abismal reference manual
Guilherme Sena

NAME

abismal - a fast and memory-efficient mapper for short whole genome bisulfite sequencing reads. Official release, discussion, source code and development versions are available at the abismal GitHub repo.

SYNOPSIS

abismal [OPTIONS] -o output.sam <input-1.fastq> [<input-2.fastq>]

Examples

abismal -g ref.fa -o out.sam [-s out.stats] reads-1.fq
abismal -g ref.fa -o out.sam [-s out.stats] reads-1.fq reads-2.fq

If a ref.idx file was created using abismalidx:

abismal -i ref.idx -o out.sam [-s out.stats] reads-1.fq
abismal -i ref.idx -o out.sam [-s out.stats] reads-1.fq reads-2.fq

DESCRIPTION

abismal maps short bisulfite sequencing reads in FASTQ format to a reference genome in FASTA format and produces an output file in SAM format. Only reads that are mapped uniquely or ambiguously (optional) are reported in the SAM output.

abismal can map reads assuming either C>T or G>A conversion in single-end reads. Both conversions are also accepted in paired-end mapping, but corresponding read pairs are assume to have complementary base conversions. The assumption of C>T conversion means Ts in the read are considered matches to either Cs or Ts in the reference (in either strand). The assumption of G>A conversion means that As in reads are considered matches to either Gs or As in the reference.

absimal was built to map short reads of up to 250 bases. It should successfully map reads of size up to 1 million, but because it uses very short seeds for filtration, the mapping time will increase substantially.

QUICK INSTALLATION

Run the following commands to install abismal

wget https://github.com/smithlabcode/abismal/releases/download/v3.2.2/abismal-3.2.2.tar.gz
tar -xvzf abismal-3.2.2.tar.gz
cd abismal-3.2.2
./configure --prefix=$(pwd)
make
make install

This will create a bin directory where binaries abismalidx and abismal are located. The abismalidx program indexes a FASTA reference genome, and abismal maps FASTQ reads to a reference.

OPTIONS

-i FILE, -index FILE [required if -g not provided]

Input index file generated by abismalidx. Either the -g or -i parameter must be provided (but not both). If -g is provided, abismal will read the indexed genome before starting to map reads. Using -i is recommended if several FASTQ files are mapped to the same genome.

-g FILE, -genome FILE [required if -i not provided]

Input FASTA genome. Either the -g or -i parameter must be provided (but not both). If -g is provided, abismal will first index the input FASTA genome before starting to map reads. Using -g is recommended if only one sample will be mapped to a reference genome. In most practical cases, however, it is preferable to first index the genome using abismalidx, then using the output index file as input through the -i flag.

-o FILE, -outfile FILE

Output file in SAM format by default. This argument is required.

-B, -bam

Using this argument, the output will be in BAM format.

-s FILE, -stats FILE

Output mapping statistics file in YAML format. This file provides a summary of mapping efficiency, detailing how many reads were zero, one or multiple times. It also provides the error rate of mapped reads and the number of reads that were too short to be mapped

The output is in YAML format, which is human-readable and can be parsed in several programming languages.

For single-end reads the statistics looks something like this:

total_reads: 1000000
mapped:
    num_mapped: 954620
    num_unique: 856108
    num_ambiguous: 98512
    percent_mapped: 95.462
    percent_unique: 85.6108
    percent_ambiguous: 9.8512
    unique_error:
        edits: 762988
        total_bases: 85610800
        error_rate: 0.891229
num_unmapped: 45380
num_skipped: 0
percent_unmapped: 4.538
percent_skipped: 0

For paired-end reads it looks like this:

pairs:
    total_pairs: 1000000
    mapped:
        num_mapped: 783009
        num_unique: 720655
        num_ambiguous: 62354
        percent_mapped: 78.3009
        percent_unique: 72.0655
        percent_ambiguous: 6.2354
        unique_error:
            edits: 1042712
            total_bases: 144131000
            error_rate: 0.723447
    num_unmapped: 216991
    num_discordant: 16085
    percent_unmapped: 21.6991
    percent_discordant: 1.6085
mate1:
    total_reads: 279345
    mapped:
        num_mapped: 175933
        num_unique: 103058
        num_ambiguous: 72875
        percent_mapped: 62.9805
        percent_unique: 36.8927
        percent_ambiguous: 26.0878
        unique_error:
            edits: 180656
            total_bases: 10305800
            error_rate: 1.75295
    num_unmapped: 103412
    num_skipped: 0
    percent_unmapped: 37.0195
    percent_skipped: 0
mate2:
    total_reads: 279345
    mapped:
        num_mapped: 115030
        num_unique: 43807
        num_ambiguous: 71223
        percent_mapped: 41.1785
        percent_unique: 15.682
        percent_ambiguous: 25.4964
        unique_error:
            edits: 55687
            total_bases: 4380700
            error_rate: 1.27119
    num_unmapped: 164315
    num_skipped: 0
    percent_unmapped: 58.8215
    percent_skipped: 0

-t NUM-THREADS, -threads NUM-THREADS [default : 1]

number of threads that should be used to map reads. Each thread reads 1000 reads in parallel, so increasing this number uses more memory by a few kilobytes per additional thread. In most practical cases this should not be significantly different than single-thread mapping.

-l MIN-FRAG-VALUE, -min-frag MIN-FRAG-VALUE [default : 32]

For paired-end mode only. The minimum size a concordant fragment can have. There are cases in which concordant pairs can "dovetail", that is, the end of the reverse mate can pass the start of the forward mate. Ths parameter dictates the minimum size a dovetail read can have while accepting concordance. The schematic below depicts what the value of -l represents:

                    [==================================>] [FORWARD MATE]
[<==================================] [REVERSE MATE]
                    |-------l-------|

-L MAX-FRAG-VALUE, -max-frag MAX-FRAG-VALUE [default : 3000]

For paired-end mode only. The maximum size a concordant fragment, defined as the maximum distance between the genome start (smallest) coordinate of the forward mapped mate and the start (largest) coordinate of the reverse mapped strand, for a pair to be considered concordant. The schematic below depics how L is calculated.

[===============>] [FORWARD MATE]
                                         [<==============] [REVERSE MATE]
|---------------------------L----------------------------|

-m MAX-FRACTION, -max-distance MAX-FRACTION [default : 0.1]

The maximum edit distance allowed for a read to be considered "mapped", relative to the read's mapped length. Abismal will choose the location in the genome with maximum alignment score. This location will be reported if the number of mismatches, insertions and deletions is at most -m times the mapped region of the read (i.e. excluding soft-clipped bases). For instance, if a read of 150 bp is mapped to a location with CIGAR string 20S100M30S, the read is allowed to have at most 10 mismatches.

-a -ambig

If this flag is provided, abismal will report a random location to reads that mapped ambiguously. These reads can be identified by the presence of bit 0x100 in the SAM flags.

-P -pbat

For paired-end mapping only. Assumes the bisulfite conversion of the first end to be G>A and the bisulfite conversion of the second end to be C>T.

-R -random-pbat

Maps reads in random PBAT mode.

For single-end mapping, abismal will attempt to map reads assuming C>T conversion, then G>A, and keeping the conversion that attains the best alignment score.

For paired-end mapping, abismal will attempt to map reads assuming end 1 has C>T conversion and end 2 has G>A conversion. Then it will map the same reads assuming end 1 has G>A conversion. The conversion that attains highest sum of alignment scores is kept.

-A -a-rich

For single-end mapping only. Indicates that reads are A-rich. Mapping with this flags assumes that the bisulfite conversion is G>A instead of C>T.

-v -verbose

Prints more run info on the mapping progress, including a progress bar showing the percentage of input reads currently processed.

INPUT FASTQ FORMAT

abismal accepts reads in either FASTQ or FASTQ.GZ formats. The FASTQ format represents reads through 4 lines per read.

  • The first line is the read name and has to start with the @ character.
  • The second line is the read itself. Read charcters must beither A, C, G, T or N, in lowercase or uppercase.
  • The third line is ignored. It usually starts with the + character.
  • The fourth line is the Phred quality of each base in the read. It must be the same length as the second line. It is also ignored by the abismal algorithm.

An example FASTQ file with one read looks like this:

@1_chr3:131015484-131015553_R1
TTTATTAGGTAAGAAGGATAATAAGGGAGTTGAGTTTATGTGTTATAGAG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Because reads take 4 lines per entry, the number of lines of the FASTQ input must be a multiple of 4. This is a necessary but not sufficient condition for properly formatted FASTQ.

For paired-end data, it is mandatory that both FASTQ ends have the same number of lines. Corresponding entries in each file are assumed to be mates.

OUTPUT SAM FORMAT

Output headers

abismal representes mapped reads in the Sequence Alignment/Map (SAM) format. Detailed specification of SAM is available at the samtools documentation page. BAM format is also supported, but the BAM format can be converted to SAM using samtools if desired.

The output starts with SAM headers. Headers contain metadata information on the reference genome. Each header line starts with the @ character.

The first line in the SAM output file is given by

@HD VN:1.0

followed by one line per chromosome in the input FASTA file, encoding the chromosome length. Each line is in the format

@SQ SN:[chrom-name] LN:[chrom-length]

where [chrom-name] is given by the first word of the chromosome name in the FASTA file (anything after the first white space is deleted), and [chrom-length] is the number of bases in the chromosome sequence.

The last line of the headers is a copy of how the program was called to generate the SAM output, and is of the form

@PG ID:ABISMAL  VN:3.2.2  CL:[command-call]

where [command-call] is the shell command used to run abismal.

Output mapped lines

Following the header lines, reads that are mapped once (or at least once if the -a flag is used) are reported. Each read is a set of thirteen tab-separated fields as follows.

The first eleven fields are mandatory SAM fields:

  • QNAME: The query name, given by the first word of the FASTQ read name
  • FLAG: List of samtools flags, according to the SAM file format documentation
  • RNAME: The reference name, or the chromosome to which the read was mapped
  • POS: 1-based position in the chromosome in which the read was mapepd
  • MAPQ: Abismal does not provide MAPQ, so this value is always set to 255 ("not defined" according to the SAM documentation)
  • CIGAR: A CIGAR string representing the read alignment. M stands for matches, S are soft-clipped bases, I are insertions to the reference and D are deletions from the reference
  • RNEXT: This field is an equal sign (=) for reads that mapped concordantly or an asterisk (*) for single-end reads or reads that map discordantly.
  • PNEXT: The position (POS field) of the read's mate. In single-end reads this is given by an asterisk (*)
  • TLEN: The fragment length for paired-end reads. It is a positive value for if the first end maps to the forward strand and negative otherwise. For discordant or single-end reads a value of 0 is reported.
  • SEQ: The input sequence identical to the FASTQ input (see "important note on SEQ reads reported in the SAM output" below)
  • QUAL: An asterisk (*). QUAL values are discarded on mapping and not reported

The last two fields are optional tags that can be used downstream

  • NM: The edit distance (mismatches + insertions + deletions) in the alignment between the read and reference
  • CV: The bisulfite letter conversion assumed when mapping the read. If the read was assumed A-rich (G>A conversion), the value CV:A:A is reported. If the read was assumed T-rich (C>T conversion), then the value CV:A:T is reported. If the -R flag is not set, all reads coming from the same FASTQ file will have the same assumed conversion. If the -R flag is used, this tag provides the conversion used in the reported alignment.

IMPORTANT NOTE ON SEQ READS REPORTED IN THE SAM OUTPUT

In a strict sense, the SEQ field reported by abismal does not comply with the SAM standard. Properly formatted SAM files reverse-complement reads that map to the reverse strand of the reference genome, whereas abismal reports reads identical to the input FASTQ. This is done deliberately to maintain consistency with the conversion type reported in the CV tag. This standard allows downstream analyses on letter frequencies and conversion types. Tools to reverse-complement the SEQ field and the letter in the CV tag can be used if properly formatted SAM files are necessary.

Filtering concordant/discordant pairs in paired-end SAM

In paired-end mapping, abismal will try to mate concordant pairs and find pairs whose sum of edit distances is below the user-defined cut-off (set to 10% of the mapped read length by default). If it cannot find such concordant pair (either due to ambiguity or low alignment score), abismal will map each end as single-end reads and report locations with sufficiently high similarity. In many cases, one may want to only keep concordant pairs, or even isolate discordant pairs to analyze their sequence patterns, mapping quality, etc.

To isolate concordant pairs from paired-end output out.sam, use the following awk script, which isolates the SAM header and reads with an equal (=) symbol in the RNEXT field (reserved only for concordant pairs).

awk '$1 ~ /^@/ || $7 == "="' out.sam >out_paired.sam

To isolate discordantly mapped reads from paired-end output out.sam, use the following awk script, which isolates the SAM header and reads with an asterisk (*) in the RNEXT field (reserved for discordantly mapped reads).

awk '$1 ~ /^@/ || $7 == "*"' out.sam >out_single.sam

NOTATION USED BY ABISMAL

Below we explain in more detail some of the notation used in the abismal options.

T-rich and A-rich reads

Abismal uses the notation of T-rich and A-rich reads for input reads. We say a read is T-rich if it was sequenced directly after bisulfite conversion, and we say a read is A-rich if it the complement of the bisulfite-converted DNA was sequenced. Single-end and end 1 of paired-end reads are usually T-rich, and end 2 of paired-end reads are usually A-rich.

For most organisms, we can infer if an input is T-rich based on the frequencies of Ts and Cs (for animal samples and most plants). Half of the T-rich bases should be Ts, and the frequency of Cs should be extremely low. Conversely, in A-rich samples half of the bases are As, and the frequency of Gs is very low. If neither case applies for the dataset, it may be either an RPBAT sample or not a bisulfite sequencing sample (see "notation used by abismal" below for a description of what RPBAT samples are)

Traditional, PBAT and RPBAT reads

Reads that follow the expected T-rich/A-rich convention are called traditional. Inputs are assumed traditional by default. In other words, if no flags are provided, A single-end input, as well as end 1 of paired-end reads, will be assumed T-rich, and end 2 will be assumed A-rich.

We say reads are PBAT (often from the Post-Bisulfite Adapter Tagging protocol) if they follow the opposite convention of traditional reads. This means that a PBAT single-end input is A-rich, and a PBAT paired-end input is A-rich in end 1 and T-rich in end 2. Reads can be mapped in PBAT mode by using the -p flag or the -pbat flag among the [options].

We say reads are RPBAT (from the Random-priming PBAT protocol) if reads in the input can be T-rich or A-rich with equal probability, and both must be tried. For paired-end input, we always assume that corresponding ends of reads have complementary bisulfite bases. In other words, if end 1 is T-rich, then end 2 is A-rich, and vice-versa. Reads can be mapped as RPBAT by passing the -R flag or the -random-pbat flag among the [options].

What if the protocol (traditional, PBAT, RPBAT) is not known?

If the sequencing protocol is not known, we suggest trying all 3 possibilities. You will always get most reads by mapping as RPBAT, but that does not mean this is always the best option. For instance, if reads come from the traditional single-end protocol, the reads mapped as A-rich may be false positives and lead to incorrect downstream analyses. You should only map as RPBAT if you are getting very low mapping values in both traditional and PBAT modes, which suggests that reads are truly RPBAT.

Valid hits

Abismal maps reads by first computing Hamming distances between the read and the candidates retrieved during seeding. Hamming distance is the number of mismatches between the read and the candidate, so no insertions and deletions are made. We say a candidate is a valid hit for a read if the Hamming distance is below 40% of the read length. Hits that are not valid will not be aligned, as they are extremely unlikely to yield high alignment scores.

Alignment scores and edit distances

Abismal aligns reads through a banded Smith-Waterman alignment, using a band width of 3. We use a scoring system of +1 for matches, -1 for mismatches and -1 for indels. In other words, if an alignment has M matches, m mismatches, I insertions and D deletions, the alignment score A is given by

$$A = a - I - D$$

and the edit distance E is given by

$$E = m + I + D$$

Abismal selects the best alignment score, and only reports it if the edit distance is below a fraction m (set under the -m or -max-fraction flag) of the read length. For example, if the read length is 100 and m =0.1, the read is only reported if the edit distance is at most 10.

We also note that abismal does not use phred quality scores in alignment, so a mismatch on a low quality base is the same as a mismatch in a high-quality base.

Valid alignments

We say a candidate is a valid alignment if the edit distance is at most a fraction m of the read length. The acceptable fraction can be set through the -m or -max-distance flag in [options]. Some application-specific cases may require more or less acceptable error, so this parameter can be adjusted by the user.

Concordant pairs

On paired-end input, we say reads $r_1$ and $r_2$ with lengths $n_1$ and $n_2$, respectively, are a corresponding pair are concordant if there exist positions $p_1$ and $p_2$ such that

  1. $p_1$ is a valid alignment for $r_1$ and $p_2$ is a valid alignment for $r_2$
  2. $p_1$ and $p_2$ are in opposite strands of the genome, and
  3. $p_2 - p_1 \geq l$ and $p_2 - p_1 \leq L - n2$.

This means that the fragment lengths from which the pair originates is at least $l$ and at most $L$. The default values of $l$ and $L$ are 32 and 3000, respectively, and can be set by the -l and -L flags. Those are conservative values that cover most of the current protocols. We will incorporate automatic calculations of these values in the future based on the first high-quality read pairs that are mapped.

Discordant pairs

We say a read pair is discordant if the following 3 conditions are met simultaneously

  1. It is not concordant,
  2. The best mapping position of each end is unique,
  3. The best mapping position of each end is a valid alignment.

Reporting non-concordant reads

Read pairs that are not concordant (including discordant reads) do not follow the expectations, but in cases where they map with high quality, they will be reported. If read pairs are not concordant, abismal maps each end independently as single-end reads. If users do not which to keep single-end alignments in paired-end data, they can filter out single-end reads of an output file out.sam though the following command.

samtools view out.sam | awk '$7 == "="' >out_paired.sam

The resulting file (out_paired.sam) will contain only concordant pairs.

EXIT VALUES

0 : Success.

1 : Runtime error. When the program, fails, an error message will be shown in STDERR describing the problem encountered.

REPORTING BUGS

Bugs can be reported at the abismal GitHub page at the issues section. the abismal GitHub page) as well as by e-mail to desenabr@usc.edu or andrewds@usc.edu. When reporting bugs, please provide the version of abismal used and the operating system used to run abismal. It is also helpful, when relevant, to provide small input datasets and the genome used so we can reproduce the issue and find the source of the problem.

COPYRIGHT

Copyright (C) 2018-2023 Andrew D. Smith and Guilherme Sena

abismal is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

abismal is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.