Skip to content

Commit

Permalink
* Added FO and KS tags to @rg
Browse files Browse the repository at this point in the history
 * Added the FZ alignment tag
 * Allow IUPAC code in MD
 * Clarify that IUPAC is allowed in SEQ
 * Clarify that the H SAM type stores a byte array in the Hex format
 * Added BAM tag type "B"
 * Other format changes and minor clarifications
  • Loading branch information
Heng Li committed Apr 2, 2011
1 parent 8a44780 commit 598d014
Showing 1 changed file with 71 additions and 53 deletions.
124 changes: 71 additions & 53 deletions SAMv1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

\makeindex

\title{The SAM Format Specification (v1.3-r882)}
\title{The SAM Format Specification (v1.3-r946)}
\author{The SAM Format Specification Working Group}
\begin{document}

Expand Down Expand Up @@ -61,6 +61,8 @@ \subsection{An example}
\end{verbatim}
\end{framed}

\pagebreak

\subsection{Terminologies and Concepts}

\begin{description}
Expand Down Expand Up @@ -90,9 +92,9 @@ \subsection{Terminologies and Concepts}
\subsection{The header section}
Each header line begins with character `{\tt @}' followed by a
two-letter record type code. In the header, each line is TAB-delimited
and each data field follows a format `{\tt TAG:VALUE}' where {\tt TAG}
and except the {\tt @CO} lines, each data field follows a format `{\tt TAG:VALUE}' where {\tt TAG}
is a two-letter string that defines the content and the format of {\tt
VALUE}. Each header line should match:\\ {\tt
VALUE}. Each header line should match: {\tt
/\char94@[A-Za-z][A-Za-z](\char92t[A-Za-z][A-Za-z]:[
-\char126])+\$/}. Tags containing lowercase letters are reserved for
end users.
Expand All @@ -109,9 +111,9 @@ \subsection{The header section}
& {\tt VN}* & Format version. \emph{Accepted format}: {\tt /\char94[0-9]+\char92.[0-9]+\$/}.\\\cline{2-3}
& {\tt SO} & Sorting order of alignments. \emph{Valid values}: {\tt unknown} (default), {\tt
unsorted}, {\tt queryname} and {\tt coordinate}. For coordinate sort, the major sort
key is the RNAME field, with order defined by the order of @SQ lines in the header. The
minor sort key is the POS field. For alignments with equal RNAME and POS, order is
arbitrary. All alignments with * in RNAME field follow alignments with some other
key is the {\sf RNAME} field, with order defined by the order of {\tt @SQ} lines in the header. The
minor sort key is the {\sf POS} field. For alignments with equal {\sf RNAME} and {\sf POS}, order is
arbitrary. All alignments with `{\tt *}' in {\sf RNAME} field follow alignments with some other
value but otherwise are in arbitrary order.\\\cline{1-3}
\multicolumn{2}{|l}{\tt @SQ} & Reference sequence dictionary. The order of {\tt @SQ} lines defines the alignment sorting order.\\\cline{2-3}
& {\tt SN}* & Reference sequence name. Each {\tt @SQ} line must have a unique {\tt SN} tag. The value of this
Expand All @@ -123,26 +125,37 @@ \subsection{The header section}
& {\tt SP} & Species.\\\cline{2-3}
& {\tt UR} & URI of the sequence. This value may start with one of the standard
protocols, e.g http: or ftp:. If it does not start with one of these protocols, it is assumed to be a file-system path.\\\cline{1-3}\pagebreak\cline{1-3}
\multicolumn{2}{|l}{\tt @RG} & Read group. Unordered multiple lines are allowed.\\\cline{2-3}
\multicolumn{2}{|l}{\tt @RG} & Read group. Unordered multiple {\tt @RG} lines are allowed.\\\cline{2-3}
& {\tt ID}* & Read group identifier. Each {\tt @RG} line must have a unique {\tt ID}. The value of {\tt ID}
is used in the RG tags of alignment records. Must be unique among all read groups in header section. Read group IDs may be modified when merging SAM files in order to handle collisions.\\\cline{2-3}
& {\tt CN} & Name of sequencing center producing the read.\\\cline{2-3}
& {\tt DS} & Description.\\\cline{2-3}
& {\tt DT} & Date the run was produced (ISO8601 date or date/time).\\\cline{2-3}
& {\tt FO} & Flow order. The array of nucleotide bases that correspond to the nucleotides used for each flow of each read.
Multi-base flows are encoded in IUPAC format, and non-nucleotide flows by various other characters. \emph{Format}: {\tt /\char92*|[ACMGRSVTWYHKDBN]+/}\\\cline{2-3}
& {\tt KS} & The array of nucleotide bases that correspond to the key sequence of each read.\\\cline{2-3}
& {\tt LB} & Library.\\\cline{2-3}
& {\tt PG} & Programs used for processing the read group.\\\cline{2-3}
& {\tt PI} & Predicted median insert size.\\\cline{2-3}
& {\tt PL} & Platform/technology used to produce the read. \emph{Valid values}:
{\tt ILLUMINA}, {\tt SOLID}, {\tt LS454}, {\tt HELICOS} and {\tt PACBIO}.\\\cline{2-3}
& {\tt PL} & Platform/technology used to produce the reads. \emph{Valid values}:
{\tt CAPILLARY}, {\tt LS454}, {\tt ILLUMINA}, {\tt SOLID}, {\tt HELICOS}, {\tt IONTORRENT} and {\tt PACBIO}.\\\cline{2-3}
& {\tt PU} & Platform unit (e.g. flowcell-barcode.lane for Illumina or slide for SOLiD). Unique identifier.\\\cline{2-3}
& {\tt SM} & Sample. Use pool name where a pool is being sequenced.\\\cline{1-3}
\multicolumn{2}{|l}{\tt @PG} & Program. \\\cline{2-3}
& {\tt ID}* & Program record identifier. Each {\tt @PG} line must have a unique {\tt ID}. The value of {\tt ID} is used in the alignment {\tt PG} tag and {\tt PP} tags of other {\tt @PG} lines. PG IDs may be modified when merging SAM files in order to handle collisions.\\\cline{2-3}
& {\tt ID}* & Program record identifier. Each {\tt @PG} line must have a unique {\tt ID}.
The value of {\tt ID} is used in the alignment {\tt PG} tag and {\tt PP} tags of other {\tt @PG} lines.
{\tt PG} IDs may be modified when merging SAM files in order to handle collisions.\\\cline{2-3}
& {\tt PN} & Program name \\\cline{2-3}
& {\tt CL} & Command line \\\cline{2-3}
& {\tt PP} & Previous {\tt @PG-ID}. Must match another {\tt @PG} header's {\tt ID} tag. {\tt @PG} records may be chained using {\tt PP} tag, with the last record in the chain having no {\tt PP} tag. This chain defines the order of programs that have been applied to the alignment. PP values may be modified when merging SAM files in order to handle collisions of PG IDs. The first PG record in a chain (i.e. the one referred to by the PG tag in a SAM record) describes the most recent program that operated on the SAM record. The next PG record in the chain describes the next most recent program that operated on the SAM record. \\\cline{2-3}
& {\tt PP} & Previous {\tt @PG-ID}. Must match another {\tt @PG} header's {\tt ID} tag.
{\tt @PG} records may be chained using {\tt PP} tag, with the last record in the chain
having no {\tt PP} tag. This chain defines the order of programs that have been applied to the alignment.
{\tt PP} values may be modified when merging SAM files in order to handle collisions of {\tt PG} {\tt ID}s.
The first {\tt PG} record in a chain (i.e. the one referred to by the {\tt PG} tag in a SAM record)
describes the most recent program that operated on the SAM record.
The next {\tt PG} record in the chain describes the next most recent program that operated on the SAM record. \\\cline{2-3}
& {\tt VN} & Program version \\\cline{1-3}
\multicolumn{2}{|l}{\tt @CO} & One-line text comment. Unordered multiple lines are allowed.\\
\multicolumn{2}{|l}{\tt @CO} & One-line text comment. Unordered multiple {\tt @CO} lines are allowed.\\
\cline{1-3}
\end{longtable}
\end{center}
Expand Down Expand Up @@ -179,24 +192,25 @@ \subsection{The alignment section: mandatory fields}
\item {\sf FLAG}: bitwise FLAG. Each bit is explained in the following
table (`*' means no bits are set):
\begin{center}\small
\begin{tabular}{rl}
\begin{tabular}{rcl}
\hline
Bit & Description\\
Bit & Chr\footnotemark[1] & Description\\
\hline
0x1 & template having multiple fragments in sequencing \\
0x2 & each fragment properly aligned according to the aligner \\
0x4 & fragment unmapped \\
0x8 & next fragment in the template unmapped \\
0x10 & {\sf SEQ} being reverse complemented \\
0x20 & {\sf SEQ} of the next fragment in the template being reversed \\
0x40 & the first fragment in the template \\
0x80 & the last fragment in the template \\
0x100 & secondary alignment\\
0x200 & not passing quality controls \\
0x400 & PCR or optical duplicate \\
0x1 & p&template having multiple fragments in sequencing \\
0x2 & P&each fragment properly aligned according to the aligner \\
0x4 & u&fragment unmapped \\
0x8 & U&next fragment in the template unmapped \\
0x10 & r& {\sf SEQ} being reverse complemented \\
0x20 & R& {\sf SEQ} of the next fragment in the template being reversed \\
0x40 & 1& the first fragment in the template \\
0x80 & 2& the last fragment in the template \\
0x100 &s& secondary alignment\\
0x200 &f& not passing quality controls \\
0x400 &d& PCR or optical duplicate \\
\hline
\end{tabular}
\end{center}
\footnotetext[1]{For human readability, some programs may use a string to represent {\sf FLAG}, but this is not formally defined in the SAM spec.}
\begin{itemize}
\item Bit 0x4 is the only reliable place to tell whether the fragment
is unmapped. If 0x4 is set, no assumptions can be made about {\sf
Expand Down Expand Up @@ -284,20 +298,19 @@ \subsection{The alignment section: mandatory fields}
sequence is not stored. If not a `*', the length of the sequence must
equal the sum of lengths of {\tt M/I/S/=/X} operations in {\sf CIGAR}.
An `=' denotes the base is identical to the reference base. No
assumptions can be made on the letter cases. Anything other than {\tt
A/C/G/T/=} is regarded as ambiguous base {\tt N}.
assumptions can be made on the letter cases.
\item {\sf QUAL}: ASCII of base QUALity plus 33 (same as the quality
string in the Sanger FASTQ format). A base quality is the phred-scaled
base error probability which equals $-10\log_{10}\Pr\{\mbox{base is
wrong}\}$. This field can be a `*' when quality is not stored. If
not a `*', {\sf SEQ} is not a `*' and the length of the quality string
not a `*', {\sf SEQ} must not be a `*' and the length of the quality string
ought to equal the length of {\sf SEQ}.
\end{enumerate}

\subsection{The alignment section: optional fields}
All optional fields are presented in the {\tt TAG:TYPE:VALUE} format
where {\tt TAG} is a two-character string that matches {\tt
/[A-Za-z][A-Za-z0-9]/}, {\tt TYPE} is a casesensitive single letter which
/[A-Za-z][A-Za-z0-9]/}, and {\tt TYPE} is a casesensitive single letter which
defines the format of {\tt VALUE}:
\begin{center}\small
\begin{tabular}{cll}
Expand All @@ -308,23 +321,24 @@ \subsection{The alignment section: optional fields}
i & {\tt [-+]?[0-9]+} & Singed 32-bit integer \\
f & {\tt [-+]?[0-9]*\char92.?[0-9]+([eE][-+]?[0-9]+)?} & Single-precision floating number \\
Z & {\tt [\,\,\,!-\char126]+} & Printable string, including space\\
H & {\tt [0-9A-F]+} & Hex string, high nybble first \\
H & {\tt [0-9A-F]+} & Byte array in the Hex format\footnotemark[1]\\
\hline
\end{tabular}
\footnotetext[1]{For example, a byte array {\tt \{0x1a,0xe3,0x1\}} corresponds to a Hex string `{\tt 1AE301}'.}
\end{center}
Each {\tt TAG} can only appear once in one alignment line. A {\tt TAG}
containing lowercase letters are reserved for end users.

{\flushleft Predefined tags are shown in the following table. You can
{Predefined tags are shown in the following table. You can
freely add new tags, and if a new tag may be of general interest, you
can email {\tt samtools-help@lists.sourceforge.net} to add the new tag
can email {\tt samtools-devel@lists.sourceforge.net} to add the new tag
to the specification. Note that tags started with `{\tt X}', `{\tt Y}'
and `{\tt Z}' or tags containing lowercase letters in either position are reserved for local use and will not be formally
defined in any future version of this specification.}
\begin{center}\small
\begin{tabular}{ccp{12.5cm}}
\hline
{\bf Tag} & {\bf Type} & {\bf Description} \\
{\bf Tag\footnotemark[1]} & {\bf Type} & {\bf Description} \\
\hline
{\tt X?} & ? & Reserved fields for end users (together with {\tt Y?} and {\tt Z?}) \\
{\tt AM} & i & The smallest template-independent mapping quality of fragments in the rest \\
Expand All @@ -339,13 +353,14 @@ \subsection{The alignment section: optional fields}
{\tt E2} & Z & The 2nd most likely base calls. Same encoding and same length as {\sf QUAL}.\\
{\tt FI} & i & The index of fragment in the template.\\
{\tt FS} & Z & Fragment suffix.\\
{\tt FZ} & H & Flow signal intensities on the original strand of the read, stored as {\tt (uint16\_t) round(value * 100.0)}. \\
{\tt LB} & Z & Library. Value to be consistent with the header {\tt RG-LB} tag if {\tt @RG} is present.\\
{\tt H0} & i & Number of perfect hits\\
{\tt H1} & i & Number of 1-difference hits (see also {\tt NM})\\
{\tt H2} & i & Number of 2-difference hits \\
{\tt HI} & i & Query hit index, indicating the alignment record is the i-th one stored in SAM\\
{\tt IH} & i & Number of stored alignments in SAM that contains the query in the current record\\
{\tt MD} & Z & String for mismatching positions. \emph{Regex}: {\tt [0-9]+(([ACGTN]|\char92\char94[ACGTN]+)[0-9]+)*}\,$^1$\\
{\tt MD} & Z & String for mismatching positions. \emph{Regex}: {\tt [0-9]+(([A-Z]|\char92\char94[A-Z]+)[0-9]+)*}\footnotemark[2]\\
{\tt MQ} & i & Mapping quality of the mate/next fragment \\
{\tt NH} & i & Number of reported alignments that contains the query in the current record\\
{\tt NM} & i & Edit distance to the reference, including ambiguous bases but excluding clipping\\
Expand All @@ -365,18 +380,16 @@ \subsection{The alignment section: optional fields}
\hline
\end{tabular}
\end{center}
\begin{enumerate}
\item The MD field aims to achieve SNP/indel calling without looking at
\footnotetext[1]{The {\tt GS}, {\tt GC}, {\tt GQ}, {\tt MF}, {\tt S2}
and {\tt SQ} are reserved for backward compatibility.}
\footnotetext[2]{The MD field aims to achieve SNP/indel calling without looking at
the reference. For example, a string `{\tt 10A5\char94AC6}' means from
the leftmost reference base in the alignment, there are 10 matches
followed by an A on the reference which is different from the aligned
read base; the next 5 reference bases are matches followed by a 2bp
deletion from the reference; the deleted sequence is AC; the last 6
bases are matches. The {\tt MD} field ought to match the {\sf CIGAR}
string.
\item The {\tt GS}, {\tt GC}, {\tt GQ}, {\tt MF}, {\tt S2}
and {\tt SQ} are reserved for backward compatibility.
\end{enumerate}
string.}

\pagebreak

Expand Down Expand Up @@ -440,7 +453,12 @@ \subsection{The BGZF compression format}
format. The goal of BGZF is to provide good compression while allowing
efficient random access to the BAM file for indexed queries. The BGZF
format is `gunzip compatible', in the sense that a compliant gunzip
utility can decompress a BGZF compressed file.
utility can decompress a BGZF compressed file\footnote{It is worth noting that there is a known bug in the Java {\sf
GZIPInputStream} class that concatenated gzip archives cannot be
successfully decompressed by this class. BGZF files can be created and
manipulated using the built-in Java {\sf util.zip} package, but naive
use of {\sf GZIPInputStream} on a BGZF file will not work due to this
bug.}.

A BGZF archive is a series of concatenated BGZF blocks. Each BGZF block
is itself a spec-compliant gzip archive which contains an "extra field"
Expand Down Expand Up @@ -497,7 +515,7 @@ \subsection{The BGZF compression format}

BGZF files support random access through the BAM file index. To achieve
this, the BAM file index uses \emph{virtual file offsets} into the BGZF
file. Each virtual file offset is 64 bits, defined as: {\tt
file. Each virtual file offset is an unsigned 64-bit integer, defined as: {\tt
coffset\char60\char60 16\char124uoffset}, where {\tt coffset} is an
unsigned byte offset into the BGZF file to the beginning of a BGZF
block, and {\tt uoffset} is an unsigned byte offset into the
Expand All @@ -506,15 +524,8 @@ \subsection{The BGZF compression format}
and addition between a virtual offset and an integer are both
disallowed.

It is worth noting that there is a known bug in the Java {\sf
GZIPInputStream} class that concatenated gzip archives cannot be
successfully decompressed by this class. BGZF files can be created and
manipulated using the built-in Java {\sf util.zip} package, but naive
use of {\sf GZIPInputStream} on a BGZF file will not work due to this
bug.

\subsection{The BAM format}
BAM is compressed in the BGZF format. All integers in BAM are
BAM is compressed in the BGZF format. All multi-byte numbers in BAM are
little-endian, regardless of the machine endianness. The format is
formally described in the following table where values in brackets are
the default when the corresponding information is not available; an
Expand Down Expand Up @@ -546,15 +557,22 @@ \subsection{The BAM format}
& \multicolumn{2}{l|}{\sf tlen} & Template length ($=\underline{\sf TLEN}$) & {\tt int32\_t} & [0] \\\cline{2-6}
& \multicolumn{2}{l|}{\sf read\_name} & Read name, {\tt NULL} terminated (\underline{\sf QNAME} plus a tailing `{\tt \char92 0}') & {\tt char[{\sf l\_read\_name}]} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf cigar} & CIGAR: {\tt {\sf op\_len}\char60\char60 4\char124{\sf op}}. `{\tt MIDNSHP\char61X}'$\to$`012345678' & {\tt uint32\_t[{\sf n\_cigar\_op}]} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf seq} & 4-bit encoded read: `{\tt =ACGTN}'$\to0,1,2,4,8,15$; high nybble first (1st base in the highest 4-bit of the 1st byte) & {\tt uint8\_t[({\sf l\_seq}+1)/2]} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf seq} & 4-bit encoded read: `{\tt =ACMGRSVTWYHKDBN}'$\to[0,15]$; other characters mapped to `{\tt N}'; high nybble first (1st base in the highest 4-bit of the 1st byte) & {\tt uint8\_t[({\sf l\_seq}+1)/2]} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf qual} & Phred base quality (a sequence of {\tt 0xFF} if absent) & {\tt char[{\sf l\_seq}]} & \\\cline{2-6}
& \multicolumn{5}{c|}{\textcolor{gray}{\it List of auxiliary data (until the end of the alignment block)}} \\\cline{3-6}
& & {\sf tag} & Two-character tag & {\tt char[2]} & \\\cline{3-6}
& & {\sf val\_type} & Value type: {\tt AcCsSiIfZH}. An integer may be stored as `{\tt cCsSiI}' depending on the magnitude of the integer. In SAM, all integer types are mapped to `{\tt i}'. & {\tt char} & \\\cline{3-6}
& & {\sf value} & Tag value & by {\sf val\_type} &\\
& & {\sf val\_type} & Value type: {\tt AcCsSiIfZHB}\footnotemark[1]$^,$\footnotemark[2] & {\tt char} & \\\cline{3-6}
& & {\sf value} & Tag value & (by {\sf val\_type}) &\\
\cline{1-6}
\end{tabular}}
\end{table}
\footnotetext[1]{An integer may be stored as one of `{\tt cCsSiI}' in BAM, representing {\tt int8\_t}, {\tt uint8\_t},
{\tt int16\_t}, {\tt uint16\_t}, {\tt int32\_t} and {\tt uint32\_t}, respectively. In SAM, all integer types are mapped to {\tt int32\_t}.}
\footnotetext[2]{BAM uses two types `{\tt H}' and `{\tt B}' to store a byte string. When type {\tt H} is in use,
the byte string is stored as a {\tt NULL}-terminated Hex string. When type {\tt B} is in use,
the first 4 bytes in `{\sf value}', in the little-endian byte order, gives the number of bytes in the byte string;
the byte string is kept in the following bytes. Both BAM type `{\tt H}' and `{\tt B}' are mapped to
a single SAM type `{\tt H}'.}

\pagebreak

Expand Down

0 comments on commit 598d014

Please sign in to comment.