ToC
- Learning Objective
This article will give you an overview of the formatted file I/O in SeqAn.
- Difficulty
Basic
- Duration
30 min
- Prerequisites
tutorial-datastructures-sequences
Most file formats in bioinformatics are structured as lists of records. Often, they start out with a header that itself contains different header records. For example, the Binary Sequence Alignment/Map (SAM/BAM) format starts with an header that lists all contigs of the reference sequence. The BAM header is followed by a list of BAM alignment records that contain query sequences aligned to some reference contig.
SeqAn allows to read or write record-structured files through two types of classes: FormattedFileIn
and FormattedFileOut
. Classes of type FormattedFileIn
allow to read files, whereas classes of type FormattedFileOut
allow to write files. Note how these types of classes do not allow to read and write the same file at the same time.
These types of classes provide the following I/O operations on formatted files:
- Open a file given its filename or attach to an existing stream like std::cin or std::cout.
- Guess the file format from the file content or filename extension.
- Access compressed or uncompressed files transparently.
SeqAn provides the following file formats:
SeqFileIn
,SeqFileOut
(see Tutorialtutorial-io-sequence-io
)BamFileIn
,BamFileOut
(see Tutorialtutorial-io-sam-bam-io
)BedFileIn
,BedFileOut
(see Tutorialtutorial-io-bed-io
)VcfFileIn
,VcfFileOut
(see Tutorialtutorial-io-vcf-io
)GffFileIn
,GffFileOut
(see Tutorialtutorial-io-gff-and-gtf-io
)RoiFileIn
,RoiFileOut
SimpleIntervalsFileIn
,SimpleIntervalsFileInOut
UcscFileIn
,UcscFileOut
Warning
Access to compressed files relies on external libraries. For instance, you need to have zlib installed for reading .gz
files and libbz2 for reading .bz2
files. If you are using Linux or OS X and you followed the tutorial-getting-started
tutorial closely, then you should have already installed the necessary libraries. On Windows, you will need to follow infra-use-install-dependencies
to get the necessary libraries.
You can check whether you have installed these libraries by running CMake again. Simply call cmake .
in your build directory. At the end of the output, there will be a section "SeqAn Features". If you can read ZLIB - FOUND
and BZIP2 - FOUND
then you can use zlib and libbz2 in your programs.
This tutorial shows the basic functionalities provided by any class of type FormattedFileIn
or FormattedFileOut
. In particular, this tutorial adopts the classes BamFileIn
and BamFileOut
as concrete types. The class BamFileIn
allows to read files in SAM or BAM format, whereas the class BamFileOut
allows to write them. Nonetheless, these functionalities are independent from the particular file format and thus valid for all record-based file formats supported by SeqAn.
The demo application shown here is a simple BAM to SAM converter.
Support for a specific format comes by including a specific header file. In this case, we include the BAM header file:
demos/tutorial/file_io_overview/example1.cpp
Classes of type FormattedFileIn
and FormattedFileOut
allow to FormattedFile#open
and FormattedFile#close
files.
A file can be opened by passing the filename to the constructor:
demos/tutorial/file_io_overview/example1.cpp
Alternatively, a file can be opened after construction by calling FormattedFile#open
:
demos/tutorial/file_io_overview/example1.cpp
Note that any file is closed automatically whenever the FormattedFileIn
or FormattedFileOut
object goes out of scope. Eventually, a file can be closed manually by calling FormattedFile#close
.
To access the header, we need an object representing the format-specific header. In this case, we use an object of type BamHeader
. The content of this object can be ignored for now, it will be covered in the tutorial-io-sam-bam-io
tutorial.
demos/tutorial/file_io_overview/example1.cpp
The function FormattedFileIn#readHeader
reads the header from the input BAM file and FormattedFileOut#writeHeader
writes it to the SAM output file.
Again, to access records, we need an object representing format-specific information. In this case, we use an object of type BamAlignmentRecord
. Each call to FormattedFileIn#readRecord
reads one record from the BAM input file and moves the BamFileIn
forward. Each call to FormattedFileOut#writeRecord
writes the record just read to the SAM output files. We check the end of the input file by calling FormattedFile#atEnd
.
demos/tutorial/file_io_overview/example1.cpp
Our small BAM to SAM conversion demo is ready. The tool still lacks error handling, reading from standard input and writing to standard output. You are now going to add these features.
We distinguish between two types of errors: low-level file I/O errors and high-level file format errors. Possible file I/O errors can affect both input and output files. Example of errors are: the file permissions forbid a certain operation, the file does not exist, there is a disk reading error, a file being read gets deleted while we are reading from it, or there is a physical error in the hard disk. Conversely, file format errors can only affect input files: such errors arise whenever the content of the input file is incorrect or damaged. Error handling in SeqAn is implemented by means of exceptions.
All FormattedFile#FormattedFile FormattedFileIn
and FormattedFile#FormattedFile FormattedFileOut
constructors and functions throw exceptions of type IOError
to signal low-level file I/O errors. Therefore, it is sufficient to catch these exceptions to handle I/O errors properly.
There is only one exception to this rule. Function FormattedFile#open
returns a bool
to indicate whether the file was opened successfully or not.
- Type
Application
- Objective
Improve the program above to detect file I/O errors.
- Hint
Use the
IOError
class.- Solution
demos/tutorial/file_io_overview/solution1.cpp
Classes of types FormattedFileIn
throw exceptions of type ParseError
to signal high-level input file format errors.
- Type
Application
- Objective
Improve the program above to detect file format errors.
- Solution
demos/tutorial/file_io_overview/solution2.cpp
The FormattedFile#FormattedFile FormattedFileIn
and FormattedFile#FormattedFile FormattedFileOut
constructors accept not only filenames, but also standard C++ streams, or any other class implementing the StreamConcept Stream
concept. For instance, you can pass std::cin to any FormattedFile#FormattedFile FormattedFileIn constructor
and std::cout to any FormattedFile#FormattedFile FormattedFileOut constructor
.
Note
When writing to std::cout, classes of type FormattedFileOut
cannot guess the file format from the filename extension. Therefore, the file format has to be specified explicitly by providing a tag, e.g. FileFormats#Sam
or FileFormats#Bam
.
- Type
Application
- Objective
Improve the program above to write to standard output.
- Solution
demos/tutorial/file_io_overview/solution3.cpp
Running this program results in the following output.
demos/tutorial/file_io_overview/solution3.cpp.stdout
If you want, you can now have a look at the API documentation of the FormattedFile
class.
You can now read the tutorials for already supported file formats:
tutorial-io-sequence-io
tutorial-io-sam-bam-io
tutorial-io-vcf-io
tutorial-io-bed-io
tutorial-io-gff-and-gtf-io