Skip to content

Latest commit

 

History

History
244 lines (159 loc) · 9.12 KB

FileIOOverview.rst

File metadata and controls

244 lines (159 loc) · 9.12 KB

ToC

File I/O Overview

Learning Objective

This article will give you an overview of the formatted file I/O in SeqAn.

Difficulty

Basic

Duration

30 min

Prerequisites

tutorial-datastructures-sequences

Overview

Most file formats in bioinformatics are structured as lists of records. Often, they start out with a header that itself contains different header records. For example, the Binary Sequence Alignment/Map (SAM/BAM) format starts with an header that lists all contigs of the reference sequence. The BAM header is followed by a list of BAM alignment records that contain query sequences aligned to some reference contig.

Formatted Files

SeqAn allows to read or write record-structured files through two types of classes: FormattedFileIn and FormattedFileOut. Classes of type FormattedFileIn allow to read files, whereas classes of type FormattedFileOut allow to write files. Note how these types of classes do not allow to read and write the same file at the same time.

These types of classes provide the following I/O operations on formatted files:

  1. Open a file given its filename or attach to an existing stream like std::cin or std::cout.
  2. Guess the file format from the file content or filename extension.
  3. Access compressed or uncompressed files transparently.

SeqAn provides the following file formats:

  • SeqFileIn, SeqFileOut (see Tutorial tutorial-io-sequence-io)
  • BamFileIn, BamFileOut (see Tutorial tutorial-io-sam-bam-io)
  • BedFileIn, BedFileOut (see Tutorial tutorial-io-bed-io)
  • VcfFileIn, VcfFileOut (see Tutorial tutorial-io-vcf-io)
  • GffFileIn, GffFileOut (see Tutorial tutorial-io-gff-and-gtf-io)
  • RoiFileIn, RoiFileOut
  • SimpleIntervalsFileIn, SimpleIntervalsFileInOut
  • UcscFileIn, UcscFileOut

Warning

Access to compressed files relies on external libraries. For instance, you need to have zlib installed for reading .gz files and libbz2 for reading .bz2 files. If you are using Linux or OS X and you followed the tutorial-getting-started tutorial closely, then you should have already installed the necessary libraries. On Windows, you will need to follow infra-use-install-dependencies to get the necessary libraries.

You can check whether you have installed these libraries by running CMake again. Simply call cmake . in your build directory. At the end of the output, there will be a section "SeqAn Features". If you can read ZLIB - FOUND and BZIP2 - FOUND then you can use zlib and libbz2 in your programs.

Basic I/O

This tutorial shows the basic functionalities provided by any class of type FormattedFileIn or FormattedFileOut. In particular, this tutorial adopts the classes BamFileIn and BamFileOut as concrete types. The class BamFileIn allows to read files in SAM or BAM format, whereas the class BamFileOut allows to write them. Nonetheless, these functionalities are independent from the particular file format and thus valid for all record-based file formats supported by SeqAn.

The demo application shown here is a simple BAM to SAM converter.

Includes

Support for a specific format comes by including a specific header file. In this case, we include the BAM header file:

demos/tutorial/file_io_overview/example1.cpp

Opening and Closing Files

Classes of type FormattedFileIn and FormattedFileOut allow to FormattedFile#open and FormattedFile#close files.

A file can be opened by passing the filename to the constructor:

demos/tutorial/file_io_overview/example1.cpp

Alternatively, a file can be opened after construction by calling FormattedFile#open:

demos/tutorial/file_io_overview/example1.cpp

Note that any file is closed automatically whenever the FormattedFileIn or FormattedFileOut object goes out of scope. Eventually, a file can be closed manually by calling FormattedFile#close.

Accessing the Header

To access the header, we need an object representing the format-specific header. In this case, we use an object of type BamHeader. The content of this object can be ignored for now, it will be covered in the tutorial-io-sam-bam-io tutorial.

demos/tutorial/file_io_overview/example1.cpp

The function FormattedFileIn#readHeader reads the header from the input BAM file and FormattedFileOut#writeHeader writes it to the SAM output file.

Accessing the Records

Again, to access records, we need an object representing format-specific information. In this case, we use an object of type BamAlignmentRecord. Each call to FormattedFileIn#readRecord reads one record from the BAM input file and moves the BamFileIn forward. Each call to FormattedFileOut#writeRecord writes the record just read to the SAM output files. We check the end of the input file by calling FormattedFile#atEnd.

demos/tutorial/file_io_overview/example1.cpp

Our small BAM to SAM conversion demo is ready. The tool still lacks error handling, reading from standard input and writing to standard output. You are now going to add these features.

Error Handling

We distinguish between two types of errors: low-level file I/O errors and high-level file format errors. Possible file I/O errors can affect both input and output files. Example of errors are: the file permissions forbid a certain operation, the file does not exist, there is a disk reading error, a file being read gets deleted while we are reading from it, or there is a physical error in the hard disk. Conversely, file format errors can only affect input files: such errors arise whenever the content of the input file is incorrect or damaged. Error handling in SeqAn is implemented by means of exceptions.

I/O Errors

All FormattedFile#FormattedFile FormattedFileIn and FormattedFile#FormattedFile FormattedFileOut constructors and functions throw exceptions of type IOError to signal low-level file I/O errors. Therefore, it is sufficient to catch these exceptions to handle I/O errors properly.

There is only one exception to this rule. Function FormattedFile#open returns a bool to indicate whether the file was opened successfully or not.

Assignment 1

Type

Application

Objective

Improve the program above to detect file I/O errors.

Hint

Use the IOError class.

Solution

demos/tutorial/file_io_overview/solution1.cpp

Format Errors

Classes of types FormattedFileIn throw exceptions of type ParseError to signal high-level input file format errors.

Assignment 2

Type

Application

Objective

Improve the program above to detect file format errors.

Solution

demos/tutorial/file_io_overview/solution2.cpp

Streams

The FormattedFile#FormattedFile FormattedFileIn and FormattedFile#FormattedFile FormattedFileOut constructors accept not only filenames, but also standard C++ streams, or any other class implementing the StreamConcept Stream concept. For instance, you can pass std::cin to any FormattedFile#FormattedFile FormattedFileIn constructor and std::cout to any FormattedFile#FormattedFile FormattedFileOut constructor.

Note

When writing to std::cout, classes of type FormattedFileOut cannot guess the file format from the filename extension. Therefore, the file format has to be specified explicitly by providing a tag, e.g. FileFormats#Sam or FileFormats#Bam.

Assignment 3

Type

Application

Objective

Improve the program above to write to standard output.

Solution

demos/tutorial/file_io_overview/solution3.cpp

Running this program results in the following output.

demos/tutorial/file_io_overview/solution3.cpp.stdout

Next Steps

If you want, you can now have a look at the API documentation of the FormattedFile class.

You can now read the tutorials for already supported file formats:

  • tutorial-io-sequence-io
  • tutorial-io-sam-bam-io
  • tutorial-io-vcf-io
  • tutorial-io-bed-io
  • tutorial-io-gff-and-gtf-io