MPEG-G Genomic Information Database
Switch branches/tags
Nothing to show
Clone or download

README.md

MPEG-G Genomic Information Database

Purpose

This is the access point to the MPEG-G Genomic Information Database ("Database"). The Database contents listings can be found in the file mpeg-g-gidb.xlsx. This file contains four tables:

  1. Sequencing Data Collection
  2. Data origins
  3. Reference sequences
  4. Conformance Test Items

We will refer to these tables as "Table 1", "Table 2", "Table 3" and "Table 4", respectively.

The Database consists of two parts:

  1. Sequencing Data Collection: a collection of statistically meaningful sequencing data to be used to assess the performance of genomic information compression technologies. Besides the sequencing data (listed in Table 1) the Sequencing Data Collection contains a set of reference sequences (listed in Table 3) and supplementary data for variant calling experiments (only listed on the server).

  2. Conformance Test Items: a set of bitstreams for conformance testing according to ISO/IEC 23092-5. Table 4 lists the Conformance Test Items.

Further work on the Database is discussed on the MPEG AHG on Genomic Information Representation email reflector: genome_compression@listes.epfl.ch.

Database access

The public access point to the Database is: https://github.com/voges/mpeg-g-gidb.

The Database is provided as the list of URLs (in Table 2) to public repositories where the data are available.

In case some resources are not available a copy of the data can be retrieved upon request to mpeg-g@tnt.uni-hannover.de with the following details:

Data classes

To make the Database statistically meaningful, sequencing data with different characteristics are considered.

Experiment types

The Database includes sequencing data generated for different experiment types:

  • Whole genome sequencing (WGS)
    • Including simulated human WGS data which was generated with ART [1]
    • Including cancer genome sequencing data
  • Metagenomics sequencing
  • RNA sequencing (RNA-Seq)

Organisms

The Database includes sequencing data from the following species:

  • Animalia
    • D. melanogaster
    • H. sapiens
  • Plantae
    • T. cacao
  • Fungi
    • S. cerevisiae
  • Bacteria
    • E. coli (different strains)
    • P. aeruginosa
  • Viruses
    • Phi X 174

Sequencing technologies

The Database includes sequencing data which was generated with the following sequencing technologies:

  • Sequencing by synthesis
    • Illumina/Solexa Genome Analyzer
    • Illumina Genome Analyzer IIx
    • Illumina MiSeq
    • Illumina HiSeq 2000
    • Illumina HiSeq X Ten
    • Illumina NovaSeq 6000
  • Single molecule real time sequencing
    • Pacific Biosciences SMRT (PacBio)
  • Nanopore sequencing
    • Oxford Nanopore MinION
    • Ion semiconductor sequencing
    • Ion Torrent PGM

Data formats

Unmapped sequencing data are provided in the form of gzipped FASTQ files. FASTQ files are usually manipulated with custom scripts written in Bash, Python, Perl etc.

Mapped sequencing data are provided in the form of BAM files. Transcoding of data from the BAM format to the SAM format can be done using the Samtools program suite (https://www.htslib.org) [2]. Manipulation of data which is stored in the SAM and BAM formats can also be achieved with the Samtools program suite.

Database

The paths listed in the tables are relative to the Database URL. The Database is organized in the following top-level folders:

  • candidate-data: files which are under consideration for a possible incorporation into the Sequencing Data Collection
  • collection: Sequencing Data Collection
  • conformance: Conformance Test Items
  • development: test files used during the development of the reference software

Sequencing Data Collection

Table 1 provides the selected data.

Table 2 provides the data origins of the Sequencing Data Collection. These URLs are kept as a trace of the origins. Some of them might not work anymore.

Table 3 provides a list of available reference sequences. Some of them were used for the alignment of BAM files from Table 1.

Conformance Test Items

Table 4 provides the Conformance Test Items.

References

[1] W. Huang, L. Li, J. R. Myers and G. T. Marth, "ART: a next-generation sequencing read simulator," Bioinformatics, vol. 28, no. 4, pp. 593-594, 2012.

[2] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin and 1000 Genome Project Data Processing Subgroup, "The Sequence Alignment/Map format and SAMtools," Bioinformatics, vol. 25, no. 16, pp. 2078-2079, 2009.

[3] J. Voges, J. Ostermann and M. Hernaez, "CALQ: compression of quality values of aligned sequencing data," Bioinformatics, vol. 34, no. 10, pp. 1650-1658, 2018.

[4] J. K. Bonfield, "The Scramble conversion tool," Bioinformatics, vol. 30, no. 19, pp. 2818-2819, 2014.

[5] F. Hach, I. Numanagic and S. C. Sahinalp, "DeeZ: reference-based compression by local assembly," Nature Methods, vol. 11, pp. 1082-1084, 2014.

[6] S. Marco-Sola, M. Sammeth, R. Guigó and P. Ribeca, "The GEM mapper: fast, accurate and versatile alignment by filtration," Nature Methods, vol. 9, pp. 1185-1188, 2012.