Skip to content

Latest commit

 

History

History
54 lines (43 loc) · 8 KB

samples.md

File metadata and controls

54 lines (43 loc) · 8 KB

Data Repository

All reference files should be pre-processed with ref.py, as explained in the Manual.

SAM Samples

Organism Technology Coverage Uncompressed
size
Link Reference
E.coli Illumina MiSeq 420x 5.2 GB MiSeq_Ecoli_DH10B_110721_PF 1 (1.3 GB) CP000948
H.sapiens IonTorrent 0.6x 5.6 GB sample-2-10_sorted (1.4 GB) Homo_sapiens_assembly19
H.sapiens Illumina HiSeq 2x 20 GB 9827_2#49 (6.1 GB) hs37d5
D.melangoster PacBio 75x 29 GB dm3PacBio (12 GB) dm3
H.sapiens RNASeq 6x 71 GB K562_cytosol_LID8465_TopHat_v (12.8 GB) hg19 2
H.sapiens PacBio 15x 118 GB NA12878.pacbio.bwa-sw.20140202 (53.8 GB) hs37d5
H.sapiens Illumina-like Cancer Cell 30x 398 GB HCC1954.mix1.n80t20 3 (122.5 GB) Homo_sapiens_assembly19
H.sapiens Illumina HiSeq 50x 549 GB NA12878_S1 (113.3 GB) hg19
  1. All BAM files must be decompressed via samtools view -h <bam> -o <sam>.
  2. You can concatenate separate chromosome files into a large FASTA file via cat chr*.fa > hg19.fa.
  3. You need GeneTorrent to download this sample. After you obtain it, you can fetch this particular sample via gtdownload -vv -c https://cghub.ucsc.edu/software/downloads/cghub_public.key -d 360b4736-6c5e-48df-af58-c1cf51609350.

FASTQ Samples

Organism Technology Paired Coverage Uncompressed size Link Reference
P.aeruginosa Illumina GAIIx 50x 1 GB SRR554369_1 1 (119 MB)
SRR554369_2 (120 MB)
NC_002516.2
E.coli PacBio 140x 1.3 GB SRR1284073 3 (2.2 GB) Arabidopsis
H.sapiens gut Illumina GAII Unknown 3.6 GB MH0001_081026_clean.1 2 (478 MB)
MH0001_081026_clean.2 (550 MB)
hg19
S.cerevisiae Illumina GAII 175x 7.7 GB SRR327342_1 (792 MB)
SRR327342_2 (947 MB)
ACFL01000033
T.cacao Illumina GAIIx 35x 39 GB SRR870667_1 (5.2 GB)
SRR870667_2 (4.0 GB)
Cacao
H.sapiens Illumina HiSeq 13x 102 GB ERR174310_1 (17.3 GB)
ERR174310_2 (16.8)
hg19
H.sapiens Illumina HiSeq 120x (single-end) 887 GB ERR174324 (4) (17.5 GB)
ERR174325 (16.7 GB)
ERR174326 (16.3 GB)
ERR174327 (16.3 GB)
ERR174328 (16.3 GB)
ERR174329 (16.3 GB)
ERR174330 (16.1 GB)
ERR174331 (17.3 GB)
ERR174332 (15.7 GB)
ERR174333 (15.4 GB)
ERR174334 (15.6 GB)
ERR174335 (15.6 GB)
ERR174336 (15.9 GB)
ERR174337 (16.0 GB)
ERR174338 (16.0 GB)
ERR174339 (15.6 GB)
ERR174340 (11.2 GB)
ERR174341 (14.8 GB)
Total 284.6 GB
hg19
  1. All bzip2 files must be decompressed via bzip2 -d <bz> -c > <fastq>.
  2. All Gzip files must be decompressed via gzip -d <gz> -c > <fastq>.
  3. You need NCBI SRA Toolkit to download this sample. After you obtain it, you can fetch this particular sample via fastq-dump SRR1284073.
  4. For this sample, only first library mate (_1) files are used.

Coverage calculation

Coverage was calculated by dividing a total number of nucleotides in SAM or FASTQ file with the rounded reference genome size. These numbers are not intended to be exact, but more as a rough estimate of a coverage in the given sample.

The following reference genome sizes were used:

Organism Size
H.sapiens 3,100,000,000
T.cacao 345,000,000
D.melangoster 168,000,000
S.cerevisiae 12,000,000
P.aeruginosa 6,300,000
E.coli 4,700,000