Download code from https://github.com/xa6xa6/metaOthello/
There are two code folders:
- Builder -- The code under "build" is used to generate the MetaOthello index.
- Classifier -- The code under "classifier" is used to perform taxonomic classification of sequencing reads using MetaOthello Index.
Building indexing is very time-consuming (costs about 6 hours to build index for NCBI/refseq bacterial genome database). Therefore, we provide ready-built indexes (NCBI/refseq bacterial genome database) for users to download:
- 20mer index: https://drive.google.com/open?id=0BxgO-FKbbXRIYWREa2NwejlVYUU
- 25mer index: https://drive.google.com/open?id=0BxgO-FKbbXRIY1pRaHJsYVg5dTQ
- 31mer index: https://drive.google.com/open?id=0BxgO-FKbbXRIa0Flc3Q4bWtycGM
If you want to build an index with your own reference sequences, follow these steps.
Jellyfish is used to prepare k-mer files for each reference sequence. Download jellyfish from: http://www.cbcb.umd.edu/software/jellyfish/
- Produce a k-mer count file for your reference seqeuences. Command:
jellyfish count \
–o <path_to_bacterial_rawKmerCountFile> \
-m <Kmer_length> \
-t <threads_num> \
-s <bf_size> \
-C <path_to_bacterial_referenceSeqFastaFile>
- Dump k-mers to human-readable format. Command:
jellyfish dump \
–t –c \
–o <path_to_bacterial_readableKmerCountFile> \
<path_to_bacterial_rawKmerCountFile>
-
Generate taxonomy info file. Put all readable k-mer count files into the same directory
<path_to_bacterial_reference_seq_Kmer_file_dir>
and rename them as1.Kmer
,2.Kmer
,...
,m.Kmer
, and generate a taxonomy info file like: https://drive.google.com/open?id=0BxgO-FKbbXRIZlV3ZzBBdlFpMTQThere are three columns for each taxonomic rank in the file: the 1st column is a reissued id from
0
tom-1
, wherem
is the total taxon num in that taxonomic rank. The 2nd column lists taxon ids, and the 3rd column lists taxon scientific names. Each row represents a species and its associated taxonomy info. -
Run
make build
under the directorybuild
. -
Build the index. Command:
./build \
<bacterial_reference_seq_associated_taxonomy_info_file(generated in Step1.3)> \
<path_to_bacterial_reference_seq_Kmer_file_dir> \
<shared kmer file suffixes> \
<Kmer_length> 6 \
<path_to_bacterial_index> \
<path_to_a_temp_dir_for_intermediate_files>
- Run
make classifier
in theclassifier
directory. - Perform taxonomic classification for each metagenomics sequencing reads. Command:
./classifier \
<path_to_bacterial_index> \
<path_to_output_results_dir> \
<Kmer_length> \
<threads_num> \
<fa_or_fq> \
<SE_or_PE> \
<bacterial_speciesId2taxoInfo_file> \
<NCBI_names_file> \
<readFile_singleEnd or readFile_end1> \
(<readFile_end2 if paired-end reads are provided>)
<bacterial_speciesId2taxoInfo_file>
can be downloaded from:
https://drive.google.com/open?id=0BxgO-FKbbXRIc3FkLVFvMlpVVGM
Each row represents a species and its associated taxon ids at each taxonomic rank:
species, genus, family, order, class, and phylum. Assign -1
if the taxon id is not available.
<NCBI_names_file>
can be downloaded from:
https://drive.google.com/open?id=0BxgO-FKbbXRIUFI2dHlBMXZhdTA
NOTE: We will keep the following files updated with the latest NCBI/refseq bacterial genome databases:
- bacterial reference seq associated taxonomy info file,
- bacerial index (MetaOthello index for classification)
- bacterial speciesId2taxoInfo_file
- NCBI names file
Also, we will release tools for generating all the above files (from NCBI/refseq bacterial genome databases) very soon.
Copyright (C) 2016-, University of Kentucky
Please refer to LICENSE.TXT for the detailed 'License'.