Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



81 Commits

Repository files navigation

This code is associated with the following manuscript. If you use any part of the source code, please cite us:
Sumaiya Nazeen, Yun William Yu, and Bonnie Berger*. 
"Carnelian uncovers hidden functional patterns across diverse study populations from whole metagenome sequencing reads." (Accepted for publication in Genome Biology). 

A preliminary version of the paper was presented at ECCB 2018 (Applications Track) and bioRxiv preprint can be found at

Upon publication, further information can be found at

0. Requirments
    Vowpal Wabbit 8.1.1
    R 3.3.2
    Python 2.7.13
    BioPython 1.70

    This code has been tested with GCC 6.3.0 on Ubuntu 17.04, running
    under Bash 4.4.7(1) on a server with Intel Xeon E5-2695 v2 x86_64 2.40 GHz 
    processor and 320 GB RAM.
    Using the EC-2010-DB dataset as gold standard, Carnelian can be comfortably
    run on a machine with 16GB RAM using 1 CPU.

1. Directory structure
data/: EC-2010-DB dataset with gold standard EC labels.
scripts/: R scripts for abundance estimation and analysis from read counts in functional bins.
    ext/: external libararies.
    test/: test drawfrag.c and fasta2skm.c
    drawfrag.c: draw fragments from fasta records.
    fasta2skm.c: construct feature (spaced k-mer profile), and convert to VW input format. generate LSH function using LDPC code. splitting and merging utilities for fasta files. links paired-end read files using paired-end relationships. translate sequences using reduced amino acid alphabets.
    kseq.h: parse FASTA files
    demo_data/: data files required for unit and advanced tests. contains the unit tests for Carnelian. contains the end-to-end tests for Carnelian. configures unit tests and advanced tests for Carnelian.
    README.txt: contains the instructions to run the tests.
2. Install and test:

3. Usage:

    (default --optional-arguments such as k-mer length, fragment size,
    hash functions, etc. are set to work best with EC-2010-DB as used in the manuscript.
    If you're going to train on a different dataset, be sure to tune parameters.)

    1)  ./ frag [--optional-arguments] test_dir frag_dir [-h]
        Looks for a fasta file in test_dir with matching label file.
        Randomly draws fragments of length and coverage specified in
        optional-arguments. (use "./ frag -h" for details)

        Outputs these fragments with corresponding label into frag_dir.

    2) ./ train [--optional-arguments] train_dir model_dir [-h]

        Looks for a fasta file in train_dir with matching label file.
        For each batch of training, randomly draw fragments and generate
        feature vectors using Opal LDPC hashes, and trains Vowpal_Wabbit
        One-Against-All classifier against all batches sequentially. To train 
        classifiers in precise mode, use "--precise" option which will make
        the learned model store probabilities.

        Outputs the generated classifier model into model_dir.

    3) ./ retrain [--optional-arguments] old_model_dir new_model_dir new_exmpls_dir [-h]

        Looks for a vowpal-wabbit model with patterns and dictionary file
        in the old_model_dir and a fasta file with matching labels in the
        new_exmpls_dir. Starting with the old model, it updates the existing
        training model and merges new labels with old dictionary using the old
        LDPC patterns. Note that a model trained in default mode must be updated
        in default mode. Same is true for precise mode.
        Output model, dictionary, and pattern files will be generated in new_model_dir.
    4) ./ translate [--optional-arguments] seq_dir out_dir fgsp_loc [-h]
	Using FragGeneScan program located in the fgsp_loc directory, tries to find
	coding sequences in the input reads fasta file in seq_dir, and translated the
	coding sequences to possible ORFs outputting them in a fasta file in the out_dir.

    5) ./ predict [--optional-arguments] model_dir test_dir predict_dir [-h]

        Looks for a classifier model in model_dir, and a fasta file in
        test_dir containing reads/fragments. To make predictions with probabilities,
        run in precise mode using "--precise" option and specify probability cutoff
        using "--cutoff <X>" option.
        Outputs the predictions in predict_dir as a fasta file with
        corresponding a corresponding label file.

    6) ./ eval reference_file predicted_labels [-h]
        Evaluation of prediction accuracy in terms of micro and macro averaged
        precision, sensitivity, and F1-score. If run in "precise" mode, it will
        assume predicted_labels file to have two tab-separated columns: <readID, predLabel>

    7) ./ abundance in_dir out_dir mapping_file gs_file [-h]

        Generates abundance estimates of functional terms. Looks for predicted labels for
        each sample in its own sub-directory in in_dir and sample mapping information
        and average protein length per label in mapping_file and gs_file respectively. Please
	note that, the sample ids must not start with digits. 

        Outputs raw counts and effective counts matrices in out_dir.

    8) ./ simulate [--optional-arguments] test_dir train_dir out_dir [-h]

        Runs a full pipeline for performance evaluation starting from training on data 
        in train_dir, testing on data in test_dir, and outputting fragments, model and 
        predictions under out_dir in the following directory structure:

            simulated test data (drawn fragments) are saved here.
            (ignored if --do-not-fragment)
            classifier will be saved here.
            fragment classifications are saved here.

    9) ./ annotate [--optional-arguments] sample_dir model_dir out_dir fgsp_loc [-h]
	Annotates the input nucleotide reads starting from gene finding and translation
	on the reads fasta file in the sample_dir using FragGeneScan located in the fgsp_loc
	directory, then classifying the predicted ORFs using the model in model_dir, and 
	outputting the labels in the out_dir.
    Steps to be followed in a typical workflow is given in the workflow.txt file.
    To replicate our classification performance analysis the code in performance_analysis.R can be used. Before running
    the script the following packages need to be installed: 
        caret, pROC, ROCR, cvAUC, randomForest

    Sumaiya Nazeen,

    This implementation of Carnelian is adapted from the source code of the following papers:
    Yunan Luo, Y. William Yu, Jianyang Zeng, Bonnie Berger, and Jian Peng. Metagenomic binning through low density hashing.  Bioinformatics (2018), bty611,
    K. Vervier, P. Mahe, M. Tournoud, J.-B. Veyrieras, and J.-P. Vert. Large-scale Machine Learning for Metagenomics Sequence Classification , Technical report HAL-01151453, May, 2015.