Merge 6b6fad1 into be56e56

smirarab · Feb 1, 2021 · c524fd7 · c524fd7
2 parents be56e56 + 6b6fad1
commit c524fd7
Show file tree

Hide file tree

Showing 6 changed files with 731 additions and 514 deletions.
diff --git a/README.TIPP.md b/README.TIPP.md
@@ -1,75 +1,69 @@
-------------------------------------
-Summary
-------------------------------------
-
-TIPP stands for `Taxonomic identification and phylogenetic profiling`, and so is a method for the following problems:
+Taxonomic Identification and Phylogenetic Profiling (TIPP)
+==========================================================
+TIPP is a method for the following problems:
 
 Taxonomic identification:
-- Input: A query sequence `q`
-- Output: The taxonomic lineage of `q`.
++ Input: A query sequence *q*
++ Output: The taxonomic lineage of *q*
 
 Abundance profiling:
-- Input: A set of query sequences `Q`
-- Output: An abundance profile estimated on `Q`
-
++ Input: A set *Q* of query sequences
++ Output: An abundance profile estimated on *Q*
 
-TIPP is a modification of SEPP for classifying query sequences using phylogenetic placement.  TIPP inserts the query sequences into a taxonomic tree and uses the insertion location to identify the reads.  The novel idea behind TIPP is that rather than using the single best alignment and placement for taxonomic identification, we use a collection of alignments and placements and consider statistical support for each alignment and placement.  Our study shows that TIPP provides improved classification accuracy on novel sequences and on sequences with evolutionarily divergent datasets.  TIPP can also be used for abundance estimation by computing an abundance profile on the reads binned to the 30 gene reference dataset.
+TIPP is a modification of SEPP for classifying query sequences (i.e. reads) using phylogenetic placement. TIPP inserts each read into a taxonomic tree and uses the insertion location to identify the taxonomic lineage of the read. The novel idea behind TIPP is that rather than using the single best alignment and placement for taxonomic identification, we use a collection of alignments and placements and consider statistical support for each alignment and placement. Our study shows that TIPP provides improved classification accuracy on novel sequences and on sequences with evolutionarily divergent datasets. TIPP can also be used for abundance estimation by computing an abundance profile on the reads binned to marker genes in a reference dataset. TIPP2 provides an new reference dataset with 40 marker genes, assembled from the NCBI RefSeq database (learn more [here](https://github.com/shahnidhi/TIPP_reference_package)). In addition, TIPP2 updates how query sequences (i.e. reads) are mapped to marker genes. This repository corresponds to TIPP2, and henceforth we use the terms TIPP and TIPP2 interchangeably.
 
-Developers: Nam Nguyen, Siavash Mirarab, and Tandy Warnow.
+Developers of TIPP: Nam Nguyen, Siavash Mirarab, Nidhi Shah, Erin Molloy, and Tandy Warnow.
 
-###Publication:
-Nguyen, Nam , Siavash Mirarab, Bo Liu, Mihai Pop, and Tandy Warnow. `TIPP: Taxonomic identification and phylogenetic profiling`. Bioinformatics (2014). [doi:10.1093/bioinformatics/btu721](http://bioinformatics.oxfordjournals.org/content/30/24/3548.full.pdf).
+### Publications:
+Nguyen, Nam, Siavash Mirarab, Bo Liu, Mihai Pop, and Tandy Warnow, "TIPP: Taxonomic identification and phylogenetic profiling," *Bioinformatics*, 2014. [doi:10.1093/bioinformatics/btu721](http://bioinformatics.oxfordjournals.org/content/30/24/3548.full.pdf).
 
+Shah, Nidhi, Erin K. Molloy, Mihai Pop, and Tandy Warnow, "TIPP2: metagenomic taxonomic profiling using phylogenetic markers," *Bioinformatics*, 2020. [doi:10.1093/bioinformatics/btab023](https://doi.org/10.1093/bioinformatics/btab023)
 
 ### Note and Acknowledgment: 
 - TIPP bundles the following two programs into its distribution:
-  1. pplacer: http://matsen.fhcrc.org/pplacer/
-  2. hmmer: http://hmmer.janelia.org/
-  3. EPA: http://sco.h-its.org/exelixis/software.html
+	- pplacer: http://matsen.fhcrc.org/pplacer/
+	- hmmer: http://hmmer.janelia.org/
+	- EPA: http://sco.h-its.org/exelixis/software.html
 - TIPP uses the [Dendropy](http://pythonhosted.org/DendroPy/) package. 
 - TIPP uses some code from [SATe](http://phylo.bio.ku.edu/software/sate/sate.html).
 
 -------------------------------------
-Installation
--------------------------------------
-This section details steps for installing and running TIPP. We have run TIPP on Linux and MAC. If you experience difficulty installing or running the software, please contact one of us (Tandy Warnow, Nam Nguyen, or Siavash Mirarab).
+
+Installing TIPP
+===============
+This section details steps for installing and running TIPP. We have run TIPP on Linux and MAC. If you experience difficulty installing or running the software, please contact one of us (Tandy Warnow or Siavash Mirarab).
 
 Requirements:
--------------------
+-------------
 Before installing the software you need to make sure the following programs are installed on your machine.
 
-1. Python: Version > 2.7. 
-2. Java: Version > 1.5
-3. Blast: Version > 2.2.2
+- Python: Version > 2.7 
+- Java: Version > 1.5
+- Blast: Version > 2.2.2
 
 Installation Steps:
 -------------------
-TIPP is a part of the SEPP distribution package.  First install SEPP.  Once done, do the following. 
+TIPP is a part of the SEPP distribution package. First install SEPP. Once done, do the following. 
 
-1. Download the reference dataset available at https://github.com/tandyw/tipp-reference/releases/download/v2.0.0/tipp.zip
-2. Unzip it to a directory
-3. Set the environmental variables that will be used to create the configuration file.  The environmental variable can be set using the following command (shell-dependent)
-    `export VARIABLE_NAME=/path/to/file`  (bash shell)
-    `setenv VARIABLE_NAME /path/to/file` (tcsh shell)  
-    3a. Set the environment variable REFERENCE to point to the location of the reference directory (i.e., the location of the tipp folder generated from unzipping the tipp.zip file)
-    3b. Set the environment variable BLAST to point to the location of blastn
-4. Configure: run `python setup.py tipp` or `python setup.py tipp -c` (you should use `-c` if you used `-c` when you installed SEPP). 
-
+1. Download and decompress the reference dataset available at [https://obj.umiacs.umd.edu/tipp/tipp2-refpkg.tar.gz](https://obj.umiacs.umd.edu/tipp/tipp2-refpkg.tar.gz).
+2. Set the environment variables (`REFERENCE` and `BLAST`) that will be used to create the configuration file. The `REFERENCE` environment variable should point to the location of the reference dataset (i.e. it should point to the `tipp2-refpkg` directory with its full path). The `BLAST` environment variable should to point to the location of the binary: `blastn`. Environment variables can be set using the following (shell-dependent) commands:
+	- `export VARIABLE_NAME=/path/to/file` (bash shell)
+	- `setenv VARIABLE_NAME /path/to/file` (tcsh shell)
+3. Create the TIPP configuration file by running the command: `python setup.py tipp` or `python setup.py tipp -c`. This  creates a `tipp.config` config file. It is important that you use `-c` here if you used `-c` when installing SEPP and otherwise, not use `-c`. 
 
-The last step creates a `tipp.config` config file. It is important that you use `-c` here if you used `-c` when installing SEPP and otherwise, not use `-c`. 
 
 Common Problems:
--------------------
-1.  TIPP requires SEPP to be installed.  If TIPP is not running, first check to see if TIPP was installed correctly.
+----------------
+1. TIPP requires SEPP to be installed. If TIPP is not running, first check to see if SEPP was installed correctly.
 
-2.  TIPP relies on blastn for the binning of metagenomic reads.  This needs to be installed separately.  To point BLAST to your installation of blastn, modify ~/.sepp/tipp.config. 
-   blast: http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download
+2. TIPP relies on `blastn` for the binning of metagenomic reads, so BLAST needs to be downloaded and installed separately (learn more [here](http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download)). Then, point the `BLAST` environment variable to your installation of `blastn`. Alternatively, you can manually point TIPP to the `blastn` installation by modifying the `tipp.config` file. 
+
+3. TIPP performs abundance profiling uses a set of 40 marker genes. This reference dataset needs to be downloaded separately from [here](https://obj.umiacs.umd.edu/tipp/tipp2-refpkg.tar.gz). Then, point the `REFERENCE` environment variable to the decompressed directory before installing TIPP. Alternatively, you can manually point TIPP to the reference dataset by modifying the `tipp.config` file. 
 
-3.  TIPP performs abundance profiling uses a set of 30 marker genes.  This needs to be downloaded separately.  Download the reference dataset and unzip it to a directory.  Point the REFERENCE environment variable to this directory before installing TIPP.  You can manually point TIPP to the reference directory by modifying the ~/.sepp/tipp.config file. 
-
 ---------------------------------------------
+
 Running TIPP
----------------------------------------------
+============
 To run TIPP, invoke the `run_tipp.py` script from the `bin` sub-directory of the location in which you installed the Python packages. To see options for running the script, use the command:
 
 `python <bin>/run_tipp.py -h`
@@ -80,14 +74,17 @@ The general command for running TIPP for a specific marker is:
 
 `python <bin>/run_tipp.py -R <reference_marker> -f <fragment_file>`
 
-The main output of TIPP is a _classification.txt file that annotation for each read.  In addition, TIPP outputs a .json file with the placements, created according to pplacer format. Please refer to pplacer website (currently http://matsen.github.com/pplacer/generated_rst/pplacer.html#json-format-specification) for more information on the format of the josn file. Also note that pplacer package provides a program called guppy that can read .json files and perform downstream steps such as visualization.
+The main output of TIPP is a `_classification.txt` file that annotation for each read. In addition, TIPP outputs a `.json` file with the placements, created according to pplacer format. Please refer to [pplacer website](http://matsen.github.com/pplacer/generated_rst/pplacer.html#json-format-specification) for more information on the format of the `.json` file. Also note that pplacer package provides a program called guppy that can read `.json` files and perform downstream steps such as visualization.
 
-In addition to the .json file, TIPP outputs alignments of fragments to reference sets. There could be multiple alignment files created, each corresponding to a different placement subset. 
+In addition to the `.json` file, TIPP outputs alignments of fragments to reference sets. There could be multiple alignment files created, each corresponding to a different placement subset. 
 
-By setting SEPP_DEBUG environmental variable to `True`, you can instruct SEPP to output more information that can be helpful for debugging.  
+By setting `SEPP_DEBUG` environment variable to `True`, you can instruct SEPP to output more information that can be helpful for debugging.  
+
+This [tutorial](tutorial/tipp-tutorial.md) contains examples of running TIPP for read classification as well as abundance profiling.
 
 ---------------------------------------------
+
 Bugs and Errors
----------------------------------------------
-TIPP is under active research development at UIUC by the Warnow Lab (and especially with her former PhD students Siavash Mirarab and Nam Nguyen). Please report any errors to Siavash Mirarab (smirarab@gmail.com) and Nam Nguyen (ndn006@eng.ucsd.edu).
+===============
+TIPP is under active research development at UIUC by the Warnow Lab. Please report any errors to Tandy Warnow (warnow@illinois.edu) and Siavash Mirarab (smirarab@ucsd.edu).
 
diff --git a/README.md b/README.md
@@ -21,4 +21,4 @@ Each of these related tools has its own README file.
 ---------------------------------------------
 Bugs and Errors
 ---------------------------------------------
-SEPP, TIPP, UPP, HIPPI are under active research development at UIUC by the Warnow Lab (and especially with her PhD student Mike Nute and former students Siavash Mirarab and Nam-phuong Nguyen). Please report any errors to Siavash Mirarab (smirarab@ucsd.edu), Nam Nguyen (ndn006@eng.ucsd.edu), or Mike Nute (nute2@illinois.edu).
+SEPP, TIPP, UPP, HIPPI are under active research development at UIUC by the Warnow Lab and former student Siavash Mirarab (now at UCSD). Please report any errors to Siavash Mirarab (smirarab@ucsd.edu).
diff --git a/run_tipp_tool.py b/run_tipp_tool.py
@@ -12,63 +12,74 @@
 import sepp.config
 import sepp.metagenomics
 import sepp
+import sys
 
 
 def parse_args():
     parser = argparse.ArgumentParser(
         description='Performs various tools for TIPP.')
     parser.add_argument(
-        '-g', '--gene', default=None, metavar='GENE',
-        help='use GENE\'s reference package',
-        type=str, dest='gene')
+        "-g", "--gene",
+        dest="gene", metavar="GENE",
+        default=None,
+        type=str,
+        help="use GENE\'s reference package")
     parser.add_argument(
-        '-a', '--action', default=None, metavar='ACTION', help='Run ACTION',
-        required=True, type=str, dest='action')
+        "-o", "--output",
+        dest="output", metavar="OUTPUT",
+        default="output",
+        type=sepp.config.valid_file_prefix,
+        help="output files with prefix OUTPUT. [default: %(default)s]")
     parser.add_argument(
-        '-o', '--output', default='output', metavar='OUTPUT',
-        help='OUTPUT directory', type=str, dest='output')
+        "-d", "--outdir",
+        dest="outdir", metavar="OUTPUT_DIR",
+        default=os.path.curdir,
+        type=sepp.config.valid_dir_path,
+        help="output to OUTPUT_DIR directory. full-path required. "
+             "[default: %(default)s]")
     parser.add_argument(
-        '-p', '--prefix', default='prefix', metavar='PREFIX', help='PREFIX',
-        type=str, dest='prefix')
+        "-i", "--input",
+        dest="input", metavar="INPUT",
+        default=None,
+        type=str,
+        help="input file. full-path required. "
+             "(_classification.txt file from running run_tipp.py)")
     parser.add_argument(
-        '-i', '--input', default='input', metavar='INPUT',
-        help='INPUT destination',
-        type=str, dest='input')
-    parser.add_argument(
-        '-t', '--threshold', default=0.95, metavar='THRESHOLD',
-        help='threshold for classification',
-        type=float, dest='threshold')
-
-    args = parser.parse_args()
-    return args
-
-
-root_p = open(os.path.join(os.path.split(
-    os.path.split(sepp.__file__)[0])[0], "home.path")).readlines()[0].strip()
-tipp_config_path = os.path.join(root_p, "tipp.config")
+        "-t", "--threshold",
+        dest="threshold",  metavar="THRESHOLD",
+        default=0.95,
+        type=float,
+        help="Threshold for classification [default: 0.95]")
+    return parser.parse_args()
 
 
-def profile(input, gene, output, prefix, threshold):
+def profile(inputf, gene, output, prefix, threshold):
+    root_p = open(os.path.join(os.path.split(
+                  os.path.split(sepp.__file__)[0])[0], "home.path"))
+    root_p = root_p.readlines()[0].strip()
+    tipp_config_path = os.path.join(root_p, "tipp.config")
     sepp.config.set_main_config_path(tipp_config_path)
     opts = Namespace()
     sepp.config._read_config_file(open(tipp_config_path, 'r'), opts)
     (taxon_map, level_map, key_map) = sepp.metagenomics.load_taxonomy(
-        "%s/refpkg/%s.refpkg/all_taxon.taxonomy" % (opts.reference.path, gene))
+        "%s/%s.refpkg/all_taxon.taxonomy" % (opts.reference.path, gene))
     gene_classification = sepp.metagenomics.generate_classification(
-        input, threshold)
+        inputf, threshold)
     sepp.metagenomics.remove_unclassified_level(gene_classification)
-    sepp.metagenomics.write_classification(
-        gene_classification, "%s/%s.classification" % (output, prefix))
+    tstr = str("%d" % (threshold * 100))
+    cfile = str("%s/%s.classification_%s.txt" % (output, prefix, tstr))
+    sepp.metagenomics.write_classification(gene_classification, cfile)
     sepp.metagenomics.write_abundance(gene_classification, output)
 
 
 def main():
     args = parse_args()
-    if (args.action == 'profile'):
-        profile(args.input, args.gene, args.output, args.prefix,
-                args.threshold)
+    if os.path.isdir(args.outdir):
+        sys.stdout.write("WARNING: %s already exists,"
+                         " may overwrite files\n" % args.outdir)
+    profile(args.input, args.gene, args.outdir, args.output,
+            args.threshold)
 
 
 if __name__ == "__main__":
     main()
-#
diff --git a/sepp/exhaustive_tipp.py b/sepp/exhaustive_tipp.py
@@ -352,10 +352,9 @@ def merge_results(self):
     def check_options(self, supply=[]):
         if options().reference_pkg is not None:
             self.load_reference(
-                os.path.join(
-                    options().reference.path,
-                    'refpkg/%s.refpkg/' % options().reference_pkg))
-        if options().taxonomy_file is None:
+                os.path.join(options().reference.path,
+                             '%s.refpkg/' % options().reference_pkg))
+        if (options().taxonomy_file is None):
             supply = supply + ["taxonomy file"]
         if options().taxonomy_name_mapping_file is None:
             supply = supply + ["taxonomy name mapping file"]
@@ -477,13 +476,12 @@ def get_alignment_decomposition_tree(self, p_tree):
             _LOG.info(
                 "Reading alignment decomposition input tree: %s" % (
                     self.options.alignment_decomposition_tree))
-            d_tree = PhylogeneticTree(
+            return PhylogeneticTree(
                 dendropy.Tree.get_from_stream(
                     self.options.alignment_decomposition_tree,
                     schema="newick",
                     preserve_underscores=True,
                     taxon_set=self.root_problem.subtree.get_tree().taxon_set))
-            return d_tree
 
 
 def augment_parser():