Initial commit.PuttingRSD on github as a python package. Newscripts, …

…rsd_search andrsd_format
todddeluca · Oct 15, 2011 · a99c39d · a99c39d
commit a99c39d
Show file tree

Hide file tree

Showing 19 changed files with 6,894 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,22 @@
+# Ignore Mac OS X metadata files
+.DS_Store
+
+# Ignore backups and compiled files
+*~
+*.py[co]
+*.old
+*.class
+#*#
+
+# Ignore dirs from packaging and distributing
+dist
+build
+reciprocal_smallest_distance.egg-info
+MANIFEST
+
+# Ignore all tmp dirs and everything in them
+tmp/*
+tmp/**/*
+**/tmp/*
+**/tmp/**/*
+
diff --git a/LICENSE.txt b/LICENSE.txt
@@ -0,0 +1,20 @@
+The MIT License (MIT)
+Copyright (c) 2004-2011 Dennis P. Wall, Todd F. DeLuca
+
+Permission is hereby granted, free of charge, to any person obtaining a
+copy of this software and associated documentation files (the "Software"),
+to deal in the Software without restriction, including without limitation
+the rights to use, copy, modify, merge, publish, distribute, sublicense,
+and/or sell copies of the Software, and to permit persons to whom the
+Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,3 @@
+include *.markdown *.txt
+recursive-include rsd *.ctl *.dat
+recursive-include examples *.txt *.aa
diff --git a/README.markdown b/README.markdown
@@ -0,0 +1,108 @@
+Author: Todd F. DeLuca  
+Organization: Wall Laboratory, Center for Biomedical Informatics, Harvard Medical School, USA, Earth, Sol System, Orion Arm, Milky Way.  
+Date: 2011/08/29  
+
+This package contains the scripts needed to run the Reciprocal Smallest Distance (RSD) ortholog detection algorithm as well as examples of input and output files.
+
+- README.md:  the file you are reading now
+- bin/rsd_search: a script that runs the reciprocal smallest distance (RSD) algorithm to search for orthologs.
+- bin/rsd_format: a script that turns FASTA-formatted genomes into BLAST-formatted indexes.
+- rsd/: package implementing the RSD algorithm.  
+- rsd/jones.dat, rsd/codeml.ctl:  used by codeml/paml to compute the evolutionary distance between two sequences.
+- examples/:  a directory containing examples of inputs and outputs to rsd, including fasta-formatted genome protein sequence files,
+ a query sequence id file (for --ids), and an orthologs output file.
+
+
+## Installing RSD
+
+### Prerequisites
+
+RSD depends on Python, NCBI BLAST, PAML, and Kalign.  It has been tested to work with the following versions.  It might work with other versions too.
+
+Install:
+
+- Python 2.7: http://www.python.org/download/
+- NCBI BLAST 2.2.24: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
+- PAML 4.4: http://abacus.gene.ucl.ac.uk/software/paml.html
+- Kalign 2.04: http://msa.sbc.su.se/cgi-bin/msa.cgi
+
+Add the executables for python (version 2.7), makeblastdb, blastp, codeml, and kalign, to your PATH.
+
+### Installing from a tarball
+
+Download and untar the latest version from github:
+
+    cd ~
+    curl https://github.com/downloads/todddeluca/reciprocal_smallest_distance/reciprocal_smallest_distance-VERSION.tar.gz | tar xvz
+
+Install reciprocal\_smallest\_distance, making sure to use Python 2.7:
+
+    cd reciprocal_smallest_distance-VERSION
+    python setup.py install
+
+
+## Using RSD to Find Othologs
+
+The following example commands demonstrate the main ways to run `rsd_search`.
+Every invocation of `rsd_search` requires specifying the location of a FASTA-formatted sequence file for two genomes,
+called the query and subject genomes.  Their order is arbitrary, but if you use the `--ids` option, the ids must come from the query genome.
+You must also specify a file to write the results of the orthologs found by the RSD algorithm.
+The format of the output file contains one ortholog per line.  Each line contains the query sequence id, subject sequence id,
+and distance (calculated by codeml) between the sequences.
+You can optionally specify a file containing ids using the `--ids` option.  Then rsd will only search for orthologs for those ids.
+Using `--divergence` and `--evalue`, you have the option of using different thresholds from the defaults.
+
+
+Get help on how to run rsd_search:
+
+    rsd_search -h
+
+
+Find orthologs between all the sequences in the query and subject genomes.
+
+    rsd_search -d 0.2 -e 1e-20 -q examples/genomes/Mycoplasma_genitalium.aa/Mycoplasma_genitalium.aa \
+    --subject-genome=examples/genomes/Mycobacterium_leprae.aa/Mycobacterium_leprae.aa \
+    -o examples/Mycoplasma_genitalium.aa_Mycobacterium_leprae.aa_0.2_1e-20.orthologs.txt
+
+
+Find orthologs using non-default divergence and evalue thresholds
+
+    rsd_search -q examples/genomes/Mycoplasma_genitalium.aa/Mycoplasma_genitalium.aa \
+    --subject-genome=examples/genomes/Mycobacterium_leprae.aa/Mycobacterium_leprae.aa \
+    -o examples/Mycoplasma_genitalium.aa_Mycobacterium_leprae.aa_0.8_1e-5.orthologs.txt \
+    -d 0.8 -e 1e-5
+
+
+Find orthologs between all the sequences in the query and subject genomes using genomes that have already been formatted for blast.
+
+    rsd_search -q examples/genomes/Mycoplasma_genitalium.aa/Mycoplasma_genitalium.aa \
+    --subject-genome=examples/genomes/Mycobacterium_leprae.aa/Mycobacterium_leprae.aa \
+    -o examples/Mycoplasma_genitalium.aa_Mycobacterium_leprae.aa_0.8_1e-5.orthologs.txt \
+    --no-format
+
+
+Find orthologs for specific sequences in the query genome.  For finding orthologs for only a few sequences, using `--no-blast-cache` can
+speed up computation.  YMMV.
+
+    rsd_search -q examples/genomes/Mycoplasma_genitalium.aa/Mycoplasma_genitalium.aa \
+    --subject-genome=examples/genomes/Mycobacterium_leprae.aa/Mycobacterium_leprae.aa \
+    -o examples/Mycoplasma_genitalium.aa_Mycobacterium_leprae.aa_0.8_1e-5.orthologs.txt \
+    --ids examples/Mycoplasma_genitalium.aa.ids.txt --no-blast-cache
+
+
+## Formatting FASTA Files For BLAST
+
+It is not necessary to format a fasta file for BLAST, because rsd_search does it for you.  However should you find yourself running
+the program multiple times, especially for large genomes which take some time to format, preformatting the fasta files and 
+then running `rsd_search` with the `--no-format` option can save you time.  Here is how to format a FASTA file in place:
+
+    rsd_format -g examples/genomes/Mycoplasma_genitalium.aa/Mycoplasma_genitalium.aa -v
+
+And here is how to format the FASTA file, putting the results in a specific directory:
+
+    rsd_format -g examples/genomes/Mycoplasma_genitalium.aa/Mycoplasma_genitalium.aa -v -d /my/place/for/genomes
+
+
+## Extras
+
+A user might also want to change the alpha shape parameter for the gamma distribution used in the likelihood calculations of the codeml package of paml.  This can be done by editing the codeml.ctl file included in the distribution.  See the documentation for codeml for more details.
diff --git a/bin/rsd_format b/bin/rsd_format
@@ -0,0 +1,51 @@
+#!/usr/bin/env python
+
+# RSD: The reciprocal smallest distance algorithm.
+#   Wall, D.P., Fraser, H.B. and Hirsh, A.E. (2003) Detecting putative orthologs, Bioinformatics, 19, 1710-1711.
+# Original Author: Dennis P. Wall, Department of Biological Sciences, Stanford University.
+# Author: Todd F. DeLuca, Center for Biomedical Informatics, Harvard Medical School
+# Contributors: I-Hsien Wu, Computational Biology Initiative, Harvard Medical School
+
+import argparse
+import os
+import shutil
+
+import rsd
+
+
+def main():
+    # create command line parser for "format" command
+    formatDesc = ('Reciprocal smallest distance (RSD) uses BLAST to search for putative orthologs.  ' +
+                  'This command formats a FASTA-formatted genome for use with BLAST.  ' +
+                  'By default, index names are derived from the name of GENOME and the indexes are placed in the same dir as GENOME.  ' +
+                  'If DIR is specified, GENOME is copied to DIR, and indexes are placed in DIR.')
+    parser = argparse.ArgumentParser(description=formatDesc)
+    parser.add_argument('-d', '--dir', help='Dir where BLAST indexes will be put.  Default: the directory containing GENOME.')
+    parser.add_argument('-g', '--genome', required=True, help='FASTA format protein sequence file, with unique ids on each nameline either in the form ">id" or ">ns|id|...".')
+    parser.add_argument('-v', '--verbose', default=False, action='store_true')
+    args = parser.parse_args()
+
+    srcPath = os.path.abspath(os.path.expanduser(args.genome))
+
+    # copy GENOME to DIR if necessary
+    if args.dir:
+        destDir = os.path.abspath(os.path.expanduser(args.dir))
+        destPath = os.path.join(destDir, os.path.basename(srcPath))
+        if srcPath != destPath:
+            if args.verbose:
+                print 'copying {} to {}'.format(srcPath, destPath)
+            shutil.copyfile(srcPath, destPath)
+    else:
+        destPath = srcPath
+
+    # formatting puts blast indexes in the same dir as destPath.
+    if args.verbose:
+        print 'formatting {}'.format(destPath)
+    rsd.formatForBlast(destPath)
+
+
+if __name__ == '__main__':
+   main()
+
+
+# last line
diff --git a/bin/rsd_search b/bin/rsd_search
@@ -0,0 +1,105 @@
+#!/usr/bin/env python
+
+# RSD: The reciprocal smallest distance algorithm.
+#   Wall, D.P., Fraser, H.B. and Hirsh, A.E. (2003) Detecting putative orthologs, Bioinformatics, 19, 1710-1711.
+# Original Author: Dennis P. Wall, Department of Biological Sciences, Stanford University.
+# Author: Todd F. DeLuca, Center for Biomedical Informatics, Harvard Medical School
+# Contributors: I-Hsien Wu, Computational Biology Initiative, Harvard Medical School
+
+import argparse
+import os
+import shutil
+
+import rsd
+import rsd.nested
+
+
+def main():
+
+    parser = argparse.ArgumentParser(description='Compute orthologs using the reciprocal smallest distance (RSD) algorithm between the query genome and the subject genome.  See http://bioinformatics.oxfordjournals.org/content/19/13/1710 for a description of the algorithm.')
+    parser.add_argument('-q', '--query-genome', required=True, help='FASTA format sequence file, with unique ids on each nameline either in the form ">id" or ">ns|id|...".')
+    parser.add_argument('-s', '--subject-genome', required=True, help='FASTA format sequence file, with unique ids on each nameline either in the form ">id" or ">ns|id|...".')
+    parser.add_argument('-o', '--outfile', required=True, help='File in which to write orthologs.  Orthologs are written: query_id subject_id distance.')
+    parser.add_argument('-d', '--divergence', type=float, default='0.8', help='Theshold for the maximum divergence allowed between a query and subject sequence.  A number > 0 and < 1. e.g. 0.2, or 0.5.  Default is 0.8.')
+    parser.add_argument('-e', '--evalue', type=float, default='1e-5', help='Theshold for the maximum BLAST e-value allowed between a query and subject sequence.  e.g. 1e-20, or 0.005.  Default is 1e-5.')
+    parser.add_argument('--ids', help='Path to file containing seq ids (one per line) in query_genome for which to compute orthologs.  If you only have one or a few sequences of interest it can be much faster to limit computation to those sequences.  The default is to compute othologs for all sequences in query_genome.  The sequence ids in the file must correspond to ids on the fasta namelines of query_genome.')
+    parser.add_argument('--no-blast-cache', default=False, action='store_true', help='If this option is given, blast hits will not be precomputed for every sequence in each genome.  Using this option Can be faster if computing orthologs for only a few sequences.  Consider using in conjunction with --ids.')
+    parser.add_argument('--no-format', default=False, action='store_true', help='If this option is given, genome fasta files will not be formatted for blast.  This is useful if blast formatted indices already exist and are located in the same directory as the fasta files.')
+    parser.add_argument('--workdir', default='.', help='Directory under which to work.  will create a subdirectory under this dir in which to write temporary files, etc.  This subdirectory will be removed when rsd finishes.  Default is "."')
+    parser.add_argument('-v', '--verbose', default=False, action='store_true')
+    args = parser.parse_args()
+
+    assert args.divergence > 0.0 and args.divergence < 1.0
+    assert args.evalue >= 0.0
+
+    if args.ids:
+        with open(os.path.abspath(os.path.expanduser(args.ids))) as fh:
+            # one id per line.  ignore blank lines and comment lines
+            ids = [i for i in (line.strip() for line in fh) if i and not i.startswith('#')] 
+    else:
+        ids = None
+    if args.verbose:
+        print 'query sequence ids:', ids
+
+    div = str(args.divergence)
+    evalue = str(args.evalue)
+    outfile = os.path.abspath(os.path.expanduser(args.outfile))
+    if args.verbose:
+        print 'outfile:', outfile
+        print 'divergence:', div
+        print 'evalue:', evalue
+
+    with rsd.nested.NestedTempDir(dir=os.path.abspath(os.path.expanduser(args.workdir)), nesting=0) as tmpDir:
+
+        # format fasta files if needed.  do it in tmpDir to be clean.
+        if args.no_format:
+            # assume blast formatted index files coexist with the fasta files
+            queryFastaPath = os.path.abspath(os.path.expanduser(args.query_genome))
+            subjectFastaPath = os.path.abspath(os.path.expanduser(args.subject_genome))
+            if args.verbose:
+                print 'fasta files:', queryFastaPath, subjectFastaPath
+        else:
+            if args.verbose:
+                print 'copying fasta files'
+            queryFastaPath = os.path.join(tmpDir, os.path.basename(args.query_genome))
+            subjectFastaPath = os.path.join(tmpDir, os.path.basename(args.subject_genome))
+            if args.verbose:
+                print 'fasta files:', queryFastaPath, subjectFastaPath
+            shutil.copy(os.path.abspath(os.path.expanduser(args.query_genome)), queryFastaPath)
+            shutil.copy(os.path.abspath(os.path.expanduser(args.subject_genome)), subjectFastaPath)
+            if args.verbose:
+                print 'formatting fasta files'
+            rsd.formatForBlast(queryFastaPath)
+            rsd.formatForBlast(subjectFastaPath)
+
+        # print queryFastaPath, subjectFastaPath, outfile, div, evalue
+        divEvalues = [(div, evalue)]
+
+        # precompute blast hits if needed.  do it in tmpDir to be clean.
+        if args.no_blast_cache: 
+            getForwardHits = rsd.makeGetHitsOnTheFly(subjectFastaPath, evalue, tmpDir)
+            getReverseHits = rsd.makeGetHitsOnTheFly(queryFastaPath, evalue, tmpDir)
+            if args.verbose:
+                print 'computing orthologs'
+            divEvalueToOrthologs = rsd.computeOrthologsUsingOnTheFlyHits(queryFastaPath, subjectFastaPath, divEvalues, ids, tmpDir)
+        else:
+            if args.verbose:
+                print 'precomuting blast hits'
+            forwardHitsPath = os.path.join(tmpDir, 'query_subject.blast.hits.pickle')
+            reverseHitsPath = os.path.join(tmpDir, 'subject_query.blast.hits.pickle')
+            rsd.computeBlastHits(queryFastaPath, subjectFastaPath, forwardHitsPath, evalue, workingDir=tmpDir)
+            rsd.computeBlastHits(subjectFastaPath, queryFastaPath, reverseHitsPath, evalue, workingDir=tmpDir)
+            if args.verbose:
+                print 'computing orthologs'
+            divEvalueToOrthologs = rsd.computeOrthologsUsingSavedHits(queryFastaPath, subjectFastaPath, divEvalues, forwardHitsPath, reverseHitsPath, ids, tmpDir)
+
+        if args.verbose:
+            print 'writing {0} orthologs to outfile'.format(len(divEvalueToOrthologs[divEvalues[0]]))
+        rsd.writeToOutfile(divEvalueToOrthologs[divEvalues[0]], outfile)
+
+
+if __name__ == '__main__':
+   main()
+
+
+# last line
diff --git a/examples/Mycoplasma_genitalium.aa.ids.txt b/examples/Mycoplasma_genitalium.aa.ids.txt
@@ -0,0 +1,5 @@
+108885075
+108885076
+12044853
+12044854
+12044855