Skip to content

Commit

Permalink
Initial commit.PuttingRSD on github as a python package. Newscripts, …
Browse files Browse the repository at this point in the history
…rsd_search andrsd_format
  • Loading branch information
Todd DeLuca committed Oct 15, 2011
0 parents commit a99c39d
Show file tree
Hide file tree
Showing 19 changed files with 6,894 additions and 0 deletions.
22 changes: 22 additions & 0 deletions .gitignore
@@ -0,0 +1,22 @@
# Ignore Mac OS X metadata files
.DS_Store

# Ignore backups and compiled files
*~
*.py[co]
*.old
*.class
#*#

# Ignore dirs from packaging and distributing
dist
build
reciprocal_smallest_distance.egg-info
MANIFEST

# Ignore all tmp dirs and everything in them
tmp/*
tmp/**/*
**/tmp/*
**/tmp/**/*

20 changes: 20 additions & 0 deletions LICENSE.txt
@@ -0,0 +1,20 @@
The MIT License (MIT)
Copyright (c) 2004-2011 Dennis P. Wall, Todd F. DeLuca

Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.
3 changes: 3 additions & 0 deletions MANIFEST.in
@@ -0,0 +1,3 @@
include *.markdown *.txt
recursive-include rsd *.ctl *.dat
recursive-include examples *.txt *.aa
108 changes: 108 additions & 0 deletions README.markdown
@@ -0,0 +1,108 @@
Author: Todd F. DeLuca
Organization: Wall Laboratory, Center for Biomedical Informatics, Harvard Medical School, USA, Earth, Sol System, Orion Arm, Milky Way.
Date: 2011/08/29

This package contains the scripts needed to run the Reciprocal Smallest Distance (RSD) ortholog detection algorithm as well as examples of input and output files.

- README.md: the file you are reading now
- bin/rsd_search: a script that runs the reciprocal smallest distance (RSD) algorithm to search for orthologs.
- bin/rsd_format: a script that turns FASTA-formatted genomes into BLAST-formatted indexes.
- rsd/: package implementing the RSD algorithm.
- rsd/jones.dat, rsd/codeml.ctl: used by codeml/paml to compute the evolutionary distance between two sequences.
- examples/: a directory containing examples of inputs and outputs to rsd, including fasta-formatted genome protein sequence files,
a query sequence id file (for --ids), and an orthologs output file.


## Installing RSD

### Prerequisites

RSD depends on Python, NCBI BLAST, PAML, and Kalign. It has been tested to work with the following versions. It might work with other versions too.

Install:

- Python 2.7: http://www.python.org/download/
- NCBI BLAST 2.2.24: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
- PAML 4.4: http://abacus.gene.ucl.ac.uk/software/paml.html
- Kalign 2.04: http://msa.sbc.su.se/cgi-bin/msa.cgi

Add the executables for python (version 2.7), makeblastdb, blastp, codeml, and kalign, to your PATH.

### Installing from a tarball

Download and untar the latest version from github:

cd ~
curl https://github.com/downloads/todddeluca/reciprocal_smallest_distance/reciprocal_smallest_distance-VERSION.tar.gz | tar xvz

Install reciprocal\_smallest\_distance, making sure to use Python 2.7:

cd reciprocal_smallest_distance-VERSION
python setup.py install


## Using RSD to Find Othologs

The following example commands demonstrate the main ways to run `rsd_search`.
Every invocation of `rsd_search` requires specifying the location of a FASTA-formatted sequence file for two genomes,
called the query and subject genomes. Their order is arbitrary, but if you use the `--ids` option, the ids must come from the query genome.
You must also specify a file to write the results of the orthologs found by the RSD algorithm.
The format of the output file contains one ortholog per line. Each line contains the query sequence id, subject sequence id,
and distance (calculated by codeml) between the sequences.
You can optionally specify a file containing ids using the `--ids` option. Then rsd will only search for orthologs for those ids.
Using `--divergence` and `--evalue`, you have the option of using different thresholds from the defaults.


Get help on how to run rsd_search:

rsd_search -h


Find orthologs between all the sequences in the query and subject genomes.

rsd_search -d 0.2 -e 1e-20 -q examples/genomes/Mycoplasma_genitalium.aa/Mycoplasma_genitalium.aa \
--subject-genome=examples/genomes/Mycobacterium_leprae.aa/Mycobacterium_leprae.aa \
-o examples/Mycoplasma_genitalium.aa_Mycobacterium_leprae.aa_0.2_1e-20.orthologs.txt


Find orthologs using non-default divergence and evalue thresholds

rsd_search -q examples/genomes/Mycoplasma_genitalium.aa/Mycoplasma_genitalium.aa \
--subject-genome=examples/genomes/Mycobacterium_leprae.aa/Mycobacterium_leprae.aa \
-o examples/Mycoplasma_genitalium.aa_Mycobacterium_leprae.aa_0.8_1e-5.orthologs.txt \
-d 0.8 -e 1e-5


Find orthologs between all the sequences in the query and subject genomes using genomes that have already been formatted for blast.

rsd_search -q examples/genomes/Mycoplasma_genitalium.aa/Mycoplasma_genitalium.aa \
--subject-genome=examples/genomes/Mycobacterium_leprae.aa/Mycobacterium_leprae.aa \
-o examples/Mycoplasma_genitalium.aa_Mycobacterium_leprae.aa_0.8_1e-5.orthologs.txt \
--no-format


Find orthologs for specific sequences in the query genome. For finding orthologs for only a few sequences, using `--no-blast-cache` can
speed up computation. YMMV.

rsd_search -q examples/genomes/Mycoplasma_genitalium.aa/Mycoplasma_genitalium.aa \
--subject-genome=examples/genomes/Mycobacterium_leprae.aa/Mycobacterium_leprae.aa \
-o examples/Mycoplasma_genitalium.aa_Mycobacterium_leprae.aa_0.8_1e-5.orthologs.txt \
--ids examples/Mycoplasma_genitalium.aa.ids.txt --no-blast-cache


## Formatting FASTA Files For BLAST

It is not necessary to format a fasta file for BLAST, because rsd_search does it for you. However should you find yourself running
the program multiple times, especially for large genomes which take some time to format, preformatting the fasta files and
then running `rsd_search` with the `--no-format` option can save you time. Here is how to format a FASTA file in place:

rsd_format -g examples/genomes/Mycoplasma_genitalium.aa/Mycoplasma_genitalium.aa -v

And here is how to format the FASTA file, putting the results in a specific directory:

rsd_format -g examples/genomes/Mycoplasma_genitalium.aa/Mycoplasma_genitalium.aa -v -d /my/place/for/genomes


## Extras

A user might also want to change the alpha shape parameter for the gamma distribution used in the likelihood calculations of the codeml package of paml. This can be done by editing the codeml.ctl file included in the distribution. See the documentation for codeml for more details.
51 changes: 51 additions & 0 deletions bin/rsd_format
@@ -0,0 +1,51 @@
#!/usr/bin/env python

# RSD: The reciprocal smallest distance algorithm.
# Wall, D.P., Fraser, H.B. and Hirsh, A.E. (2003) Detecting putative orthologs, Bioinformatics, 19, 1710-1711.
# Original Author: Dennis P. Wall, Department of Biological Sciences, Stanford University.
# Author: Todd F. DeLuca, Center for Biomedical Informatics, Harvard Medical School
# Contributors: I-Hsien Wu, Computational Biology Initiative, Harvard Medical School

import argparse
import os
import shutil

import rsd


def main():
# create command line parser for "format" command
formatDesc = ('Reciprocal smallest distance (RSD) uses BLAST to search for putative orthologs. ' +
'This command formats a FASTA-formatted genome for use with BLAST. ' +
'By default, index names are derived from the name of GENOME and the indexes are placed in the same dir as GENOME. ' +
'If DIR is specified, GENOME is copied to DIR, and indexes are placed in DIR.')
parser = argparse.ArgumentParser(description=formatDesc)
parser.add_argument('-d', '--dir', help='Dir where BLAST indexes will be put. Default: the directory containing GENOME.')
parser.add_argument('-g', '--genome', required=True, help='FASTA format protein sequence file, with unique ids on each nameline either in the form ">id" or ">ns|id|...".')
parser.add_argument('-v', '--verbose', default=False, action='store_true')
args = parser.parse_args()

srcPath = os.path.abspath(os.path.expanduser(args.genome))

# copy GENOME to DIR if necessary
if args.dir:
destDir = os.path.abspath(os.path.expanduser(args.dir))
destPath = os.path.join(destDir, os.path.basename(srcPath))
if srcPath != destPath:
if args.verbose:
print 'copying {} to {}'.format(srcPath, destPath)
shutil.copyfile(srcPath, destPath)
else:
destPath = srcPath

# formatting puts blast indexes in the same dir as destPath.
if args.verbose:
print 'formatting {}'.format(destPath)
rsd.formatForBlast(destPath)


if __name__ == '__main__':
main()


# last line
105 changes: 105 additions & 0 deletions bin/rsd_search
@@ -0,0 +1,105 @@
#!/usr/bin/env python

# RSD: The reciprocal smallest distance algorithm.
# Wall, D.P., Fraser, H.B. and Hirsh, A.E. (2003) Detecting putative orthologs, Bioinformatics, 19, 1710-1711.
# Original Author: Dennis P. Wall, Department of Biological Sciences, Stanford University.
# Author: Todd F. DeLuca, Center for Biomedical Informatics, Harvard Medical School
# Contributors: I-Hsien Wu, Computational Biology Initiative, Harvard Medical School

import argparse
import os
import shutil

import rsd
import rsd.nested


def main():

parser = argparse.ArgumentParser(description='Compute orthologs using the reciprocal smallest distance (RSD) algorithm between the query genome and the subject genome. See http://bioinformatics.oxfordjournals.org/content/19/13/1710 for a description of the algorithm.')
parser.add_argument('-q', '--query-genome', required=True, help='FASTA format sequence file, with unique ids on each nameline either in the form ">id" or ">ns|id|...".')
parser.add_argument('-s', '--subject-genome', required=True, help='FASTA format sequence file, with unique ids on each nameline either in the form ">id" or ">ns|id|...".')
parser.add_argument('-o', '--outfile', required=True, help='File in which to write orthologs. Orthologs are written: query_id subject_id distance.')
parser.add_argument('-d', '--divergence', type=float, default='0.8', help='Theshold for the maximum divergence allowed between a query and subject sequence. A number > 0 and < 1. e.g. 0.2, or 0.5. Default is 0.8.')
parser.add_argument('-e', '--evalue', type=float, default='1e-5', help='Theshold for the maximum BLAST e-value allowed between a query and subject sequence. e.g. 1e-20, or 0.005. Default is 1e-5.')
parser.add_argument('--ids', help='Path to file containing seq ids (one per line) in query_genome for which to compute orthologs. If you only have one or a few sequences of interest it can be much faster to limit computation to those sequences. The default is to compute othologs for all sequences in query_genome. The sequence ids in the file must correspond to ids on the fasta namelines of query_genome.')
parser.add_argument('--no-blast-cache', default=False, action='store_true', help='If this option is given, blast hits will not be precomputed for every sequence in each genome. Using this option Can be faster if computing orthologs for only a few sequences. Consider using in conjunction with --ids.')
parser.add_argument('--no-format', default=False, action='store_true', help='If this option is given, genome fasta files will not be formatted for blast. This is useful if blast formatted indices already exist and are located in the same directory as the fasta files.')
parser.add_argument('--workdir', default='.', help='Directory under which to work. will create a subdirectory under this dir in which to write temporary files, etc. This subdirectory will be removed when rsd finishes. Default is "."')
parser.add_argument('-v', '--verbose', default=False, action='store_true')
args = parser.parse_args()

assert args.divergence > 0.0 and args.divergence < 1.0
assert args.evalue >= 0.0

if args.ids:
with open(os.path.abspath(os.path.expanduser(args.ids))) as fh:
# one id per line. ignore blank lines and comment lines
ids = [i for i in (line.strip() for line in fh) if i and not i.startswith('#')]
else:
ids = None
if args.verbose:
print 'query sequence ids:', ids

div = str(args.divergence)
evalue = str(args.evalue)
outfile = os.path.abspath(os.path.expanduser(args.outfile))
if args.verbose:
print 'outfile:', outfile
print 'divergence:', div
print 'evalue:', evalue

with rsd.nested.NestedTempDir(dir=os.path.abspath(os.path.expanduser(args.workdir)), nesting=0) as tmpDir:

# format fasta files if needed. do it in tmpDir to be clean.
if args.no_format:
# assume blast formatted index files coexist with the fasta files
queryFastaPath = os.path.abspath(os.path.expanduser(args.query_genome))
subjectFastaPath = os.path.abspath(os.path.expanduser(args.subject_genome))
if args.verbose:
print 'fasta files:', queryFastaPath, subjectFastaPath
else:
if args.verbose:
print 'copying fasta files'
queryFastaPath = os.path.join(tmpDir, os.path.basename(args.query_genome))
subjectFastaPath = os.path.join(tmpDir, os.path.basename(args.subject_genome))
if args.verbose:
print 'fasta files:', queryFastaPath, subjectFastaPath
shutil.copy(os.path.abspath(os.path.expanduser(args.query_genome)), queryFastaPath)
shutil.copy(os.path.abspath(os.path.expanduser(args.subject_genome)), subjectFastaPath)
if args.verbose:
print 'formatting fasta files'
rsd.formatForBlast(queryFastaPath)
rsd.formatForBlast(subjectFastaPath)

# print queryFastaPath, subjectFastaPath, outfile, div, evalue
divEvalues = [(div, evalue)]

# precompute blast hits if needed. do it in tmpDir to be clean.
if args.no_blast_cache:
getForwardHits = rsd.makeGetHitsOnTheFly(subjectFastaPath, evalue, tmpDir)
getReverseHits = rsd.makeGetHitsOnTheFly(queryFastaPath, evalue, tmpDir)
if args.verbose:
print 'computing orthologs'
divEvalueToOrthologs = rsd.computeOrthologsUsingOnTheFlyHits(queryFastaPath, subjectFastaPath, divEvalues, ids, tmpDir)
else:
if args.verbose:
print 'precomuting blast hits'
forwardHitsPath = os.path.join(tmpDir, 'query_subject.blast.hits.pickle')
reverseHitsPath = os.path.join(tmpDir, 'subject_query.blast.hits.pickle')
rsd.computeBlastHits(queryFastaPath, subjectFastaPath, forwardHitsPath, evalue, workingDir=tmpDir)
rsd.computeBlastHits(subjectFastaPath, queryFastaPath, reverseHitsPath, evalue, workingDir=tmpDir)
if args.verbose:
print 'computing orthologs'
divEvalueToOrthologs = rsd.computeOrthologsUsingSavedHits(queryFastaPath, subjectFastaPath, divEvalues, forwardHitsPath, reverseHitsPath, ids, tmpDir)

if args.verbose:
print 'writing {0} orthologs to outfile'.format(len(divEvalueToOrthologs[divEvalues[0]]))
rsd.writeToOutfile(divEvalueToOrthologs[divEvalues[0]], outfile)


if __name__ == '__main__':
main()


# last line
5 changes: 5 additions & 0 deletions examples/Mycoplasma_genitalium.aa.ids.txt
@@ -0,0 +1,5 @@
108885075
108885076
12044853
12044854
12044855

0 comments on commit a99c39d

Please sign in to comment.