merged

ratschlab · Sep 25, 2012 · 24e586c · 24e586c
2 parents 2aceb5f + be266f5
commit 24e586c
Show file tree

Hide file tree

Showing 101 changed files with 17,616 additions and 5 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1 +1,2 @@
-*~.branch
+*~
+.branch
diff --git a/AUTHORS b/AUTHORS
@@ -0,0 +1,11 @@
+rQuant:
+Regina Bohnert <regina.bohnert@tuebingen.mpg.de>
+Gunnar Raetsch <gunnar.raetsch@tuebingen.mpg.de>
+
+Gff2Anno:
+Vipin T Sreedharan <vipin.ts@tuebingen.mpg.de>
+
+BAM file processing:
+Regina Bohnert <regina.bohnert@tuebingen.mpg.de>
+Jonas Behr <jonas.behr@tuebingen.mpg.de>
+Gunnar Raetsch <gunnar.raetsch@tuebingen.mpg.de>
diff --git a/COPYRIGHT b/COPYRIGHT
@@ -0,0 +1,3 @@
+GPL:
+====
+rQuant is licensed under the GNU General Public License version 3 or at your option any later version.
diff --git a/INSTALL b/INSTALL
@@ -0,0 +1,10 @@
+To setup rQuant, please follow these steps:
+
+1. Download the SAMTools (version 0.1.7) from http://samtools.sourceforge.net/ and install it. You need to add the flag -fPIC in the SAMTools Makefile for compilation.
+2. Add the SAMTools directory to ./mex/Makefile, go to ./mex and run make ('make octave' for Octave and 'make matlab' for Matlab).
+3. Run ./setup_rquant.sh and setup paths and configuration options for rQuant.
+
+Optional
+4. Download the example data with ./get_data.sh in ./examples.
+5. Run an example by executing ./run_example.sh with input 'small' or 'big' to work on a small (55 examples) and big (1865 examples) C. elegans data set, respectively in the examples directory.
+
diff --git a/LICENSE b/LICENSE
diff --git a/NEWS b/NEWS
@@ -0,0 +1,66 @@
+rQuant version 2.1 (Aug 30, 2011)
+----------------------------------------
+
+New features:
+- Profiles can now also be estimated empirically instead of using the
+optimisation approach (CFG.learn_profiles=1: empirically estimated,
+CFG.learn_profiles=2: optimised), which is considerably faster.
+- The usage of information from paired-end reads has been implemented
+and can be used during abundance estimation (CFG.paired = 1).
+
+
+rQuant version 2.0 (May 24, 2011)
+----------------------------------------
+
+New features:
+- The optimisation of the transcript and profile variables has been
+newly implemented. The optimisation problems are now solved via
+coordinate descent and the analytical solution, making the
+calculations much faster than in the old releases and making rQuant
+independent of a commercial solving software.
+- The profile functions are now modelled with piecewise linear
+functions instead of piecewise constant functions.
+
+Bug fixes:
+- ParseGFF.py: assertion for a GFF file without 9 columns
+
+
+rQuant version 1.2 (May 18, 2011)
+----------------------------------------
+
+New features:
+- transcripts from overlapping loci are merged for quantitation
+- additional option allowing genome annotation to be in AGS format
+
+Other changes:
+- ParseGFF.py: now also parses multiple mappings of parent IDs of GFF
+features
+
+
+rQuant version 1.1 (March 11, 2011)
+----------------------------------------
+
+New features:
+- tool ReadStats: generates a statistic about the read alignments and
+the covered genes
+
+Other changes:
+- prctiles.m: replaced function by own implementation
+- ParseGFF.py: now also parses non-coding transcripts; exons
+coordinates always in ascending order (for both strands)
+
+Bug fixes:
+- get_reads: fixed a memory leak and segmentation faults
+- sanitise_genes.m: adapted to closed intervals in gene structure from
+ParseGFF.py
+- rquant_core.m: corrected initialisation of transcript length bins
+
+
+rQuant version 1.0 (December 17, 2010)
+----------------------------------------
+
+This is the first release of the quantitation tool rQuant, which
+determines abundances of multiple transcripts per gene locus from
+RNA-Seq measurements.
+Please also visit http://fml.mpg.de/raetsch/suppl/rquant for more
+information about this software.
diff --git a/README b/README
@@ -1,4 +1,57 @@
-Software: <name>
+----------------------
-Description: <description>
+  rQuant version 2.1
-Authors: <authors>
+----------------------
-URL: <url>
+
+DESCRIPTION
+rQuant is a programme to determine abundances of multiple transcripts
+per gene locus from RNA-Seq measurements. It can simultaneously
+estimate the effect of biases introduced by experimental settings. 
+
+REQUIREMENTS
+- Octave or Matlab
+- Python >=2.6.5 and Scipy >=0.7.1
+- SAMTools >= 0.1.7
+
+GETTING STARTED
+To install rQuant and the required software please follow the
+instructions in INSTALL in this directory.
+
+CONTENTS
+All relevant scripts for rQuant are located in the subdirectory src.
+rquant.sh is the main script to start rQuant.
+In the same subdirectory you find the script read_stats.sh that
+generates a statistic about the read alignments and the covered genes.
+
+GALAXY
+rQuant can be used as a web service embedded in a Galaxy instance
+(cf. http://galaxy.fml.tuebingen.mpg.de/tool_runner?tool_id=rquantweb).
+The Galaxy tool configuration file of rQuant is located in the
+subdirectory galaxy along with XML file for loading example data and
+instructions (rquant_web.xml and rquant_web_instructions.xml,
+respectively). Please adapt the paths to the respective tools in
+command section of the XML files as indicated.
+The subdirectory test_data contains all data for running a functional
+test in Galaxy (e.g. with sh run_functional_test.sh -id rquantweb). You
+may need to move these test files into the Galaxy test-data directory.
+
+DOCUMENTATION
+More information is available in doc/rquant_web_instructions.txt,
+doc/rquant_web.txt, and doc/read_stats.txt. Examples for running
+rQuant can be found in examples/./run_example.sh.
+You can also find information on rQuant.web and rQuant on
+http://fml.mpg.de/raetsch/suppl/rquant/web and
+http://fml.mpg.de/raetsch/suppl/rquant, respectively.
+
+LICENSE
+rQuant is licensed under the GPL version 3 or any later version
+(cf. LICENSE).
+
+CITE US
+If you use rQuant in your research you are kindly asked to cite the
+following publications:
+* Regina Bohnert and Gunnar Raetsch: rQuant.web: A tool for
+RNA-Seq-based transcript quantitation. Nucleic Acids Research,
+38(Suppl 2):W348-51, July 2010.
+* Regina Bohnert, Jonas Behr, and Gunnar Raetsch: Transcript quantification
+with RNA-Seq data. BMC Bioinformatics, 10(S13):P5, October 2009.
+
diff --git a/VERSION b/VERSION
@@ -0,0 +1 @@
+2.1
diff --git a/bin/genarglist.sh b/bin/genarglist.sh
@@ -0,0 +1,21 @@
+#!/bin/bash
+
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# Written (W) 2009-2010 Regina Bohnert, Gunnar Raetsch
+# Copyright (C) 2009-2010 Max Planck Society
+#
+
+until [ -z $1 ] ; do
+	if [ $# != 1 ];
+	then
+		echo -n "'$1', "
+	else
+		echo -n "'$1'"
+	fi
+	shift
+done
diff --git a/bin/genes_cell2struct b/bin/genes_cell2struct
@@ -0,0 +1 @@
+rquant_wrapper.sh
diff --git a/bin/read_stats b/bin/read_stats
@@ -0,0 +1 @@
+rquant_wrapper.sh
diff --git a/bin/rquant b/bin/rquant
@@ -0,0 +1 @@
+rquant_wrapper.sh
diff --git a/bin/rquant_config.sh b/bin/rquant_config.sh
@@ -0,0 +1,27 @@
+#!/bin/bash
+
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# Written (W) 2009-2011 Regina Bohnert, Gunnar Raetsch
+# Copyright (C) 2009-2011 Max Planck Society
+#
+
+
+export RQUANT_VERSION=2.1
+export RQUANT_PATH=
+export RQUANT_SRC_PATH=
+export INTERPRETER=
+export MATLAB_BIN_PATH=
+export MATLAB_MEX_PATH=
+export MATLAB_INCLUDE_DIR=
+export OCTAVE_BIN_PATH=
+export OCTAVE_MKOCT=
+export SAMTOOLS_DIR=
+export PYTHON_PATH=
+export SCIPY_PATH=
+
+if [ -z "${RQUANT_PATH}" ]; then echo Warning: variable RQUANT_PATH not set\; consider running ./setup_rquant.sh ; fi
diff --git a/bin/rquant_gendata b/bin/rquant_gendata
@@ -0,0 +1 @@
+rquant_wrapper.sh
diff --git a/bin/rquant_wrapper.sh b/bin/rquant_wrapper.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# Written (W) 2009-2010 Regina Bohnert, Gunnar Raetsch
+# Copyright (C) 2009-2010 Max Planck Society
+#
+
+# rQuant wrapper script to start the interpreter with the correct list of arguments
+
+set -e
+
+PROG=`basename $0`
+DIR=`dirname $0`
+
+exec ${DIR}/start_interpreter.sh ${PROG} "`${DIR}/genarglist.sh $@`"
diff --git a/bin/start_interpreter.sh b/bin/start_interpreter.sh
@@ -0,0 +1,34 @@
+#!/bin/bash
+
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# Written (W) 2009-2010 Regina Bohnert, Gunnar Raetsch
+# Copyright (C) 2009-2010 Max Planck Society
+#
+
+set -e
+
+. `dirname $0`/rquant_config.sh
+
+export MATLAB_RETURN_FILE=`tempfile`
+
+if [ "$INTERPRETER" == 'octave' ];
+then
+	echo exit | ${OCTAVE_BIN_PATH} --eval "global SHELL_INTERPRETER_INVOKE; SHELL_INTERPRETER_INVOKE=1; addpath $RQUANT_SRC_PATH; rquant_config; $1($2); exit;" || (echo starting Octave failed; rm -f $MATLAB_RETURN_FILE; exit -1) ;
+fi
+
+if [ "$INTERPRETER" == 'matlab' ];
+then
+	echo exit | ${MATLAB_BIN_PATH} -nodisplay -r "global SHELL_INTERPRETER_INVOKE; SHELL_INTERPRETER_INVOKE=1; addpath $RQUANT_SRC_PATH; rquant_config; $1($2); exit;" || (echo starting Matlab failed; rm -f $MATLAB_RETURN_FILE; exit -1) ;
+fi
+
+test -f $MATLAB_RETURN_FILE || exit 0
+ret=`cat $MATLAB_RETURN_FILE` ;
+rm -f $MATLAB_RETURN_FILE
+exit $ret
+
+
diff --git a/doc/read_stats.txt b/doc/read_stats.txt
@@ -0,0 +1,50 @@
+**What it does** 
+
+`ReadStats` generates a statistic about the read alignments (number of reads) and the covered genes (read coverage, number of covered introns, intron coverage). It can be used to perform a sanity check of the read alignments file and the annotation.
+
+**Inputs**
+
+`ReadStats` requires three input files to run:
+
+1. The Genome Information Object, containing essential information about the genome (sequence, size, etc). It can be created using the `GenomeTool` from a fasta file.
+2. The Genome Annotation Object, containing the necessary information about the transcripts that are to be quantified. It can be constructed using the `GFF2Anno` tool from an annotation in GFF3 format.
+3. The BAM alignment file, which stores the read alignments in a compressed format. It can be generated using the `SAM-to-BAM` tool in the NGS: SAM Tools section.
+
+
+**Output**
+
+`ReadStats` writes an output file (Read Statistic) containing
+
+1. the number of reads,
+2. the read coverage of the given genes,
+3. the number of covered introns, and
+4. the intron coverage.
+
+------
+
+.. class:: infomark
+
+**About formats**
+
+**GFF3 format** General Feature Format is a format for describing genes
+and other features associated with DNA, RNA and protein
+sequences. GFF3 lines have nine tab-separated fields:
+
+1. seqid - The name of a chromosome or scaffold.
+2. source - The program that generated this feature.
+3. type - The name of this type of feature. Some examples of standard feature types are "gene", "CDS", "protein", "mRNA", and "exon". 
+4. start - The starting position of the feature in the sequence. The first base is numbered 1.
+5. stop - The ending position of the feature (inclusive).
+6. score - A score between 0 and 1000. If there is no score value, enter ".".
+7. strand - Valid entries include '+', '-', or '.' (for don't know/care).
+8. phase - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'.
+9. attributes - All lines with the same group are linked together into a single item.
+
+For more information see http://www.sequenceontology.org/gff3.shtml
+
+**SAM/BAM format** The Sequence Alignment/Map (SAM) format is a
+tab-limited text format that stores large nucleotide sequence
+alignments. BAM is the binary version of a SAM file that allows for
+fast and intensive data processing. The format specification and the
+description of SAMtools can be found on
+http://samtools.sourceforge.net/.
diff --git a/doc/rquant_web.txt b/doc/rquant_web.txt
@@ -0,0 +1,68 @@
+**What it does** 
+
+`rQuant` determines the abundances of a given set transcripts based on aligned reads from an RNA-Seq experiment.
+
+**Inputs**
+
+`rQuant` requires two input files to run:
+
+1. Annotation file either in GFF3 or AGS format, containing the necessary information about the transcripts that are to be quantified.
+2. The BAM alignment file, which stores the read alignments in a compressed format. It can be generated using the `SAM-to-BAM` tool in the NGS: SAM Tools section.
+
+For the feature Transcript Profiles you have three options:
+
+1. "No profiles": This disables the estimation of the density model.
+2. "Load profiles": You can load a pre-learned density model (consisting of transcripts profiles). 
+3. "Learn profiles": This enables the estimation of the density model. You can specify the number of iterations. As an additional output one file describing the density model (transcripts profiles) is generated in your history. 
+
+
+**Output**
+
+`rQuant` generates a GFF3 file with the attributes `ARC` and `RPKM` that describe the abundance of a transcript in ARC (estimated average read coverage) and RPKM (reads per kilobase of exon model per million mapped reads), respectively.
+
+------
+
+**Licenses**
+
+If **rQuant.web** is used to obtain results for scientific publications it
+should be cited as [1]_ or [2]_.
+
+**References** 
+
+.. [1] Bohnert, R, and Raetsch, G (2010): `rQuant.web. A tool for RNA-Seq-based transcript quantitation`_. Nucleic Acids Research, 38(Suppl 2):W348-51.
+
+.. [2] Bohnert, R, Behr, J, and Raetsch, G (2009): `Transcript quantification with RNA-Seq data`_. BMC Bioinformatics, 10(S13):P5.
+
+.. _rQuant.web. A tool for RNA-Seq-based transcript quantitation: http://nar.oxfordjournals.org/cgi/content/abstract/38/suppl_2/W348
+.. _Transcript quantification with RNA-Seq data: http://www.biomedcentral.com/1471-2105/10/S13/P5
+
+------
+
+.. class:: infomark
+
+**About formats**
+
+**GFF3 format** General Feature Format is a format for describing genes and other features associated with DNA, RNA and protein sequences. GFF3 lines have nine tab-separated fields:
+
+1. seqid - The name of a chromosome or scaffold.
+2. source - The program that generated this feature.
+3. type - The name of this type of feature. Some examples of standard feature types are "gene", "CDS", "protein", "mRNA", and "exon". 
+4. start - The starting position of the feature in the sequence. The first base is numbered 1.
+5. stop - The ending position of the feature (inclusive).
+6. score - A score between 0 and 1000. If there is no score value, enter ".".
+7. strand - Valid entries include '+', '-', or '.' (for don't know/care).
+8. phase - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'.
+9. attributes - All lines with the same group are linked together into a single item.
+
+For the quantitation we provide two additional attributes:
+
+1. ARC: estimated average read coverage (direct output from rQuant)
+2. RPKM: the number of reads per thousand bases per million mapped reads
+
+describing the estimated expression value for each transcript.
+
+For more information see http://www.sequenceontology.org/gff3.shtml
+
+**AGS format** Annotation Gene Structure Object is an internal structure that efficiently stores the information from a GFF3 file.
+
+**SAM/BAM format** The Sequence Alignment/Map (SAM) format is a tab-limited text format that stores large nucleotide sequence alignments. BAM is the binary version of a SAM file that allows for fast and intensive data processing. The format specification and the description of SAMtools can be found on http://samtools.sourceforge.net/.