Skip to content

Commit

Permalink
command line documentation, clarity
Browse files Browse the repository at this point in the history
  • Loading branch information
ctb committed Jun 5, 2016
1 parent 0b6178d commit 947ff53
Show file tree
Hide file tree
Showing 3 changed files with 102 additions and 12 deletions.
Empty file added doc/_templates/.empty
Empty file.
101 changes: 94 additions & 7 deletions doc/command-line.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,22 @@
Using sourmash from the command line
====================================

From the command line, sourmash can be used to compute `MinHash
sketches <https://en.wikipedia.org/wiki/MinHash>`__ from DNA
sequences, compare them to each other, and plot the results. This
allows you to estimate sequence similarity quickly and accurately.

Please see the `mash <http://mash.readthedocs.io/en/latest/>`__
software and the `mash paper (Ondov et al., 2016)
<http://biorxiv.org/content/early/2015/10/26/029827>`__ for background
information on how and why MinHash sketches work.

----

sourmash uses a subcommand syntax, so all commands start with
``sourmash`` followed by a subcommand specifying the action to be
taken.

An example
==========

Expand All @@ -15,8 +31,8 @@ Compute signatures for each::

sourmash compute -f *.fna.gz

This will produce three `.sig` files containing MinHash signatures at k=31;
the `-f` bypasses an error where the last of the genomes has some non-ATCGN
This will produce three ``.sig`` files containing MinHash signatures at k=31;
the ``-f`` bypasses an error where the last of the genomes has some non-ATCGN
characters in it.

Next, compare all the signatures to each other::
Expand All @@ -25,10 +41,81 @@ Next, compare all the signatures to each other::

Finally, plot a dendrogram::

./plot-comparison.py cmp --pdf
sourmash plot cmp

This will output two files, ``cmp.dendro.png`` and ``cmp.matrix.png``,
containing a clustering & dendrogram of the sequences, as well as a
similarity matrix and heatmap.

The ``sourmash`` command and its subcommands
============================================

To get a list of subcommands, run ``sourmash`` without any arguments.

There are three main subcommands: ``compute``, ``compare``, and ``plot``.

``sourmash compute``
--------------------

The ``compute`` subcommand computes and saves MinHash sketches for
each sequence in one or more sequence files. It takes as input FASTA
or FASTQ files, and these files can be uncompressed or compressed with
gzip or bzip2. The output will be one or more YAML signature files
that can be used with ``sourmash compare``.

Usage::

sourmash compute filename [ filename2 ... ]

Optional arguments::

--ksizes K1[,K2,K3] -- one or more k-mer sizes to use; default is 31
--force -- recompute existing signatures; convert non-DNA characters to N
--output -- save all the signatures to this file; can be '-' for stdout.

``sourmash compare``
--------------------

The ``compare`` subcommand compares one or more signature files
(created with ``compute``) using estimated `Jaccard index
<https://en.wikipedia.org/wiki/Jaccard_index>`__. The default output
is a text display of a similarity matrix where each entry ``[i, j]``
contains the estimated Jaccard index between input signature ``i`` and
input signature ``j``. The output matrix can be saved to a file
with ``--output`` and used with the ``sourmash plot`` subcommand.

Usage::

sourmash compare file1.sig [ file2.sig ... ]

Options::

--output -- save the distance matrix to this file (as a numpy binary matrix)
--ksize -- do the comparisons at this k-mer size.

``sourmash plot``
-----------------

The ``plot`` subcommand produces two plots -- a dendrogram and a
dendrogram+matrix -- from a distance matrix computed by ``sourmash compare
--output <matrix>``. The deafault output is two PNG files.

Usage::

sourmash plot <matrix>

Options::

--pdf -- output PDF files.
--labels -- display the signature names (by default, the filenames) on the plot
--indices -- turn off index display on the plot.
--vmax -- maximum value (default 1.0) for heatmap.
--vmin -- minimum value (deafult 0.0) for heatmap.

Example figures:

Mention:
.. figure:: _static/cmp.matrix.png
:width: 60%

* reads fa, fq, gz, bz2
* show png

.. figure:: _static/cmp.dendro.png
:width: 60%
13 changes: 8 additions & 5 deletions sourmash
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ Commands can be:
parser = argparse.ArgumentParser()
parser.add_argument('signatures', nargs='+')
parser.add_argument('-k', '--ksize', type=int, default=DEFAULT_K)
parser.add_argument('-o', '--output-filename')
parser.add_argument('-o', '--output')
args = parser.parse_args(args)

# load in the various signatures
Expand All @@ -174,6 +174,9 @@ Commands can be:
print('loading', filename, file=sys.stderr)
data = open(filename).read()
loaded = sig.load_signatures(data, select_ksize=args.ksize)
if not loaded:
print('warning: no signatures loaded at given ksize from %s' %
filename, file=sys.stderr)
siglist.extend(loaded)

if len(siglist) == 0:
Expand All @@ -198,15 +201,15 @@ Commands can be:
print('min similarity in matrix:', numpy.min(D), file=sys.stderr)

# shall we output a matrix?
if args.output_filename:
labeloutname = args.output_filename + '.labels.txt'
if args.output:
labeloutname = args.output + '.labels.txt'
print('saving labels to:', labeloutname, file=sys.stderr)
with open(labeloutname, 'w') as fp:
fp.write("\n".join(labeltext))

print('saving distance matrix to:', args.output_filename,
print('saving distance matrix to:', args.output,
file=sys.stderr)
with open(args.output_filename, 'wb') as fp:
with open(args.output, 'wb') as fp:
numpy.save(fp, D)


Expand Down

0 comments on commit 947ff53

Please sign in to comment.