Skip to content

Commit

Permalink
added API example
Browse files Browse the repository at this point in the history
  • Loading branch information
ctb committed Jun 5, 2016
1 parent d7ddef6 commit edb9320
Show file tree
Hide file tree
Showing 4 changed files with 106 additions and 2 deletions.
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@ clean:
cd doc && make clean

test: all
$(PYTHON) -m pytest sourmash_lib/*.py
$(PYTHON) -m pytest

doc: .PHONY
cd doc && make html

coverage: all
$(PYTHON) setup.py clean --all
SOURMASH_COVERAGE=1 $(PYTHON) setup.py build_ext -i
$(PYTHON) -m pytest --cov=. sourmash_lib/*.py
$(PYTHON) -m pytest --cov=.
101 changes: 101 additions & 0 deletions doc/api-example.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
============================
``sourmash`` Python examples
============================

A very simple example: two k-mers
---------------------------------

Define two sequences:

>>> seq1 = "ATGGCA"
>>> seq2 = "AGAGCA"

Create two estimators using 3-mers, and add the sequences:

>>> import sourmash_lib
>>> E1 = sourmash_lib.Estimators(n=20, ksize=3)
>>> E1.add_sequence(seq1)

>>> E2 = sourmash_lib.Estimators(n=20, ksize=3)
>>> E2.add_sequence(seq2)

One of the 3-mers (out of 4) overlaps, so Jaccard index is 1/4:

>>> E1.jaccard(E2)
0.25

and of course the estimators match themselves:

>>> E1.jaccard(E1)
1.0

We can add sequences and query at any time --

>>> E1.add_sequence(seq2)
>>> x = E1.jaccard(E2)
>>> round(x, 3)
0.571

Consuming files
---------------

Suppose we want to create MinHash sketches from genomes --

>>> import glob, pprint
>>> genomes = glob.glob('data/GCF*.fna.gz')
>>> pprint.pprint(genomes)
['data/GCF_000005845.2_ASM584v2_genomic.fna.gz',
'data/GCF_000006945.1_ASM694v1_genomic.fna.gz',
'data/GCF_000783305.1_ASM78330v1_genomic.fna.gz']

We have to read them in (here using screed), but then they can be fed
into 'add_sequence' directly; here we set 'force=True' in ``add_sequence``
to ignore non-ACTGN characters.

(Note, just for speed reasons, we'll truncate the sequences to 50kb in length.)

>>> import screed
>>> estimators = []
>>> for g in genomes:
... E = sourmash_lib.Estimators(n=500, ksize=31)
... for record in screed.open(g):
... E.add_sequence(record.sequence[:50000], True)
... estimators.append(E)

And now the estimators can be compared against each other:

>>> for i, e in enumerate(estimators):
... print(genomes[i][:20], end=' ')
... for j, e2 in enumerate(estimators):
... x = e.jaccard(estimators[j])
... print(round(x, 3), end=' ')
... print('')
data/GCF_000005845.2 1.0 0.0 0.0
data/GCF_000006945.1 0.0 1.0 0.0
data/GCF_000783305.1 0.0 0.0 1.0

Note that the comparisons are quite quick; most of the time is spent in
making the estimators, which can be saved and loaded easily.

Saving and loading signature files
----------------------------------

>>> from sourmash_lib import signature
>>> sig1 = signature.SourmashSignature('titus@idyll.org', estimators[0],
... name=genomes[0][:20])
>>> with open('/tmp/genome1.sig', 'wt') as fp:
... signature.save_signatures([sig1], fp)

Here, ``/tmp/genome1.sig`` is a YAML file that can now be loaded and
compared -- first, load:

>>> sigdata = open('/tmp/genome1.sig', 'rt').read()
>>> siglist = signature.load_signatures(sigdata)
>>> loaded_sig = siglist[0]

then compare:

>>> loaded_sig.estimator.jaccard(sig1.estimator)
1.0
>>> sig1.estimator.jaccard(loaded_sig.estimator)
1.0
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ Contents:

command-line
api
api-example


Indices and tables
Expand Down
2 changes: 2 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[pytest]
addopts = --doctest-glob='*.rst'

0 comments on commit edb9320

Please sign in to comment.