added API example

sourmash-bio · Jun 5, 2016 · edb9320 · edb9320
1 parent d7ddef6
commit edb9320
Show file tree

Hide file tree

Showing 4 changed files with 106 additions and 2 deletions.
diff --git a/Makefile b/Makefile
@@ -10,12 +10,12 @@ clean:
 	cd doc && make clean
 
 test: all
-	$(PYTHON) -m pytest sourmash_lib/*.py
+	$(PYTHON) -m pytest
 
 doc: .PHONY
 	cd doc && make html
 
 coverage: all
 	$(PYTHON) setup.py clean --all
 	SOURMASH_COVERAGE=1 $(PYTHON) setup.py build_ext -i
-	$(PYTHON) -m pytest --cov=. sourmash_lib/*.py
+	$(PYTHON) -m pytest --cov=.
diff --git a/doc/api-example.rst b/doc/api-example.rst
@@ -0,0 +1,101 @@
+============================
+``sourmash`` Python examples
+============================
+
+A very simple example: two k-mers
+---------------------------------
+
+Define two sequences:
+
+>>> seq1 = "ATGGCA"
+>>> seq2 = "AGAGCA"
+
+Create two estimators using 3-mers, and add the sequences:
+
+>>> import sourmash_lib
+>>> E1 = sourmash_lib.Estimators(n=20, ksize=3)
+>>> E1.add_sequence(seq1)
+
+>>> E2 = sourmash_lib.Estimators(n=20, ksize=3)
+>>> E2.add_sequence(seq2)
+
+One of the 3-mers (out of 4) overlaps, so Jaccard index is 1/4:
+
+>>> E1.jaccard(E2)
+0.25
+
+and of course the estimators match themselves:
+
+>>> E1.jaccard(E1)
+1.0
+
+We can add sequences and query at any time --
+
+>>> E1.add_sequence(seq2)
+>>> x = E1.jaccard(E2)
+>>> round(x, 3)
+0.571
+
+Consuming files
+---------------
+
+Suppose we want to create MinHash sketches from genomes --
+
+>>> import glob, pprint
+>>> genomes = glob.glob('data/GCF*.fna.gz')
+>>> pprint.pprint(genomes)
+['data/GCF_000005845.2_ASM584v2_genomic.fna.gz',
+ 'data/GCF_000006945.1_ASM694v1_genomic.fna.gz',
+ 'data/GCF_000783305.1_ASM78330v1_genomic.fna.gz']
+
+We have to read them in (here using screed), but then they can be fed
+into 'add_sequence' directly; here we set 'force=True' in ``add_sequence``
+to ignore non-ACTGN characters.
+
+(Note, just for speed reasons, we'll truncate the sequences to 50kb in length.)
+
+>>> import screed
+>>> estimators = []
+>>> for g in genomes:
+...     E = sourmash_lib.Estimators(n=500, ksize=31)
+...     for record in screed.open(g):
+...         E.add_sequence(record.sequence[:50000], True)
+...     estimators.append(E)
+
+And now the estimators can be compared against each other:
+
+>>> for i, e in enumerate(estimators):
+...    print(genomes[i][:20], end=' ')
+...    for j, e2 in enumerate(estimators):
+...       x = e.jaccard(estimators[j])
+...       print(round(x, 3), end=' ')
+...    print('')
+data/GCF_000005845.2 1.0 0.0 0.0 
+data/GCF_000006945.1 0.0 1.0 0.0 
+data/GCF_000783305.1 0.0 0.0 1.0 
+
+Note that the comparisons are quite quick; most of the time is spent in
+making the estimators, which can be saved and loaded easily.
+
+Saving and loading signature files
+----------------------------------
+
+>>> from sourmash_lib import signature
+>>> sig1 = signature.SourmashSignature('titus@idyll.org', estimators[0],
+...                                    name=genomes[0][:20])
+>>> with open('/tmp/genome1.sig', 'wt') as fp:
+...   signature.save_signatures([sig1], fp)
+
+Here, ``/tmp/genome1.sig`` is a YAML file that can now be loaded and
+compared -- first, load:
+
+>>> sigdata = open('/tmp/genome1.sig', 'rt').read()
+>>> siglist = signature.load_signatures(sigdata)
+>>> loaded_sig = siglist[0]
+
+then compare:
+
+>>> loaded_sig.estimator.jaccard(sig1.estimator)
+1.0
+>>> sig1.estimator.jaccard(loaded_sig.estimator)
+1.0
diff --git a/doc/index.rst b/doc/index.rst
@@ -47,6 +47,7 @@ Contents:
 
    command-line
    api
+   api-example
 
 
 Indices and tables

diff --git a/pytest.ini b/pytest.ini
@@ -0,0 +1,2 @@
+[pytest]
+addopts = --doctest-glob='*.rst'