# Using the `LCA_Database` API

Create an `LCA_Database` like so:

In [4]:
import sourmash
db = sourmash.lca.LCA_Database(ksize=31, scaled=1000)

Create signatures for some genomes, load them, and add them:

In [5]:
!sourmash compute --name-from-first -k 31 --scaled=1000 genomes/*

[K
== This is sourmash version 3.2.4.dev5+g6484e78f. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Ksetting num_hashes to 0 because --scaled is set
[Kcomputing signatures for files: genomes/akkermansia.fa, genomes/shew_os185.fa, genomes/shew_os223.fa
[KComputing signature for ksizes: [31]
[KComputing only nucleotide (and not protein) signatures.
[KComputing a total of 1 signature(s).
[Kskipping genomes/akkermansia.fa - already done
[Kskipping genomes/shew_os185.fa - already done
[Kskipping genomes/shew_os223.fa - already done


In [7]:
sig1 = sourmash.load_one_signature('akkermansia.fa.sig', ksize=31)
sig2 = sourmash.load_one_signature('shew_os185.fa.sig', ksize=31)
sig3 = sourmash.load_one_signature('shew_os223.fa.sig', ksize=31)

In [8]:
db.insert(sig1, ident='akkermansia')
db.insert(sig2, ident='shew_os185')
db.insert(sig3, ident='shew_os223')

## Run `search` and `gather` via the `Index` API

In [14]:
from pprint import pprint
pprint(db.search(sig1, threshold=0.1))

[(1.0,
  SourmashSignature('CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome', 6822e0b7),
  None)]


In [15]:
pprint(db.search(sig2, threshold=0.1))

[(1.0,
  SourmashSignature('NC_009665.1 Shewanella baltica OS185, complete genome', b47b13ef),
  None),
 (0.22846441947565543,
  SourmashSignature('NC_011663.1 Shewanella baltica OS223, complete genome', ae6659f6),
  None)]


In [16]:
pprint(db.gather(sig3))

[(1.0,
  SourmashSignature('NC_011663.1 Shewanella baltica OS223, complete genome', ae6659f6),
  None)]


## Retrieve all signatures with `signatures()`

In [19]:
for i in db.signatures():
    print(i)

SourmashSignature('CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome', 6822e0b7)
SourmashSignature('NC_009665.1 Shewanella baltica OS185, complete genome', b47b13ef)
SourmashSignature('NC_011663.1 Shewanella baltica OS223, complete genome', ae6659f6)


## Access identifiers and names

The list of (unique) identifiers in the database can be accessed via the attribute `ident_to_idx`, which maps to integer identifiers; identifiers can also retrieve full names, which are taken from `sig.name()` upon insertion.

In [20]:
pprint(db.ident_to_idx.keys())

dict_keys(['akkermansia', 'shew_os185', 'shew_os223'])


In [23]:
pprint(db.ident_to_name)

{'akkermansia': 'CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete '
                'genome',
 'shew_os185': 'NC_009665.1 Shewanella baltica OS185, complete genome',
 'shew_os223': 'NC_011663.1 Shewanella baltica OS223, complete genome'}


## Access hash values directly

The attribute `hashval_to_idx` contains a mapping from individual hash values to sets of `idx` indices.

See the method `_find_signatures()` for an example of how this is used in `search` and `gather`.

In [25]:
print('{} hash values total in this database'.format(len(db.hashval_to_idx)))

1300 hash values total in this database


In [27]:
all_idx = set()
for idx_set in db.hashval_to_idx.values():
    all_idx.update(idx_set)
print('belonging to signatures with idx {}'.format(all_idx))

belonging to signatures {0, 1, 2}


In [35]:
first_three_hashvals = list(db.hashval_to_idx)[:3]

In [36]:
for hashval in first_three_hashvals:
    print('hashval {} belongs to idxs {}'.format(hashval, db.hashval_to_idx[hashval]))

hashval 17302105753387 belongs to idxs {0}
hashval 95741036335406 belongs to idxs {0}
hashval 165640715598232 belongs to idxs {0}


In [44]:
query_idx = 2
hashval_set = set()
for hashval, idx_set in db.hashval_to_idx.items():
    if query_idx in idx_set:
        hashval_set.add(hashval)
        
print('{} hashvals belong to query idx {}'.format(len(hashval_set), query_idx))

ident = db.idx_to_ident[query_idx]
print('query idx {} matches to ident {}'.format(query_idx, ident))

name = db.ident_to_name[ident]
print('query idx {} matches to name {}'.format(query_idx, name))

490 hashvals belong to query idx 2
query idx 2 matches to ident shew_os223
query idx 2 matches to name NC_011663.1 Shewanella baltica OS223, complete genome


## TODO: add lineage manipulation examples