# refget-py tutorial

In [1]:
import refget
from refget import trunc512_digest

Show some results for sequence digests:

In [2]:
trunc512_digest('ACGT')

'68a178f7c740c5c240aa67ba41843b119d3bf9f8b0f0ac36'

In [3]:
trunc512_digest('TCGA')

'3912dddce432f3085c6b4f72a644c4c4c73f07215a9679ce'

In [4]:
trunc512_digest('ACGT', 26)

'68a178f7c740c5c240aa67ba41843b119d3bf9f8b0f0ac36cf70'

## Use a database

Now, instantiate a RefDB object. You have to provide a database where you will store lookup values. For a demo, you can also use a basic dictionary as a lookup database, but this will obviously not persist. 

Seed our database with a few pre-existing entries:

In [5]:
local_lookup_dict = {
    trunc512_digest('ACGT'): "ACGT",
    trunc512_digest('TCGA'): "TCGA"
}

rgdb_local = refget.RefDB(local_lookup_dict)


Retrieve sequences using the checksum

In [6]:
rgdb_local.refget(trunc512_digest('TCGA'))

'TCGA'

We can also add new sequences into the database:

In [7]:
rgdb_local.refget(trunc512_digest('TCGATCGA'))  # This sequence is not found in our database yet

'Not found'

In [8]:
checksum = rgdb_local.load_seq("TCGATCGA")  # So, let's add it into database

In [9]:
rgdb_local.refget(checksum)  # This time it returns

'TCGATCGA'

## Switching to a Redis back-end
Using a dict as a database will not persist. Let's instead use a redis back-end. If you're running a local redis server, you can use that as a back-end. First, start up a server like this:

```
docker run --rm --workdir="`pwd`" redis:5.0.5 redis-server
```

 Then you can instantiate a new RefDB object that uses it like this:

In [10]:
rgdb = refget.RefDB(refget.RedisDict())

## Database insertion

Insert a sequence into the database, then retrieve it via checksum

In [11]:
checksum = rgdb.load_seq("GGAA")
rgdb.refget(checksum)

'GGAA'

## Insert and retrieve a sequence collection (fasta file)

In [12]:
fa_file = "demo.fa"
checksum, content = rgdb.load_fasta(fa_file)

Here we retrieve the complete fasta file:

In [13]:
rgdb.refget(checksum)

'>chr1\nACGT\n>chr2\nTCGA'

You can limit recursion to get just the checksums for individual sequences, rather than the sequences themselves:

In [14]:
rgdb.refget(checksum, reclimit=1)

'chr1:68a178f7c740c5c240aa67ba41843b119d3bf9f8b0f0ac36;chr2:3912dddce432f3085c6b4f72a644c4c4c73f07215a9679ce'

The individual sequences are also retrievable independently because each sequence from the fasta file is stored as a primary unit. Test some single-sequence lookups from the database:

In [15]:
rgdb.refget(content["chr1"])

'ACGT'

In [16]:
rgdb.refget(trunc512_digest('ACGT'))

'ACGT'

Now if we kill that object and create a new object using the same redis back-end, the data persists because it's stored in the redis back-end:

In [17]:
rgdb = None
rgdb2 = refget.RefDB(refget.RedisDict())
rgdb2.refget(checksum)

'>chr1\nACGT\n>chr2\nTCGA'