# Refget python package tutorial

Record some versions used in this tutorial:

In [34]:
from platform import python_version 
python_version()

'3.11.11'

In [1]:
import refget
refget.__version__

'0.8.0'

## Computing digests locally

In [36]:
from refget import sha512t24u_digest, digest_fasta

Show some results for sequence digests:

In [37]:
sha512t24u_digest('GGAA')

'YBbVX0dLKG1ieEDCiMmkrTZFt_Z5Vdaj'

You can also use the `digest_fasta` function to compute digests for a fasta file

In [38]:
for x in digest_fasta('../../../test_fasta/base.fa'):
    print(f"{x.id}\t{x.length}\t{x.sha512t24u}\t{x.md5}")

chrX	8	iYtREV555dUFKg2_agSJW6suquUyPpMw	5f63cfaa3ef61f88c9635fb9d18ec945
chr1	4	YBbVX0dLKG1ieEDCiMmkrTZFt_Z5Vdaj	31fc6ca291a32fb9df82b85e5f077e31
chr2	4	AcLxtBuKEPk_7PGE_H4dGElwZHCujwH6	92c6a56c9e9459d8a42b96f7884710bc


## Connecting to a remote API

The refget package provides a simple python wrapper around a remote hosted refget RESTful API. Provide the base url when construction a RefGetClient object and you can retrieve sequences from the remote server.

In [None]:
rgc = refget.RefgetClient(seq_api_urls=["https://beta.ensembl.org/data/refget/"])

In [3]:
rgc.get_sequence("6681ac2f62509cfc220d78751b8dc524", start=0, end=10)

INFO:refget.clients:Successful response from https://beta.ensembl.org/data/refget/


'CCACACCACA'

In [4]:
rgc.get_sequence("6681ac2f62509cfc220d78751b8dc524", start=0, end=50)

INFO:refget.clients:Successful response from https://beta.ensembl.org/data/refget/


'CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC'

You can also hit the `{digest}/metadata` and `service_info` API endpoints described in the refget API specification:

In [None]:
rgc.get_metadata("6681ac2f62509cfc220d78751b8dc524")

AttributeError: 'RefGetClient' object has no attribute 'meta'

In [None]:
rgc.service_info()

{'service': {'algorithms': ['ga4gh', 'md5', 'trunc512'],
  'circular_supported': True,
  'subsequence_limit': None,
  'supported_api_versions': ['1.0.0']}}

When requesting a sequence that is not found, the service responds appropriately:

In [None]:
rgc.refget(trunc512_digest('TCGATCGA'))

'Not Found'

## Use a local database for caching

By default, any full-sequences retrieved from an API are cached locally in memory (in a Python Dict). This data will not persist past a current session, but is useful if you have an application that requires repeated requests. here, we re-request the sequence requested above. It is much faster this time because it uses a local cache:


In [None]:
rgc.refget("6681ac2f62509cfc220d78751b8dc524", start=0, end=10)

'CCACACCACA'

We can also add new sequences into the database:

In [None]:
rgc.refget(refget.md5('TCGATCGA'))  # This sequence is not found in our database yet

'Not Found'

In [None]:
checksum = rgc.load_seq("TCGATCGA")  # So, let's add it into database

In [None]:
rgc.refget(checksum)  # This time it returns

'TCGATCGA'

Keep in mind that sequences added in this way are added to your *local* database, not to the remote API, so when we restart, they will be gone:

In [None]:
del rgc

In [None]:
rgc = refget.RefGetClient("https://refget.herokuapp.com/sequence/")
rgc.refget(refget.md5('TCGA'))

'Not Found'

## Making data persist

If you want to retain your local cache, you can use a Dict that is backed by some persistent storage, such as a database on disk or another running process. There are many ways to do this, for example, you can use an sqlite database, a Redis database, or a MongoDB database. Here we'll show you how to use the `sqlitedict` package to back your local database.

To start, you need to create a dict object and pass that to the RefGetClient constructor.

In [None]:
import refget
from sqlitedict import SqliteDict
mydict = SqliteDict('./my_db.sqlite', autocommit=True)

In [None]:
rgc = refget.RefGetClient("https://refget.herokuapp.com/sequence/", mydict)

Now when we retrieve a sequence it will be added to the local sqlite database automatically.

In [None]:
rgc.refget("6681ac2f62509cfc220d78751b8dc524", start=0, end=50)

'CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC'

Look, we can see that this object has been added to our sqlite database:

In [None]:
mydict["6681ac2f62509cfc220d78751b8dc524"][1:50]

'CACACCACACCCACACACCCACACACCACACCACACACCACACCACACC'

So now if we kill this object and start it up again *without the API connection*, but with the mydict local backend, we can still retrieve it:

In [None]:
del rgc

In [None]:
rgc = refget.RefGetClient(database=mydict)

In [None]:
rgc.refget("6681ac2f62509cfc220d78751b8dc524", start=0, end=50)

'CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC'

## Loading a fasta file

The package also comes with a helper function for computing checksums for an entire fasta file.

In [None]:
fa_file = "../demo_fasta/demo.fa"
content = rgc.load_fasta(fa_file)

In [None]:
content

[{'name': 'chr1',
  'length': 4,
  'sequence_digest': 'f1f8f4bf413b16ad135722aa4591043e'},
 {'name': 'chr2',
  'length': 4,
  'sequence_digest': '45d0ff9f1a9504cf2039f89c1ffb4c32'}]

In [None]:
rgc.refget(content[0]['sequence_digest'])

'ACGT'

In [None]:
rgc.refget("blah")

No remote URL connected


In [None]:
rgc.api_url_base = "https://refget.herokuapp.com/sequence/"

In [None]:
rgc.refget("blah")

'Not Found'

In [None]:
# You can show the complete contents of the database like this:
# rgc.show()
