# Refgenie tutorial

The `Refgenie` class is a key component of the Refgenie package, which is used for managing and organizing reference genome files. It provides a set of methods and functionalities to interact with reference genome assets, and other related resources.

## Purpose of this file

This file serves as a tutorial for using the `Refgenie` Python API. It demonstrates how to set up a temporary directory for storing reference genome assets, configure the `Refgenie` instance, and perform various operations such as listing available assets, retrieving asset information, and managing data channels. 

In order to learn more about any of the contepts indicated in the code, please refer to a specific section of the documentation.

## Installation

Before the package gets released, clone the repository install, for example using `uv`:
  
```bash
git clone <repo_url>
cd refgenie1
uv pip install .
```

## Configuration

First, let's create a temporary directory that will be used to store the refgenie assets.


In [1]:
from pathlib import Path
from rich import print
import os

Let's set a temporary directory to store the refgenie assets.

In [2]:
from tempfile import TemporaryDirectory

REFGENIE_CODE_PATH = Path.cwd().parent / "refgenie" 

# set the environment variable
archive_tmp_dir = TemporaryDirectory(prefix="refgenie_archive_demo_").name
os.environ["REFGENIE_GENOME_ARCHIVE_FOLDER"] = archive_tmp_dir
tmp_dir = TemporaryDirectory(prefix="refgenie_demo_").name
os.environ["REFGENIE_GENOME_FOLDER"] = tmp_dir
# set the REFGENIE_DB_CONFIG_PATH to a sqlite config file in the refgenie package
# os.environ["REFGENIE_DB_CONFIG_PATH"] = (REFGENIE_CODE_PATH / "config" / "sqlite_config.yaml").as_posix()


Let's inspect the refgenie configuration object.

In [3]:
from refgenie.config import config

print(config)

### Database backend

As you can see, refgenie configuration points to a database configuration file, as by default refgenie is backed by a SQLite database.

Let's inspect the refgenie database configuration file.

In [4]:
%cat {config.database_config_path}

type: sqlite
path: /Users/stolarczyk/.refgenie/refgenie


Make sure the directory where the SQLite database file is stored exists, and create it if it doesn't.

In [5]:
!refgenie1 purge --force
!rm -rf ~/refgenie_db
!mkdir -p ~/refgenie_db

[34mINFO    [0m Purged genome folder:                                   ]8;id=465630;file:///Users/stolarczyk/code/refgenie1/refgenie/refgenie.py\[2mrefgenie.py[0m]8;;\[2m:[0m]8;id=670353;file:///Users/stolarczyk/code/refgenie1/refgenie/refgenie.py#534\[2m534[0m]8;;\
         [32m'/var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refge[0m [2m               [0m
         [32mnie_demo_ig9f7if_'[0m                                      [2m               [0m
[34mINFO    [0m Purged refgenie backend:                                ]8;id=559086;file:///Users/stolarczyk/code/refgenie1/refgenie/refgenie.py\[2mrefgenie.py[0m]8;;\[2m:[0m]8;id=833022;file:///Users/stolarczyk/code/refgenie1/refgenie/refgenie.py#536\[2m536[0m]8;;\
         [32m'sqlite:////Users/stolarczyk/.refgenie/refgenie'[0m        [2m               [0m


In practice, you don't even need to create the configuration file manually, as refgenie ships with a default configuration file that is used if no configuration file is provided. Just as we've seen above.

For production deployments you may want to use a different database backend, such as MySQL or PostgreSQL. In this case, you can provide the database configuration file path by setting `REFGENIE_DB_CONFIG_PATH` environment variable, or even set/override the database engine using `database_engine` in the `Refgenie` constructor. The object must be a `sqlalchemy.engine.Engine` object.



### Refgenieserver client

Similarly, refgenie ships with a Refgenieserver client, which is used by default to retrieve remote genome assets and does not need to be replaced in majority of use cases. However, you can provide a custom URL-client mapping to `Refgenie` constructor, by setting the `server_client_mapping` argument. Please note that, the clients need to follow a specific interface, defined in `refgenie.server.ServerClient` protocol. More details below.

In [6]:
from refgenie.server.models import ServerClient
from rich import inspect

inspect(
    ServerClient,
    methods=True,
    docs=True,
    help=True,
    title="ServerClient Protocol structure",
)

First, let's import the `Refgenie` class from the `refgenie` package.


In [7]:
from refgenie import Refgenie

refgenie = Refgenie(suppress_migrations=True)

Let's ensure we start with a clean slate by removing any existing refgenie metadata and initializing a new refgenie instance.


In [8]:
refgenie.init()  # initialize new refgenie instance

Let's subscribe to the default refgenie server. This method will reach out to the server at the provided URL and query the OpenAPI specification to determine whether ther server is refgenie-compatible. If it is, the server will be added to the list of subscribed servers.

Note: there's currently no public compatible refgenieserver instance deployed, so the following code snippets use a local refgenieserver instance serving the latest API.


In [9]:
refgenie.configuration.subscribe("http://localhost:8000")

And that's it! We have now configured a refgenie instance and subscribed to a refgenie-compatible server. We can now start using the refgenie instance to manage reference genome assets.

### Pull an asset

Let's initialize a new genome by pulling an asset of fasta class. This will create a new directory in the `data` subdirectory of the `genome_folder` and mirror it in the `alias` directory with symbolic links, rather than copies of the files.


In [10]:
refgenie.pull(alias_name="rCRSd-1", asset_group_name="fasta")

Asset(name='samtools-1.21', description=None, size=34082, updated_at=datetime.datetime(2025, 7, 6, 18, 28, 50, 885001), path='data/ZtAkf32sCUjeSl0KxVA5DVevklHDazQM/fasta/samtools-1.21', digest='8ccce3f01185ef75c8dabeb9e03f8822', recipe_id=None, asset_group_id=1, created_at=datetime.datetime(2025, 7, 6, 18, 28, 50, 885007))

As you can see above, the genome has been initialized and `fasta` asset was pulled. Let's inspect the initialized genome.


In [11]:
print(refgenie.genomes_table())

Now, that a `fasta` asset has been built for the `dm6` genome, let's add some custom asset classes and recipes to build an asset based on that.

### Add `bowtie2_index` asset class and recipe

By supplying a URL (`str` object) rather than a local path (`pathlib.Path` object), refgenie will grab the remote file and register it as if it was a local file.


In [12]:
refgenie.asset_class.add(
    "https://github.com/refgenie/recipes/raw/refgenie1/asset_classes/bowtie2_index_asset_class.yaml"
)
refgenie.recipe.add(
    "https://github.com/refgenie/recipes/raw/refgenie1/recipes/bowtie2_index_asset_recipe.yaml"
)

Recipe(id=2, name='bowtie2_index', version='0.0.1', description='Genome index for bowtie2, produced with bowtie2-build', output_asset_class_id=2, command_templates=['bowtie2-build --threads {{values.params["threads"]}} {{values.genome_folder}}/{{values.assets["fasta"].seek_keys_dict["fasta"]}} {{values.output_folder}}/{{values.genome_digest}}'], input_params={'threads': {'description': 'Number of threads to use', 'default': 1}}, input_files=None, input_assets={'fasta': {'asset_class': 'fasta', 'description': 'fasta asset for genome', 'default': 'fasta'}}, docker_image='docker.io/databio/refgenie', custom_properties={'version': "bowtie2-build --version | awk 'NR==1{print $3}'"}, default_asset='{{values.custom_properties.version}}', updated_at=datetime.datetime(2025, 7, 6, 18, 28, 51, 196749), created_at=datetime.datetime(2025, 7, 6, 18, 28, 51, 196751))

Let's verify that it worked by listing the available asset classes and recipes:


In [13]:
from rich import print

print(refgenie.recipe.table())
print(refgenie.asset_class.table())

## Build a `fasta` asset


In [14]:
from refgenie import BuildParams

refgenie.build_asset(
    recipe_name="fasta",
    genome_name="t7",
    asset_group_name="fasta",
    params=BuildParams(
        files={"fasta": REFGENIE_CODE_PATH.parent / "tests/data/t7.fa"}
    ),
    genome_description="Genome of T7 phage",
)

Using default schema: /Users/stolarczyk/code/refgenie1/.venv/lib/python3.12/site-packages/pipestat_output_schema.yaml


No pipestat output schema was supplied to PipestatManager.
Initializing results file '/var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/fasta/samtools-1.21/_refgenie_build/stats.yaml'


### Pipeline run code and environment:

*          Command: `/Users/stolarczyk/code/refgenie1/.venv/lib/python3.12/site-packages/ipykernel_launcher.py --f=/Users/stolarczyk/Library/Jupyter/runtime/kernel-v3b4e8f4927474eb39f1b6ad0ce5ad7909ee3e90d0.json`
*     Compute host: `michals-macbook-pro-2.home`
*      Working dir: `/Users/stolarczyk/code/refgenie1/docs`
*        Outfolder: `/var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/fasta/samtools-1.21/_refgenie_build/`
*         Log file: `/var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/fasta/samtools-1.21/_refgenie_build/refgenie_t7_fasta_samtools-1.21_log.md`
*       Start time:  (07-06 20:28:51) elapsed: 0.0 _TIME_

### Version log:

*   Python version: `3.12.7`
*      Pypiper dir: `/Users/stolarczyk/code/refgenie1/.venv/lib/python3.12/site-packages/pypiper`
*  Pypiper version: `0.14.3`
*     Pypiper hash: `f0996bac

Asset(name='samtools-1.21', description='DNA sequences in the FASTA format, indexed FASTA (produced with samtools index), chromosome sizes file and FASTA dict (produced with samtools dict)', size=42981, updated_at=datetime.datetime(2025, 7, 6, 18, 28, 51, 515997), path='data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/fasta/samtools-1.21', digest='51a58ef25e4f1d7e76f226fd5655754f', recipe_id=1, asset_group_id=2, created_at=datetime.datetime(2025, 7, 6, 18, 28, 51, 516002))

### Build a `bowtie2_index` asset

The `bowtie2_index` asset class and recipe have been added successfully. Let's build the `bowtie2_index` asset for the `dm6` genome.


In [15]:
from refgenie.models import BuildParams

refgenie.build_asset(
    recipe_name="bowtie2_index",
    genome_name="t7",
    asset_group_name="bowtie2_index",
    params=BuildParams(params={"threads": 8}),
    archive=True,  # archive the asset right after building
)

Using default schema: /Users/stolarczyk/code/refgenie1/.venv/lib/python3.12/site-packages/pipestat_output_schema.yaml


No pipestat output schema was supplied to PipestatManager.
Initializing results file '/var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/bowtie2_index/2.5.3/_refgenie_build/stats.yaml'


### Pipeline run code and environment:

*          Command: `/Users/stolarczyk/code/refgenie1/.venv/lib/python3.12/site-packages/ipykernel_launcher.py --f=/Users/stolarczyk/Library/Jupyter/runtime/kernel-v3b4e8f4927474eb39f1b6ad0ce5ad7909ee3e90d0.json`
*     Compute host: `michals-macbook-pro-2.home`
*      Working dir: `/Users/stolarczyk/code/refgenie1/docs`
*        Outfolder: `/var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/bowtie2_index/2.5.3/_refgenie_build/`
*         Log file: `/var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/bowtie2_index/2.5.3/_refgenie_build/refgenie_t7_bowtie2_index_2.5.3_log.md`
*       Start time:  (07-06 20:28:51) elapsed: 0.0 _TIME_

### Version log:

*   Python version: `3.12.7`
*      Pypiper dir: `/Users/stolarczyk/code/refgenie1/.venv/lib/python3.12/site-packages/pypiper`
*  Pypiper version: `0.14.3`
*     Pypiper hash: `f0996bac

Settings:
  Output files: "/var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/bowtie2_index/2.5.3/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 32
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  /var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/fasta/samtools-1.21/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.fa
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference 

Building a SMALL index
Renaming /var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/bowtie2_index/2.5.3/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.3.bt2.tmp to /var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/bowtie2_index/2.5.3/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.3.bt2
Renaming /var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/bowtie2_index/2.5.3/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.4.bt2.tmp to /var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/bowtie2_index/2.5.3/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.4.bt2
Renaming /var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/bowtie2_index/2.5.3/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.1.bt2.tmp to /var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLK

Exited Ebwt loop
fchr[A]: 0
fchr[C]: 10842
fchr[G]: 19880
fchr[T]: 30171
fchr[$]: 39937
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 4207850 bytes to primary EBWT file: /var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/bowtie2_index/2.5.3/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.rev.1.bt2.tmp
Wrote 9992 bytes to secondary EBWT file: /var/folders/18/3fc3jyt50sv9kqx6hdqg5b600000gn/T/refgenie_demo_ugm8n45l/data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/bowtie2_index/2.5.3/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.rev.2.bt2.tmp
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
    len: 39937
    bwtLen: 39938
    sz: 9985
    bwtSz: 9985
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 20
    eftabSz: 80
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 2497
    offsSz: 9988
    lineSz: 64
    sideSz: 64
    sideBwtSz: 48
    sideBwtLen: 192
    numSides: 209
    

building file list ... done
kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.1.bt2
kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.2.bt2
kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.3.bt2
kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.4.bt2
kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.rev.1.bt2
kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.rev.2.bt2

sent 8447138 bytes  received 152 bytes  16894580.00 bytes/sec
total size is 8445686  speedup is 1.00


a 2.5.3
a 2.5.3/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.4.bt2
a 2.5.3/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.rev.2.bt2
a 2.5.3/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.2.bt2
a 2.5.3/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.3.bt2
a 2.5.3/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.1.bt2
a 2.5.3/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB.rev.1.bt2


Asset(name='2.5.3', description='Genome index for bowtie2, produced with bowtie2-build', size=8446757, updated_at=datetime.datetime(2025, 7, 6, 18, 28, 51, 871146), path='data/kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB/bowtie2_index/2.5.3', digest='40a7b72a358850f722b6734a836b0fa8', recipe_id=2, asset_group_id=3, created_at=datetime.datetime(2025, 7, 6, 18, 28, 51, 871153))

Let's list the assets for the genome `t7` to verify that the `bowtie2_index` asset has been built successfully.


In [16]:
refgenie.assets_table(genome_names=["t7"])[0]

One of the assets was also archived (a neccessary step to serve the assets via the refgenie server). Let's list the archived assets.

In [17]:
print(refgenie.archive.table())

Asset `bowtie2_index` has been built successfully for the `dm6` genome, and automatically tagged with `2.5.3`, indicating the version of Bowtie2 software used (this behavior is encoded in the recipe).


## Interact with aliases

Let's list the aliases:

In [18]:
refgenie.aliases_table()

Let's assign another alias to the same genome digest, this way we can refer to the same genome in multiple ways.

In [19]:
t7_alias = refgenie.set_genome_alias(
    alias_name="Bacteriophage-T7",
    genome_digest="kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB",
    genome_description="My favorite genome",
)
fav_alias = refgenie.set_genome_alias(
    alias_name="myFavGenome",
    genome_digest="kN9XHLKLS_u7ei2GH87H-qpQrkz8moPB",
    genome_description="My favorite genome",
)

The new alias should be listed in the aliases:

In [20]:
refgenie.aliases_table()

The command not only creates a new alias, but also creates a symbolic links to the files in the `data` directory for that genome.

Conversely, alias removal will remove the symbolic links, but not the files in the `data` directory.

In [21]:
refgenie.remove_alias("myFavGenome")

## Retrieve paths to assets

Most importantly, we can retrieve paths to refgenie-managed files.

All below commands will return the same path to the fasta file managed by Refgenie:

In [22]:
print(refgenie.seek("t7", "fasta"))
print(refgenie.seek("Bacteriophage-T7", "fasta", "samtools-1.21"))
print(refgenie.seek("t7", "fasta", "samtools-1.21", "fasta"))

## Remove an asset

Let's remove the `bowtie2_index` asset for the `dm6` genome.

In [23]:
refgenie.remove_asset_group(
    genome_name="t7", asset_group_name="bowtie2_index", force=True
)

## Data channels

Refgenie supports data channels, which are used to allow third-party tool developers to expose their recipes and asset classes to Refgenie ecosystem.
In the simplest case, data channels is just a github repository with an index file that lists available asset classes and recipes, like so:

```yaml
asset_class:
  dir: asset_classes # optional, needed only if the asset classes are stored in a subdirectory
  files: # list of asset class files, relative to the index file (or directory)
    - fasta.yaml 
    - bowtie2_index.yaml
recipe:
  dir: recipes # optional, needed only if the recipes are stored in a subdirectory
  files: # list of recipe files, relative to the index file (or directory)
    - fasta.yaml
    - bowtie2_index.yaml
```

One such example is the [refgenie/recipes](https://github.com/refgenie/recipes/blob/refgenie1/index.yaml) repository, which can be added as a data channel to refgenie in the following way:

In [24]:
data_channel = refgenie.data_channel.add(
    name="refgenie-recipes",
    type="http",
    index_address="https://refgenie.github.io/recipes/index.yaml",
    description="Refgenie recipes channel",
)

print(refgenie.data_channel.table())

Subsequently, the asset classes and recipes from the data channel can be listed and added to the refgenie instance.

In [25]:
for asset_class in refgenie.data_channel.iter_asset_classes("refgenie-recipes"):
    try:
        refgenie.asset_class.add(asset_class)
    except Exception as e:
        print(e)

for recipe in refgenie.data_channel.iter_recipes("refgenie-recipes"):
    try:
        refgenie.recipe.add(recipe)
    except Exception as e:
        print(e)
        


Alternatively, the same can be achieved by running the following CLI command:

```bash
refgenie1 data_channel sync refgenie-recipes --exists-ok
```

## SeqCol interface

Refgenie also provides a `SeqCol` interface, which is standard for working with sequence collections. More details on this interface can be found on the [SeqCol project website](https://seqcol.readthedocs.io/en/latest/).
Under the hood, refgenie uses the `SeqCol` digests to uniquely identify genomes.

In [26]:
d1 = refgenie.refget_db_agent.seqcol.add_from_fasta_file(
    "/Users/stolarczyk/code/refgenie1/tests/data/rCRSd.fa"
).digest
d2 = refgenie.refget_db_agent.seqcol.add_from_fasta_file(
    "/Users/stolarczyk/code/refgenie1/tests/data/rCRSd-extra.fa"
).digest
refgenie.refget_db_agent.compare_digests(d1, d2)

{'attributes': {'a_only': [],
  'b_only': [],
  'a_and_b': ['lengths',
   'name_length_pairs',
   'names',
   'sequences',
   'sorted_sequences']},
 'array_elements': {'a': {'lengths': 1,
   'name_length_pairs': 1,
   'names': 1,
   'sequences': 1,
   'sorted_sequences': 1},
  'b': {'lengths': 2,
   'name_length_pairs': 2,
   'names': 2,
   'sequences': 2,
   'sorted_sequences': 2},
  'a_and_b': {'lengths': 1,
   'name_length_pairs': 1,
   'names': 1,
   'sequences': 1,
   'sorted_sequences': 1},
  'a_and_b_same_order': {'lengths': True,
   'name_length_pairs': True,
   'names': True,
   'sequences': True,
   'sorted_sequences': True}}}