# Tutorial

I assume you've already installed refgenie. In this tutorial I'll show you a few ways to use refgenie from the command line (commands that start with a `!`), and also some Python commands.

To start, initialize an empty refgenie configuration file from the shell and subscribe to the desired asset server:

In [1]:
!refgenie init -c refgenie.yaml -s http://rg.databio.org

Initialized genome configuration file: /Users/mstolarczyk/code/refgenie/docs_jupyter/refgenie.yaml
Created directories:
 - /Users/mstolarczyk/code/refgenie/docs_jupyter/data
 - /Users/mstolarczyk/code/refgenie/docs_jupyter/alias


Here's what it looks like:

In [2]:
!cat refgenie.yaml

config_version: 0.4
genome_folder: /Users/mstolarczyk/code/refgenie/docs_jupyter
genome_servers: 
 - http://rg.databio.org
genomes: null


In [3]:
!refgenie listr -c refgenie.yaml

[3m                             Remote refgenie assets                             [0m
[3m                       Server URL: http://rg.databio.org                        [0m
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1mgenome          [0m[1m [0m┃[1m [0m[1massets                                                   [0m[1m [0m┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ rCRSd            │ fasta, bowtie2_index, bwa_index, hisat2_index,            │
│                  │ star_index, bismark_bt2_index                             │
│ hg18_cdna        │ fasta, kallisto_index                                     │
│ hs38d1           │ fasta, suffixerator_index, bowtie2_index, bwa_index,      │
│                  │ tallymer_index, hisat2_index, star_index,                 │
│                  │ bismark_bt2_index                                         │
│ hg38_cdna        │ fasta, kallis

Now let's enter python and do some stuff.

In [4]:
import refgenconf
rgc = refgenconf.RefGenConf(filepath="refgenie.yaml")

Use `pull` to download one of the assets:

In [5]:
rgc.pull("mouse_chrM2x", "fasta", "default")

Output()

(['43f14ba8beed34d52edb244e26f193df6edbb467bd55d37a', 'fasta', 'default'],
 {'asset_path': 'fasta',
  'asset_digest': '8dfe402f7d29d5b036dd8937119e4404',
  'archive_digest': 'bfb7877ee114c61a17a50bd471de47a2',
  'asset_size': '39.4KB',
  'archive_size': '9.1KB',
  'seek_keys': {'fasta': '43f14ba8beed34d52edb244e26f193df6edbb467bd55d37a.fa',
   'fai': '43f14ba8beed34d52edb244e26f193df6edbb467bd55d37a.fa.fai',
   'chrom_sizes': '43f14ba8beed34d52edb244e26f193df6edbb467bd55d37a.chrom.sizes'},
  'asset_parents': [],
  'asset_children': ['43f14ba8beed34d52edb244e26f193df6edbb467bd55d37a/suffixerator_index:default',
   '43f14ba8beed34d52edb244e26f193df6edbb467bd55d37a/bowtie2_index:default',
   '43f14ba8beed34d52edb244e26f193df6edbb467bd55d37a/bwa_index:default',
   '43f14ba8beed34d52edb244e26f193df6edbb467bd55d37a/tallymer_index:default',
   '43f14ba8beed34d52edb244e26f193df6edbb467bd55d37a/hisat2_index:default',
   '43f14ba8beed34d52edb244e26f193df6edbb467bd55d37a/star_index:default',
   '

Once it's downloaded, use `seek` to retrieve a path to it.

In [6]:
rgc.seek("mouse_chrM2x", "fasta")

'/Users/mstolarczyk/code/refgenie/docs_jupyter/alias/mouse_chrM2x/fasta/default/mouse_chrM2x.fa'

You can get the unique asset identifier with `id()`

In [7]:
rgc.id("mouse_chrM2x", "fasta")

'8dfe402f7d29d5b036dd8937119e4404'

## Building and pulling from the command line

Here, we can build a fasta asset instead of pulling one. Back to the shell, we'll grab the Revised Cambridge Reference Sequence (human mitochondrial genome, because it's small):

In [8]:
!wget -O rCRSd.fa.gz http://big.databio.org/refgenie_raw/files.rCRSd.fasta.fasta

--2021-03-09 12:22:40--  http://big.databio.org/refgenie_raw/files.rCRSd.fasta.fasta
Resolving big.databio.org (big.databio.org)... 128.143.245.181, 128.143.245.182
Connecting to big.databio.org (big.databio.org)|128.143.245.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8399 (8.2K) [application/octet-stream]
Saving to: ‘rCRSd.fa.gz’


2021-03-09 12:22:40 (1.35 MB/s) - ‘rCRSd.fa.gz’ saved [8399/8399]



In [9]:
!refgenie build rCRSd/fasta -c refgenie.yaml  --files fasta=rCRSd.fa.gz -R

Using 'default' as the default tag for 'rCRSd/fasta'
Recipe validated successfully against a schema: /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/refgenie/schemas/recipe_schema.yaml
Building 'rCRSd/fasta:default' using 'fasta' recipe
Initializing genome: rCRSd
Loaded AnnotatedSequenceDigestList (1 sequences)
Set genome alias (94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4: rCRSd)
Created alias directories: 
 - /Users/mstolarczyk/code/refgenie/docs_jupyter/alias/rCRSd
Saving outputs to:
- content: /Users/mstolarczyk/code/refgenie/docs_jupyter/data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4
- logs: /Users/mstolarczyk/code/refgenie/docs_jupyter/data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/fasta/default/_refgenie_build
### Pipeline run code and environment:

*              Command:  `/Library/Frameworks/Python.framework/Versions/3.6/bin/refgenie build rCRSd/fasta -c refgenie.yaml --files fasta=rCRSd.fa.gz -R`
*         Compute host:  Michal

The asset should be available for local use, let's call `refgenie list` to check it:

In [10]:
!refgenie list -c refgenie.yaml --genome rCRSd

[3m                        Local refgenie assets                         [0m
[3m             Server subscriptions: http://rg.databio.org              [0m
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃[1m [0m[1mgenome   [0m[1m [0m┃[1m [0m[1masset ([0m[1;3mseek_keys[0m[1m)                         [0m[1m [0m┃[1m [0m[1mtags     [0m[1m [0m┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ rCRSd     │ fasta ([3mfasta, fai, chrom_sizes[0m)            │ default   │
└───────────┴────────────────────────────────────────────┴───────────┘


We can retrieve the path to this asset with:

In [11]:
!refgenie seek rCRSd/fasta -c refgenie.yaml

/Users/mstolarczyk/code/refgenie/docs_jupyter/alias/rCRSd/fasta/default/rCRSd.fa


Naturally, we can do the same thing from within Python:

In [12]:
rgc = refgenconf.RefGenConf("refgenie.yaml")
rgc.seek("rCRSd", "fasta")

'/Users/mstolarczyk/code/refgenie/docs_jupyter/alias/rCRSd/fasta/default/rCRSd.fa'

Now, if we have bowtie2-build in our `$PATH` we can build the `bowtie2_index` asset with no further requirements.

Let's check the requirements with `refgenie build --requirements`:


In [13]:
!refgenie build rCRSd/bowtie2_index -c refgenie.yaml --requirements

'bowtie2_index' recipe requirements: 
- assets:
	fasta (fasta asset for genome); default: fasta


Since I already have the fasta asset, that means I don't need anything else to build the bowtie2_index.

In [14]:
!refgenie build rCRSd/bowtie2_index -c refgenie.yaml

Using 'default' as the default tag for 'rCRSd/bowtie2_index'
Recipe validated successfully against a schema: /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/refgenie/schemas/recipe_schema.yaml
Building 'rCRSd/bowtie2_index:default' using 'bowtie2_index' recipe
Saving outputs to:
- content: /Users/mstolarczyk/code/refgenie/docs_jupyter/data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4
- logs: /Users/mstolarczyk/code/refgenie/docs_jupyter/data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index/default/_refgenie_build
### Pipeline run code and environment:

*              Command:  `/Library/Frameworks/Python.framework/Versions/3.6/bin/refgenie build rCRSd/bowtie2_index -c refgenie.yaml`
*         Compute host:  MichalsMBP
*          Working dir:  /Users/mstolarczyk/code/refgenie/docs_jupyter
*            Outfolder:  /Users/mstolarczyk/code/refgenie/docs_jupyter/data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index/default/_refge

Command completed. Elapsed time: 0:00:00. Running peak memory: 0.003GB.  
  PID: 63609;	Command: bowtie2-build;	Return code: 0;	Memory used: 0.003GB


> `touch /Users/mstolarczyk/code/refgenie/docs_jupyter/data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index/default/_refgenie_build/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4_bowtie2_index__default.flag` (63611)
<pre>
psutil.ZombieProcess process still exists but it's a zombie (pid=63611)
</pre>
Command completed. Elapsed time: 0:00:00. Running peak memory: 0.003GB.  
  PID: 63611;	Command: touch;	Return code: 0;	Memory used: 0GB

Asset digest: 1262e30d4a87db9365d501de8559b3b4
Default tag for '94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index' set to: default

### Pipeline completed. Epilogue
*        Elapsed time (this run):  0:00:01
*  Total elapsed time (all runs):  0:00:00
*         Peak memory (this run):  0.0028 GB
*        Pipeline completed time: 2021-03-09 12:22:46
Finished building 'bowtie2_ind

We can see a list of available recipes like this:

In [15]:
!refgenie list -c refgenie.yaml --recipes

bismark_bt1_index, bismark_bt2_index, blacklist, bowtie2_index, bwa_index, cellranger_reference, dbnsfp, dbsnp, ensembl_gtf, ensembl_rb, epilog_index, fasta, fasta_txome, feat_annotation, gencode_gtf, hisat2_index, kallisto_index, refgene_anno, salmon_index, salmon_partial_sa_index, salmon_sa_index, star_index, suffixerator_index, tallymer_index, tgMap


We can get the unique digest for any asset with `refgenie id`:

In [16]:
!refgenie id rCRSd/fasta -c refgenie.yaml

4eb430296bc02ed7e4006624f1d5ac53


## Versions

In [17]:
from platform import python_version 
python_version()

'3.6.5'

In [18]:
!refgenie --version

refgenie 0.10.0-dev | refgenconf 0.10.0-dev
