## Welcome to the BioProv tutorial!

### Introduction

BioProv is a library to record provenance information of bioinformatics workflows. If you work with genomics, you've probably encountered the situation where you have several different files for a number of biological samples, and each file concerns a certain aspect of your data. As you develop your analysis workflow, it is challenging to keep track of the **provenance** of your data: how, when and why each file was created and/or modified. There are many tools to aid in this task, such as version control systems [REF], scientific workflow management systems [REF], or even simply keeping a tidy computational notebook [REF].

Although these tools are certainly helpful and we recommend that you employ them [REF], it is not trivial to integrate and share provenance information across different people, research groups and even computing environments. A solution to this has been the development of W3C-PROV [REF], a standard created by the W3C organization to facilitate the exchange of provenance data in the web.

The W3C-PROV is composed of a set of 13 documents [REF], of which maybe the most pertinent to us is the W3C-PROV-DM, which describes a data model to represent provenance information. Although this model is widely implemented in a range of domain applications, including scientific workflows [REFs], to the best of our knowledge, there is not yet a software tool specialized in the provenance of biological data structures and bioinformatics workflows. To extract provenance attributes of common file formats and common project organization patterns in bioinformatics, generic provenance extraction systems must be extended or customized, which can be a costly task for both the domain specialist and the developers of said systems. In order to fulfill this gap, we present BioProv, which aims to facilitate the provenance extraction in bioinformatics workflows by providing a Python library which integrates two open source libraries: BioPython [REF] and Prov [REF].

### How it works

BioProv is **project-based**, where each **Project** contains a number of **Samples** which have associated **Files**. **Files** may also be associated directly with the **Project**, if they contain information about zero or multiple samples. BioProv also stores information about **Programs** used create new and modify existing **Files**. **Programs** may contain **Parameters** which will determine how they will be run. Once a **Program** has been run, information about the process will be stored as a **Run**.

Therefore, these are the main classes of the BioProv library:
* **Project**
* **Sample**
* **Files**
* **Programs**
* **Parameters**
* **Runs**

See an example on how to run a program with BioProv.

In [1]:
import bioprov as bp 

sample = bp.Sample("Synechococcus_elongatus_PCC_6301",
                   attributes={"ncbi_accession": "GCF_000010065.1",
                               "ncbi_database": "assembly"}
                  )

project = bp.Project(samples=[sample,], tag="Synechococcus_genome_analysis")

### Adding Files and Programs

Now we have a **Project** containing 1 **Sample**. However, our sample has no associated **Files** nor **Programs**. Let's add a **File** to our **Sample** and run a program on it.

BioProv comes with an auxiliary `data` subpackage, which contains some preset data for us to experiment with. The `synechococcus_genome` variable is an instance of `pathlib.PosixPath`, which is used to hold file paths.

In [14]:
from bioprov.data import synechococcus_genome

# We create a File object based on a path or a string representing a path.
assembly_file = bp.File(synechococcus_genome, tag="assembly")

# We can add this File to our Sample
sample.add_files(assembly_file)
sample.files

Updating file assembly with value /Users/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic.fna.


{'assembly': /Users/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic.fna,
 'GATTACA_count': PosixPath('/Users/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GATTACA_count.txt')}

Now our instance of `Sample` holds a `File` object. Files can be accessed by the attribute `.files`, which is a dictionary composed of `{file.tag: File instance}`.

We can now run a **Program** in our **Sample**. The sample's **Files** can be used as **Parameter** to the program. Programs are processed by the UNIX shell. 

Here we are setting up a program using UNIX's `grep` to count the number of occurrences of the 7-mer 'GATTACA' in our sample. We are going to write the results to a new **File**.

To write our program, we start with an instance of the **Program** class and add **Parameters** to it.

In [15]:
grep = bp.Program("grep")

# Kmer variable and output file.
kmer = "GATTACA"
sample.files['{}_count'.format(kmer)] = sample.files['assembly'].directory.joinpath("{}_count.txt".format(kmer))

# Creating Parameters
count = bp.Parameter('-c')
kmer_param = bp.Parameter("'{}'".format(kmer))
in_file = bp.Parameter(str(sample.files['assembly']))
pipe_out = bp.Parameter("> {}".format(str(sample.files['{}_count'.format(kmer)])))

for param in (count, kmer_param, in_file, pipe_out):
    grep.add_parameter(param)
    
grep.cmd

Added parameter -c with value '' to program grep
Added parameter 'GATTACA' with value '' to program grep
Added parameter /Users/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic.fna with value '' to program grep
Added parameter > /Users/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GATTACA_count.txt with value '' to program grep


"/usr/bin/grep -c 'GATTACA' /Users/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic.fna > /Users/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GATTACA_count.txt"

### Running Programs

Okay, there's a lot going on in this code block. The first thing is creating the program we are going to run. We then create two variables: the kmer we wish to count and an item in the `sample.files` dictionary with the path to our output file.

After that, we create parameters to be added to the `grep` program. Parameters are strings which are added to the program's command-line. We can just put a string with all of our parameters, but creating them one by one and enclosing them with the `bp.Parameter` class will allow for querying later. Parameters are then added with the `Program.add_parameter()` method.

Finally, we check our command is correct: each `bioprov.Program` instance has a `Program.cmd` attribute which shows the exact command-line which will be run on the UNIX shell.

Now we want to run our program. We use the `Program.run()` method. 

In [18]:
grep.run()

Running program 'grep'.
Command is:
/usr/bin/grep -c 'GATTACA' /Users/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic.fna > /Users/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GATTACA_count.txt


Run of Program 'grep' with 4 parameter(s).
Started at Mon Oct 12 12:46:22 2020.
Ended at Mon Oct 12 12:46:22 2020.
Status is finished.

When we run a **Program**, we create a new **Run**. The `bioprov.Run` class holds information about a process, such as the start time and end time. Runs are stored in the `Program.runs` attribute:

In [26]:
grep.runs

{'1': Run of Program 'grep' with 4 parameter(s).
 Started at Mon Oct 12 12:45:59 2020.
 Ended at Mon Oct 12 12:45:59 2020.
 Status is finished.,
 '2': Run of Program 'grep' with 4 parameter(s).
 Started at Mon Oct 12 12:46:22 2020.
 Ended at Mon Oct 12 12:46:22 2020.
 Status is finished.}

In [29]:
# Each Run has useful attributes such as stdout, stderr and status
grep.runs['1'].__dict__

{'program': Program 'grep' with 4 parameter(s).,
 'params': OrderedDict([('-c', Parameter with command string '-c '),
              ("'GATTACA'", Parameter with command string ''GATTACA' '),
              ('/Users/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic.fna',
               Parameter with command string '/Users/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic.fna '),
              ('> /Users/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GATTACA_count.txt',
               Parameter with command string '> /Users/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GATTACA_count.txt ')]),
 'sample': Sample Synechococcus_elongatus_PCC_6301 with 2 file(s).,
 'cmd': "/usr/bin/grep -c 'GATTACA' /Users/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genom