## Welcome to the BioProv tutorial!

### Introduction

BioProv is a library to record provenance information of bioinformatics workflows. If you work with genomics, you've probably encountered the situation where you have several different files for a number of biological samples, and each file concerns a certain aspect of your data. As you develop your analysis workflow, it is challenging to keep track of the **provenance** of your data: how, when and why each file was created and/or modified. There are many tools to aid in this task, such as version control systems [REF], scientific workflow management systems [REF], or even simply keeping a tidy computational notebook [REF].

Although these tools are certainly helpful and we recommend that you employ them [REF], it is not trivial to integrate and share provenance information across different people, research groups and even computing environments. A solution to this has been the development of W3C-PROV [REF], a standard created by the W3C organization to facilitate the exchange of provenance data in the web.

The W3C-PROV is composed of a set of 13 documents [REF], of which maybe the most pertinent to us is the W3C-PROV-DM, which describes a data model to represent provenance information. Although this model is widely implemented in a range of domain applications, including scientific workflows [REFs], to the best of our knowledge, there is not yet a software tool specialized in the provenance of biological data structures and bioinformatics workflows. To extract provenance attributes of common file formats and common project organization patterns in bioinformatics, generic provenance extraction systems must be extended or customized, which can be a costly task for both the domain specialist and the developers of said systems. In order to fulfill this gap, we present BioProv, which aims to facilitate the provenance extraction in bioinformatics workflows by providing a Python library which integrates two open source libraries: BioPython [REF] and Prov [REF].

### How it works

BioProv is **project-based**, where each **Project** contains a number of **Samples** which have associated **Files**. **Files** may also be associated directly with the **Project**, if they contain information about zero or multiple samples. BioProv also stores information about **Programs** used create new and modify existing **Files**. **Programs** may contain **Parameters** which will determine how they will be run. Once a **Program** has been run, information about the process will be stored as a **Run**.

Therefore, these are the main classes of the BioProv library:
* **Project**
* **Sample**
* **Files**
* **Programs**
* **Parameters**
* **Runs**

See an example on how to run a program with BioProv.

In [26]:
import bioprov as bp 

sample = bp.Sample("Synechococcus_elongatus_PCC_6301",
                   attributes={"ncbi_accession": "GCF_000010065.1",
                               "ncbi_database": "assembly"}
                  )

project = bp.Project(samples=[sample,], tag="Synechococcus_genome_analysis")

### Adding Files and Programs

Now we have a **Project** containing 1 **Sample**. However, our sample has no associated **Files** nor **Programs**. Let's add a **File** to our **Sample** and run a program on it.

BioProv comes with an auxiliary `data` subpackage, which contains some preset data for us to experiment with. The `synechococcus_genome` variable is an instance of `pathlib.PosixPath`, which is used to hold file paths.

In [27]:
from bioprov.data import synechococcus_genome

# We create a File object based on a path or a string representing a path.
assembly_file = bp.File(synechococcus_genome, tag="assembly")

# We can add this File to our Sample
sample.add_files(assembly_file)

Now our instance of `Sample` holds a `File` object. Files can be accessed by the attribute `.files`, which is a dictionary composed of `{file.tag: File instance}`.

In [28]:
sample.files

{'assembly': /home/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic.fna}

We can now run a **Program** in our **Sample**. The sample's **Files** can be used as **Parameter** to the program. Programs are processed by the UNIX shell. Here we are setting up a program using UNIX's `grep` to count the number of occurrences of the 7-mer 'GATTACA' in our sample. We are going to write the results to a new **File**.

There are three kinds of parameters: *input*, which are **Files** used as input, *output*, which are files generated by the program

In [90]:
p = bp.Program("grep")

# Parameters to be added
kmer = bp.Parameter(key="kmer", value="'GATTACA'", kind="misc", keyword_argument=False)
count = bp.Parameter(key="count", value="-c", kind="misc", keyword_argument=False)
in_file = bp.Parameter(key='input_file', value=sample.files["assembly"], kind='input', keyword_argument=False)

for param in (kmer, count, in_file):
    p.add_parameter(param)

Added parameter kmer with value ''GATTACA'' to program grep
Added parameter count with value '-c' to program grep
Added parameter input_file with value '/home/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic.fna' to program grep


In [91]:
p.cmd

"/bin/grep -c /home/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic.fna 'GATTACA'"

In [69]:
p.runs['1'].stderr

b'/bin/grep: GATTACA: No such file or directory\n'

In [51]:
p.path

''