## Welcome to the BioProv tutorials!

### Tutorial index
* <a href="./introduction.ipynb">Introduction to BioProv</a>
* <a href="./w3c-prov.ipynb">W3C-PROV projects</a>

## W3C-PROV workflows

In the last tutorial we learned about how to start a **Project** in BioProv, along with the main classes of the library. In this tutorial, we are going to look at some additional BioProv functions to facilitate our work. We are also going to produce a W3C-PROV document describing our workflow. Let's dive into it.

### Importing data easily

You might be thinking that creating BioProv Projects and Samples one by one is quite repetitive. To facilitate this, it is possible to import data into BioProv using comma- or tab-delimited files, using the BioProv `read_csv()` function.

In [1]:
import bioprov as bp
from bioprov.data import picocyano_dataset

proj = bp.read_csv(picocyano_dataset); proj

Project 'hopeful-deer' with 2 samples.

The `picocyano_dataset` variable is simply a path pointing to a comma-delimited file that comes with BioProv:

In [2]:
picocyano_dataset

PosixPath('/Users/vini/anaconda3/envs/bioprov/lib/python3.7/site-packages/bioprov/data/datasets/picocyano.csv')

In [3]:
# Take a peek at the file
!cat ../../bioprov/data/datasets/picocyano.csv

sample-id,assembly,taxon
GCF_000010065.1_ASM1006v1,bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic.fna,Synechococcus elongatus PCC 6301
GCF_000007925.1_ASM792v1,bioprov/data/genomes/GCF_000007925.1_ASM792v1_genomic.fna,Prochlorococcus marinus CCMP1375


You can notice that this is a simple comma-delimited file. If you have a different delimiter, simply pass it to the `sep` argument of `read_csv()`, e.g. if you have a tab-delimited file, type `read_csv(path, sep="\t")`.

BioProv uses Pandas to process delimited files. Because of this, you can also import data from Pandas DataFrames using the `from_df()` function. This is quite handy if you want to process your file for a bit before importing it with BioProv. 

In [4]:
import pandas as pd

df = pd.read_csv(picocyano_dataset)

df['assembly'] = "../../" + df["assembly"]
... # do your processing here

proj = bp.from_df(df, file_cols="assembly"); proj

Project 'coral-mackerel' with 2 samples.

The `from_df()` function has some useful arguments to make sure our data is read correctly. 

The first is the `index_col`, which is the column used to read the Sample IDs. This column must contain unique identifiers, it can be passed as an integer (position of the column) or as a string (name of the column). However, we don't have to worry about that because it reads the first column as index_col by default.

The second useful argument is `file_cols`, which is used to specify the columns which contain the path to files in our data. The second column of our dataset, the `"assembly"` column, contains the path to the genome assembly of each sample. This will create an instance of **File** for each **Sample**. The **File** will be tagged with the column name. The remaining columns will be added as attributes to the **Sample**.

In [7]:
proj = bp.from_df(df, file_cols=["assembly",])

# Creates files tagged with the column name
print(proj.samples, "\n")
print(proj['GCF_000010065.1_ASM1006v1'].files, "\n")
print(proj['GCF_000010065.1_ASM1006v1'].attributes)

{'GCF_000010065.1_ASM1006v1': Sample GCF_000010065.1_ASM1006v1 with 1 file(s)., 'GCF_000007925.1_ASM792v1': Sample GCF_000007925.1_ASM792v1 with 1 file(s).} 

{'assembly': /Users/vini/Bio/BioProv/docs/tutorials/../../bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic.fna} 

{'assembly': '../../bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic.fna', 'taxon': 'Synechococcus elongatus PCC 6301'}


The `read_csv()` function also accepts these arguments and passes them to `from_df()`. So we can just do:

In [8]:
proj = bp.read_csv(picocyano_dataset, file_cols=["assembly",])

# Because of the path of the tutorials, we are going to have to process the file,
# so we use from_df()

df = pd.read_csv(picocyano_dataset)
df['assembly'] = "../../" + df["assembly"]
proj = bp.from_df(df, file_cols="assembly", tag="W3C-PROV-tutorial.json"); proj

Project 'W3C-PROV-tutorial.json' with 2 samples.

### SeqFiles
The goal of BioProv is to make the provenance of biological data structures more accessible. To do this, there is the class **SeqFile**. An instance of **SeqFile** is a customization of **File** which also holds information about sequences, so it can load files such as FASTA and extract information using the BioPython modules `Bio.SeqIO` and `Bio.AlignIO`. When importing data, we can specify which columns have sequence files by using the `sequencefile_cols` argument. The `"assembly"` column of our dataset contains FASTA files, so we can load it as such, and use the `import_data=True` argument to extract information from the files.

In [21]:
df = pd.read_csv(picocyano_dataset)
df['assembly'] = "../../" + df["assembly"]
proj = bp.from_df(df, sequencefile_cols="assembly", tag="W3C-PROV-tutorial.json", import_data=True); proj

Project 'W3C-PROV-tutorial.json' with 2 samples.

**SeqFiles** possess attributes which are specific to biological sequences, such as number of sequences, number of basepairs, GC content, and N50. They are attributes of the **Sample** instance and also implemented in the **SeqStats** data class, which is also an attribute of the **SeqFile**.

In [22]:
proj["GCF_000010065.1_ASM1006v1"].files['assembly'].__dict__

{'path': PosixPath('/Users/vini/Bio/BioProv/docs/tutorials/../../bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic.fna'),
 'name': 'GCF_000010065.1_ASM1006v1_genomic',
 'basename': 'GCF_000010065.1_ASM1006v1_genomic.fna',
 'directory': PosixPath('/Users/vini/Bio/BioProv/docs/tutorials/../../bioprov/data/genomes'),
 'extension': '.fna',
 'tag': 'assembly',
 'attributes': {},
 '_exists': True,
 '_size': '2.6 MB',
 'raw_size': 2730026,
 '_sha1': 'f8658496b343257690f828ec14226644dc9e9ca2',
 '_entity': None,
 'format': 'fasta',
 'records': {'NC_006576.1': SeqRecord(seq=Seq('ATTTAAATCACTGGCATCAGCATTCGCAATATCATTGAGGTCAACAATACTTTC...GGC'), id='NC_006576.1', name='NC_006576.1', description='NC_006576.1 Synechococcus elongatus PCC 6301 DNA, complete genome', dbxrefs=[])},
 '_generator': <Bio.SeqIO.FastaIO.FastaIterator at 0x7ff342d6ecd0>,
 '_seqstats': SeqStats(number_seqs=1, total_bps=2696255, mean_bp=2696255.0, min_bp=2696255, max_bp=2696255, N50=2696255, GC=0.55484),
 '_parser': 'seq',
 

In [23]:
proj["GCF_000010065.1_ASM1006v1"].files['assembly'].seqstats

SeqStats(number_seqs=1, total_bps=2696255, mean_bp=2696255.0, min_bp=2696255, max_bp=2696255, N50=2696255, GC=0.55484)

### PresetPrograms

Now that we've seen an easier way to import data into BioProv, let us see an easier way to run **Programs**. BioProv has the class **PresetProgram**, which is an easier way to create Programs which will be run a lot. There are functions to call PresetPrograms in the `bioprov.programs` module. For this example, we are going to run the program [Prodigal](https://github.com/hyattpd/Prodigal) using a PresetProgram. Prodigal is a gene-calling software which predicts coding sequences from prokaryotic genomes.

In [24]:
from bioprov.programs import prodigal

# Because proj.samples is a dictionary, we must iterate using .items()
for _, sample in proj.items():
    sample.add_programs(prodigal(sample))
    sample.run_programs(_print=False)

Running program 'prodigal' for sample GCF_000010065.1_ASM1006v1.
Command is:
/Users/vini/anaconda3/envs/bioprov/bin/prodigal \ 
	-i /Users/vini/Bio/BioProv/docs/tutorials/../../bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic.fna \ 
	-a /Users/vini/Bio/BioProv/docs/tutorials/../../bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic_proteins.faa \ 
	-d /Users/vini/Bio/BioProv/docs/tutorials/../../bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic_genes.fna \ 
	-s /Users/vini/Bio/BioProv/docs/tutorials/../../bioprov/data/genomes/GCF_000010065.1_ASM1006v1_genomic_scores.cds 
Running program 'prodigal' for sample GCF_000007925.1_ASM792v1.
Command is:
/Users/vini/anaconda3/envs/bioprov/bin/prodigal \ 
	-i /Users/vini/Bio/BioProv/docs/tutorials/../../bioprov/data/genomes/GCF_000007925.1_ASM792v1_genomic.fna \ 
	-a /Users/vini/Bio/BioProv/docs/tutorials/../../bioprov/data/genomes/GCF_000007925.1_ASM792v1_genomic_proteins.faa \ 
	-d /Users/vini/Bio/BioProv/docs/tutorials/../../bio

Because Prodigal is a **PresetProgram**, it already expects a Sample to have a File tagged as `"assembly"`. This can be customized by setting the `input_files` argument in the constructor of PresetProgram:

In [25]:
bp.PresetProgram?

[0;31mInit signature:[0m
[0mbp[0m[0;34m.[0m[0mPresetProgram[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mname[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mparams[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msample[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minput_files[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_files[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpreffix_tag[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Class for holding a preset program and related functions.

A WorkflowStep instance inherits from Program and consists of an instance
of Program with an associated instance of Sample or Project.
[0;31mInit docstring:[0m
:param name: Instance of bioprov.Program
:param params: Dictionary of parameters.
:param sample: An instance of Sa

### BioProvProjects and W3C-PROV documents (under development)

Now that we've imported our **Project** and ran the **PresetProgram** Prodigal on the **Samples**, we can record the provenance from our workflow. For this, we use the **BioProvDocument** class, which couples the **Project** to a **ProvDocument** from the [Prov](https://prov.readthedocs.io/en/latest/) library.

In [26]:
prov = bp.BioProvDocument(proj)

Now, our **Project** has been associated with a `prov.ProvDocument` object, which is an attribute of the **BioProvDocument** instance. Although we have overwritten the `proj` variable, our project is still accessible as the `.project` attribute of the BioProvDocument:

In [29]:
print(prov.project, "\n")

Project 'W3C-PROV-tutorial.json' with 2 samples. 



There are numerous ways to manipulate this ProvDocument. 

* We can extract the PROV-N format, which is a human-readable provenance format, by using the `get_provn()` method;
* We can serialize the document as a W3C-PROV compatible JSON
* We can export the document as a provenance graph using `prov_to_dot()`

In [28]:
from prov.dot import prov_to_dot

print("PROV-N", "\n\n", prov.ProvDocument.get_provn())
print("PROV-JSON", "\n\n", prov.ProvDocument.serialize())
dot = prov_to_dot(prov.ProvDocument)
dot.write_png("W3C-PROV-tutorial.png")

PROV-N 

 document
  prefix project <Project 'W3C-PROV-tutorial.json' with 2 samples.>
  prefix users <Users associated with BioProv Project 'W3C-PROV-tutorial.json'>
  prefix samples <Samples associated with bioprov Project 'W3C-PROV-tutorial.json'>
  prefix GCF_000010065.1_ASM1006v1.programs <Programs associated with Sample GCF_000010065.1_ASM1006v1>
  prefix envs <Environments associated with User 'vini'>
  prefix GCF_000007925.1_ASM792v1.programs <Programs associated with Sample GCF_000007925.1_ASM792v1>
  
  entity(project:Project 'W3C-PROV-tutorial.json' with 2 samples.)
  wasDerivedFrom(samples:GCF_000010065.1_ASM1006v1, project:Project 'W3C-PROV-tutorial.json' with 2 samples., -, -, -)
  wasDerivedFrom(GCF_000010065.1_ASM1006v1.programs:prodigal, envs:Environment_caafe21a8b87100b49a042b205b378e1a106dc7a, -, -, -)
  wasDerivedFrom(samples:GCF_000007925.1_ASM792v1, project:Project 'W3C-PROV-tutorial.json' with 2 samples., -, -, -)
  wasDerivedFrom(GCF_000007925.1_ASM792v1.program

### To be continued

The graph feature is still under development, so it only looks like this now:

![](W3C-PROV-tutorial.png)

Previous versions of the provenance graphs look like this:

![](deprecated_graph.png)

This tutorial is still under development, and any suggestions or contributions are appreciated!