# Presets and Workflows

### Tutorial index
* <a href="./introduction.ipynb">Introduction to BioProv</a>
* <a href="./w3c-prov.ipynb">W3C-PROV projects</a>
* <a href="./workflows_and_presets.ipynb">Presets and Workflows</a>

In this tutorial, we're gonna take a look at how to build preset programs and workflows in order to accelerate the development of provenance-aware pipelines.

This will be done using the PresetProgram and Workflow classes made available by BioProv.

## Preset programs

As discussed before, manually creating programs is something that BioProv supports and encourages. But it might become a too repetitive if you find yourself using the same command structure frequently, therefore we provide the class PresetProgram, which facilitates creation of programs using a single object and makes it very easy to generalize the same program structure for running it in batches as well as building workflows with it.

First off, it's good to know that BioProv already provides a series of different preset programs for common bioinformatics tasks, here is a list of them:

In [None]:
import bioprov.programs
from inspect import getmembers, isfunction

getmembers(bioprov.programs, isfunction)

If you don't find the program you want in the list above, don't fret! We're gonna show you how to build your own next.

First, let's read in the example data we've been using, as well as a blastdb for a subset of the [MegaRes database](https://megares.meglab.org/) made available through our data module.

In [None]:
import bioprov as bp
from bioprov import Sample
from bioprov.data import synechococcus_genome, megares_blastdb

example_genome = Sample("Synechococcus", files={"assembly": synechococcus_genome})

And let's say, for demonstration purposes, that BioProv doesn't have a BlastN preset program but we still wish to align each of these genomes to the MegaRes BlastDB.

To do that, we can go one of two paths: Manually creating the program object for BlastN in a script or, if we think this is a process we'll run many times over different days across various genomes, it might be wise to create a PresetProgram for it. We'll show you how to do the latter.

For this you will need BlastN installed, which can be easily done through the [conda package manager](https://anaconda.org/bioconda/blast):
```bash
conda install blast -c bioconda
```

In [None]:
from bioprov.src.main import Parameter, PresetProgram

blastn = PresetProgram(
        name="blastn",
        params=(
            Parameter(key="-db", value=megares_blastdb),
            # outfmt 6 is the blastn tabular output format
            Parameter(key="-outfmt", value=6),
        ),
        sample=example_genome,
        input_files={"-query": "assembly"},
        output_files={"-out": ("megares_hits", "_megares_hits.txt")}
    )

Okay, maybe that was a lot, let's break down what
PresetPrograms take in:
1. The program name, in this case `blastn`

2. Some general parameters and corresponding values that will be used, defined through the `params` arguments.

3. The input parameter, `-query` in this case, and the corresponding tag for the input sample, which we defined as 'assembly'.

4. The output parameter, `-out` in this case, as well as a tag and a suffix for the output file. The tag will be added to the the sample's files and the output parameter value will correspond to the sample name + the suffix.

We can also define the `extra_flags` argument, which is used to append extra command line parameters (flags or switches) to the end of the command string.

Though less flexible than manually adding Files and Parameters as we saw for the Program class in the introductory tutorial, PresetPrograms support the general structure of most command line tools and help accelerate development as well as maintain the codebase shorter and easier to read.

Of course, we can then run this program the same way we would with any other one:

In [None]:
blastn.run()

We can then see the output file gets added to our Sample:

In [None]:
example_genome.files

Let's turn this preset into a function that takes in only the sample, just so it's more reusable.

In [None]:
def align_to_megares(assembly_sample):
    
    blastn = PresetProgram(
        name="blastn",
        params=(
            Parameter(key="-db", value=megares_blastdb),
            # outfmt 6 is the blastn tabular output format
            Parameter(key="-outfmt", value=6),
        ),
        sample=assembly_sample,
        input_files={"-query": "assembly"},
        output_files={"-out": ("megares_hits", "_megares_hits.txt")}
    )
    
    return blastn

## Workflows

But then, let's suppose aligning to MegaRes is only the first step in a series of commands (or pipeline) we routinely run in our work.

What we're describing here is a Workflow and BioProv provides a helper class to deal with this scenario.

Let's say the next step could be filtering the hits we obtained by their score and saving the filtered data. For this purpose, we could use an [AWK](https://en.wikipedia.org/wiki/Awk) command, so, let's build a PresetProgram for this task too! In this case, we'll use the output tag from the previous step as our input.

In [None]:
def filter_high_score(alignment_sample):
    
    score_filter = PresetProgram(
        name="awk",
        # Parameters can be ommitted!
        sample=alignment_sample,
        # Score greater or equal to 1000
        input_files={"'$12>=1000'": "megares_hits"},
        output_files={">": ("high_score", "_high_score_hits.txt")}
    )
    
    return score_filter

Okay, let's start building a workflow using the two presets we've made.

First thing we'll need is a dataset to use as input for the workflow, for that we'll use the `picocyano_dataset`, only altering the paths in the 'assembly' column to be the relative paths to their respective files. The workflow will run on all samples created from this dataset.

In [None]:
from bioprov.data import picocyano_dataset

import pandas as pd

df = pd.read_csv(picocyano_dataset)

df["assembly"] = "../../" + df["assembly"]

df.to_csv(path_or_buf="test_dataset.csv", index=False)

Then we can begin building the workflow itself, BioProv Workflow objects take in a lot of parameters (run `help(Workflow)` to see a list of them), the ones we'll be using today are:

1. `name`: The name we'll choose for this workflow;

2. `description`: A description for the workflow;

3. `input`, `sep`: The input for the workflow. Since the input is a tabular data file, it's good to specify the column separator (it defaults to tabs);

4. `input_type`: The type of input the workflow expects, it can be dataframe (a path to a tabular data file), a path to a directory of files or both.

5. `index_col`, `file_columns`: These should be familiar to you if you've looked at the previous tutorials. It merely indicates what are the relevant portions of the dataframe to create the project with. 

6. `write_provn`: In case we want to write the PROVN document after running the workflow. Let's set it to true.

In [None]:
from bioprov.src.workflow import Workflow, Step

get_megares_hits = Workflow(
    name="Get best MegaRes hits",
    description="Align nucleotide data to MegaRes with BLASTN and filters high-score (>=1000) hits.",
    input='./test_dataset.csv',
    sep=",",
    input_type="dataframe",
    index_col="sample-id",
    file_columns="assembly",
    write_provn=True,
)

Since this workflow isn't associated with any particular project, you can see BioProv generated a project using a random tag.

Now let's add the steps to this workflow using the PresetProgram functions we built before. You might think it's strange we're setting the samples to None, but this is just so the function returns a PresetProgram instance, which the Workflow expects, the samples will be injected from the dataset we used as input for the workflow (so it won't be None in the end).

In [None]:
steps = [
    Step(align_to_megares(None), default=True),
    Step(filter_high_score(None), default=True),
]

for _step in steps:
        get_megares_hits.add_step(_step)

We can then choose which steps we want to run using a list as input to the `run_steps` method. They will then run in the order they were added to the workflow.

In [None]:
get_megares_hits.run_steps(['blastn', 'awk'])

Wow, that was a lot of stuff! But, if you look into this tutorial's directory, you will see a list of outputs corresponding to each step of the workflow, which ran for each file in the input dataset, as well as two other files:

1. A PROVN file describing the whole workflow and the programs we just ran.

2. A .log file similar to the cell output you see above, describing each command that was run.

But anyway, that's it for presets and workflows in BioProv. We hope this inspires you to create your own provenance-aware programs and workflows. If you do, share it with us! 

And if you think the workflow or preset you created would be useful to others, feel free to strike a conversation in our GitHub or open a new pull request.