# ChemBioSys and AquaDiva | Genome-resolved metagenomics workshop

## Session 06 | Scalable pipelines 

Date: 18-19 September 2023

---
### ❓ **Why do we need scalable pipelines?**
**Disasdvantages of having stand alone tools:**

1. Reproducibility: 💭 Which version did i use?<br>
<br>
2. Data privacy: 💀 Am i allowed to share that on a 'open server'?<br>
<br>
3. Efficiency: 💂‍♂️ Large scale input and output?<br>
<br>
4. Standradisation: 👀 Comparable output?<br>
<br>
5. Bioinformatics: 😹 Are my coding skills sufficient? Time?

---


### One example of scalable pipelines: [nf-core](https://nf-co.re/)


<img src="img/nf-core-logo.png" width="500"/>
<font size="2"> 


<img src="img/nf-core-fig.png" width="500"/>
<font size="2"> 

<font size="2">[Ewels et.al. 2020](https://www.nature.com/articles/s41587-020-0439-x)

A community effort to collect a curated set of analysis pipelines built using Nextflow to run tasks across multiple compute infrastructure.<br>
<br>
- Highly optimised pipelines
- Portable (docker/singularity/conda)
- Extensive documentation provided by the nf-core community
- Reproducible due to validated releases and packaged software
- Widespread community ready to help
- Easy to apply and use

---
  

#### **Pipeline example**

<img src="img/nf-core-funcscan_logo_flat_dark.png" width="500"/>
<font size="2"> 

A bioinformatics best-practice analysis pipeline for the screening of nucleotide sequences such as assembled contigs for functional genes. It currently features mining for antimicrobial peptides, antibiotic resistance genes and biosynthetic gene clusters.

<img src="img/funcscan_metro_workflow.png" width="500"/>
<font size="2"> 

**Example Usage**

1. Prepare a samplesheet (samplesheet.csv) with your input data. Each row represents a (multi-)fasta file of assembled contig sequences.

```
sample,fasta
CONTROL_REP1,AEG588A1_001.fasta
CONTROL_REP2,AEG588A1_002.fasta
CONTROL_REP3,AEG588A1_003.fasta
```

2.  Run the pipeline in a simple CLI line:

```
nextflow run nf-core/funcscan \
   -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR> \
   --run_amp_screening \
   --run_arg_screening \
   --run_bgc_screening
```

3. Output comprises of three csv tables corresponding to AMP - BGC - ARGs with which the user can filter further downstream.


<span style="color:blue">For more information please refer to [Funcscan documentation](https://github.com/nf-core/funcscan) 
</span>





### Snakemake
`nf-core` is based on [`nextflow`](https://www.nextflow.io/), which is a workflow management system. In life sciences, scalable pipelines are currently mostly written using either `nextflow` or [`Snakemake`](https://snakemake.github.io/). 

#### In a nutshell
`Snakemake` is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, [`Python`](https://www.python.org/) based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, `Snakemake` workflows can entail a description of required software, which will be automatically deployed to any execution environment.

<img src="img/smk_1.png" alt="Snakemake" width="500"/>

#### Rules
`Snakemake` is centered around rules that define input and output files, and how these files should be processed/generated. 

<img src="img/smk_2.png" alt="Snakemake rules" width="500"/>

#### Automatic depency inference
`Snakemake` dynamically determined what rules have to be followed in order to generate the intended output. This means rules are connected by input/output files and `Snakemake` determines what files are already there. If a workflow was run before, stopped for whatever reason, and is repeated; `Snakemake` determines what output files are left to be generated and runs only the needed rules.

<img src="img/smk_3.png" alt="Snakemake rules" width="500"/>

#### Scalability
`Snakemake` workflows can run in any environment (locally, clusters, cloud).

<img src="img/smk_4.png" alt="Snakemake rules" width="500"/>

#### Automatic software deployment
Needed tools can be installed _on the fly_ through e.g. `conda` or [`docker`](https://www.docker.com/).

<img src="img/smk_5.png" alt="Snakemake rules" width="500"/>

#### Nextflow vs. Snakemake
`Nextflow` and `Snakemake` are both great tools and ultimately serve similar purposes, primarily making science more reproducible and making high-throughput data processing more accessible. `nf-core` is a big asset, as it provides numerous, well-documented pipelines. Overall, the `Nextflow` community is bigger than the `Snakemake` camp. `Snakemake` is heavily based on [`python`](https://www.python.org/), in comparison to `nextflow`, the learning curve with `Snakemake` is arguably 😜 flatter. Within `Snakemake`, there is the [Snakemake Wrappers project](https://snakemake-wrappers.readthedocs.io/en/stable/), which is a collection of reusable wrappers that allow to quickly use popular tools from Snakemake rules and workflows. In the end, it comes down to personal preference.

_Useful `Snakemake` resources:_

* `Snakemake` [homepage](https://snakemake.github.io/)
* `Snakemake` [documentation](https://snakemake.readthedocs.io/)
* `Snakemake` [tutorial](https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html)

---
<sub> © Anan Ibrahim, Carl-Eric Wegner 2023-09 </sub>