<a href="https://colab.research.google.com/github/shukwong/FAIR_Workflows/blob/main/src/bioinformatics_workflow_workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FAIR principles applied to cancer bioinformatics workflows

## Introduction
In this workshop, we will be
- Discuss how to apply the FAIR principles to bioinformatics workflows
- installing the conda package manager
- installing the nextflow workflow manager and bioinformatics tools
- search and download data from the cloud
- running the nextflow pipeline
- examine the results

### The FAIR principles - Findable, Accessible, Interoperable, and Reusable
These principles help ensure that data, tools, and methodologies are easily discoverable and usable by both humans and machines.







### Findable
- It is recommended that researchers share their code and workflows on version control and collaboration platforms, such as Github and Gitlab
- Provide persistent links: Each file and commit in a GitHub repository has a unique URL, providing stable, long-term access to specific versions of a workflow.
- Discoverability: GitHub's search functionality and integration with other platforms improve the chances of workflows being found by interested researchers.

### Accessible

-  Access: GitHub provides a public platform where researchers can freely share their code, making it accessible to anyone with an internet connection.
- Version control: GitHub's version control system allows users to access different versions of a workflow, including the latest updates and previous iterations (e.g. the version used for publication).
- Documentation: README files, wikis, and inline comments in GitHub repositories can provide clear instructions on how to access and use the workflows.
- Issue tracking: The issue tracker on GitHub allows users to report problems or request features, facilitating communication between creators and users of the workflow.
- API access: GitHub's API allows programmatic access to repositories, enabling automated systems to retrieve workflows.

### Interoperability
- Vital in cancer bioinformatics workflows, as it enables the integration of diverse data types and tools from various sources.
- By adhering to standard formats, providing detailed documentation, and using common ontologies, researchers can create workflows that others can easily adapt and build upon

### Reusability

The usage of GitHub and workflow management systems such as Snakemake and Nextflow, is crucial for bioinformatics research and collaborative science.

By hosting workflows on GitHub, researchers can:

Version control their code, allowing for easy tracking of changes and rollbacks if needed.
Share their workflows openly, enabling other researchers to examine, use, and build upon existing work.
Facilitate collaboration through features like pull requests and issue tracking.
Provide documentation alongside the code, improving understanding and reproducibility.

Workflow management systems like Snakemake and Nextflow further enhance reusability by:

- Offering a standardized way to define and execute complex bioinformatics pipelines.
- Providing built-in support for containerization (e.g., Docker, Singularity) and conda environments, allowing environment specifications to be included directly in the workflow definition, ensuring consistent environments across different systems.
- Supporting modular design, allowing researchers to easily swap out or update individual steps in a pipeline.
- Automatically handling job scheduling and parallelization, making workflows more efficient and scalable.

Both Nextflow and Snakemake have public collection of standardized, and best-practice common bioinformatics pipelines:
- Nextflow provides [nf-core](https://nf-co.re/)
- Snakemake provides [Snakemake workflow catalog](https://snakemake.github.io/snakemake-workflow-catalog/)

## Hands on workshop

### Install conda package manager

Conda is a cross-platform package management system and environment manager primarily used for Python programming, allowing users to easily install, run, and update packages and their dependencies.

In [1]:
import os

conda_path = "/usr/local/bin/conda"

if os.path.exists(conda_path):
    print(f"{conda_path} exists.")
else:
    print(f"{conda_path} does not exist, installing")
    !pip install -q condacolab
    import condacolab
    condacolab.install()

/usr/local/bin/conda does not exist, installing
⏬ Downloading https://github.com/conda-forge/miniforge/releases/download/23.11.0-0/Mambaforge-23.11.0-0-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:16
🔁 Restarting kernel...


In [1]:
!conda --version

conda 23.11.0


### Use conda to install tools
- Conda channels are repositories that contain packages for installation.  
- One notable channel in the bioinformatics community is Bioconda, which specializes in providing bioinformatics software.
- Bioconda offers a vast array of tools and libraries specifically tailored for biological data analysis, making it easier for researchers to set up and manage their computational environments.
- We will install Nextflow and Samtools through Bioconda.
  - [Nextflow](https://www.nextflow.io/) is a powerful workflow management system that enables scalable and reproducible scientific workflows using software containers.
  - [Samtools](https://www.htslib.org/) is a suite of programs for interacting with high-throughput sequencing data. It provides various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing, and generating alignments in various formats.

In [2]:
%%bash
conda config --add channels bioconda
conda config --add channels conda-forge
conda install samtools=1.20 nextflow=24.04.3 -y

Channels:
 - conda-forge
 - bioconda
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - nextflow=24.04.3
    - samtools=1.20


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    alsa-lib-1.2.12            |       h4ab18f5_0         543 KB  conda-forge
    ca-certificates-2024.7.4   |       hbcca054_0         151 KB  conda-forge
    cairo-1.18.0               |       h3faef2a_0         959 KB  conda-forge
    certifi-2024.7.4           |     pyhd8ed1ab_0         156 KB  conda-forge
    coreutils-9.5              |       hd590300_0         2.9 MB  conda-forge
    curl-8.8.0                 |       he654da7_0         163 KB  conda-forge
    expat-2.6.2                |       h59595ed_0         134 KB  conda-forge
    font-ttf-dejavu-san



    current version: 23.11.0
    latest version: 24.5.0

Please update conda by running

    $ conda update -n base -c conda-forge conda




In [3]:
%%bash
which nextflow

/usr/local/bin/nextflow


In [4]:
!which samtools

/usr/local/bin/samtools


### Find and download test data
We will be using https://42basepairs.com/ browser to look for tumor/normal paired dataset in this exercise to do some variant calling.

A good place to look for open bioinforamtics data and references on the cloud is [Open Data on AWS](https://registry.opendata.aws/tag/bioinformatics/)

Let's download 2 pairs of tumor/normal pairs from [TCGA DREAM challenge](https://42basepairs.com/browse/gs/broad-public-datasets/TCGA_DREAM). Because we have limited space on the colab runtime, we will subset the BAMs to the TP53 gene regions (chr17:7,571,739-7,590,808, GRCh37) only on the fly and download that region only.

In [5]:
%%bash
tp53_grch37_region="17:7571739-7590808"
samtools view -b gs://broad-public-datasets/TCGA_DREAM/synthetic.challenge.set1.tumor.bam ${tp53_grch37_region} >set1.tumor.tp53.bam
samtools view -b gs://broad-public-datasets/TCGA_DREAM/synthetic.challenge.set1.normal.bam ${tp53_grch37_region} >set1.normal.tp53.bam
samtools view -b gs://broad-public-datasets/TCGA_DREAM/synthetic.challenge.set2.tumor.bam ${tp53_grch37_region} >set2.tumor.tp53.bam
samtools view -b gs://broad-public-datasets/TCGA_DREAM/synthetic.challenge.set2.normal.bam ${tp53_grch37_region} >set2.normal.tp53.bam

samtools index set1.tumor.tp53.bam
samtools index set1.normal.tp53.bam
samtools index set2.tumor.tp53.bam
samtools index set2.normal.tp53.bam

In [6]:
%%bash
echo -e "17\t7571739\t7590808" >tp53.bed

In [7]:
%%bash
echo -e "patient,status,sample,bam,bai
sample1,1,tumor_sample1,set1.tumor.tp53.bam,set1.tumor.tp53.bam.bai
sample1,0,normal_sample1,set1.normal.tp53.bam,set1.normal.tp53.bam.bai
sample2,1,tumor_sample2,set2.tumor.tp53.bam,set2.tumor.tp53.bam.bai
sample2,0,normal_sample2,set2.normal.tp53.bam,set2.normal.tp53.bam.bai" > samplesheet.csv

### Run the sarek workflow with nextflow
[Sarek](https://nf-co.re/sarek/3.4.2/) is a comprehensive and modular Nextflow workflow designed for the analysis of germline and somatic variations in whole genome and exome sequencing data.


In [8]:
%%bash
wget --no-verbose https://data.broadinstitute.org/snowman/hg19/Homo_sapiens_assembly19.fasta
wget --no-verbose https://data.broadinstitute.org/snowman/hg19/Homo_sapiens_assembly19.fasta.fai
wget --no-verbose https://data.broadinstitute.org/snowman/hg19/Homo_sapiens_assembly19.dict

2024-07-22 23:44:53 URL:https://data.broadinstitute.org/snowman/hg19/Homo_sapiens_assembly19.fasta [3140756381/3140756381] -> "Homo_sapiens_assembly19.fasta" [1]
2024-07-22 23:44:53 URL:https://data.broadinstitute.org/snowman/hg19/Homo_sapiens_assembly19.fasta.fai [2780/2780] -> "Homo_sapiens_assembly19.fasta.fai" [1]
2024-07-22 23:44:54 URL:https://data.broadinstitute.org/snowman/hg19/Homo_sapiens_assembly19.dict [14811/14811] -> "Homo_sapiens_assembly19.dict" [1]


In [9]:
%%bash
nextflow run nf-core/sarek -r 3.4.2 -profile conda --max_cpus 2 --max_memory 8GB \
 --input ./samplesheet.csv --outdir ./results --fasta ./Homo_sapiens_assembly19.fasta --fasta_fai ./Homo_sapiens_assembly19.fasta.fai \
 --genome null --igenomes_ignore --step variant_calling \
 --wes --intervals tp53.bed \
 --tools strelka --only_paired_variant_calling


 N E X T F L O W   ~  version 24.04.3

Pulling nf-core/sarek ...
 downloaded from https://github.com/nf-core/sarek.git
Launching `https://github.com/nf-core/sarek` [elegant_leakey] DSL2 - revision: b5b766d3b4 [3.4.2]

Downloading plugin nf-validation@1.1.3
Downloading plugin nf-prov@1.2.2
WARN: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  There is a problem with your Conda configuration!

  You will need to set-up the conda-forge and bioconda channels correctly.
  Please refer to https://bioconda.github.io/
  The observed channel order is 
  [conda-forge, bioconda]
  but the following channel order is required:
  [conda-forge, bioconda, defaults]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
WARN: Access to undefined parameter `monochromeLogs` -- Initialise it to a default value eg. `params.monochromeLogs = some_value`


------------------------------------------------------
                                     

In [12]:
%%bash
zcat ./results/variant_calling/strelka/tumor_sample1_vs_normal_sample1/tumor_sample1_vs_normal_sample1.strelka.somatic_snvs.vcf.gz

##fileformat=VCFv4.1
##fileDate=20240722
##source=strelka
##source_version=2.9.10
##startTime=Mon Jul 22 23:58:53 2024
##cmdline=/content/work/conda/strelka_somatic-797e7ecb4792efbff3ce957e697bec25/bin/configureStrelkaSomaticWorkflow.py --tumor tumor_sample1.converted.cram --normal normal_sample1.converted.cram --referenceFasta Homo_sapiens_assembly19.fasta --callRegions 17_7571740-7590808.bed.gz --exome --runDir strelka
##reference=file:///content/work/ad/cab6e82969b0e434bb8acb3038e925/Homo_sapiens_assembly19.fasta
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<I