*This tutorial provides a summarized explanation of Nextflow, derived from the original [Nextflow training documents](https://training.nextflow.io/). In addition, it also showcases an interactive example of how to utilize Nextflow in BRH adapted from the [Canine Data Commons FASTQ Reader tutorial](https://brh.data-commons.org/dashboard/Public/notebooks/canine_datacommons_fastq_reader.html).*

In order for this tutorial to be able to download data from Gen3, access to the Tutorial CANINE Google Login must be authorized under the Profile page in BRH. Pulling the data from gen3 may not work on your local machine. [Read more](https://brh.data-commons.org/dashboard/Public/index.html#LinkingAccessTo)

# Nextflow Introduction

## Why Use Nextflow?

Nextflow is a tool that helps to easily create workflows for data-heavy computations. It's built around the idea that Linux, with its many command-line and scripting tools, can be used to create data science pipelines.

Nextflow allows you to define complex interactions between programming languages and gives you a sophisticated parallel computing environment in which to write pipelines (termed workflows). It's key features include:
* Workflow portability and reproducibility
* Scalability of parallelization and deployment
* Integration of existing tools

## Processes and Channels

A Nextflow workflow is created by connecting different processes. Each process can be written in any scripting language that Linux can run (like Bash, Perl, Ruby, Python, etc.). These processes run independently and do not share any common state that can be written to.

The only way these processes can talk to each other is through asynchronous queues called channels. Any process can define one or more channels as an input and output. The way these processes interact, and therefore the workflow itself, is determined by these input and output declarations.

## Execution Abstraction

While a process defines what command or script should be run, the executor determines how that script is run on the target platform. Unless specified otherwise, processes are run on your local computer.

However, for real-world workflows, a cloud platform is often needed. Nextflow provides a separation between the logic of the workflow and the system on which it runs. This means you can create a workflow that runs on your computer, a server, or the cloud, without needing any changes. You just need to define the target platform in the configuration file.

## Scripting Language

Nextflow scripting is an extension of the Groovy programming language, which itself is a subset of Java. Groovy is like a simplified version of Java, making it easier and more approachable to write code.

# Example Gen3 Workflow

## Set-Up

Now let's get into the specifics of how Nextflow works. The most essential component of a Nextflow workflow is the main .nf Nextflow script file. Here is where the main flow of the workflow is set. We will slowly build up the components of this file in this section.

```bash
#!/usr/bin/env nextflow
...
workflow {
    files = DownloadFastqFile()
    files.flatten().set {input_files}
    AnalyzeFastqFile(input_files)
}
```

The first line of a Nextflow script file is a shebang. It's a special type of comment used in scripts to specify what interpreter should be used to execute the script.

In this case, `#!/usr/bin/env nextflow` tells the system to execute the script using the `nextflow` interpreter. The `env` command helps locate the `nextflow` interpreter within the system's PATH. If you are using a Nextflow workspace in BRH, this setup has already been properly done. If not, refer to https://www.nextflow.io/docs/latest/getstarted.html for setup instructions.

The workflow is the final component of the Nextflow script and outlines the primary steps in the workflow:

1. `files = DownloadFastqFile()`: This line executes the `DownloadFastqFile` process, which is defined earlier in the script (not shown here for simplicity). The results of this process, which could be multiple files, are stored in the variable `files`.

2. `files.flatten().set {input_files}`: The `flatten()` operator transforms the nested collection of files into a flat list. This step is necessary if `DownloadFastqFile` outputs a list of lists of files. This flattened list of files is then converted into a Nextflow channel and stored in `input_files`. A Nextflow channel is a mechanism that allows data to be passed between processes.

3. `AnalyzeFastqFile(input_files)`: This line calls the `AnalyzeFastqFile` process, using the output of `DownloadFastqFile` (the flattened list of files) as input.

The workflow showcased here consists of two separate processes. The first of these is `DownloadFastqFile`.

```bash
process DownloadFastqFile {

    // Move the outputs of DownloadFastqFile to the results directory in
    // the base workflow directory

    conda './environment.yml' // path to your environment.yml file

    output:
    path("*.fastq")

    script:
    """
    gen3 drs-pull object "${params.fastq_guid}"
    for file in \$(ls | grep .fastq.gz); do
        gunzip "\$file"
    done
    """
}
```

This process consists of several major components:

1. `conda './environment.yml'`: This is an optional line which tells Nextflow to use the conda environment specified in the 'environment.yml' file to run this process (or create it if it doesn't already exist). Conda is a package and environment management system, and using this statement ensures that the correct dependencies are installed and used for this process. It is also possible to specify the conda environment to be used in the `nextflow.config` file instead (see below for more details on the Nextflow configuration file).

2. `output: path("*.fastq")`: This line specifies what the process will output. In this case, the process outputs one or more fastq files pulled from gen3.

3. `script:`: This begins the section where the commands to be run by this process are defined. In this case the commands are a bash script to download the `params.fastq_guid` (which is a parameter to the nextflow workflow which is given in the configuration file - more details below) file from gen3 and a loop to go through all the files ending in .fastq.gz that were pulled and unzip them.

The `./environment.yml` file being used to specify the conda environment installs the needed dependencies for the python scripts being run. Read more about conda environment files [here](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).

In [1]:
%%writefile environment.yml
name: canine
channels:
  - defaults
dependencies:
  - python=3.9
  - pip
  - pip:
    - gen3
    - bioinfokit

Writing environment.yml


*Note the `%%writefile environment.yml` line tells Jupyter Notebook to write the cell to a file rather than running the code in Jupyter Notebook*

In [2]:
%%writefile nextflow.config
// enables the processes to be run in a conda environment
conda.enabled = true

params {
    fastq_guid = "dg.C78ne/4527012c-3a5f-481d-820c-da7b77a26b48" // GUID for fastq file
    max_records = 10 // max records to process
    endpoint = "https://caninedc.org"
    refresh_file = "/home/jovyan/.gen3/credentials.json"
}

Writing nextflow.config


The Nextflow configuration file gives additional setup details for the Nextflow workflow. In this case, it specifies that Conda is to be used for setting up the environments for the individuals processes and sets up some default parameters for the Nextflow script.

These parameters can be used within the script and can also be overridden when you run the script. 

You can set or override the parameters when you run the script by passing them as command line arguments with the `--` prefix. 

For example, to run the script with different values for `refresh_file` and `max_records`, you would use the following command:

```bash
nextflow run canine.nf --refresh_file="/home/josh/.gen3/credentials.json" --max_records=20
```

`nextflow run canine.nf` is the ordinary syntax for running the script, `/home/josh/.gen3/credentials.json` is the new value for `refresh_file` and `20` is the new value for `max_records`. The `max_records` parameter is accessed in the Nextflow script itself through `{params.max_records}`.

This is one of the features that makes Nextflow powerful for pipeline scripting. By using parameters, the same script can be used with different data or settings without any changes to the script itself.

Nextflow configuration files can specify variables for the script being run or specific settings for how to run the script. Read more about Nextflow configuration files at https://www.nextflow.io/docs/latest/config.html .

```bash
process AnalyzeFastqFile {

    // Copy the outputs of AnalyzeFastqFile to the results directory in
    // the base workflow directory
    publishDir "${baseDir}/results", mode: 'copy'
    
    conda './environment.yml' // path to your environment.yml file

    input:
    path(input_file)

    output:
    path("${input_file}_analysis.txt")

    script:
    """
    python3 ${baseDir}/analyze_fastq.py ${params.endpoint} ${params.refresh_file} ${input_file} ${params.max_records}
    """
}
```

The second process in this workflow takes the file specified by the parameter `params.fastq_file` and runs the Python script `analyze_fastq.py` which can be found in the same directory as the Nextflow script (this is indicated by the `${baseDir}/` before the name of the Python script to be run).

The `analyze_fastq.py` script is run with four command-line parameters each of which is specified by one of the parameters specified in the configuration file. Nextflow automatically stores the path for the input file to the process inside the input file variable which can then be referenced in the output file name as well as given as a command line parameter to the python script.

Lastly, the AnalyzeFastqFile process also specifies `${input_file}_analysis.txt` as the output of the `analyze_fastq.py`. The actual code to create the file is within the Python script itself but this section indicates to the process to expect a text file named `${input_file}_analysis.txt` to be written by the Python script. The line `publishDir "${baseDir}/results", mode: 'copy'` indicates for Nextflow to copy the `${params.fastq_file}_analysis.txt` file produced by this process to the results folder in the same main directory as the Nextflow script (`${baseDir}`). This line will also automatically create a results folder for the output to be copied into if one doesn't already exist.

The Python code for the AnalyzeFastqFile process is given in `analyze_fastq.py`.

In [3]:
%%writefile analyze_fastq.py
import argparse
from gen3.file import Gen3File
from gen3.query import Gen3Query
from gen3.auth import Gen3Auth
from gen3.submission import Gen3Submission
from gen3.index import Gen3Index
from bioinfokit.analys import fastq

# create the top-level parser
parser = argparse.ArgumentParser(prog='analyze_fastq')
parser.add_argument('endpoint', type=str, help='The endpoint.')
parser.add_argument('refresh_file', type=str, help='The refresh file.')
parser.add_argument('fastq_file', type=str, help='The FASTQ file to analyze.')
parser.add_argument('record_limit', type=float, help='The maximum number of records to process.')

args = parser.parse_args()

endpoint = args.endpoint
auth = Gen3Auth(endpoint, refresh_file=args.refresh_file)
sub = Gen3Submission(endpoint, auth)
file = Gen3File(endpoint, auth)

programs = sub.get_programs()

records = fastq.fastq_reader(file=args.fastq_file)
counter = 0
output_file = open(args.fastq_file + "_analysis.txt", "w")

for record in records:
    if counter < args.record_limit:
        _, sequence, _, quality = record
        base_count = {'A': sequence.count('A'), 'C': sequence.count('C'), 'G': sequence.count('G'), 'T': sequence.count('T')}
        output_file.write(f"{sequence}, {quality}, {base_count}\n")
    else:
        break
    counter += 1
output_file.close()

Writing analyze_fastq.py


This script, `analyze_fastq.py`, is used to analyze a FASTQ file and create a textual analysis output file. This analysis consists of the sequence, quality, and base counts for a specified number of records. The FASTQ file to be analyzed, the number of records to be analyzed, and a few more parameters are passed to the script as command-line arguments.

Let's go through the script line by line:

1. `import argparse`: This line imports the `argparse` module, which provides functions to facilitate the parsing of command-line arguments.

2. The script then imports several modules from the `gen3` and `bioinfokit` packages. These packages are used to interact with a Gen3 data repository and analyze the FASTQ file respectively.

3. The `argparse.ArgumentParser` function is used to create a parser object. This object will hold all the information necessary to parse the command-line arguments.

4. `parser.add_argument` is then used to specify which command-line options the program is expecting. In this case, it is expecting four positional arguments: `endpoint`, `refresh_file`, `fastq_file`, and `record_limit`.

5. `args = parser.parse_args()`: This line parses the command-line arguments and returns them as an `argparse.Namespace` object. The arguments are then accessed as attributes of this object.

6. `auth = Gen3Auth(endpoint, refresh_file=args.refresh_file)`: This line creates a `Gen3Auth` object. This represents the user's authorization to access the Gen3 data repository. The `refresh_file` argument should point to a file containing the user's credentials.

7. `sub = Gen3Submission(endpoint, auth)`: This line creates a `Gen3Submission` object which can be used to submit data to the Gen3 repository.

8. `file = Gen3File(endpoint, auth)`: This line creates a `Gen3File` object, representing the file to be analyzed.

9. `programs = sub.get_programs()`: This line gets a list of programs from the Gen3 repository.

10. `records = fastq.fastq_reader(file=args.fastq_file)`: This line reads the records from the FASTQ file, which is one of the command-line arguments.

11. An output file is opened for writing. The output file's name is the same as the input FASTQ file, with "_analysis.txt" appended to it.

12. The script then enters a for loop, iterating over the FASTQ records. For each record, it splits the record into its sequence and quality components. It then counts the number of each base (A, C, G, T) in the sequence. This information is written to the output file. The loop stops when the number of processed records equals the `record_limit` argument.

13. `output_file.close()`: Finally, the script closes the output file.

In summary, this script (and by extension, the `AnalyzeFastqFile` process which calls it) reads a FASTQ file, analyzes the sequence and quality of a certain number of records, and writes the results to a new file.

Putting all sections of the Nextflow script together:

In [4]:
%%writefile canine.nf
#!/usr/bin/env nextflow

process DownloadFastqFile {

    // Move the outputs of DownloadFastqFile to the results directory in
    // the base workflow directory

    conda './environment.yml' // path to your environment.yml file

    output:
    path("*.fastq")

    script:
    """
    gen3 drs-pull object "${params.fastq_guid}"
    for file in \$(ls | grep .fastq.gz); do
        gunzip "\$file"
    done
    """
}


process AnalyzeFastqFile {

    // Copy the outputs of AnalyzeFastqFile to the results directory in
    // the base workflow directory
    publishDir "${baseDir}/results", mode: 'copy'
    
    conda './environment.yml' // path to your environment.yml file

    input:
    path(input_file)

    output:
    path("${input_file}_analysis.txt")

    script:
    """
    python3 ${baseDir}/analyze_fastq.py ${params.endpoint} ${params.refresh_file} ${input_file} ${params.max_records}
    """
}

workflow {
    files = DownloadFastqFile()
    files.flatten().set {input_files}
    AnalyzeFastqFile(input_files)
}

Writing canine.nf


To edit and re-run the Nextflow script, make sure to execute the above cell again to rewrite the Nextflow script file and then run the code block following this to re-run the script itself. It may also be necessary to delete the work and results directory between runs since the large size of the downloaded `SRR7012463_1.fastq` file can quickly exhaust memory.

## Results

Now all that's left is to run the Nextflow process itself (the resume option is optional and tells Nextflow to use any work files leftover from previous runs when able):

In [1]:
!nextflow run canine.nf --resume

N E X T F L O W  ~  version 22.10.6
Launching `canine.nf` [voluminous_spence] DSL2 - revision: 660eb25521
[-        ] process > DownloadFastqFile -[K
[2A
[-        ] process > DownloadFastqFile [  0%] 0 of 1[K
[-        ] process > AnalyzeFastqFile  -[K
[3A
executor >  local (1)[K
[4e/d145d9] process > DownloadFastqFile [  0%] 0 of 1[K
[-        ] process > AnalyzeFastqFile  -[K
[4A
executor >  local (1)[K
[4e/d145d9] process > DownloadFastqFile [100%] 1 of 1 ✔[K
[-        ] process > AnalyzeFastqFile  -[K
[4A
executor >  local (2)[K
[4e/d145d9] process > DownloadFastqFile    [100%] 1 of 1 ✔[K
[06/2b42a8] process > AnalyzeFastqFile (1) [  0%] 0 of 1[K
[4A
executor >  local (2)[K
[4e/d145d9] process > DownloadFastqFile    [100%] 1 of 1 ✔[K
[06/2b42a8] process > AnalyzeFastqFile (1) [100%] 1 of 1 ✔[K
[32;1mCompleted at: 25-Jul-2023 19:43:57
Duration    : 1m 13s
CPU hours   : (a few seconds)
Succeeded   : 2
[22;39m[K



Nextflow lists the process or processes it is running at any point when it is running via command-line output (displayed above). The hash corresponding to each process is displayed to the left of the process itself. At the end of its execution, Nextflow also returns the time-stamp of the workflow execution, the workflow real-time execution time, the time the workflow took on the devices used for execution, and the number of processes which ran successfullly at the end of its command-line output.

When the script is run, Nextflow creates a directory structure to manage the data and workflow execution. Here is a breakdown of the resulting directory structure:

1. `work/`: This directory is automatically created by Nextflow and is where the program runs the processes. Inside the `work/` directory, there will be multiple subdirectories, each corresponding to a process task. Each subdirectory will be named with a hash value, unique to each task (and its corresponding process). Inside these task directories, Nextflow stores scripts, input files, and output files related to each task. 

2. `results/`: This directory is specified in the `AnalyzeFastqFile` process with the `publishDir` directive. This is where the output files of the `AnalyzeFastqFile` process will be copied to.

    - `results/SRR7012463_1.fastq_analysis.txt`: This file is the output of the `AnalyzeFastqFile` process. It contains the analysis of the FASTQ file and is copied into the `results/` directory.

The file `SRR7012463_1.fastq` downloaded from the `DownloadFastqFile` process is not specified to be copied or moved to a specific location, so it will also remain within its `work/` subdirectory. This file is not copied from its work directories since it is extremely large and is likely to exhaust memory.