# Working in this notebook

This jupyter notebook is set up to us a `bash` kernel. That means that the commands that we include
in the code blocks will be executed by a bash shell, basically as if we typed them in the terminal.
If you would rather work in a terminal, you can convert this document to a shell script through the `Export`
file menu option. 

# Setup

Installed in the docker image are python3, java, and conda. We will use those to bootstrap 
our environment as we go along. 

## [Install Nextflow](https://www.nextflow.io/docs/latest/getstarted.html#installation)

Nextflow is distributed as a self-contained executable package, which means that it does not require any special installation procedure.

It only needs two easy steps:

- Download the executable package by copying and pasting the following command in your terminal window: `wget -qO- https://get.nextflow.io | bash`. It will create the nextflow main executable file in the current directory.

- Optionally, move the nextflow file to a directory accessible by your `$PATH` variable (this is only required to avoid remembering and typing the full path to nextflow each time you need to run it).

In [11]:
wget -qO- https://get.nextflow.io | bash

CAPSULE: Downloading dependency org.slf4j:log4j-over-slf4j:jar:1.7.25lease wait .. 
CAPSULE: Downloading dependency org.multiverse:multiverse-core:jar:0.7.0
CAPSULE: Downloading dependency com.fasterxml.jackson.core:jackson-databind:jar:2.6.7.2
CAPSULE: Downloading dependency com.amazonaws:aws-java-sdk-batch:jar:1.11.542
CAPSULE: Downloading dependency io.nextflow:nf-httpfs:jar:20.10.0
CAPSULE: Downloading dependency joda-time:joda-time:jar:2.8.1
CAPSULE: Downloading dependency com.beust:jcommander:jar:1.35
CAPSULE: Downloading dependency io.nextflow:nf-commons:jar:20.10.0
CAPSULE: Downloading dependency org.jsoup:jsoup:jar:1.11.2
CAPSULE: Downloading dependency ch.grengine:grengine:jar:1.3.0
CAPSULE: Downloading dependency jline:jline:jar:2.9
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-nio:jar:3.0.5
CAPSULE: Downloading dependency javax.mail:mail:jar:1.4.7
CAPSULE: Downloading dependency org.slf4j:jul-to-slf4j:jar:1.7.25
CAPSULE: Downloading dependency org.apache.httpco

Now we can check to see that our nextflow works. Note here that I am addressing the `nextflow` executable without 
adding it to the path. I tend to work this way to keep the version of `nextflow` available tightly coupled to the project 
I'm working on. 

In [5]:
./nextflow

Usage: nextflow [options] COMMAND [arg...]

Options:
  -C
     Use the specified configuration file(s) overriding any defaults
  -D
     Set JVM properties
  -bg
     Execute nextflow in background
  -c, -config
     Add the specified file to configuration set
  -d, -dockerize
     Launch nextflow via Docker (experimental)
  -h
     Print this help
  -log
     Set nextflow log file path
  -q, -quiet
     Do not print information messages
  -syslog
     Send logs to syslog server (eg. localhost:514)
  -v, -version
     Print the program version

Commands:
  clean         Clean up project cache and work directories
  clone         Clone a project into a folder
  config        Print a project configuration
  console       Launch Nextflow interactive console
  drop          Delete the local copy of a project
  help          Print the usage help for a command
  info          Print project and system runtime information
  kuberun       Execute a workflow in a Kubernetes cluster (experimental

## Add Bioconda channel (optional)

As a first step, let's add the [Bioconda channel](https://bioconda.github.io/user/install.html#set-up-channels) (and conda-forge) to our conda installation.

Since this is our first time interacting with code, note that the next block is a code block. 
Three commands are included in the block. Inside the code block, one can:

1. Hit the play button.
2. Hit shift-enter (which executes and moves to the next code block).
3. Hit control-enter (which executes but does not move to the next code block).

Make your choice now to add additional conda channels that we can use later. 

In [6]:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge



We can double-check that this works by `getting` the conda channels.

In [7]:
conda config --get channels

--add channels 'defaults'   # lowest priority
--add channels 'bioconda'
--add channels 'conda-forge'   # highest priority


Did we confirm that we added the channels?

# Nextflow

## Functions

Nextflow allows the definition of custom function in the workflow script using the following syntax:

```
def <function name> ( arg1, arg, .. ) {
    <function body>
}
```

For example:

```
def foo() {
    'Hello world'
}
```

```
def bar(alpha, omega) {
    alpha + omega
}
```

The above snippet defines two simple functions, that can be invoked in the workflow script as foo() which returns the Hello world string and bar(10,20) which return the sum of two parameters.

- Tip: Functions implicitly return the result of the last evaluated statement.

The keyword `return` can be used to explicitly exit from a function returning the specified value. for example:

```
def fib( x ) {
    if( x <= 1 )
        return x
    else
        fib(x-1) + fib(x-2)
}
```

### Example

Let's get our feet wet by running a file with a nextflow script in it. Since we have learned about functions, we can use this as our playground. The file of interest is available in `example_1` directory. Alternatively, [CLICK THIS LINK](01_example_functions/main2.nf).

In [10]:
cat 01_example_functions/main.nf

nextflow.preview.dsl=2


/* foo takes no arguments 
   and returns a Hello World */

def foo() {
    'Hello world'
}



// bar takes two "things" 
// and returns their sum.

def bar(alpha, omega) {
    alpha + omega
}


In [9]:
./nextflow run 01_example_functions/main.nf

N E X T F L O W  ~  version 20.10.0
Launching `example_1/main.nf` [festering_cantor] - revision: 75c8d1da04
[33mWARN: DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE[39m[K


That is interesting, but we didn't end up doing anything obvious. The functions `foo` and `bar` are never executed
and we never print the result.

Making a few additions, including `println` statements results in [this new file](01_example_functions/main2.nf).

Running this file through nextflow produces results.

In [None]:
./nextflow run 01_example_functions/main2.nf

Functions, then, can be useful for encapsulating functionality, modularity, and for the star students, testing. 

## Processes

Functions are useful for encapsulating code, but they are **not** the building blocks of a workflow or pipeline. In practice a Nextflow pipeline script is made by joining together different processes. Each process can be written in any scripting language that can be executed by the Linux platform (Bash, Perl, Ruby, Python, etc.).

Processes are executed independently and are isolated from each other, i.e. they do not share a common (writable) state. The only way they can communicate is via asynchronous FIFO queues, called channels in Nextflow.

Any process can define one or more channels as input and output. The interaction between these processes, and ultimately the pipeline execution flow itself, is implicitly defined by these input and output declarations.

A Nextflow script looks like this:

```
nextflow.preview.dsl=2

process foo {
    output:
      path 'foo.txt'
    script:
      """
      your_command > foo.txt
      """
}

 process bar {
    input:
      path x
    output:
      path 'bar.txt'
    script:
      """
      another_command $x > bar.txt
      """
}

workflow {
    data = channel.fromPath('/some/path/*.txt')
    foo()
    bar(data)
}
```

The above example defines two _processes_. Their *execution order* is not determined by the fact that the `foo` process comes before `bar` in the script (it could also be written the other way around).

### Process composition
Processes having matching input-output declaration can be composed so that the output of the first process is passed as input to the following process. Taking in consideration the previous process definition, it’s possible to write the following:

```workflow {
    bar(foo())
}
```

### Process outputs
A process output can also be accessed using the out attribute for the respective process object. For example:

```
workflow {
    foo()
    bar(foo.out)
    bar.out.view()
}
```

When a process defines two or more output channels, each of them can be accessed using the array element operator e.g. `out[0]`, `out[1]`, etc. or using named outputs (see below).

#### Process named output

The process output definition allows the use of the emit option to define a name identifier that can be used to reference the channel in the external scope. For example:

```
process foo {
  output:
    path '*.bam', emit: samples_bam

  '''
  your_command --here
  '''
}

workflow {
    foo()
    foo.out.samples_bam.view()
}
```


## Workflow

The `workflow` block then defines how processes are related to each other through input parameters and outputs. In this case, because `foo` does not depend on `bar` outputs, these two processes **can** run in parallel (if resources to do so 
exist).

### Workflow definition

The workflow keyword allows the definition of sub-workflow components that enclose the invocation of one or more processes and operators:

```
workflow my_pipeline {
    foo()
    bar( foo.out.collect() )
}
```

For example, the above snippet defines a workflow component, named my_pipeline, that can be invoked from another workflow component definition as any other function or process i.e. my_pipeline().

### Workflow parameters

A workflow component can access any variable and parameter defined in the outer scope:

```
params.data = '/some/data/file'

workflow my_pipeline {
    if( params.data )
        bar(params.data)
    else
        bar(foo())
}
```

### Workflow inputs

A workflow component can declare one or more input channels using the take keyword. For example:

```
workflow my_pipeline {
    take: data
    main:
    foo(data)
    bar(foo.out)
}
```

- Warning: When the take keyword is used, the beginning of the workflow body needs to be identified with the main keyword.

Then, the input can be specified as an argument in the workflow invocation statement:

```
workflow {
    my_pipeline( channel.from('/some/data') )
}
```

Note

Workflow inputs are by definition channel data structures. If a basic data type is provided instead, ie. number, string, list, etc. it’s implicitly converted to a channel value (ie. non-consumable).

### Workflow outputs
A workflow component can declare one or more out channels using the emit keyword. For example:

```
workflow my_pipeline {
    main:
      foo(data)
      bar(foo.out)
    emit:
      bar.out
}
```

Then, the result of the my_pipeline execution can be accessed using the out property ie. my_pipeline.out. When there are multiple output channels declared, use the array bracket notation to access each output component as described for the Process outputs definition.

Alternatively, the output channel can be accessed using the identifier name which it’s assigned to in the emit declaration:

```
workflow my_pipeline {
   main:
     foo(data)
     bar(foo.out)
   emit:
     my_data = bar.out
}
```

Then, the result of the above snippet can accessed using my_pipeline.out.my_data.

### Implicit workflow
A workflow definition which does not declare any name is assumed to be the main workflow and it’s implicitly executed. Therefore it’s the entry point of the workflow application.

- Note: Implicit workflow definition is ignored when a script is included as module. This allows the writing a workflow script that can be used either as a library module and as application script.

- Tip: An alternative workflow entry can be specified using the -entry command line option.

### Workflow composition

Workflows defined in your script or imported by a module inclusion can be invoked and composed as any other process in your application.

```
workflow flow1 {
    take: data
    main:
        foo(data)
        bar(foo.out)
    emit:
        bar.out
}

workflow flow2 {
    take: data
    main:
        foo(data)
        baz(foo.out)
    emit:
        baz.out
}

workflow {
    take: data
    main:
      flow1(data)
      flow2(flow1.out)
}
```