# Tutorial 3: Run Nextflow pipelines with `nf-quilt` plugin

The Nextflow `nf-core/rnaseq pipeline`, in conjunction with `nf-quilt` was used to process raw sequencing data (fastqs) and generate per sample expression values. Samples were processed together in batches (called "runs"), mirroring common practice in NGS centers when multiple samples on a sequencing flow cell are pre-processed at the same time. The `nf-quilt` plugin automatically packages Nextflow pipeline output into a Quilt package, and appends detailed pipeline run metadata to the package.

In this demo, we will use the raw data packages containing raw fastqs and sample-level package metadata as input to the `nf-core/rnaseq` pipeline to process RNA-sequencing data and generate sample-level expression measurements. 


For more detailed information on using `nf-quilt` plugin for Nextflow pipelines, please refer to the docs at https://github.com/nextflow-io/nf-quilt. 


In [None]:
import pandas as pd

# 1. Create a sample sheet

The primary input to `nf-core/rnaseq` pipeline is a `samplesheet.csv`, which tells the pipeline where the input files are located. For more information on samplesheets, please refer to `nf-core` documentation: https://nf-co.re/rnaseq/3.14.0. 

Since we want to process all samples from a single batch together in the same pipeline run, we will create 1 sample sheet per sample batch. 

In [None]:
# user-specified params for specific batches of raw data to include in nf-core run
registry = "s3://quilt-example-bucket"
batch = "ccle/20190225_PRJNA523380"

In [None]:
# get list of packages from the project
samples = []
fastq1 = []
fastq2 = []

# get locations of fastqs from raw quilt packages
for p in packages:
    if batch in p:
        print(">>>>> " + p)
        pkg = quilt3.Package.browse(p, registry)
        samples = samples + [pkg.meta["CellLine"] + "__" + p.replace("ccle/20190225_PRJNA523380_", "")]
        pkg_files = list(pkg)
        for file in pkg_files:
            print(file)
            if "_1.fastq.gz" in file:
                fastq1 = fastq1 + [registry + "/" + batch + "_" + p.replace("ccle/20190225_PRJNA523380_", "") + "/" + file]
            if "_2.fastq.gz" in file:
                fastq2 = fastq2 + [registry + "/" + batch +  "_" + p.replace("ccle/20190225_PRJNA523380_", "") + "/" + file]

In [None]:
# create sample sheet
sample_sheet = pd.DataFrame({"sample": samples, "fastq_1": fastq1, "fastq_2": fastq2})
sample_sheet.insert(sample_sheet.shape[1], "strandedness", "auto")

In [None]:
# save sample sheets locally, could also be to a location on s3
outpath = "~/data/nfcore"
sample_sheet.to_csv(outpath + "/" + batch.replace("ccle/", "") + "__samplesheet.csv", index=False)

In [None]:
# upload sample sheet to quilt package
# this quilt package also house nextflow pipeline outputs

# define pkg name
pname = "ccle/20190225_PRJNA523380_nfcore_rnaseq"

# create the quilt package
p = quilt3.Package()

# stage sample sheet in pkg
p.set(batch.replace("ccle/", "") + "__samplesheet.csv",
      outpath + "/" + batch.replace("ccle/", "") + "__samplesheet.csv"
     )

# push sample sheet to bucket 
p.push(pname, registry='s3://quilt-example-bucket', message='upload sample sheet for batch')

# 2. Run `nf-core/rnaseq` with `nf-quilt`

Now that we have the input sample sheet prepared, we can run the Nextflow pipeline with the `nf-quilt` plugin to package pipeline outputs automatically & append detailed pipeline run parameters as metadata.

## 2.1 Seqera Tower


Below we have listed the three places to make modifications to Nextflow runs in Seqera Tower Launchpad to enable packaging outputs with `nf-quilt` plugin. 


**`Pipeline parameters`**

- Ensure the `--input` points to sample sheets created above; accepts both an s3 path or Quilt package URI
- Ensure `--outdir` is a Quilt package URI beginning with `quilt+s3://`

```json
{"input":"s3://quilt-example-bucket/ccle/20190224_PRJNA523380_nfcore_rnaseq/20190224_PRJNA523380__samplesheet.csv",
"outdir":"quilt+s3://quilt-example-bucket#package=ccle/20190224_PRJNA523380_nfcore_rnaseq",
...
}
```

**`Advanced options > Nextflow config file`**

- You can specify a specific version of nf-quilt by using `'nf-quilt@0.7.12'`
- We recommend using the latest version

```yml
plugins {
    id 'nf-quilt'
}
```

**`Advanced options > Pre-run script`**

- paste the following text into the pre-run script field to install `nf-quilt` plugin requirements for the run

```bash
yum install python3-pip -y
yum install git -y
pip3 install quilt3

```


... and that's it! You can run your Nextflow pipeline!

## 2.2 Command-line

Alternatively to Seqera Tower, you can launch Nextflow pipelines with the `nf-quilt` plugin directly from the command line. An example is show below. Similar to with Tower, be sure your output directory is a Quilt package URI, and specify the use of `nf-quilt` with the `plugins` option (plugin version is optional, if not specified will run with the latest version).  

```bash
nextflow run 'https://github.com/nf-core/rnaseq' 
	-r 3.14.0
	-profile docker
	-input "s3://quilt-example-bucket/ccle/20190224_PRJNA523380_nfcore_rnaseq/20190224_PRJNA523380__samplesheet.csv"
	-outdir "quilt+s3://quilt-example-bucket#package=ccle/20190224_PRJNA523380_nfcore_rnaseq"
	-plugins nf-quilt@0.7.12

```