# meripseq pipeline using Nextflow and AWS Batch

## Overview

<mark>This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitate gene expression. This tutorial uses a popular workflow manager called [Nextflow](https://www.nextflow.io) run via [AWS Batch](https://aws.amazon.com/batch/). If you completed the other tutorials in this repo, you will see that it is similar to Tutorial 2, but instead of running Snakemake locally, we switch to Nextflow and run it using Batch.**EDIT**</mark> 

AWS Batch will create the needed permissions, roles and resources to run Nextflow in a serverless manner. You can set up AWS Batch manually or deploy it **automatically** with a stack template. The Launch Stack button below will take you to the cloud formation create stack webpage with the template with required resources already linked. 

If you prefer to skip manual deployment and deploy automatically in the cloud, click the Launch Stack button below. For a walkthrough of the screens during automatic deployment please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToLaunchAWSBatch.md). The deployment should take ~5 min and then the resources will be ready for use. 

[![Launch Stack](images/LaunchStack.jpg)](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=aws-batch-nigms&templateURL=https://nigms-sandbox.s3.us-east-1.amazonaws.com/cf-templates/AWSBatch_template.yaml )

Before begining this tutorial, if you do not have required roles, policies, permissions or compute environment and would like to **manually** set those up please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/AWS-Batch-Setup.md) to set that up.


## Prerequisites
#### Python requirements
+ Python >= 3.8

#### AWS requirements
+ Please ensure you have a VPC, subnets, and security group set up before running this tutorial.
+ Role with AdministratorAccess, AmazonSageMakerFullAccess, S3 access and AWSBatchServiceRole.
+ Instance Role with AmazonECS_FullAccess, AmazonEC2ContainerRegistryFullAccess, and S3 access.
+ If you do not have the required set-up for AWS Batch please follow this tutorial [here](https://github.com/STRIDES/NIHCloudLabAWS/blob/zbyosufzai-awsbatch-1/notebooks/AWSBatch/Intro_AWS_Batch.ipynb#install_nextflow).
+ ***When making the instance role, make another for SageMaker notebooks with the following permissions: AdminstratorAccess, AmazonEC2ContainerRegistryFullAccess, AmazonECS_FullAccess, AmazonS3FullAccess, AmazonSageMakerFullAccess, and AWSBatchServiceRole.***
+ It is recommended that specific permission to folders are added through inline policy. An example of the JSON is below:

<pre>
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowSageMakerS3Access",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:CreateBucket"
            ],
            "Resource": [
                "arn:aws:s3:::batch-bucket",
                "arn:aws:s3:::batch-bucket/*",
                "arn:aws:s3:::nigms-sandbox-healthomics",
                "arn:aws:s3:::nigms-sandbox-healthomics/*",
                "arn:aws:s3:::ngi-igenomes",
                "arn:aws:s3:::ngi-igenomes/*"
            ]
        }
    ]
}
</pre>
For AWS bucket naming conventions, please click [here](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html).

### Step 1. Install required dependencies, update paths and create a new S3 Bucket to store input and output files (if needed)


In [None]:
import boto3
# Get account ID and region 
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

In [None]:
# Set variable names 
# These variables should come from the Intro AWS Batch tutorial (or leave as-is if using the launch stack button)
BUCKET_NAME = "aws-batch-nigms-batch-bucket-" + account_id
INPUT_FOLDER = 'nigms-sandbox/ovarian-cancer-example-fastqs'
AWS_QUEUE = 'aws-batch-nigms-JobQueue'
AWS_REGION = region

In [None]:
# Install Nextflow
! mamba install -y -c conda-forge -c bioconda nextflow --quiet

In [None]:
##### Import relevant libraries
# Created using this https://github.com/STRIDES/NIHCloudLabAWS/blob/zbyosufzai-awsbatch-1/notebooks/AWSBatch/Intro_AWS_Batch.ipynb#install_nextflow
#Run if you don't have Java installed
! sudo apt update
! sudo apt-get install default-jdk -y
! java -version

In [None]:
#Install nexflow, make it exceutable, and update it
! curl https://get.nextflow.io | bash
! chmod +x nextflow
! ./nextflow self-update
! ./nextflow plugin update

In [None]:
# replace batch bucket name in nextflow configuration file
! sed -i "s/aws-batch-nigms-batch-bucket-/$BUCKET_NAME/g" nf-meripseq-aws-batch/nextflow.config

In [None]:
# replace job queue name in configuration file 
! sed -i "s/aws-batch-nigms-JobQueue/$AWS_QUEUE/g" nf-meripseq-aws-batch/nextflow.config

In [None]:
# replace the region placeholder with your region
! sed -i "s/aws-region/$AWS_REGION/g" nf-meripseq-aws-batch/nextflow.config

### Step 2. Enable AWS Batch for the nextflow script 

In [4]:
# Run nextflow script with parameters 
! ./nextflow run nf-meripseq-aws-batch/main.nf --input s3://$INPUT_FOLDER/samplesheet.csv -profile docker,awsbatch -c nf-meripseq-aws-batch/conf/add.config --awsqueue $AWS_QUEUE --awsregion $AWS_REGION

[Knloading nextflow dependencies. It may require a few seconds, please wait .. 
[1m[38;5;232m[48;5;43m N E X T F L O W [0;2m  ~  [mversion 24.10.5[m
[K
Launching[35m `nf-meripseq-aws-batch/main.nf` [0;2m[[0;1;36mromantic_carson[0;2m] DSL2 - [36mrevision: [0;36m975a2c9438[m
[K
Downloading plugin nf-schema@2.3.0
Downloading plugin nf-amazon@2.9.2
Downloading plugin nf-wave@1.7.4
[1mInput/output options[0m
  [0;34minput                     : [0;32ms3://nigms-sandbox/ovarian-cancer-example-fastqs/samplesheet.csv[0m
  [0;34mread_length               : [0;32m37[0m
  [0;34moutdir                    : [0;32ms3://aws-batch-nigms-batch-bucket-009160061346/nextflow_output/[0m

[1mReference genome options[0m
  [0;34mfasta                     : [0;32ms3://nigms-sandbox/ovarian-cancer-example-fastqs/chr11_1.5M.fasta[0m
  [0;34mgtf                       : [0;32ms3://nigms-sandbox/ovarian-cancer-example-fastqs/gencode.v46.pri.chr11.1.5M.gtf[0m

[1mAlignment options

### Step 3: Explore Results

In [5]:
# View output files that were output to S3 bucket
! aws s3 ls s3://$BUCKET_NAME/nextflow_output/ --recursive | cut -c32-

nextflow_output/
nextflow_output/fastqc/
nextflow_output/fastqc/1850_input_1_fastqc.html
nextflow_output/fastqc/1850_input_1_fastqc.zip
nextflow_output/fastqc/1850_input_2_fastqc.html
nextflow_output/fastqc/1850_input_2_fastqc.zip
nextflow_output/fastqc/1850_m6A-IP_1_fastqc.html
nextflow_output/fastqc/1850_m6A-IP_1_fastqc.zip
nextflow_output/fastqc/1850_m6A-IP_2_fastqc.html
nextflow_output/fastqc/1850_m6A-IP_2_fastqc.zip
nextflow_output/fastqc/1917_input_1_fastqc.html
nextflow_output/fastqc/1917_input_1_fastqc.zip
nextflow_output/fastqc/1917_input_2_fastqc.html
nextflow_output/fastqc/1917_input_2_fastqc.zip
nextflow_output/fastqc/1917_m6A-IP_1_fastqc.html
nextflow_output/fastqc/1917_m6A-IP_1_fastqc.zip
nextflow_output/fastqc/1917_m6A-IP_2_fastqc.html
nextflow_output/fastqc/1917_m6A-IP_2_fastqc.zip
nextflow_output/fastqc/2005_input_1_fastqc.html
nextflow_output/fastqc/2005_input_1_fastqc.zip
nextflow_output/fastqc/2005_input_2_fastqc.html
nextflow_output/fastqc/2005_input_2_fastqc.zip
n

In [None]:
# Copy output to local results folder (same outdir as if workflow was run locally)
! aws s3 sync s3://$BUCKET_NAME/nextflow_output/ meripseq-aws-batch-results/ --quiet