# Module 5. Running the MeRIP-seq Pipeline on AWS Batch with Nextflow

## Overview

In this module, we will run a MeRIP-seq (m6A-seq) pipeline using Nextflow on [AWS Batch](https://aws.amazon.com/batch/). If you completed the other tutorials in this repo, you will see that it is similar to submodule 4, but instead of running the nextflow pipeline locally in the SageMaker notebook, we switch to AWS Batch, which enables scalable, reproducible, and cloud-based analysis with minimal infrastructure management.

We will build upon concepts introduced in previous modules:
- Submodules 1–3: Local data processing and MeRIP-seq basics
- Submodule 4: Running the pipeline with Nextflow on a local machine
- **Submodule 5**: Transitioning the workflow to **AWS Batch**

#### About AWS Batch
AWS Batch will create the needed permissions, roles and resources to run Nextflow in a serverless manner. You can set up AWS Batch manually or deploy it **automatically** with a stack template. Please see **Setting up AWS Batch** in the Get Started section below to learn more about how to use it.


## Prerequisites
#### Python requirements
+ Python >= 3.8

#### AWS requirements
+ Please ensure you have a VPC, subnets, and security group set up before running this tutorial.
+ Role with AdministratorAccess, AmazonSageMakerFullAccess, S3 access and AWSBatchServiceRole.
+ Instance Role with AmazonECS_FullAccess, AmazonEC2ContainerRegistryFullAccess, and S3 access.
+ If you do not have the required set-up for AWS Batch please follow this tutorial [here](https://github.com/STRIDES/NIHCloudLabAWS/blob/zbyosufzai-awsbatch-1/notebooks/AWSBatch/Intro_AWS_Batch.ipynb#install_nextflow).
+ ***When making the instance role, make another for SageMaker notebooks with the following permissions: AdminstratorAccess, AmazonEC2ContainerRegistryFullAccess, AmazonECS_FullAccess, AmazonS3FullAccess, AmazonSageMakerFullAccess, and AWSBatchServiceRole.***
+ It is recommended that specific permission to folders are added through inline policy. An example of the JSON is below:

<pre>
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowSageMakerS3Access",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:CreateBucket"
            ],
            "Resource": [
                "arn:aws:s3:::batch-bucket",
                "arn:aws:s3:::batch-bucket/*",
                "arn:aws:s3:::nigms-sandbox-healthomics",
                "arn:aws:s3:::nigms-sandbox-healthomics/*",
                "arn:aws:s3:::ngi-igenomes",
                "arn:aws:s3:::ngi-igenomes/*"
            ]
        }
    ]
}
</pre>
For AWS bucket naming conventions, please click [here](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html).



<div style="border: 1px solid #ffe69c; padding: 0px; border-radius: 4px;">
  <div style="background-color: #fff3cd; padding: 5px; font-weight: bold;">
    <i class="fas fa-exclamation-triangle" style="color: #664d03;margin-right: 5px;"></i><a style="color: #664d03">Before using AWS Batch </a>
  </div>
  <p style="margin-left: 5px;">
Before begining this tutorial, if you do not have required roles, policies, permissions or compute environment and would like to <b>manually</b> set those up please click <a href="https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/AWS-Batch-Setup.md">here</a> to set that up.
  </p>
</div>

## Get Started
### Step 0. Setting up AWS Batch
AWS Batch manages the provisioning of compute environments (EC2, Fargate), container orchestration, job queues, IAM roles, and permissions. We can deploy a full environment either:
- Automatically using a preconfigured AWS CloudFormation stack (**recommended**)
- Manually by setting up roles, queues, and buckets
The Launch Stack button below will take you to the cloud formation create stack webpage with the template with required resources already linked. 

If you prefer to skip manual deployment and deploy automatically in the cloud, click the **Launch Stack** button below. For a walkthrough of the screens during automatic deployment please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToLaunchAWSBatch.md). The deployment should take ~5 min and then the resources will be ready for use. 

[![Launch Stack](images/LaunchStack.jpg)](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=aws-batch-nigms&templateURL=https://nigms-sandbox.s3.us-east-1.amazonaws.com/cf-templates/AWSBatch_template.yaml )

### Step 1. Install dependencies, update paths and create a new S3 Bucket to store input and output files
After setting up a AWS CloudFormation stack, need to let the nextflow workflow to know where are those resrouces by providing the configuration:
<div style="border: 1px solid #e57373; padding: 0px; border-radius: 4px;">
  <div style="background-color: #ffcdd2; padding: 5px; ">
    <i class="fas fa-exclamation-triangle" style="color: #b71c1c;margin-right: 5px;"></i><a style="color: #b71c1c"><b>Important</b> - Customize Required</a>
  </div>
  <p style="margin-left: 5px;">
Replace the <b>stack name</b> to the stack that you just created. <code>STACK_NAME = "your-stack-name-here"</code>
  </p>
</div>

In [None]:
# dfine a stack name variable
STACK_NAME = "aws-batch-nigms-test1"

In [None]:
import boto3
# Get account ID and region 
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

In [None]:
# Set variable names 
# These variables should come from the Intro AWS Batch tutorial (or leave as-is if using the launch stack button)
BUCKET_NAME = f"{STACK_NAME}-batch-bucket-{account_id}"
AWS_QUEUE = f"{STACK_NAME}-JobQueue"
INPUT_FOLDER = 'nigms-sandbox/ovarian-cancer-example-fastqs'
AWS_REGION = region

#### Install dependencies

In [None]:
# Install Nextflow
! mamba install -y -c conda-forge -c bioconda nextflow --quiet

<details>
<summary>Install Java and Nextflow if needed in other systems</summary>
If using other system other than AWS SageMaker Notebook, you might need to install java and nextflow using the code below:
<br> <i># Install java</i><pre>
    sudo apt update
    sudo apt-get install default-jdk -y
    java -version
    </pre>
    <i># Install Nextflow</i><pre>
    curl https://get.nextflow.io | bash
    chmod +x nextflow
    ./nextflow self-update
    ./nextflow plugin update
    </pre>
</details>

#### Create additional .config file needed

In [None]:
# copy the aws batch configuration file 
! cp nf-meripseq/conf/aws_batch_template.config aws_batch_submodule5.config 
# replace batch bucket name in nextflow configuration file
! sed -i "s/aws-batch-nigms-batch-bucket-/$BUCKET_NAME/g" aws_batch_submodule5.config 
# replace job queue name in configuration file 
! sed -i "s/aws-batch-nigms-JobQueue/$AWS_QUEUE/g" aws_batch_submodule5.config 
# replace the region placeholder with your region
! sed -i "s/us-east-1/$AWS_REGION/g" aws_batch_submodule5.config 

### Step 2. Enable AWS Batch for the nextflow script 

In [None]:
# Run nextflow script with parameters 
! nextflow run nf-meripseq -profile docker,awsbatch \
    --input s3://$INPUT_FOLDER/samplesheet.csv \
    --fasta s3://$INPUT_FOLDER/chr11_1.5M.fasta \
    --gtf s3://$INPUT_FOLDER/gencode.v46.pri.chr11.1.5M.gtf \
    --genome hg38 \
    --read_length 37 \
    --contrast "omental_tumor_vs_normal_Fallopian_tube" \
    -c aws_batch_submodule5.config \
    -resume

### Step 3: Explore Results

In [None]:
# View output files that were output to S3 bucket
! aws s3 ls s3://$BUCKET_NAME/nextflow_output/ --recursive | cut -c32-

In [None]:
# Copy output to local results folder (same outdir as if workflow was run locally)
#! aws s3 sync s3://$BUCKET_NAME/nextflow_output/ meripseq-aws-batch-results/ --quiet