# miRNASeq Pipeline

## Introduction

The notebook provides steps to run primary analysis of miRNA-Seq data on an existing AWS CFNCluster. 

<div class="alert alert-info">

Before running this notebook, ensure:

* You are running it on a linux or Mac OSX platform
* You have installed the `paramiko` and `scp` packages in the environment where it is running
* You have an existing AWS CFN cluster set up, and have the IP address and private key pair file for this cluster

</div>

<div class="alert alert-warning">

## Design File Creation

<h4>Analyst note:</h4>
This pipeline cannot be run without a project-specific design file created by the analyst.  Create such a file for this project, following the guidance shown below (excerpted from the cirrus-ngs README):

The design file is a tab-separated text file that specifies the names of the sequence files to process.  It has no header line.  For miRNA-Seq, it must contain two columns but only the first column is actively used by the pipeline. The **first column** is the filename of the sample (with extensions: e.g. fastq.gz), as shown below in examples modified from the cirrus README:
>
>For example, a two-column design file for two single-end-sequenced samples named `mysample1` and `mysample2` might look like:

```
	mysample1.fastq.gz		not_applicable
	mysample2.fastq.gz		not_applicable
```

>If the sequencing data is paired-end, the first column includes the name of the forward sequencing file, followed by a comma, followed by the name of the reverse sequencing file (note that there must **not** be any spaces between these two file names--only a comma!)  An example two-column design file for two paired-end-sequenced samples named `mysample1` and `mysample2` might look like:

```
	mysample1_fwd.fastq.gz,mysample1_rev.fastq.gz		not_applicable
	mysample2_fwd.fastq.gz,mysample2_rev.fastq.gz		not_applicable
```

Note that **second column** must have content but is not actively used by the pipeline, so best practice is to set all values of that column to "not_applicable".

Once you have created the design file, input its path into the Input Parameters below.

</div>



## Input Parameters Set-Up

*Project settings*

**project_name**: a name for your project, containing no whitespace.

**design_file**: the path to your design file for this project. Please see the README for the format specification for the design files. 

**genome**: the name of the reference genome against which to align during this project; note that currently only "human" and "mouse" are supported.

**s3_input_files_address**: the s3 path to the *directory* in which your input fastq files are found. All fastq files must be in the same s3 bucket.

**s3_output_files_address**: the s3 path to the *directory* in which the outputs from your project should be uploaded. This will only be the root directory; please see the README for details of how outputs are structured.

*Cluster settings*

**your_cluster_name**: the name of your cluster (assigned when it was created using cfncluster). 

**private_key**: the path to the private key pair file needed to access your cluster.

<div class="alert alert-warning">
<h4>Analyst note:</h4>
The values in the cell below are example settings, and <strong>MUST</strong> be replaced with appropriate values for your cluster and project.
</div>

In [None]:
## A project name: no whitespace allowed
project_name = "20190101_testuser_mymirnaseq"

# Path to the design file
design_file = "../../data/cirrus-ngs/mirna_test_design.txt"

# Genome: currently available genomes are human and mouse
genome = "human"

# S3 input and output addresses.
# Notice: DO NOT put a forward slash at the end of these s3 addresses.
s3_input_files_address = "s3://path/to/fastq"
s3_output_files_address = "s3://path/to/output"
    
# CFNCluster name
your_cluster_name = "myclustername"

# private key pair file for accessing the new cluster
private_key = "/path/to/your_aws_key.pem"

print("Variables set.")

<div class="alert alert-warning">
<h4>Analyst note:</h4>
The values shown below are standard settings  and <strong>SHOULD NOT</strong> be modified without a clear understanding of what change should be made and why it is necessary.
</div>

In [None]:
# location of cirrus-ngs src directory
# on local installation; may be an absolute or relative path.
# DO NOT put a forward slash at the end of the path.
source_dir_path = "../../src/cirrus_ngs"

# location to which multqc report should be downloaded
import os
report_dir = os.getcwd().split("notebooks")[0] + "data"

# Pipeline name for miRNASeq pipeline;
# DO NOT CHANGE
pipeline = "SmallRNASeq"

# Workflow to run; currently this value MUST be "bowtie2"
# so DO NOT CHANGE
workflow = "bowtie2"

# Analysis steps: a set of strings that contains the analysis steps to run; 
# note that the order of the steps does not matter.
# For miRNASeq, options are "fastqc", "trim", "cut_adapt", "align_and_count", and "multiqc"
analysis_steps = {"fastqc", "trim", "cut_adapt", "align_and_count","multiqc"}

## Pipeline Initiation

Connect to the CFN cluster:

In [None]:
import os
import sys
sys.path.append(source_dir_path)

from cfnCluster import CFNClusterManager, ConnectionManager
master_ip_address = CFNClusterManager.create_cfn_cluster(cluster_name=your_cluster_name)
ssh_client = ConnectionManager.connect_master(hostname=master_ip_address,
                                              username="ec2-user",
                                              private_key_file=private_key)

Execute the pipeline on the cluster:

In [None]:
from util import DesignFileLoader
from util import PipelineManager

reference = "hairpin_{}".format(genome)
print(reference)

sample_list, group_list, pair_list = DesignFileLoader.load_design_file(design_file)

## DO NOT edit: no user-serviceable settings here
PipelineManager.execute(pipeline, ssh_client, project_name, workflow, analysis_steps, s3_input_files_address,
                       sample_list, group_list, s3_output_files_address, reference, "NA", pair_list)

## Pipeline Status Checking

While the pipeline is running, it will not push information on its status to this notebook.  However, it is possible to pull information about the status of the pipeline at any time using the code below:

In [None]:
# Status check settings:

# Specify a step (one that is in the analysis_steps set) to check on the status of 
# or set to "all" to see the status of all steps
step="all"

# View verbose status information by setting this value to True or 
# view abbreviated status information by setting it to False
verbose=True

In [None]:
import datetime
print(datetime.datetime.now())
PipelineManager.check_status(ssh_client, step, pipeline, workflow, project_name, analysis_steps,verbose=verbose)

#### Failure Handling
If the status information above indicates that steps have failed, examine the log files accessible from the cluster's master node to identify the issue.  To do this, gather the login information for the master node:

In [1]:
# print(private_key)
# print(master_ip_address)

With this information, `ssh` into the master node with a command of the following form:
    
    ssh -i <private_key> ec2-user@<master_ip_address>
    
Logs are stored in the `/shared/workspace/logs/` directory, under sub-folders for the pipeline, workflow, and project name.  `cd` to this directory with a command of the following form:

    cd /shared/workspace/logs/SmallRNASeq/bowtie2/<project_name>
    
    
If a sample's log files are empty, this indicates that it was not downloaded from s3 (because by default the aws s3 command download runs in quiet mode and does not generate log messages even on failure).  Double-check the file names in the design file to ensure they are correct and that the files are actually in the stated s3 location.

Once the cause of the failures has been addressed, simply resubmit the jobs without clearing the logs or outputs; cirrus-ngs will automatically rerun the failed samples, while the samples which have outputs won’t be run again.  If it is necessary to redo the entire run from scratch, delete both the project folder in the logs and also the outputs before commencing the rerun to ensure that all existing state has been cleared.

## MultiQC Report Display

If the multiqc step has been **run and completed**, the code below can be used to view the multiqc report to assess the overall outcome of the primary analysis.

<div class="alert alert-info">
        
**NOTE** that report display will only work if the `aws` command line tool has been configured with AWS credentials in the environment in which the notebook is being executed by first running the `aws configure` command; if it has not, a `fatal error: Unable to locate credentials` message will be displayed.
        
</div>

In [None]:
# Download the multiqc html file to local directory
!aws s3 cp $s3_output_files_address/$project_name/$workflow/multiqc_report.html $report_dir

In [None]:
from IPython.display import IFrame

IFrame(os.path.relpath("{}/multiqc_report.html".format(report_dir)), width="100%", height=1000)

## Appendix: Methods Documentation

Print out the settings and script for each step run in the pipeline and workflow:

In [None]:
from util import AddonsManager
AddonsManager.display_pipeline_workflow_settings_and_scripts(ssh_client, pipeline, workflow, analysis_steps)

Print out the software configuration file, including software and reference genome versions, etc:

In [None]:
AddonsManager.display_software_config(ssh_client)