<div class="alert alert-warning">
    <strong>Analyst Note:</strong><br />
    Fill in the human-readable name of your project and the type of your data, such as:
    
   > # Dr. Doe Human Patient Time-Series RNASeq

</div>


# Primary Analysis

<div class="alert alert-warning">
    <strong>Analyst Note:</strong><br />
    Fill in the author attributions for your analysis, such as:
    
   > * Guorong Xu, CCBB (g1xu@ucsd.edu)
</div>


## Table of Contents
* [Background](#Background)
* [Introduction](#Introduction)
* [Design File Creation](#Design-File-Creation)
* [Parameters Input](#Parameters_Input)
* [Pipeline Initiation](#Pipeline-Initiation)
* [Pipeline Status Checking](#Pipeline-Status-Checking)
* [MultiQC Report Display](#MultiQC-Report-Display)
* [Archiving](#Archiving)
* [Appendix: Methods Documentation](#Appendix:-Methods-Documentation)


## Background

<div class="alert alert-warning">
    <strong>Analyst Note:</strong><br />
    Fill in background on the specific project here, such as 
   
   > The fastq data analyzed in this notebook were downloaded the ftp://igm-storage1.ucsd.edu/190319_K00180_0772_AH2Y27BBXY_SR75_Combo/ link provided by Dr. Postdoc on 03/20/2019.
   
</div>

[Table of Contents](#Table-of-Contents)

## Introduction

The notebook provides steps to run primary analysis of either RNASeq or miRNASeq data on an existing AWS CFNCluster. 

<div class="alert alert-info">

Before running this notebook, ensure:

* You are running it on a linux or Mac OSX platform
* You have installed the `paramiko` and `scp` packages in the environment where it is running
* You have an existing AWS CFN cluster set up, and have the IP address and private key pair file for this cluster
* The master node of the cluster must be running (not stopped) on AWS

</div>

[Table of Contents](#Table-of-Contents)

<div class="alert alert-warning">

## Design File Creation

<h4>Analyst note:</h4>
This pipeline cannot be run without a project-specific design file created by the analyst.  Create such a file for this project, following the guidance shown below (excerpted from the cirrus-ngs README):

The design file is a tab-separated text file that specifies the names of the sequence files to process.  It has no header line.  For RNASeq and miRNA-Seq, it must contain two columns but only the first column is actively used by the pipeline. The **first column** is the filename of the sample (with extensions: e.g. fastq.gz), as shown below in examples modified from the cirrus README:
>
>For example, a two-column design file for two single-end-sequenced samples named `mysample1` and `mysample2` might look like:

```
	mysample1.fastq.gz		not_applicable
	mysample2.fastq.gz		not_applicable
```

>If the sequencing data is paired-end, the first column includes the name of the forward sequencing file, followed by a comma, followed by the name of the reverse sequencing file (note that there must **not** be any spaces between these two file names--only a comma!)  An example two-column design file for two paired-end-sequenced samples named `mysample1` and `mysample2` might look like:

```
	mysample1_fwd.fastq.gz,mysample1_rev.fastq.gz		not_applicable
	mysample2_fwd.fastq.gz,mysample2_rev.fastq.gz		not_applicable
```

Note that **second column** must have content but is not actively used by the pipeline, so best practice is to set all values of that column to "not_applicable".

Once you have created the design file, input its path into the Input Parameters below.

</div>

[Table of Contents](#Table-of-Contents)

## Parameters Input

*Project settings*

**project_name**: a name for your project, containing no whitespace.

**design_file**: the path to your design file for this project. Please see the README for the format specification for the design files. 

**genome**: the name of the reference genome against which to align during this project; note that currently only "human" and "mouse" are supported.

**s3_input_files_address**: the s3 path to the *directory* in which your input fastq files are found. All fastq files must be in the same s3 bucket.

**s3_output_files_address**: the s3 path to the *directory* in which the outputs from your project should be uploaded. This will only be the root directory; please see the README for details of how outputs are structured.

*Cluster settings*

**your_cluster_name**: the name of your cluster (assigned when it was created using cfncluster). 

**private_key**: the path to the private key pair file needed to access your cluster.

<div class="alert alert-warning">
<h4>Analyst note:</h4>
The values in the cell below are example settings, and <strong>MUST</strong> be replaced with appropriate values for your cluster and project.
</div>

In [None]:
## A project name: no whitespace allowed
project_name = "20190101_testuser_mymirnaseq"

# Path to the design file
design_file = "../../data/cirrus-ngs/mirna_test_design.txt"

# Pipeline: choose either "SmallRNASeq" (for miRNASeq) or "RNASeq"
pipeline = "SmallRNASeq"

# IF you chose "RNASeq" as the pipeline above, 
# then set the genome to hg19 for human or mm10 for mouse
#(mm10 only supported in star_rsem as of now).
# Conversely, if you chose "SmallRNASeq" as the pipeline above,
# then set the genome to "human" for human or "mouse" for mouse.
# Note that ONLY human and mouse are supported.
genome = "mm10"

# If and only if you chose "RNASeq" as the pipeline above, 
# then set the workflow value below to one of these three 
# choices: "star_gatk", "star_htseq", or "star_rsem".
# If you chose "SmallRNASeq" as the pipeline above,
# simply leave the workflow as None
workflow = None

# S3 input, output, and archive addresses.
# Notice: DO NOT put a forward slash at the end of these s3 addresses.
s3_input_files_address = "s3://path/to/fastq"
s3_output_files_address = "s3://path/to/output"
s3_archive_files_address = "s3://path/to/archive"

# CFNCluster name
your_cluster_name = "myclustername"

# private key pair file for accessing the new cluster
private_key = "/path/to/your_aws_key.pem"

print("Variables set.")

<div class="alert alert-warning">
<h4>Analyst note:</h4>
The values shown below are standard settings  and <strong>SHOULD NOT</strong> be modified without a clear understanding of what change should be made and why it is necessary.
</div>

In [None]:
# location of cirrus-ngs src directory
# on local installation; may be an absolute or relative path.
# DO NOT put a forward slash at the end of the path.
source_dir_path = "../../src/cirrus_ngs"

# location to which multqc report should be downloaded
import os
report_dir = os.getcwd().split("notebooks")[0] + "data"

# Analysis steps: a set of strings that contains the analysis steps to run. 
# note that the order of the analysis steps does not matter; 
# the analysis_steps variable only specifies WHICH steps to run,
# and the pipeline/workflow itself specifies the order in which
# the chosen steps are run.
# IF AND ONLY IF you are a power user who wants to run/rerun just
# a subset of steps, then input the steps you want in the analysis_steps 
# variable below.  If you are a normal user who wants to run the full 
# pipeline, leave it as None.
analysis_steps = None

if pipeline == "SmallRNASeq":
    # SmallRNASeq has only one approved workflow, so 
    # if that is the pipeline, set the workflow to the 
    # only approved one;
    workflow = "bowtie2"
    reference = "hairpin_{}".format(genome)
    full_analysis_steps = {"fastqc", "trim", "cut_adapt", "align_and_count","merge_counts","multiqc"}
elif pipeline == "RNASeq":
    reference = genome
    if workflow == "star_gatk":
        full_analysis_steps = {"fastqc", "trim", "align", "multiqc", "variant_calling"}
    elif workflow == "star_htseq":
        full_analysis_steps = {"fastqc", "trim", "align", "multiqc", "count", "merge_counts"}
    elif workflow == "star_rsem":
        full_analysis_steps = {"fastqc", "trim", "align_count", "multiqc", "merge_counts"}
analysis_steps = full_analysis_steps if analysis_steps is None else analysis_steps

# if analysis steps is STILL none after the code above, 
# then the pipeline workflow combination is unrecognized, so throw an error:
if analysis_steps is None:
    raise ValueError("Unrecognized pipeline/workflow combination: {0}/{1}".format(pipeline, workflow))

print("workflow: {0}".format(workflow))
print("reference: {0}".format(reference))
print("analysis_steps (not necessarily in run order): {0}".format(analysis_steps))

[Table of Contents](#Table-of-Contents)

## Pipeline Initiation

Connect to the CFN cluster:

In [None]:
import os
import sys
sys.path.append(source_dir_path)

from cfnCluster import CFNClusterManager, ConnectionManager
master_ip_address = CFNClusterManager.create_cfn_cluster(cluster_name=your_cluster_name)
ssh_client = ConnectionManager.connect_master(hostname=master_ip_address,
                                              username="ec2-user",
                                              private_key_file=private_key)

Execute the pipeline on the cluster:

In [None]:
## DO NOT edit: no user-serviceable settings here
from util import DesignFileLoader
from util import PipelineManager

sample_list, group_list, pair_list = DesignFileLoader.load_design_file(design_file)
PipelineManager.execute(pipeline, ssh_client, project_name, workflow, analysis_steps, s3_input_files_address,
                       sample_list, group_list, s3_output_files_address, reference, "NA", pair_list)

[Table of Contents](#Table-of-Contents)

## Pipeline Status Checking

While the pipeline is running, it will not push information on its status to this notebook.  However, it is possible to pull information about the status of the pipeline at any time using the code below:

In [None]:
# Status check settings:

# Specify a step (one that is in the analysis_steps set) to check on the status of 
# or set to "all" to see the status of all steps
step="all"

# View verbose status information by setting this value to True or 
# view abbreviated status information by setting it to False
verbose=True

<div class="alert alert-info">

Be aware that the below cell that reports analysis status can take a very long time (10s of minutes) to complete if there are large numbers of jobs running on the cluster for the analysis.

</div>

In [None]:
import datetime
print(datetime.datetime.now())
PipelineManager.check_status(ssh_client, step, pipeline, workflow, project_name, analysis_steps,verbose=verbose)

#### Failure Handling
If the status information above indicates that steps have failed, examine the log files accessible from the cluster's master node to identify the issue.  To do this, gather the login information for the master node:

In [None]:
# print(private_key)
# print(master_ip_address)

With this information, `ssh` into the master node with a command of the following form:
    
    ssh -i <private_key> ec2-user@<master_ip_address>
    
Logs are stored in the `/shared/workspace/logs/` directory, under sub-folders for the pipeline, workflow, and project name.  `cd` to this directory with a command of the following form:

    cd /shared/workspace/logs/SmallRNASeq/bowtie2/<project_name>
    
    
If a sample's log files are empty, this indicates that it was not downloaded from s3 (because by default the aws s3 command download runs in quiet mode and does not generate log messages even on failure).  Double-check the file names in the design file to ensure they are correct and that the files are actually in the stated s3 location.

Once the cause of the failures has been addressed, simply resubmit the jobs without clearing the logs or outputs; cirrus-ngs will automatically rerun the failed samples, while the samples which have outputs won’t be run again.  If it is necessary to redo the entire run from scratch, delete both the project folder in the logs and also the outputs before commencing the rerun to ensure that all existing state has been cleared.

[Table of Contents](#Table-of-Contents)

## MultiQC Report Display

If the multiqc step has been **run and completed**, the code below can be used to view the multiqc report to assess the overall outcome of the primary analysis.

<div class="alert alert-info">
        
**NOTE** that report display will only work if the `aws` command line tool has been configured with AWS credentials in the environment in which the notebook is being executed by first running the `aws configure` command; if it has not, a `fatal error: Unable to locate credentials` message will be displayed.
        
</div>

In [None]:
# Download the multiqc html file to local directory
!aws s3 cp $s3_output_files_address/$project_name/$workflow/multiqc_report.html $report_dir

In [None]:
from IPython.display import IFrame

IFrame(os.path.relpath("{}/multiqc_report.html".format(report_dir)), width="100%", height=1000)

[Table of Contents](#Table-of-Contents)

## Archiving

Generally the outputs are written to a temporary directory on S3 until the analysis run has been verified as successful.  The final step is therefore to move the outputs of a successful run to their permanent archive location.

In [None]:
import datetime
# get today's date, without hyphens between yr/mo/day
curr_date = str(datetime.datetime.now().date()).replace("-","")
archive_subdir_name = "{0}_{1}_primary_analysis_deliverable".format(curr_date, workflow)
print(archive_subdir_name)

In [None]:
!aws s3 mv $s3_output_files_address/$project_name/$workflow/ $s3_archive_files_address/$archive_subdir_name/outputs --recursive --exclude "*.fastq"

If desired, delete the output directory that was used to temporarily store the outputs before archiving:

In [None]:
# !aws s3 rm --recursive $s3_output_files_address/

[Table of Contents](#Table-of-Contents)

## Appendix: Methods Documentation

Print out the settings and script for each step run in the pipeline and workflow:

In [None]:
from util import AddonsManager
AddonsManager.display_pipeline_workflow_settings_and_scripts(ssh_client, pipeline, workflow, analysis_steps)

Print out the software configuration file, including software and reference genome versions, etc:

In [None]:
AddonsManager.display_software_config(ssh_client)

[Table of Contents](#Table-of-Contents)

Copyright (c) 2018 UC San Diego Center for Computational Biology & Bioinformatics under the MIT License

Notebook template by Guorong Xu and Amanda Birmingham