# miRNA Pipeline

## Be sure to install paramiko and scp with pip before using this notebook

## 1. Configure AWS key pair, data location on S3 and the project information

This cell only contains information that you, the user, should input.

#### String Fields

**s3_input_files_address**: This is an s3 path to where your input fastq files are found. This shouldn't be the path to the actual fastq files, just to the directory containing all of them. All fastq files must be in the same s3 bucket.

**s3_output_files_address**: This is an s3 path to where you would like the outputs from your project to be uploaded. This will only be the root directory, please see the README for information about exactly how outputs are structured

**design_file**: This is a path to your design file for this project. Please see the README for the format specification for the design files. 

**your_cluster_name**: This is the name given to your cluster when it was created using cfncluster. 

**private_key**: The path to your private key needed to access your cluster.

**project_name**: Name of your project. There should be no whitespace.

**workflow**: The workflow you want to run for this project. For the miRNASeq pipeline the possible workflow is "bowtie2". 

**genome**: The name of the reference you want to use for your project. Currently only "human" is supported here.

#### analysis_steps
This is a set of strings that contains the steps you would like to run. The order of the steps does not matter.

posible bowtie2 steps: "fastqc", "trim", "cut_adapt", "align_and_count", "multiqc"

In [1]:
import os
import sys
from util import PipelineManager
from util import DesignFileLoader

## S3 input and output addresses.
# Notice: DO NOT put a forward slash at the end of your addresses.
s3_input_files_address = "s3://ucsd-ccbb-interns/Mengyi/mirna_test/20171107_Tom_miRNASeq/fastq"
s3_output_files_address = "s3://ucsd-ccbb-interns/Mustafa/smallrna_test"
    
## Path to the design file
design_file = "../../data/cirrus-ngs/mirna_test_design.txt"

## CFNCluster name
your_cluster_name = "mustafa8"

## The private key pair for accessing cluster.
private_key = "/home/mustafa/keys/interns_oregon_key.pem"

## Project information
# Recommended: Specify year, month, date, user name and pipeline name (no empty spaces)
project_name = "test_project"

## Workflow information: only "bowtie2" now
workflow = "bowtie2"

## Genome information: currently available genomes: human, mouse
genome = "mouse"

## "fastqc", "trim", "cut_adapt", "align_and_count", "merge_counts", "multiqc"
analysis_steps = {"fastqc", "trim", "cut_adapt", "align_and_count","multiqc"}

## If delete cfncluster after job is done.
delete_cfncluster = False

print("Variables set.")

Variables set.


## 2. Create CFNCluster

Following cell connects to your cluster. Run before step 3.

In [2]:
sys.path.append("../../src/cirrus_ngs")
from cfnCluster import CFNClusterManager, ConnectionManager
## Create a new cluster
master_ip_address = CFNClusterManager.create_cfn_cluster(cluster_name=your_cluster_name)
ssh_client = ConnectionManager.connect_master(hostname=master_ip_address,
                                              username="ec2-user",
                                              private_key_file=private_key)

cluster mustafa8 does exist.
Status: CREATE_COMPLETE
Status: CREATE_COMPLETE
MasterServer: RUNNING
MasterServer: RUNNING
Output:"MasterPublicIP"="34.218.52.146"
Output:"MasterPrivateIP"="172.31.47.153"
Output:"GangliaPublicURL"="http://34.218.52.146/ganglia/"
Output:"GangliaPrivateURL"="http://172.31.47.153/ganglia/"

connecting
connected


## 3. Run the pipeline

This cell actually executes your pipeline. Make sure that steps 1 and 2 have been completed before running.

In [3]:
## DO NOT edit below
reference = "hairpin_{}".format(genome)
print(reference)

sample_list, group_list, pair_list = DesignFileLoader.load_design_file(design_file)

PipelineManager.execute("SmallRNASeq", ssh_client, project_name, workflow, analysis_steps, s3_input_files_address,
                       sample_list, group_list, s3_output_files_address, reference, "NA", pair_list)

hairpin_mouse
[['AD17-WK52_ACTGAT_S72_L002_R1_001.fastq.gz'], ['AD17-WK73_GTGAAA_S18_L001_R1_001.fastq.gz'], ['AD5-WK24_GTGGCC_S67_L002_R1_001.fastq.gz']]
['group1', 'group1', 'group2']
{}
making the yaml file...
copying yaml file to remote master node...
test_project.yaml
/shared/workspace/Pipelines/yaml_files/SmallRNASeq/bowtie2
executing pipeline...


## 4. Check status of pipeline

This allows you to check the status of your pipeline. You can specify a step or set the step variable to "all". If you specify a step it should be one that is in your analysis_steps set. You can toggle how verbose the status checking is by setting the verbose flag (at the end of the second line) to False. 

In [9]:
step="all"
PipelineManager.check_status(ssh_client, step, "SmallRNASeq", workflow, project_name, analysis_steps,verbose=True)

checking status of jobs...

Your project will go through the following steps:
	fastqc, trim, cut_adapt, align_and_count, multiqc

The fastqc step calls the fastqc.sh script on the cluster
The fastqc step has finished running without failure

The trim step calls the trim.sh script on the cluster
The trim step has finished running without failure

The cut_adapt step calls the cutadapt.sh script on the cluster
The cut_adapt step has finished running without failure

The align_and_count step calls the bowtie2_and_count.sh script on the cluster
The align_and_count step has finished running without failure

The multiqc step calls the multiqc.sh script on the cluster
The multiqc step has finished running without failure


Your pipeline has finished



## 5. Display MultiQC report

### Note: Run the cells below after the multiqc step is done

In [10]:
# Download the multiqc html file to local
notebook_dir = os.getcwd().split("notebooks")[0] + "data/"
!aws s3 cp $s3_output_files_address/$project_name/$workflow/multiqc_report.html $notebook_dir

download: s3://ucsd-ccbb-interns/Mustafa/smallrna_test/test_project/bowtie2/multiqc_report.html to ../../data/multiqc_report.html


In [11]:
from IPython.display import IFrame

IFrame(os.path.relpath("{}multiqc_report.html".format(notebook_dir)), width="100%", height=1000)