# miRNA Pipeline (with MultiQC analysis)

Notice: Please open the notebook under /notebooks/awsCluster/BasicCFNClusterSetup.ipynb to install CFNCluster package on your Jupyter-notebook server before running the notebook.

## 1. Configure AWS key pair, data location on S3 and the project information

In [2]:
import os
import sys

## S3 input and output addresses.
# Notice: DO NOT put a forward slash at the end of your addresses.
s3_input_files_address = "s3://ucsd-ccbb-interns/Mengyi/mirna_test/20171107_Tom_miRNASeq/fastq"
s3_output_files_address = "s3://ucsd-ccbb-interns/Mustafa/smallrna_test"
    
## Path to the design file
design_file = "../../data/cirrus-ngs/mirna_test_design.txt"

## CFNCluster name
your_cluster_name = "mustafa7"

## The private key pair for accessing cluster.
private_key = "/home/mustafa/keys/interns_oregon_key.pem"

## Project information
# Recommended: Specify year, month, date, user name and pipeline name (no empty spaces)
project_name = "new_test_mi1"

## Workflow information: only "bowtie2" now
workflow = "bowtie2"

## Genome information: currently available genomes: human, mouse
genome = "human"

## "fastqc", "trim", "cut_adapt", "align_and_count", "merge_counts", "multiqc"
analysis_steps = {"fastqc", "trim", "cut_adapt", "align_and_count","multiqc"}

## If delete cfncluster after job is done.
delete_cfncluster = False

print("Variables set.")

Variables set.


## 2. Create CFNCluster

Notice: The CFNCluster package can be only installed on Linux box which supports pip installation.

In [4]:
sys.path.append("../../src/cirrus_ngs")
from cfnCluster import CFNClusterManager, ConnectionManager
## Create a new cluster
master_ip_address = CFNClusterManager.create_cfn_cluster(cluster_name=your_cluster_name)
ssh_client = ConnectionManager.connect_master(hostname=master_ip_address,
                                              username="ec2-user",
                                              private_key_file=private_key)

cluster mustafa7 does exist.
Status: CREATE_COMPLETE
Status: CREATE_COMPLETE
MasterServer: RUNNING
MasterServer: RUNNING
Output:"MasterPublicIP"="34.214.180.149"
Output:"MasterPrivateIP"="172.31.33.31"
Output:"GangliaPublicURL"="http://34.214.180.149/ganglia/"
Output:"GangliaPrivateURL"="http://172.31.33.31/ganglia/"

connecting
connected


## 3. Run the pipeline

In [5]:
from util import PipelineManager
from util import DesignFileLoader

## DO NOT edit below
reference = "hairpin_{}".format(genome)
print(reference)

sample_list, group_list, pair_list = DesignFileLoader.load_design_file(design_file)

PipelineManager.execute("SmallRNASeq", ssh_client, project_name, workflow, analysis_steps, s3_input_files_address,
                       sample_list, group_list, s3_output_files_address, reference, "NA", pair_list)

hairpin_human
[['AD17-WK52_ACTGAT_S72_L002_R1_001.fastq.gz'], ['AD17-WK73_GTGAAA_S18_L001_R1_001.fastq.gz'], ['AD5-WK24_GTGGCC_S67_L002_R1_001.fastq.gz']]
['group1', 'group1', 'group2']
{}
making the yaml file...
copying yaml file to remote master node...
new_test_mi1.yaml
/shared/workspace/Pipelines/yaml_files/SmallRNASeq/bowtie2
executing pipeline...


## 4. Check status of pipeline

In [23]:
step="all"
PipelineManager.check_status(ssh_client, step, "SmallRNASeq", workflow, project_name, analysis_steps,verbose=True)

checking status of jobs...

Your project will go through the following steps:
	align_and_count, multiqc

The align_and_count step calls the bowtie2_and_count.sh script on the cluster
The align_and_count step is being executed
There are 1 instances of the align_and_count step currently queued
	one is currently queued using 4 core(s) and was submitted 0 days, 0 hours, and 0 minutes ago
There are 2 instances of the align_and_count step currently running
	one is currently running using 4 core(s) and was submitted 0 days, 0 hours, and 0 minutes ago
	one is currently running using 4 core(s) and was submitted 0 days, 0 hours, and 0 minutes ago

The multiqc step calls the multiqc.sh script on the cluster
The multiqc step has not started yet.




## 5. Display MultiQC report

### Note: Run the cells below after all jobs are done on the cluster.

In [25]:
# Download the html file to local (in the same directory with this notebook)
notebook_dir = os.getcwd().split("notebooks")[0] + "data/"
!aws s3 cp $s3_output_files_address/$project_name/$workflow/multiqc_report.html $notebook_dir

download: s3://ucsd-ccbb-interns/Mustafa/smallrna_test/test_mi1/bowtie2/multiqc_report.html to ../../data/multiqc_report.html


In [19]:
from IPython.display import IFrame
  
IFrame(os.path.relpath("{}multiqc_report2.html".format(notebook_dir)), width="100%", height=1000)

In [26]:
IFrame(os.path.relpath("{}multiqc_report.html".format(notebook_dir)), width="100%", height=1000)