# RNA-seq Pipeline

Notice: Please open the notebook under /notebooks/awsCluster/BasicCFNClusterSetup.ipynb to install CFNCluster package on your Jupyter-notebook server before running the notebook.

## 1. Configure AWS key pair, data location on S3 and the project information

In [17]:
import os
import sys
sys.path.append("../../src/cirrus_ngs")
from cfnCluster import CFNClusterManager, ConnectionManager
from util import PipelineManager
from util import DesignFileLoader

## S3 input and output addresses.
# Notice: DO NOT put a forward slash at the end of your addresses.
# s3_input_files_address = "s3://ucsd-ccbb-interns/Mengyi/rna_test/rna_vc"
s3_input_files_address = "s3://ucsd-ccbb-interns/Mustafa/rna_test"
s3_output_files_address = "s3://ucsd-ccbb-interns/Mustafa/rna_test"

## Path to the design file
design_file = "../../data/cirrus-ngs/rna_short_design.txt"

## CFNCluster name
your_cluster_name = "mustafa8"

## The private key pair for accessing cluster.
private_key = "/home/mustafa/keys/interns_oregon_key.pem"

## Project information
# Recommended: Specify year, month, date, user name and pipeline name (no empty spaces)
project_name = "mousetest"

## Choose from one of the workflows: "star_gatk", "star_htseq", "star_rsem"
workflow = "star_rsem"

## Genome information: currently available genomes: hg19, mm10 (mm10 only supported in star_rsem as of now)
genome = "mm10"

## Analysis steps:
# for star_gatk: "fastqc", "trim", "align", "multiqc", "variant_calling"
# for star_htseq: "fastqc", "trim", "align", "multiqc", "count", "merge_counts"
# for star_rsem: "fastqc", "trim", "align_count", "multiqc", "merge_counts"
# for kallisto: "fastqc", "trim", "align", "multiqc", "count", "merge_counts"


analysis_steps = {"fastqc", "trim", "align_count"}

## If delete cfncluster after job is done.
delete_cfncluster = False

print ("Variables set.")

Variables set.


## 2. Create CFNCluster

Notice: The CFNCluster package can be only installed on Linux box which supports pip installation.

In [10]:
## Create a new cluster
master_ip_address = CFNClusterManager.create_cfn_cluster(cluster_name=your_cluster_name)
ssh_client = ConnectionManager.connect_master(hostname=master_ip_address,
                                              username="ec2-user",
                                              private_key_file=private_key)

cluster mustafa8 does exist.
Status: CREATE_COMPLETE
Status: CREATE_COMPLETE
MasterServer: RUNNING
MasterServer: RUNNING
Output:"MasterPublicIP"="34.218.52.146"
Output:"MasterPrivateIP"="172.31.47.153"
Output:"GangliaPublicURL"="http://34.218.52.146/ganglia/"
Output:"GangliaPrivateURL"="http://172.31.47.153/ganglia/"

connecting
connected


## 3. Run the RNA sequencing pipeline

In [18]:
## DO NOT EDIT BELOW
print (analysis_steps)

sample_list, group_list, pair_list = DesignFileLoader.load_design_file(design_file)

PipelineManager.execute("RNASeq", ssh_client, project_name, workflow, analysis_steps, s3_input_files_address,
                        sample_list, group_list, s3_output_files_address, genome, "NA", pair_list)

{'trim', 'align_count', 'fastqc'}
[['638-BM-Stem_S1_R1_001_head.fastq', '638-BM-Stem_S1_R2_001_head.fastq'], ['689-BM-Stem_S0_R1_001_head.fastq', '689-BM-Stem_S0_R2_001_head.fastq']]
['groupA', 'groupB']
{}
making the yaml file...
copying yaml file to remote master node...
mousetest.yaml
/shared/workspace/Pipelines/yaml_files/RNASeq/star_rsem
executing pipeline...


## 4. Check status of pipeline

In [21]:
step="all"
PipelineManager.check_status(ssh_client, step, "RNASeq", workflow, project_name, analysis_steps,verbose=True)

checking status of jobs...

Your project will go through the following steps:
	fastqc, trim, align_count

The fastqc step calls the fastqc.sh script on the cluster
The fastqc step has finished running without failure

The trim step calls the trim.sh script on the cluster
The trim step has finished running without failure

The align_count step calls the cal_expression.sh script on the cluster
The align_count step has finished running, but has failed
	Please check the logs


Your pipeline has finished



If your pipeline is finished run this cell just in case there's some processes still running.
This is only relevant if you plan on running another pipeline on the same cluster right afterwards.

In [11]:
PipelineManager.stop_pipeline(ssh_client)

## 5. Display MultiQC report

### Note: Run the cells below after all jobs are done on the cluster.

In [4]:
# Download the multiqc html file to local
notebook_dir = os.getcwd().split("notebooks")[0] + "data/"
!aws s3 cp $s3_output_files_address/$project_name/$workflow/multiqc_report.html $notebook_dir

download: s3://ucsd-ccbb-interns/Mustafa/rna_test/new_index_test/kallisto/multiqc_report.html to ../../data/multiqc_report.html


In [5]:
from IPython.display import IFrame

IFrame(os.path.relpath("{}multiqc_report.html".format(notebook_dir)), width="100%", height=1000)