# WES|WGSPipeline 

## Be sure to install paramiko and scp with pip before using this notebook

## 1. Configure AWS key pair, data location on S3 and the project information

This cell only contains information that you, the user, should input.

#### String Fields

**s3_input_files_address**: This is an s3 path to where your input fastq files are found. This shouldn't be the path to the actual fastq files, just to the directory containing all of them. All fastq files must be in the same s3 bucket.

**s3_output_files_address**: This is an s3 path to where you would like the outputs from your project to be uploaded. This will only be the root directory, please see the README for information about exactly how outputs are structured

**design_file**: This is a path to your design file for this project. Please see the README for the format specification for the design files. 

**your_cluster_name**: This is the name given to your cluster when it was created using cfncluster. 

**private_key**: The path to your private key needed to access your cluster.

**project_name**: Name of your project. There should be no whitespace.

**workflow**: The workflow you want to run for this project. For the DNASeq pipeline the possible workflows are "bwa_gatk" and "bwa_mutect". 

**genome**: The name of the reference you want to use for your project. Currently only "hg19" and "GRCm38" are supported here.

#### analysis_steps
This is a set of strings that contains the steps you would like to run. The order of the steps does not matter.

posible bwa_gatk steps: "fastqc", "trim", "align", "multiqc", "sort", "dedup", "split", "postalignment", "haplotype", "merge", "combine_vcf"

possible bwa_mutect steps: "fastqc", "trim" , "align", "multiqc", "sort", "dedup", "split", "postalignment",
"somatic_variant_calling", "merge"

In [1]:
import os
import sys
sys.path.append("../../src/cirrus_ngs")
from cfnCluster import CFNClusterManager, ConnectionManager
from util import PipelineManager
from util import DesignFileLoader

#s3 address of input files and output files
s3_input_files_address = "s3://ucsd-ccbb-interns/Mustafa/wgs_test/Sample_cDNA/mouse_samples"
s3_output_files_address = "s3://ucsd-ccbb-interns/Mustafa/wgs_test/Sample_cDNA/gatk_run"

## CFNCluster name
your_cluster_name = "mustafa7"

## The private key pair for accessing cluster.
private_key = "/home/mustafa/keys/interns_oregon_key.pem"

## Project information
project_name = "test_status"

#bwa_gatk or bwa_mutect
workflow = "bwa_mutect"

#hg19 or GRCm38
genome = "GRCm38"

## If delete cfncluster after job is done.
delete_cfncluster= False

## Possible analysis_steps inputs for the two workflows. Order of input does not matter.

##bwa_gatk: "fastqc", "trim", "align", "multiqc","sort", "dedup", "split", "postalignment", 
#"haplotype", "merge", "combine_vcf", "filter"

##bwa_mutect: "fastqc", "trim" , "align", "multiqc", "sort", "dedup", "split", "postalignment",
#"somatic_variant_calling", "merge"

analysis_steps = {
                    "fastqc"
                    ,"trim"
                    ,"align"
#                     ,"multiqc"
                    ,"sort"
                    ,"dedup"
                    ,"split"
#                     ,"postalignment"
#                     ,"haplotype"
#                     ,"mutect"
#                     ,"merge" 
#                     ,"merge_vcf_pairwise"
#                     ,"combine_vcf"
                    #"filter"
                }

#add design file path here
#examples in cirrus_root/data/cirrus-ngs/
design_file = "/home/mustafa/ccbb/cirrus-ngs/data/cirrus-ngs/mouse_test.txt"

print("variables set")

variables set


## 2. Create CFNCluster

Following cell connects to your cluster. Run before step 3.

In [2]:
## Create a new cluster
master_ip_address = CFNClusterManager.create_cfn_cluster(cluster_name=your_cluster_name)
ssh_client = ConnectionManager.connect_master(hostname=master_ip_address,
               username="ec2-user",
               private_key_file=private_key)

cluster mustafa7 does exist.
Status: CREATE_COMPLETE
Status: CREATE_COMPLETE
MasterServer: RUNNING
MasterServer: RUNNING
Output:"MasterPublicIP"="34.214.180.149"
Output:"MasterPrivateIP"="172.31.33.31"
Output:"GangliaPublicURL"="http://34.214.180.149/ganglia/"
Output:"GangliaPrivateURL"="http://172.31.33.31/ganglia/"

connecting
connected


## 3. Run the pipeline

This cell actually executes your pipeline. Make sure that steps 1 and 2 have been completed before running.

In [3]:
sample_list, group_list, pairs_list = DesignFileLoader.load_design_file(design_file)

PipelineManager.execute("DNASeq", ssh_client, project_name, workflow, analysis_steps, s3_input_files_address,
                       sample_list, group_list, s3_output_files_address, genome, "NA", pairs_list)

[['SRR2473232_1.fastq.gz', 'SRR2473232_2.fastq.gz'], ['SRR616274_1.fastq.gz', 'SRR616274_2.fastq.gz']]
['groupA', 'groupA']
{'SRR2473232_1': 'SRR616274_1'}
making the yaml file...
copying yaml file to remote master node...
test_status.yaml
/shared/workspace/Pipelines/yaml_files/DNASeq/bwa_mutect
executing pipeline...


## 4. Check status of pipeline

This allows you to check the status of your pipeline. You can specify a step or set the step variable to "all". If you specify a step it should be one that is in your analysis_steps set. You can toggle how verbose the status checking is by setting the verbose flag (at the end of the second line) to False. 

In [14]:
step = "all"
PipelineManager.check_status(ssh_client, step, "DNASeq", workflow, project_name, analysis_steps,verbose=False)

checking status of jobs...

The fastqc step is being executed
There are 2 instances of the fastqc step currently running

The trim step has not started yet.

The bwa step has not started yet.

The sort step has not started yet.

The dedup step has not started yet.

The split step has not started yet.



If your pipeline is finished run this cell just in case there's some processes still running.
This is only relevant if you plan on doing another run on the same cluster afterwards.

In [None]:
PipelineManager.stop_pipeline(ssh_client)

## 5. Display MultiQC report

### Note: Run the cells below after the multiqc step is done

In [None]:
# Download the multiqc html file to local
notebook_dir = os.getcwd().split("notebooks")[0] + "data/"
!aws s3 cp $s3_output_files_address/$project_name/$workflow/multiqc_report.html $notebook_dir

In [None]:
from IPython.display import IFrame

IFrame(os.path.relpath("{}multiqc_report.html".format(notebook_dir)), width="100%", height=1000)