# ChIPSeq Pipeline

## 1. Configure AWS key pair, data location on S3 and the project information

This cell only contains information that you, the user, should input.

#### String Fields

**s3_input_files_address**: This is an s3 path to where your input fastq files are found. This shouldn't be the path to the actual fastq files, just to the directory containing all of them. All fastq files must be in the same s3 bucket.

**s3_output_files_address**: This is an s3 path to where you would like the outputs from your project to be uploaded. This will only be the root directory, please see the README for information about exactly how outputs are structured

**design_file**: This is a path to your design file for this project. Please see the README for the format specification for the design files. 

**your_cluster_name**: This is the name given to your cluster when it was created using cfncluster. 

**private_key**: The path to your private key needed to access your cluster.

**project_name**: Name of your project. There should be no whitespace.

**workflow**: The workflow you want to run for this project. For the ChIPSeq pipeline the only possible workflow is "homer".

**genome**: The name of the reference you want to use for your project. Currently only "hg19" and "mm10" are supported here.

**style**: This will always be either "factor" or "histone" depending on your purposes. The "factor" style is more fitting for transcription factor analysis while the "histone" style is intended for histone analysis. More details can be found [here](http://homer.ucsd.edu/homer/ngs/peaks.html)

#### analysis_steps
This is a set of strings that contains the steps you would like to run. The order of the steps does not matter.

posible homer steps: "fastqc", "trim", "align", "multiqc", "make_tag_directory", "make_UCSC_file", "find_peaks", "annotate_peaks", "pos2bed", "find_motifs_genome"

In [1]:
import os
import sys
from cirrusngs.managers import ConnectionManager, PipelineManager, ClusterSetupManager
from cirrusngs.util import DesignFileLoader

#s3 addresses for input files and output files
s3_input_files_address = "s3://path/to/fastq"
s3_output_files_address = "s3://path/to/output"

#path to chipseq design file
#examples in cirrus_root/data/cirrus-ngs/
design_file = "/path/to/design/file"

## CFNCluster name
your_cluster_name = "clustername"

## The private key pair for accessing cluster.
private_key = "/path/to/your_aws_key.pem"

## Project information
project_name = "project_name"

#options: homer
workflow = "workflow"

#options: hg19, mm10
genome = "genome"

#options: factor, histone
style = "style"

##order does not matter

##can be fastqc, trim, bowtie, multiqc, make_tag_directory, make_UCSC_file, 
#find_peaks, annotate_peaks, pos2bed, find_motifs_genome
analysis_steps = {
                    "fastqc"
                    ,"trim"
                    ,"align"
                    ,"multiqc"
                    ,"make_tag_directory"
                    ,"make_UCSC_file"
                    ,"find_peaks"
                    ,"annotate_peaks"
                    ,"pos2bed"
                    ,"find_motifs_genome"
                }

## 2. Create CFNCluster

Following cell connects to your cluster. Run before step 3.

In [None]:
## Create a new cluster
master_ip_address = ClusterSetupManager.create_cfn_cluster(cluster_name=your_cluster_name)
ssh_client = ConnectionManager.connect_master(hostname=master_ip_address,
               username="ec2-user",
               private_key_file=private_key)

## 3. Run the pipeline

This cell actually executes your pipeline. Make sure that steps 1 and 2 have been completed before running.

In [None]:
sample_list, group_list, pairs_list = DesignFileLoader.load_design_file(design_file)

PipelineManager.execute("ChiPSeq", ssh_client, project_name, workflow, analysis_steps, s3_input_files_address,
                       sample_list, group_list, s3_output_files_address, genome, style, pairs_list)

## 4. Check status of pipeline

This allows you to check the status of your pipeline. You can specify a step or set the step variable to "all". If you specify a step it should be one that is in your analysis_steps set. You can toggle how verbose the status checking is by setting the verbose flag (at the end of the second line) to False. 

In [None]:
step = "all" #can be any step in analysis_steps or "all"
PipelineManager.check_status(ssh_client, step, "ChiPSeq", workflow, project_name, analysis_steps,verbose=True)

If your pipeline is finished run this cell just in case there's some processes still running.
This is only relevant if you plan on doing another run on the same cluster afterwards.

In [None]:
PipelineManager.stop_pipeline(ssh_client)

## 5. Display MultiQC report

### Note: Run the cells below after the multiqc step is done

In [None]:
# Download the multiqc html file to local
notebook_dir = os.getcwd().split("notebooks")[0] + "data/"
!aws s3 cp $s3_output_files_address/$project_name/$workflow/multiqc_report.html $notebook_dir

In [None]:
from IPython.display import IFrame

IFrame(os.path.relpath("{}multiqc_report.html".format(notebook_dir)), width="100%", height=1000)