# ChIP-Seq Pipeline

## Be sure to install paramiko and scp with pip before using this notebook

## 1. Configure AWS key pair, data location on S3 and the project information

This cell only contains information that you, the user, should input.

#### String Fields

**s3_input_files_address**: This is an s3 path to where your input fastq files are found. This shouldn't be the path to the actual fastq files, just to the directory containing all of them. All fastq files must be in the same s3 bucket.

**s3_output_files_address**: This is an s3 path to where you would like the outputs from your project to be uploaded. This will only be the root directory, please see the README for information about exactly how outputs are structured

**design_file**: This is a path to your design file for this project. Please see the README for the format specification for the design files. 

**your_cluster_name**: This is the name given to your cluster when it was created using ParallelCluster. 

**private_key**: The path to your private key needed to access your cluster.

**project_name**: Name of your project. There should be no whitespace.

**workflow**: The workflow you want to run for this project. For the ChIPSeq pipeline the only possible workflow is "homer".

**genome**: The name of the reference you want to use for your project. Currently only "hg19" and "mm10" are supported here.

**style**: This will always be either "factor" or "histone" depending on your purposes. The "factor" style is more fitting for transcription factor analysis while the "histone" style is intended for histone analysis. More details can be found [here](http://homer.ucsd.edu/homer/ngs/peaks.html)

#### analysis_steps
This is a set of strings that contains the steps you would like to run. The order of the steps does not matter.

posible homer steps: "fastqc", "trim", "align", "multiqc", "make_tag_directory", "make_UCSC_file", "find_peaks", "annotate_peaks", "pos2bed", "find_motifs_genome"

In [1]:
import os
import sys
sys.path.append("../../src/cirrus_ngs")
from awsCluster import ClusterManager, ConnectionManager
from util import PipelineManager
from util import DesignFileLoader
from util import ConfigParser

#s3 addresses for input files and output files
s3_input_files_address = "s3://path/to/fastq"
s3_output_files_address = "s3://path/to/output"

## Path to the design file
design_file = "../../data/cirrus-ngs/chipseq_design_example.txt"

## ParallelCluster name
your_cluster_name = "clustername"

## The private key pair for accessing cluster.
private_key = "/path/to/your_aws_key.pem"

## Project information
project_name = "test_project"

#options: homer
workflow = "homer"

#options: hg38, hg19, mm10
genome = "hg19"

#options: factor, histone
style = "histone"

##order does not matter

##can be fastqc, trim, bowtie, multiqc, make_tag_directory, make_UCSC_file, 
#find_peaks, annotate_peaks, pos2bed, find_motifs_genome
analysis_steps = {
                    "fastqc"
                    ,"trim"
                    ,"align"
                    ,"multiqc"
                    ,"make_tag_directory"
                    ,"make_UCSC_file"
                    ,"find_peaks"
                    ,"annotate_peaks"
                    ,"pos2bed"
                    ,"find_motifs_genome"
                }


print("variables set")

variables set


## 2. Create ParallelCluster

Following cell connects to your cluster. Run before step 3.

In [None]:
master_ip_address = ClusterManager.create_aws_cluster(cluster_name=your_cluster_name)
ssh_client = ConnectionManager.connect_master(hostname=master_ip_address,
               username="ec2-user",
               private_key_file=private_key)

## 3. Run the pipeline

This cell actually executes your pipeline. Make sure that steps 1 and 2 have been completed before running.

In [2]:
## DO NOT EDIT BELOW
## print the analysis information
reference_list, tool_list = ConfigParser.parse(os.getcwd())
ConfigParser.print_software_info("ChiPSeq", workflow, genome, reference_list, tool_list)

print (analysis_steps)

sample_list, group_list, pairs_list = DesignFileLoader.load_design_file(design_file)

PipelineManager.execute("ChiPSeq", ssh_client, project_name, workflow, analysis_steps, s3_input_files_address,
                       sample_list, group_list, s3_output_files_address, genome, style, pairs_list)

#Primary analysis details
#Author: Guorong Xu
#Date: 2019-07-25 14:06:40

#Reference used:
Reference genome: ucsc.hg19.fasta
Annotation: gencode.v19.annotation.gtf

#Tools used:
FASTQC: 0.11.3
Trimmomatic: 0.36
samtools: 1.9
MultiQC: v1.3
bowtie: 1.0.1
homer: 4.8.3



{'trim', 'make_tag_directory', 'align', 'find_peaks', 'annotate_peaks', 'find_motifs_genome', 'multiqc', 'make_UCSC_file', 'fastqc', 'pos2bed'}


## 4. Check status of pipeline

This allows you to check the status of your pipeline. You can specify a step or set the step variable to "all". If you specify a step it should be one that is in your analysis_steps set. You can toggle how verbose the status checking is by setting the verbose flag (at the end of the second line) to False. 

In [6]:
step = "all" #can be any step in analysis_steps or "all"
PipelineManager.check_status(ssh_client, step, "ChiPSeq", workflow, project_name, analysis_steps,verbose=True)

checking status of jobs...

Your project will go through the following steps:
	fastqc, trim, align, make_tag_directory, make_UCSC_file, find_peaks, annotate_peaks, pos2bed, find_motifs_genome

The fastqc step calls the fastqc.sh script on the cluster
The fastqc step has finished running without failure

The trim step calls the trim.sh script on the cluster
The trim step has finished running without failure

The align step calls the bowtie.sh script on the cluster
The align step has finished running without failure

The make_tag_directory step calls the make_tag_directory.sh script on the cluster
The make_tag_directory step has finished running without failure

The make_UCSC_file step calls the make_UCSC_file.sh script on the cluster
The make_UCSC_file step has finished running without failure

The find_peaks step calls the findpeaks.sh script on the cluster
The find_peaks step has finished running without failure

The annotate_peaks step calls the annotate_peaks.sh script on the cluste

If your pipeline is finished run this cell just in case there's some processes still running.
This is only relevant if you plan on doing another run on the same cluster afterwards.

In [None]:
PipelineManager.stop_pipeline(ssh_client)

## 5. Display MultiQC report

### Note: Run the cells below after the multiqc step is done

In [None]:
# Download the multiqc html file to local
notebook_dir = os.getcwd().split("notebooks")[0] + "data/"
!aws s3 cp $s3_output_files_address/$project_name/$workflow/multiqc_report.html $notebook_dir

In [None]:
from IPython.display import IFrame

IFrame(os.path.relpath("{}multiqc_report.html".format(notebook_dir)), width="100%", height=1000)