Call differential binding events

Tao Liu (τν) edited this page Jun 20, 2014 · 3 revisions

In this page, I describe how to use MACS v2 to identify differential regions by comparing pileup tracks of two conditions. Two modules will be involved: callpeak and bdgdiff ( predictd is optional ). I will use human ChIP-seq data as example, and filenames of raw data are: cond1_ChIP.bam and cond1_Control.bam for condition 1; cond2_ChIP.bam and cond2_Control.bam for condition 2.

Step 1: Generate pileup tracks using callpeak module

Purpose of this step is to use callpeak with -B option to generate bedGraph files for both conditions. There are several things to be remember: 1. --SPMR is not compatible with bdgdiff, so avoid using it; 2. prepare a pen to write down the number of non-redundant reads of both conditions -- you will find such information in runtime message or xls output from callpeak; 3. keep using the same --extsize for both conditions ( you can get it from predictd module).

To get a uniform extension size for running callpeak, run predictd:

 $ macs2 predictd -i cond1_ChIP.bam
 $ macs2 predictd -i cond2_ChIP.bam

An easy solution is to use the average of two 'fragment size' predicted in callpeak, however any reasonable value will work. For example, you can use 200 for most ChIP-seq for transcription factors, or 147 for most histone modification ChIP-seq. The only requirement is that you have to keep using the same extsize for the following commands:

 $ macs2 callpeak -B -t cond1_ChIP.bam -c cond1_Control.bam -n cond1 --nomodel --extsize 120
 $ macs2 callpeak -B -t cond2_ChIP.bam -c cond2_Control.bam -n cond2 --nomodel --extsize 120

Pay attention to runtime message, or extract the "tags after filtering in treatment" and "tags after filtering in control" lines from xls to see the effective sequencing depths for both conditions. In our previous command lines, '--to-large' is not used, so the effective sequencing depth is the smaller number of treatment and control. For example:

 $ egrep "tags after filtering in treatment|tags after filtering in control" cond1_peaks.xls
 # tags after filtering in treatment: 19291269
 # tags after filtering in control: 12914669
 $ egrep "tags after filtering in treatment|tags after filtering in control" cond2_peaks.xls
 # tags after filtering in treatment: 19962431
 # tags after filtering in control: 14444786

Then actual effective depths of condition 1 and 2 are: 12914669 and 14444786. Keep record of these two numbers and we will use them later. After successfully running callpeak, you will have cond1_treat_pileup.bdg, cond1_control_lambda.bdg, cond2_treat_pileup.bdg, and cond2_control_lambda.bdg in the working directory.

Step 2: Call differential regions

The purpose of this step is to do a three ways comparisons to find out where in the genome has differential enrichment between two conditions. A basic requirement is that this region should be at least enriched in either condition. A log10 likelihood ratio cutoff (C) will be applied in this step. Three types of differential regions will be reported: 1. those having more enrichment in condition 1 over condition 2 ( cond1_ChIP > cond1_Control and cond1_ChIP > cond2_ChIP ); 2. those having more enrichment in condition 2 over condition 1 ( cond2_ChIP > cond2_Control and cond2_ChIP > cond1_ChIP ); those having similar enrichment in both conditions ( cond1_ChIP > cond1_Control and cond2_ChIP > cond2_Control and cond1_ChIP ≈ cond1_ChIP ).

Run this:

 $ macs2 bdgdiff --t1 cond1_treat_pileup.bdg --c1 cond1_control_lambda.bdg --t2 cond2_treat_pileup.bdg\
   --c2 cond2_control_lambda.bdg --d1 12914669 --d2 14444786 -g 60 -l 120 --o-prefix diff_c1_vs_c2

You will get the following three files in working directory:

diff_c1_vs_c2_c3.0_cond1.bed
This file stores regions that are highly enriched in condition 1 comparing to condition 2. The last column in the file represent the log10 likelihood ratio to show how likely the observed signal in condition 1 in this region is from condition 1 comparing to condition 2. Higher the value, bigger the difference.
diff_c1_vs_c2_c3.0_cond2.bed
This file stores regions that are highly enriched in condition 2 comparing to condition 1. The last column in the file represent the log10 likelihood ratio to show how likely the observed signal in condition 2 in this region is from condition 2 comparing to condition 1. Higher the value, bigger the difference.
diff_c1_vs_c2_c3.0_common.bed
This file stores regions that are highly enriched in both condition 1 and condition 2, and the difference between condition 1 and condition 2 is not significant. The last column in the file represent the difference between condition 1 and condition 2 in log10 likelihood ratios.

Step 3: Annotate peaks from callpeak

TBD