# Steps to continue data analysis after Galaxy

The starting point of this notebook is after mapping and predicting peaks on ChIP-seq data, to estimate differential peak binding.

In [1]:
[ ! -d ~/work/jupyter-res ] && mkdir ~/work/jupyter-res
cd ~/work/jupyter-res

### Gather files
Galaxy exports several files to a results directory. For the next steps, bed and bam files are used.

In [2]:
peak_caller=epic2

## find input files on system
bed_01=($(ls ~/work/galaxy-res/chipseq1/*bed | grep -i "${peak_caller}"))
bed_02=($(ls ~/work/galaxy-res/chipseq2/*bed | grep -i "${peak_caller}"))

bam_1=($(ls ~/work/galaxy-res/chipseq1/2*merged.bam* | grep -i chip))
inp_1=($(ls ~/work/galaxy-res/chipseq1/2*merged.bam* | grep -i input))

bam_2=($(ls ~/work/galaxy-res/chipseq2/2*merged.bam* | grep -i chip))
inp_2=($(ls ~/work/galaxy-res/chipseq2/2*merged.bam* | grep -i input))

### Analysis of differentially marked regions

Peak calling was performed comparing ChIP alignments against INPUT samples. A fold change indicates the variation between them both. Before running differential binding analysis, we applied a filter of FC > 1 (ChIP over INPUT) to enrich in better defined peaks. To remove this filter, set `min_pk_fc=0`. Additionally,peaks within scaffolds that contained non-nuclear sequences were removed. Thus, the steps followed are:
* Filter epic2 peaks
* Run MAnorm

#### Filter epic2 results
First, the remaining largest peaks after filtering by FC and gene origin are visualized on IGV.

In [3]:
scaf_list=/home/jovyan/work/nuclear_scaff.txt
min_pk_fc=1

In [4]:
# leaves
awk -v fc=$min_pk_fc '$7>fc{print $4,$3-$2+1,$1,$2,$3}' OFS='\t' \
   <(join -t$'\t' $scaf_list <(sort $bed_01)) | sort -k2rn 2>/dev/null | head -5

island_15025	84400	Scaffold0276	12000	96399
island_1906	43800	A02	5991800	6035599
island_6749	31600	A05	1188400	1219999
island_13965	31200	A09	43780000	43811199
island_2067	30000	A02	8235200	8265199


In [5]:
# inflorescences
awk -v fc=$min_pk_fc '$7>fc{print $4,$3-$2+1,$1,$2,$3}' OFS='\t' \
   <(join -t$'\t' $scaf_list <(sort $bed_02)) | sort -k2rn 2>/dev/null | head -5

island_4572	12000	A03	22574400	22586399
island_4698	9400	A03	24657800	24667199
island_15554	8800	A10	17663400	17672199
island_10364	8600	A07	16591400	16599999
island_10441	8400	A07	18497000	18505399


Most noisy peaks are removed when filtering for fold-change and non-nuclear scaffolds. One peak on the first sample will be removed manually. The second sample did not require additional filtering.

In [6]:
pk2rm_1=(island_15025)
bed_1="${bed_01/.bed/_filt.bed}"
bed_2="${bed_02/.bed/_filt.bed}"

awk -v fc=$min_pk_fc 'FNR==NR{a[$1];next;} $7>fc{if($4 in a == 0) {print $0}}' \
   RS=' ' <(echo ${pk2rm_1[@]}) \
   RS='\n' <(join -t$'\t' $scaf_list <(sort $bed_01)) \
   > ${bed_1}

awk -v fc=$min_pk_fc '$7>fc{print $0}' OFS='\t' \
   <(join -t$'\t' $scaf_list <(sort $bed_02)) \
   > ${bed_2}

In [7]:
# number of peaks before and after filtering
for bed in $bed_01 $bed_1 $bed_02 $bed_2
do
    wc -l $bed
done

15140 /home/jovyan/work/galaxy-res/chipseq1/epic2_peaks.bed
9058 /home/jovyan/work/galaxy-res/chipseq1/epic2_peaks_filt.bed
21397 /home/jovyan/work/galaxy-res/chipseq2/epic2_peaks.bed
8722 /home/jovyan/work/galaxy-res/chipseq2/epic2_peaks_filt.bed


#### MAnorm 
The filtered peak results were compared between the two samples. MAnorm uses only ChIP files and compares them on a M-A plot to determine differentially marked regions. 

In [8]:
sample_1=leaf
sample_2=infl
manorm_dir=manorm-"$sample_1"VS"$sample_2"

manorm \
--peak1 "$bed_1" \
--peak2 "$bed_2" \
--peak-format bed \
--read1 "$bam_1" \
--read2 "$bam_2" \
--read-format bam \
--name1 "$sample_1" \
--name2 "$sample_2" \
--paired-end \
-o "$manorm_dir" \
2> manorm.log

#### Add some stats to manorm log and move it to folder

In [9]:
echo -e "\n# peaks\tM>0.1\tM>0.25\tM>0.5\tM>1" >> manorm.log
awk -F '\t' 'NR>1{m_val=sqrt($5^2); if(m_val>.1){a++;} if(m_val>.25){b++;} if(m_val>.5){c++;} if(m_val>1){d++;} }END{print "total",a,b,c,d}' OFS='\t' "$manorm_dir"/*xls >> manorm.log
awk -F '\t' 'NR>1&&$5>0{m_val=$5; if(m_val>.1){a++;} if(m_val>.25){b++;} if(m_val>.5){c++;} if(m_val>1){d++;} }END{print "M > 0",a,b,c,d}' OFS='\t' "$manorm_dir"/*xls >> manorm.log
awk -F '\t' 'NR>1&&$5<0{m_val=-$5; if(m_val>.1){a++;} if(m_val>.25){b++;} if(m_val>.5){c++;} if(m_val>1){d++;} }END{print "M < 0",a,b,c,d}' OFS='\t' "$manorm_dir"/*xls >> manorm.log

mv manorm.log "$manorm_dir"


### Draw metagene plot and gene heatmap
A metagene plot helps visualize the distribution of mark across genes. For normalization, we used both ChIP and INPUT files.

In [10]:
genes=/data/Bra_3.0_genes.bed 

ngs.plot.r \
-G Bra3.0 \
-R bed \
-E "$genes" \
-C "$bam_1":"$inp_1" \
-T "" \
-IN 1 \
-O genes_leaf \
-FS 10 \
-WD 6 -HG 5 -SE 1 -LEG 0 \
-RR 200

Configuring variables...Done
Loading R libraries.....Done
In headerIndexBam(bam.list) :
  Aligner for: /home/jovyan/work/galaxy-res/chipseq1/2_filt-markdup_INPUT_merged.bam cannot be determined. Style of 
standard SAM mapping score will be used. Would you mind submitting an issue 
report to us on Github? This will benefit people using the same aligner.
'isNotPrimaryRead' is deprecated.
Use 'isSecondaryAlignment' instead.
See help("Deprecated") 
.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

## After running R notebook
On the R notebook, gene counts are categorized by expression level. Here, a metagene plot shows average mark levels on genes of each category. Three steps are taken:
* Prepare bed files from each list with a provided bed annotation of gene models
* Write a configuration file with paths to bam and bed files
* Run ngs.plot

In [11]:
# prepare bed files
gene_path=~/work/jupyter-res/gene_lists
for f in "${gene_path}"/leaf*txt; do 
    join -2 4 -o 2.{1..6} -t $'\t' $f /data/Bra_3.0_genes.bed > ${f/txt/bed}; 
done

In [12]:
# write config file
echo '# base command: ngs.plot.r -G Bra3.0 -R bed -C config_leaves.txt -O leaves -P 6 -FL 300 -IN 1 -FS 10 -WD 5 -HG 5 -SE 1' > config_leaves.txt
echo '# Use TAB to separate the three columns: coverage file<TAB>gene list<TAB>title' >> config_leaves.txt
echo '# "title" will be shown in the figure legend.' >> config_leaves.txt
echo -e "$bam_1:$inp_1\t"${gene_path}"/leaf.high.bed\t'High'" >> config_leaves.txt
echo -e "$bam_1:$inp_1\t"${gene_path}"/leaf.medium.bed\t'Medium'" >> config_leaves.txt
echo -e "$bam_1:$inp_1\t"${gene_path}"/leaf.low.bed\t'Low'" >> config_leaves.txt
echo -e "$bam_1:$inp_1\t"${gene_path}"/leaf.no_expr.bed\t'No expr'" >> config_leaves.txt

In [13]:
## run ngs.plot
ngs.plot.r \
-G Bra3.0 \
-R bed \
-C config_leaves.txt \
-O plot-leafBYexpr \
-P 6 \
-FL 300 \
-IN 1 \
-FS 10 \
-WD 5 \
-HG 5 \
-SE 1

Configuring variables...Done
Loading R libraries.....Done
In headerIndexBam(bam.list) :
  Aligner for: /home/jovyan/work/galaxy-res/chipseq1/2_filt-markdup_INPUT_merged.bam cannot be determined. Style of 
standard SAM mapping score will be used. Would you mind submitting an issue 
report to us on Github? This will benefit people using the same aligner.
'isNotPrimaryRead' is deprecated.
Use 'isSecondaryAlignment' instead.
See help("Deprecated") 
.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................