# Steps to continue data analysis after Galaxy

The starting point of this notebook is after mapping and predicting peaks on ChIP-seq data, to estimate differential peak binding.

In [1]:
## setup working directory
jup_wd=~/work/jupyter-res
gal_wd=~/work/galaxy-res

[ ! -d ${jup_wd} ] && mkdir ${jup_wd}/{figures}
cd ${jup_wd}

### Gather files
Galaxy exports several files to a results directory. For the next steps, bed and bam files are used.

In [2]:
peak_caller=epic2

## find input files on system
bed_01=($(ls ~/work/galaxy-res/chipseq1/*bed | grep -i "${peak_caller}"))
bed_02=($(ls ~/work/galaxy-res/chipseq2/*bed | grep -i "${peak_caller}"))

bam_1=($(ls ~/work/galaxy-res/chipseq1/bam_files/*merged.bam* | grep -i chip))
inp_1=($(ls ~/work/galaxy-res/chipseq1/bam_files/*merged.bam* | grep -i input))

bam_2=($(ls ~/work/galaxy-res/chipseq2/bam_files/*merged.bam* | grep -i chip))
inp_2=($(ls ~/work/galaxy-res/chipseq2/bam_files/*merged.bam* | grep -i input))

### Analysis of differentially marked regions

Peak calling was performed comparing ChIP alignments against INPUT samples. A fold change indicates the variation between them both. Before running differential binding analysis, we applied a filter of FC > 1 (ChIP over INPUT) to enrich in better defined peaks. To remove this filter, set `min_pk_fc=0`. Additionally,peaks within scaffolds that contained non-nuclear sequences were removed. Thus, the steps followed are:
* Filter epic2 peaks
* Run MAnorm

#### Filter epic2 results
First, the remaining largest peaks after filtering by FC and gene origin are visualized on IGV.

In [3]:
scaf_list=~/work/lib/brapa_genome/nuclear_scaff.txt
min_pk_fc=1

In [4]:
# leaves
awk -v fc=$min_pk_fc '$7>fc{print $4,$3-$2+1,$1,$2,$3}' OFS='\t' \
   <(join -t$'\t' $scaf_list <(sort $bed_01)) | sort -k2rn 2>/dev/null | head -5

island_15025	84400	Scaffold0276	12000	96399
island_1906	43800	A02	5991800	6035599
island_6749	31600	A05	1188400	1219999
island_13965	31200	A09	43780000	43811199
island_2067	30000	A02	8235200	8265199


In [5]:
# inflorescences
awk -v fc=$min_pk_fc '$7>fc{print $4,$3-$2+1,$1,$2,$3}' OFS='\t' \
   <(join -t$'\t' $scaf_list <(sort $bed_02)) | sort -k2rn 2>/dev/null | head -5

island_4572	12000	A03	22574400	22586399
island_4698	9400	A03	24657800	24667199
island_15554	8800	A10	17663400	17672199
island_10364	8600	A07	16591400	16599999
island_10441	8400	A07	18497000	18505399


Most noisy peaks are removed when filtering for fold-change and non-nuclear scaffolds. One peak on the first sample will be removed manually. The second sample did not require additional filtering.

In [6]:
pk2rm_1=(island_15025)
bed_1="${bed_01/.bed/_filt.bed}"
bed_2="${bed_02/.bed/_filt.bed}"

awk -v fc=$min_pk_fc 'FNR==NR{a[$1];next;} $7>fc{if($4 in a == 0) {print $0}}' \
   RS=' ' <(echo ${pk2rm_1[@]}) \
   RS='\n' <(join -t$'\t' $scaf_list <(sort $bed_01)) \
   > ${bed_1}

awk -v fc=$min_pk_fc '$7>fc{print $0}' OFS='\t' \
   <(join -t$'\t' $scaf_list <(sort $bed_02)) \
   > ${bed_2}

In [7]:
# number of peaks before and after filtering
for bed in $bed_01 $bed_1 $bed_02 $bed_2
do
    wc -l $bed
done

15140 /home/jovyan/work/galaxy-res/chipseq1/epic2_peaks.bed
9058 /home/jovyan/work/galaxy-res/chipseq1/epic2_peaks_filt.bed
21397 /home/jovyan/work/galaxy-res/chipseq2/epic2_peaks.bed
8722 /home/jovyan/work/galaxy-res/chipseq2/epic2_peaks_filt.bed


#### MAnorm 
The filtered peak results were compared between the two samples. MAnorm uses only ChIP files and compares them on a M-A plot to determine differentially marked regions. 

In [8]:
sample_1=leaf
sample_2=infl
manorm_dir=manorm-"$sample_1"VS"$sample_2"

manorm \
--peak1 "$bed_1" \
--peak2 "$bed_2" \
--peak-format bed \
--read1 "$bam_1" \
--read2 "$bam_2" \
--read-format bam \
--name1 "$sample_1" \
--name2 "$sample_2" \
--paired-end \
-o "$manorm_dir" \
2> manorm.log

#### Add some stats to manorm log and move it to folder

In [9]:
echo -e "\n# peaks\tM>0.1\tM>0.25\tM>0.5\tM>1" >> manorm.log
awk -F '\t' 'NR>1{m_val=sqrt($5^2); if(m_val>.1){a++;} if(m_val>.25){b++;} if(m_val>.5){c++;} if(m_val>1){d++;} }END{print "total",a,b,c,d}' OFS='\t' "$manorm_dir"/*xls >> manorm.log
awk -F '\t' 'NR>1&&$5>0{m_val=$5; if(m_val>.1){a++;} if(m_val>.25){b++;} if(m_val>.5){c++;} if(m_val>1){d++;} }END{print "M > 0",a,b,c,d}' OFS='\t' "$manorm_dir"/*xls >> manorm.log
awk -F '\t' 'NR>1&&$5<0{m_val=-$5; if(m_val>.1){a++;} if(m_val>.25){b++;} if(m_val>.5){c++;} if(m_val>1){d++;} }END{print "M < 0",a,b,c,d}' OFS='\t' "$manorm_dir"/*xls >> manorm.log

mv manorm.log "$manorm_dir"

echo "$(( $(cat $manorm_dir/*xls | wc -l) - 1 )) combined peaks analyzed by MAnorm"
tail -4 $manorm_dir/manorm.log

10726 combined peaks analyzed by MAnorm
# peaks	M>0.1	M>0.25	M>0.5	M>1
total	9718	8238	5986	3199
M > 0	5949	5214	4018	2408
M < 0	3769	3024	1968	791


## ChIP-seq read counts on bins for Scatterplot
For comparison of ChIP-seq experiments, coverage on selected genomic bins is calculated to be used for sample-wise visual representation and calculation of correlation. Alignment files and indexes of desired samples to plot are downloaded and renamed manually into their corresponding analysis folder. Note: index files are `sample.bai`.

In [None]:
# bam files are renamed to remove the history number (this step can be skipped)
# A pattern is used to avoid targeting merged bam files
for f in ${gal_wd}/chipseq*/bam_files/*[^d].bam; do mv $f $(echo $f | sed 's;/[0-9]\{3\}_;/;'); done

In [10]:
ls ~/work/galaxy-res/chipseq*/*[^d].bam*

/home/jovyan/work/galaxy-res/chipseq1/leaf_C1.bam
/home/jovyan/work/galaxy-res/chipseq1/leaf_C3.bam
/home/jovyan/work/galaxy-res/chipseq1/leaf_I1.bam
/home/jovyan/work/galaxy-res/chipseq1/leaf_I3.bam
/home/jovyan/work/galaxy-res/chipseq2/infl_C1.bam
/home/jovyan/work/galaxy-res/chipseq2/infl_I1.bam


In [None]:
# bam files are indexed with samtools.
cd ${gal_wd}/chipseq1/bam_files/
conda activate samtools

for f in *bam; do samtools index ${f} ${f}.bai; done
cd ${gal_wd}/chipseq2/bam_files/
for f in *bam; do samtools index ${f} ${f}.bai; done

conda deactivate

In [11]:
cpus=5
conda activate deeptools

multiBamSummary bins \
 --bamfiles $(ls ~/work/galaxy-res/chipseq*/*[^d].bam*) \
 --binSize 10000 \
 --numberOfProcessors ${cpus} \
 -out ChIP_counts.npz \
 --outRawCounts ChIP_counts.tab \
 --scalingFactors ChIP_cnt_sf.tab \
 2> log_multibam.err
 
conda deactivate

(deeptools) (deeptools) (deeptools) (deeptools) 

### Draw metagene plot and gene heatmap
A metagene plot helps visualize the distribution of mark across genes. For normalization, we used both ChIP and INPUT files.

FIRST RUN WITHOUT DRAWING

In [12]:
[ -d  ~/work/jupyter-res/ngsplot ] || mkdir ~/work/jupyter-res/ngsplot && cd  ~/work/jupyter-res/ngsplot
genes=~/work/lib/brapa_genome/Bra_3.0_genes.bed 
cpus=6

# leaves
ngs.plot.r \
-P $cpus \
-G Bra3.0 \
-R bed \
-E "$genes" \
-C "$bam_1":"$inp_1" \
-O allgenes_leaf \
-FI 1

# inflorescences
ngs.plot.r \
-P $cpus \
-G Bra3.0 \
-R bed \
-E "$genes" \
-C "$bam_2":"$inp_2" \
-O allgenes_infl \
-FI 1

rm 2-*

Configuring variables...Done
Loading R libraries.....Done
In headerIndexBam(bam.list) :
  Aligner for: /home/jovyan/work/galaxy-res/chipseq1/2-MarkDupes_INPUT_merged.bam cannot be determined. Style of 
standard SAM mapping score will be used. Would you mind submitting an issue 
report to us on Github? This will benefit people using the same aligner.
'isNotPrimaryRead' is deprecated.
Use 'isSecondaryAlignment' instead.
See help("Deprecated") 
..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

# After running R notebook
On the R notebook, gene counts are categorized by expression level. Here, a metagene plot shows average mark levels on genes of each category. Three steps are taken:
* Prepare bed files from each list with a provided bed annotation of gene models
* Write a configuration file with paths to bam and bed files
* Run ngs.plot

SECOND RUN FILTERED GENES

In [13]:
cd  ~/work/jupyter-res/ngsplot
cpus=6

## leaves
ngs.plot.r \
-P $cpus \
-G Bra3.0 \
-R bed \
-E epic2_marked_genes_LF.bed \
-C "$bam_1":"$inp_1" \
-T "" \
-IN 1 \
-O filtgenes_leaf \
-FS 20 \
-SE 1 -LEG 0 \
-RR 50 \
-CD 0.7 -CO darkred:yellow:darkgreen

## inflorescences
ngs.plot.r \
-P $cpus \
-G Bra3.0 \
-R bed \
-E epic2_marked_genes_FL.bed \
-C "$bam_2":"$inp_2" \
-T "" \
-O filtgenes_infl \
-FS 20 \
-SE 1 -LEG 0 \
-CD 0.7 -CO darkred:yellow:darkgreen \
-RR 45 -RB 0.05

Configuring variables...Done
Loading R libraries.....Done
In headerIndexBam(bam.list) :
  Aligner for: /home/jovyan/work/galaxy-res/chipseq1/2-MarkDupes_INPUT_merged.bam cannot be determined. Style of 
standard SAM mapping score will be used. Would you mind submitting an issue 
report to us on Github? This will benefit people using the same aligner.
'isNotPrimaryRead' is deprecated.
Use 'isSecondaryAlignment' instead.
See help("Deprecated") 
..........................................................................................................................................................................................................................................................Done
Plotting figures...Done
Saving results...Done
Wrapping results up...sh: 1: : Permission denied
In system2(zip, args) : error in running command
Done
All done. Cheers!
Configuring variables...Done
Loading R libraries.....Done
'isNotPrimaryRead' is deprecated.
Use 'isSecondaryAlignment' instead.
See 

RUN BY GENE EXPRESSION

In [14]:
# prepare bed files
gene_path=~/work/jupyter-res/gene_lists
for f in "${gene_path}"/*txt; do 
    join -2 4 -o 2.{1..6} -t $'\t' $f ~/work/lib/brapa_genome/Bra_3.0_genes.bed > ${f/txt/bed}; 
done

In [15]:
# write config file leaves
echo '# base command: ngs.plot.r -G Bra3.0 -R bed -C config_leaves.txt -O leaves -P 6 -FL 300 -IN 1 -FS 10 -WD 5 -HG 5 -SE 1' > config_leaves.txt
echo '# Use TAB to separate the three columns: coverage file<TAB>gene list<TAB>title' >> config_leaves.txt
echo '# "title" will be shown in the figure legend.' >> config_leaves.txt
echo -e "$bam_1:$inp_1\t"${gene_path}"/leaf.high.bed\t'High'" >> config_leaves.txt
echo -e "$bam_1:$inp_1\t"${gene_path}"/leaf.medium.bed\t'Medium'" >> config_leaves.txt
echo -e "$bam_1:$inp_1\t"${gene_path}"/leaf.low.bed\t'Low'" >> config_leaves.txt
echo -e "$bam_1:$inp_1\t"${gene_path}"/leaf.no_expr.bed\t'No expr'" >> config_leaves.txt

# write config file infl
echo '# base command: ngs.plot.r -G Bra3.0 -R bed -C config_infl.txt -O infl -P 6 -FL 300 -IN 1 -FS 10 -WD 5 -HG 5 -SE 1' > config_infl.txt
echo '# Use TAB to separate the three columns: coverage file<TAB>gene list<TAB>title' >> config_infl.txt
echo '# "title" will be shown in the figure legend.' >> config_infl.txt
echo -e "$bam_2:$inp_2\t"${gene_path}"/infl.high.bed\t'High'" >> config_infl.txt
echo -e "$bam_2:$inp_2\t"${gene_path}"/infl.medium.bed\t'Medium'" >> config_infl.txt
echo -e "$bam_2:$inp_2\t"${gene_path}"/infl.low.bed\t'Low'" >> config_infl.txt
echo -e "$bam_2:$inp_2\t"${gene_path}"/infl.no_expr.bed\t'No expr'" >> config_infl.txt

In [16]:
cpus=6

## run ngs.plot
for f in config*
do
    s=$(sed 's;config_;;' <(echo ${f%%.txt}))
    ngs.plot.r \
    -G Bra3.0 \
    -R bed \
    -C $f \
    -O plot-"$s"BYexpr \
    -P $cpus \
    -FL 300 \
    -IN 1 \
    -FS 10 -WD 5 -HG 5 \
    -SE 1
done

Configuring variables...Done
Loading R libraries.....Done
'isNotPrimaryRead' is deprecated.
Use 'isSecondaryAlignment' instead.
See help("Deprecated") 
................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................