This repository contains the shapes and in-house scripts used in our paper Self-Organizing Maps with Variable Neighborhoods Facilitate Learning of Chromatin Accessibility Signal Shapes Associated with Regulatory Elements. The code in this repository can be used to annotate new chromatin accessibility samples or learn new shapes. We have also included code to replicate our results. Our code is designed for use in a Unix environment and can be run using a command-line interface.
-
Python 2.7 or higher. Note that your version of Python must be compiled to use UCS2. You can check this by running the following commands on the Python command line:
import sys
print sys.maxunicode
If it uses UCS2, it should print 65535. -
wig_split.py. This can be obtained via the taolib repository. Note: We plan to release a future version without this dependency.
-
The GCC compiler. Some our file I/O scripts are written in C and will need to be compiled using GCC. This should be available on most Unix systems.
-
bedtools. This can be installed here.
-
The Unix utilities shuf, cut, and awk. These should be available on most Unix systems.
-
The Python modules numpy, scipy, HTSeq, argparse, logging, sys, itertools, and pysam.
-
If you plan to learn new shapes, the Python module tensorflow.
While a pre-existing version of gosr exists, our framework uses a modified version that preserves zeros in the signal profile. gosr should automatically be downloaded when you download our repository. To install it, simply untar the file file gosr.tar
provided in this repository.
You will need to add the following paths to your PYTHONPATH
variable:
- <installation_path>/SOM-VN/gosr/gosr/tools
- <path_to_gap_statistic_code>
You will need to add the following paths to your PATH variable:
- <installation_path>/SOM-VN/gosr/bin
- The directory where you installed bedtools.
- The directory containing the taolib folder for
wig_split.py
.
To segment regions for learning shapes or annotating shapes, you will need to run the following:
gosr binbam -f 0 -n 1000 -t <name_of_sample> <bam_file> 50 <name_of_sample> > <name_of_WIG_file>
sed '3d' <name_of_WIG_file> > <name_of_WIG_file>_noheader.wig
mkdir <chrom_wig_dir>
python wig_split.py <name_of_WIG_file>_noheader.wig <chrom_wig_dir>/<name_of_sample>
gcc -pthread -lm -o run_get_data ../common_scripts/get_file_data.c
You may annotate new samples using the shape files we provide or using new shapes that you have learned. In the pipeline below, file_containing_shapes refers to the shape file you are using. The code you will need for this task is in the folder annotation_scripts. To annotate the new samples, you will need to do the following for each chromosome. Here, the annotation_file is the location where you wish to store the annotated regions.
python make_annotated_bed.py <regions_to_annotate> <file_containing_shapes> <annotation_file> <chrom_wig_file> 0.0
This will generate two files: one that contains the annotated regions and their scores, and one that also includes the annotated regions with the original signal, saved as <annotation_file>clust.bedtools sort -i <annotation_file> > <annotation_sorted_file>
bedtools sort -i <annotation_file>clust > <annotation_sorted_file>clust
python common_scripts/consolidate_bed.py <annotation_sorted_file> <annotation_consolidated_file>
The code you will need for this task is in the folder shape_learning_scripts. To learn shapes for one chromosome, you will need to do the following:
shuf <regions_to_train> > <shuffled_regions_to_train>
python shift_input.py <shuffled_regions_to_train> <shuffled_regions_to_train> 50 4000 <chrom_wig_file>.chr<chrom> false 0
python som_vn.py <shuffled_regions_to_train> <som_shapes_learned> <chrom_wig_file>.chr<chrom> 4000 50 0 False
python remove_by_cutoff.py <som_shapes_learned> 1 <som_shapes_filtered>
python merge_shifted.py <som_shapes_filtered> <som_shapes_shifted> 0
python kmeans_shapes.py <som_shapes_shifted> <som_shapes_final>
- `python make_shape_bed.py <regions_to_annotate> <som_shapes_final> <regions_annotated> 0
bedtools sort -i <regions_annotated> > <regions_annotated_sorted>
python consolidate.py <regions_annotated_sorted> <regions_annotated_final>
cut -d$'\t' -f 1,2,3,4,5 <regions_annotated_final> > <regions_annotated_final>.bed
awk '{ print $6}' <regions_annotated_final> > <clusters_annotated_final>
cut -d$'\t' -f 7,8,9,10 <regions_annotated_final> > <scores_annotated_final>
bedtools intersect -wao -a <regions_annotated_final>.bed -b <chromhmm_mnemonic_file> > <intersects>
bedtools sort -i <intersects> > <intersects_sorted>
python consolidate_chromHMM.py <intersects_sorted> <som_shapes_final> <shapes> <chrom_wig_file>.chr<chrom> <name_of_sample> <regions_to_annotate> 0
To merge shapes learned across multiple chromosomes, run the following:
python common_scripts/merge_significant.py <shapes> <shapes_all> <shapes_log>
All of our data can be downloaded by running download_all_files.sh
in the location where you wish to save your BAM files. They will be index by the cell type and SPOT score.
To evaluate null models, you will need to learn shapes using SOM-VN with unpermuted input and SOM-VN with permuted input. You will then need to associate the learned shapes with both orignal, unpermuted ChromHMM RE and permuted ChromHMM RE.
- Run the following scripts. Note that they are designed to be run in a SLURM supercomputing environment, but can be modified to be run on a local machine.
learn_shapes_A549_slurm.sh
learn_shapes_b_cell_low_slurm.sh
learn_shapes_b_cell_high_slurm.sh
learn_shapes_Brain_low_slurm.sh
learn_shapes_Brain_high_slurm.sh
learn_shapes_GM12878_slurm.sh
learn_shapes_H1_low_slurm.sh
learn_shapes_H1_high_slurm.sh
learn_shapes_HeLa_low_slurm.sh
learn_shapes_HeLa_high_slurm.sh
learn_shapes_heart_low_slurm.sh
learn_shapes_heart_high_slurm.sh
learn_shapes_stomach_low_slurm.sh
learn_shapes_stomach_high_slurm.sh
- Merge shapes learned across chromosomes. To do this, run:
python common_scripts/merge_significant.py $BASE_PATH/anno_A549_consolidated <shapes_all_A549> <shapes_log_A549>
...python common_scripts/merge_significant.py $BASE_PATH/anno_stomach_high_consolidated <shapes_all_stomach_high> <shapes_log_stomach_high>
- Run the following scripts. Note that they are designed to be run in a SLURM supercomputing environment, but can be modified to be run on a local machine.
learn_from_perm_wig_A549_slurm.sh
learn_from_perm_wig_b_cell_low_slurm.sh
learn_from_perm_wig_b_cell_high_slurm.sh
learn_from_perm_wig_Brain_low_slurm.sh
learn_from_perm_wig_Brain_high_slurm.sh
learn_from_perm_wig_GM12878_slurm.sh
learn_from_perm_wig_H1_low_slurm.sh
learn_from_perm_wig_H1_high_slurm.sh
learn_from_perm_wig_HeLa_low_slurm.sh
learn_from_perm_wig_HeLa_high_slurm.sh
learn_from_perm_wig_heart_low_slurm.sh
learn_from_perm_wig_heart_high_slurm.sh
learn_from_perm_wig_stomach_low_slurm.sh
learn_from_perm_wig_stomach_high_slurm.sh
- Merge shapes learned across chromosomes. To do this, run:
python common_scripts/merge_significant.py $BASE_PATH/anno_A549_consolidated_perm <shapes_all_A549_perm> <shapes_log_A549_perm>
...python common_scripts/merge_significant.py $BASE_PATH/anno_stomach_high_consolidated_perm <shapes_all_stomach_high_perm> <shapes_log_stomach_high_perm>
- Run
python permute_chromhmm.py <chromhmm_mnemonic_file> <permuted chromhmm_mnemonic_file>
- Run the following scripts. Note that they are designed to be run in a SLURM supercomputing environment, but can be modified to be run on a local machine.
associate_from_perm_chromhmm_A549_slurm.sh
associate_from_perm_chromhmm_b_cell_low_slurm.sh
associate_from_perm_chromhmm_b_cell_high_slurm.sh
associate_from_perm_chromhmm_Brain_low_slurm.sh
associate_from_perm_chromhmm_Brain_high_slurm.sh
associate_from_perm_chromhmm_GM12878_slurm.sh
associate_from_perm_chromhmm_H1_low_slurm.sh
associate_from_perm_chromhmm_H1_high_slurm.sh
associate_from_perm_chromhmm_HeLa_low_slurm.sh
associate_from_perm_chromhmm_HeLa_high_slurm.sh
associate_from_perm_chromhmm_heart_low_slurm.sh
associate_from_perm_chromhmm_heart_high_slurm.sh
associate_from_perm_chromhmm_stomach_low_slurm.sh
associate_from_perm_chromhmm_stomach_high_slurm.sh
- Merge shapes learned across chromosomes. To do this, run:
python common_scripts/merge_significant.py $BASE_PATH/anno_A549_consolidated_chromhmm_perm <shapes_all_A549_chromhmm_perm> <shapes_log_A549_chromhmm_perm>
...python common_scripts/merge_significant.py $BASE_PATH/anno_stomach_high_consolidated_chromhmm_perm <shapes_all_stomach_high_chromhmm_perm> <shapes_log_stomach_high_chromhmm_perm>
- Run the following scripts:
run_crosscorr_A549_slurm.sh
run_crosscorr_b_cell_low_slurm.sh
run_crosscorr_b_cell_high_slurm.sh
run_crosscorr_Brain_low_slurm.sh
run_crosscorr_Brain_high_slurm.sh
run_crosscorr_GM12878_slurm.sh
run_crosscorr_H1_low_slurm.sh
run_crosscorr_H1_high_slurm.sh
run_crosscorr_HeLa_low_slurm.sh
run_crosscorr_HeLa_high_slurm.sh
run_crosscorr_heart_low_slurm.sh
run_crosscorr_heart_high_slurm.sh
run_crosscorr_stomach_low_slurm.sh
run_crosscorr_stomach_high_slurm.sh
- Make the plots. To do this, run:
python meta_analysis_scripts/plot_crosscorr_distrib.py $BASE_PATH/crosscorr_A549/anno $BASE_PATH/crosscorr_A549_perm/anno <name_of_plot_A549> A549
...python meta_analysis_scripts/plot_crosscorr_distrib.py $BASE_PATH/crosscorr_stomach_high/anno $BASE_PATH/crosscorr_stomach_high_perm/anno <name_of_plot_stomach_high> "Stomach (High)"
- Run hypothesis testing, i.e.
python crosscorr_hypothesis_tests.py $BASE_PATH/crosscorr_A549/anno $BASE_PATH/crosscorr_A549_perm/anno
.
To run CAGT, note that you must have MATLAB installed. In addition, while CAGT is available for download separately, our local copy of CAGT contains a slight modification that prevents the program from stopping early in the case of slow convergence.
- Learn SOM-VN shapes as in (###Learning SOM-VN shapes from permuted signal).
- Run the following commands before starting CAGT.
mkdir /data/eichertd/som_vn_data/matlab_matrix_<cell_type>
mkdir /data/eichertd/som_vn_data/cagt_out_GM12878/
- For each chromosome and cell type, run the following:
matlab -nodisplay -nodesktop -r "run_cagt('/data/eichertd/som_vn_data/training_<cell_type>_shifted/chrom<chrom>','/data/eichertd/som_vn_data/matlab_matrix_<cell_type>','<chrom>','/data/eichertd/som_vn_data/cagt_out_<cell_type>/<chrom>.csv','$BASE_Path/cagt/trunk/matlab/src/', '0.8', '1000', '40')"
- Run the following scripts:
run_cagt_analysis_A549_slurm.sh
run_cagt_analysis_b_cell_low_slurm.sh
run_cagt_analysis_b_cell_high_slurm.sh
run_cagt_analysis_Brain_low_slurm.sh
run_cagt_analysis_Brain_high_slurm.sh
run_cagt_analysis_GM12878_slurm.sh
run_cagt_analysis_H1_low_slurm.sh
run_cagt_analysis_H1_high_slurm.sh
run_cagt_analysis_HeLa_low_slurm.sh
run_cagt_analysis_HeLa_high_slurm.sh
run_cagt_analysis_heart_low_slurm.sh
run_cagt_analysis_heart_high_slurm.sh
run_cagt_analysis_stomach_low_slurm.sh
run_cagt_analysis_stomach_high_slurm.sh
- Merge shapes learned using CAGT for each cell type, e.g.
python ../common_scripts/merge_significant.py $BASE_PATH/anno_A549_consolidated_cagt <shapes_all_A549_cagt> <shapes_log_A549_cagt>
- Run
python plot_precision_recall_nobaselines.py $BASE_PATH/annotated_merged_<cell_type>/ $BASE_PATH/annotated_consolidated_<cell_type>/ $BASE_PATH/pr_<cell_type>.png "<cell_type> SOM-VN" $BASE_PATH/wig/<cell_type>/<cell_type>.chr
to plot precision and recall for SOM-VN. - Run
python meta_analysis_scripts/save_precision_recall.py $BASE_PATH/annotated_merged_<cell_type>/ $BASE_PATH/annotated_consolidated_<cell_type>/ <wig_chrom_dir> <cell_type_precision_recall_file>
- Run
python plot_precision_recall_nobaselines.py $BASE_PATH/annotated_merged_<cell_type>_cagt/ $BASE_PATH/annotated_consolidated_<cell_type>_cagt/ $BASE_PATH/pr_<cell_type>_cagt.png "<cell_type> CAGT" $BASE_PATH/wig/<cell_type>/<cell_type>.chr
to plot precision and recall for CAGT. - Run
python meta_analysis_scripts/save_precision_recall.py $BASE_PATH/annotated_merged_<cell_type>_cagt/ $BASE_PATH/annotated_consolidated_<cell_type>_cagt/ <wig_chrom_dir> <cell_type_precision_recall_file>
Run python meta_analysis_scripts/save_precision_recall_threshold.py $BASE_PATH/annotated_merged_<cell_type>/ $BASE_PATH/annotated_consolidated_<cell_type>/ <wig_chrom_dir> <cell_type_precision_recall_file>
- Run the following scripts:
annotate_b_cell_low_from_high.sh
annotate_HeLa_low_from_high.sh
annotate_Heart_low_from_high.sh
annotate_Stomach_low_from_high.sh
- Run
mkdir <b_cell_precision_recall_directory>
. - Run
python meta_analysis_scripts/save_precision_recall.py $BASE_PATH/annotated_merged_b_cell_high_b_cell_low/ $BASE_PATH/annotated_consolidated_b_cell_high_b_cell_low/ <b_cell_wig_chrom_dir> <b_cell_precision_recall_directory>
- Run
shape_learning_scripts/magnitude_all.sh
to associate magnitude with RE on each chromosome of each cell type. - Run
python merge_significant_magnitude.py $BASE_PATH/anno_<cell_type>_magnitude_consolidated/ $BASE_PATH/magnitudes_<cell_type> $BASE_PATH/magnitudes_<cell_type>.log
- Run
annotation_scripts/annotate_magnitude_all.sh
to annotate each chromosome of each cell type with the learned magnitude associations for that cell type. - Run
python meta_analysis_scripts/save_precision_recall.py $BASE_PATH/annotated_merged_magnitude_<cell_type>/ $BASE_PATH/annotated_consolidated_magnitude_<cell_type>/ <wig_chrom_dir> <cell_type_precision_recall_file>
For each sample, run the following:
- For each chromosome and RE, run
awk '($4 == "<RE>")' $BASE_PATH/annotated_consolidated_<sample>/anno<chrom>.bed > $BASE_PATH/annotated_consolidated_<sample>/anno<chrom>_known_<RE>.bed
. A shortcut is to run./get_re_specific_anno.sh
. cat $BASE_PATH/annotated_consolidated_<sample>/anno*_known_<RE>.bed > $BASE_PATH/annotated_consolidated_<sample>/anno_all_known_<RE>.bed
, and similarly for additional samples.bedtools intersect -a $BASE_PATH/annotated_consolidated_<sample>/anno_all_known_<RE>.bed -b $PEAKS/<sample>.bed -u > annotations_<RE>_<cell_type>_peak.bed
wc -l $BASE_PATH/annotated_consolidated_<sample>/anno_all_known_<RE>.bed
wc -l annotations_<RE>_<cell_type>_peak.bed
Run the following for each cell type.
- Run
cat $CHROMHMM/b_cell_E032.bed | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
to compute the total coverage of the entire ChromHMM file. - Run
awk '($4 == "6_EnhG" || $4 == "7_Enh" || $4 == "12_EnhBiv")' $CHROMHMM/b_cell_E032.bed > $CHROMHMM/b_cell_E032_enhancer.bed
to obtain a ChromHMM file with only enhancers. - Run
awk '($4 == "1_TssA" || $4 == "2_TssAFlnk" || $4 == "10_TssBiv" || $4 == "11_BivFlnk")' $CHROMHMM/b_cell_E032.bed > $CHROMHMM/b_cell_E032_promoter.bed
to obtain a ChromHMM file with only promoters. - Run
awk '($4 == "9_Het" || $4 == "15_Quies")' $CHROMHMM/b_cell_E032.bed > $CHROMHMM/b_cell_E032_weak.bed
to obtain a ChromHMM file with only weak RE. - Run
cat $CHROMHMM/b_cell_E032_<RE>.bed | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
to compute the total coverage of ChromHMM for each RE (promoter, enhancer, weak).
First, download and unzip the peaks for each high quality brain tissue replicate.
wget https://www.encodeproject.org/files/ENCFF018ATG/@@download/ENCFF018ATG.bed.gz
gunzip ENCFF018ATG.bed.gz
mv ENCFF018ATG.bed Brain_rep1_peaks.bed
wget https://www.encodeproject.org/files/ENCFF936VAD/@@download/ENCFF936VAD.bed.gz
gunzip ENCFF936VAD.bed.gz
mv ENCFF936VAD.bed Brain_rep2_peaks.bed
Next, run IDR:idr --samples Brain_rep{1,2}_peaks.bed
Follow the same steps for A549 and GM12878, where the peak files to download are as follows:
- A549 Rep 1: https://www.encodeproject.org/files/ENCFF529HMB/@@download/ENCFF529HMB.bed.gz
- A549 Rep 2: https://www.encodeproject.org/files/ENCFF823UOG/@@download/ENCFF823UOG.bed.gz
- GM12878 Rep 1: https://www.encodeproject.org/files/ENCFF598KWZ/@@download/ENCFF598KWZ.bed.gz
- GM12878 Rep 2: https://www.encodeproject.org/files/ENCFF073ORT/@@download/ENCFF073ORT.bed.gz Run the following files to learn shapes for each replicate:
learn_shapes_Brain_rep1_slurm.sh
learn_shapes_Brain_rep2_slurm.sh
learn_shapes_GM12878_rep1_slurm.sh
learn_shapes_GM12878_rep2_slurm.sh
learn_shapes_A549_rep1_slurm.sh
learn_shapes_A549_rep2_slurm.sh
- For each cell type,
python common_scripts/merge_significant.py $BASE_PATH/anno_<sample_name>_consolidated_perm <shapes> <shapes_log>
Run the following files to annotate each cell type for each replicate: annotate_Brain_rep1.sh
annotate_Brain_rep2.sh
annotate_GM12878_rep1.sh
annotate_GM12878_rep2.sh
annotate_A549_rep1.sh
annotate_A549_rep2.sh
Prepare chromatin state assignments as follows for each sample:cd $BASE_PATH/annotated_consolidated_<sample_name>/
rm anno*clust.bed
- `cp anno*.bed > $BASE_PATH/anno_all.bed
- sort -k1,1 -k2,2n $BASE_PATH/annotated_consolidated_<sample_name>/all_anno.bed > $BASE_PATH/annotated_consolidated_<sample_name>/all_anno_sorted.bed
awk '($4 == "Enhancer")' $BASE_PATH/all_anno_sorted.bed > $BASE_PATH/all_anno_sorted_enhancer.bed
awk '($4 == "Promoter")' $BASE_PATH/all_anno_sorted.bed > $BASE_PATH/all_anno_sorted_promoter.bed
awk '($4 == "Weak")' $BASE_PATH/all_anno_sorted.bed > $BASE_PATH/all_anno_sorted_weak.bed
For each cell type and RE, compute the overlap between replicates:bedtools intersect -a annotated_consolidated_<cell_type>_rep1/all_anno_sorted_<RE>.bed -b annotated_consolidated_<cell_type>_rep2/all_anno_sorted_<RE>.bed -u > annotations_<RE>_<cell_type>_a.bed
bedtools intersect -a annotated_consolidated_<cell_type>_rep2/all_anno_sorted_<RE>.bed -b annotated_consolidated_<cell_type>_rep1/all_anno_sorted_<RE>.bed -u > annotations_<RE>_<cell_type>_b.bed
wc -l annotated_consolidated_<cell_type>_rep1/all_anno_sorted_<RE>.bed
wc -l annotations_<RE>_<cell_type>_a.bed
wc -l annotated_consolidated_<cell_type>_rep2/all_anno_sorted_<RE>.bed
wc -l annotations_<RE>_<cell_type>_b.bed