Getting Started

This repository contains the shapes and in-house scripts used in our paper Self-Organizing Maps with Variable Neighborhoods Facilitate Learning of Chromatin Accessibility Signal Shapes Associated with Regulatory Elements. The code in this repository can be used to annotate new chromatin accessibility samples or learn new shapes. We have also included code to replicate our results. Our code is designed for use in a Unix environment and can be run using a command-line interface.

Dependencies

Python 2.7 or higher. Note that your version of Python must be compiled to use UCS2. You can check this by running the following commands on the Python command line: import sys print sys.maxunicode If it uses UCS2, it should print 65535.
wig_split.py. This can be obtained via the taolib repository. Note: We plan to release a future version without this dependency.
The GCC compiler. Some our file I/O scripts are written in C and will need to be compiled using GCC. This should be available on most Unix systems.
bedtools. This can be installed here.
The Unix utilities shuf, cut, and awk. These should be available on most Unix systems.
The Python modules numpy, scipy, HTSeq, argparse, logging, sys, itertools, and pysam.
Python code to compute the gap statistic
If you plan to learn new shapes, the Python module tensorflow.

Installing gosr

While a pre-existing version of gosr exists, our framework uses a modified version that preserves zeros in the signal profile. gosr should automatically be downloaded when you download our repository. To install it, simply untar the file file gosr.tar provided in this repository.

Adding programs to PATH and PYTHONPATH

You will need to add the following paths to your PYTHONPATH variable:

<installation_path>/SOM-VN/gosr/gosr/tools
<path_to_gap_statistic_code>

You will need to add the following paths to your PATH variable:

<installation_path>/SOM-VN/gosr/bin
The directory where you installed bedtools.
The directory containing the taolib folder for wig_split.py.

Obtaining regions to annotate from a BAM file

To segment regions for learning shapes or annotating shapes, you will need to run the following:

gosr binbam -f 0 -n 1000 -t <name_of_sample> <bam_file> 50 <name_of_sample> > <name_of_WIG_file>
sed '3d' <name_of_WIG_file> > <name_of_WIG_file>_noheader.wig
mkdir <chrom_wig_dir>
python wig_split.py <name_of_WIG_file>_noheader.wig <chrom_wig_dir>/<name_of_sample>
gcc -pthread -lm -o run_get_data ../common_scripts/get_file_data.c

Annotating New Samples

You may annotate new samples using the shape files we provide or using new shapes that you have learned. In the pipeline below, file_containing_shapes refers to the shape file you are using. The code you will need for this task is in the folder annotation_scripts. To annotate the new samples, you will need to do the following for each chromosome. Here, the annotation_file is the location where you wish to store the annotated regions.

python make_annotated_bed.py <regions_to_annotate> <file_containing_shapes> <annotation_file> <chrom_wig_file> 0.0 This will generate two files: one that contains the annotated regions and their scores, and one that also includes the annotated regions with the original signal, saved as <annotation_file>clust.
bedtools sort -i <annotation_file> > <annotation_sorted_file>
bedtools sort -i <annotation_file>clust > <annotation_sorted_file>clust
python common_scripts/consolidate_bed.py <annotation_sorted_file> <annotation_consolidated_file>

Learning Shapes

The code you will need for this task is in the folder shape_learning_scripts. To learn shapes for one chromosome, you will need to do the following:

shuf <regions_to_train> > <shuffled_regions_to_train>
python shift_input.py <shuffled_regions_to_train> <shuffled_regions_to_train> 50 4000 <chrom_wig_file>.chr<chrom> false 0
python som_vn.py <shuffled_regions_to_train> <som_shapes_learned> <chrom_wig_file>.chr<chrom> 4000 50 0 False
python remove_by_cutoff.py <som_shapes_learned> 1 <som_shapes_filtered>
python merge_shifted.py <som_shapes_filtered> <som_shapes_shifted> 0
python kmeans_shapes.py <som_shapes_shifted> <som_shapes_final>
`python make_shape_bed.py <regions_to_annotate> <som_shapes_final> <regions_annotated> 0
bedtools sort -i <regions_annotated> > <regions_annotated_sorted>
python consolidate.py <regions_annotated_sorted> <regions_annotated_final>
cut -d$'\t' -f 1,2,3,4,5 <regions_annotated_final> > <regions_annotated_final>.bed
awk '{ print $6}' <regions_annotated_final> > <clusters_annotated_final>
cut -d$'\t' -f 7,8,9,10 <regions_annotated_final> > <scores_annotated_final>
bedtools intersect -wao -a <regions_annotated_final>.bed -b <chromhmm_mnemonic_file> > <intersects>
bedtools sort -i <intersects> > <intersects_sorted>
python consolidate_chromHMM.py <intersects_sorted> <som_shapes_final> <shapes> <chrom_wig_file>.chr<chrom> <name_of_sample> <regions_to_annotate> 0

To merge shapes learned across multiple chromosomes, run the following: python common_scripts/merge_significant.py <shapes> <shapes_all> <shapes_log>

Replicating Our Results

Downloading the data

All of our data can be downloaded by running download_all_files.sh in the location where you wish to save your BAM files. They will be index by the cell type and SPOT score.

Evaluating null models

To evaluate null models, you will need to learn shapes using SOM-VN with unpermuted input and SOM-VN with permuted input. You will then need to associate the learned shapes with both orignal, unpermuted ChromHMM RE and permuted ChromHMM RE.

Learning SOM-VN shapes

Run the following scripts. Note that they are designed to be run in a SLURM supercomputing environment, but can be modified to be run on a local machine.

learn_shapes_A549_slurm.sh
learn_shapes_b_cell_low_slurm.sh
learn_shapes_b_cell_high_slurm.sh
learn_shapes_Brain_low_slurm.sh
learn_shapes_Brain_high_slurm.sh
learn_shapes_GM12878_slurm.sh
learn_shapes_H1_low_slurm.sh
learn_shapes_H1_high_slurm.sh
learn_shapes_HeLa_low_slurm.sh
learn_shapes_HeLa_high_slurm.sh
learn_shapes_heart_low_slurm.sh
learn_shapes_heart_high_slurm.sh
learn_shapes_stomach_low_slurm.sh
learn_shapes_stomach_high_slurm.sh

Merge shapes learned across chromosomes. To do this, run: python common_scripts/merge_significant.py $BASE_PATH/anno_A549_consolidated <shapes_all_A549> <shapes_log_A549> ... python common_scripts/merge_significant.py $BASE_PATH/anno_stomach_high_consolidated <shapes_all_stomach_high> <shapes_log_stomach_high>

Learning SOM-VN shapes from permuted signal

Run the following scripts. Note that they are designed to be run in a SLURM supercomputing environment, but can be modified to be run on a local machine.

learn_from_perm_wig_A549_slurm.sh
learn_from_perm_wig_b_cell_low_slurm.sh
learn_from_perm_wig_b_cell_high_slurm.sh
learn_from_perm_wig_Brain_low_slurm.sh
learn_from_perm_wig_Brain_high_slurm.sh
learn_from_perm_wig_GM12878_slurm.sh
learn_from_perm_wig_H1_low_slurm.sh
learn_from_perm_wig_H1_high_slurm.sh
learn_from_perm_wig_HeLa_low_slurm.sh
learn_from_perm_wig_HeLa_high_slurm.sh
learn_from_perm_wig_heart_low_slurm.sh
learn_from_perm_wig_heart_high_slurm.sh
learn_from_perm_wig_stomach_low_slurm.sh
learn_from_perm_wig_stomach_high_slurm.sh

Merge shapes learned across chromosomes. To do this, run: python common_scripts/merge_significant.py $BASE_PATH/anno_A549_consolidated_perm <shapes_all_A549_perm> <shapes_log_A549_perm> ... python common_scripts/merge_significant.py $BASE_PATH/anno_stomach_high_consolidated_perm <shapes_all_stomach_high_perm> <shapes_log_stomach_high_perm>

Associating SOM-VN shapes with permuted RE

Run python permute_chromhmm.py <chromhmm_mnemonic_file> <permuted chromhmm_mnemonic_file>
Run the following scripts. Note that they are designed to be run in a SLURM supercomputing environment, but can be modified to be run on a local machine.

associate_from_perm_chromhmm_A549_slurm.sh
associate_from_perm_chromhmm_b_cell_low_slurm.sh
associate_from_perm_chromhmm_b_cell_high_slurm.sh
associate_from_perm_chromhmm_Brain_low_slurm.sh
associate_from_perm_chromhmm_Brain_high_slurm.sh
associate_from_perm_chromhmm_GM12878_slurm.sh
associate_from_perm_chromhmm_H1_low_slurm.sh
associate_from_perm_chromhmm_H1_high_slurm.sh
associate_from_perm_chromhmm_HeLa_low_slurm.sh
associate_from_perm_chromhmm_HeLa_high_slurm.sh
associate_from_perm_chromhmm_heart_low_slurm.sh
associate_from_perm_chromhmm_heart_high_slurm.sh
associate_from_perm_chromhmm_stomach_low_slurm.sh
associate_from_perm_chromhmm_stomach_high_slurm.sh

Merge shapes learned across chromosomes. To do this, run: python common_scripts/merge_significant.py $BASE_PATH/anno_A549_consolidated_chromhmm_perm <shapes_all_A549_chromhmm_perm> <shapes_log_A549_chromhmm_perm> ... python common_scripts/merge_significant.py $BASE_PATH/anno_stomach_high_consolidated_chromhmm_perm <shapes_all_stomach_high_chromhmm_perm> <shapes_log_stomach_high_chromhmm_perm>

Plotting cross-correlation

Run the following scripts:

run_crosscorr_A549_slurm.sh
run_crosscorr_b_cell_low_slurm.sh
run_crosscorr_b_cell_high_slurm.sh
run_crosscorr_Brain_low_slurm.sh
run_crosscorr_Brain_high_slurm.sh
run_crosscorr_GM12878_slurm.sh
run_crosscorr_H1_low_slurm.sh
run_crosscorr_H1_high_slurm.sh
run_crosscorr_HeLa_low_slurm.sh
run_crosscorr_HeLa_high_slurm.sh
run_crosscorr_heart_low_slurm.sh
run_crosscorr_heart_high_slurm.sh
run_crosscorr_stomach_low_slurm.sh
run_crosscorr_stomach_high_slurm.sh

Make the plots. To do this, run: python meta_analysis_scripts/plot_crosscorr_distrib.py $BASE_PATH/crosscorr_A549/anno $BASE_PATH/crosscorr_A549_perm/anno <name_of_plot_A549> A549 ... python meta_analysis_scripts/plot_crosscorr_distrib.py $BASE_PATH/crosscorr_stomach_high/anno $BASE_PATH/crosscorr_stomach_high_perm/anno <name_of_plot_stomach_high> "Stomach (High)"
Run hypothesis testing, i.e. python crosscorr_hypothesis_tests.py $BASE_PATH/crosscorr_A549/anno $BASE_PATH/crosscorr_A549_perm/anno.

Evaluating SOM-VN against CAGT

To run CAGT, note that you must have MATLAB installed. In addition, while CAGT is available for download separately, our local copy of CAGT contains a slight modification that prevents the program from stopping early in the case of slow convergence.

Learn SOM-VN shapes as in (###Learning SOM-VN shapes from permuted signal).
Run the following commands before starting CAGT.
mkdir /data/eichertd/som_vn_data/matlab_matrix_<cell_type>
mkdir /data/eichertd/som_vn_data/cagt_out_GM12878/
For each chromosome and cell type, run the following:
matlab -nodisplay -nodesktop -r "run_cagt('/data/eichertd/som_vn_data/training_<cell_type>_shifted/chrom<chrom>','/data/eichertd/som_vn_data/matlab_matrix_<cell_type>','<chrom>','/data/eichertd/som_vn_data/cagt_out_<cell_type>/<chrom>.csv','$BASE_Path/cagt/trunk/matlab/src/', '0.8', '1000', '40')"
Run the following scripts:

run_cagt_analysis_A549_slurm.sh
run_cagt_analysis_b_cell_low_slurm.sh
run_cagt_analysis_b_cell_high_slurm.sh
run_cagt_analysis_Brain_low_slurm.sh
run_cagt_analysis_Brain_high_slurm.sh
run_cagt_analysis_GM12878_slurm.sh
run_cagt_analysis_H1_low_slurm.sh
run_cagt_analysis_H1_high_slurm.sh
run_cagt_analysis_HeLa_low_slurm.sh
run_cagt_analysis_HeLa_high_slurm.sh
run_cagt_analysis_heart_low_slurm.sh
run_cagt_analysis_heart_high_slurm.sh
run_cagt_analysis_stomach_low_slurm.sh
run_cagt_analysis_stomach_high_slurm.sh

Merge shapes learned using CAGT for each cell type, e.g. python ../common_scripts/merge_significant.py $BASE_PATH/anno_A549_consolidated_cagt <shapes_all_A549_cagt> <shapes_log_A549_cagt>
Run python plot_precision_recall_nobaselines.py $BASE_PATH/annotated_merged_<cell_type>/ $BASE_PATH/annotated_consolidated_<cell_type>/ $BASE_PATH/pr_<cell_type>.png "<cell_type> SOM-VN" $BASE_PATH/wig/<cell_type>/<cell_type>.chr to plot precision and recall for SOM-VN.
Run python meta_analysis_scripts/save_precision_recall.py $BASE_PATH/annotated_merged_<cell_type>/ $BASE_PATH/annotated_consolidated_<cell_type>/ <wig_chrom_dir> <cell_type_precision_recall_file>
Run python plot_precision_recall_nobaselines.py $BASE_PATH/annotated_merged_<cell_type>_cagt/ $BASE_PATH/annotated_consolidated_<cell_type>_cagt/ $BASE_PATH/pr_<cell_type>_cagt.png "<cell_type> CAGT" $BASE_PATH/wig/<cell_type>/<cell_type>.chr to plot precision and recall for CAGT.
Run python meta_analysis_scripts/save_precision_recall.py $BASE_PATH/annotated_merged_<cell_type>_cagt/ $BASE_PATH/annotated_consolidated_<cell_type>_cagt/ <wig_chrom_dir> <cell_type_precision_recall_file>

Evaluating SOM-VN above a threshold

Run python meta_analysis_scripts/save_precision_recall_threshold.py $BASE_PATH/annotated_merged_<cell_type>/ $BASE_PATH/annotated_consolidated_<cell_type>/ <wig_chrom_dir> <cell_type_precision_recall_file>

Evaluating cross-cell-type performance

Run the following scripts:

annotate_b_cell_low_from_high.sh
annotate_HeLa_low_from_high.sh
annotate_Heart_low_from_high.sh
annotate_Stomach_low_from_high.sh

Run mkdir <b_cell_precision_recall_directory>.
Run python meta_analysis_scripts/save_precision_recall.py $BASE_PATH/annotated_merged_b_cell_high_b_cell_low/ $BASE_PATH/annotated_consolidated_b_cell_high_b_cell_low/ <b_cell_wig_chrom_dir> <b_cell_precision_recall_directory>

Evaluating cross-chromosome performance

Evaluating performance using magnitude only.

Run shape_learning_scripts/magnitude_all.sh to associate magnitude with RE on each chromosome of each cell type.
Run python merge_significant_magnitude.py $BASE_PATH/anno_<cell_type>_magnitude_consolidated/ $BASE_PATH/magnitudes_<cell_type> $BASE_PATH/magnitudes_<cell_type>.log
Run annotation_scripts/annotate_magnitude_all.sh to annotate each chromosome of each cell type with the learned magnitude associations for that cell type.
Run python meta_analysis_scripts/save_precision_recall.py $BASE_PATH/annotated_merged_magnitude_<cell_type>/ $BASE_PATH/annotated_consolidated_magnitude_<cell_type>/ <wig_chrom_dir> <cell_type_precision_recall_file>

Computing coverage

For each sample, run the following:

For each chromosome and RE, run awk '($4 == "<RE>")' $BASE_PATH/annotated_consolidated_<sample>/anno<chrom>.bed > $BASE_PATH/annotated_consolidated_<sample>/anno<chrom>_known_<RE>.bed. A shortcut is to run ./get_re_specific_anno.sh.
cat $BASE_PATH/annotated_consolidated_<sample>/anno*_known_<RE>.bed > $BASE_PATH/annotated_consolidated_<sample>/anno_all_known_<RE>.bed, and similarly for additional samples.
bedtools intersect -a $BASE_PATH/annotated_consolidated_<sample>/anno_all_known_<RE>.bed -b $PEAKS/<sample>.bed -u > annotations_<RE>_<cell_type>_peak.bed
wc -l $BASE_PATH/annotated_consolidated_<sample>/anno_all_known_<RE>.bed
wc -l annotations_<RE>_<cell_type>_peak.bed

Computing ChromHMM percentages

Run the following for each cell type.

Run cat $CHROMHMM/b_cell_E032.bed | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}' to compute the total coverage of the entire ChromHMM file.
Run awk '($4 == "6_EnhG" || $4 == "7_Enh" || $4 == "12_EnhBiv")' $CHROMHMM/b_cell_E032.bed > $CHROMHMM/b_cell_E032_enhancer.bed to obtain a ChromHMM file with only enhancers.
Run awk '($4 == "1_TssA" || $4 == "2_TssAFlnk" || $4 == "10_TssBiv" || $4 == "11_BivFlnk")' $CHROMHMM/b_cell_E032.bed > $CHROMHMM/b_cell_E032_promoter.bed to obtain a ChromHMM file with only promoters.
Run awk '($4 == "9_Het" || $4 == "15_Quies")' $CHROMHMM/b_cell_E032.bed > $CHROMHMM/b_cell_E032_weak.bed to obtain a ChromHMM file with only weak RE.
Run cat $CHROMHMM/b_cell_E032_<RE>.bed | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}' to compute the total coverage of ChromHMM for each RE (promoter, enhancer, weak).

Compute overlap between results for single replicates.

First, download and unzip the peaks for each high quality brain tissue replicate.

wget https://www.encodeproject.org/files/ENCFF018ATG/@@download/ENCFF018ATG.bed.gz
gunzip ENCFF018ATG.bed.gz
mv ENCFF018ATG.bed Brain_rep1_peaks.bed
wget https://www.encodeproject.org/files/ENCFF936VAD/@@download/ENCFF936VAD.bed.gz
gunzip ENCFF936VAD.bed.gz
mv ENCFF936VAD.bed Brain_rep2_peaks.bed Next, run IDR:
idr --samples Brain_rep{1,2}_peaks.bed Follow the same steps for A549 and GM12878, where the peak files to download are as follows:

A549 Rep 1: https://www.encodeproject.org/files/ENCFF529HMB/@@download/ENCFF529HMB.bed.gz
A549 Rep 2: https://www.encodeproject.org/files/ENCFF823UOG/@@download/ENCFF823UOG.bed.gz
GM12878 Rep 1: https://www.encodeproject.org/files/ENCFF598KWZ/@@download/ENCFF598KWZ.bed.gz
GM12878 Rep 2: https://www.encodeproject.org/files/ENCFF073ORT/@@download/ENCFF073ORT.bed.gz Run the following files to learn shapes for each replicate:

learn_shapes_Brain_rep1_slurm.sh
learn_shapes_Brain_rep2_slurm.sh
learn_shapes_GM12878_rep1_slurm.sh
learn_shapes_GM12878_rep2_slurm.sh
learn_shapes_A549_rep1_slurm.sh
learn_shapes_A549_rep2_slurm.sh
For each cell type, python common_scripts/merge_significant.py $BASE_PATH/anno_<sample_name>_consolidated_perm <shapes> <shapes_log> Run the following files to annotate each cell type for each replicate:
annotate_Brain_rep1.sh
annotate_Brain_rep2.sh
annotate_GM12878_rep1.sh
annotate_GM12878_rep2.sh
annotate_A549_rep1.sh
annotate_A549_rep2.sh Prepare chromatin state assignments as follows for each sample:
cd $BASE_PATH/annotated_consolidated_<sample_name>/
rm anno*clust.bed
`cp anno*.bed > $BASE_PATH/anno_all.bed
sort -k1,1 -k2,2n $BASE_PATH/annotated_consolidated_<sample_name>/all_anno.bed > $BASE_PATH/annotated_consolidated_<sample_name>/all_anno_sorted.bed
awk '($4 == "Enhancer")' $BASE_PATH/all_anno_sorted.bed > $BASE_PATH/all_anno_sorted_enhancer.bed
awk '($4 == "Promoter")' $BASE_PATH/all_anno_sorted.bed > $BASE_PATH/all_anno_sorted_promoter.bed
awk '($4 == "Weak")' $BASE_PATH/all_anno_sorted.bed > $BASE_PATH/all_anno_sorted_weak.bed For each cell type and RE, compute the overlap between replicates:
bedtools intersect -a annotated_consolidated_<cell_type>_rep1/all_anno_sorted_<RE>.bed -b annotated_consolidated_<cell_type>_rep2/all_anno_sorted_<RE>.bed -u > annotations_<RE>_<cell_type>_a.bed
bedtools intersect -a annotated_consolidated_<cell_type>_rep2/all_anno_sorted_<RE>.bed -b annotated_consolidated_<cell_type>_rep1/all_anno_sorted_<RE>.bed -u > annotations_<RE>_<cell_type>_b.bed
wc -l annotated_consolidated_<cell_type>_rep1/all_anno_sorted_<RE>.bed
wc -l annotations_<RE>_<cell_type>_a.bed
wc -l annotated_consolidated_<cell_type>_rep2/all_anno_sorted_<RE>.bed
wc -l annotations_<RE>_<cell_type>_b.bed

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
ChromHMM		ChromHMM
annotation_scripts		annotation_scripts
common_scripts		common_scripts
curated_shapes		curated_shapes
meta_analysis_scripts		meta_analysis_scripts
new_results		new_results
shape_learning_scripts		shape_learning_scripts
shapes		shapes
.DS_Store		.DS_Store
README.md		README.md
TSS_hg38.bed		TSS_hg38.bed
cagt.tar.gz		cagt.tar.gz
gosr.tar.gz		gosr.tar.gz
hg38.chrom.sizes		hg38.chrom.sizes
shape_plots.zip		shape_plots.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Getting Started

Dependencies

Installing gosr

Adding programs to PATH and PYTHONPATH

Obtaining regions to annotate from a BAM file

Annotating New Samples

Learning Shapes

Replicating Our Results

Downloading the data

Evaluating null models

Learning SOM-VN shapes

Learning SOM-VN shapes from permuted signal

Associating SOM-VN shapes with permuted RE

Plotting cross-correlation

Evaluating SOM-VN against CAGT

Evaluating SOM-VN above a threshold

Evaluating cross-cell-type performance

Evaluating cross-chromosome performance

Evaluating performance using magnitude only.

Computing coverage

Computing ChromHMM percentages

Compute overlap between results for single replicates.

About

Releases

Packages

Languages

taraeicher/SOM_VN

Folders and files

Latest commit

History

Repository files navigation

Getting Started

Dependencies

Installing gosr

Adding programs to PATH and PYTHONPATH

Obtaining regions to annotate from a BAM file

Annotating New Samples

Learning Shapes

Replicating Our Results

Downloading the data

Evaluating null models

Learning SOM-VN shapes

Learning SOM-VN shapes from permuted signal

Associating SOM-VN shapes with permuted RE

Plotting cross-correlation

Evaluating SOM-VN against CAGT

Evaluating SOM-VN above a threshold

Evaluating cross-cell-type performance

Evaluating cross-chromosome performance

Evaluating performance using magnitude only.

Computing coverage

Computing ChromHMM percentages

Compute overlap between results for single replicates.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages