Skip to content
master
Switch branches/tags
Go to file
Code

Latest commit

Ibrahim Numanagić Fix #19
09e65b7

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
Jun 28, 2018
Jul 2, 2018
src
Sep 2, 2020
Aug 23, 2017
Jun 28, 2018

README.md

SEDEF: Segmental Duplication Evaluation Framework

SEDEF is a tool for quick detection of segmental duplications in a genome.

Paper

SEDEF has been presented at ECCB 2018 (DOI 10.1093/bioinformatics/bty586). Preprint is available here. Get the final paper here.

Results

👨🎨 Human (hg38) 👨‍🎨 Human (hg19) 🐭 Mouse (mm8)
Final calls Final calls Final calls

The experiment pipeline from the paper is described in this Jupyter notebook.

How to compile

Simple! Do this:

git clone https://github.com/vpc-ccg/sedef
cd sedef
make -j release

By default, SEDEF uses Intel C++ compiler. If you are using g++, build with:

make -j release CXX=g++

If you are using Clang on macOS, compile as

brew install libomp
make -j release OPENMP="-Xpreprocessor -fopenmp" CXX=clang++

You need at least g++ 5.1.0 (C++14) to compile SEDEF. Clang should work fine as well.

SEDEF requires Boost libraries in order to compile. In case you installed Boost in a non-standard directory, you can still compile as follows:

CPATH={path_to_boost} make -j release

How to run

The genome assembly must be soft-masked (i.e. all common and tandem repeats should be converted to lower-case letters) and indexed. Suppose that our genome is hg19.fa (we use UCSC hg19 genome with 24 standard chromosomes that does not contain patches (unGl) or random strains (chrXX_random)).

Automatic transmission

Just go to sedef directory and run

./sedef.sh -o <output> -j <jobs> <genome> 

For example, to run hg19.fa on 80 cores type:

./sedef.sh -o sedef_hg19 -j 80 hg19.fa

You can add -f if sedef_hg19 already exists (it will overwrite the existing content though). The final results will be located in sedef_hg19/final.bed.

Please note that sedef.sh depends on Samtools and GNU Parallel. If you want to experiment with different parameters, run sedef help for parameter documentation.

Output

Output will be located in <out_dir>/final.bed.

The fields of BEDPE file are as follows:

First 6 fields are standard BEDPE fields describing the coordinates of SD mates:

  • chr1, start1 and end1
  • chr2, start2 and end2

Other fields are (in the order of appearance):

Field Description
name SD name
score Total alignment error
strand1 1st SD mate strand
strand2 2nd SD mate strand
max_len Length of longer mate
aln_len Alignment length (length with gaps)
cigar Empty string
comment Comment: currently shows mismatch base error (m) and gap base error (g)
indel_a Number of gap bases in the 1st mate
indel_b Number of gap bases in the 2nd mate
alnB Aligned base count (matches and mismatches without gaps)
matchB Match base count
mismatchB Mismatch base count
transitionsB Transition count (A <-> G and C <-> T)
transversions Transversion count (all mismatches that are not transitions)
fracMatch matchB / alnB
fracMatchIndel matchB / aln_len
jck Jaccard score: where w = mismatchB / alnB
k2K Kimura score: where p = transitionsB / alnB and q = transitionsB / alnB
aln_gaps Number of gaps in the alignment
uppercaseA Number of non-masked (uppercase) bases in the 1st mate
uppercaseB Number of non-masked (uppercase) bases in the 2nd mate
uppercaseMatches Non-masked match count
aln_matches Match base count
aln_mismatches Mismatch base count
aln_gaps Number of gaps in the alignment
aln_gap_bases Number of gap bases (indel_a + indel_b)
cigar CIGAR string of the SD mate alignment
filter_score (aln_gaps + aln_mismatches) / aln_len (should be ≥ 0.5)

All errors are expressed as ratios (0.0--1.0) of the alignment length unless otherwise noted.

Warning: as per WGAC, when calculating the similarity and error rates (fields score, fracMatch, fracMatchIndel and filter_score) SEDEF counts a gap as a single error (so the hypothetical alignment of A-----GC and AT-----C will have error 4 and NOT 8). This might lead to SDs with rather large gap contents. For more filtering, consult comment field that provides the percentage of match/mismatch and gap bases.

Manual transmission

First make sure to index the genome:

samtools faidx hg19.fa

Then run the sedef-search in parallel (in this example, we will use GNU parallel) to get the initial seeds:

mkdir -p out # For the output
mkdir -p out/log # For the logs

for i in `seq 1 22` X Y; do 
for j in `seq 1 22` X Y; do  
	SI=`awk '$1=="chr'$i'" {print $2}' hg19.fa.fai`; 
	SJ=`awk '$1=="chr'$j'" {print $2}' hg19.fa.fai`; 
	if [ "$SI" -le "$SJ" ] ; 
	then 
		for m in y n ; do
		[ "$m" == "y" ] && rc="-r" || rc="";
		echo "sedef search $rc hg19.fa chr$i chr$j >out/${i}_${j}_${m}.bed 2>out/log/${i}_${j}_${m}.log"
		done; 
	fi
done
done | time parallel --will-cite -j 80 --eta

# Now make sure that all runs completed successfully 
grep Total out/log/*.log | wc -l
# You should see here 600 (or n(n+1) if you have n chromosomes in your file)

# Get the single-core running time
grep Wall out/log/*.log | tr -d '(' | awk '{s+=$4}END{print s}'

# Get the maximum meory usage as well
grep Memory out/log/*.log | awk '{if($3>m)m=$3}END{print m}'

Then use sedef-align to bucket the files for the optimal parallel alignment. Afterwards, start the whole alignment:

# First bucket the reads into 1000 bins
mkdir -p out/bins
mkdir -p out/log/bins
time sedef align bucket -n 1000 out out/bins

# Now run the alignment
for j in out/bins/bucket_???? ; do
	k=$(basename $j);
	echo "sedef align generate -k 11 hg19.fa $j >${j}.bed 2>out/log/bins/${k}.log"
done | time parallel --will-cite -j 80 --eta

# Make sure that all runs finished nicely
grep Finished out/log/bins/*.log | wc -l
# Should be number of bins (in our case, 1000)

# Get again the total running time
grep Wall out/log/bins/*.log | tr -d '(' | awk '{s+=$4}END{print s}'

# And the memory
grep Memory out/log/bins/*.log | awk '{if($3>m)m=$3}END{print m}'

Finally, run sedef-stats to produce the final output:

# Concatenate the files 
cat out/*.bed > out.bed # seed SDs
cat out/bins/bucket_???? > out.init.bed # potential SD regions
cat out/bins/*.bed | sort -k1,1V -k9,9r -k10,10r -k4,4V -k2,2n -k3,3n -k5,5n -k6,6n |\
	uniq > out.final.bed # final chains

# Now get the final calls
sedef stats generate hg19.fa out.final.bed |\
		sort -k1,1V -k9,9r -k10,10r -k4,4V -k2,2n -k3,3n -k5,5n -k6,6n |\
		uniq > out.hg19.bed

Acknowledgments

SEDEF uses {fmt}, argh and the modified version of Heng Li's ksw2.

Support

Questions, bugs? Open a GitHub issue or drop me an e-mail at inumanag at mit dot edu.