Inconsistencies between BAM v. BED #24

MediciPrime · 2016-06-07T14:58:21Z

Problem
I am getting different values for complexity output and future yield when generating the outputs using BAM or BED file. I might be doing something wrong so please look at the workflow below:

BAM
samtools sort fly_aligned.bam -o bam/fly_aligned.sorted.bam
preseq c_curve -o complexity_output.txt -B fly_aligned.sorted.bam
preseq lc_extrap -o future_yield.txt -B fly_aligned.sorted.bam

BED
bedtools bamtobed -bed12 -i fly_aligned.bam > bed/fly_aligned.bed
sort -k 1,1 -k 2,2n -k 3,3n -k 6,6 fly_aligned.bed > fly_aligned.sorted.bed
preseq c_curve -o complexity_output.txt fly_aligned.sorted.bed
preseq lc_extrap -o future_yield.txt fly_aligned.sorted.bed

R Plots

fy_bed = read.table("future_yield.txt", header=TRUE) # bed future yield
sub_fy_bed <- fy_bed[1:2]
fy_bam = read.table("future_yield.txt", header=TRUE) # bam future yield
sub_fy_bam <- fy_bam[1:2]
plot(sub_fy_bed, type="p", col="red", pch=3, lwd=1, main="Future Yield", xlab="Total Reads", ylab="Expected Distinct")
points(sub_fy_bam, type="p", col="green", pch=1, lwd=1)
legend(5e+09, 7.0e+06, c("BED", "BAM"), lty=c(3, 1), lwd=c(1,1),col=c("red", "green"))

co_bed = read.table("complexity_output.txt", header=TRUE)
co_bam = read.table("complexity_output.txt", header=TRUE)
sub_co_bed <- co_bed[1:2]
sub_co_bam <- co_bam[1:2]
plot(co_bed, type="p", col="red", pch=3, lwd=1, main="Complexity Plot", xlab="Total Reads", ylab="Distinct Reads")
points(co_bam, type="p", col="green", pch=1, lwd=1)
legend(5e+09, 7.0e+06, c("BED", "BAM"), lty=c(3, 1), lwd=c(1,1),col=c("red", "green"))

They look very different and I was wondering why the BAM workflow is producing different results from the BED workflow.

Any ideas?
Behram

timydaley · 2016-06-07T16:05:59Z

Is the data paired end? I think bedtools outputs the two ends of paired end data separately. We use a file format called mr, where the two mates of a concordantly mapped read are merged into one read. It looks like bed format but with sequence and score information. The conversion can be made with the provided tool bam2mr.
In default mode (not paired end mode) only the first mate is used to determine the number of duplicates in a bam file.
One way to test this is to filter out just the first mate, covert bam to bed, then run preseq. This is how we tested when we initially included bam format. If you see inconsistencies with this, please let us know.

MediciPrime · 2016-06-07T17:14:31Z

Hi Timothy! Thank you for the prompt reply. The BAM contains single-end 50-bp reads from
an RNA-seq experiment so I don't think it should have any issue converting to
a BED. As for bam2mr, isn't that only necessary when using gc_extrap? Anyways, I will try to reproduce the problem using SRA data and post my results when I am finished.

timydaley · 2016-06-07T17:21:55Z

Hmmm. Thats' concerning. Your second plot shows more reads in the bed format. So one of two things is happening:

BEDTools is outputting too many reads.
preseq isn't counting all the bam reads.

I think it may be the first, similar to the problem mentioned here: https://www.biostars.org/p/67579/
To test this first filter the bam file for primary and proper mapping, then run bam2bed.

MediciPrime · 2016-06-07T21:00:56Z

Okay so my mentor recommended that I try using uniquely mapped reads instead of the raw mapped reads. I used the same workflow from my earlier post to generate the plots show below.

The results for the future yield don't exactly line up so I am still a bit concerned.
Does preseq filter BAM files using the mapping quality scores?
If that is the case then it might explain why there is such a massive discrepancy between BED & BAM in the complexity plot from my previous post; i.e., BEDTools is outputting too many reads and preseq doesn't have a way to filter them, in addition to preseq filtering the BAM files based on the mapped quality scores.
Lastly, should I always use uniquely mapped reads in order to avoid the problem talked about in this issue?

timydaley · 2016-06-07T21:26:10Z

What you really want to look at to compare the two inputs is the counts histogram for bam and bed format (primary mapping only, which I think you're referring to as unique mapping). This can be done with the verbose option. If the outputted histogram is identical then there are no issues.

Does preseq filter BAM files using the mapping quality scores?
We don't filter by quality scores, but we do filter out secondary mapping since including them would obviously bias the estimated complexity.
BEDTools is outputting too many reads and preseq doesn't have a way to filter them
I don't really think there is a way to filter secondary mapping, based on the bed format. There is no analog to the SAM flag in bed format to distinguish primary mapping from secondary mapping. The tool bam2mr will do this filtering though.

When looking at the extrapolation, you only want to look 20-100 fold out from the initial experiment. So if your initial experiment is 10M reads, I would limit the comparison of the extrapolation to 1e9 reads. Otherwise the variability introduced in bootstrapping is too much that you may think there are large differences when there really isn't. This variability is absent in interpolation because we use an exact formula. One way to control for this is to either use quick mode, which does a single extrapolation without bootstrapping (and hence no randomness), or to manually set the random number generator at line 279 in preseq.cpp (I'm not sure if you want to get into that).
If you're trying to get the total number of molecules, this is difficult because of inherent mathematical issues (see our most recent paper http://arxiv.org/abs/1605.03294). Small changes in the input can lead to very large changes in the output. So, I wouldn't worry too much about differences so far out in the extrapolation.

MediciPrime · 2016-06-08T00:23:41Z

Thank you for all your help Timothy! Preseq makes a lot of sense now!

MediciPrime closed this as completed Jun 8, 2016

nickp60 mentioned this issue Nov 17, 2020

Duplication and gc_extrap #51

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistencies between BAM v. BED #24

Inconsistencies between BAM v. BED #24

MediciPrime commented Jun 7, 2016

timydaley commented Jun 7, 2016

MediciPrime commented Jun 7, 2016

timydaley commented Jun 7, 2016

MediciPrime commented Jun 7, 2016

timydaley commented Jun 7, 2016

MediciPrime commented Jun 8, 2016

Inconsistencies between BAM v. BED #24

Inconsistencies between BAM v. BED #24

Comments

MediciPrime commented Jun 7, 2016

timydaley commented Jun 7, 2016

MediciPrime commented Jun 7, 2016

timydaley commented Jun 7, 2016

MediciPrime commented Jun 7, 2016

timydaley commented Jun 7, 2016

MediciPrime commented Jun 8, 2016