Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies between BAM v. BED #24

Closed
MediciPrime opened this issue Jun 7, 2016 · 6 comments
Closed

Inconsistencies between BAM v. BED #24

MediciPrime opened this issue Jun 7, 2016 · 6 comments

Comments

@MediciPrime
Copy link

Problem
I am getting different values for complexity output and future yield when generating the outputs using BAM or BED file. I might be doing something wrong so please look at the workflow below:

BAM
samtools sort fly_aligned.bam -o bam/fly_aligned.sorted.bam
preseq c_curve -o complexity_output.txt -B fly_aligned.sorted.bam
preseq lc_extrap -o future_yield.txt -B fly_aligned.sorted.bam

BED
bedtools bamtobed -bed12 -i fly_aligned.bam > bed/fly_aligned.bed
sort -k 1,1 -k 2,2n -k 3,3n -k 6,6 fly_aligned.bed > fly_aligned.sorted.bed
preseq c_curve -o complexity_output.txt fly_aligned.sorted.bed
preseq lc_extrap -o future_yield.txt fly_aligned.sorted.bed

R Plots

fy_bed = read.table("future_yield.txt", header=TRUE) # bed future yield
sub_fy_bed <- fy_bed[1:2]
fy_bam = read.table("future_yield.txt", header=TRUE) # bam future yield
sub_fy_bam <- fy_bam[1:2]
plot(sub_fy_bed, type="p", col="red", pch=3, lwd=1, main="Future Yield", xlab="Total Reads", ylab="Expected Distinct")
points(sub_fy_bam, type="p", col="green", pch=1, lwd=1)
legend(5e+09, 7.0e+06, c("BED", "BAM"), lty=c(3, 1), lwd=c(1,1),col=c("red", "green"))

future_yield

co_bed = read.table("complexity_output.txt", header=TRUE)
co_bam = read.table("complexity_output.txt", header=TRUE)
sub_co_bed <- co_bed[1:2]
sub_co_bam <- co_bam[1:2]
plot(co_bed, type="p", col="red", pch=3, lwd=1, main="Complexity Plot", xlab="Total Reads", ylab="Distinct Reads")
points(co_bam, type="p", col="green", pch=1, lwd=1)
legend(5e+09, 7.0e+06, c("BED", "BAM"), lty=c(3, 1), lwd=c(1,1),col=c("red", "green"))

complexity_plot


They look very different and I was wondering why the BAM workflow is producing different results from the BED workflow.

Any ideas?
Behram

@timydaley
Copy link
Contributor

Is the data paired end? I think bedtools outputs the two ends of paired end data separately. We use a file format called mr, where the two mates of a concordantly mapped read are merged into one read. It looks like bed format but with sequence and score information. The conversion can be made with the provided tool bam2mr.
In default mode (not paired end mode) only the first mate is used to determine the number of duplicates in a bam file.
One way to test this is to filter out just the first mate, covert bam to bed, then run preseq. This is how we tested when we initially included bam format. If you see inconsistencies with this, please let us know.

@MediciPrime
Copy link
Author

Hi Timothy! Thank you for the prompt reply. The BAM contains single-end 50-bp reads from
an RNA-seq experiment so I don't think it should have any issue converting to
a BED. As for bam2mr, isn't that only necessary when using gc_extrap? Anyways, I will try to reproduce the problem using SRA data and post my results when I am finished.

@timydaley
Copy link
Contributor

Hmmm. Thats' concerning. Your second plot shows more reads in the bed format. So one of two things is happening:

  1. BEDTools is outputting too many reads.
  2. preseq isn't counting all the bam reads.

I think it may be the first, similar to the problem mentioned here: https://www.biostars.org/p/67579/
To test this first filter the bam file for primary and proper mapping, then run bam2bed.

@MediciPrime
Copy link
Author

Okay so my mentor recommended that I try using uniquely mapped reads instead of the raw mapped reads. I used the same workflow from my earlier post to generate the plots show below.

unique_co_plot

unique_fy_plot

The results for the future yield don't exactly line up so I am still a bit concerned.
Does preseq filter BAM files using the mapping quality scores?
If that is the case then it might explain why there is such a massive discrepancy between BED & BAM in the complexity plot from my previous post; i.e., BEDTools is outputting too many reads and preseq doesn't have a way to filter them, in addition to preseq filtering the BAM files based on the mapped quality scores.
Lastly, should I always use uniquely mapped reads in order to avoid the problem talked about in this issue?

@timydaley
Copy link
Contributor

What you really want to look at to compare the two inputs is the counts histogram for bam and bed format (primary mapping only, which I think you're referring to as unique mapping). This can be done with the verbose option. If the outputted histogram is identical then there are no issues.

Does preseq filter BAM files using the mapping quality scores?
We don't filter by quality scores, but we do filter out secondary mapping since including them would obviously bias the estimated complexity.
BEDTools is outputting too many reads and preseq doesn't have a way to filter them
I don't really think there is a way to filter secondary mapping, based on the bed format. There is no analog to the SAM flag in bed format to distinguish primary mapping from secondary mapping. The tool bam2mr will do this filtering though.

When looking at the extrapolation, you only want to look 20-100 fold out from the initial experiment. So if your initial experiment is 10M reads, I would limit the comparison of the extrapolation to 1e9 reads. Otherwise the variability introduced in bootstrapping is too much that you may think there are large differences when there really isn't. This variability is absent in interpolation because we use an exact formula. One way to control for this is to either use quick mode, which does a single extrapolation without bootstrapping (and hence no randomness), or to manually set the random number generator at line 279 in preseq.cpp (I'm not sure if you want to get into that).
If you're trying to get the total number of molecules, this is difficult because of inherent mathematical issues (see our most recent paper http://arxiv.org/abs/1605.03294). Small changes in the input can lead to very large changes in the output. So, I wouldn't worry too much about differences so far out in the extrapolation.

@MediciPrime
Copy link
Author

Thank you for all your help Timothy! Preseq makes a lot of sense now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants