-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistencies between BAM v. BED #24
Comments
Is the data paired end? I think bedtools outputs the two ends of paired end data separately. We use a file format called mr, where the two mates of a concordantly mapped read are merged into one read. It looks like bed format but with sequence and score information. The conversion can be made with the provided tool bam2mr. |
Hi Timothy! Thank you for the prompt reply. The BAM contains single-end 50-bp reads from |
Hmmm. Thats' concerning. Your second plot shows more reads in the bed format. So one of two things is happening:
I think it may be the first, similar to the problem mentioned here: https://www.biostars.org/p/67579/ |
Okay so my mentor recommended that I try using uniquely mapped reads instead of the raw mapped reads. I used the same workflow from my earlier post to generate the plots show below. The results for the future yield don't exactly line up so I am still a bit concerned. |
What you really want to look at to compare the two inputs is the counts histogram for bam and bed format (primary mapping only, which I think you're referring to as unique mapping). This can be done with the verbose option. If the outputted histogram is identical then there are no issues.
When looking at the extrapolation, you only want to look 20-100 fold out from the initial experiment. So if your initial experiment is 10M reads, I would limit the comparison of the extrapolation to 1e9 reads. Otherwise the variability introduced in bootstrapping is too much that you may think there are large differences when there really isn't. This variability is absent in interpolation because we use an exact formula. One way to control for this is to either use quick mode, which does a single extrapolation without bootstrapping (and hence no randomness), or to manually set the random number generator at line 279 in preseq.cpp (I'm not sure if you want to get into that). |
Thank you for all your help Timothy! Preseq makes a lot of sense now! |
Problem
I am getting different values for complexity output and future yield when generating the outputs using BAM or BED file. I might be doing something wrong so please look at the workflow below:
BAM
samtools sort fly_aligned.bam -o bam/fly_aligned.sorted.bam
preseq c_curve -o complexity_output.txt -B fly_aligned.sorted.bam
preseq lc_extrap -o future_yield.txt -B fly_aligned.sorted.bam
BED
bedtools bamtobed -bed12 -i fly_aligned.bam > bed/fly_aligned.bed
sort -k 1,1 -k 2,2n -k 3,3n -k 6,6 fly_aligned.bed > fly_aligned.sorted.bed
preseq c_curve -o complexity_output.txt fly_aligned.sorted.bed
preseq lc_extrap -o future_yield.txt fly_aligned.sorted.bed
R Plots
They look very different and I was wondering why the BAM workflow is producing different results from the BED workflow.
Any ideas?
Behram
The text was updated successfully, but these errors were encountered: