Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected relative profile results (possible user error) #3

Closed
rsharris opened this issue Apr 1, 2021 · 7 comments
Closed

Unexpected relative profile results (possible user error) #3

rsharris opened this issue Apr 1, 2021 · 7 comments

Comments

@rsharris
Copy link

rsharris commented Apr 1, 2021

Synopsis: I have a contrived example with a short genome and some perfect reads sampled from that genome. I'm using the relative profile feature to get counts of read kmers across the genome. The results are mostly 1s with a few short stretches of higher counts. The 1s conflict with the result I expect.

Details:

The example can be found at https://github.com/rsharris/fake_reads_2021. apple.fa is a random genome of length 3K. orange.fa is 80 fake reads, each 300 bp, sampled from apple with no sequencing error, at random positions and with random orientation. Average depth ≈8X.

Running this
FastK -k40 -t1 apple.fa
FastK -k40 -t1 orange.fa
FastK -k40 -t1 -p:orange.ktab apple.fa
Profex apple.prof 1 > orange_on_apple.profile
I expected to get a count profile that roughly matched the coverage profile I would get from piling up orange to apple alignments (realizing that the 40-mers absent at the end of the reads would reduce avg count to ≈7X).

(orange_on_apple.profile is in that same repo)
Instead, I get mostly 1s with occasional short bursts of higher numbers. I'm at a loss to understand this. Am I missing something in how I am using the tools? I know the motivation for profiles was to profile along reads, not along genomes, so maybe there's a difference I'm not understanding.

I wonder if the fact that the kmer counts in apple itself are all 1 is relevant.

FWIW I'm using a fetch of the repo from about an hour ago.

@thegenemyers
Copy link
Owner

thegenemyers commented Apr 1, 2021 via email

@rsharris
Copy link
Author

rsharris commented Apr 1, 2021

Oh yeah, I see what you mean about duplicates in the kmer table for orange. Very strange.

I'm going to try the same test on a couple different architectures, on the outside change that it's platform-dependent.

@rsharris
Copy link
Author

rsharris commented Apr 1, 2021

I've tried it on three machines, all of which have duplicates in the orange kmer table.

One machine on campus running some version of linux. And two macs here at home. I can send you architecture details, but my guess about platform differences looks to be irrelevant.

@rsharris
Copy link
Author

rsharris commented Apr 2, 2021

I tried a few similar datasets with various sizes. All with same result UNTIL I happened to try FastK -k40 -t2 orange.fa instead of FastK -k40 -t1 orange.fa. The table for -t2 looks more reasonable — by eyeball I didn't notice any duplicated kmers. But the histogram still seems to have too many for freq=1 (even though it shows a distribution that otherwise makes more sense than t=1 did, with a nice peak at freq=7).

I'm probably chasing after cats that are of no use to you, so I'll stop with that one.

@thegenemyers
Copy link
Owner

Its early morning here in Germany. It is a bug I unfortunately introduced last Saturday into the
core sorting routine. I've found it (a one character mistake (again) :-)) and am doing a little more
testing before I check the fix in.

@richarddurbin
Copy link
Collaborator

richarddurbin commented Apr 2, 2021 via email

@thegenemyers
Copy link
Owner

@richarddurbin @rsharris

I just checked in the fixed version of FastK. I'm so sorry, I inadvertently screwed up the core MSD sort
in the trunk version while working on trying to get more speed. And I did this on Saturday the day after
my talk to the VGP gang. So I'm going to also send an e-mail to the group to let them know as there has
been a bump in downloads since the talk. Sheisse.

Anyhow its fixed now (and a little faster than it was before :-) ).

Thanks Bob for catching this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants