-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected relative profile results (possible user error) #3
Comments
Are you sure about orange.fa, it does not look like an 8x of apple.
When I look
at the histogram orange.hist produced in the second step with Histex it is:
Histogram of unique 40-mers of orange
Input: 17,475 unique 40-mers
Freq: Count Cum. %
17: 12 0.1%
16: 35 0.3%
15: 1 0.3%
12: 49 0.6%
11: 51 0.8%
10: 3 0.9%
9: 54 1.2%
7: 18 1.3%
6: 156 2.2%
5: 20 2.3%
4: 2 2.3%
3: 96 2.8%
1: 16978 100.0%
In otherwords most k-mers occuring once would be consistent with the
output your are getting
when you build the relative profile. So I looked at the k-mer table for
orange.fa with Tabex
and its the strangest thing, identical k-mers are not collapsed, e.g.:
Opening 40-mer table with 17,475 entries
0: aaaaaaatggaatctcgcagcgcctatagtagcaaggata = 1
1: aaaaaaatggaatctcgcagcgcctatagtagcaaggata = 1
2: aaaaaaatggaatctcgcagcgcctatagtagcaaggata = 1
3: aaaaaatcgcaaactcacgaggttgaccacttaagtacgg = 1
4: aaaaaatcgcaaactcacgaggttgaccacttaagtacgg = 1
5: aaaaaatcgcaaactcacgaggttgaccacttaagtacgg = 1
6: aaaaaatcgcaaactcacgaggttgaccacttaagtacgg = 1
7: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1
8: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1
9: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1
10: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1
11: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1
12: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1
13: aaaaaatggaatctcgcagcgcctatagtagcaaggatac = 1
14: aaaaaatggaatctcgcagcgcctatagtagcaaggatac = 1
15: aaaaaatggaatctcgcagcgcctatagtagcaaggatac = 1
16: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1
17: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1
18: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1
19: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1
20: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1
21: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1
22: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1
23: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1
24: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1
25: aaaaagcacatttccgcacagaccggactcaacactgcgc = 3
26: aaaaagcattttttcctcacaagctcacaggttcgaagct = 1
27: aaaaagcattttttcctcacaagctcacaggttcgaagct = 1
28: aaaaagcattttttcctcacaagctcacaggttcgaagct = 1
Hmm, its midnight here in Europe, I'll look at it tomorrow. Could again
be some facet of a
super small data set, albeit I did test a couple small ones previously.
I appreciate your kicking the tires in ways I obviously haven't, and am
committed
to solving every problem you report, albeit I hope (for your sake and my
pride)
there aren't too many more.
Best, Gene
…On 4/1/21, 8:57 PM, Bob Harris wrote:
Synopsis: I have a contrived example with a short genome and some
perfect reads sampled from that genome. I'm using the relative profile
feature to get counts of read kmers across the genome. The results are
mostly 1s with a few short stretches of higher counts. The 1s conflict
with the result I expect.
Details:
The example can be found at
https://github.com/rsharris/fake_reads_2021. apple.fa is a random
genome of length 3K. orange.fa is 80 fake reads, each 300 bp, sampled
from apple with no sequencing error, at random positions and with
random orientation. Average depth ≈8X.
Running this
|FastK -k40 -t1 apple.fa|
|FastK -k40 -t1 orange.fa|
|FastK -k40 -t1 -p:orange.ktab apple.fa|
|Profex apple.prof 1 > orange_on_apple.profile|
I expected to get a count profile that roughly matched the coverage
profile I would get from piling up orange to apple alignments
(realizing that the 40-mers absent at the end of the reads would
reduce avg count to ≈7X).
(orange_on_apple.profile is in that same repo)
Instead, I get mostly 1s with occasional short bursts of higher
numbers. I'm at a loss to understand this. Am I missing something in
how I am using the tools? I know the motivation for profiles was to
profile along reads, not along genomes, so maybe there's a difference
I'm not understanding.
I wonder if the fact that the kmer counts in apple itself are all 1 is
relevant.
FWIW I'm using a fetch of the repo from about an hour ago.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABUSINXDZAHQWSZZRD2AJZDTGS62VANCNFSM42HT5OVA>.
|
Oh yeah, I see what you mean about duplicates in the kmer table for orange. Very strange. I'm going to try the same test on a couple different architectures, on the outside change that it's platform-dependent. |
I've tried it on three machines, all of which have duplicates in the orange kmer table. One machine on campus running some version of linux. And two macs here at home. I can send you architecture details, but my guess about platform differences looks to be irrelevant. |
I tried a few similar datasets with various sizes. All with same result UNTIL I happened to try I'm probably chasing after cats that are of no use to you, so I'll stop with that one. |
Its early morning here in Germany. It is a bug I unfortunately introduced last Saturday into the |
Let me know when you have checked in the fix. I am planning to do a whole set of FastK runs today and I want them to be correct.
… On 2 Apr 2021, at 07:11, Eugene W Myers Jr ***@***.***> wrote:
Its early morning here in Germany. It is a bug I unfortunately introduced last Saturday into the
core sorting routine. I've found it (a one character mistake (again) :-)) and am doing a little more
testing before I check the fix in.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2FXZUICZ4QNGAXJTI7JSTTGVNXZANCNFSM42HT5OVA>.
|
I just checked in the fixed version of FastK. I'm so sorry, I inadvertently screwed up the core MSD sort Anyhow its fixed now (and a little faster than it was before :-) ). Thanks Bob for catching this. |
Synopsis: I have a contrived example with a short genome and some perfect reads sampled from that genome. I'm using the relative profile feature to get counts of read kmers across the genome. The results are mostly 1s with a few short stretches of higher counts. The 1s conflict with the result I expect.
Details:
The example can be found at https://github.com/rsharris/fake_reads_2021. apple.fa is a random genome of length 3K. orange.fa is 80 fake reads, each 300 bp, sampled from apple with no sequencing error, at random positions and with random orientation. Average depth ≈8X.
Running this
FastK -k40 -t1 apple.fa
FastK -k40 -t1 orange.fa
FastK -k40 -t1 -p:orange.ktab apple.fa
Profex apple.prof 1 > orange_on_apple.profile
I expected to get a count profile that roughly matched the coverage profile I would get from piling up orange to apple alignments (realizing that the 40-mers absent at the end of the reads would reduce avg count to ≈7X).
(orange_on_apple.profile is in that same repo)
Instead, I get mostly 1s with occasional short bursts of higher numbers. I'm at a loss to understand this. Am I missing something in how I am using the tools? I know the motivation for profiles was to profile along reads, not along genomes, so maybe there's a difference I'm not understanding.
I wonder if the fact that the kmer counts in apple itself are all 1 is relevant.
FWIW I'm using a fetch of the repo from about an hour ago.
The text was updated successfully, but these errors were encountered: