-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for reference sequences longer than 65535 #70
Comments
Well, I had to make the decision for the most common use cases, because using position types that are larger increases the size of the index file measurably (for every one else as well). I can make a compile time option for this in the future so you could specify e.g. |
Hi, I respect your decision of using 16 bit uints for the index, that's fine. I'll definitely try the change you suggested. Hopefully I'll get an environment up where I can compile this on my own (been using the precompiled binaries so far). I'll report back as soon as I have time to try this. |
While we're at it. I noticed the expected number of sequences is stored as uint32. That also seems a bit low for what I normally use [1]. I would suggest adding options to compile with 64 bit integers in these fields as well, since capping it merely 4.3 million sequences isn't a lot. Without looking at the entire code base, I'm not sure if this would this be possible with lambda, are there any implications I'm not considering? [1] Genbank's non-redundant protein database currently contains more than 57 million sequences... |
Ok, please let me know if it works!
32bit means up to 4.3 billion sequences, I hope that's enough for the next two or three years 😄 |
Haha. Whoops. My bad 😆. Disregard that, I still haven't woken up today it seems. |
I'm not having any luck trying to build lambda 😞. Trying to build from the lambda-next branch, but also tried building the master branch just to see if that works (it didn't). Tried it on three different machines:
Besides lots of errors like this:
Based on these messages I was thinking maybe my anaconda3 installation is messing things up, but removing all anaconda-stuff from my $PATH didn't change anything. On the second machine the build and linking of lambda_indexer seems to work, and it appears to run like it should. |
Did you clone the master branch with the
Please try only with the master branch for now. And can you post a full log of the cmake-run? [for make the last parts are enough, but for cmake I need the full log]
Inga problem, jag förstår lite ;) |
Yes, I used the Strangely enough I managed to complete a build from the master branch just now. The compiled binaries don't work, however:
Here is
Not sure what exactly you want from cmake, but here are two log files I could find. Hope they are of some use. |
Deleted everything and started over again and tried another build on my laptop (Bash on Ubuntu on Windows thing) from the master branch, and it seems to succeed now. I could even produce static binaries that seem to run on my main crunching machine (the first of the three listed in my previous post). However, I got this output during compilation:
Not sure if this is a problem though. It seems to be working now. |
It looks like you are using a non-system compiler, but then linking against system libraries, this can lead to all sorts of problems, unreleated to Lambda :)
[However static builds might break on centos, because they don't provide the static lib files for zlib, this is the error you are seeing up at the top] |
Yes, these warnings can be ignored! |
Yeah. Compiling stuff that requires newer GCC versions than what RedHat repositories have available is frustrating, to say the least... 😞 Despite my success with the static binary from my laptop, I'll try building static binaries on my main machine as well, just to see if I can get it to work. The binaries complain about not being built in release mode, and they are indeed very slow. Can I do something about this? |
Yeah, RedHat is the most difficult platform to support.
To make portable binaries, also deactivate
Thanks for pointing this out! The Just add |
Hmmm.. Building on my RedHat machine is still not working. Getting another set of errors now though... :)
I think I'll just abandon my attempts at building on that machine and just be happy with the binary I get from my laptop. Sorry to hear about the issues with supporting builds on RedHat systems, especially seeing as they are so common in our line of work (essentially all cluster systems I've ever used here in Sweden are based on CentOS). Thanks a lot for all the help today. Quick edit: One more thing! I actually needed to remove all anaconda related paths from my $PATH before it started working. Maybe I wasn't clear enough on that. Just in case someone else has the same issue with linking to |
Can you try once more with :
(the first is also necessary for static linking, the second only for dynmic) where /path/to/lib is the path to your GCC's lib or lib64 directory? I am not sure whether |
Sure thing!
I don't think I have versions of |
Ok, one more try with: |
Thats ok, because it should check the other parts, too. The important thing is that libgcc, the libstdc++ and the libgomp are taken from that directory (they should be there), it then checks the usual directories for the other libs.
Its a mess 😢 |
Ok! Cleaned out the entire build dir. Ran The build took muuuch longer to complete (maybe two minutes or so), rather than the usual <0.5-1 min to fail. It still seems I get a message about the binary not being built in release mode though... |
Glad to hear! 👍
Thats completely normal.
Ah, my internal check is case sensitive and only checks for |
Recompiled with |
Yep, if it's swapping, it will never finish, because the high memory consumption is permanent during construction.
Short answer: roughly 10 times the total sequence length you are indexing (if your fasta headers are small: Long answer: Well, you have two 32Bit integers per suffix array position now, and the full suffix array needs to be constructed, thats 64bit (8byte) per text-position, i.e. number of basepairs in your text. If you are doing translation from DNA to Protein on the text first, the text size doubles (six translation frames, but length divided by 3). Than in addition to the suffix array the BWT (size of text) needs to be constructed and some helper data structures and extra buffers (1-2times size of text)
In the future I want to precalculate the expected size and warn the user if the available memory seems to small. Possible solutions if you have too little memory:
The memory requirements for performing the search are actually much lower, around twice the size of the database file + the size of the query file
I will keep it open as a TODO for adding the compile time option to cmake!
No problem! If you would like to be informed about new versions, please sign up to the newsletter: |
Ok! Thanks for the amazingly great reply! |
Ok, but just to warn you: based on the database size I would expect the run-times to be
|
Ouch! That's quite some time... But I'm in no rush, I'll let it chug along :). |
It wasn't so bad actually!
Searching against the indexed database is very quick! Using a little test query file with 3888 short peptides, running on 40 cores, it takes about 20-25 seconds with reasonable memory consumption (about 80GB peak). The first search took about 4 minutes, I'm guessing because it had to put the index into file system cache, but subsequent searches are lightning quick! Very impressed. |
I have just added the macro and also added to checks to the programs to make sure that it catches incompatible indexes: dd800f9 |
I was hoping to use lambda for mapping fairly short peptides to reference genome sequences, which I guess it isn't really intended for... Is there a reason for this, perhaps the alignment approach used in lambda isn't well suited for this type of mapping?
The text was updated successfully, but these errors were encountered: