Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for reference sequences longer than 65535 #70

Closed
boulund opened this issue Oct 12, 2016 · 27 comments
Closed

Support for reference sequences longer than 65535 #70

boulund opened this issue Oct 12, 2016 · 27 comments
Assignees
Milestone

Comments

@boulund
Copy link

boulund commented Oct 12, 2016

I was hoping to use lambda for mapping fairly short peptides to reference genome sequences, which I guess it isn't really intended for... Is there a reason for this, perhaps the alignment approach used in lambda isn't well suited for this type of mapping?

@h-2
Copy link
Member

h-2 commented Oct 12, 2016

Well, I had to make the decision for the most common use cases, because using position types that are larger increases the size of the index file measurably (for every one else as well).
You can change this line to contain a 32bit integer:
https://github.com/seqan/lambda/blob/master/src/options.hpp#L57
and rebuild lambda, then it should "just work".

I can make a compile time option for this in the future so you could specify e.g. -DLAMBDA_LONG_PROTEIN_SUBJ_SEQS 😃

@h-2 h-2 added this to the 1.9.2 milestone Oct 12, 2016
@h-2 h-2 self-assigned this Oct 12, 2016
@boulund
Copy link
Author

boulund commented Oct 13, 2016

Hi,

I respect your decision of using 16 bit uints for the index, that's fine. I'll definitely try the change you suggested. Hopefully I'll get an environment up where I can compile this on my own (been using the precompiled binaries so far). I'll report back as soon as I have time to try this.

@boulund
Copy link
Author

boulund commented Oct 13, 2016

While we're at it. I noticed the expected number of sequences is stored as uint32. That also seems a bit low for what I normally use [1]. I would suggest adding options to compile with 64 bit integers in these fields as well, since capping it merely 4.3 million sequences isn't a lot. Without looking at the entire code base, I'm not sure if this would this be possible with lambda, are there any implications I'm not considering?

[1] Genbank's non-redundant protein database currently contains more than 57 million sequences...

@h-2
Copy link
Member

h-2 commented Oct 13, 2016

I'll report back as soon as I have time to try this.

Ok, please let me know if it works!

While we're at it. I noticed the expected number of sequences is stored as uint32. That also seems a bit low for what I normally use [1]. I would suggest adding options to compile with 64 bit integers in these fields as well, since capping it merely 4.3 million sequences isn't a lot.

32bit means up to 4.3 billion sequences, I hope that's enough for the next two or three years 😄

@boulund
Copy link
Author

boulund commented Oct 13, 2016

Haha. Whoops. My bad 😆. Disregard that, I still haven't woken up today it seems.

@boulund
Copy link
Author

boulund commented Oct 13, 2016

I'm not having any luck trying to build lambda 😞. Trying to build from the lambda-next branch, but also tried building the master branch just to see if that works (it didn't). Tried it on three different machines:

  • RHEL 7.2, 2x20 core Intel Xeon E5-2650 v3 with 192GB RAM, GCC 4.9.4 built from scratch (this morning), cmake 3.6.1.
  • CentOS 6.7, 20 core Intel Xeon E5-2650 v3 with 132GB RAM, GCC 6.2.0, cmake 3.3.1.
  • Debian stretch/sid (Bash on Ubuntu on Windows 10 insider preview), 4 core i7-6600U, 16GB RAM, GCC 5.4.0-6ubuntu1~16.04.2, cmake 3.5.1

Besides lots of errors like this:
/home/boulund/src/lambda.git/src/lambda.hpp:2067:36: error: ‘_prepareAndRunSimdAlignment’ was not declared in this scope, which I guess is related to the SIMD-development that's currently in progress?
I also see this (sorry for this one being in Swedish):

/usr/bin/ld: försökte länka det dynamiska objektet '/home/boulund/anaconda3/lib/libz.so' statiskt.   
collect2: error: ld returned 1 exit status                                                           
src/CMakeFiles/lambda_indexer.dir/build.make:95: receptet för målet ”bin/lambda_indexer” misslyckades
make[2]: *** [bin/lambda_indexer] Fel 1                                                              
CMakeFiles/Makefile2:124: receptet för målet ”src/CMakeFiles/lambda_indexer.dir/all” misslyckades    
make[1]: *** [src/CMakeFiles/lambda_indexer.dir/all] Fel 2                                           
make[1]: *** Inväntar oavslutade jobb...                                                             
/usr/bin/ld: försökte länka det dynamiska objektet '/home/boulund/anaconda3/lib/libz.so' statiskt.   
collect2: error: ld returned 1 exit status                                                           
src/CMakeFiles/lambda.dir/build.make:95: receptet för målet ”bin/lambda” misslyckades                
make[2]: *** [bin/lambda] Fel 1                                                                      
CMakeFiles/Makefile2:87: receptet för målet ”src/CMakeFiles/lambda.dir/all” misslyckades             
make[1]: *** [src/CMakeFiles/lambda.dir/all] Fel 2                                                   
Makefile:149: receptet för målet ”all” misslyckades                                                  
make: *** [all] Fel 2                                                                                

Based on these messages I was thinking maybe my anaconda3 installation is messing things up, but removing all anaconda-stuff from my $PATH didn't change anything.

On the second machine the build and linking of lambda_indexer seems to work, and it appears to run like it should.

@h-2
Copy link
Member

h-2 commented Oct 13, 2016

Did you clone the master branch with the --recursive argument? Otherwise lambda will use SeqAn from your system which might be too old.

I'm not having any luck trying to build lambda . Trying to build from the lambda-next branch, but also tried building the master branch just to see if that works (it didn't).

Please try only with the master branch for now. And can you post a full log of the cmake-run? [for make the last parts are enough, but for cmake I need the full log]

I also see this (sorry for this one being in Swedish):

Inga problem, jag förstår lite ;)

@boulund
Copy link
Author

boulund commented Oct 13, 2016

Yes, I used the --recursive argument. I don't have the SeqAn library installed system-wide (I've never had much success with SeqAn to be honest; very happy you supply precompiled binaries).

Strangely enough I managed to complete a build from the master branch just now. The compiled binaries don't work, however:

[bin]$ ./lambda                                                            
./lambda: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./lambda)                
[bin]$ ./lambda_indexer                                                    
./lambda_indexer: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./lambda_indexer)

Here is ldd output for these binaries

[bin]$ ldd lambda                                                          
./lambda: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./lambda)                
        linux-vdso.so.1 =>  (0x00007fffeaffe000)                                                        
        librt.so.1 => /lib64/librt.so.1 (0x00007fb16ab66000)                                            
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fb16a94a000)                                  
        libz.so.1 => /lib64/libz.so.1 (0x00007fb16a733000)                                              
        libbz2.so.1 => /lib64/libbz2.so.1 (0x00007fb16a523000)                                          
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fb16a21a000)                                    
        libm.so.6 => /lib64/libm.so.6 (0x00007fb169f17000)                                              
        libgomp.so.1 => /lib64/libgomp.so.1 (0x00007fb169d00000)                                        
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fb169aea000)                                      
        libc.so.6 => /lib64/libc.so.6 (0x00007fb169728000)                                              
        /lib64/ld-linux-x86-64.so.2 (0x00007fb16ad86000)                                                
[bin]$ ldd lambda_indexer                                                  
./lambda_indexer: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./lambda_indexer)
        linux-vdso.so.1 =>  (0x00007ffe0df5f000)                                                        
        librt.so.1 => /lib64/librt.so.1 (0x00007fa8d7c90000)                                            
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa8d7a74000)                                  
        libz.so.1 => /lib64/libz.so.1 (0x00007fa8d785d000)                                              
        libbz2.so.1 => /lib64/libbz2.so.1 (0x00007fa8d764d000)                                          
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fa8d7344000)                                    
        libm.so.6 => /lib64/libm.so.6 (0x00007fa8d7041000)                                              
        libgomp.so.1 => /lib64/libgomp.so.1 (0x00007fa8d6e2a000)                                        
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fa8d6c14000)                                      
        libc.so.6 => /lib64/libc.so.6 (0x00007fa8d6852000)                                              
        /lib64/ld-linux-x86-64.so.2 (0x00007fa8d7eb0000)                                                

Not sure what exactly you want from cmake, but here are two log files I could find. Hope they are of some use.
CMakeError.log.txt
CMakeOutput.log.txt

@boulund
Copy link
Author

boulund commented Oct 13, 2016

Deleted everything and started over again and tried another build on my laptop (Bash on Ubuntu on Windows thing) from the master branch, and it seems to succeed now. I could even produce static binaries that seem to run on my main crunching machine (the first of the three listed in my previous post). However, I got this output during compilation:

Scanning dependencies of target lambda                                                                                                                  
Scanning dependencies of target lambda_indexer                                                                                                          
[ 50%] Building CXX object src/CMakeFiles/lambda_indexer.dir/lambda_indexer.cpp.o                                                                       
[ 50%] Building CXX object src/CMakeFiles/lambda.dir/lambda.cpp.o                                                                                       
[ 75%] Linking CXX executable ../bin/lambda_indexer                                                                                                     
/usr/lib/gcc/x86_64-linux-gnu/5/libgomp.a(target.o): I funktionen "gomp_target_init":                                                                   
(.text+0xba): varning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
[ 75%] Built target lambda_indexer                                                                                                                      
[100%] Linking CXX executable ../bin/lambda                                                                                                             
/usr/lib/gcc/x86_64-linux-gnu/5/libgomp.a(target.o): I funktionen "gomp_target_init":                                                                   
(.text+0xba): varning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
[100%] Built target lambda                                                                                                                              

Not sure if this is a problem though. It seems to be working now.

@h-2
Copy link
Member

h-2 commented Oct 13, 2016

It looks like you are using a non-system compiler, but then linking against system libraries, this can lead to all sorts of problems, unreleated to Lambda :)
This might already be solvable by removing your cmake files and then rerunning cmake with -DLAMBDA_STATIC_BUILD=ON.
For dynamic linking you need to pass the correct rpath to your compiler:

-DCMAKE_CXX_FLAGS="-L/path/to/lib" \
-DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath=/path/to/lib"

[However static builds might break on centos, because they don't provide the static lib files for zlib, this is the error you are seeing up at the top]

@h-2
Copy link
Member

h-2 commented Oct 13, 2016

Not sure if this is a problem though. It seems to be working now.

Yes, these warnings can be ignored!

@boulund
Copy link
Author

boulund commented Oct 13, 2016

Yeah. Compiling stuff that requires newer GCC versions than what RedHat repositories have available is frustrating, to say the least... 😞

Despite my success with the static binary from my laptop, I'll try building static binaries on my main machine as well, just to see if I can get it to work.

The binaries complain about not being built in release mode, and they are indeed very slow. Can I do something about this?

@h-2
Copy link
Member

h-2 commented Oct 13, 2016

Yeah. Compiling stuff that requires newer GCC versions than what RedHat repositories have available is frustrating, to say the least... 

Yeah, RedHat is the most difficult platform to support.

Despite my succes with the static binary from my laptop, I'll try building static binaries on my main machine as well, just to see if I can get it to work.

To make portable binaries, also deactivate LAMBDA_NATIVE_BUILD… but even then it might have issues with a too-old libstdc++ on redhat :(

The binaries complain about not being built in release mode, and they are indeed very slow. Can I do something about this?

Thanks for pointing this out! The CMAKE_BUILD_TYPE should be set to RELEASE. Previously this was automatically set to RELEASE if not specified by the user, but apparently this was lost when changing the build system recently and I didn't notice, because on my operating system the system default is always RELEASE. But on Linux the system default may be DEBUG so you got slow binaries now.

Just add -DCMAKE_BUILD_TYPE=RELEASE to your build, I will fix the default behaviour again for the next release.

@boulund
Copy link
Author

boulund commented Oct 13, 2016

Hmmm.. Building on my RedHat machine is still not working. Getting another set of errors now though... :)

[release]$ cmake3 ../../lambda -DLAMBDA_STATIC_BUILD=ON -DCMAKE_CXX_COMPILER=/storage/boulund/TTT/lambda_test/src/gcc/bin/g++ -DCMAKE_BUILD_TYPE=RELEASE -DLAMBDA_NATIVE_BUILD=OFF
Compiler Detection                                                                                                                             
-- The CXX compiler identification is GNU 4.9.4                                                                                                
-- Check for working CXX compiler: /storage/boulund/TTT/lambda_test/src/gcc/bin/g++                                                            
-- Check for working CXX compiler: /storage/boulund/TTT/lambda_test/src/gcc/bin/g++ -- works                                                   
-- Detecting CXX compiler ABI info                                                                                                             
-- Detecting CXX compiler ABI info - done                                                                                                      
-- Detecting CXX compile features                                                                                                              
-- Detecting CXX compile features - done                                                                                                       

Dependency detection                                                                                                                           
-- Found a local SeqAn library provided with the Lambda source code.                                                                           
   This will be preferred over system global headers.                                                                                          
-- Performing Test CXX14_BUILTIN                                                                                                               
-- Performing Test CXX14_BUILTIN - Failed                                                                                                      
-- Performing Test CXX14_FLAG                                                                                                                  
-- Performing Test CXX14_FLAG - Success                                                                                                        
-- Looking for C++ include execinfo.h                                                                                                          
-- Looking for C++ include execinfo.h - found                                                                                                  
--   Determined version is 2.2.0                                                                                                               
-- These dependencies where found:                                                                                                             
     OPENMP     TRUE      -fopenmp                                                                                                             
     ZLIB       TRUE      1.2.7                                                                                                                
     BZIP2      TRUE      1.0.6                                                                                                                
     SEQAN      TRUE      2.2.0                                                                                                                
-- The requirements where met.                                                                                                                 

Build configuration                                                                                                                            
-- LAMBDA version is: 1.0.1                                                                                                                    
-- The following options are selected for the build:    
     LAMBDA_FASTBUILD         OFF                                                          
     LAMBDA_LINGAPS_OPT       OFF                                                          
     LAMBDA_MMAPPED_DB        ON                                                           
     LAMBDA_NATIVE_BUILD      OFF                                                          
     LAMBDA_STATIC_BUILD      ON                                                           
-- Run 'cmake -LH' to get a comment on each option.                                        
-- Remove CMakeCache.txt and re-run cmake with -DOPTIONNAME=ON|OFF to change an option.    

Setting up unit tests                                                                      
-- Configuring done                                                                        
-- Generating done                                                                         
-- Build files have been written to: /home/boulund/TTT/lambda_test/src/lambda_build/release
[release]$ make -j2
Scanning dependencies of target lambda                                           
Scanning dependencies of target lambda_indexer                                   
[ 25%] Building CXX object src/CMakeFiles/lambda_indexer.dir/lambda_indexer.cpp.o
[ 50%] Building CXX object src/CMakeFiles/lambda.dir/lambda.cpp.o                
[ 75%] Linking CXX executable ../bin/lambda_indexer                              
/usr/bin/ld: cannot find -lrt                                                    
/usr/bin/ld: cannot find -lpthread                                               
/usr/bin/ld: cannot find -lz                                                     
/usr/bin/ld: cannot find -lbz2                                                   
/usr/bin/ld: cannot find -lm                                                     
/usr/bin/ld: cannot find -lpthread                                               
/usr/bin/ld: cannot find -lc                                                     
collect2: error: ld returned 1 exit status                                       
make[2]: *** [bin/lambda_indexer] Error 1                                        
make[1]: *** [src/CMakeFiles/lambda_indexer.dir/all] Error 2                     
make[1]: *** Waiting for unfinished jobs....                                     
[100%] Linking CXX executable ../bin/lambda                                      
/usr/bin/ld: cannot find -lrt                                                    
/usr/bin/ld: cannot find -lpthread                                               
/usr/bin/ld: cannot find -lz                                                     
/usr/bin/ld: cannot find -lbz2                                                   
/usr/bin/ld: cannot find -lm                                                     
/usr/bin/ld: cannot find -lpthread                                               
/usr/bin/ld: cannot find -lc                                                     
collect2: error: ld returned 1 exit status                                       
make[2]: *** [bin/lambda] Error 1                                                
make[1]: *** [src/CMakeFiles/lambda.dir/all] Error 2                             
make: *** [all] Error 2                                                          

I think I'll just abandon my attempts at building on that machine and just be happy with the binary I get from my laptop. Sorry to hear about the issues with supporting builds on RedHat systems, especially seeing as they are so common in our line of work (essentially all cluster systems I've ever used here in Sweden are based on CentOS). Thanks a lot for all the help today.

Quick edit: One more thing! I actually needed to remove all anaconda related paths from my $PATH before it started working. Maybe I wasn't clear enough on that. Just in case someone else has the same issue with linking to anaconda3/lib/libz.so.

@h-2
Copy link
Member

h-2 commented Oct 13, 2016

Can you try once more with :

-DCMAKE_CXX_FLAGS="-L/path/to/lib" -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath=/path/to/lib"

(the first is also necessary for static linking, the second only for dynmic)

where /path/to/lib is the path to your GCC's lib or lib64 directory? I am not sure whether /storage/boulund/TTT/lambda_test/src/gcc/ is the actual install dir of your compiler? Than it would likely be /storage/boulund/TTT/lambda_test/src/gcc/lib64...

@boulund
Copy link
Author

boulund commented Oct 13, 2016

Sure thing!
Actually that is the install dir. As I said earlier, I compiled this GCC version from scratch this morning just so I could try out this teeny uint32 change in the index :).

[release]$ cmake3 ../../lambda -DLAMBDA_STATIC_BUILD=ON -DCMAKE_CXX_COMPILER=/storage/boulund/TTT/lambda_test/src/gcc/bin/g++ -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_CXX_FLAGS="-L/storage/boulund/TTT/lambda_test/src/gcc/lib64" -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath=/storage/boulund/TTT/lambda_test/src/gcc/lib64"                                                                            
Compiler Detection                                                                                                                        
-- The CXX compiler identification is GNU 4.9.4                                                                                           
-- Check for working CXX compiler: /storage/boulund/TTT/lambda_test/src/gcc/bin/g++                                                       
-- Check for working CXX compiler: /storage/boulund/TTT/lambda_test/src/gcc/bin/g++ -- works                                              
-- Detecting CXX compiler ABI info                                                                                                        
-- Detecting CXX compiler ABI info - done                                                                                                 
-- Detecting CXX compile features                                                                                                         
-- Detecting CXX compile features - done                                                                                                  

Dependency detection                                                                                                                      
-- Found a local SeqAn library provided with the Lambda source code.                                                                      
   This will be preferred over system global headers.                                                                                     
-- Performing Test CXX14_BUILTIN                                                                                                          
-- Performing Test CXX14_BUILTIN - Failed                                                                                                 
-- Performing Test CXX14_FLAG                                                                                                             
-- Performing Test CXX14_FLAG - Success                                                                                                   
-- Looking for C++ include execinfo.h                                                                                                     
-- Looking for C++ include execinfo.h - found                                                                                             
--   Determined version is 2.2.0                                                                                                          
-- These dependencies where found:                                                                                                        
     OPENMP     TRUE      -fopenmp                                                                                                        
     ZLIB       TRUE      1.2.7                                                                                                           
     BZIP2      TRUE      1.0.6                                                                                                           
     SEQAN      TRUE      2.2.0                                                                                                           
-- The requirements where met.                                                                                                            

Build configuration                                                                                                                       
-- LAMBDA version is: 1.0.1                                                                                                               
-- The following options are selected for the build:                                                                                      
     LAMBDA_FASTBUILD         OFF                                                                                                         
     LAMBDA_LINGAPS_OPT       OFF                                                                                                         
     LAMBDA_MMAPPED_DB        ON                                                                                                          
     LAMBDA_NATIVE_BUILD      ON                                                                                                          
     LAMBDA_STATIC_BUILD      ON                                                                                                          
-- Run 'cmake -LH' to get a comment on each option.                                                                                       
-- Remove CMakeCache.txt and re-run cmake with -DOPTIONNAME=ON|OFF to change an option.                                                   

Setting up unit tests                                                                                                                     
-- Configuring done                                                                                                                       
-- Generating done                                                                                                                        
-- Build files have been written to: /home/boulund/TTT/lambda_test/src/lambda_build/release
[release]$ make -j2                              
Scanning dependencies of target lambda_indexer                                   
Scanning dependencies of target lambda                                           
[ 25%] Building CXX object src/CMakeFiles/lambda_indexer.dir/lambda_indexer.cpp.o
[ 50%] Building CXX object src/CMakeFiles/lambda.dir/lambda.cpp.o                
[ 75%] Linking CXX executable ../bin/lambda_indexer                              
/usr/bin/ld: cannot find -lrt                                                    
/usr/bin/ld: cannot find -lpthread                                               
/usr/bin/ld: cannot find -lz                                                     
/usr/bin/ld: cannot find -lbz2                                                   
/usr/bin/ld: cannot find -lm                                                     
/usr/bin/ld: cannot find -lpthread                                               
/usr/bin/ld: cannot find -lc                                                     
collect2: error: ld returned 1 exit status                                       
make[2]: *** [bin/lambda_indexer] Error 1                                        
make[1]: *** [src/CMakeFiles/lambda_indexer.dir/all] Error 2                     
make[1]: *** Waiting for unfinished jobs....                                     
[100%] Linking CXX executable ../bin/lambda                                      
/usr/bin/ld: cannot find -lrt                                                    
/usr/bin/ld: cannot find -lpthread                                               
/usr/bin/ld: cannot find -lz                                                     
/usr/bin/ld: cannot find -lbz2                                                   
/usr/bin/ld: cannot find -lm                                                     
/usr/bin/ld: cannot find -lpthread                                               
/usr/bin/ld: cannot find -lc                                                     
collect2: error: ld returned 1 exit status                                       
make[2]: *** [bin/lambda] Error 1                                                
make[1]: *** [src/CMakeFiles/lambda.dir/all] Error 2                             
make: *** [all] Error 2                                                          

I don't think I have versions of librt, libpthread, libz, libbz2, libm, and libc in the lib folder where I installed GCC (and pointed cmake to). I guess this is why it won't work? I feel like I have a lot to learn about C++ stuff...

@h-2
Copy link
Member

h-2 commented Oct 13, 2016

Ok, one more try with: -DLAMBDA_STATIC_BUILD=OFF :)

@h-2
Copy link
Member

h-2 commented Oct 13, 2016

I don't think I have versions of librt, libpthread, libz, libbz2, libm, and libc in the lib folder where I installed GCC (and pointed cmake to).

Thats ok, because it should check the other parts, too. The important thing is that libgcc, the libstdc++ and the libgomp are taken from that directory (they should be there), it then checks the usual directories for the other libs.
However since you activated static builds it looks for .a files instead of .so files. Most unix systems provide both, but RedHat/CentOS/Fedora family decided to strip them out so it can't find them.

I feel like I have a lot to learn about C++ stuff...

Its a mess 😢

@boulund
Copy link
Author

boulund commented Oct 13, 2016

Ok!

Cleaned out the entire build dir. Ran cmake3 ../../lambda -DLAMBDA_STATIC_BUILD=OFF -DCMAKE_CXX_COMPILER=/storage/boulund/TTT/lambda_test/src/gcc/bin/g++ -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_CXX_FLAGS="-L/storage/boulund/TTT/lambda_test/src/gcc/lib64" -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath=/storage/boulund/TTT/lambda_test/src/gcc/lib64" and make -j2. It actually seems to work!

The build took muuuch longer to complete (maybe two minutes or so), rather than the usual <0.5-1 min to fail.

It still seems I get a message about the binary not being built in release mode though...
But it runs now anyway! Thanks so much!

@h-2
Copy link
Member

h-2 commented Oct 13, 2016

It actually seems to work!

Glad to hear! 👍

The build took muuuch longer to complete (maybe two minutes or so), rather than the usual <0.5-1 min to fail.

Thats completely normal.

It still seems I get a message about the binary not being built in release mode though...

Ah, my internal check is case sensitive and only checks for =Release and not =RELEASE, but for the actual build I think (hope!) that it doesn't make a difference.
You can double-check with the readelf tool:
http://stackoverflow.com/a/12801855
or recompile with =Release to make sure.

@boulund
Copy link
Author

boulund commented Oct 14, 2016

Recompiled with =Release to get rid of the debug warning. Currently running the lambda_indexer, it's been running since this morning (a few hours now). It has printed some kind of progress meter for SuffixArray generation it seems. It hasn't been updated yet though, still at 0%. Memory consumption is high; currently requested 338GB virtual memory, out of which 247GB is being used (the machine has 252GB to spare, if no processes were running). I think it's swapping a fair bit (iotop shows the process spending about 80% of its time in the SWAPIN column). Can you calculate the expected memory consumption of the indexing somehow? Trying to index about 3.5k whole genome sequences, some of them split across a few contigs each. If the expected memory consumption is within reasonable bounds I could try to prepare the index on a machine with even more memory (I think I can round up a 512GB node if the expected time consumption is less than a few days).
Other than that, I think this issue is as good as done. You can close if if you want to. Thanks again for all the help.

@h-2
Copy link
Member

h-2 commented Oct 14, 2016

I think it's swapping a fair bit.

Yep, if it's swapping, it will never finish, because the high memory consumption is permanent during construction.

Can you calculate the expected memory consumption of the indexing somehow?

Short answer: roughly 10 times the total sequence length you are indexing (if your fasta headers are small: sequence length ~= total file size)
If your database is translated, roughly 20 times.

Long answer: Well, you have two 32Bit integers per suffix array position now, and the full suffix array needs to be constructed, thats 64bit (8byte) per text-position, i.e. number of basepairs in your text. If you are doing translation from DNA to Protein on the text first, the text size doubles (six translation frames, but length divided by 3). Than in addition to the suffix array the BWT (size of text) needs to be constructed and some helper data structures and extra buffers (1-2times size of text)

8 * 2n + 2*2n

In the future I want to precalculate the expected size and warn the user if the available memory seems to small.

Possible solutions if you have too little memory:

  • just split the database into multiple smaller files and search them individually
  • use --algorithm skew7ext which is a on-disk instead of in-memory algorithm. Note however that the required amount of free disk space is even higher than the required memory for the other algorithms. I'd say at least twice as much. Also this algorithm is not parallelized so it is much slower than the other algorithms (there is no progress indicator so you just need to wait). If you do have the disk space, also specify --tmp-dir /path/to/huge/disk so that it is used. Note that network storage is too slow for this to make sense.

The memory requirements for performing the search are actually much lower, around twice the size of the database file + the size of the query file

Other than that, I think this issue is as good as done. You can close if if you want to.

I will keep it open as a TODO for adding the compile time option to cmake!

Thanks again for all the help.

No problem! If you would like to be informed about new versions, please sign up to the newsletter:
https://lists.fu-berlin.de/listinfo/lambda-users (less than one email a month)
Also (in case you haven't seen it yet) the wiki has lots of help on parameters and output formats:
https://github.com/seqan/lambda/wiki

@boulund
Copy link
Author

boulund commented Oct 14, 2016

Ok! Thanks for the amazingly great reply!
Using the formula from your short answer would mean I would need about 360GB of RAM to construct this index (about 18GB of sequence data). Hmm.. I'll try with the on-disk alternative then. There's disk space a-plenty! :)

@h-2
Copy link
Member

h-2 commented Oct 14, 2016

Ok, but just to warn you: based on the database size I would expect the run-times to be

  • 8-12 hours when using radixsort and enough RAM and more than 20 CPU cores
  • a few days or even a week when using skew7ext

@boulund
Copy link
Author

boulund commented Oct 14, 2016

Ouch! That's quite some time... But I'm in no rush, I'll let it chug along :).

@boulund
Copy link
Author

boulund commented Oct 17, 2016

It wasn't so bad actually!

[lambda_test]$ time ./lambda_indexer -d /storage/TTT/reference_data/reference_genomes.fasta -p tblastn --tmp-dir /storage/boulund/TTT/lambda_test/temp --algorithm skew7ext
Loading Subject Sequences and Ids... done.
Dumping Subj Ids... done.
 dumping untranslated subject lengths...No Seg-File specified, no masking will take place.Dumping binary seqan mask file... done.
translating...Dumping unreduced Subj Sequences... done. 
Generating Index... done.
Writing Index to disk... done.

real    1309m24.853s
user    913m2.870s
sys     167m57.245s

Searching against the indexed database is very quick! Using a little test query file with 3888 short peptides, running on 40 cores, it takes about 20-25 seconds with reasonable memory consumption (about 80GB peak). The first search took about 4 minutes, I'm guessing because it had to put the index into file system cache, but subsequent searches are lightning quick! Very impressed.

@h-2
Copy link
Member

h-2 commented Nov 23, 2016

I have just added the macro and also added to checks to the programs to make sure that it catches incompatible indexes: dd800f9

@h-2 h-2 closed this as completed Nov 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants