Hello!
We're preparing a new release of MGnify Proteins database, which will significantly increase the size of non-redundant protein sequences from ~2.4 billion to ~5.7 billion, stored in 48 different fasta file chunks. We want to generate clusters and their representatives for this new release. We've first tried to create the database by running this command:
createdb /path/to/mgy_proteins_1.fasta /path/to/mgy_proteins_2.fasta ... /path/to/mgy_proteins_48.fasta db
But it seems to hang and do nothing after the Converting sequences step... it creates quite sizeable db files (1.4TB), and the db_index file does have the correct number of lines matching the number of sequences in the input fasta files. The couple final lines of the output log look like this:
=================================================================================================== 5737 Mio. sequences processed =================================================================================================== 5738 Mio. sequences processed =================================================================================================== 5739 Mio. sequences processed =================================================================================================== 5740 Mio. sequences processed =================================================================================================== 5741 Mio. sequences processed =================================================================================================== 5742 Mio. sequences processed =================================================================================================== 5743 Mio. sequences processed =================================
I checked out github issues and found this comment that mentions a limit of around 4 billion sequences #495 (comment) - is this limit still in place? And is that the reason the createdb command hangs?
The recommendation seems to be to run it in multiple batches... which we can do, but is it possible to then merge all of the created dbs into one? Or would we run into issues again when running linclust due to this ~4 billion limit? What would your recommendation be?
Here's some further logging info in case it's useful:
MMseqs Version: d45e0c44404715475da3e1f06df6529d4c83e49e
Database type 0
Shuffle input database true
Createdb mode 0
Write lookup file 1
Offset of numeric ids 0
Threads 48
Compressed 0
Mask residues 0
Mask residues probability 0.9
Mask lower case residues 0
Mask lower letter repeating N times 0
Use GPU 0
Verbosity 3
Thanks in advance and best regards,
Chris
Hello!
We're preparing a new release of MGnify Proteins database, which will significantly increase the size of non-redundant protein sequences from ~2.4 billion to ~5.7 billion, stored in 48 different fasta file chunks. We want to generate clusters and their representatives for this new release. We've first tried to create the database by running this command:
But it seems to hang and do nothing after the
Converting sequencesstep... it creates quite sizeable db files (1.4TB), and thedb_indexfile does have the correct number of lines matching the number of sequences in the input fasta files. The couple final lines of the output log look like this:I checked out github issues and found this comment that mentions a limit of around 4 billion sequences #495 (comment) - is this limit still in place? And is that the reason the createdb command hangs?
The recommendation seems to be to run it in multiple batches... which we can do, but is it possible to then merge all of the created dbs into one? Or would we run into issues again when running linclust due to this ~4 billion limit? What would your recommendation be?
Here's some further logging info in case it's useful:
Thanks in advance and best regards,
Chris