Skip to content

How to cluster almost 6 billion protein sequences? #1100

@chrisAta

Description

@chrisAta

Hello!

We're preparing a new release of MGnify Proteins database, which will significantly increase the size of non-redundant protein sequences from ~2.4 billion to ~5.7 billion, stored in 48 different fasta file chunks. We want to generate clusters and their representatives for this new release. We've first tried to create the database by running this command:

createdb /path/to/mgy_proteins_1.fasta /path/to/mgy_proteins_2.fasta ... /path/to/mgy_proteins_48.fasta db

But it seems to hang and do nothing after the Converting sequences step... it creates quite sizeable db files (1.4TB), and the db_index file does have the correct number of lines matching the number of sequences in the input fasta files. The couple final lines of the output log look like this:

===================================================================================================     5737 Mio. sequences processed                         ===================================================================================================     5738 Mio. sequences processed                         ===================================================================================================     5739 Mio. sequences processed                         ===================================================================================================     5740 Mio. sequences processed                         ===================================================================================================     5741 Mio. sequences processed                         ===================================================================================================     5742 Mio. sequences processed                         ===================================================================================================     5743 Mio. sequences processed                         =================================

I checked out github issues and found this comment that mentions a limit of around 4 billion sequences #495 (comment) - is this limit still in place? And is that the reason the createdb command hangs?

The recommendation seems to be to run it in multiple batches... which we can do, but is it possible to then merge all of the created dbs into one? Or would we run into issues again when running linclust due to this ~4 billion limit? What would your recommendation be?

Here's some further logging info in case it's useful:

MMseqs Version:                         d45e0c44404715475da3e1f06df6529d4c83e49e
Database type                           0
Shuffle input database                  true
Createdb mode                           0
Write lookup file                       1
Offset of numeric ids                   0
Threads                                 48
Compressed                              0
Mask residues                           0
Mask residues probability               0.9
Mask lower case residues                0
Mask lower letter repeating N times     0
Use GPU                                 0
Verbosity                               3

Thanks in advance and best regards,
Chris

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions