How to cluster almost 6 billion protein sequences?

Hello!

We're preparing a new release of MGnify Proteins database, which will significantly increase the size of non-redundant protein sequences from ~2.4 billion to ~5.7 billion, stored in 48 different fasta file chunks. We want to generate clusters and their representatives for this new release. We've first tried to create the database by running this command:

```
createdb /path/to/mgy_proteins_1.fasta /path/to/mgy_proteins_2.fasta ... /path/to/mgy_proteins_48.fasta db
```

But it seems to hang and do nothing after the `Converting sequences` step... it creates quite sizeable db files (1.4TB), and the `db_index` file does have the correct number of lines matching the number of sequences in the input fasta files. The couple final lines of the output log look like this:
```
===================================================================================================     5737 Mio. sequences processed                         ===================================================================================================     5738 Mio. sequences processed                         ===================================================================================================     5739 Mio. sequences processed                         ===================================================================================================     5740 Mio. sequences processed                         ===================================================================================================     5741 Mio. sequences processed                         ===================================================================================================     5742 Mio. sequences processed                         ===================================================================================================     5743 Mio. sequences processed                         =================================
```

I checked out github issues and found this comment that mentions a limit of around 4 billion sequences https://github.com/soedinglab/MMseqs2/issues/495#issuecomment-939168330 - is this limit still in place? And is that the reason the createdb command hangs?

The recommendation seems to be to run it in multiple batches... which we can do, but is it possible to then merge all of the created dbs into one? Or would we run into issues again when running linclust due to this ~4 billion limit? What would your recommendation be?

Here's some further logging info in case it's useful:
```
MMseqs Version:                         d45e0c44404715475da3e1f06df6529d4c83e49e
Database type                           0
Shuffle input database                  true
Createdb mode                           0
Write lookup file                       1
Offset of numeric ids                   0
Threads                                 48
Compressed                              0
Mask residues                           0
Mask residues probability               0.9
Mask lower case residues                0
Mask lower letter repeating N times     0
Use GPU                                 0
Verbosity                               3
```

Thanks in advance and best regards,
Chris

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to cluster almost 6 billion protein sequences? #1100

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to cluster almost 6 billion protein sequences? #1100

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions