Skip to content

Clustering using a batch system #764

@boratyng

Description

@boratyng

Hi,

I am trying to use MMseqs2 to cluster a large protein database, splitting the work into batch jobs. I followed the search example from https://github.com/soedinglab/mmseqs2/wiki#how-to-run-mmseqs2-on-multiple-servers-using-batch-systems, searching batches of the database against the whole database. Then I am trying to use search results to compute clusters with the clust subcommand. Here is my script:

$MMSEQS createdb $INFASTA $DB
$MMSEQS splitdb $DB ${DB}_split --split $NUM_SPLITS

for i in $(ls ${DB}_split_*_$NUM_SPLITS) ; do
      $MMSEQS search $i $DB ${i}_search tmp
done

$MMSEQS mergedbs ${DB}_split_0_${NUM_SPLITS}_search ${DB}_search $(awk 'BEGIN {for (i=1;i < '$NUM_SPLITS';i++) printf("'$DB'_split_%d_'$NUM_SPLITS'_search ", i);}')

$MMSEQS clust ${DB} ${DB}_search ${DB}_clust 

mmseqs clust gives Sequence db size != result db size error.

Is there a way to combine the search results into one results database or compute clusters for each of my database batch and merge them, or any other way do clustering on a batch system (without MPI)?

Your Environment

Linux CentOs.
MMseqs2 Release 14-7e284: https://github.com/soedinglab/MMseqs2/releases/download/14-7e284/mmseqs-linux-avx2.tar.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions