Skip to content
MMseqs2: ultra fast and sensitive search and clustering suite
C++ Shell CMake C Dockerfile Perl
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github Move mmseqs2 logo to subfolder Apr 4, 2019
cmake
data Merge branch master Sep 3, 2019
examples
lib
src Fix pairwise LCA function Sep 11, 2019
util Cleanup merge issues Sep 3, 2019
.dockerignore
.gitattributes Omit third party code in GitHub language statistics Dec 11, 2018
.gitignore Cleanup merge issues Sep 3, 2019
.gitmodules
.travis.yml
CMakeLists.txt Use kseq.h from ksw2 instead of the one we had previously bundled Aug 7, 2019
Dockerfile
LICENCE.md Create LICENCE.md Sep 16, 2016
README.md
azure-pipelines.yml

README.md

MMseqs2: ultra fast and sensitive protein search and clustering suite

MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets. MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. The software is designed to run on multiple cores and servers and exhibits very good scalability. MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed.

Publications

Steinegger M and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi: 10.1038/nbt.3988 (2017).

Steinegger M and Soeding J. Clustering huge protein sequence sets in linear time. Nature Communications, doi: 10.1038/s41467-018-04964-5 (2018).

Mirdita M, Steinegger M and Soeding J. MMseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics, doi: 10.1093/bioinformatics/bty1057 (2019)

BioConda Install Github All Releases Docker Pulls Build Status Travis CI Zenodo DOI

Documentation

The MMseqs2 user guide is available in our GitHub Wiki or as a PDF file (Thanks to pandoc!). We provide a tutorial of MMseqs2 here.

Keep posted about MMseqs2/Linclust updates by following Martin on Twitter.

Installation

MMseqs2 can be used by compiling from source, downloading a statically compiled version, using Homebrew, conda or Docker. MMseqs2 requires a 64-bit system (check with uname -a | grep x86_64) with at least the SSE4.1 instruction set (check by executing cat /proc/cpuinfo | grep sse4_1 on Linux or sysctl -a | grep machdep.cpu.features | grep SSE4.1 on MacOS).

 # install by brew
 brew install mmseqs2
 # install via conda
 conda install -c bioconda mmseqs2
 # install docker
 docker pull soedinglab/mmseqs2
 # static build with SSE4.1
 wget https://mmseqs.com/latest/mmseqs-linux-sse41.tar.gz; tar xvfz mmseqs-linux-sse41.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH
 # static build with AVX2
 wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

The AVX2 version is faster than SSE4.1, check if AVX2 is supported by executing cat /proc/cpuinfo | grep avx2 on Linux and sysctl -a | grep machdep.cpu.leaf7_features | grep AVX2 on MacOS). We also provide static binaries for MacOS and Windows at mmseqs.com/latest.

MMseqs2 comes with a bash command and parameter auto completion, which can be activated by adding the following lines to your $HOME/.bash_profile:

        if [ -f /Path to MMseqs2/util/bash-completion.sh ]; then
            source /Path to MMseqs2/util/bash-completion.sh
        fi

Compilation from source

Compiling MMseqs2 from source has the advantage that it will be optimized to the specific system, which should improve its performance. To compile MMseqs2 git, g++ (4.8 or later) and cmake (2.8.12 or later) are needed. Afterwards, the MMseqs2 binary will be located in the build/bin/ directory.

    git clone https://github.com/soedinglab/MMseqs2.git
    cd MMseqs2
    mkdir build
    cd build
    cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
    make -j 4
    make install
    export PATH=$(pwd)/bin/:$PATH

❗️ To compile MMseqs2 on MacOS, first install the gcc compiler from Homebrew. The default MacOS clang compiler does not support OpenMP and MMseqs2 will only be able to use a single thread. Then use the following cmake call:

    CC="$(brew --prefix)/bin/gcc-9" CXX="$(brew --prefix)/bin/g++-9" cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..

Getting started

We provide easy workflows to cluster, search and assign taxonomy. These easy workflows are a shorthand to deal directly with FASTA/FASTQ files as input and output. MMseqs2 provides many modules to transform, filter, execute external programs and search. However, these modules use the MMseqs2 database formats, instead of the FASTA/FASTQ format. For maximum flexibility, we recommend using MMseqs2 workflows and modules directly. Please read more about this in the documentation.

Cluster

For clustering, MMseqs2 easy-cluster and easy-linclust are available.

easy-cluster by default clusters the entries of a FASTA/FASTQ file using a cascaded clustering algorithm.

    mmseqs easy-cluster examples/DB.fasta clusterRes tmp --min-seq-id 0.5 -c 0.8 --cov-mode 1        

easy-linclust clusters the entries of a FASTA/FASTQ file. The runtime scales linearly with input size. This mode is recommended for huge datasets.

    mmseqs easy-linclust examples/DB.fasta clusterRes tmp     

Sequence identity is in default estimated to output real sequence identity use --alignment-mode 3. Read more about the clustering format in our user guide.

Please adjust the clustering criteria and check if temporary directory provides enough free space. For disk space requirements, see the user guide.

Search

The easy-search searches directly with a FASTA/FASTQ files against either another FASTA/FASTQ file or an already existing MMseqs2 database.

    mmseqs easy-search examples/QUERY.fasta DB.fasta alnRes tmp

It is also possible to pre-compute the index for the target database:

    mmseqs createdb examples/DB.fasta targetDB
    mmseqs createindex targetDB tmp
    mmseqs easy-search examples/QUERY.fasta targetDB alnRes tmp

The speed and sensitivity of the search can be adjusted with -s parameter and should be adapted based on your use case (see setting sensitivity -s parameter). A very fast search would use a sensitivity of -s 1.0, while a very sensitive search would use a sensitivity of up to -s 7.0. A detailed guide how to speed up searches is here.

The output can be customized with the --format-output option e.g. --format-output "query,target,qaln,taln" returns the query and target accession and the pairwise alignments in tab separated format. You can choose many different output columns.

Taxonomy

The easy-taxonomy workflow can be used assign sequences taxonomical labels. It performs a search against a target sequence databases and computes the lowest common ancestor of all equal scoring top hits (default). Other assignment options are available through --lca-mode.

    mmseqs createdb examples/DB.fasta targetDB
    mmseqs createtaxdb targetDB tmp
    mmseqs createindex targetDB tmp
    mmseqs easy-taxonomy examples/QUERY.fasta targetDB alnRes tmp

In default createtaxdb assigns every sequence with a Uniprot accession to a taxonomical identifier and downloads the NCBI taxonomy. We also support BLAST, SILVA or custom taxonomical databases.

Read more about the taxonomy format and the classification in our user guide.

Supported search modes

MMseqs2 provides many additional search modes:

Many modes can also be combined. You can, for example, do a translated nucleotide against protein profile search.

Memory Requirements

MMseqs2 minimum memory requirements for cluster or linclust is 1 byte per sequence residue, search needs 1 byte per target residue. Sequence databases can be compressed using the --compress flag, DNA sequences can be reduced by a factor of ~3.5 and proteins by ~1.7.

MMseqs2 checks the available system memory and automatically divides the target database in parts that fit into memory. Splitting the database will increase the runtime slightly. It is possible to control the memory usage using --split-memory-limit.

How to run MMseqs2 on multiple servers using MPI

MMseqs2 can run on multiple cores and servers using OpenMP and Message Passing Interface (MPI). MPI assigns database splits to each compute node, which are then computed with multiple cores (OpenMP).

Make sure that MMseqs2 was compiled with MPI by using the -DHAVE_MPI=1 flag (cmake -DHAVE_MPI=1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=. ..). Our precompiled static version of MMseqs2 cannot use MPI. The version string of MMseqs2 will have a -MPI suffix, if it was built successfully with MPI support.

To search with multiple servers, call the search or cluster workflow with the MPI command exported in the RUNNER environment variable. The databases and temporary folder have to be shared between all nodes (e.g. through NFS):

    RUNNER="mpirun -pernode -np 42" mmseqs search queryDB targetDB resultDB tmp
You can’t perform that action at this time.