Releases · soedinglab/MMseqs2

31 Oct 09:22

6f45232

Latest

MMseqs2 Release 15 brings efficient single query searches with low memory overhead through the new ungapped-prefiltering mode (--prefilter-mode 1). We also improved our greedy clustering algorithm and added a large swath of smaller fixes and features. Thanks to all contributors for their vital contributions and fixes.

Breaking

Updated greedy cluster algorithm. The clustering picks better representatives to respect the sequence identity and coverage criteria. (2568829) Thanks @bbuchfink

New Features and Enhancements

Implement additional prefilter modes (standard double k-mer prefilter, ungapped prefilter, exhaustive searching) (5e119e9)
Added createclusearchdb and mkrepseqdb modules to build cluster-search databases, this was implemented for Foldseek, cluster-search in MMseqs2 will be implemented at a later point (9ae4458, 80f8b0b, 542f362, ad6dfc6, 91f2a6a, 8310cd6, 0019026, 76b7df1)
Implement target-side similar k-mer search mode for sequence-sequence prefiltering (71dd32e)
Rework ungappedprefilter to improve performance and expose additional parameters such as taxon filtering and db-load-mode to ungappedprefilter (8a89305, 800eb09, eb01b5b, 20d3afc)
Added gappedprefilter module for Smith-Waterman prefiltering, similar to ungappedprefilter (df77d9e)
Reworked pairaln for the ColabFold greedy taxonomy pairing mode (1514015)
Implemented experimental module for A3M filtering (167bbd1, 499bb73)
Implemented weighted clustering (bd080e6, b36070a, fd1837b) Thanks @AnnSeidel
Precomputed indices without k-mers can be created with --index-subset (314c1f0, 8fe3bf9)
Add result2neff module to extract Neff scores (4148e09) Thanks @neftlon
Add ppos format-output to convertalis for count of positive substitution scores (5edc79b) Thanks @Dohyun-s
Speed-up FASTA parsing in kseq.h with memchr (98406dd) Thanks @valentynbez @kloetzl

Bugfixes

Add min and max modes for result2stats (19dce03, 61e7734) Thanks @ClovisG
Fixed a segmentation fault in ca3m with the same database (f5f780a) Thanks @ClovisG
Fix crash when some input file sizes are an exact multiple of 4096 in convertalis and gff2db (712f288) Thanks @RuoshiZhang
Fixed issues for GTDB r214 database creation (4b52296) Thanks @apcamargo
Fix source number being limited to 16-bit (65k) (1d62fa0)
kseq now correctly handles input sequences larger than 2^31 bytes (07ca4a7)
Fixed unpackdb to work without a .lookup file and added support for writing compressed files (92d8cc3, 570e3ed)
createindex --check-compatible check the k-mer threshold correctly now (bb0a1b3)
Fixed prefilter exclusively long result lists reading to result truncation. This was primarily a Foldseek issue and shouldn't affect MMseqs2 (ed4c55f)
Corrected handling of multiline checks in createdb (6b93884)
Fix crash by disabling wrapped scoring when the target sequence is shorter than the query (8459b6b) Thanks @AnnSeidel
Fixed logic in reciprocal-best-hit by removing resAB_sort (3bcbdba) Thanks @StephanieSKim
Corrected handling of differently ordered parts of sequence databases in concatdbs (ea17d30)
Fix --single-step-clustering misspelled in cluster warning (fa6c093) Thanks @valentynbez

Build and Compatibility Updates

Addressed build and compatibility issues, including updates for newer compilers and architectures (e.g., Mac ARM64) (e26b9ad, 3e43617, b341b66, 932d32b) Thanks @A-N-Other
Added Mac ARM64 support in GitHub actions and updated from Ubuntu 18.04 to a newer image (1fea43d, 05132de)
Updated regression testing to fix errors in MPI test (2113766)

Developer

Introduced base: prefix to enable inheriting subprojects to find shadowed modules (i.e. Foldseek shadows createdb, but can use base:createdb to use the MMseq2's one) (90aa913)
Exported build architecture in CMake so subprojects can use it (fce06b1)

Contributors

A-N-Other, kloetzl, and 9 other contributors

Assets 12

13 Oct 12:31

martin-steinegger

14-7e284

7e28409

MMseqs2 Release 14-7e284

This is a major release containing features implemented for ColabFold, Foldseek, MMseqs2 profile-profile (not published yet, and still in preview) and many bugfixes. Thanks a lot to the contributors who submitted bug fixes.

If you are using the Docker Hub based MMseqs2 containers, please switch to the new Github Container Registry based ones. The Docker Hub containers will not be maintained in the future.

Breaking

Profile databases created by previous MMseqs2 releases won't work anymore with this release. Please recreate them from previous search results or MSAs with result2profile or `msa2profile.
Profile k-mer threshold parameter were fitted to new pseudo-counter parameter (--pca,--pcb). Previous --k-score parameters will have differing sensitivity. However, most users will have set -s instead, which was fitted to match as closely as possible.

Features

gff2db now should actually work correctly after refactoring (488df86, thanks @RuoshiZhang)
result2msa now supports reading from precomputed index
Add db2tar: Create a tar file from a database
Add parsable columnar tsv output to databases with --tsv
Add taxonomic filtering during prefilter with --taxon-list
Add --comp-bias-corr-scale to adjust the weight of the compositional bias correction
Add --mask-prob parameter to adjust tantan's masking threshold
Add context specific pseudo-counts to result2profile
Add iterative profile-profile search workflow (thanks @haydenji0731)
Add support for profile-profile scoring in striped Smith-Waterman algorithm (thanks @haydenji0731)
Add support for gap-open/gap-close costs to striped Smith-Waterman algorithm (thanks @hgsommer)
Add environment variable MMSEQS_IGNORE_INDEX to ignore an existing precomputed index
createsubdb and view can now return results from identifiers in .lookup with --id-mode 1
Change compressdb loop to omp static to keep order
Improvements to nucleotide alignments and scoring (thanks @AnnSeidel)

Features built for ColabFold now available in MMseqs2

Add pairaln: taxonomic pairing on sequences for MSA building (9a0df0d, 5e245d1, 3f8695e, 3e92abf, edb8223, e19df7c)
Add A3M support to result2msa (--msa-format-mode 5)
Add A3M support with alignment information to result2msa (--msa-format-mode 6)
result2profile allows --diff 0
Make taxonomy mapping mmap'able for (near) instant read-in
Add workflow to create expandable profile (profile-profile) db from TSVs tsv2exprofiledb
Enable result2profile/filterresult to read new expand alignment index
Add support to filter MSAs in buckets filterresult, result2profile
Add --filter-min-enable to enable filtering only above a minimum threshold of hits (c6d8ae0)
Expand can filter in each target cluster before expanding (75af0c8, 85ce847)

Bugfixes

summarizeresult was rejecting hits that match the coverage threshold exactly (#586, 67949d7)
Don’t use reserved filename characters in unpackdb (#467, c663497 thanks @cutecutecat)
Fix typo (violoations -> violations) (#526, 74c3aa6, thanks @Benjamin-Lee)
Fix potential endless loop in rescorediagonal
Fix prefilter/alignment with 0-size query input #433
Fix unpackdb parameter parsing issue
Make sure FILTER_RESULT variable is always correctly set for exhaustive search (d4a3354)
tar2db breaking with --tar-include/exclude (#561)
Wrong database name printed for variadic input when creating a tmp directory
extractorfs sometimes loading invalid start/stop codons on non-avx2 platforms
Don't mask consensus sequences in profiles
result2msa correctly prints X residues
Allocate CSProfile only if it's going to be used (d873697)
Taxonomy db paths are now correctly found if given a precomputed index (8ff26f2)
Encode more strings internally as base64 if special characters are used (16b5774, d155586)
Disable broken iterative profile searches in taxonomy (#432)
Fixed a possible segmentation fault in align (thanks @rchikhi)

MMseqs2 `databases`

Added VOGDB
Updated dbCAN2 to V9 and removed .aln suffix from profile names
Fix issues with ResFinder (#494, 56816b3), GTDB (#561, 678c82a), Kalamari (#531, ce7bf53), Uniref (#496, e85ceb9, thanks to @fanhuan)

Speedup

Rework of result2msa to avoid allocating a lot of memory
Improvement of speed for ungapped alignment in prefilter
TaxonomyExpression is faster with a single tax identifier (8ff7279)

MMseqs2 subprojects

MMseqs2-based subprojects can use databases too (5afd33c)
Add appenddbtoindex: augment a precomputed index with other databases in sub-projects
Allow subprojects to build their own precomputed indices (a506d67)
Add support for external k-mer thresholds for the prefilter (fea8d20)
Subprojects can define their own DbType validators

Developers

Added CirrusCI to test FreeBSD and old compilers (a2e2129, 904d0c6, a09a704, 4f1996a, 482dedc, 16830a5)
MMseqs2 Docker containers are now published in the Github Container Registry (eb203d3, 5185d3c, ba4e11f)
Our microtar fork can write tar files again (dcd180b)
Add URIs as allowed parameter inputs (3b9cf88)
Additional s390x fixes (linclust might work now)
Add support for new MultiParameter type
Bundled SIMDe was updated (thanks @mr-c)

Contributors

rchikhi, mr-c, and 7 other contributors

Assets 12

24 Feb 11:08

milot-mirdita

13-45111

45111b6

MMseqs2 Release 13-45111

New Taxonomy Workflow (new feature and breaking change)

We introduce a new taxonomy workflow for assigning taxonomic labels to nucleotide sequences by searching against protein reference databases. For details see:

Mirdita M, Steinegger M, Breitwieser F, Söding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. bioRxiv, doi: 10.1101/2020.11.27.401018 (2020)

The nucleotide-to-protein taxonomic assignment is now much faster and is optimized towards annotation of contigs. If you use MMseqs2 taxonomy to assign taxonomic labels to short reads, consider using the --orf-filter 0 parameter to disable the new filter stage as it can reject too many short query sequences. MMseqs2 is still considerably faster with this parameter set.

As our nucleotide-to-nucleotide taxonomic assignment does not support the 2bLCA assignment mode for stable lowest-common-ancestor computation, we previously set MMseqs2 to perform LCA assignment by top-hit (--lca-mode 4) as default. Approximate (see manuscript) 2bLCA is now again the default and we automatically switch to top-hit if given nucleotide-to-nucleotide input.

Breaking changes

--slice-search in now called --exhaustive-search
Unify --compress --summarize --omit-consensus (in result2msa) to --msa-format-mode

Features

Add GTDB and CDD to databases downloader #410
Add nrtotaxmapping to create taxonomy mapping from NR
Add unpackdb to split a database into separate files #406
Add majoritylca module for majority voting based taxonomy from alignment results
Add cpdb and lndb
Taxonomy information is stored in binary format (a single db_taxonomy file, instead of db_{named,nodes,merged}.dmp,db_mapping) to speed up read-in. Old format is still supported.
--exhaustive-search is usable with ungapped alignments (--alignment-mode 4)
Allow sequence/result database input in taxonomyreport #401/#408
msa2profile/result can skip the first sequence with --skip-query
createtaxdb can create a taxdb by mapping through .source in addition to .lookup (--tax-mapping-mode 1)
splitsequence can create a sequence database with original headers
align can return short cluster format if only identifiers are required --alignment-output-mode
tar2db can be used multi-threaded if input allows (e.g. .tar containing .gz files)
Encode species names in taxonomy blocklist to make sure we don't block random nodes in * e.g. GTDB)
Split non-index parts over additional files in split index case to reduce peak memory use
proteinaln2nucl can now compute scores and e-values
createdb can create a sequence database from a database containing fasta files (e.g. created by tar2db)
Add MMSEQS_FORCE_MERGE environment variable to force generating fully merged databases
Improved many descriptions, warnings and error messages

Bugs fixed

Fix filterresult off by one issue removing wrong sequences
Fix addtaxonomy always crashing due to invalid check #355
Reduce numbers of calls to posix_memalign to fix lock contention on macOS
extractorfs doesn't flood warnings due to short sequences anymore
expand2profile --pca is correctly set to 0
msa2profile always copies .lookup/source files instead of symlinking
Clustering of clustering input would not work with set-cover or connected-component
Short circuit --cluster-reassign if nothing can be reassigned
Fix temporary files not getting removed in linclust/cluster with --remove-tmp--files
Fix kmermatcher setting user k-mer pattern in auto k-mer selection and breaking
Krona taxonomyreport was not working if no sequence was unclassified
Make Matcher::resultToBuffer buffer sizes consistent (could crash with very long backtraces, needs further refactoring)
Fix multiple locations where Util::checkAllocation could never be called as it would have crashed before
Whitespace containing parameters do not break workflows anymore (e.g. passing whitespaces to --sub-mat)
taxonomyreport and addtaxonomy parameter were not adjustable in easy-taxonomy
E-value parameters are now correctly parsed as doubles instead of floats #379
Add symlinks to splitdb #376
Increase maximum number of open files in DBReader
Include file size and modified date of inputs in temporary file hash calculation #372
--cov-mode 5 was not working #371
Database downloader deals correctly with redirects now
result2profile could crash if target database contained much longer sequences than query database
Stop symlinking header database (and other ancillary files) in filterresult

Developer

Add vector of predefined substitution matrices to add additional matrices in subprojects
Don't create false _has_{builtin,attribute} macros (see simd-everywhere/simde#691 (comment))
Add USE_SYSTEM_ZSTD cmake flag to use system provided zstd #411
Replace texlive with tectonic for faster/prettier userguide
Add more instructions to simd.h
Add initial fixes to get MMseqs2 working on s390x (work in progress)
Prebuilt macOS binary is now a Universal Mac Binary supporting SSE, AVX and Apple Silicon NEON
Build ARM64/PPC64LE binaries by cross-compiling
Add missing licenses and READMEs for vendored libraries #403
Update ALP to 1.98
Update xxhash to v0.8.0

Assets 11

01 Sep 11:22

martin-steinegger

12-113e3

113e321

MMseqs2 Release 12-113e3

Breaking changes

Remove --add-internal-id parameter from result2msa
filterdb --shuffle is now randomly instead of deterministically shuffled
Taxonomy expressions in filtertax(seq)db interpret , as || now #320
convertalis pident output field now correctly reports percentage (0-100) sequence identity instead of fraction (0.00-1.00), use fident to print the fraction instead

Features

Support nucleotide clustering in cluster and easy-cluster
Support other architectures (SSE2/ARM64/POWER8/POWER9/etc) through SIMDe
Linclust is much faster on systems with a lot of CPU cores
Clustering update is faster, more stable and correctly deals with deleted sequences #272
Add easy workflow for reciprocal best hit searches easy-rbh
Add SILVA, Pfam-B, dbCAN2 to databases
databases produces taxonomy information for NR
Replace old greedy incremental clustering with new memory efficient version
Add result2dnamsa module to create MSAs of nucleotide sequences
Continued progress on profile-profile searching (result2pp,expandaln,expand2profile) , stay tuned!
Add multi-parameter to support to overwrite sequence type specific parameters: e.g. --gap-open "nucl:5,aa:11"
Add ORF information as output options to convertalis (qOrfStart/qOrfEnd, dbOrfStart, dbOrfEnd)
Speed up sorting using ips4o
Speed up masking through new version of tantan
Speed up multi-threaded writing of clustering results
Speed up reading of database indices and merging target split databases
Add memory tracking to account for index size when computing available memory (--split-memory-limit should be more reliable when searching/clustering billions of sequences).
Add --search-type 4 (translated/translated search) to createindex
Add convertalis --format-mode 3 HTML output based on MMseqs2 app (app.mmseqs.com)
Improve memory management in result2msa and result2profile modules
Add msa2result module to create an alignment result db from MSAs
Add filterresult to slim down result dbs with pairwise HHblits filtering #316
Add --kmers-per-sequence-scale to linsearch to extract a k-mer fraction instead of a fixed count
Add a random integer to --local-tmp path to avoid race conditions if multiple MMseqs2 happen on the same machine
Add --max-seqs to ungappedprefilter
Add --tax-lineage-mode 2 parameter to print numeric taxids

Bugs fixed

rbh workflow was broken due to issues with filterdb
Fix -a in RBH search to show alignments
Fix PDB70 database creation in databases
Fix aria2c download support
Fix memory issues and MPI in kmermatcher
Fix memory issues in extractorfs when using AVX2
Fix --cluster-reassign to respect --cov-mode
Set-cover supports up to 2^32 sequences (previously crashed with more than 2^31)
Exit correctly if there is not have enough disk space instead of crashing in the next module
Fix prefilter order instability when searching very redundant databases
Correctly parse keys from data files in filterdb --filter-file, this was causing instability in linsearch
Allow overwriting string parameters with empty strings
Fix ASAN issue in extractorf when using AVX2
Microtar would try to seek backwards constantly resulting in horrible gzip read performance
Avoid lookup writing to corrupt memory if an accession is too long
Fix various inconsistencies and usability issues in alignall:
- --alignment-mode inconsistent with align module
- --add-backtrace did not do anything
Fix restart of clusterings using reassignment cluster --cluster-reassign
Fix createdb did not correctly read gz/bzip files with --createdb-mode 1 #323

Assets 14

11 Feb 22:31

martin-steinegger

11-e1a1c

e1a1c12

MMseqs2 Release 11-e1a1c

At a glance: The MMseqs2 command line interface is cleaner and validates user input. Many MMseqs2 modules use less memory and run faster. The new databases module helps to download and setup database. We now have a chat support at chat.mmseqs.com.

Known Issues

rbh crashes due to invalid sorting mode (#290)
Homebrew's macOS version does not use multiple cores (#289)
prefilter results can be unstable between different runs for extremely redundant databases (#277)
linclust/cluster can crash for very small input sets (#274)

Breaking Changes

kmermatcher --skip-n-repeat-kmer parameter was replaced with --ignore-multi-kmer
Does not discard whole sequences anymore if a k-mer occured to often, instead it skips the specific k-mers.
Either mode is only used in Plass and not in Linclust
--lca-ranks from (easy-)taxonomy and lca has to be delimited with semicolons (;) instead of colons (:)
--dont-shuffle flag was renamed to --shuffle true/false

Features

new databases workflow to list and download common databases.
Supported databases:

  Name                	Type      	Taxonomy	Url
- UniRef100           	Aminoacid 	     yes	https://www.uniprot.org/help/uniref
- UniRef90            	Aminoacid 	     yes	https://www.uniprot.org/help/uniref
- UniRef50            	Aminoacid 	     yes	https://www.uniprot.org/help/uniref
- UniProtKB           	Aminoacid 	     yes	https://www.uniprot.org/help/uniprotkb
- UniProtKB/TrEMBL    	Aminoacid 	     yes	https://www.uniprot.org/help/uniprotkb
- UniProtKB/Swiss-Prot	Aminoacid 	     yes	https://uniprot.org
- NR                  	Aminoacid 	       -	https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- NT                  	Nucleotide	       -	https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- PDB                 	Aminoacid 	       -	https://www.rcsb.org
- PDB70               	Profile   	       -	https://github.com/soedinglab/hh-suite
- Pfam-A.full         	Profile   	       -	https://pfam.xfam.org
- Pfam-A.seed         	Profile   	       -	https://pfam.xfam.org
- eggNOG              	Profile   	       -	http://eggnog5.embl.de
- Resfinder           	Nucleotide	       -	https://cge.cbs.dtu.dk/services/ResFinder
- Kalamari            	Nucleotide	     yes	https://github.com/lskatz/Kalamari

(easy-)search --slice-search is now usable. Slice search finds all hits that fulfill the alignment criteria while using only as much disk space as defined by --disk-space-limit
createdb and the various easy- workflows learned to read query input from STDIN
taxonomyreport learned to display the summarized taxonomy result with Krona
new filtertaxseqdb module for filtering sequence DBs with taxonomy information according to provided taxa
--taxon-list parameter understands expressions. E.g. get all bacterial and human sequences --taxon-list "2||9606"
easy-search and convertalis can now output taxonomic information using --format-output

taxid      Taxonomic identifier
taxname    Taxon Name
taxlineage Taxonomic lineage

speed up in (easy-)cluster/linclust by improving k-mer extraction
MMseqs2 consistently creates .source and .lookup files to match from which input file a sequence came from
E.g.: mmseqs createdb input1.fa input2.fa seqDB each sequence in seqDB can tell if it came from input1.fa or input2.fa
createdb learned to index an existing (single-line-seq per entry) FASTA file without copying the FASTA content to a new database
align and rescorediagonal learned to align circular sequences
align exposes the z-drop parameter of its Banded Nucleotide alignment algorithm
reverseseq learned to reverse profiles
filterdb can filter rows with value within given percentage of first row
new aggragatetax module to assign a taxonomic label to contigs according to the fragments matched on the contig
Adjusting --max-seq-len is not required anymore, MMseqs2 automatically increases the length now.
MMseqs2 on Cygwin/Windows uses nedmalloc as its memory allocator now and does not massively slow down due to lock contention
new tar2db module to efficiently transform content of tar archives to MMseqs2 databases

Bug fixes

createindex would create corrupted indices for profile target databases
rbh workflow would create its result DB at an unexpected (wrong) location
(easy)-taxonomy --lca-mode 3 (Approx. LCA) was aligning invalid sequences in the second iteration and producing bad results
lca (and (easy)-taxonomy) add empty columns for unclassifed sequences to be valid TSVs
kmermatcher uses xxhash for hashing now (faster)
kmermatcher avoid crash machine has not enough memory to process data at once (affects linclust/cluster)
kmermatcher correctly deals with sequences longer than MAX_SHRT now
kmermatcher fixed various edge cases (e.g. alignment of 1-char sequences)
kmermatcher hash-shift would be ignored
offsetalignment could produce wrong results in the minus-strand
clust now correctly and consistently handles alignment DB input
clusthash better deals with nucleotide input now and several multi-threaded inefficiencies were resolved
(easy-)cluster --single-step-clustering could cluster unrelated sequences due to hash collisions
prefilter --diag-score 0 respects --min-ungapped-score
createseqfiledb could print empty sequence lines
taxonomyreport could crash if no sequence was unclassified
result2flat could crash with long sequence input
result2msa, result2profile, msa2profile backport filtering fix from HHblits
align could produce bad alignments if all sequence lenghts in query DB where a lot shorter than in target DB
splitsequence fix issues with splitsequence if combined with compressed
result2profile fix Filter2 bug of HH-suite in MMseqs2
apply would crash due to reading wrong entry lengths
filterdb --filter-expression was not thread safe and could corrupt results
filterdb --extract-lines and --trim-to-one-column are compatible with each other

Developers

Internal representation of sequences changed from 4-byte per character to 1-byte per character
Compilation under AppleClang + libomp works now (see util/build_osx.sh)
Tools inheriting from MMseqs2 can now add their own citations
MMseqs2 on macOS compiles with the macOS 10.9 SDK (removed symlinkat call; relevant for bioconda)

Assets 8

23 Aug 12:05

martin-steinegger

10-6d92c

6d92cd2

MMseqs2 Release 10-6d92c

At a glance: The MMseqs2 command line interface is cleaner and validates user input. Many MMseqs2 modules use less memory and run faster.

Known Issues

High sensitivity searches (higher than -s 6) with precomputed indices should fail. Pass --db-load-mode 3 as a workaround to the MMseqs2 call.

Breaking Changes

Default taxonomy mode is assigning the same taxonomic label as the top hit. The previous "approximate 2bLCA" mode can be used with --lca-mode 3 or the non-approximated 2bLCA with --lca-mode 2
MMseqs2 will refuse to compile on compilers without OpenMP support (Use -DREQUIRE_OPENMP=0 to force a single-threaded no OpenMP build)
The confusingly named (and probably non-functional) --global-alignment parameter is gone
File names of the latest precompiled binaries changed. All archives contain a copy of the user guide and the MMseqs2 binary in the same subfolder (see further down for binaries of release 10-6d92c):

SIMD	Linux	macOS	Windows
SSE4.1	mmseqs-linux-sse41.tar.gz	mmseqs-osx-sse41.tar.gz	mmseqs-win64.zip
AVX2	mmseqs-linux-avx2.tar.gz	mmseqs-osx-avx2.tar.gz	-

Known Issues

MMseqs2 on Windows seems to not scale well on multiple threads
MMseqs2 on Windows can crash when built with AVX2 support (mostly on VMs)

Features

createindex can precompute split indices to improve runtime when searching against a database that is larger than the system memory. Precomputed databases also require less overhead RAM, since only the required parts are loaded
easy-search, easy-taxonomy, easy-linclust and easy-cluster workflows can take any number of query FASTA or FASTQ files
MMseqs2 validates database types. It will exit with an error message on wrong input, where it would previously crash
kmermatcher reports the diagonal with the most k-mer matches
kmermatcher scales the number of k-mers with sequence length (--kmer-per-seq-scale)
rescorediagonal got two new rescore modes, one for global alignment scoring and one for scoring a quasi global alignment fullfilling a local window criterion
Peak memory usage for reading in very large databases is greatly reduced. 128GB nodes should comfortably be able to deal with up to the maximum of 4.2 billion entries
Parameters taking byte values support syntax with a SI suffix (e.g., --split-memory-limit 64G)
Nucleotide substitution matrices should be user definable
Taxonomy report is compatible with Pavian. Thanks to Florian Breitwieser!
cluster workflow learned a reassignment mode --cluster-reassign. This mode corrects errors that occured because of cascaded clustering
extractorfs can directly translate a nucleotide ORF to an amino acid sequence
result2stats can write TSV files
createsubdb supports softlinks instead of always hard copying the whole file to disk
reduced harddisk space usage for all cascaded clusterings
easy-taxonomy reports the top hit alignment as a separate output file with the suffix tophit_aln
createindex checks if an index needs to be recomputed were improved

Bug fixes

MMseqs2 did not compile on FreeBSD. Please let us know about free continuous integration options to make sure it will keep working in the future
proteinaln2nucl could return wrong coordinates
apply would deadlock when running with multiple threads
MPI searches are way more reliable, there were various issues around merging the separate results. MPI logic of split and merge is also integrated into the regression tests suite
prefilter splits nucleotide searches if not enough memory is available
kmermatcher could corrupt memory
rescorediagonal could produce wrong sequence identities when aligning mixed-case sequences
macOS builds were not actually static (still dynamically link libsystem however)
lca module could corrupt memory and crash
createdb does not crash on systems with only 4GB of RAM anymore
AVX2 and SSE4.1 builds could produce slightly different results
summarizeresults does not crash on empty alignments results anymore
fix wrong tophit_report in easy-taxonomy
Precompiled Windows builds were broken
Precomputed indices of databases with very short sequences could truncate alignments if the query sequences were longer

Developers

Tools using MMseqs2 as a framework do not need to export MMseqs2 modules again anymore
MMseqs2 uses Azure Pipelines for all platforms to run our regression tests suite and provide precompiled binaries
MMseqs2 runs under ASan without any issues. We fixed various small memory leaks

The regression suite is directly linked through a submodule

It can be used by running:

git submodule update --init
./util/regression/run_regression.sh $PATH_TO_MMSEQS/mmseqs $TMP_DIR

Assets 8

04 May 03:26

martin-steinegger

9-d36de

d36dea2

MMseqs2 Release 9-d36de

At a glance: Improved taxonomy, add colors to user output, improve computation progress bar, small speed ups and many bug fixes

Features

Add support for Kraken style taxonomy reports. Thanks to Florian Breitwieser
New easy-taxonomy workflow
New progress bar to reduce output
Colored errors and warnings

Bugs

Fix alignment problem in SSW library mengyao/Complete-Striped-Smith-Waterman-Library#61
Fix iterative profile search
Fix protein nucleotide index issues
Fix cluster update workflow
Fix critical multi threading bug in taxonomy workflow

Assets 8

01 Apr 01:18

martin-steinegger

8-fac81

fac81fa

MMseqs2 Release 8-fac81

At a glance: Faster searches and clustering through improved IO and better seeding. More search modes like tblastx, reciprocal best hit and linsearch. New output format SAM. Support for compressed databases to reduce hard disk and memory requirements.

Known Issues

Iterative search only works up to 2 iterations

Breaking Changes

MMseqs2 now saves a lot on IO by not merging result datafiles
There is still a single .index file, but the corresponding data files are split into multiple parts (as many as threads were used previously)
MMseqs2 now uses the VTML80 [1,2] substitution matrix to speed up the prefiltering (changeable by --seed-sub-mat), the final alignment is still computed with the Blosum62 (still changeable by --sub-mat)
All databases have now a .dbtype file
MMseqs2 Docker image is now based on Debian instead of Alpine
Changed Orf header format to be more space efficent. The new format is now orignIdentifer startPos(-/+)len flag
prefilter returns ungapped-alignment scores instead of e-values
createindex the file extention is now .idx instead of the previous .[s]k[6,7] format

Features

Support for tblastx-style nucl-nucl translated searches
mmseqs search nuclDB1 nuclDb2 aln tmp --search-mode 2
Support for nucleotide searches
mmseqs search nuclDB1 nuclDb2 aln tmp --search-mode 3
convertalis has learned to return SAM formatted output (preview)
Database can be compressed by applying zstd on each entry (--compressed 1)
- Also added compress and decompress modules
rbh workflow for reciprocal best hit searches added
linclust can now cluster nucleotide sequences on both forward and reverse strand
Added linsearch, a lightning fast search for proteins and nucleotide sequences (preview; easy workflow variant easy-linsearch also added)
createlinindex computes an index for linsearch
taxonomy uses --orf-start-mode 1 to annotate more sequences
Added approx. 2bLCA to speed up computation, this is now the new default. The old mode can be turned on by --lca-mode 2
createdb recognizes sequences containing Uracil as DNA sequences
createdb is now faster through speeding up its shuffle operations
view module to view single entry in an MMseqs2 database
align module has learned --min-aln-len parameter to filter by minimal alignment length
Alignment modules (rescorediagonal, align) can align longer sequences now (not limited to 2^15 length)
Input sequences can now be softmasked (lower letter masking) instead of only hard masking (replacing with X) ``--mask-lower-case. The masking only applies to the prefilter stages kmermatcher` or `prefilter` and can be combined with `--mask`
filterdb has learned --filter-expression parameter and mode that allows filtering by simple mathematical expressions
alignbykmer can be used for nucleotide searches
MMseqs2 did-you-mean functionality gives better suggestions
MMseqs2 does not repeat the whole parameter list for each submodule call anymore

Bugs

Default parameters of map workflow are now set correctly
Some modules were using the wrong coverage parameter
Sliced profile search was losing high E-value hits
Sliced profile search is now stable
Profile-Sequence alignment E-values where slightly too high
result2msa was crashing with profiles on the target side
result2msa should not crash with --alow-deletion anymore
Some parameters were never visible (with or without -h)
Various issues with MPI were resolved

Developers

Continous integration enforces no compile warnings now
Continous integration now tries to build AArch64 builds with Docker and Qemu
We added a first draft of our developer guide to the wiki

References

[1] Müller T & Martin Vingron, Modeling Amino Acid Replacement, J Comput Biol. 2000;7:761–76. doi: 10.1089/10665270050514918.

[2] Müller T, Spang R, Vingron M. Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol. 2002;19:8–13. doi: 10.1093/oxfordjournals.molbev.a003985

Assets 8

29 Nov 23:43

martin-steinegger

7-4e23d

4e23d5f

MMseqs2 Release 7-4e23d

Changes since release 6-f5a1c

New features

Simplified taxonomy. We add tools the tools to create the taxonomical annotated database createtaxdb. It is possible to filter result databaese based on taxonomy with filtertaxdb and addtaxonomy to append taxonomy information to result databases
index (createindex) support for translated target databaes searches
add nucleotide search (experimental)
support NEON CPU architecture (experimental)
improve performance of prefilter if L2 is greater 256K
easy-search automatically computes backtrace if requested by --format-output
Create search-2m workflow, similiar to 2bLCA but without the LCA computation
We add a database preload mode. Database preload mode 0: auto, 1: fread, 2: mmap, 3: mmap+touch. The processing time per query with fread is 15% faster but the read in is slower. mmap is use for the MMseqs2 webserver, it enables instance searches if the database is already in memory, mmap+touch uses mmap an touches every page.
We add a new tool touchdb, it loads the database in memory. This can be useuful for "--db-load-mode 2.
add local hard disks support --local-tmp for MPI runs. This reduces pressure from the NFS
Introduce sortresult tool to sort an unordered sequence db (e.g. from mergeresult)
prefilter supports now indexes with k-mer ranges > 2^31
convertkb can read multiple files
speed up mmap memory touch function

breaking changes

new index version. Recomputation of old indexes in needed
--format-output is now comma separated
changed taxonomy database format, old taxonomy databaes are not supported anymore

default parameter change

extractorfs default is now --orf-start-mode 1. This is important for translated searches in organisms with introns.

Bug fixes

Fix wrong alignment positions for translated searches
Fix of by one error in extratalignedregion
Fix bug in NcbiTaxonomy tool
Fix e-value threshold if -e < --e-profile

Developer

Update to newest ALP version

Assets 8

09 Oct 01:40

martin-steinegger

6-f5a1c

f5a1cdb

MMseqs2 Release 6-f5a1c

Changes since release 5-9375b

New features

Support user defined output format in convertalis.
Add parameters for gap open and gap extension costs.
Improve substitution matrix support. Letters of alphabet can now be chose freely.
Add a few PAM matrices to the data folder. Chose them with the --sub-mat parameter.
Support IUPAC codes in translated search.
Add parameter to define a spaced k-mer pattern.
Add a new module ungappedprefilter. It computes an optimal ungapped score using a vectorized algorithm.

Bug fixes

Fix easy-linclust parameter parsing issue.
Fix coverage filtering in align when the parameter --realign is set.
Fix sequence identity computation in rescorediagonal --rescore-mode 2.
Fix apply MPI support.
Fix representative sequence output bug in result2repseq.
Fix possible MPI issues in modules creating symlinks.
Fix slightly wrong E-value computed in alignall module.

Known Issues

easy-search output has only one column. Workaround: Add parameter --format-output "".

Assets 8

Releases: soedinglab/MMseqs2

MMseqs2 Release 15-6f452

Breaking

New Features and Enhancements

Bugfixes

Build and Compatibility Updates

Developer

Contributors

MMseqs2 Release 14-7e284

Breaking

Features

Features built for ColabFold now available in MMseqs2

Bugfixes

MMseqs2 databases

Speedup

MMseqs2 subprojects

Developers

Contributors

MMseqs2 Release 13-45111

New Taxonomy Workflow (new feature and breaking change)

Breaking changes

Features

Bugs fixed

Developer

MMseqs2 Release 12-113e3

Breaking changes

Features

Bugs fixed

MMseqs2 Release 11-e1a1c

Known Issues

Breaking Changes

Features

Bug fixes

Developers

MMseqs2 Release 10-6d92c

Known Issues

Breaking Changes

Known Issues

Features

Bug fixes

Developers

MMseqs2 Release 9-d36de

Features

Bugs

MMseqs2 Release 8-fac81

Known Issues

Breaking Changes

Features

Bugs

Developers

References

MMseqs2 Release 7-4e23d

New features

breaking changes

default parameter change

Bug fixes

Developer

MMseqs2 Release 6-f5a1c

New features

Bug fixes

Known Issues

MMseqs2 `databases`