Skip to content

@milot-mirdita milot-mirdita released this Feb 24, 2021 · 67 commits to master since this release

New Taxonomy Workflow (new feature and breaking change)

We introduce a new taxonomy workflow for assigning taxonomic labels to nucleotide sequences by searching against protein reference databases. For details see:

Mirdita M, Steinegger M, Breitwieser F, Söding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. bioRxiv, doi: 10.1101/2020.11.27.401018 (2020)

The nucleotide-to-protein taxonomic assignment is now much faster and is optimized towards annotation of contigs. If you use MMseqs2 taxonomy to assign taxonomic labels to short reads, consider using the --orf-filter 0 parameter to disable the new filter stage as it can reject too many short query sequences. MMseqs2 is still considerably faster with this parameter set.

As our nucleotide-to-nucleotide taxonomic assignment does not support the 2bLCA assignment mode for stable lowest-common-ancestor computation, we previously set MMseqs2 to perform LCA assignment by top-hit (--lca-mode 4) as default. Approximate (see manuscript) 2bLCA is now again the default and we automatically switch to top-hit if given nucleotide-to-nucleotide input.

Breaking changes

  • --slice-search in now called --exhaustive-search
  • Unify --compress --summarize --omit-consensus (in result2msa) to --msa-format-mode

Features

  • Add GTDB and CDD to databases downloader #410
  • Add nrtotaxmapping to create taxonomy mapping from NR
  • Add unpackdb to split a database into separate files #406
  • Add majoritylca module for majority voting based taxonomy from alignment results
  • Add cpdb and lndb
  • Taxonomy information is stored in binary format (a single db_taxonomy file, instead of db_{named,nodes,merged}.dmp,db_mapping) to speed up read-in. Old format is still supported.
  • --exhaustive-search is usable with ungapped alignments (--alignment-mode 4)
  • Allow sequence/result database input in taxonomyreport #401/#408
  • msa2profile/result can skip the first sequence with --skip-query
  • createtaxdb can create a taxdb by mapping through .source in addition to .lookup (--tax-mapping-mode 1)
  • splitsequence can create a sequence database with original headers
  • align can return short cluster format if only identifiers are required --alignment-output-mode
  • tar2db can be used multi-threaded if input allows (e.g. .tar containing .gz files)
  • Encode species names in taxonomy blocklist to make sure we don't block random nodes in * e.g. GTDB)
  • Split non-index parts over additional files in split index case to reduce peak memory use
  • proteinaln2nucl can now compute scores and e-values
  • createdb can create a sequence database from a database containing fasta files (e.g. created by tar2db)
  • Add MMSEQS_FORCE_MERGE environment variable to force generating fully merged databases
  • Improved many descriptions, warnings and error messages

Bugs fixed

  • Fix filterresult off by one issue removing wrong sequences
  • Fix addtaxonomy always crashing due to invalid check #355
  • Reduce numbers of calls to posix_memalign to fix lock contention on macOS
  • extractorfs doesn't flood warnings due to short sequences anymore
  • expand2profile --pca is correctly set to 0
  • msa2profile always copies .lookup/source files instead of symlinking
  • Clustering of clustering input would not work with set-cover or connected-component
  • Short circuit --cluster-reassign if nothing can be reassigned
  • Fix temporary files not getting removed in linclust/cluster with --remove-tmp--files
  • Fix kmermatcher setting user k-mer pattern in auto k-mer selection and breaking
  • Krona taxonomyreport was not working if no sequence was unclassified
  • Make Matcher::resultToBuffer buffer sizes consistent (could crash with very long backtraces, needs further refactoring)
  • Fix multiple locations where Util::checkAllocation could never be called as it would have crashed before
  • Whitespace containing parameters do not break workflows anymore (e.g. passing whitespaces to --sub-mat)
  • taxonomyreport and addtaxonomy parameter were not adjustable in easy-taxonomy
  • E-value parameters are now correctly parsed as doubles instead of floats #379
  • Add symlinks to splitdb #376
  • Increase maximum number of open files in DBReader
  • Include file size and modified date of inputs in temporary file hash calculation #372
  • --cov-mode 5 was not working #371
  • Database downloader deals correctly with redirects now
  • result2profile could crash if target database contained much longer sequences than query database
  • Stop symlinking header database (and other ancillary files) in filterresult

Developer

  • Add vector of predefined substitution matrices to add additional matrices in subprojects
  • Don't create false _has_{builtin,attribute} macros (see simd-everywhere/simde#691 (comment))
  • Add USE_SYSTEM_ZSTD cmake flag to use system provided zstd #411
  • Replace texlive with tectonic for faster/prettier userguide
  • Add more instructions to simd.h
  • Add initial fixes to get MMseqs2 working on s390x (work in progress)
  • Prebuilt macOS binary is now a Universal Mac Binary supporting SSE, AVX and Apple Silicon NEON
  • Build ARM64/PPC64LE binaries by cross-compiling
  • Add missing licenses and READMEs for vendored libraries #403
  • Update ALP to 1.98
  • Update xxhash to v0.8.0
Assets 11