@martin-steinegger martin-steinegger released this Oct 9, 2018 · 48 commits to master since this release

Assets 8

Changes since release 5-9375b

New features

  • Support user defined output format in convertalis.
  • Add parameters for gap open and gap extension costs.
  • Improve substitution matrix support. Letters of alphabet can now be chose freely.
  • Add a few PAM matrices to the data folder. Chose them with the --sub-mat parameter.
  • Support IUPAC codes in translated search.
  • Add parameter to define a spaced k-mer pattern.
  • Add a new module ungappedprefilter. It computes an optimal ungapped score using a vectorized algorithm.

Bug fixes

  • Fix easy-linclust parameter parsing issue.
  • Fix coverage filtering in align when the parameter --realign is set.
  • Fix sequence identity computation in rescorediagonal --rescore-mode 2.
  • Fix apply MPI support.
  • Fix representative sequence output bug in result2repseq.
  • Fix possible MPI issues in modules creating symlinks.
  • Fix slightly wrong E-value computed in alignall module.

@martin-steinegger martin-steinegger released this Sep 4, 2018 · 117 commits to master since this release

Assets 8

Changes since release 4-0b8cc

Bug fixes

  • bool flag parameters (e.g. -a) work again
  • swapresults will deterministically rank results
  • shellcompletion does not report run time anymore

@martin-steinegger martin-steinegger released this Sep 4, 2018 · 120 commits to master since this release

Assets 8

Changes since release 3-be8f6

New features

  • Alternative alignments in search (--alt-ali). Find alignments by masking out previously found regions in the target sequence.
  • Added map workflow for fast near-exact mapping of reads
  • Added easy-linclust workflow, that works on FASTA files
  • Sequence lengths longer than 32k are now supported (default sequence length limit is now 65535)
  • createdb shuffles the order of entries by default (--dont-shuffle to disable), useful for database splits, where one split could take much longer than others
  • linclust now supports MPI
  • linclust adds one hash for the whole sequence, to improve extract sequence matching
  • New sequence identity computation modes, where the normalization happens on the query or target length instead of alignment length
  • New --cov-mode that computes the coverage only based on sequence lengths (--cov-mode 3)
  • search/cluster/linclust workflows have learned --alignment-mode 4 for faster ungapped alignments
  • Translated search sorts now results by E-value and aggregates all ORFs under the corresponding contig identifier
  • prefiltering can now sort hits with score > 255 correctly
  • convertalis now works with profiles
  • Added generalized database transposition tool swapdb (swapresults only makes sense for prefiltering/alignment results)

Performance

  • Speedup extractorf with vectorization
  • Many performance improvements to reduce overhead for web server mode
  • createtsv writes output in parallel
  • Avoid many unnecessary memory allocations in various modules

Bug fixes

  • covertmsa does now correctly parses STOCKHOLM files without accession keys
  • In search when using splits less than --max-seqs sequences would be the limit, now correctly computes the limit (max-seqs/Splits + 4*sqrt(max_seqs/Splits))
  • Fix bug in MsaFilter where wrong sequences would be filtered
  • swapresults will add an empty entry if a target entry has no corresponding query match, instead of no entry at all
  • createindex creates now correctly creates a tmp directory if no directory exists already
  • Fix query split runs for small input databases
  • result2stats was reading the wrong first sequence (from query instead of target database)
  • result2repseq now writes the correct .dbtype file
  • convertalis now reads the correct dbtype for the target sequence
  • Fix empty REG_EMPTY bug on macOS
  • Fix possible memory corruption when searching against database indexed by 'createindex'
  • Report error if -DHAVE_MPI was set and MPI is not installed on the system
  • Avoid race condition in kmermatcher (invalid parallel writing to vector)
  • Fix msa2profile header output format
  • msa2profile uses the FASTA readin mode by default now
  • Target profile databases and databases build with --exact-kmer-matching now correctly extract all k-mers
  • Fix identical score computation of alignment if clustering using profiles
  • Nucleotide backtranslation translateaa would produce invalid codons for X

Others

  • removed --early-exit
  • Output name of program called

Experimental new modules

  • new fast alignment method alignbykmer

Developers

  • Cmake flag -DHAVE_GPROF for profiling MMseqs2 using gprof
  • Fixed most warnings
  • SSTR does not use stringstreams anymore
  • Refactored time measuring
  • Debug::INFO/WARNING/ERROR is now used consistently across the codebase
  • If available (shellcheck)[https://github.com/koalaman/shellcheck] will critique shell scripts and fail the compilation

@martin-steinegger martin-steinegger released this May 28, 2018 · 344 commits to master since this release

Assets 8

Changes since 2-23394 Release

New Features

  • Create simple workflows fasta/fastq in flat file out for clustering easy-cluster and searching easy-search
  • Add a new clustering greedy incremental clustering algorithm to the clust module which needs less memory
  • Make the new low memory clustering algorithm default if --cov-mode 1 is used in linclust and cluster
  • Add alignall module for all-against-all alignments of e.g. clusters
  • Improved Windows support
  • filterdb learned new modes

Bug fixes

  • Fix wrong merging code in linclust
  • Fix e-value issues in target-split case
  • Fix seg. fault in rescore diagonal if 'z' is used
  • Fix seg. fault when using masking in kmermatcher
  • Fix wrong filterdb default mode
  • prefilter overestimated the required amount of memory and refused to run
  • prefilter scores would saturate to early, now they have the full 2^16 range

Others

  • Profile searches do create less high scoring false positive through better compositional bias correction and masking of low complexity regions of profiles
  • Clustering supports now the whole 2^32 range instead the previously 2^31
  • Speed up clustering when using --cov-mode 1
  • Rework symlinks to the header databaes
  • Support profiles on query and target side in result2profile

@martin-steinegger martin-steinegger released this Mar 5, 2018 · 477 commits to master since this release

Assets 8

Changes since 1-c7a89 Release

New Features

  • Translated searches (blastx and tblastn like search modes)
  • Improvement splitting input sequences in kmermatcher (Less memory needed for linclust)
  • linclust supports nucleotide sequences (experimental feature, k-mer length is not yet optimized)
  • search supports nucleotide-nucleotide searches (preview, not stable yet)
  • pssm2profile module to print human readable profiles
  • msa2profile has a gap match mode to to convert multiple sequences alignments without representative sequence to profile databases
  • Compute sequence identity in a similar way to BLAST if --alignment-mode 3 is used
  • apply module to execute a arbitrary program on each entry of a mmseqs database. Like map from MapReduce.
  • extractorf can use start/stop codons from alternative translation tables
  • filterdb now can append entries from other databases by looking them up
  • proteinaln2nucl maps a protein alignment back to its original nucleotide sequences
  • taxonomy now can blacklist nodes (per default the unclassified and others nodes)
  • tmp folder is automatically created, all workflow intermediate results are placed in a subfolder based on the hash of all paths and parameters

Performance Regressions Fixed

  • Fixed regression when multiple mmseqs instances were running at the same time

Breaking Command Line Interface Changes

  • Incremented index version, old precomputed indices have to be regenerated
  • New Profile format, databases generated through convertprofiledb and msa2profile have to be regenerated
  • Clustering workflow is now by default cascaded. We replaced the --cascaded flag with --single-step-clustering
  • Max sequence length of 32768 is now actually validated and enforced
  • Each sequence database has now a dbtype file (AA=0, NUC=1, PROFILE=2)
  • extractorf was reworked:
    * --skip-incomplete was split into two parameters --contig-start-mode and --contig-end-mode
    * --longest-orf was reworked into --orf-start-mode
    * removed --extend-min parameter

Others

  • Factor four times faster clustering workflow
  • Improve speed of linclust by a factor of two
  • Remove 'X' from prefilter index (reduces memory and improves speed at the same sensitivity)
  • Fix bugs for Query coverage mode (--cov-mode 2)
  • Clustering is now the same between single and multi threaded version
  • Speedup of kmermatcher
  • Fix bug in Clust hash. It can now cluster to 1.0 sequence identity
  • Improve target profile search, set max-seqs to infinite for alignments.
  • Improve speed of align if prefilter result fit into memory
  • Many usability improvements
  • Improved suggestions of bash completion
  • Expert modules are hidden by default, use -h flag to show everything
  • Speed up mergeclusters by a lot
  • Fix sequence identity print out bug if the id is less than 10%
  • MPI Runner variable can now correctly contain further parameters (RUNNER="mpirun -np 4" was not working)
  • Enforcing GCC 4.6 compatibilty in our continous integration

Devlopers

  • MMseqs2 can now be included in framework mode to subprojects
  • DBReader has a SHUFFLE mode

@martin-steinegger martin-steinegger released this Oct 29, 2017 · 714 commits to master since this release

Assets 4

Changes since vNatBiotech Release

New Features

  • Taxonomy classification workflow with robust 2bLCA computation and fast LCA computation in O(N LogN)
  • Support reading .bz2 archives for createdb
  • Createdb can turn multiple fasta files into one database now
  • Extend prefilter score range to improve order of best hits after prefiltering.
  • Automatically split input sequence set based on system RAM in kmermatcher. Linclust can now run with less memory.

Performance Regressions Fixed

  • Fixed underperforming iterative-sequence-profile search without a precomputed index table

Breaking Command Line Interface Changes

  • Iterative-non-profile-search --sens-step-size changed to --sens-steps (Number of Iterations) (Does not break nested workflows anymore)

Others

  • Query coverage mode (--cov-mode 2) for searching
  • Clustering is now the same between single and multi threaded version
  • Bug fixes in rescorediagonal
  • Speedup of kmermatcher
  • Speedup and memory reduction of swapresults
  • Many usability improvements

Devlopers

  • MMseqs2 can now be included in framework mode to subprojects

@milot-mirdita milot-mirdita released this Aug 8, 2017 · 833 commits to master since this release

Assets 2

Release for Nature Biotechnology