Metabuli produces very different results from mmseqs on longread contigs #41

sgalkina · 2023-09-15T12:51:50Z

Dear authors,

Thank you for your work! The tool looks great, and out preliminary testing showed fantastic results for the short reads.

We are interested in contig-level annotations, but in the paper it is stated as the read-level tool. Do you see any problems with using Metabuli for contigs?

We tested it on the CAMI shortread contigs and got amazing species annotations for 99% of contigs. However, after running it on the real longread data, we are seeing major differences (~25% of the annotated part of the dataset) with how mmseqs annotates the same dataset, starting from the phylum level already. Do you know what can be the problem? Does it makes sense to use the tool on the longread contigs?

Thank you,
Svetlana

jaebeom-kim · 2023-09-15T13:25:45Z

Hi, Svetlana.
Thank you for testing Metabuli :)

"We are interested in contig-level annotations, but in the paper it is stated as the read-level tool. Do you see any problems with using Metabuli for contigs?"

I haven't tested Metabuli with contigs, but I believe there will be no problem in using Metabuli with contigs.
If the contigs are assembled correctly from short reads, classifying the contigs is usually easier than classifying short reads.

"However, after running it on the real longread data, we are seeing major differences (~40% of the annotated part of the dataset) with how mmseqs annotates the same dataset, starting from the phylum level already."

Interesting point.
I'm in the same lab with the authors of MMseqs2 Taxonomy, so I'll discuss it with them.
@milot-mirdita @martin-steinegger
To figure it out, I need some context information.
Could you let me know what database of each tool you used?
And did you assemble long reads and use the assembled contigs as input data?
or use the long reads themselves?

Thanks,
Jaebeom

sgalkina · 2023-09-15T13:42:55Z

thanks for the swift reply!

I've used GTDB for both. For Metabuli, the database was created by the default command from the docs: metabuli databases GTDB207 DBDIR tmp
For mmseqs, I do believe it was created with the standard mmseqs database command, but it was a while ago so I can't recall for sure.
For CAMI, the two tools are in agreement, and for the real longreads they seem to disagree. We used the contigs assembled from the long reads, not the reads themselves.
I'll also point out that the IDs returned by the two tools are not quite the same but after matching them, the discrepancies persist

For annotating both the short- and long-read contigs, I used the following interface:

metabuli classify --seq-mode 1 <contigs> tmp/latest/gtdb207 metabuli_outdir $dataset

jaebeom-kim · 2023-12-15T05:28:59Z

Could you provide report files generated by Metabuli and MMseqs2?
It will help us to see if the different results are intended behaviors or bugs.

In our benchmarks, Metabuli and MMseq2 Taxonomy make different classifications even when the same reference sequences are used for DB.
Here is the link for Metabuli preprint.
https://www.biorxiv.org/content/10.1101/2023.05.31.543018v1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metabuli produces very different results from mmseqs on longread contigs #41

Metabuli produces very different results from mmseqs on longread contigs #41

sgalkina commented Sep 15, 2023 •

edited

jaebeom-kim commented Sep 15, 2023

sgalkina commented Sep 15, 2023

jaebeom-kim commented Dec 15, 2023

Metabuli produces very different results from mmseqs on longread contigs #41

Metabuli produces very different results from mmseqs on longread contigs #41

Comments

sgalkina commented Sep 15, 2023 • edited

jaebeom-kim commented Sep 15, 2023

sgalkina commented Sep 15, 2023

jaebeom-kim commented Dec 15, 2023

sgalkina commented Sep 15, 2023 •

edited