Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metabuli produces very different results from mmseqs on longread contigs #41

Open
sgalkina opened this issue Sep 15, 2023 · 3 comments
Open

Comments

@sgalkina
Copy link

sgalkina commented Sep 15, 2023

Dear authors,

Thank you for your work! The tool looks great, and out preliminary testing showed fantastic results for the short reads.

We are interested in contig-level annotations, but in the paper it is stated as the read-level tool. Do you see any problems with using Metabuli for contigs?

We tested it on the CAMI shortread contigs and got amazing species annotations for 99% of contigs. However, after running it on the real longread data, we are seeing major differences (~25% of the annotated part of the dataset) with how mmseqs annotates the same dataset, starting from the phylum level already. Do you know what can be the problem? Does it makes sense to use the tool on the longread contigs?

Thank you,
Svetlana

@jaebeom-kim
Copy link
Collaborator

Hi, Svetlana.
Thank you for testing Metabuli :)

"We are interested in contig-level annotations, but in the paper it is stated as the read-level tool. Do you see any problems with using Metabuli for contigs?"

I haven't tested Metabuli with contigs, but I believe there will be no problem in using Metabuli with contigs.
If the contigs are assembled correctly from short reads, classifying the contigs is usually easier than classifying short reads.

"However, after running it on the real longread data, we are seeing major differences (~40% of the annotated part of the dataset) with how mmseqs annotates the same dataset, starting from the phylum level already."

Interesting point.
I'm in the same lab with the authors of MMseqs2 Taxonomy, so I'll discuss it with them.
@milot-mirdita @martin-steinegger
To figure it out, I need some context information.
Could you let me know what database of each tool you used?
And did you assemble long reads and use the assembled contigs as input data?
or use the long reads themselves?

Thanks,
Jaebeom

@sgalkina
Copy link
Author

thanks for the swift reply!

I've used GTDB for both. For Metabuli, the database was created by the default command from the docs: metabuli databases GTDB207 DBDIR tmp
For mmseqs, I do believe it was created with the standard mmseqs database command, but it was a while ago so I can't recall for sure.
For CAMI, the two tools are in agreement, and for the real longreads they seem to disagree. We used the contigs assembled from the long reads, not the reads themselves.
I'll also point out that the IDs returned by the two tools are not quite the same but after matching them, the discrepancies persist

For annotating both the short- and long-read contigs, I used the following interface:

metabuli classify --seq-mode 1 <contigs> tmp/latest/gtdb207 metabuli_outdir $dataset

@jaebeom-kim
Copy link
Collaborator

Could you provide report files generated by Metabuli and MMseqs2?
It will help us to see if the different results are intended behaviors or bugs.

In our benchmarks, Metabuli and MMseq2 Taxonomy make different classifications even when the same reference sequences are used for DB.
Here is the link for Metabuli preprint.
https://www.biorxiv.org/content/10.1101/2023.05.31.543018v1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants