Older database support? #63

wwood · 2024-04-08T23:36:37Z

Hi,

I was keen to try out Metabuli, and was using an R207 pre-built database but now it seems that this has been superceded by the R214 one. For the sake of reproducibility, are old databases kept around?

One small suggestion also - it would be helpful if the GTDB db had the version coded directly into its filename, rather than simply gtdb.tar.gz. I realise that this info is in the README and once the tgz is unpacked, but would still be helpful so that the download link changes.

TIA, ben

jaebeom-kim · 2024-04-09T07:46:28Z

Dear Ben,

Thank you for trying Metabuli :)

Let me discuss about keeping older databases in the cloud server and write an answer about it later.

Regarding your suggestion, l think making the version information more visible is the point.
Renaming the DB link can be a solution, but it requires changing the source code and making new release/bioconda whenever the DB is updated.
And users also should update their Metabuli to download DBs from the new links.
So, I'm thinking about a different way to improve the version visibility.

Sorry for not giving direct answers.
I will post progresses about solving your concerns.

Thank you,
Jaebeom

jaebeom-kim · 2024-04-09T07:51:36Z

By the way, I'd like to recommend using R214 one.
In the R207 DB, K-mers from lowercased letters in a human genome were not extracted correctly.
So, it harmed Metabuli's performance for decontaminating human reads.

wwood · 2024-04-10T00:13:41Z

Thakns for the quick reply.

Renaming the DB link can be a solution, but it requires changing the source code and making new release/bioconda whenever the DB is updated.
And users also should update their Metabuli to download DBs from the new links.
So, I'm thinking about a different way to improve the version visibility.

There are other solutions. For instance metaphlan keeps an "mpa_latest" file which points to the versioned database files, and SingleM uses a library we made https://github.com/centre-for-microbiome-research/zenodo_backpack which operates through zenodo, where there is a concept of a DOI for a series and a DOI for a specific database version.

It would be great if each tool developer didn't have to spend time solving these problems and there was a broadly adopted, standardised solution.

By the way, I'd like to recommend using R214 one.
In the R207 DB, K-mers from lowercased letters in a human genome were not extracted correctly.
So, it harmed Metabuli's performance for decontaminating human reads.

Thanks - is there any other reason not to use the R207 one? My samples do not contain human reads.

jaebeom-kim · 2024-06-04T11:15:14Z

Thank you for providing great examples for the task!
If your sample doesn't contain human reads, R207 is totally fine.
R207 just has a smaller number of genomes.

wwood · 2024-06-05T11:26:15Z

Thanks @jaebeom-kim .

Congratulations on the metabuli paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Older database support? #63

Older database support? #63

wwood commented Apr 8, 2024

jaebeom-kim commented Apr 9, 2024

jaebeom-kim commented Apr 9, 2024

wwood commented Apr 10, 2024

jaebeom-kim commented Jun 4, 2024

wwood commented Jun 5, 2024

Older database support? #63

Older database support? #63

Comments

wwood commented Apr 8, 2024

jaebeom-kim commented Apr 9, 2024

jaebeom-kim commented Apr 9, 2024

wwood commented Apr 10, 2024

jaebeom-kim commented Jun 4, 2024

wwood commented Jun 5, 2024