Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greengenes database updates #547

Merged
merged 4 commits into from Jan 10, 2023
Merged

Greengenes database updates #547

merged 4 commits into from Jan 10, 2023

Conversation

wasade
Copy link
Member

@wasade wasade commented Jan 4, 2023

In this pull request, we revise the entries surrounding Greengenes and in particular note 2022.10, which is the latest release of Greengenes2.

cc @gregcaporaso

Copy link
Member

@gregcaporaso gregcaporaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wasade, looking forward to trying this out! Couple of minor requests on this.

@@ -12,14 +12,22 @@ Naive Bayes classifiers trained on:

- `Silva 138 99% OTUs full-length sequences <https://data.qiime2.org/2022.11/common/silva-138-99-nb-classifier.qza>`_ (MD5: ``b8609f23e9b17bd4a1321a8971303310``)
- `Silva 138 99% OTUs from 515F/806R region of sequences <https://data.qiime2.org/2022.11/common/silva-138-99-515-806-nb-classifier.qza>`_ (MD5: ``e05afad0fe87542704be96ff483824d4``)
- `Greengenes 13_8 99% OTUs full-length sequences <https://data.qiime2.org/2022.11/common/gg-13-8-99-nb-classifier.qza>`_ (MD5: ``6bbc9b3f2f9b51d663063a7979dd95f1``)
- `Greengenes 13_8 99% OTUs from 515F/806R region of sequences <https://data.qiime2.org/2022.11/common/gg-13-8-99-515-806-nb-classifier.qza>`_ (MD5: ``9e82e8969303b3a86ac941ceafeeac86``)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include these links as well so users still have the option to access these? I think that will be helpful for a couple of release cycles since gg 13_8 has been such a staple to the field for so long. For example, I've been using GTDB for a while now, but I almost always classify with both GTDB and GG 13_8, and compare assignments for features of interest.

If you want to emphasize GG2 over 13_8, you could do something like create a box that says something like:

GG2 has succeeded Greengenes 13_8. If you still need to access the outdated 13_8 classifiers, for example to reproduce old results or to compare against new classifiers, you can access them through the following links:
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't they access those links through older versions of QIIME 2? What I worry about here is that we don't, for example, highlight the older versions of SILVA and if we're not careful, it may give the impression that we don't think Greengenes2 is clearly better than 13_8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wasade, your concern makes sense to me. This transition seems to me to be more akin to the QIIME 1 to QIIME 2 transition rather than Silva release transitions (which tend to be relatively frequent incremental updates). Given that this is a major change to a fundamental tool in the field, I think it's important that we prioritize accessibility of both iterations for a release cycle or two. (For the QIIME transition we went through about a one year period where we supported both to allow users to assess which one they preferred and compare results of the two platforms.)

While the classifiers can be accessed through old versions of the documentation, I think we can make this more accessible. Would it address your concern if we either:

  • added a box with a note like I suggested above, but rather than link users to the classifiers directly, point them to the relevant section of the 2022.11 version of the documentation
  • or bump the old classifier links to the bottom of this page under a heading of something like Legacy Taxonomic Classifiers, and note in that section that we recommend using the most recent versions but that these versions may be useful for reproducing old results or comparing to newer classifiers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes complete sense, thanks!! I'll issue a commit with the first option


**License Information** can be found on the `Greengenes website <https://greengenes.secondgenome.com/>`_. Greengenes data are released under a `Creative Commons Attribution-ShareAlike 3.0 License <https://creativecommons.org/licenses/by-sa/3.0/deed.en_US>`_.
**License Information** can be found on the `Greengenes website <https://greengenes.secondgenome.com/>`_. Greengenes data (prior to 2022) are released under a `Creative Commons Attribution-ShareAlike 3.0 License <https://creativecommons.org/licenses/by-sa/3.0/deed.en_US>`_. Greengenes data (2022-) are released under a `BSD-3 license <http://ftp.microbio.me/greengenes_release/2022.10/00LICENSE>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Greengenes2 data (2022-) are released...?

Also, a more general question - do you think we should start referring to the older releases collectively as Greengenes1? That's kind of how we handle the QIIME / QIIME 1 / QIIME 2 situation now (where I typically use QIIME to refer to QIIME 1 and QIIME 2).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. All Greengenes data (2022-) is under BSD-3.

I'm torn on that to be honest. With Greengenes, the version numbers are pretty distinct and usually associated with the Greengenes reference. We remarked it as "Greengenes2" to reflect a major change in the database, however (to me at least) the most important thing is still the version number.

I dunno, what do you think will be least confusing for users?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective calling it Greengenes2 (and then having versions of that, like 2022.11) signifies fundamental changes to the product opposed to an incremental update, and given the long gap between versions I suspect calling it Greengenes2 will be less confusing (and possibly easier to sell to users and/or funders).

Your call though - just depends on how you want to brand it. I thought it may have been a typo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point :) I'll revise to Greengenes2 which is consistent with our other uses

@gregcaporaso gregcaporaso self-assigned this Jan 6, 2023
@gregcaporaso gregcaporaso merged commit f4cdd02 into qiime2:master Jan 10, 2023
@gregcaporaso
Copy link
Member

Thanks @wasade!

@wasade
Copy link
Member Author

wasade commented Jan 10, 2023

Wonderful, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

None yet

2 participants