-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What categories & keywords do we want associated with all rOpenSci packages? #7
Comments
to clarify @cboettig these categories and keywords are to go where? are you talking about the tags at the top of each github repo? or the strings that will go in the json describing each pkg in the https://github.com/ropensci/roregistry/blob/gh-pages/registry.json file |
I think they'd go into the DESCRIPTION (e.g. with the rest of the package metadata), and then they'd be imported into the |
Makes sense to me! |
I like this discussion. There are many places currently defining some sort of categories for rOpenSci packages (that it'd be great to align, and as said above that have to be divided into technical vs scientific domain area and package type):
Categories from the registry> sort(unique(ropkgs::ro_pkgs()$packages$ropensci_category))
[1] "altmetrics" "data-access" "data-analysis" "data-publication"
[5] "data-tools" "data-visualization" "databases" "developer-tools"
[9] "geospatial" "http-tools" "ignore" "image-processing"
[13] "literature" "metadata" "scalereprod" "security"
[17] "taxonomy" Categories from the packages page
This list of labels doesn't seem exhaustive? Categories in the onboarding policiesThese are primarily used to assess fit, there's no condition on the topic, currently packages only need to have a scientific application.
Labels in onboarding issueslabels <- gh::gh("/repos/:owner/:repo/labels",
owner = "ropensci", repo = "onboarding", .limit=200)
labels <- unlist(lapply(labels, "[[", "name"))
onboarding_topics <- labels[stringr::str_detect(labels, "topic")]
onboarding_topics <- stringr::str_replace(onboarding_topics, "topic\\:", "")
onboarding_topics
These "topics" are a mix of what the package does (accessing or extracting data for instance) and in which science field it is used (physiology, taxonomy). In a perfect world, submitters would possibly define store all sorts of keywords (what their package does, and the field of application) in their package metadata before submission and this way the labels in the onboarding issue would be added automatically? 🤔 |
I agree it would be good to harmonize all the various labels/categories. Another thought I had was that we should either let each package have multiple categories OR add another layer of labels. For example, if a package is doing data access, then it might be in the data-access category - but then to provide nice rich filtering on our pkgs page itd be nice to let the user filter by gov't data sources. I'm guessing it'd be easier for us to add this 2nd layer of keywords as we think is needed, rather than hope maintainers do it
that would be neat, though in reality that might prove hard cause we'd have an unknown set of words everytime that we'd have to fit into the categories we have - and also that isn't straightforward to do automated AFAIK |
Oops when I said "all sorts of keywords" I was thinking of "all sorts of keywords from a list we provide" (possibly writing an addin or something to help them enter all these keywords). I think packages would have several labels to describe their tasks and their topic areas. |
In the Github labels of onboarding I like the idea to use "topic:", but there could be "type:" for what the package does and "topic:" for actual topics? Besides, I'm not sure this discussion belongs here? 🤔 |
one label we might want in the registry (but not in DESCRIPTION) is community vs. staff contributed? |
@cboettig do you know of any good existing nomenclature forscientific domain areas? I imagine our nomenclatures for "what the package does" and "technical domain area" might be more home-made (and simpler), but for scientific field it doesn't need to be? |
Good question, I'm not super familiar with the main existing nomenclatures for that. Mostly I just get annoyed when they treat all of biology as if it considered nothing larger than the cell and dealt with no other issue but human health |
@maelle Bryce says the one they use most is https://earthdata.nasa.gov/about/gcmd/global-change-master-directory-gcmd-keywords, pretty rich hierarchical keyword model. Though not entirely obvious to me what the valid categories there are beyond "Earth Science". Maybe you want something more generic for scientific domain areas? cc @amoeba |
Thanks both! 😺 I had a quick look and I'm not sure it'd cover everything, even if the topics "Biosphere" |
I'll keep looking. A good nomenclature would
|
@cboettig In the CodeMeta terms, do " applicationCategory" and "applicationSubCategory" have to be unique? I've been thinking of technical and topical categories, most often packages will have only one of those, but maybe not? In edge cases, maybe one of the two could be applicationCategory and the other one a keyword. The workflow I'll have will probably to have a table of all packages with their categories and subcategories as I think they should be, then ask for feedback on that before probably doing PR to repos to add the terms in DESCRIPTION. In the future we could strive for that info to be included in DESCRIPTION when onboarding / creating the package. |
Sounds good. No requirement to be unique, and of course you can have one or more of each. |
Pragmatic decisions.
Maybe we'll still want to curate better categories, from some taxonomy, but not in the short term? |
Some "categories" could be automatically guessed. E.g. if there's a system requirement on a C library (how to define that remains to be seen) it's a "package wrapping a C library", if the package depends on One could also try clustering packages based on the dependencies they share, without too much supervision. |
Categories would probably be based around existing categories listed on: https://ropensci.org/packages/, maybe revised somewhat:
Might be good to distinguish between more science / user / researcher focused and more developer focused here, e.g. "Data Tools" covers pretty developer-focused things mostly but that might not be obvious based on the category label.
Also would be good to have a (probably larger) space of keywords (i.e. packages probably have one category but multiple keywords). Keywords should probably capture:
Some of this might overlap with terms / keywords describing the databases themselves (see https://github.com/ropenscilabs/datasauce)
Also related to discussions of tags / categories for blog posts, see: https://github.com/rosadmin/comms/issues/22
The text was updated successfully, but these errors were encountered: