Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What categories & keywords do we want associated with all rOpenSci packages? #7

Open
cboettig opened this issue Jul 10, 2017 · 17 comments

Comments

@cboettig
Copy link
Member

cboettig commented Jul 10, 2017

Categories would probably be based around existing categories listed on: https://ropensci.org/packages/, maybe revised somewhat:

  • Data Publication
  • Data Access
  • Literature
  • Altmetrics
  • Scalable & Reproducible Computing
  • Databases
  • Data Vizualization
  • Image Processing
  • Data Tools
  • Taxonomy
  • HTTP tools
  • Geospatial
  • Data Analysis (maybe delete?)

Might be good to distinguish between more science / user / researcher focused and more developer focused here, e.g. "Data Tools" covers pretty developer-focused things mostly but that might not be obvious based on the category label.

Also would be good to have a (probably larger) space of keywords (i.e. packages probably have one category but multiple keywords). Keywords should probably capture:

  • scientific domain area (both general and specific, e.g. "ecology" -> "marine" -> "fish")
  • technical domain area
  • type of package, e.g. wrapping REST APIs vs wrapping a C library vs ...

Some of this might overlap with terms / keywords describing the databases themselves (see https://github.com/ropenscilabs/datasauce)

Also related to discussions of tags / categories for blog posts, see: https://github.com/rosadmin/comms/issues/22

@sckott
Copy link
Contributor

sckott commented Jul 11, 2017

to clarify @cboettig these categories and keywords are to go where? are you talking about the tags at the top of each github repo? or the strings that will go in the json describing each pkg in the https://github.com/ropensci/roregistry/blob/gh-pages/registry.json file

@cboettig
Copy link
Member Author

I think they'd go into the DESCRIPTION (e.g. with the rest of the package metadata), and then they'd be imported into the codemeta.json in each repo. We'd then generate the registry.json by simply appending the codemeta.json from each repo. Does that make sense or too convoluted?

@sckott
Copy link
Contributor

sckott commented Jul 15, 2017

Makes sense to me!

@maelle
Copy link
Member

maelle commented Jan 29, 2018

I like this discussion. There are many places currently defining some sort of categories for rOpenSci packages (that it'd be great to align, and as said above that have to be divided into technical vs scientific domain area and package type):

  • categories from the registry

  • categories in the packages page of the website

  • this issue itself

  • categories in the onboarding policies

  • "topic:" labels in onboarding issues

Categories from the registry

> sort(unique(ropkgs::ro_pkgs()$packages$ropensci_category))
 [1] "altmetrics"         "data-access"        "data-analysis"      "data-publication"  
 [5] "data-tools"         "data-visualization" "databases"          "developer-tools"   
 [9] "geospatial"         "http-tools"         "ignore"             "image-processing"  
[13] "literature"         "metadata"           "scalereprod"        "security"          
[17] "taxonomy"

Categories from the packages page

  • Altmetrics

  • Data Publication Tools

  • Visualization

  • Databases

  • Geospatial

  • Web

  • Images Processing

  • Literature

  • Computing Infrastructure

  • Security

  • Taxonomy

This list of labels doesn't seem exhaustive?

Categories in the onboarding policies

These are primarily used to assess fit, there's no condition on the topic, currently packages only need to have a scientific application.

  • data retrieval

  • data extraction

  • database access

  • data munging

  • data deposition

  • reproducibility

  • geospatial data

  • text analysis

Labels in onboarding issues

labels <- gh::gh("/repos/:owner/:repo/labels", 
                 owner = "ropensci", repo = "onboarding", .limit=200) 
labels <- unlist(lapply(labels, "[[", "name"))
onboarding_topics <- labels[stringr::str_detect(labels, "topic")]
onboarding_topics <- stringr::str_replace(onboarding_topics, "topic\\:", "")
onboarding_topics
##  [1] "altmetrics"         "climate-data"       "data-access"       
##  [4] "data-extraction"    "data-munging"       "data-only-pkg"     
##  [7] "data-publication"   "data-visualisation" "databases"         
## [10] "earth-science"      "geospatial"         "hardware"          
## [13] "health"             "imaging"            "lab"               
## [16] "linguistics"        "literature"         "mathematics"       
## [19] "misc"               "molecular-biology"  "parsing"           
## [22] "phylogenetics"      "physiology"         "references"        
## [25] "reproducibility"    "taxonomy"           "text-mining"

These "topics" are a mix of what the package does (accessing or extracting data for instance) and in which science field it is used (physiology, taxonomy).

In a perfect world, submitters would possibly define store all sorts of keywords (what their package does, and the field of application) in their package metadata before submission and this way the labels in the onboarding issue would be added automatically? 🤔

@sckott
Copy link
Contributor

sckott commented Jan 30, 2018

I agree it would be good to harmonize all the various labels/categories.

Another thought I had was that we should either let each package have multiple categories OR add another layer of labels. For example, if a package is doing data access, then it might be in the data-access category - but then to provide nice rich filtering on our pkgs page itd be nice to let the user filter by gov't data sources. I'm guessing it'd be easier for us to add this 2nd layer of keywords as we think is needed, rather than hope maintainers do it


n a perfect world, submitters would possibly define store all sorts of keywords (what their package does, and the field of application) in their package metadata before submission and this way the labels in the onboarding issue would be added automatically?

that would be neat, though in reality that might prove hard cause we'd have an unknown set of words everytime that we'd have to fit into the categories we have - and also that isn't straightforward to do automated AFAIK

@maelle
Copy link
Member

maelle commented Jan 31, 2018

Oops when I said "all sorts of keywords" I was thinking of "all sorts of keywords from a list we provide" (possibly writing an addin or something to help them enter all these keywords).

I think packages would have several labels to describe their tasks and their topic areas.

@maelle
Copy link
Member

maelle commented Jan 31, 2018

In the Github labels of onboarding I like the idea to use "topic:", but there could be "type:" for what the package does and "topic:" for actual topics?

Besides, I'm not sure this discussion belongs here? 🤔

@maelle
Copy link
Member

maelle commented Jan 31, 2018

one label we might want in the registry (but not in DESCRIPTION) is community vs. staff contributed?

@maelle
Copy link
Member

maelle commented Feb 1, 2018

@cboettig do you know of any good existing nomenclature forscientific domain areas?

I imagine our nomenclatures for "what the package does" and "technical domain area" might be more home-made (and simpler), but for scientific field it doesn't need to be?

@cboettig
Copy link
Member Author

cboettig commented Feb 1, 2018

Good question, I'm not super familiar with the main existing nomenclatures for that. Mostly I just get annoyed when they treat all of biology as if it considered nothing larger than the cell and dealt with no other issue but human health ☹️ . I've asked in the NCEAS/DataONE slack, they've probably got some good ideas so I'll let you know if I hear from them.

@cboettig
Copy link
Member Author

cboettig commented Feb 1, 2018

@maelle Bryce says the one they use most is https://earthdata.nasa.gov/about/gcmd/global-change-master-directory-gcmd-keywords, pretty rich hierarchical keyword model. Though not entirely obvious to me what the valid categories there are beyond "Earth Science". Maybe you want something more generic for scientific domain areas? cc @amoeba

@maelle
Copy link
Member

maelle commented Feb 2, 2018

Thanks both! 😺 I had a quick look and I'm not sure it'd cover everything, even if the topics "Biosphere"
and "Biological Classification" for instance make it more generic than I thought at first glance. I'll have a better look next week and will try to find something more generic indeed (containing "linguistics" among other things).

@maelle
Copy link
Member

maelle commented Feb 5, 2018

  • I found http://skos.um.es/unesco6/00/html which looks ok but 1) I'm not sure it's widely used, it was proposed to UNESCO but not accepted 2) Taxonomy gets split up in three (insect, plant, animal) which makes it look not so ok?

  • I tried to find a nomenclature from the NSF and only found this pdf which doesn't seem very good for this use case.

I'll keep looking. A good nomenclature would

  • be sort of official/well recognized

  • contain nearly all the topics we could think of (which would be an indicator that it's complete for some fields we do not currently have much expertise in)

  • not be too detailed either, or we could stop at a given layer of details/always rely on conditional display for say surveys.

@maelle
Copy link
Member

maelle commented Feb 15, 2018

@cboettig In the CodeMeta terms, do " applicationCategory" and "applicationSubCategory" have to be unique?

I've been thinking of technical and topical categories, most often packages will have only one of those, but maybe not? In edge cases, maybe one of the two could be applicationCategory and the other one a keyword.

The workflow I'll have will probably to have a table of all packages with their categories and subcategories as I think they should be, then ask for feedback on that before probably doing PR to repos to add the terms in DESCRIPTION. In the future we could strive for that info to be included in DESCRIPTION when onboarding / creating the package.

@cboettig
Copy link
Member Author

Sounds good. No requirement to be unique, and of course you can have one or more of each.

@maelle
Copy link
Member

maelle commented Nov 5, 2018

Pragmatic decisions.

Maybe we'll still want to curate better categories, from some taxonomy, but not in the short term?

@maelle
Copy link
Member

maelle commented Nov 5, 2018

Some "categories" could be automatically guessed. E.g. if there's a system requirement on a C library (how to define that remains to be seen) it's a "package wrapping a C library", if the package depends on httr/curl/crul it's probably something like a web API wrapper, etc.

One could also try clustering packages based on the dependencies they share, without too much supervision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants