New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback? #1

Open
sckott opened this Issue Nov 23, 2016 · 29 comments

Comments

Projects
None yet
10 participants
@sckott
Owner

sckott commented Nov 23, 2016

rather than filling your inbox with another email, pinging you fine folks here:

@ethanwhite @cboettig @dfalster @dlebauer @ibartomeus @wcornwell

I'd love your feedback on this thing

What it is and why

Trait data seems to be one of the most locked down bits of the data ecosystem. Datasets are out there but often in supplemental materials, etc. that are hard to discover.

This service aims to be primarily an REST API that can be searched for datasets based on

  • taxonomy
  • location
  • trait type
  • general search across all fields

I got a Amazon research grant for $500 credits for https://github.com/sckott/gbids but it's not seeing much use as I thought it would get. So i'm trying to pivot to this idea, which i've been thinking about for some time.

There's description of the API at https://github.com/sckott/traitdb#api

Both /search routes are based on Elaticsearch, and current plan has the entire datasets loaded into ES. I don't think this will scale for sure, so evaluating other options while still allowing searching datasets. The stuff returned from this route are the actual records (aka rows) of data from the datasets, so one can get just the records they want based on some search.

There's only 6 datasets in there now, as I evaluate what works and what doesn't, and get feedback for you :)

I haven't yet cleaned up/standardized these but will be useful once done:

  • taxonomy
  • geolocation (lat/long, or similar)
  • place names
  • other things?

Then we could allow search specifically on those elements instead of just a full text search across all.

There is no website for this yet, but could make one on top of the API, for users that prefer a GUI.

What do you think?

  • Do you think people will use this?
  • Any issues you see with licensing/etc? All Dryad datasets are CC0, and I've been using those so far. AFAIK supp. ESA jourjals datasets are CC licensed, so I think those are all fair game as well
  • Would it be better to not actually serve datasets, but only be a discovery service?
  • If this was to narrow focus, what would be of the most benefit to the most people?
  • other things?
@wcornwell

This comment has been minimized.

wcornwell commented Nov 24, 2016

My two cents: I think we really need a discovery service. We need lots of things to be sure, but maybe discovery would be a good place to start. As the repositories multiply--Dryad, Figshare, Zenodo, GigaDB, USDA, ESA, Australia has it's own one now-- (see NPG's list). There is basically no way to keep on top of what's coming out. It would really useful for me if there was some kind of automated search for trait data. Is there some way to (smartly) automate dataset discovery?

@ibartomeus

This comment has been minimized.

ibartomeus commented Nov 24, 2016

  • Do you think people will use this?
    In my experience lots of ecologists will look for trait data in a web portal, few using R and almost none using other programmatic tools.
  • Would it be better to not actually serve datasets, but only be a discovery service?
    I think a two step process 1) locate where the datasets are (that is search through dryad, etc...), 2) download them if you want, would be optimal.

https://traits.party gives a Forbiden message.

  • Others: I am working on a curated open trait database (initially for bees). One big problem is the standardization of search terms, how do we call same traits, trait units, etc... Locating datasets with traits would be already a big improvement, but I imagine collating those traits in a single table would still need lots of tailored work depending on the sources. I'll keep you posted about my progress.
@sckott

This comment has been minimized.

Owner

sckott commented Nov 24, 2016

@ibartomeus

In my experience lots of ecologists will look for trait data in a web portal, few using R and almost none using other programmatic tools.

Sorry, should have included that it'd be straight-forward to make a website on top of the API for people that prefer that, and can easily use in R, or Python, etc. I'll edit my description above

@sckott

This comment has been minimized.

Owner

sckott commented Nov 24, 2016

@wcornwell Thanks for your thoughts.

Is there some way to (smartly) automate dataset discovery?

Perhaps. If it was a catch all dataset discovery that'd be a different thing, useful, but diff. from this (assuming this remains focused on traits). I think it could be done either using web services (e.g,. Dryad's OAI-PMH or Solr service) or just scraping if needed. In terms of user interface, could add RSS feeds, emails, other notifications about new items coming out.

@sckott

This comment has been minimized.

Owner

sckott commented Nov 24, 2016

@ibartomeus Sorry, the base URL is supposed to autoredirect to heartbeat, in most cases it does that. do no other routes work for you?

The bee dataset sounds great! Right, two problems: discovering datasets, and then merging, are both hard.

@ethanwhite

This comment has been minimized.

ethanwhite commented Nov 26, 2016

  1. I definitely agree that "trait data seems to be one of the most locked down bits of the data ecosystem" and that work any work to remedy this is useful.
  2. For me personally access is more of an issue than discoverability, but I agree that they are both important. As such I'd prefer access to data as well.
  3. How are you thinking of this as relating to TraitBank, which as I understand it is trying to solve the discoverability problem and also providing access to the raw data through APIs. Their website allows search on most of the things you listed and you've already handled access to the API in https://github.com/ropensci/traits. Is it that it's not ingesting enough of the available data? No longer being actively developed? Something it's missing?
@cboettig

This comment has been minimized.

cboettig commented Nov 26, 2016

Re discovery, you might want to consider searching DataONE which already indexes metadata from m major ecological repositories

@cboettig

This comment has been minimized.

cboettig commented Nov 26, 2016

Or maybe DataCite, though it might not expose enough metadata for very rich search, it would include the general purpose repos like figshare & zenodo. (Incidentally, that will be a weakness in general with repos like figshare where there is very little metadata to help discovery

@sckott

This comment has been minimized.

Owner

sckott commented Nov 26, 2016

@cboettig Right, for discoverability there is DataONE and DataCite - so we don't want to reinvent what they're doing. They aren't specific to traits though. But some subset will be trait datasets.

@sckott

This comment has been minimized.

Owner

sckott commented Nov 26, 2016

@ethanwhite Thanks for your comments.

access is more of an issue than discoverability

good to know.

How are you thinking of this as relating to TraitBank

The traits pkg could pull data from this API as well, so yeah, that's sorted. wrt Traitbank, AFAIK they don't really expose much of anything in their API for traitbank, that is I think you can only request for a given species ID. So you can't search Traitbank via the API, only via the web interface. I don't aim for this to be species centric as they are doing, where they're collating traits for individual species.

For this service, datasets will be the units of observation. Then we layer on top Elasticsearch to allow flexible query across datasets, so we can maintain the individual datasets as units.

Also, I think API first thinking will make this much more flexible and widely used in the end since it can be used in anything, whereas almost all Traitbank web portal functionality is not in their API (so not helping with reproducibility)

I have a feeling Traitbank is not running on a full tank of gas, but that's just my impression from activity on it and how long it takes to fix things.

I don't have a sense for how much data they have.

@sckott

This comment has been minimized.

Owner

sckott commented Dec 8, 2016

also love to hear from:

@dmcglinn @dlebauer @zachary-foster @davharris and anyone else really

@dlebauer

This comment has been minimized.

dlebauer commented Dec 8, 2016

Do you think people will use this?

Yes!

Any issues you see with licensing/etc? All Dryad datasets are CC0, and I've been using those so far. AFAIK supp. ESA jourjals datasets are CC licensed, so I think those are all fair game as well

Yes! with both licensing and attribution. It is an unsolved problem worthy of another thread and many conferences / working groups. CC0 makes it easiest to share data but makes it hard to give credit and track provenance. I'll punt on this after I apologize to anyone for the fact that it is hard to aggregate data and also cite every source (though we try).

Would it be better to not actually serve datasets, but only be a discovery service?

No!

There are many discovery services. The need for an interface that provides a consistent interface to many datasets is much more valuable than a discovery service. And there is a need for guidance to researchers on how they can prepare data to make it most easily reuseable. Consistent metadata only gets so far - consistent formats will do much more to facilitate synthetic research.

First, users of trait data would benefit from knowing how they can collect and arrange their data so that it can be combined with other data.

Second, the hardest part of using data from multiple sources is converting two formats into one. It doesn't need to be done by every person who wants to combine the same two datasets. But if it is done twice, we can use that as an opportunity for qc.

While any format must balance generality with utility at some point there may be a need for a different approach, e.g. to capture additional information. But we can move toward doing this more based on need than by lack of guidance. If we (community that creates and uses trait data) can converge on a countable number of clearly defined formats, it will be easier to develop translators and interfaces. This will make it easier to leverage each other's data in our own pipelines.

If this was to narrow focus, what would be of the most benefit to the most people?

Hard to tell because the API returns 'forbidden' (#2) but

I think the value will be in providing access to many datasets in a common 'homogenized' format. AND providing scientists with a template for their own data. Once the data formats (aka 'data models' or 'column names') are defined the value is in creating importer / exporter functions that translate among diverse data models (e.g. so that traitdb could easily import dumps from betydb and vice versa).

As an example I will talk about the database I've developed called BETYdb (on github and accessible via the ropensci traits package). BETYdb aims to solve some of the issues related to data homogenization and accessibility. BETYdb was designed with a web application that facilitates the collection of data extracted from figures and tables in journal articles, found in supplementary datasets, collected in the lab or field, or streaming from sensors. The primary portal betydb.org provides open access to the data it contains. But the web application can be used independently, and there is currently a network of ~10 independent instances of the database that sync their public data.

Back to the point about having a countable number of defined interfaces; if you can write a single interface between the betydb and traitdb APIs, traitdb could host any data shared by users of BETYdb. Furthermore, the data available in traitdb could be synced to all BETYdb servers, to be used in teaching and research.

This example extends to any other trait database - each one provides its own functionality and has its own user community. Effort should be made not just to have a single database but to allow all trait databases to share each other's data. In fact, genomic databases also contain trait data (e.g. to support analysis of gene-phenotype relationships using QTL and GWAS analysis). We are currently working on interfaces between BETYdb and genomic databases. These databases contain a lot of domain knowledge and functionality required for specific use cases. Coupling databases is the most efficient way to make use of and keep up to date with current functionality and user needs.

@sckott

This comment has been minimized.

Owner

sckott commented Dec 8, 2016

thanks for your feedback @dlebauer !

I get your point about talking to other databases - though first thing to focus on is what is this thing - happy to explore talking to other databases later. So yeah, what should this thing focus on? rolling over to next comment:

putting data into homogenized formats would be nice. I'd like to make certain fields across datasets homogenized to make searching across them easier (e.g.. ,taxonomy, geospatial, place names, etc.) and that should help some.

@sckott

This comment has been minimized.

Owner

sckott commented Dec 8, 2016

It seems somewhat narrowly focused databases are the most successful, as perhaps its easy to pitch to the target audience, and the interfaces/etc. can be made specific to the field, and curation/homogenization is tractable given the narrower scope.

  • One idea is to focus entirely on supplementary datasets that are not in dryad/other repos - so rescuing datasets that would otherwise be very hard to find. Then we're not duplicating what's already searchable, etc. in dryad and friends - but does that focus not really serve a useful goal?
@dlebauer

This comment has been minimized.

dlebauer commented Dec 8, 2016

@sckott I'd suggest a separate thread on what this is ... I have some comments on what I can infer about the schema but it would help to have a database dump and perhaps a diagram of table relationships, plus a description of the scope.

@zachary-foster

This comment has been minimized.

zachary-foster commented Dec 9, 2016

Hi @sckott, I am not sure I fully understand this yet, but I will comment as best I can. Finding datasets with specific properties quickly would be very useful. In addition to "taxonomy, geospatial, place names", I would include date (range?), and perhaps data type (e.g. abundance matrix vs occurrence matrix; perhaps this is the same as "trait type"). Also, observational studies vs experimental studies could be a useful distinction, since both types in the same analysis might not make much sense.

Do you think people will use this?

I think they would if they knew it existed.

Any issues you see with licensing/etc? All Dryad datasets are CC0, and I've been using those so far. AFAIK supp. ESA jourjals datasets are CC licensed, so I think those are all fair game as well

There might be, but perhaps you can avoid this by only supplying data with compatible licenses and simply letting the user know of the existence and location of data with incompatible licenses so they can get it manually.

Would it be better to not actually serve datasets, but only be a discovery service?

Perhaps a combination of the two so that the licensing issue can be handled in the way stated above.

If this was to narrow focus, what would be of the most benefit to the most people?

I’m not sure I understand this, but species occurrence and abundance data should have relatively consistent formats (matrices) and would be easiest to combine for meta analysis.

I hope that is useful. As far as I understand it, the goal of this package is to provide a webservice to locate and extract information from diverse datasets based on a set of criteria. Is that right?

@davharris

This comment has been minimized.

davharris commented Dec 9, 2016

@sckott Not sure I have anything to add beyond what's already been said. Thanks for thinking of me though.

@dlebauer

This comment has been minimized.

dlebauer commented Dec 9, 2016

@zachary-foster Although I am not working with species occurance / abundance data (I don't know if this is within scope for a 'trait' database but it seems that it could fit).
So we are coming from different perspectives but your comment "observational studies vs experimental studies could be a useful distinction" is an interesting question that could be discussed in a separate thread on schema design. And I am interested in the useful level of generalization.

The key issue here is that the definition of an 'observational' vs. 'experimental' and even the definition of the experimental design depends on the questions being asked. And the fact that there are almost as many experimental designs as there are studies means that there is some effort to harmonize data collected with different goals. A meta-analysist may have different questions than the original researcher and the original questions may not be relevant to the meta-analysis. Harmonization can include abstracting the experimental treatments into covariates that can be compared across studies.

@zachary-foster

This comment has been minimized.

zachary-foster commented Dec 9, 2016

@dlebauer Thanks for the feedback. Im not sure I understand the scope of what a "trait" is in this instance, so I could be off in my comments.

The key issue here is that the definition of an 'observational' vs. 'experimental' and even the definition of the experimental design depends on the questions being asked.

Yes, that is a difficult distinction, especially to automate in any useful way. When I say "observational" I am thinking of studies that simply record information for natural systems without active manipulation. These strike me as easier to combine into larger datasets than experimental designs, because as you said "there are almost as many experimental designs as there are studies". I have no idea how to go about combining experimental data in a robust way. The reason I think this might be a useful distinction is that combining the two type would rarely make sense.

For example, say I wanted to know the natural DBH range of a species of tree. I would want observational data from natural systems, not experimental data from a plots with nutrient amendments or different logging practices. If I was interested in the effect of nutrient amendments or different logging practices, I would not want observational, but controlled experimental data. I’m not sure if it would ever make sense to mix the two, so the distinction might be useful.

@dlebauer

This comment has been minimized.

dlebauer commented Dec 9, 2016

@zachary-foster I'll try to clarify with your example:

I would want observational data from natural systems, not experimental data from a plots with nutrient amendments or different logging practices.

In this case the observational data would be equivalent to the unfertilized, unlogged control. Such a study may wish to consider stand age as a covariate. In this case the stand age could be time since disturbance, and disturbance could include logging, fire, disease, farm abandonment, etc.

If I was interested in the effect of nutrient amendments or different logging practices, I would not want observational, but controlled experimental data.

In this case, in the context of a meta-analysis, the observational study could be treated as a 'control' and used to inform the intercept (at 0 fertilization) whether or not there was any fertilization treatment.

I’m not sure if it would ever make sense to mix the two, so the distinction might be useful.

An easy approach is to create a random effect of 'treatment' to estimate the mean under control conditions as described in LeBauer et al 2013 and implemented in the PEcAn meta analysis module.

@sckott

This comment has been minimized.

Owner

sckott commented Dec 9, 2016

thanks for your feedback @zachary-foster - I'd like to stay in the realm of trait data. I'm defining that as things measures on organisms, whether it be observational or experimental - I hadn't thought of a distinction there, just that it's trait data. I don't include abundance for the purposes of this thing. I don't think I'd want to exclude experimental results.

@zachary-foster

This comment has been minimized.

zachary-foster commented Dec 11, 2016

@sckott, yea, I was not entirely sure what you were going for. I think I understand better now, but I am not familiar with the tools you are using so I probably cant make many helpful comments. I did not mean that you should exclude experimental results, just that there might be instances where people want trait data from "unmanipulated" organisms/ecosystems, however I expect automating that distinction is hard.

@robgur

This comment has been minimized.

robgur commented Dec 13, 2016

I am late to this party, @sckott, but am tracking all the comments now. As you know, we've spent a fair bit of effort getting body length and body mass measurements unlocked from specimen data - those trait measurements were already in records, just not harmonized to be usable and mostly hidden. Vertnet now has a trait search function and all the trait values have also been pushed to archival repos, in this case on Cyverse. We discuss some of the effort here: http://blog.vertnet.org/post/150968616716/sizing-up-the-improved-vertnet-portal and there is a paper out in the journal Database that covers all the details.

We are also pursuing what amounts to an assertion graphstore about traits built on top of an observation and measurement model that can work whether the entity is described is a species, an individual etc. It looks something like your idea -- a graphstore and API. We'd love to build something bigger still, and connect together all this content more coherently. We are not trying to be Traitbank --- more of a service for data assembly, aggregation.

We'd like to pursue more here collaboratively. For one thing, we can push you the trait data we have extracted for your service and talk about connecting any services we build. It would be great to do that more in a linked open data framework if possible.

@dfalster

This comment has been minimized.

dfalster commented Dec 20, 2016

Hi Scott,

Sorry to be so late commenting. I've been finishing off my current position so a bit swamped the last weeks.

First, I absolutely agree more/better services are needed for discovery and distribution of trait data (and scientific data more generally).

Second, I think the biggest breakthrough that could be made would be to establish a set of standardised templates (e.g. column names for a table of trait data) and then a collection of scripts which grab individual resources (e.g. from dryad, figshare, zenodo, or wherever) and coerce the relevant data into the relevant format. As a community, we could then start sharing these scripts around.

The traits service/site you are proposing could fulfil this vision by providing

  1. A place where standards and scripts are assembled
  2. Routine testing of scripts
  3. A cache of data obtained using the scripts provided from number 1 (these might arise as outputs of the tests on each script)
  4. An interface that sits on top and allows searching across the assembled data (and scripts). Importantly the site would also provide with any download, a script that enabled the data provided to be retrieved afresh from the original sources.

The model I am suggesting is like the academic torrents site, which uses bitorrent technology to distribute open datasets. But we add to that standardised formats and the data manipulation scripts needed to achieve that.

It may help to have a location where all the data is hosted, but IMO the primary focus should be on collating and testing of scripts that fetch and manipulate data from the growing number of repositories that are popping up.

Hope this provides some good food for thought!

All the best,
Daniel

PS. Somewhat related, but not ecology, I was really inspired by the NICTA national map. This provides a unified interface for diverse spatial data across Australia. They don't host the data, but rather provide a slick interface linking out to different datasets hosted on many different servers and putting them onto a common spatial interface.

@dlebauer

This comment has been minimized.

dlebauer commented Dec 20, 2016

Daniel: Great ideas, nicely said!

@sckott

This comment has been minimized.

Owner

sckott commented Dec 22, 2016

@robgur

thanks for info on vertnet traits

We'd like to pursue more here collaboratively. For one thing, we can push you the trait data we have extracted for your service and talk about connecting any services we build. It would be great to do that more in a linked open data framework if possible.

all for working together! Though this thread is part of figuring out what this service is first :)

yes, linked data would be best, I'll consider that.

@sckott

This comment has been minimized.

Owner

sckott commented Dec 22, 2016

thanks for your comments @dfalster !!

I'm all for your proposed idea about scripts, templates, etc. And I think this plays nicely with our idea at https://github.com/ropensci/openscripts (not off the ground yet) - and maybe this idea deserves our collective attention.

For this traitdb thing, I do still feel like it's useful to "rescue" datasets that are lost in the graveyards of journal supplementary materials - and provide through an API - BUT the other idea for scripts is perhaps even of more immediate usefulness

NICTA national map

nice example. A map is a nice interface. I do prefer API first, then the map should be pretty easy, if people want it.

@robgur

This comment has been minimized.

robgur commented Dec 28, 2016

@sckott

This comment has been minimized.

Owner

sckott commented Dec 31, 2016

thanks @robgur

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment