Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

define machine-readable? #2

Closed
ltalirz opened this issue Feb 16, 2018 · 9 comments
Closed

define machine-readable? #2

ltalirz opened this issue Feb 16, 2018 · 9 comments

Comments

@ltalirz
Copy link
Contributor

ltalirz commented Feb 16, 2018

Since you are restricting the list to machine-readable datasets (and rightfully so, I would say), it would be very helpful to explain what this means, perhaps best using a few examples.

In practical terms: Many of these materials science efforts provide a HTML form, which connects to a database and spits out another HTML page with search results (possibly paginated). Should this count as machine readable?
In principle, of course, all information made available in digital form can be considered machine readable, but then we can drop the requirement in the first place.

In my view:

  • if the whole database can be downloaded, basically in whatever format, it's machine readable
  • if there is documented API for automated requests, it's machine readable
  • if there is just a web form that allows to query the database... it kind of makes things unnecessarily difficult

What did you have in mind?

In the end, perhaps it is best to drop the requirement and rather put something like a FAIR sticker (or similar) to those entries that actually make it easy to query the data automatically.

@blokhin
Copy link
Member

blokhin commented Feb 17, 2018

Totally agree and support your point of view. I'd although add the following point extra:

  • if the authors provide their dataset in full privately (e.g. being unable to implement any APIs)

@blokhin
Copy link
Member

blokhin commented Feb 17, 2018

Well but that's basically your first point. The only difference is in the public statement.

@ltalirz
Copy link
Contributor Author

ltalirz commented Feb 17, 2018

Well, even if a dataset is proprietary, this does not prevent one from implementing a (access-restricted) API.
But even if such an API is not present, if the whole database can be downloaded that's fine from my point of view.

How should we proceed? Should I make a pull request?
Perhaps I would rename "contributing" to "guidelines" and include a section there describing the "machine-readable" part.

And would you like to keep "machine-readable" as a basic requirement or would you rather provide a "machine-friendly" sticker that highlights those entries which make an effort to be machine-readable?

@blokhin
Copy link
Member

blokhin commented Feb 18, 2018

Let's keep the machine-readable criterion as a basic requirement? I think, it is crucial. On top of that, to my knowledge, all those mentioned datasets are (or were) investigated with the data science methods.

@ltalirz
Copy link
Contributor Author

ltalirz commented Feb 18, 2018

Let's keep the machine-readable criterion as a basic requirement

Fine!

to my knowledge, all those mentioned datasets are (or were) investigated with the data science methods

Here it is not really clear to me what this means...

Some of the databases in the list can be downloaded, so that's fine. Some may have documented APIs for automated querying. But several also don't or am I missing something?
What about Zeolite Structures Database, WURM, phonon database, NREL, ...
I guess you can reverse-engineer the web forms quite easily, but where does one draw the line?

In essence, what I am looking for is the set of criteria that led you to the choice of the databases in the list (so that I know how to add to it).

@blokhin
Copy link
Member

blokhin commented Feb 18, 2018

OK, let me try to formulate...

blokhin added a commit that referenced this issue Feb 26, 2018
…e answer on the mass downloads for data mining (relevant for #2)
@blokhin
Copy link
Member

blokhin commented Mar 10, 2018

@ltalirz I thought on your suggestion and ended up with the following. Any database is machine-readable by design. Only the access policies matter (and they aren't necessarily FAIR!). For instance, upon a private agreement, one may be granted an unrestricted access to a conservative, otherwise HTML-only data source.

After contacting some of the uncertain participants of my list, I received explicit or implicit requests for deletion. So why shouldn't we follow the canary principle? We just include anything we know was or would be of use for the mentioned or similar software frameworks and delete immediately by request.

@ltalirz
Copy link
Contributor Author

ltalirz commented Mar 10, 2018

Any database is machine-readable by design. Only the access policies matter (and they aren't necessarily FAIR!).

Agreed.

For instance, upon a private agreement, one may be granted an unrestricted access to a conservative, otherwise HTML-only data source.

We just include anything we know was or would be of use for the mentioned or similar software frameworks and delete immediately by request.

Do I understand correctly that you are proposing to include any potentially useful database, as long as they do not explicitly state (publicly or to us) that they are not open for machine-based data mining?
I think this is a reasonable approach.

In this case, however, I would suggest two things:

  1. Define a set of symbols (can even by just words for the moment) that identify for each entry of the list its data-mining openness (free / commercial / unknown)
  2. somewhere (doesn't need to be on the main page) keep the list of databases that have explicitly been excluded (new proposals will be checked against this list)

@blokhin
Copy link
Member

blokhin commented Mar 10, 2018

Great!

Define a set of symbols (can even by just words for the moment) that identify for each entry of the list its data-mining openness (free / commercial / unknown)

There's proprietary label already. Its absence assumes the data are open.

somewhere (doesn't need to be on the main page) keep the list of databases that have explicitly been excluded (new proposals will be checked against this list)

OK, makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants