Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Action by software: identification of the software #38

Open
mtrekels opened this issue Jun 24, 2020 · 18 comments
Open

Action by software: identification of the software #38

mtrekels opened this issue Jun 24, 2020 · 18 comments

Comments

@mtrekels
Copy link

If the property:type has value http://schema.org/SoftwareApplication

Is there a way to specify:

  • software used
  • version of the software (release number, revision number...)

Use-case: action is performed in an automated way (identified by artificial intelligence, recorded with camera trap...)

@mtrekels
Copy link
Author

Similarly: do we need a property to identify the hardware used?

@matdillen
Copy link

Wouldn't we "just" need to point to an external ID for the observing hardware/software? That's what we would do if the subject was a person, when we want to include other subject metadata.

Of course, this begs the question where we would keep our persistent IDs for our camera traps.

@mtrekels
Copy link
Author

This is mainly of an issue for the use-case of not having a persistent identifier available. The property 'name'/'verbatimName' is provided, but this covers mainly 'human' agents.

Extra question: a persistent identifier is not really 'human readable'. Do we want to include the possibility of a more readable representation of the agent?

@matdillen
Copy link

Well, in the absence of an identifier, the name/verbatimName combination can just as well be insufficient for a person as it can be for hardware or software.

The key problem in both cases is disambiguation. Common names likes James Smith or Zhang Wei will hardly be sufficient without additional information or a unique ID. Internal identifiers could be padded on to the name field, e.g. James Smith (3) or Zhang Wei_13, but that is only helpful as an opaque internal identifier.

I think we're running into the problem we're already trying to solve by strongly recommending the use of unique identifiers. Is it important that we come up with a solution in the absence of identifiers as well?

@mtrekels
Copy link
Author

@dshorthouse we mentioned the issue of attribution of software at the TDWG workshop. One take-away message from the meeting, is the fact that by attributing the software, you actually attribute the 'developers' of the software/algorithm

@deepreef
Copy link

@mtrekels : A similar issue was raised during the Machine Observations session. When a robot (or some other automated machine) records an observation, who should get the attribution? The robot/machine as an "agent", or the engineers who designed the robot and its software? Does it/will it make a difference if AI algorithms are used to make decisions about when to record an observation? At what point does attribution pass from the developers of the software logic to the machine itself?

@dshorthouse
Copy link
Contributor

Value judgments re: attribution aside, what I think we'll need in this context is how to unique identify the software just as we (mostly) have mechanisms to uniquely identify people. A GitHub repo? A Zenodo DOI? Other?

@mtrekels
Copy link
Author

Thanks @deepreef and @dshorthouse for the remarks.

Maybe indeed it's not really very clear if software can be considered an agent. However, my personal feeling is that it's 'something' that performs an action. As such I think it fits in the definition.

With regards to what identifier to be used, in many cases this is probably of similar complexity as for people. Some will have a well defined (GitHub, Zenodo...). Others unfortunately not (some commercial softwares might be more difficult, but we/I need to think/search a bit deeper). However, I think that nowadays traceability of software is quite good.

@wouteraddink
Copy link

wouteraddink commented Sep 29, 2020 via email

@deepreef
Copy link

deepreef commented Sep 29, 2020

I agree with @mtrekels and @wouteraddink on this -- there really needs to be a way to cite a non-human (non-living?) entity as an agent for attribution purposes. But there is a subtle but potentially important issue related to this, especially if identifiers are to be minted, which has to do with establishing some sort of parity between an instance of human agent, and an instance of non-human/non-living agent. "Human" refers to a class of thing (≈Homo sapiens Linnaeus 1758), and individual persons are instances of that class. "Human" itself is a subclass of "living things" (and so on). So on the software (and hardware) side of Agents playing roles, what are the equivalencies? At first blush, I would think of "Electronic Equipment" and "Computer Software" are more or less congruent to "Living Thing". But how to define an "instance" of one of these to be comparable to an instance of a human (e.g., Richard Lawrence Pyle)? Would it be sufficient to document it as something like "Whizbang Organism Identifier Software version 3.7", or "Canon EOS R5 DLSR"? Or would it be better to somehow identify a specific instance/installation of those "subclasses" of software/hardware? (e.g., via individual serial numbers)

Most of this is philosophical, so may not be worth spending any time on. But there are some practical implications, and I would assume whatever system is proposed or adopted would be clear enough and well-defined enough to promote consistent implementation (and, maybe more importantly, consistent understanding of what an identifier minted for a non-human agent actually refers to).

@matdillen
Copy link

Persistent identifiers for software have been designed. Not sure how frequently these are used.

Do we need to indicate the nature of the agent (human, other animal, software, hardware...)? I think this falls under the authority of the resource from which we use the identifier. We may want to define what can possibly constitute an agent and what cannot, to avoid interoperability issues in the future.

@danstowell
Copy link

danstowell commented Feb 18, 2021

I agree with others here, that it's reasonable to treat an algorithmic agent as the entity to be attributed. (As well as the algorithm "designers", there's increasingly common the "training dataset", and indeed it gets muddy if we try to ignore the existence of the algorithm as a stable entity issuing assertions.)

I also have found Prov-O a good source, and I hope Prov-O will help us with the "indirect" attribution via a software agent to its creators.

I don't think TDWG should try to enumerate all the ways that an algorithm could be described. Name of software, version of software, would be fine, but presumably simply as string values. Beyond that, we should allow for some opaque/external way(s) to refer to algorithms, such as the "persistent identifiers" mentioned by Mat. I've not seen those identifiers before, but what I have seen often is the use of DOIs to refer to a specific software edition, in particular the use of Github-Zenodo's DOI service. Would DOI be one useful option? (I expect it wouldn't cover every case.)

Edit: re-reading the thread, I realise that DOIs are fine, and the more difficult issue is how to refer to software agents that don't have any such obvious identifier.

@matdillen
Copy link

Edit: re-reading the thread, I realise that DOIs are fine, and the more difficult issue is how to refer to software agents that don't have any such obvious identifier.

Yes, and I think the problem is similar for people without obvious identifiers. The solution is to strongly recommend unique identifiers and list good examples. For software, it is a bit more complex than people, because you can have different versions of the same software around at the same time, but only one version of a person.

The other question remains unanswered, I think: "Do we need to indicate the nature of the agent (human, other animal, software, hardware...)?" This is particularly relevant when we just have a string, and not a URI to a resource which can address this for us.

@wouteraddink
Copy link

For software as agent I think what you want to identify is the instance of the software, not (only) the software project and version. The instance is very comparable to a human, there is only one and it may have a unique configuration/set of settings. The lifespan of a software instance is much shorter than that of a human though.

@pmergen
Copy link

pmergen commented Feb 18, 2021

Hi

I am currently involved in another discussion group, on how to convey scientific knowledge to different audiences in the digital transformation world. They strongly advocate for having the best, most transparent and complete information on the source of the information. In this context it is probably important to have as "agent" the software or the algorithm shown. As for the "creator" of the algorithm, seems important to have to, while there are many branches (spin offs) created by the community.

As for crediting contributors to make the algorithm evolve that's also quite tricky as many can be involved to make it evolve, including the users themselves by using it ... so would not go down that road.

@matdillen
Copy link

I think crediting the authors of software or builders of hardware is not within the scope of Darwin Core. That is up to the resources we link to identifying the software/hardware.

Identifying a software version is much more straightforward than identifying a software instance, including configuration details, system specs, training datasets... While this is obviously interesting for the sake of repeatability, it also constitutes considerable overhead and goes further than where we are currently at with human agents. We don't qualify their state of mind at the time of their action, nor the physical context in which they performed it. When you make the analogy to human agents, you can also notice how there may be some privacy concerns.

@danstowell
Copy link

danstowell commented Feb 18, 2021

Over in the "DWC for biologging" group there's a closely related discussion tdwg/dwc-for-biologging#29 which includes one suggestion that basisOfRecord is the right way to distinguish between human and machine identifications. @matdillen would this answer the "Do we need to indicate the nature of the agent" question?

Perhaps if basisOfRecord:HumanObservation, the agent should be "people, groups, or organizations" (as in the current identifiedBy definition), whereas if basisOfRecord:MachineObservation, the agent should be an instance of some class/subclass representing a machine or an algorithm instance.

@matdillen
Copy link

Over in the "DWC for biologging" group there's a closely related discussion tdwg/dwc-for-biologging#29 which includes one suggestion that basisOfRecord is the right way to distinguish between human and machine identifications. @matdillen would this answer the "Do we need to indicate the nature of the agent" question?

Perhaps if basisOfRecord:HumanObservation, the agent should be "people, groups, or organizations" (as in the current identifiedBy definition), whereas if basisOfRecord:MachineObservation, the agent should be an instance of some class/subclass representing a machine or an algorithm instance.

When dealing with HumanObservation and MachineObservation this works, but not with specimens such as PreservedSpecimen. But that is a more fundamental problem with basisOfRecord when it comes to specimens (tdwg/dwc/issues/302). We also have no way to distinguish between human and nonhuman (or further, such as hardware vs software) for other actions than the observing/recording. We may need terms such as basisOfIdentification for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants