Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IRI reuse and synonyms #17

Open
dbooth-boston opened this issue Dec 7, 2018 · 21 comments
Open

IRI reuse and synonyms #17

dbooth-boston opened this issue Dec 7, 2018 · 21 comments
Labels
Category: language features For language features of RDF itself -- model and syntax standards Standardization should address this

Comments

@dbooth-boston
Copy link
Collaborator

In theory, RDF authors should reuse
existing IRIs, rather than minting their own. But this makes
for messy RDF and increases the up-front burden on developers.
Consider a typical RDF project that integrates data from
multiple sources, and needs to connect that data into its own
vocabulary. The resulting data involves both the normalized
vocabulary and the non-normalized source vocabularies,
intermixed. The developers might be happy to adopt existing
concepts like foaf:name (for a person's name) and dc:title (for
a document title) into the project's normalized vocabulary.
But by using those existing IRIs instead of minting their
own IRIs in their own namespace (such as myapp:name and
myapp:title), it becomes hard to distinguish IRIs of the normalized
vocabulary from IRIs of the non-normalized source vocabularies.

Ideally a project should be able to use its own preferred names
(and namespaces), like myapp:name and myapp:title, while still
tying those names to existing external IRIs, such as foaf:name
and dc:title.

owl:sameAs is not great for this. It is too heavyweight
for simple synonyms, and it is only for OWL individuals --
not classes. Furthermore, it provides no way to indicate
which IRI is locally preferred. It would be good to have a
simple standard way to rename IRIs or define IRI synonyms.

@dbooth-boston dbooth-boston added the Category: language features For language features of RDF itself -- model and syntax label Dec 8, 2018
@william-vw
Copy link

Why would re-using existing IRIs be a problem? .. Wasn't the existence of different terms with identical meanings (such as name, title) - due to blindness about what vocabularies are out there - one of the major issues of the Semantic Web? It's a major hindrance to automated agents in interpreting distributed RDF data, and the integration of multiple RDF datasets.

I suppose my question is: why confound even the simple, intuitive concept of vocabulary (re-)use? Perhaps I'm misunderstanding the main rationale behind this issue .. Why would it lead to "messy" RDF data, and why is this related to normalization? I understand that this leads to an up-front burden for the developer - but that problem could be solved in a different way (see e.g., BioPortal, which greatly facilitates the discovery of relevant ontologies).

@dbooth-boston
Copy link
Collaborator Author

@darth-willy the problem is not the re-use of existing IRIs per se. The problem is that users are forced to refer to those IRIs verbatim (modulo namespace prefixes) whenever they are used. To illustrate, suppose I want to use a particular collection of concepts from several namespaces, and I want to tell my users what they are. Right now I have to explicitly list all of those URIs, such as:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

skos:prefLabel
dc:title
foaf:name
rdf:type
rdfs:subClassOf
prov:wasDerivedFrom
owl:hasKey
etc.

And if you are debugging your RDF and looking at a term such as dc:author, is it on my approved list? Hmm, not easy to tell. You have to carefully examine the whole list.

It would be much more user-friendly if I could bundle up this entire collection of URIs, from many namespaces, into a single coherent package and use a common prefix for that entire bundle of concepts, so that others could use it like this:

@prefix : <..../myapp#>

:prefLabel
:title
:name
:type
:subClassOf
:wasDerivedFrom
:hasKey
etc.

Then if you see dc:author in your data you will instantly know that it is not on my approved list, whereas if it shows up as :author then you know it is on the list.

This is quite analogous to what can already be done in programming languages.

Obviously a mechanism would have to be developed to support these name associations or renaming, and it would be nice if it could be done both on an individual basis, such as picking one specific URI from a ontology, and on a group basis, such as combining all of the URIs from both FOAF and PROV. (Conflicts would obviously have to be resolved also, if two sources use the same local names.)

I think there are two basic approaches that could address this need. One would be to define a property that is used for renaming URIs

dc:title rdf:prefUri :title .

This would act somewhat like a one-way owl:sameAs : when the processor sees :title it treats it as dc:title.

The other basic approach would be to define a higher-level binding syntax, roughly like what programming languages use for importing libraries. For example, in JavaScript you can pull in an externally defined object (that has sub-names) and associates it with your own local name:

var React = require('react');

I don't know what mechanism would be easiest. It would be nice to explore some ideas.

@william-vw
Copy link

Maybe I'm just misunderstanding..

But firstly, I don't know why one would have to explicitly list all URIs of re-used concepts up front, i.e., create an "approved list". Do you mean restricting authors regarding which predicates, types, .. should be used in data, e.g., which will be added to your repository (i.e., for consistency purposes)?

This renaming of URIs with your own prefix merely seems to obfuscate their provenance, and, as you say, will require support for resolving (possibly, a chain of) re-named URIs. If you are referring to the effort of having to refer to multiple namespaces, this seems quite minimal; and, in fact, in line with the overall philosophy of utilizing a distinct namespace to group domain- or application-specific URIs. Wouldn't throwing concepts from different namespaces, i.e., from different domains, into a single, personal namespace break this design goal?

@dbooth-boston
Copy link
Collaborator Author

If you have control over your RDF authors, then yes you could restrict your authors to the approved list of predicates and classes (for example).

But even if you don't control your RDF authors, such as if you are integrating data from external sources, then often there will be a set of predicates and classes (for example) that you already know how to handle -- the approved list -- and when new ones show up from new data sources, you might expand the approved list. So it is useful both to be able to easily distinguish easily between terms that are on the approved list and terms that are not, and it is useful to be able to work with the approved terms using a single namespace.

But the desire for easy URI renaming or synonyms goes beyond that also. It would also help when two or more URIs are discovered for the same entity, and this happens a lot. For simpler processing it would be easiest if you could simply declare that the URIs are synonyms, and indicate which URI is the preferred synonym, and then use only that URI within your application, instead of having to deal with all of them.

the overall philosophy of utilizing a distinct namespace to group domain- or application-specific URIs

That is exactly one of the goals that this issue is intended to address. Right now if I want to reuse someone else's URI, I cannot use a distinct namespace to group it into my application-specific namespace. In other words, a URI cannot belong to more than one group, because it only has one namespace. It is only and forever it its original application-specific namespace group. This does not make sense when the goal is to reuse common URIs. Perhaps this means that we need to somehow separate the grouping mechanism (which currently is done with namespaces) from the unique identification mechanism (which is done with URIs).

@HughGlaser
Copy link
Collaborator

HughGlaser commented Dec 23, 2018

Ah, @dbooth-boston , I think I am beginning to get what you are after :-)
(The idea of an "approved list" seemed a bit strange at first.)
I'll try an example.
I am building a very simple site that gathers bibliographic information from a variety of sources, and re-presents it in a simple structured form, with the names of the authors and the paper titles.
I want to write my display code (that is URI resolution or SPARQL query to HTML), and so I need to know all the properties for the names and titles.
I am very likely to be getting rdfs:label, dc:creator, foaf:name for people, as well as a bunch of similar ones for the titles.
I really only want to choose one of those to "see" in the html generation code - otherwise I am making my consumption code dependent on knowing what the different properties in the source data are, which is pants.
And now, what happens when my system is acquiring a new or changed source?
It should be possible to easily tell whether anything special needs to be done.
And if I subsequently find that a source is using skos:prefLabel or skos:label?
I would like to be easily (and even automatically) able adjust the system to cope with this, and certainly without changing the html generation.

Yes, we have struggled with this over the years.
For a long time (in rkbexplorer etc.) we used
https://www.w3.org/2005/04/fresnel-info/
which allowed the html generation to be dynamically controlled by an RDF config, which could come out of the acquisition process. Very nice, but any support seemed to die.
We also developed a whole sameAs-aware infrastructure (with servers like sameAs.org), which we still use.
Every similar (sic!) URI gets grouped in the sameAs store, which will suggest a canon for any of them if you ask, and then the infrastructure always asks for the canon and uses that.
I have a sense that is sort of what you are suggesting, with a different mechanism.

@dbooth-boston
Copy link
Collaborator Author

Yes, exactly. Those are some of the use cases and workarounds that would be addressed if we had better standard mechanisms to address this issue.

@chiarcos
Copy link

chiarcos commented Dec 23, 2018 via email

@draggett
Copy link
Member

@HughGlaser wrote:

I am very likely to be getting rdfs:label, dc:creator, foaf:name for people, as well as a bunch of similar ones for the titles. I really only want to choose one of those to "see" in the html generation code - otherwise I am making my consumption code dependent on knowing what the different properties in the source data are, which is pants.

This sounds like a vocabulary mapping problem where you need to map external vocabularies into the vocabulary your application logic is defined in. The mapping could be defined through rules, and may be context dependent, and may define preferences when the the input provides multiple choices. The data could be pushed through the rules in an eager processing model, or pulled in lazy processing model. The inability to find a mapping for a new data source could send a signal that developer attention is needed. To put that differently, some processes are fully automated whilst others bring humans into the loop for collaborative problem solving.

@HughGlaser
Copy link
Collaborator

@draggett wrote

@HughGlaser wrote:

I am very likely to be getting rdfs:label, dc:creator, foaf:name for people, as well as a bunch of similar ones for the titles. I really only want to choose one of those to "see" in the html generation code - otherwise I am making my consumption code dependent on knowing what the different properties in the source data are, which is pants.

This sounds like a vocabulary mapping problem where you need to map external vocabularies into the vocabulary your application logic is defined in. The mapping could be defined through rules, and may be context dependent, and may define preferences when the the input provides multiple choices. The data could be pushed through the rules in an eager processing model, or pulled in lazy processing model. The inability to find a mapping for a new data source could send a signal that developer attention is needed. To put that differently, some processes are fully automated whilst others bring humans into the loop for collaborative problem solving.

Yes, this is pretty much what Fresnel provides.
Or rather, provided. And it was quite complex to use, so not suitable for newbies.

Having agreed there is a problem here, what do you think is the best way to make it easier for newbies (and others) to surmount or avoid it?
That is, what should EasierRDF recommend?
Fresnel?, sameAs aware technology?, @dbooth-boston 's proposal?, something else?

@draggett
Copy link
Member

This is one of the topics I proposed for the W3C Graph Data workshop in early March. What would really help is to gather some concrete use cases that we can forge ideas against. It relates to interest in higher level frameworks and easier rule languages.

@william-vw
Copy link

Ok, I think I'm discerning a few different, albeit related, topics here (feel free to correct):

(1) Useful to be able to work with the approved terms using a single namespace.
(2) Easily distinguish easily between terms that are on the "approved list" and terms that are not.
(3) Simply declare that the URIs are synonyms, and indicate which URI is the preferred synonym, and then use only that URI within your application, instead of having to deal with all of them
(only want to choose one of [ rdfs:label, dc:title, foaf:name, .. ] to "see" in the html generation code - otherwise I am making my consumption code dependent on knowing what the different properties in the source data are.
(4) What if I subsequently find that a source is using skos:prefLabel or skos:label? I would like to be easily (and even automatically) able adjust the system to cope with this, and certainly without changing the html generation.

To me it seems that these are separate issues (although one solution could partially address multiple of them). The first issue seems related to ease of use, but, as mentioned, could obfuscate the provenance (and thus, meaning) of these terms, simply to avoid utilizing a few more prefixes..

Issues 2-4 seem very much related and pertain to data discovery. As noted by others, they could be (partially) resolved by introducing a built-in "synonym" predicate, possibly supported by a sameAs service. By checking whether a new term is a synonym of an approved term or not, one can meet issue 2.

@dbooth-boston
Copy link
Collaborator Author

@darth-willy, yes that is a pretty good summary.

You could think of these topics as being independent, but I think it's helpful to step back and take a broader view of them. What they have in common is the tension between the rigidity of a single global naming space versus the need to create RDF applications locally and independently. Again I'll make an analogy of RDF being like assembly language. At the assembly language level, there is only one global variable space. But as higher-level languages were developed, computer scientists realized that it is very beneficial to support local naming spaces, and provide mechanisms for mapping them into the underlying global naming space. By analogy, this has not yet been developed for RDF. I do not yet know what will be the best mechanisms to address these issues -- we still need more creative ideas -- but I think at least one element should include the ability to easily indicate a preferred synonym.

sameAs services can be quite useful, but one cautionary note: in the end they can only suggest synonymous URIs. They cannot be authoritative for all applications. This is because different applications need to make different judgement calls about which URIs are synonymous enough for that application's purpose. Attempting to universally decide that two URIs are synonymous leads to deep and unsolvable philosophical questions about identity.

@william-vw
Copy link

william-vw commented Dec 27, 2018

@dbooth-boston I'm still a bit confused by your analogy to namespaces in programming languages. In particular, (a) how this relates to the prior notion of "re-packaging" terms into a local, application-specific namespace, and (b) that a namespace mechanism does not exist for RDF. It seems like the former would be similar to re-packaging a class like java.lang.String as darth.willy.String (right?), which still doesn't make much sense to me. It's wholly unclear to me what you mean by the latter - because hierarchical name spaces don't exist in RDF? Afaik one can perfectly create constructs in your own, application-specific namespace, while re-using constructs from other namespaces, both in high-level programming languages and RDF.

You make a good point when saying that synonyms can be application-specific. E.g., for some applications, foaf:name and dc:title can be similar enough, whereas others may want to differentiate between them.

@HughGlaser
Copy link
Collaborator

HughGlaser commented Dec 27, 2018

@dbooth-boston

sameAs services can be quite useful, but one cautionary note: in the end they can only suggest synonymous URIs.

I disagree very strongly.
That is like saying that
:foo rdf:type :bar
is only suggesting a type for :foo.
If that triple is in your purview, triplestore or whatever, :foo does have type :bar, and any inference or whatever that is performed can and should respect that.
In exactly the same way
:foo owl:sameAs :bar
tells the system some very specific information that can and should be respected.

They cannot be authoritative for all applications. This is because different applications need to make different judgement calls about which URIs are synonymous enough for that application's purpose.

True.
And for that reason I use many sameAs stores, each of which captures policies that are appropriate for different applications, and the system needs to know which sameAs store(s) is applicable for which application/context.
As always, you don't bring in RDF from sources that don't have the sort of knowledge you want, or that you think is wrong.

Attempting to universally decide that two URIs are synonymous leads to deep and unsolvable philosophical questions about identity.

Yes - and that's why it is a worthless task.
And it is a worthless task discussing it too (hence my little rant :-) )
We can safely leave it to Cratylus, Heraclitus, Socrates and Plato, who failed to sort it out more than two thousand years ago, along with Theseus' ship and the Family Broom.
In the end, we are not in the world of philosophy, even if thinking that way does help us sometimes.
We are in a world of engineering, where systems execute code based on the data they have.
And owl:sameAs, owl:differentFrom and any other predicates can usefully be used and interpreted in appropriate ways.
Cheers! :-)

@dbooth-boston
Copy link
Collaborator Author

@dbooth-boston wrote:

sameAs services can be quite useful, but one cautionary note: in the end they can only suggest synonymous URIs.

@HughGlaser wrote:

I disagree very strongly.

Sorry, I should have been clearer. Certainly a sameAs service that a developer has chosen to use for a particular application can be authoritative for that application. What I meant was that general purpose sameAs services -- such as sameas.org -- cannot be authoritative for all applications: they can only suggest synonyms, because developers still need to make their own final choices of what they consider synonyms.

@draggett
Copy link
Member

@darth-willy wrote:

"re-packaging" terms into a local, application-specific namespace

Indeed, and this is supported by JSON-LD contexts. It is also likely to be important when dealing with graph databases that don't support namespaces explicitly, and when we want to use RDF as an interchange framework between different graph databases.

@dbooth-boston
Copy link
Collaborator Author

[Catching up on an earlier comment that I missed.]

@chiarcos, yes, that is a work-around that could be used. But I think the need is general enough that we really should have better support for it in the RDF ecosystem. I don't think rdfs:subPropertyOf or skos:broader are ideal for this, because they are already used for completely different purposes that would be detrimental to conflate. I also agree that a pre-processor might turn out to be the best approach. But I think we still need more ideas on the table.

@dbooth-boston dbooth-boston added the standards Standardization should address this label Mar 11, 2019
@azaroth42
Copy link

The discussion has been mostly in terms of predicates, but should this apply also to instance URIs?

@dbooth-boston
Copy link
Collaborator Author

The discussion has been mostly in terms of predicates, but should this apply also to instance URIs?

@azaroth42 , yes, I think the need exists for instance URI synonyms also.

@iherman
Copy link
Member

iherman commented Apr 17, 2019

I would like to differentiate (here and in some other issues) between changes (or not) of the fundamental RDF concepts and what serializations can offer. My reading of the thread that this is an issue for the latter and not the former.

JSON-LD has, by and large, solved this issue through the @context mechanism. I guess it paved the way for other serializations to do something similar. The question that may arise is whether the concept of @context should re-adapted to Turtle or RDFa, too (I would not bother about RDF/XML). We may either envisage some sort of a generic @context that could be understood by the future versions of all three, or decide to reproduce @context for, say, Turtle specifically. Not sure which is a better approach.

Words of warnings, though:

  • JSON-LD contexts may be fairly complex. Most (if not all) of the work done by the current JSON-LD WG is to look at features of @context. Some of those issues are JSON specific, but some may be generic issues. Beware of flood gates:-)
  • A natural extension of the concept is to store context files at URI-s that systems would have to read when interpreting a specific data set. We shied away from that in the RDFa days (it was discussed) because the repeated HTTP requests, caching, etc, seemed to become an issue. The JSON-LD WG decided to bite the bullet, but it is still a matter of discussion.

In spite of the potential pitfalls, I think the fundamental approach of JSON-LD, yielding @context is important: make the life of the lambda data author easier, pushing and hiding the complexity in the @context, done by experts and done only once. Something like that might be of a general value...

@dbooth-boston
Copy link
Collaborator Author

Interesting idea! i wonder how an @context-like name mapping feature would look in Turtle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: language features For language features of RDF itself -- model and syntax standards Standardization should address this
Projects
None yet
Development

No branches or pull requests

7 participants