Skip to content

Fix/schema identifier inconsistencies: allow identifier strings in cross-reference properties#54

Open
dgbroeder wants to merge 4 commits intoskg-if:mainfrom
dgbroeder:fix/schema-identifier-anyof
Open

Fix/schema identifier inconsistencies: allow identifier strings in cross-reference properties#54
dgbroeder wants to merge 4 commits intoskg-if:mainfrom
dgbroeder:fix/schema-identifier-anyof

Conversation

@dgbroeder
Copy link
Collaborator

Several cross-reference properties (references to other entities) are documented as accepting plain identifier strings (e.g.
"org_1", "ven1") but the schema defined them as $ref object types only, causing validators to reject data the docs
explicitly show as valid.

Changes from bare $ref / oneOf to anyOf: [string | $ref] for:

  • Product: topics[].term, relevant_organisations[], funding[], contributions[].declared_affiliations[],
    contributions[].by, manifestations[].biblio.in, manifestations[].biblio.hosting_data_source
  • Person/Agent: affiliations[].affiliation
  • Grant: beneficiaries[], contributions[].declared_affiliations[], contributions[].by, funding_agency

API responses may still return expanded objects.

- cf.search.keyword
- cf.search.org_name
- relevant_organisations.name, relevant_organisations.indentifiers.scheme, relevant_organisations.indentifiers.value
- srv_has_hosting_organisation.name, srv_has_hosting_organisation.scheme, srv_has_hosting_organisation.identifier
- cf.search.org_name
- pageQueryParam, pageSizeQueryParamoadded syntax examples for all these
…ence properties

Several cross-reference properties (references to other entities) are documented
as accepting plain identifier strings (e.g. "org_1", "ven1") but the schema
defined them as $ref object types only, causing validators to reject data the
docs explicitly show as valid.

Changes from bare $ref / oneOf to anyOf: [string | $ref] for:
- Product: topics[].term, relevant_organisations[], funding[],
  contributions[].declared_affiliations[], contributions[].by,
  manifestations[].biblio.in, manifestations[].biblio.hosting_data_source
- Person/Agent: affiliations[].affiliation
- Grant: beneficiaries[], contributions[].declared_affiliations[],
  contributions[].by, funding_agency

API responses may still return expanded objects.
@rduyme
Copy link
Collaborator

rduyme commented Feb 24, 2026

Hi Daan, @dgbroeder

PR Topic 1 meta section page

current meta fields are
image

The idea was reuse the semantic logic of the activitystreams vocab, with page navigation managed by the result identifiers. previous and next. (the "page" is a string and can have any value)

On the first commit your request is to add the following fields in the meta section

        page:
          description: current page number => this one is represented by the local_identifier of the meta 
          type: integer
        page_size:
          description: maximum number of items per page ( you have them in your client query and probably don't need it to iterate)
          type: integer
        items_count:
          description: number of items returned on this page ( you can count the graph array)
          type: integer

Note : If we want a total count for the whole search (not the current page count), we could add a partOf section.
See https://iiif.io/api/search/2.0/#paging-results usage of activitystreams vocab. iiif is the inspiration used for the OpenAPI paging.

@rduyme
Copy link
Collaborator

rduyme commented Feb 25, 2026

PR Topic 2 identifiers

Current status

This is the current format of local_identifiers we agreed for the SKG-IF OpenAPI.

see : https://elements-demo.stoplight.io/?spec=https://w3id.org/skg-if/api/skg-if-openapi.yaml

image All examples have this unique generic format in the OpenAPI documentation.

example:

  "@context": [
    "https://w3id.org/skg-if/context/1.1.0/skg-if.json",
    "https://w3id.org/skg-if/context/1.0.0/skg-if-api.json", // for meta section
    {
      "@base": "https://w3id.org/skg-if/sandbox/my-skg-acronym/" 
      //@base fallback to avoid "orphan" impl. errors but we could remove it in the OpenAPI recommendations 
    }
  ],
  "meta": {
    ...
  },
  "@graph": [
    {
      "local_identifier": "http://example.com/skg-if/api/products/prd-c66c6-38be-4d5f-85db-d44c9f869333", 
      //local_identifier always full URL in OpenAPI with "products" mandatory prefix 
      "entity_type": "product",
      "product_type": "literature"
    },

In the SKG-IF OpenAPI the local_identifier is here to loop back on the entities with OpenAPI compatible URLs, that was deep discussion we had with Menzo @menzowindhouwer


Discussion

  • Why not allowing local_identifiers pointing to local entities that can be discovered and ingested using the API the local_identifier format must point to the api. ex: "http://example.com/skg-if/api/persons/c66c6-38be-4d5f-85db-d44c9". Yes, this gives freedom on server side implementation but makes client side unpredictable. This was on purpose I removed the option to avoid client side logic to have 2 different json parsing process, for example to process either an grant URL string or a grant object/structure.

At the moment : https://github.com/skg-if/examples/blob/main/OpenCitations/oc_1.jsonld is compatible with the model, but it is not a SKG-IF OpenAPI output. It is a just static file on github and it is not a response to any SKG-IF specific endpoint (an OpenAPI endpoint defines a URL enpoint + query params + output format). We have endpoints as defined above for each entity type ( products, grants etc... )

"local_identifier": "https://w3id.org/oc/meta/br/062501777134" is not compatible with the OpenAPI it resolves to an html page, and you don't have a generic process to get its data via :

I am not sure allowing/suggesting external URIs make sense. "non RDF" client won't be able to ingest/interpret them. As explained before, it is allowed in pure RDF, but gets very confusing in the REST OpenAPI for client applications. In other words : the OpenAPI is only relying on local SKG entities that can be exposed by the OpenAPI it self. It removes the RDF Open World Assumption . In RDF you are free to put anything in an id without any guarantee it resolves, which is really uncommon for most REST API developers.

1. Full URL — any URL, used as-is as the entity identifier (e.g. a server API URL,
a DOI, a ROR URL, or an SKG-IF sandbox URL)
2. Plain string — resolved by prepending the `@base` from the JSON-LD preamble;
the framework prescribes `https://w3id.org/skg-if/sandbox/<provider-acronym>/` as the
`@base` for entities that have no independently dereferenceable identifier of their own,
producing a *sandbox URL* — but any `@base` value is valid
3. On-the-fly — plain string using the template `otf___<session-id>___<identifier-string>`,
also resolved via `@base`, for identifiers created on-the-fly during document generation
@dgbroeder
Copy link
Collaborator Author

as we discussed today. these proposals make the API align with what is currently in the specs and examples. the current restriction of the local_identifier format by the API is i think not needed. we should however try to avoid using very long identifiers ie. concatenating server URL + sandbox type identifier URL. lets think of a few workflow examples with where this should occur.

@rduyme
Copy link
Collaborator

rduyme commented Feb 25, 2026

Ok as discussed this morning with Daan.
I checked how we can deal with identifiers like
"local_identifier": "https://w3id.org/oc/meta/br/062501777134

What would be option to resolve them ?

  • Option 1 : https://example.com/skg-if/api/products/https://w3id.org/oc/meta/br/062501777134 ?

    • This would be incompatible/weird/confusing with other local_identifiers we currently have https:/example.com/skg-if/api/products/123456 that would resolve on https://example.com/skg-if/api/products/https:/example.com/skg-if/api/products/123456
  • Option 2 : rely on content-negotiation, recommend to have https://w3id.org/oc/meta/br/062501777134 content-negotiation content-type : "application/vnd.skgif.ld+json"

  • Option 3 : resurrect the "resolve" end point as a new additional endpoint in SKG-IF OpenAPI specs

    • https://example.com/skg-if/api/resolve?id=https://w3id.org/oc/meta/br/062501777134 .
    • This is the initial resolve endpoint that exists prior OpenAPI specs.
    • summary of endpoints we would now have in SKG-IF OpenAPI : https://elements-demo.stoplight.io/?spec=https://w3id.org/skg-if/api/skg-if-openapi.yaml
      • {protocol}://{server}:{port}/{skg_if_api_path}/products
      • {protocol}://{server}:{port}/{skg_if_api_path}/products/{short_local_identifier}
      • new : {protocol}://{server}:{port}/{skg_if_api_path}/resolve?id={local_identifier} ( local_identifier always passed as URL ). We could allow any output on this single entity resolve endpoint
    • for existing implementations, it would also be able to resolve API ids we currently have ex: http://example.com/skg-if/api/products/c66c6-38be-4d5f-85db-d44c9 with the /resolve endpoint https://example.com/skg-if/api/resolve?id=http://example.com/skg-if/api/products/c66c6-38be-4d5f-85db-d44c9 . by simply redirecting the /resolve endpoint to existing code.
    • note that the resolve URL does not contain the entity type. it is universal for any entity idhttps://example.com/skg-if/api/resolve?id=
    • left question: Which output would we expect on https://example.com/skg-if/api/resolve?id=https://ror.org/05gq02987 ?
  • I am not in favor of allowing plain string (not URL), even if it is possible to resolve them with the @base.

    • ex: local_identifier: otf___1730027051396___prod-1 , local_identifier: foobar-acme-person-1
    • Too confusing for non RDF implementers. we don't want them to interpret the context section and rebuild URLs.
    • The API spec is for now : local_identifier:http://example.com/skg-if/api/products/otf___1730027051396___prod-1

Option 3 could work.

@essepuntato
Copy link
Contributor

HI @rduyme,

My two cents here.

My point, in short, is that SKG-IF and its API must be aligned. Thus, I believe that the real answer is option 1. The SKG-IF says explicitly that:

  1. the local identifier is a URL;
  2. the content of that URL is in the hands of the source.

Thus, I think that requiring the source to use a URL template that differs from what they may have already used is a major breaking point, because it forces a source to change all its entire LOD-oriented logic (already implemented and shared) to make it compatible with SKG-IF. In addition, just to mention, also Crossref uses the doi.org URL for referring to its resources internally, and thus, it is not compatible with the constrained local identifier proposed currently in the API. Same thing with ORCID. And this problem will apply to any source that already exposes data as LOD. I do not think forcing anything here is an added value; it may put adoption at risk.

In addition, being part of the same specification, there is a strong need that the SKG-IF API follows the SKG-IF (data model, ontology, etc.); otherwise, there is a huge risk to expose SKG-IF compliant data in dumps that differs with those returned by the SKG-IF API, and I think this is not ideal, honestly.

Of course, if a source wants, the URL can be constructed as suggested currently, but that should not be mandatory for all adopters.

Have a nice day :-)

S.

@rduyme
Copy link
Collaborator

rduyme commented Mar 4, 2026

Thanks Silvio @essepuntato

Option 1 review

If we want to be clean on option 1, it would then mean changing a bit our approach, forcing user to use the w3id.org if they don't have reliable ids as URLs. ie: no local_identifier like : https:/example.com/skg-if/api/products/123456 (breaking change, but why not)

  • CESSDA
    • https://cessda.com/skg-if/api/products/https://w3id.org/skg-if/sandbox/cessda/product-123456
    • https://cessda.com/skg-if/api/grants/https://w3id.org/skg-if/sandbox/cessda/grant-123456
    • in a search response if you use the @base https://w3id.org/skg-if/sandbox/cessda/ in the context and leave only "grant-123456" in the response object local_identifier value. you should also be compatible with https://example.com/skg-if/api/grants/grant-123456 calls
  • OpenCitations (in case of API implementation)
    • https://opencitations.com/skg-if/api/products/https://w3id.org/oc/meta/br/062501777134
    • https://opencitations.com/skg-if/api/agents/https://w3id.org/oc/meta/ra/0614010840729 ( we need a generic "agents/" end point)

No usage of /resolve endpoint then.

We see this URL as id compatibility approach :

@dgbroeder
Copy link
Collaborator Author

Thanks for responding Renaud, Silvio

Option 1 Full URL is included in the pull request, so i am fine with that. But should it be a Full URL with exclusion of other possibilities? Maybe not.
In the pull request other formats are suggested eg. for use in JSON-LD records that also expand into full URL.

  1. Full URL — any URL, used as-is as the entity identifier (e.g. a server API URL,
    a DOI, a ROR URL, or an SKG-IF sandbox URL)
  2. Plain string — resolved by prepending the @base from the JSON-LD preamble;
    the framework prescribes [https://w3id.org/skg-if/sandbox/<provider-acronym>/](https://w3id.org/skg->>if/sandbox/%3Cprovider-acronym%3E/%60) as the
    @base for entities that have no independently dereferenceable identifier of their own,
    producing a sandbox URL — but any @base value is valid
  3. On-the-fly — plain string using the template otf___<session-id>___<identifier-string>,
    also resolved via @base, for identifiers created on-the-fly during document generation

I am not shocked by "https://example.com/skg-if/api/products/https://w3id.org/oc/meta/br/062501777134"
since its not a local_identifier format but an API call

If external URIs make sense is a matter for the source and its intended clients, having an agreed common resolving method (option 2 and 3) is an additional constraint, and could be a recommendation in stead of a hard requirement.

@rduyme
Copy link
Collaborator

rduyme commented Mar 4, 2026

OK let's go for option 1

I will include your changes.

I will also :

  • update all the examples with w3id.org sandbox in the yml and in the sample_data directory
  • remove any reference to ids in format https:/example.com/skg-if/api/products/prod-123456 they are more confusing than helping with option 3.
  • remove refs to short_local_identifier in get by id endpoints
  • explain in the get by id endpoint that https:/example.com/skg-if/api/products/ must resolve both prod-123456 https://example.com/skg-if/api/products/https://w3id.org/skg-if/sandbox/<provider-acronym>/prod-123456.

Other remarks:

  • no need to have /resolve (option 3)
  • we will need to announce this breaking change.
  • we need to add info about optional option 2. it is implemented by RoHub, and remains valid. I create a side task
  • we need clarification regarding <provider-acronym> I create a side task to update the page https://skg-if.github.io/interoperability-framework/. it is not clear enough. how we can avoid name conflicts here.

@markusjt
Copy link

markusjt commented Mar 6, 2026

Option 1 sounds good to me as it seems to align with what I've done in CESSDA's staging endpoint so far for Products. It's simpler when there's an existing identifier for all items and also a fully resolvable URL that contains the same identifier so all of these work and resolve to the same API result page:

It just takes the actual identifier at the end of the local_identifier if it starts with http(s).

I've been wondering how to do it for entities that don't always have any identifier but I guess some id has to be generated then, e.g. for Person entity it would be based on the person's name even though same name doesn't guarantee it's the same person the same way ORCiD does. That means I'd still rather use ORCiD when possible and only generate id when needed and then make either one resolve in SKG-IF API and use the same trick as with product id for full URLs.

Currently the staging endpoint contains a lot of otf identifiers for other entities referenced in Product and I don't think it's feasible to make them resolve at all (plan is to replace otf identifiers with stable identifiers as the entity endpoints are implemented). Currently the random part of an otf identifier is just unix timestamp at the time of generation with no way to later link it to the same thing it referenced when the otf identifier was generated. Making the "random" part based on something stable that can later be used to get the same result again means that we might as well remove the otf part and use the generated id as is.

Edit: After reading the other issue (#39) I think I understand better what the intention with https://w3id.org/skg-if/provider/cessda/ is and that it might be better if the local_identifier in my example here would be https://w3id.org/skg-if/provider/cessda/product_7e3c6fee8b0086785724ab698588433727629380e2ee04b7da1d34d94a0a82e4. That wouldn't currently resolve so it would be a breaking change but it would be a very simple change to make. I liked the current local_identifier implementation in CESSDA's staging because it's easy for human user to get to CDC from the results while it can still also be used as the identifier that resolves to the SKG-IF API page of that study but I see that it would also work the same way if https://w3id.org/skg-if/provider/cessda/product_7e3c6fee8b0086785724ab698588433727629380e2ee04b7da1d34d94a0a82e4 got redirected to CDC.

-added  a number of new example files for testing the cross-entity referencing vs embedded  (light) entities
-modified app.py (FastAPI) to support full URL local_identifier handling, and handling expansion of cross-reffed entities when the expand=true fkag is set
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants