Replace registry with new instance with retrieved resource #1065

braingram · 2023-03-22T14:19:23Z

Thanks in advance for your help with this.

Is there a way for a validator to replace the registry with a new instance that includes a retrieved resource (to avoid multiple retrievals for the same uri)?

Running the following:

import jsonschema
import referencing


uri = "https://somewhere.org/foo/bar"
obj_schema = {
    "$id": uri,
    "type" : "object",
    "properties" : {
        "price" : {"type" : "number"},
    },
}

cart_schema = {
    "properties": {
        "foo": {
            "$ref": uri,
        },
        "bar": {
            "$ref": uri,
        },
    },
}

cart = {"foo": {"price": 1}, "bar": {"price": 2}}

# make a validator with a registry
def retrieve(target_uri):
    if target_uri == uri:
        resource = referencing.Resource(contents=obj_schema, specification=referencing.jsonschema.DRAFT4)
    print(f"{target_uri} retrieved")
    return resource

registry = referencing.Registry(retrieve=retrieve)
validator = jsonschema.validators.Draft4Validator(cart_schema, registry=registry)

validator.validate(cart)

Results in the follow output:

https://somewhere.org/foo/bar retrieved
https://somewhere.org/foo/bar retrieved

due to retrieve being called twice.

The text was updated successfully, but these errors were encountered:

Julian · 2023-03-22T14:43:58Z

This should happen automatically for at least some cases, but clearly not the one you're showing for some reason. I'll have to check why.

Julian · 2023-04-03T15:57:46Z

Just a minor (non-)update on this. I haven't fully made a decision yet (so feedback definitely welcome), but I'm actually second guessing doing any of this automatically.

The reason is that if it does happen fully automatically (i.e. if Registry objects always cache the things they retrieve) then there's no easy good way to avoid that behavior (and implement some sort of behavior where you periodically do want to re-retrieve the resource).

Instead then one could of course cache any requests external to referencing, i.e. by using any normal caching mechanism you'd like. So retrieve would always be called, but you'd cache your network response or whatever you'd like.

Of course I do agree most users would want the caching behavior normally, so I'm also thinking (even beyond the above, that one could cache using any normal mechanism like functools.lru_cache), that referencing could possibly offer a to_cached_resource decorator (of type Callable[[Callable[[str], Any], Callable[[Any], Resource]], Callable[[Callable[[URI], str]], Callable[str, Resource]]) usable like this:

@referencing.to_cached_resource(loads=json.loads, to_resource=Resource.from_contents)  # the defaults
def retrieve_over_http(uri):
    return httpx.get(uri).text

which basically handles loading (in order to be able to ensure it has something cacheable, as loaded JSON may not be) and conversion to a resource, but otherwise essentially does the caching one would normally want.

Feedback still welcome, but as I say I'm kind of leaning towards tweaking the referencing expectations to match the above so that we still can support non-cached use cases.

braingram · 2023-04-03T16:37:14Z

Thanks for the detailed update.

To check that I'm understanding, the decorator you mentioned might be used to allow retrieve to inform the registry that the returned resource should be 'cached' by including it in the registry (so encountering a $ref to the loaded schema will not result in another retrieve)?

@eslavich and I have been working on updating ASDF to be compatible with 4.18. Adding the lru_cache did appear to improve performance (in some cases, more details here: eslavich/asdf#6) but as you mentioned the caching outside the registry still results in a new registry being created (I think here in get_or_retrieve) and any 'crawling' of a resource is lost (I think the resource always starts as uncrawled). I am not too familiar with the internals of jsonschema and referencing to know when all this work is duplicated (if it happens for every branch down a schema, every validation, or both).

As an alternative, is there a way to define a single Registry that will always be used (and allow for adding retrieved resources)? I expect this is against the idea of an immutable registry but for ASDF cases I believe there's an assumption that the schema are static during run-time (at the least a schema for a specific uri will not change) and there is considerable schema cross referencing. I believe performance could be improved if there was a way to perhaps pass a registry to a validator, use it to validate an object (retrieving and adding resources as needed), then cache the registry (this could be done in ASDF) so that it can be reused on the next object (that might have a different but likely related schema).

Julian · 2023-04-03T17:48:19Z

To check that I'm understanding, the decorator you mentioned might be used to allow retrieve to inform the registry that the returned resource should be 'cached' by including it in the registry (so encountering a $ref to the loaded schema will not result in another retrieve)?

It wouldn't really notify the Registry of anything, it would just speed up the function the same way it sounds like you've done via lru_cache (it'd just be a way of reminding people they likely want to consider doing so if the retrieval function is expensive). But the registry object would be "oblivious" essentially to how it worked.

and any 'crawling' of a resource is lost (I think the resource always starts as uncrawled).

Ah, this is a good point which I'd have to consider. (Yes resources are uncrawled, though you can .crawl() a registry to get a registry where the resources are crawled). Will have to think.

I believe performance could be improved if there was a way to perhaps pass a registry to a validator, use it to validate an object

I'm missing the question probably, because there certainly is a way to do this, it's the registry argument to a Validator that I'm sure you're already making use of. You can create a registry, add all your resources to it, call .crawl() to discover all subresources, and then repeatedly pass that registry to all validators you create. Am I missing the question?

(FWIW this is precisely how the registry in the jsonschema-specifications repository works)

braingram · 2023-04-03T18:43:03Z

I'm missing the question probably, because there certainly is a way to do this, it's the registry argument to a Validator that I'm sure you're already making use of. You can create a registry, add all your resources to it, call .crawl() to discover all subresources, and then repeatedly pass that registry to all validators you create. Am I missing the question?

I think what you're describing works if you know what resources are required before validating (where those resources can be loaded and a complete registry made). For our current attempts at updating ASDF to work with 4.18 we are using retrieve to handle the case where a validation pass encounters a uri for an unknown schema (this API is all new to me, so please let me know if it sounds like we're using it incorrectly). If I modify the example to do what I think is a minimal case for how we're using schema in ASDF (where the top level schema is known prior to validation but the referenced sub-schemas are unknown), the referenced schema do not appear to be retrieved during crawl and are retrieved twice during validate. Is this an expected result with crawl?

import jsonschema
import referencing


obj_uri = "https://somewhere.org/foo/obj"
obj_schema = {
    "id": obj_uri,
    "type" : "object",
    "properties" : {
        "price" : {"type" : "number"},
    },
}

cart_uri = "https://somewhere.org/foo/cart"
cart_schema = {
    "id": cart_uri,
    "properties": {
        "foo": {
            "$ref": obj_uri,
        },
        "bar": {
            "$ref": obj_uri,
        },
    },
}

cart = {"foo": {"price": 1}, "bar": {"price": 2}}

# make a validator with a registry
def retrieve(target_uri):
    if target_uri == obj_uri:
        schema = obj_schema
    elif target_uri == cart_uri:
        schema = cart_schema
    else:
        raise referencing.exceptions.NoSuchResource(f"{target_uri} not found")
    resource = referencing.Resource(contents=schema, specification=referencing.jsonschema.DRAFT4)
    print(f"{target_uri} retrieved")
    return resource

print("Making registry")
registry = (retrieve(cart_uri) @ referencing.Registry(retrieve=retrieve)).crawl()
print("Validating")
validator = jsonschema.validators.Draft4Validator(cart_schema, registry=registry)

validator.validate(cart)

Produces the following output:

Making registry
https://somewhere.org/foo/cart retrieved
Validating
https://somewhere.org/foo/obj retrieved
https://somewhere.org/foo/obj retrieved

Julian · 2023-04-03T18:56:53Z

Got it, yeah ok that's similar to the case from "this ticket", one where you want some unknown URI to be cached. Yeah, as I say I'm leaning towards supporting that by asking you to wrap your retrieve in a cache of some sort (either your own unrelated to referencing or else to_cached_resource if referencing has one it gives you).

Obviously as you say if you know obj_uri ahead of time you can do e.g.:

registry = retrieve(cart_uri) @ referencing.Registry(retrieve=retrieve)).get_or_retrieve(obj_uri).registry.crawl()

(ugly smashed on one line but hope it's clear what's going on there)

but yeah I'm assuming you're saying you have a bunch of random $refs in your schemas and don't know their URIs ahead of time.

FWIW the JSON Schema specifications do say to you as an author that you should expect implementations to ask you to show up with the entire bundle of schemas you want to make available (and some implementations will not even do any additional discovery on your behalf).

But as I say I want to go a bit beyond that and at least help out on this case with suggesting the right caching if possible.

This is a fairly simple caching decorator (which just delegates to lru_cache), but it's useful because: * Most dynamic retrievers will probably want this * It saves a small bit of referencing-specific boilerplate, letting users know to return some JSON and the rest is handled (as long as their schemas contain $schema as usual) It also allows for continuing to support use cases that *don't* want caching (by of course not using this decorator) such as when you *do* want to dynamically re-retrieve a URI because it may have changed contents. Some tweaks may still be necessary here, but it does work for the initial example. Refs: python-jsonschema/jsonschema#1065

Julian · 2023-05-31T22:48:14Z

So I've just added referencing.retrieval.to_cached_resource to the referencing library, with essentially the signature I mentioned above. It will be released in referencing v0.29.0 momentarily.

I'm going to close this as being addressed by that, but if you have feedback on it or want to discuss further definitely let me know!

In particular, for your original example, if you now instead add the relevant decorator:

import json

import jsonschema
import referencing.retrieval


obj_uri = "https://somewhere.org/foo/obj"
obj_schema = {
    "id": obj_uri,
    "type": "object",
    "properties": {
        "price": {"type": "number"},
    },
}

cart_uri = "https://somewhere.org/foo/cart"
cart_schema = {
    "id": cart_uri,
    "properties": {
        "foo": {
            "$ref": obj_uri,
        },
        "bar": {
            "$ref": obj_uri,
        },
    },
}

cart = {"foo": {"price": 1}, "bar": {"price": 2}}


# make a validator with a registry
@referencing.retrieval.to_cached_resource(from_contents=referencing.jsonschema.DRAFT4.create_resource)
def retrieve(target_uri):
    if target_uri == obj_uri:
        schema = obj_schema
    elif target_uri == cart_uri:
        schema = cart_schema
    else:
        raise referencing.exceptions.NoSuchResource(f"{target_uri} not found")

    print(f"{target_uri} retrieved")
    return json.dumps(schema)

print("Making registry")
first = retrieve(cart_uri) @ referencing.Registry(retrieve=retrieve)
print(first.get_or_retrieve(obj_uri).registry.crawl())
registry = first.crawl()
print("Validating")
validator = jsonschema.validators.Draft4Validator(cart_schema, registry=registry)

validator.validate(cart)

You should notice that now repeated lookups do not cause repeated calls to your retrieve function, they re-use the cached schema.

Thanks for raising, and yeah again please do let me know if you have any more feedback!

braingram · 2023-06-12T19:44:48Z

Hi @Julian

Thanks for the reply to this issue and sorry for the delay in getting back to this.

When I try to run your example with referencing==0.29.0 and jsonschema==4.18.0a9 I get the following error:

ValueError: Cannot combine registries with conflicting retrieval functions.

Perhaps this is related to the changes in: 1240e68#diff-d38128373c229147d38e1c4707cdecd7e510c14eef80565374fe9a69627e4676R272

Julian · 2023-06-12T19:57:33Z

Hey! No worries -- possibly, that should be pretty easy to fix (and might be a small bug anyhow) -- will have a quick look! Thanks for trying it out, will update you in a bit.

In other words when someone does *not* do Validator(..., registry=...) This both limits the case which we *do* still perform automatic remote ref retrieval (in a way that still preserves backwards compatibility given that the registry argument did not previously exist, so someone using it must be doing something new) as well as fixes an issue with creating a validator with a registry that has a retrieve function. Previously the latter raised a ValueError (coming from referencing which doesn't want you combining registries unless they agree about retrieval functions. See the added test(s) for details). Fixes the issue mentioned here: #1065 (comment) Refs: #1089

Julian · 2023-06-12T20:30:05Z

OK this should be fixed and released in a minute, thanks for the follow-up, let me know if you notice anything else!

braingram · 2023-06-12T22:30:04Z

Thanks for the fix and quick response :)

The examples run without issue on 4.18.0.a10

Julian added the Bug Something doesn't work the way it should. label Mar 22, 2023

braingram mentioned this issue Mar 22, 2023

Attempt to use only public jsonschema interfaces asdf-format/asdf#1490

Closed

Julian mentioned this issue Mar 24, 2023

Catching RefResolutionError no longer works #1069

Closed

Julian added Enhancement Some new desired functionality and removed Bug Something doesn't work the way it should. labels Apr 3, 2023

Julian closed this as completed May 31, 2023

braingram mentioned this issue Jun 12, 2023

TEST: changes for jsonschema 4.18 alpha asdf-format/asdf#1562

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace registry with new instance with retrieved resource #1065

Replace registry with new instance with retrieved resource #1065

braingram commented Mar 22, 2023

Julian commented Mar 22, 2023

Julian commented Apr 3, 2023

braingram commented Apr 3, 2023

Julian commented Apr 3, 2023

braingram commented Apr 3, 2023

Julian commented Apr 3, 2023

Julian commented May 31, 2023

braingram commented Jun 12, 2023

Julian commented Jun 12, 2023

Julian commented Jun 12, 2023

braingram commented Jun 12, 2023

Replace registry with new instance with retrieved resource #1065

Replace registry with new instance with retrieved resource #1065

Comments

braingram commented Mar 22, 2023

Julian commented Mar 22, 2023

Julian commented Apr 3, 2023

braingram commented Apr 3, 2023

Julian commented Apr 3, 2023

braingram commented Apr 3, 2023

Julian commented Apr 3, 2023

Julian commented May 31, 2023

braingram commented Jun 12, 2023

Julian commented Jun 12, 2023

Julian commented Jun 12, 2023

braingram commented Jun 12, 2023