Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace registry with new instance with retrieved resource #1065

Closed
braingram opened this issue Mar 22, 2023 · 11 comments
Closed

Replace registry with new instance with retrieved resource #1065

braingram opened this issue Mar 22, 2023 · 11 comments
Labels
Enhancement Some new desired functionality

Comments

@braingram
Copy link

Thanks in advance for your help with this.

Is there a way for a validator to replace the registry with a new instance that includes a retrieved resource (to avoid multiple retrievals for the same uri)?

Running the following:

import jsonschema
import referencing


uri = "https://somewhere.org/foo/bar"
obj_schema = {
    "$id": uri,
    "type" : "object",
    "properties" : {
        "price" : {"type" : "number"},
    },
}

cart_schema = {
    "properties": {
        "foo": {
            "$ref": uri,
        },
        "bar": {
            "$ref": uri,
        },
    },
}

cart = {"foo": {"price": 1}, "bar": {"price": 2}}

# make a validator with a registry
def retrieve(target_uri):
    if target_uri == uri:
        resource = referencing.Resource(contents=obj_schema, specification=referencing.jsonschema.DRAFT4)
    print(f"{target_uri} retrieved")
    return resource

registry = referencing.Registry(retrieve=retrieve)
validator = jsonschema.validators.Draft4Validator(cart_schema, registry=registry)

validator.validate(cart)

Results in the follow output:

https://somewhere.org/foo/bar retrieved
https://somewhere.org/foo/bar retrieved

due to retrieve being called twice.

@Julian
Copy link
Member

Julian commented Mar 22, 2023

This should happen automatically for at least some cases, but clearly not the one you're showing for some reason. I'll have to check why.

@Julian
Copy link
Member

Julian commented Apr 3, 2023

Just a minor (non-)update on this. I haven't fully made a decision yet (so feedback definitely welcome), but I'm actually second guessing doing any of this automatically.

The reason is that if it does happen fully automatically (i.e. if Registry objects always cache the things they retrieve) then there's no easy good way to avoid that behavior (and implement some sort of behavior where you periodically do want to re-retrieve the resource).

Instead then one could of course cache any requests external to referencing, i.e. by using any normal caching mechanism you'd like. So retrieve would always be called, but you'd cache your network response or whatever you'd like.

Of course I do agree most users would want the caching behavior normally, so I'm also thinking (even beyond the above, that one could cache using any normal mechanism like functools.lru_cache), that referencing could possibly offer a to_cached_resource decorator (of type Callable[[Callable[[str], Any], Callable[[Any], Resource]], Callable[[Callable[[URI], str]], Callable[str, Resource]]) usable like this:

@referencing.to_cached_resource(loads=json.loads, to_resource=Resource.from_contents)  # the defaults
def retrieve_over_http(uri):
    return httpx.get(uri).text

which basically handles loading (in order to be able to ensure it has something cacheable, as loaded JSON may not be) and conversion to a resource, but otherwise essentially does the caching one would normally want.

Feedback still welcome, but as I say I'm kind of leaning towards tweaking the referencing expectations to match the above so that we still can support non-cached use cases.

@braingram
Copy link
Author

Thanks for the detailed update.

To check that I'm understanding, the decorator you mentioned might be used to allow retrieve to inform the registry that the returned resource should be 'cached' by including it in the registry (so encountering a $ref to the loaded schema will not result in another retrieve)?

@eslavich and I have been working on updating ASDF to be compatible with 4.18. Adding the lru_cache did appear to improve performance (in some cases, more details here: eslavich/asdf#6) but as you mentioned the caching outside the registry still results in a new registry being created (I think here in get_or_retrieve) and any 'crawling' of a resource is lost (I think the resource always starts as uncrawled). I am not too familiar with the internals of jsonschema and referencing to know when all this work is duplicated (if it happens for every branch down a schema, every validation, or both).

As an alternative, is there a way to define a single Registry that will always be used (and allow for adding retrieved resources)? I expect this is against the idea of an immutable registry but for ASDF cases I believe there's an assumption that the schema are static during run-time (at the least a schema for a specific uri will not change) and there is considerable schema cross referencing. I believe performance could be improved if there was a way to perhaps pass a registry to a validator, use it to validate an object (retrieving and adding resources as needed), then cache the registry (this could be done in ASDF) so that it can be reused on the next object (that might have a different but likely related schema).

@Julian
Copy link
Member

Julian commented Apr 3, 2023

To check that I'm understanding, the decorator you mentioned might be used to allow retrieve to inform the registry that the returned resource should be 'cached' by including it in the registry (so encountering a $ref to the loaded schema will not result in another retrieve)?

It wouldn't really notify the Registry of anything, it would just speed up the function the same way it sounds like you've done via lru_cache (it'd just be a way of reminding people they likely want to consider doing so if the retrieval function is expensive). But the registry object would be "oblivious" essentially to how it worked.

and any 'crawling' of a resource is lost (I think the resource always starts as uncrawled).

Ah, this is a good point which I'd have to consider. (Yes resources are uncrawled, though you can .crawl() a registry to get a registry where the resources are crawled). Will have to think.

I believe performance could be improved if there was a way to perhaps pass a registry to a validator, use it to validate an object

I'm missing the question probably, because there certainly is a way to do this, it's the registry argument to a Validator that I'm sure you're already making use of. You can create a registry, add all your resources to it, call .crawl() to discover all subresources, and then repeatedly pass that registry to all validators you create. Am I missing the question?

(FWIW this is precisely how the registry in the jsonschema-specifications repository works)

@Julian Julian added Enhancement Some new desired functionality and removed Bug Something doesn't work the way it should. labels Apr 3, 2023
@braingram
Copy link
Author

I'm missing the question probably, because there certainly is a way to do this, it's the registry argument to a Validator that I'm sure you're already making use of. You can create a registry, add all your resources to it, call .crawl() to discover all subresources, and then repeatedly pass that registry to all validators you create. Am I missing the question?

I think what you're describing works if you know what resources are required before validating (where those resources can be loaded and a complete registry made). For our current attempts at updating ASDF to work with 4.18 we are using retrieve to handle the case where a validation pass encounters a uri for an unknown schema (this API is all new to me, so please let me know if it sounds like we're using it incorrectly). If I modify the example to do what I think is a minimal case for how we're using schema in ASDF (where the top level schema is known prior to validation but the referenced sub-schemas are unknown), the referenced schema do not appear to be retrieved during crawl and are retrieved twice during validate. Is this an expected result with crawl?

import jsonschema
import referencing


obj_uri = "https://somewhere.org/foo/obj"
obj_schema = {
    "id": obj_uri,
    "type" : "object",
    "properties" : {
        "price" : {"type" : "number"},
    },
}

cart_uri = "https://somewhere.org/foo/cart"
cart_schema = {
    "id": cart_uri,
    "properties": {
        "foo": {
            "$ref": obj_uri,
        },
        "bar": {
            "$ref": obj_uri,
        },
    },
}

cart = {"foo": {"price": 1}, "bar": {"price": 2}}

# make a validator with a registry
def retrieve(target_uri):
    if target_uri == obj_uri:
        schema = obj_schema
    elif target_uri == cart_uri:
        schema = cart_schema
    else:
        raise referencing.exceptions.NoSuchResource(f"{target_uri} not found")
    resource = referencing.Resource(contents=schema, specification=referencing.jsonschema.DRAFT4)
    print(f"{target_uri} retrieved")
    return resource

print("Making registry")
registry = (retrieve(cart_uri) @ referencing.Registry(retrieve=retrieve)).crawl()
print("Validating")
validator = jsonschema.validators.Draft4Validator(cart_schema, registry=registry)

validator.validate(cart)

Produces the following output:

Making registry
https://somewhere.org/foo/cart retrieved
Validating
https://somewhere.org/foo/obj retrieved
https://somewhere.org/foo/obj retrieved

@Julian
Copy link
Member

Julian commented Apr 3, 2023

Got it, yeah ok that's similar to the case from "this ticket", one where you want some unknown URI to be cached. Yeah, as I say I'm leaning towards supporting that by asking you to wrap your retrieve in a cache of some sort (either your own unrelated to referencing or else to_cached_resource if referencing has one it gives you).

Obviously as you say if you know obj_uri ahead of time you can do e.g.:

registry = retrieve(cart_uri) @ referencing.Registry(retrieve=retrieve)).get_or_retrieve(obj_uri).registry.crawl()

(ugly smashed on one line but hope it's clear what's going on there)

but yeah I'm assuming you're saying you have a bunch of random $refs in your schemas and don't know their URIs ahead of time.

FWIW the JSON Schema specifications do say to you as an author that you should expect implementations to ask you to show up with the entire bundle of schemas you want to make available (and some implementations will not even do any additional discovery on your behalf).

But as I say I want to go a bit beyond that and at least help out on this case with suggesting the right caching if possible.

Julian added a commit to python-jsonschema/referencing that referenced this issue May 31, 2023
This is a fairly simple caching decorator (which just delegates to
lru_cache), but it's useful because:

    * Most dynamic retrievers will probably want this
    * It saves a small bit of referencing-specific boilerplate, letting
      users know to return some JSON and the rest is handled (as long as
      their schemas contain $schema as usual)

It also allows for continuing to support use cases that *don't* want
caching (by of course not using this decorator) such as when you *do*
want to dynamically re-retrieve a URI because it may have changed
contents.

Some tweaks may still be necessary here, but it does work for the
initial example.

Refs: python-jsonschema/jsonschema#1065
@Julian
Copy link
Member

Julian commented May 31, 2023

So I've just added referencing.retrieval.to_cached_resource to the referencing library, with essentially the signature I mentioned above. It will be released in referencing v0.29.0 momentarily.

I'm going to close this as being addressed by that, but if you have feedback on it or want to discuss further definitely let me know!

In particular, for your original example, if you now instead add the relevant decorator:

import json

import jsonschema
import referencing.retrieval


obj_uri = "https://somewhere.org/foo/obj"
obj_schema = {
    "id": obj_uri,
    "type": "object",
    "properties": {
        "price": {"type": "number"},
    },
}

cart_uri = "https://somewhere.org/foo/cart"
cart_schema = {
    "id": cart_uri,
    "properties": {
        "foo": {
            "$ref": obj_uri,
        },
        "bar": {
            "$ref": obj_uri,
        },
    },
}

cart = {"foo": {"price": 1}, "bar": {"price": 2}}


# make a validator with a registry
@referencing.retrieval.to_cached_resource(from_contents=referencing.jsonschema.DRAFT4.create_resource)
def retrieve(target_uri):
    if target_uri == obj_uri:
        schema = obj_schema
    elif target_uri == cart_uri:
        schema = cart_schema
    else:
        raise referencing.exceptions.NoSuchResource(f"{target_uri} not found")

    print(f"{target_uri} retrieved")
    return json.dumps(schema)

print("Making registry")
first = retrieve(cart_uri) @ referencing.Registry(retrieve=retrieve)
print(first.get_or_retrieve(obj_uri).registry.crawl())
registry = first.crawl()
print("Validating")
validator = jsonschema.validators.Draft4Validator(cart_schema, registry=registry)

validator.validate(cart)

You should notice that now repeated lookups do not cause repeated calls to your retrieve function, they re-use the cached schema.

Thanks for raising, and yeah again please do let me know if you have any more feedback!

@Julian Julian closed this as completed May 31, 2023
@braingram
Copy link
Author

Hi @Julian

Thanks for the reply to this issue and sorry for the delay in getting back to this.

When I try to run your example with referencing==0.29.0 and jsonschema==4.18.0a9 I get the following error:

ValueError: Cannot combine registries with conflicting retrieval functions.

Perhaps this is related to the changes in: 1240e68#diff-d38128373c229147d38e1c4707cdecd7e510c14eef80565374fe9a69627e4676R272

@Julian
Copy link
Member

Julian commented Jun 12, 2023

Hey! No worries -- possibly, that should be pretty easy to fix (and might be a small bug anyhow) -- will have a quick look! Thanks for trying it out, will update you in a bit.

Julian added a commit that referenced this issue Jun 12, 2023
In other words when someone does *not* do Validator(..., registry=...)

This both limits the case which we *do* still perform automatic remote
ref retrieval (in a way that still preserves backwards compatibility
given that the registry argument did not previously exist, so someone
using it must be doing something new) as well as fixes an issue with
creating a validator with a registry that has a retrieve function.

Previously the latter raised a ValueError (coming from referencing which
doesn't want you combining registries unless they agree about retrieval
functions. See the added test(s) for details).

Fixes the issue mentioned here:

#1065 (comment)

Refs: #1089
@Julian
Copy link
Member

Julian commented Jun 12, 2023

OK this should be fixed and released in a minute, thanks for the follow-up, let me know if you notice anything else!

@braingram
Copy link
Author

Thanks for the fix and quick response :)

The examples run without issue on 4.18.0.a10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Some new desired functionality
Projects
None yet
Development

No branches or pull requests

2 participants