refactor(jsonschema): reworking how we handle json schema #65

Arqu · 2020-04-23T13:03:05Z

WIP PR for the jsonschema implementation rework.

Things to do:

Things that will not be covered in this release:

Implement multi-version support
Implement older version keywords / handling

b5

quick mid-point review, mainly pointing out style things to make life easier later. Looking great!

draft2019_09_keywords.go

keyword.go

schema_context.go

… validating

Arqu · 2020-04-29T20:21:18Z

We're entering reviewable PR teritory. All things considered this is very much work in progress and needs the following finalized:

Major cleanup
Finalized tests
Ref resolution finalization
Support for older versions

However I would like to get going on the "overal logic" front. Here's a breakdown on what changed functionally from the previous implementation:

Moved a lot of logic away from schema - this means that all keywords are now independantly implemented from the schema itself and schema only serves as a vessle for implementations of other keywords and validators. The only logic schema will retain is the reference resolution.
Validation is now slightly more stateful and thus we propagate a "schema context" throughout all validation. The reasoning here is to detach the global alignment of features such as additional items/properties, boolean logic & conditional statements from schema and have them function independantly but coordinated through the global context

Codewise:

Keywords are loaded directly which means we have to have base definitions for all drafts available
Currently the above is manually loaded, however in upcomming cleanups, we will utilize the schema version to dictate and autoload keywords or fall back to the latest if unspecified
Keyword evaluation order is now strict-er and relies on the keyword definition to declare it's order preference
The overarching concept is broken into two concrete pieces: schema parsing and data validation.

Schema parsing:

The whole process is done in a depth first fashion and each item is independantly unmarshaled. The reason for this is to have every item have it's own logic on how to parse and more importantly how to prepare the state for validation

Schema validation:

Same as parsing is done in a depth first fashion, however since we have the validation state already prepared, we mostly use this as a validation tree and simply evaluate along it's path
This is where we utilize schema context to keep track of evaluation results, location in the state tree and the data currently being evaluated against

Notable pieces of code:

schema.go
schema_context.go
schema_registry.go
keyword.go

There is also this prerequisite PR for qri-io/jsonpointer which is functionally the same, but carries some new options for performance improvements.

Points for further discussion:

Breaking the API - this version might lead to "odd" behavior for external users as things have shifted a bit. The surface API will remain the same for the most part, however things like RootSchema no longer exist (which was a major entrypoint for any user). What I'd like to figure out here is to what lenght should we go to keep backwards compatibility.
Qri specific keywords/implementations - these should live in a separate package as this one should be only concerned with by the spec jsonschema implementation. Keep your eyes peeled on how we should actually do this while not being invasive. My current thought process is to have the "load keywords" process to be overload-able/extendable so we can manually interject new keywords and implementations in our use
Support for keywords like unevaluatedItems/Properties - these require to have an additional copy of validation state just to be able to execute the validation (which can be worked around with the existing additionalItems/Properties keyword) which has the drawback of a) complicating the logic further, b) carrying a not negligable performance hit. Currently most of the time spent in execution is on managing, copying and juggling state with the heaviest being keeping track of item evaluations and their underlying maps. It's not an end of the world performance hit, but is probably in the X0% range in the average case of real world usage.

Arqu · 2020-05-07T00:36:55Z

I deem this ready to start being reviewed. Refer to the above comment for a bit more guidance on the changes in general. The top comment reflects the current state of what still needs to be done before merging and what will be left for future work.

…19 keywords

b5

Ok, this is a first pass. I've looked at 23/34 files. Looking great. I'd love to get back over to package jsonschema ASAP.

The other thing we need to do is cut a release (v0.2.0) that updates CHANGELOG.md and specifies this the last version before a jump to 2019 draft support

README.md

go.mod

keyword.go

schema_registry.go

util.go

schema_test.go

b5

Ok, now we get to chat about SchemaContext 😄

keyword.go

keywords_conditional.go

b5 · 2020-05-13T13:33:28Z

schema_context.go

+type SchemaContext struct {
+	Local                   *Schema
+	Root                    *Schema
+	RecursiveAnchor         *Schema
+	Instance                interface{}
+	LastEvaluatedIndex      int
+	LocalLastEvaluatedIndex int
+	BaseURI                 string
+	InstanceLocation        *jptr.Pointer
+	RelativeLocation        *jptr.Pointer
+	BaseRelativeLocation    *jptr.Pointer
+
+	LocalRegistry *SchemaRegistry
+
+	EvaluatedPropertyNames      map[string]bool
+	LocalEvaluatedPropertyNames map[string]bool
+	Misc                        map[string]interface{}
+
+	ApplicationContext *context.Context
+}


Ok, after a bunch of reading, I think this API needs work, but seems to clearly point to how it can be improved.

It seems to me this struct is a state machine, not a context. a "context" in go carries scope across API boundaries. This struct harmonizes state into a single location, allowing keywords to be stateless.

because all fields are exported, the guts of this state machine are open to the world to modify

the methods & fields on this struct are the primary API for keyword developers. Because we support custom keywords, It's a public API and needs to be written defensively.

all keywords in this package are consumers of this API, and examples for other keyword developers, a massive win IMHO.

this struct is not safe for concurrent use. That might be ok for now, but we should have a plan for making it safe for concurrency.

the Instance couples Instance data to validation state, which feels incorrect. data should flow separately though the keyword API, especially in a streaming context

this state machine should collect validation errors (we've been using an outParam for this)

Starting from the top level Schema "user api". Instead of being "a wrapper function to maintain some level of backwards compatibility with versions v0.1.2 and prior", let's break the API complete and define a very "generic" go function that initializes a root state and transitions to the "keyword API":

func (s *Schema) Validate(ctx context.Context, data interface{}) { st := NewValidationState(s) s.ValidateKeyword(ctx, st, data) }

The call to ValidateKeyword transitions us to the "keyword API", where we've changed the primary ValidateFromContext interface function to something like ValidateKeyword:

// Keyword is an interface for anything that can validate. // JSON-Schema keywords are all examples of Keyword type Keyword interface { // ValidateKeyword runs a validation check against decoded JSON data, // calling methods on ValidationState to record any discovered errors ValidateKeyword(ctx context.Context, state *ValidationState, data interface{}) // ... }

A keyword implementation would change to implement like this:

// ValidateKeyword implements the Keyword interface for Maximum func (m Maximum) ValidateKeyword(ctx context.Context, state *ValidateionState, data interface{}) { SchemaDebug("[Maximum] Validating") if num, ok := data.(float64); ok { if num > float64(m) { // state now keeps the errs slice internally, has all the info it needs to // populate error fields state.AddError(fmt.Sprintf("must be less than or equal to %f", m)) } } }

a complex keyword needs a fairly rich API from the state struct. Here I've made up methods to satistify methods Items needs. Warning, untested code:

// ValidateKeyword implements the Keyword interface for Items func (it Items) ValidateKeyword(ctx context.Context, state *ValidationState, data interface{}) { SchemaDebug("[Items] Validating") if arr, ok := schCtx.Instance.([]interface{}); ok { // instead of "NewSchemaContextFromSourceClean(*schCtx)", state gets a method "subState" // that initializes & returns a clean substate from the parent subState := state.NewSubState() if it.single { // BaseRelativeLocation should be a method that reads from a private field, // the method defends against nil access, making it safe to use like this: if newPtr := state.BaseRelativeDescendant("items"); newPtr != nil { subState.SetBaseRelativeLocation(newPtr) } // this could probably be turned into a one-liner: newPtr := state.RelativeDescendant("items") subState.SetRelativeLocation(newPtr) for i, elem := range arr { if _, ok := state.LocalKeyword("additionalItems"); ok { state.SetPropertiesEvaluated("0") state.SetLocalPropertiesEvaluated("0") // These might be combined into some sort of "only increment if higher" setter if state.LastEvaluatedIndex() < i { state.SetEvaluatedIndex(i) } if state.LocalLastEvaluatedIndex() < i { state.SetLocalLastEvaluatedIndex(i) } } subState.ClearContext() newPtr = state.InstanceLocationDescendant(strconv.Itoa(i)) subState.SetInstanceLocation(newPtr) // here it's clearer we're using a subState with a different data element: it.Schemas[0].ValidateKeyword(ctx, subState, elem) if _, ok := state.LocalKeyword("additionalItems"); ok { // TODO(arqu): this might clash with additionalProperties // should separate items out state.SetPropertiesEvaluated(subState.EvaluatedProperties()...) state.SetLocalPropertiesEvaluated(subState.LocalEvaluatedProperties()...) } } } else { for i, vs := range it.Schemas { if i < len(arr) { if _, ok := state.LocalKeyword("additionalItems"); ok { state.SetPropertyEvaluated(strconv.Itoa(i)) state.SetLocalPropertyEvaluated(strconv.Itoa(i)) // These might be combined into some sort of "only increment if higher" setter if state.LastEvaluatedIndex() < i { state.SetEvaluatedIndex(i) } if state.LocalLastEvaluatedIndex() < i { state.SetLocalLastEvaluatedIndex(i) } } subState.ClearContext() if newPtr := state.BaseRelativeDescendant("items", strconv.Itoa(i)); newPtr != nil { subState.SetBaseRelativeLocation(newPtr) } newPtr, _ := state.RelativeLocationDescendant("items", strconv.Itoa(i)) subState.SetRelativeLocation(newPtr) newPtr = state.InstanceLocationDescendant(strconv.Itoa(i)) subState.SetInstanceLocation(newPtr) vs.ValidateKeyword(ctx, subState, arr[i]) if _, ok := state.LocalKeyword("additionalItems"); ok { state.SetPropertiesEvaluated(subState.EvaluatedProperties()...) state.SetLocalPropertiesEvaluated(subState.LocalEvaluatedProperties()...) } } } } } }

The hard work of figuring out how to arrange this is done, all of these comments are just ergonomics, but important considering we're going to break the API. I think breaking the API is wholly appropriate with the transition to the 2019_09 spec.

b5

🌟 🌟 🚀 🧑‍🚀 🚀 🌟 🌟
🔥 🔥 🎸 👩‍🎤 🎸 🔥 🔥
🚂 🚋 🚋 🚋 🚋 🚋 🚋

b5 · 2020-05-21T15:37:07Z

Let's merge a version bump first, but THIS IS SO GOOD YAY @Arqu

Arqu added this to In progress in Sprint H via automation Apr 23, 2020

Arqu force-pushed the refactoring-jsonschema branch 2 times, most recently from ab38c2d to f09df59 Compare April 24, 2020 23:46

Arqu added this to In progress in Sprint I via automation Apr 27, 2020

Arqu self-assigned this Apr 27, 2020

Arqu force-pushed the refactoring-jsonschema branch 2 times, most recently from 4fdfb09 to b60b331 Compare April 28, 2020 22:15

b5 reviewed Apr 29, 2020

View reviewed changes

draft2019_09_keywords.go Outdated Show resolved Hide resolved

keyword.go Show resolved Hide resolved

keyword.go Outdated Show resolved Hide resolved

keyword.go Outdated Show resolved Hide resolved

schema_context.go Outdated Show resolved Hide resolved

refactor(jsonschema): reworking how we handle json schema parsing and…

732e0ba

… validating

Arqu force-pushed the refactoring-jsonschema branch from b60b331 to 732e0ba Compare April 29, 2020 19:50

Arqu mentioned this pull request Apr 29, 2020

feat(pointer): better performance through directly managed pointers qri-io/jsonpointer#8

Merged

Arqu added the refactor label Apr 29, 2020

refactor(jsonschema): refactored jsonschema and implmented draft2019_09

43bb537

Arqu marked this pull request as ready for review May 7, 2020 00:40

refactor(jsonschema): added more tests, made schemas autoload draft20…

3601eb5

…19 keywords

Arqu moved this from In progress to Needs Review in Sprint I May 7, 2020

refactor(jsonschema): cleanup, updated README

4b4c01f

Arqu force-pushed the refactoring-jsonschema branch from c0215a4 to 4b4c01f Compare May 10, 2020 23:12

b5 added this to In progress in Sprint J via automation May 11, 2020

b5 moved this from In progress to Review in progress in Sprint J May 11, 2020

b5 reviewed May 13, 2020

View reviewed changes

b5 requested changes May 13, 2020

View reviewed changes

b5 moved this from Needs Review to To do in Sprint J May 14, 2020

Arqu moved this from To do to In progress in Sprint J May 14, 2020

Arqu moved this from In progress to To do in Sprint J May 15, 2020

b5 moved this from To do to In progress in Sprint J May 20, 2020

refactor(jsonschema): fixing bugs, implementing more keywords, cleanup

4bc1017

Arqu force-pushed the refactoring-jsonschema branch from 8e6be66 to f41640c Compare May 20, 2020 23:42

refactor(jsonschema): further cleanup

2d24f12

Arqu force-pushed the refactoring-jsonschema branch from f41640c to 2d24f12 Compare May 20, 2020 23:43

refactor(jsonschema): improving the validation API

857fdbb

Arqu requested a review from b5 May 21, 2020 15:17

Arqu moved this from In progress to Needs Review in Sprint J May 21, 2020

Arqu changed the title ~~refactor(jsonschema): reworking how we handle json schema WIP~~ refactor(jsonschema): reworking how we handle json schema May 21, 2020

Sprint J automation moved this from Needs Review to Reviewer approved May 21, 2020

b5 approved these changes May 21, 2020

View reviewed changes

Arqu merged commit bb2a1cf into master May 21, 2020

Sprint J automation moved this from Reviewer approved to Done May 21, 2020

Arqu deleted the refactoring-jsonschema branch May 21, 2020 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(jsonschema): reworking how we handle json schema #65

refactor(jsonschema): reworking how we handle json schema #65

Arqu commented Apr 23, 2020 •

edited

Loading

b5 left a comment

Arqu commented Apr 29, 2020 •

edited

Loading

Arqu commented May 7, 2020

b5 left a comment

b5 left a comment

b5 May 13, 2020

b5 left a comment

b5 commented May 21, 2020

refactor(jsonschema): reworking how we handle json schema #65

refactor(jsonschema): reworking how we handle json schema #65

Conversation

Arqu commented Apr 23, 2020 • edited Loading

b5 left a comment

Choose a reason for hiding this comment

Arqu commented Apr 29, 2020 • edited Loading

Arqu commented May 7, 2020

b5 left a comment

Choose a reason for hiding this comment

b5 left a comment

Choose a reason for hiding this comment

b5 May 13, 2020

Choose a reason for hiding this comment

b5 left a comment

Choose a reason for hiding this comment

b5 commented May 21, 2020

Arqu commented Apr 23, 2020 •

edited

Loading

Arqu commented Apr 29, 2020 •

edited

Loading