Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(jsonschema): reworking how we handle json schema #65

Merged
merged 7 commits into from
May 21, 2020

Conversation

Arqu
Copy link
Contributor

@Arqu Arqu commented Apr 23, 2020

WIP PR for the jsonschema implementation rework.

Things to do:

  • Initial wiring/rework
  • Implement all draft2019-09 keywords
  • Improve validation error formats
  • Clean up code
  • Clear TODOs
  • Run benchmarks
  • Align with old API for as much backwards compatibility as possible
  • Merge Json Pointer changes first
  • Re-introduce some of the lost tests traversal_test.go, val_error_test.go & validate_test.go
  • Update README
  • Resolve linting issues
  • Comment out code
  • Resolve 2 failing tests for ref.json (or disregard them)
  • Undo the package change to main and remove main.go
  • Run another performance tweaking rund
  • Improve lib ergonomy
  • Clean up the package API
  • Cut release of the current implmenetation before merging
  • Update godocs
  • Cut new release

Things that will not be covered in this release:

  • Implement multi-version support
  • Implement older version keywords / handling

@Arqu Arqu added this to In progress in Sprint H via automation Apr 23, 2020
@Arqu Arqu force-pushed the refactoring-jsonschema branch 2 times, most recently from ab38c2d to f09df59 Compare April 24, 2020 23:46
@Arqu Arqu added this to In progress in Sprint I via automation Apr 27, 2020
@Arqu Arqu self-assigned this Apr 27, 2020
@Arqu Arqu force-pushed the refactoring-jsonschema branch 2 times, most recently from 4fdfb09 to b60b331 Compare April 28, 2020 22:15
Copy link
Member

@b5 b5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick mid-point review, mainly pointing out style things to make life easier later. Looking great!

draft2019_09_keywords.go Outdated Show resolved Hide resolved
keyword.go Show resolved Hide resolved
keyword.go Outdated Show resolved Hide resolved
keyword.go Outdated Show resolved Hide resolved
schema_context.go Outdated Show resolved Hide resolved
@Arqu
Copy link
Contributor Author

Arqu commented Apr 29, 2020

We're entering reviewable PR teritory. All things considered this is very much work in progress and needs the following finalized:

  • Major cleanup
  • Finalized tests
  • Ref resolution finalization
  • Support for older versions

However I would like to get going on the "overal logic" front. Here's a breakdown on what changed functionally from the previous implementation:

  • Moved a lot of logic away from schema - this means that all keywords are now independantly implemented from the schema itself and schema only serves as a vessle for implementations of other keywords and validators. The only logic schema will retain is the reference resolution.
  • Validation is now slightly more stateful and thus we propagate a "schema context" throughout all validation. The reasoning here is to detach the global alignment of features such as additional items/properties, boolean logic & conditional statements from schema and have them function independantly but coordinated through the global context

Codewise:

  • Keywords are loaded directly which means we have to have base definitions for all drafts available
  • Currently the above is manually loaded, however in upcomming cleanups, we will utilize the schema version to dictate and autoload keywords or fall back to the latest if unspecified
  • Keyword evaluation order is now strict-er and relies on the keyword definition to declare it's order preference
  • The overarching concept is broken into two concrete pieces: schema parsing and data validation.

Schema parsing:

  • The whole process is done in a depth first fashion and each item is independantly unmarshaled. The reason for this is to have every item have it's own logic on how to parse and more importantly how to prepare the state for validation

Schema validation:

  • Same as parsing is done in a depth first fashion, however since we have the validation state already prepared, we mostly use this as a validation tree and simply evaluate along it's path
  • This is where we utilize schema context to keep track of evaluation results, location in the state tree and the data currently being evaluated against

Notable pieces of code:

  • schema.go
  • schema_context.go
  • schema_registry.go
  • keyword.go

There is also this prerequisite PR for qri-io/jsonpointer which is functionally the same, but carries some new options for performance improvements.

Points for further discussion:

  • Breaking the API - this version might lead to "odd" behavior for external users as things have shifted a bit. The surface API will remain the same for the most part, however things like RootSchema no longer exist (which was a major entrypoint for any user). What I'd like to figure out here is to what lenght should we go to keep backwards compatibility.
  • Qri specific keywords/implementations - these should live in a separate package as this one should be only concerned with by the spec jsonschema implementation. Keep your eyes peeled on how we should actually do this while not being invasive. My current thought process is to have the "load keywords" process to be overload-able/extendable so we can manually interject new keywords and implementations in our use
  • Support for keywords like unevaluatedItems/Properties - these require to have an additional copy of validation state just to be able to execute the validation (which can be worked around with the existing additionalItems/Properties keyword) which has the drawback of a) complicating the logic further, b) carrying a not negligable performance hit. Currently most of the time spent in execution is on managing, copying and juggling state with the heaviest being keeping track of item evaluations and their underlying maps. It's not an end of the world performance hit, but is probably in the X0% range in the average case of real world usage.

@Arqu
Copy link
Contributor Author

Arqu commented May 7, 2020

I deem this ready to start being reviewed. Refer to the above comment for a bit more guidance on the changes in general. The top comment reflects the current state of what still needs to be done before merging and what will be left for future work.

@Arqu Arqu marked this pull request as ready for review May 7, 2020 00:40
@Arqu Arqu moved this from In progress to Needs Review in Sprint I May 7, 2020
@b5 b5 added this to In progress in Sprint J via automation May 11, 2020
@b5 b5 moved this from In progress to Review in progress in Sprint J May 11, 2020
Copy link
Member

@b5 b5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is a first pass. I've looked at 23/34 files. Looking great. I'd love to get back over to package jsonschema ASAP.

The other thing we need to do is cut a release (v0.2.0) that updates CHANGELOG.md and specifies this the last version before a jump to 2019 draft support

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
keyword.go Outdated Show resolved Hide resolved
keyword.go Outdated Show resolved Hide resolved
schema_registry.go Outdated Show resolved Hide resolved
util.go Outdated Show resolved Hide resolved
util.go Outdated Show resolved Hide resolved
schema_test.go Outdated Show resolved Hide resolved
schema_test.go Outdated Show resolved Hide resolved
Copy link
Member

@b5 b5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now we get to chat about SchemaContext 😄

keyword.go Outdated Show resolved Hide resolved
keywords_conditional.go Show resolved Hide resolved
Comment on lines 12 to 31
type SchemaContext struct {
Local *Schema
Root *Schema
RecursiveAnchor *Schema
Instance interface{}
LastEvaluatedIndex int
LocalLastEvaluatedIndex int
BaseURI string
InstanceLocation *jptr.Pointer
RelativeLocation *jptr.Pointer
BaseRelativeLocation *jptr.Pointer

LocalRegistry *SchemaRegistry

EvaluatedPropertyNames map[string]bool
LocalEvaluatedPropertyNames map[string]bool
Misc map[string]interface{}

ApplicationContext *context.Context
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after a bunch of reading, I think this API needs work, but seems to clearly point to how it can be improved.

  • It seems to me this struct is a state machine, not a context. a "context" in go carries scope across API boundaries. This struct harmonizes state into a single location, allowing keywords to be stateless.
  • because all fields are exported, the guts of this state machine are open to the world to modify
  • the methods & fields on this struct are the primary API for keyword developers. Because we support custom keywords, It's a public API and needs to be written defensively.
  • all keywords in this package are consumers of this API, and examples for other keyword developers, a massive win IMHO.
  • this struct is not safe for concurrent use. That might be ok for now, but we should have a plan for making it safe for concurrency.
  • the Instance couples Instance data to validation state, which feels incorrect. data should flow separately though the keyword API, especially in a streaming context
  • this state machine should collect validation errors (we've been using an outParam for this)

Starting from the top level Schema "user api". Instead of being "a wrapper function to maintain some level of backwards compatibility with versions v0.1.2 and prior", let's break the API complete and define a very "generic" go function that initializes a root state and transitions to the "keyword API":

func (s *Schema) Validate(ctx context.Context, data interface{}) {
	st := NewValidationState(s)
	s.ValidateKeyword(ctx, st, data)
}

The call to ValidateKeyword transitions us to the "keyword API", where we've changed the primary ValidateFromContext interface function to something like ValidateKeyword:

// Keyword is an interface for anything that can validate.
// JSON-Schema keywords are all examples of Keyword
type Keyword interface {
  // ValidateKeyword runs a validation check against decoded JSON data, 
  // calling methods on ValidationState to record any discovered errors
  ValidateKeyword(ctx context.Context, state *ValidationState, data interface{})
  // ...
}

A keyword implementation would change to implement like this:

// ValidateKeyword implements the Keyword interface for Maximum
func (m Maximum) ValidateKeyword(ctx context.Context, state *ValidateionState, data interface{}) {
	SchemaDebug("[Maximum] Validating")
	if num, ok := data.(float64); ok {
		if num > float64(m) {
			// state now keeps the errs slice internally, has all the info it needs to 
			// populate error fields
			state.AddError(fmt.Sprintf("must be less than or equal to %f", m))
		}
	}
}

a complex keyword needs a fairly rich API from the state struct. Here I've made up methods to satistify methods Items needs. Warning, untested code:

// ValidateKeyword implements the Keyword interface for Items
func (it Items) ValidateKeyword(ctx context.Context, state *ValidationState, data interface{}) {
	SchemaDebug("[Items] Validating")
	if arr, ok := schCtx.Instance.([]interface{}); ok {
		// instead of "NewSchemaContextFromSourceClean(*schCtx)", state gets a method "subState"
		// that initializes & returns a clean substate from the parent
		subState := state.NewSubState()

		if it.single {
			// BaseRelativeLocation should be a method that reads from a private field,
			// the method defends against nil access, making it safe to use like this:
			if newPtr := state.BaseRelativeDescendant("items"); newPtr != nil {
				subState.SetBaseRelativeLocation(newPtr)
			}
			// this could probably be turned into a one-liner:
			newPtr := state.RelativeDescendant("items")
			subState.SetRelativeLocation(newPtr)
			for i, elem := range arr {
				if _, ok := state.LocalKeyword("additionalItems"); ok {
					state.SetPropertiesEvaluated("0")
					state.SetLocalPropertiesEvaluated("0")
					// These might be combined into some sort of "only increment if higher" setter
					if state.LastEvaluatedIndex() < i {
						state.SetEvaluatedIndex(i)
					}
					if state.LocalLastEvaluatedIndex() < i {
						state.SetLocalLastEvaluatedIndex(i)
					}
				}
				subState.ClearContext()
				newPtr = state.InstanceLocationDescendant(strconv.Itoa(i))
				subState.SetInstanceLocation(newPtr)
				// here it's clearer we're using a subState with a different data element:
				it.Schemas[0].ValidateKeyword(ctx, subState, elem)
				if _, ok := state.LocalKeyword("additionalItems"); ok {
					// TODO(arqu): this might clash with additionalProperties
					// should separate items out
					state.SetPropertiesEvaluated(subState.EvaluatedProperties()...)
					state.SetLocalPropertiesEvaluated(subState.LocalEvaluatedProperties()...)
				}
			}
		} else {
			for i, vs := range it.Schemas {
				if i < len(arr) {
					if _, ok := state.LocalKeyword("additionalItems"); ok {
						state.SetPropertyEvaluated(strconv.Itoa(i))
						state.SetLocalPropertyEvaluated(strconv.Itoa(i))
						// These might be combined into some sort of "only increment if higher" setter
						if state.LastEvaluatedIndex() < i {
							state.SetEvaluatedIndex(i)
						}
						if state.LocalLastEvaluatedIndex() < i {
							state.SetLocalLastEvaluatedIndex(i)
						}
					}
					subState.ClearContext()
					if newPtr := state.BaseRelativeDescendant("items", strconv.Itoa(i)); newPtr != nil {
						subState.SetBaseRelativeLocation(newPtr)
					}
					newPtr, _ := state.RelativeLocationDescendant("items", strconv.Itoa(i))
					subState.SetRelativeLocation(newPtr)
					newPtr = state.InstanceLocationDescendant(strconv.Itoa(i))
					subState.SetInstanceLocation(newPtr)

					vs.ValidateKeyword(ctx, subState, arr[i])
					if _, ok := state.LocalKeyword("additionalItems"); ok {
						state.SetPropertiesEvaluated(subState.EvaluatedProperties()...)
						state.SetLocalPropertiesEvaluated(subState.LocalEvaluatedProperties()...)
					}
				}
			}
		}
	}
}

The hard work of figuring out how to arrange this is done, all of these comments are just ergonomics, but important considering we're going to break the API. I think breaking the API is wholly appropriate with the transition to the 2019_09 spec.

@b5 b5 moved this from Needs Review to To do in Sprint J May 14, 2020
@Arqu Arqu moved this from To do to In progress in Sprint J May 14, 2020
@Arqu Arqu moved this from In progress to To do in Sprint J May 15, 2020
@b5 b5 moved this from To do to In progress in Sprint J May 20, 2020
@Arqu Arqu requested a review from b5 May 21, 2020 15:17
@Arqu Arqu moved this from In progress to Needs Review in Sprint J May 21, 2020
@Arqu Arqu changed the title refactor(jsonschema): reworking how we handle json schema WIP refactor(jsonschema): reworking how we handle json schema May 21, 2020
Sprint J automation moved this from Needs Review to Reviewer approved May 21, 2020
Copy link
Member

@b5 b5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌟 🌟 🚀 🧑‍🚀 🚀 🌟 🌟
🔥 🔥 🎸 👩‍🎤 🎸 🔥 🔥
🚂 🚋 🚋 🚋 🚋 🚋 🚋

@b5
Copy link
Member

b5 commented May 21, 2020

Let's merge a version bump first, but THIS IS SO GOOD YAY @Arqu

@Arqu Arqu merged commit bb2a1cf into master May 21, 2020
Sprint J automation moved this from Reviewer approved to Done May 21, 2020
@Arqu Arqu deleted the refactoring-jsonschema branch May 21, 2020 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Sprint H
  
In progress
Sprint I
  
Needs Review
Sprint J
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants