Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #12: Implement Schema.serializable #83

Merged
merged 32 commits into from
Jul 12, 2021
Merged

Conversation

thinkharderdev
Copy link
Contributor

Create a MetaSchema data structure to represent the abstract structure of a schema and an associated Schema.Meta(spec: MetaSchema) extends Schema[Schema[_]] which is the "serialized" representation of a Schema[A].

This is about 90% implemented. The only outstanding items are to support diffing of Schema.Meta and add some additional unit tests.

Limitations and known issues:

  1. The main limitation is that we lose all information about a Schema[A] except for the structure when serializing it (since we cannot serialized transform, extractFieldN, deconstruct, construct lambdas). As such the reified Schema that you get when you deserialize a Schema.Meta is always going to be a generic version of the original (i.e. Schema.CaseClassN -> Schema.GenericRecord and Schema.EnumN -> Schema.Enumeration).
  2. It's not really obvious what we should do with Schema.CaseObject and the current approach discards all type information about the original object represented by the schema.

@thinkharderdev thinkharderdev requested a review from a team as a code owner June 27, 2021 13:27
@thinkharderdev thinkharderdev marked this pull request as draft June 27, 2021 13:28
@jdegoes
Copy link
Member

jdegoes commented Jun 28, 2021

@thinkharderdev Thanks for working on this!

  1. Rather than add a new subtype of Schema, I think you can implement Schema#serializable just by dynamically building up a new Schema which does not include Transform nodes (and which genericizes enums and records, I suppose). Now, this is not "type safe", because there is no way to know, looking at the schema, that it can be serialized and deserialized. But if we want to address that, we can in other ways (e.g. phantom type, path-dependent type, fixed point data, etc.).
  2. I think the problem with case object is suggesting we do not have a correct representation for that. Probably, we need to modify CaseObject to store an id (as a String?), and then transform it. Or maybe change it to mean some fixed, static value, which can then be transformed to the case object. The key thing is we are missing a generic notion of "this is fixed, static data" in a way that doesn't bundle it with a user-defined data type; whereas for records and enumerations, we have the ability to "lose" the user-defined data types by genericizing (converting to generic records / generic enumerations with just the labels and schemas for the terms).

@jdegoes
Copy link
Member

jdegoes commented Jun 28, 2021

Related:

zio/zio-json#307

Case objects are isomorphic to unit. So maybe that's another way to handle them: represent them as unit, which is then transformed to the case object. In a sum of case objects, the term ids would be the case object names (names of the subtype of the sum type).

case Schema.Sequence(schema, _, _) => Sequence(fromSchema(schema))
case Schema.Fail(message) => Fail(message)
case Schema.Transform(schema, _, _) => fromSchema(schema)
case lzy @ Schema.Lazy(_) => fromSchema(lzy.schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should test this on a schema like JSON which is recursive. I think our "schema language" may have to embed references in order to handle recursion (or some sort of "fixed point" operator).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is going to be a problem. I can work on this today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON Schema doc uses "pointers" for recursion.

To track recursion, you can use a map based on object identity. This will allow you to detect if you processed the same field before. It might also be possible to define equals / hashCode on the ordinary Schema classes in such a way that it terminates even for recursive structures.

final case class Duration(units: TemporalUnit) extends MetaSchema

object Duration {
implicit val schema: Schema[Duration] = Schema[String].transformOrFail(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice!

implicit val chunkSchema: Schema[Chunk[MetaCase]] = Schema.chunk(schema)
}

final case class Value(valueType: StandardType[_]) extends MetaSchema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the added type safety. ❤️

@thinkharderdev thinkharderdev marked this pull request as ready for review July 2, 2021 19:17
This allows us to check for cyclic references while building the tree
and replace recursive references with pointers.

Also fixed a bug with JSON encoding/decoding of recursive data types and
fixed derivation of recursive ADTs by inserting laziness into the
implicit conversions.
@thinkharderdev
Copy link
Contributor Author

Went back to the drawing board and came up with what I think is a much better encoding for the schema serialization.

Modelling the the "meta schema" as an AST now which has two nice benefits:

  1. We can handle recursive references when we are building the tree and replace recursive subtrees with a pointer to the ancestor.
  2. The resulting JSON encoding is more intuitive and human-readable. I think it will more naturally transform to other encodings (JSON schema, Avro schema, etc) as well (I think....)

Also fixed the issue with JSON serialization/deserialization which turns out was just a regular bug. Still having an issue with the protobuf codecs on recursive data types but I assume that is just a bug as well. Need to do some more debugging though to be sure.

@thinkharderdev thinkharderdev marked this pull request as draft July 4, 2021 14:41
@thinkharderdev thinkharderdev marked this pull request as ready for review July 9, 2021 23:06
case (_: Schema.Lazy[_], _) => true // equalsAst(expected.schema, actual)
case (_, _: Schema.Lazy[_]) => true // equalsAst(expected, actual.schema)
case _ => false
equalsAst(expected, actual, depth)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Laziness can be tested via reference equality, assuming we are not making mistakes anywhere. It should not be tested for structural equality since that will lead to infinite checks (= stack overflows).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reference equality isn't what we want here. We need to test that that expected and actual have the same AST. They may not have (and in most cases we care about w/r/t tests will not) have the same type.


import zio.{ Chunk, ChunkBuilder }

sealed trait Ast { self =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is with the name Ast. Because we are in the zio.schema package, it would be nice if we can shoot for import zioi.schema._ in most code bases. But the name Ast is quite common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good paint. Maybe SchemaAst?

case Schema.Meta(ast) => ast
}

def subtree(schema: Schema[_], lineage: Chunk[Int], optional: Boolean = false, dimensions: Int = 0): Ast =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this one package private?

}
}

def materialize(ast: Ast, refs: Map[Int, Ast] = Map.empty): Schema[_] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package private?


}

object AstRenderer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package private?

}
}

implicit lazy val schema: Schema[Ast] = DeriveSchema.gen[Ast]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe expand for long-term stability? What do you think?

@jdegoes
Copy link
Member

jdegoes commented Jul 12, 2021

A giant undertaking. Let's do any revisions as followups. Thanks for your work on this!

@jdegoes jdegoes merged commit f1f6b31 into zio:main Jul 12, 2021
landlockedsurfer pushed a commit to landlockedsurfer/zio-schema that referenced this pull request May 28, 2022
* Remove setup gpg

* Update setup-java

* Test

* test2

* test 3

* Fixes zio#78: Implement diffing between instances described by a schema

* Specs and additional implementations

* Rename DiffAlgorithm to Differ

* Add diffs for product and sum types

* Copy Myers diff implementation from zio-test

* Implement binary diff and unit tests

* Fix various temporal diffs and add unit tests

* formatting

* Test coverage and tweaks

* Fixes zio#12: Implement Schema.serializable

* Formatting

* Remove commented code

* Refactoring for clarity and type safety

* Refactor for clarity

* Test coverage and (ignored) test cases for recursive data types

* linting

* Better encoding of serializable schema with an Abstract Syntax Tree.
This allows us to check for cyclic references while building the tree
and replace recursive references with pointers.

Also fixed a bug with JSON encoding/decoding of recursive data types and
fixed derivation of recursive ADTs by inserting laziness into the
implicit conversions.

* Fix bugs in protobuf decoding

* 2.12 does not suport by-name implicit parameters

* Re-enable dynamic value test for recursive data types

* AST materialization and unit tests

* Remove ref method from Schema

* Preserve ref map on recursive materializations

* Add multi-dimensionality to schema to account for sequences of sequences

* Do AST comparison on Lazy schemas by limiting stack depth to avoid infinite recurse

* Rename Ast to SchemaAst to avoid conflicts

* Appease the 2.12 compiler

Co-authored-by: thinkharder <thinkharderdev@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants