zql [1/n] - v1 of end-to-end implementation of zql #29

tantaman · 2024-03-09T12:10:19Z

For a high level overview before diving into the code, read below

The description below is a copy-paste of src/zql/README.md so the links to files are not going to work while this is a PR.

ZQL

./query/EntityQuery.ts is the main entrypoint for everything query related:

building
preparing
running
and materializing queries

Creating an EntityQuery

First, build your schema for rails as you normally would:

const issueSchema = z.object({
  id: z.string(),
  title: z.string(),
  created: z.date(),
  status: z.enum(['active', 'inactive']),
});
type Issue = z.infer<typeof issueSchema>;

Then you can create a well-typed query builder

const query = new EntityQuery<Issue>(context, 'issue');

The first param to EntityQuery is the integration point between the query builder and Replicache. It provides the query builder with a way to gain access to the current Replicache instance and collections. See makeTestContext for an example.
The second param to EntityQuery is the same prefix parameter that is passed to rails generate. It is used to identify the collection being queried.

Note: this'll eventually be folded into a method returned by GenerateResult so users are not exposed to either parameter on EntityQuery.

EntityQuery

./query/EntityQuery.ts

EntityQuery holds all the various query methods and is responsible for building the AST (./ast/ZqlAst.ts) to represent the query.

Example:

const derivedQuery = query
  .where(...)
  .join(...)
  .select('id', 'title', 'joined.thing', ...)
  .asc(...);

Under the hood, where, join, select, etc. are all making a copy of and updating the internal AST.

Key points:

EntityQuery is immutable. Each method invoked on it returns a new query. This prevents queries that have been passed around from being modified out from under their users. This also makes it easy to fork queries that start from a common base.
EntityQuery is a 100% type safe interface to the user. Layers below EntityQuery which are internal to the framework do need to ditch type safety in a number of places but, since the interface is typed, we know the types coming in are correct.
The order in which methods are invoked on EntityQuery that return this does not and will not ever matter. All permutations will return the same AST and result in the same query.

Once a user has built a query they can turn it into a prepared statement.

Prepared Statements

./query/Statement.ts

A prepared statement is used to:

Manage the lifetime of a query
In the future, change bindings of the query
De-duplicate queries
Materialize a query

const stmt = derivedQuery.prepare();

Lifetime - A statement will subscribe to its input sources when it is subscribed or when it is materialized into a view. For this reason, statements must be cleaned up by calling destroy.

Bindings - not yet implemented. See the ZQL design doc.

Query de-duplication - not yet implemented. See the ZQL design doc.

Materialization - the process of running the query and, optionally, keeping that query's results up to date. Materialization can be 1-shot or continually maintained.

Prepared Statement Creation

./ast-to-ivm/pipelineBuilder.ts

When the user calls query.prepare() the AST held by the query is converted into a differential dataflow graph.

The resulting graph/pipeline is held by the prepared statement. The pipelineBuilder is responsible for performing this conversion.

high level notes on dataflow

The pipeline builder walks the AST --

When encountering tables (via FROM and JOIN) they are added as sources to the graph.
When encountering JOIN, adds a JOIN operator to join the two mentioned sources.
When encountering WHERE conditions, those are added as filters against the sources.
When encountering SELECT statements, those are added as map operations to re-shape the results.
ORDER BY and LIMIT are retained to either be passed to the source provider, view or both.

Dataflow Internals: Source, DifferenceStream, Operator, View

Also see: ./ivm/README.md

The components, in code, that make up the dataflow graph are:

ISource
1. MemorySource
2. StatelessSource
DifferenceStream
1. DifferenceStreamReader
2. DifferenceStreamWriter
Operator
1. Join (handles JOIN)
2. Map (handles SELECT as well as function application)
3. Reduce (handles aggregates like GroupBy)
4. Filter (handles WHERE and ON statements or HAVING when applied after a reduction)
5. LinearCount (handles COUNT)
6. Or
7. And
8. Union
9. Intersect
View
1. ValueView
2. TreeView

Conspicuously absent are LIMIT and ORDER BY. These are handled either by the sources or views. A future section is devoted to these two.

The above components would be composed into a graph like so:

Query Execution

Query execution, from scratch, and incremental maintenance are nearly identical processes.

The dataflow graph represents the execution plan for a query. To execute a query is a simple matter of sending all rows from all sources through the graph.

See Statement.test.ts for examples of queries being run from scratch.

Execution can be optimized in the case where a limit and cursor are provided (not yet implemented here).

In other words, if:

the final view has the same ordering as the source and
the query specifies a cursor

We can jump to that set of rows rather than feeding all rows. If a limit is specified we can stop reading rows once we hit the limit.

The limit functionality is implemented by making Multiset lazy. See Multiset.ts for how this is currently implemented. The limited view that is pulling values from a multiset can stop without all values being visited.

Not yet implemented would be index selection. E.g., queries of the form:

Issue.where('id', '=', x);
Issue.where('id', '>', x);

should just be lookups against the primary key rather than a full scan.

If a view's order does not match a source's order, we will (not yet implemented here) create a new version of the source that is in the view's order. This source will be maintained in concert with the original source and used for any queries that need the given ordering.

Incremental Maintenance

Incremental maintenance is simply a matter of feeding each write through the graph as it happens. The graph will produce the correct aggregate result at the end.

The details of how that works are specific to individual operators. For operators that only use the current row, like map & filter, it is trivial. They just emit their result. For join and reduce (not implemented yet) it is more complex.

What has been implemented here lays the groundwork for join & reduce hence why it isn't as simple as a system that only needs to support map & filter.

Sources: Stateful vs Stateless

Sources model tables. A source can come in stateful or stateless variants.

A stateless source cannot return historical data to queries that are subscribed after the source was created.

A stateful source knows it contents. When a data flow graph is attached to it, the full contents of the source is sent through the graph, effectively running the query against historical data.

Views: ValueView, TreeView

There are currently two kinds of views: ValueView and TreeView.

ValueView maintains the result of a count query.

TreeView maintains select queries.

count and select are distinct queries. This implementation does not support returning a count with other selected columns.

TreeView holds a comparator which uses the columns provided to Order By to sorts it contents. If no Order By is specified then the items are ordered by id. Any time an Order By is specified that does not include the id, the id is appended as the last item to order by. This is so we get a stable sort order when users sort on columns that are not unique.

Views can be subscribed to if a user wishes to be notified whenever the view is updated.

Aaron was pretty adamant about using native JS collections, hence the value property on TreeView returns a JS array. This is fine for cases where the view has a limit but for cases where you want a view of thousands+ of items I'd recommend the PersistentTreapView (not available here, but in Materialite).

OR, Parenthesis & Breadth First vs Depth First Computation

This PR does not support OR or nested conditions but does lay the groundwork for it by executing the dataflow graph breadth fist rather than depth first.

This is the reason for the split of dataflow events between enqueue and notify or run and notify for operators.

See the commentary on IOperator in Operator.ts

Transactions

The IVM system here has a concept of a transaction. It enables:

Many sources and values to be written before updating query results
Only notifying query subscribers after all queries have been updated with the results of the writes made in that transaction.

See commentary in ISourceInternal in ISource.ts as well as on IOperator in Operator.ts.

zkl-compressed.mov

- select - where - limit - orderBy - count up later: - join - groupBy - subSelect but first we'll generate the AST for these operators and get them working e2e in Replicache.

…ators

This'll serve as the basis for MemorySource and then integration with Replicache

A connection to replicache is provided by a `Context` param, injected into queries. Rails will create query instances and will be responsible for injecting this param. The context provides: - a way to look up sources based on table/collection name - a materialite instance for the given Replicache instance

- orderBy fields are added as a hidden field on the mapped object - sources are created in the desired order

aboodman

I only looked at the first few commits and my comments are not substantive.

My main question that I want to understand is where, if at all, copies of entity data are happening from the source since it's so important to minimize them for performance. I hope that we can make this work in a zero-copy way, except for creating the final array/map to return to caller.

src/zql/README.md

aboodman · 2024-03-10T08:06:41Z

src/zql/README.md

+
+The components, in code, that make up the dataflow graph are:
+
+1. [ISource](./ivm/source/ISource.ts)


The most nit, we don't usually label our interfaces with a prefix I.

src/zql/ast-to-ivm/pipelineBuilder.ts

src/zql/ast/ZqlAst.ts

aboodman · 2024-03-10T09:46:53Z

src/zql/schema/EntitySchema.ts

@@ -0,0 +1,20 @@
+export type Edge<TSrc extends EntitySchema, TDst extends EntitySchema> = {


Now that I've read the design doc, I understand what this type does, but I think some comments would be useful.

(Basically it's the same thing again about the "Edge" and "Node" terminology being unfamiliar. I wasn't sure if this was a node/edge in the query graph, or the data flow graph or ...).

changed it to Relationship

src/zql/query/ZqlAst.ts

aboodman · 2024-03-10T09:55:19Z

src/zql/query/EntityQueryType.ts

@@ -0,0 +1,40 @@
+/* eslint-disable @typescript-eslint/ban-types */


Gonna take your word for it on this file. 😂

aboodman · 2024-03-10T09:55:55Z

src/zql/query/EntityQueryType.ts

+  readonly prepare: () => IStatement<TReturn>;
+}
+
+export interface IStatement<TReturn> {


Nittiest Nit, but can we just say Statement? IStatement is very dot-net.

src/zql/query/EntityQueryInstance.ts

src/zql/ivm/Multiset.ts

src/zql/README.md

aboodman · 2024-03-10T10:17:58Z

src/zql/README.md

+
+Views can be subscribed to if a user wishes to be notified whenever the view is updated.
+
+Aaron was pretty adamant about using native JS collections, hence the `value` property on `TreeView` returns a JS array. This is fine for cases where the view has a `limit` but for cases where you want a view of thousands+ of items I'd recommend the `PersistentTreapView` (not available here, but in Materialite).


Hey, here I am. I go back and forth about this – it's possible I'm wrong. My main concern is the dx – the dx of true arrays is so nice because JS has so many language helpers and so on for them. When @arv gets in here he'll have opinions I'm sure and may overrule.

I think that in practice we do not want people querying thousands of items, they should instead page them in < 1k at a time. But I can think of use cases for querying thousands of items at once.

aboodman · 2024-03-10T10:23:09Z

This is just absurdly, mind-numbingly exciting btw. Absolutely on the edge of my seat to start playing with this.

arv · 2024-03-10T16:38:24Z

Any advice on reviewing this? Is it better to review commit by commit or just review the final results of the pr?

aboodman · 2024-03-10T17:21:46Z

There is a design doc that covers the concepts you should start with if you haven’t already: https://www.notion.so/replicache/WIP-ZQL-add4072bba85476ea7d34800176bcf8d?pvs=4 a (phone)

…

On Sun, Mar 10, 2024 at 6:38 AM Erik Arvidsson ***@***.***> wrote: Any advice on reviewing this? Is it better to review commit by commit or just review the final results of the pr? — Reply to this email directly, view it on GitHub <#29 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAATUBHR6UA2GRUYKKUI5I3YXSEBNAVCNFSM6AAAAABEOB2XRWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGI4DSMBXGM> . You are receiving this because you commented.Message ID: ***@***.***>

tantaman · 2024-03-10T17:26:59Z

Any advice on reviewing this? Is it better to review commit by commit or just review the final results of the pr?

Its up to you guys. I can split this up into a PR per commit for review purposes and stack all of those.

grgbkr · 2024-03-11T21:16:15Z

src/zql/ast-to-ivm/README.md

+
+Would create 1k pipelines since we don't collapse over slots in stage 2.
+
+## Stage 3: Pipeline per Unique Unbound ZQL Query


I'm on the edge of my seat for Stages 3+ :)

grgbkr · 2024-03-11T21:44:48Z

src/zql/query/EntityQuery.ts

+    );
+  }
+
+  where<K extends keyof S['fields']>(


I had been thinking you would have to select a field in order to filter on it in where. Of course that is not how SQL works, and I think what you have here makes more sense. It will complicate slightly the logic for determining which columns to sync to the client, but I don't think by much.

We will also need logic for determining the 'required columns' for a query on the client, so that we can filter the source to just entities that have the 'required columns'. I'm imagining this would be the very first step in the pipeline.

grgbkr · 2024-03-11T22:01:37Z

src/zql/query/IEntityQuery.ts

+  readonly count: () => IEntityQuery<TSchema, number>;
+  readonly where: <K extends keyof TSchema['fields']>(
+    f: K,
+    op: Operator,


Which operators are valid is dependent on the type of the field, and then the type for value is dependent on the type of the operator.

I don't have the full typing worked out, but I think instead of just Operator, we need
NumberOperator, StringOperator, BooleanOperator, and eventually SubqueryOperator.

Which operators are valid is dependent on the type of the field, and then the type for value is dependent on the type of the operator.

Yeah. Let me see what I can work out.

arv

I will do another high level pass tomorrow.

src/zql/schema/EntitySchema.ts

src/zql/ivm/graph/operators/filter-operator.test.ts

src/zql/ivm/view/primitive-view.test.ts

src/zql/ivm/view/tree-view.ts

src/zql/ivm/graph/operators/map-operaetor.ts

- phase1: enqueue values - phase2: run operators - post-commit: notify observers

src/zql/ivm/Multiset.ts

tantaman · 2024-03-13T00:38:45Z

My main question that I want to understand is where, if at all, copies of entity data are happening from the source since it's so important to minimize them for performance. I hope that we can make this work in a zero-copy way, except for creating the final array/map to return to caller. - @aboodman

As you saw, it does create a new object in order to apply the selection set. We could omit this step.

Where I hook it into Replicache it uses experimentalWatch and dumps the data returned by that method into a tree.

Maybe this can be omitted too? Since, presumably, Replicache already has all of this stuff in-memory? I'll need to dive into the Replicache internals to find out or get some pointers from @arv / @grgbkr

aboodman · 2024-03-13T03:41:06Z

As you saw, it does create a new object in order to apply the selection set. We could omit this step.

The ultimate arbiter is going to be framerate in Repliear. Before, you managed to beat our carefully built code using abstractions that I felt would be expensive. So maybe @arv is right and we build it the reasonable way and see how it performs.

But I'm warning you now that I'm likely to keep pestering about taking copies off the read path since I know it will make us faster and it's like free performance.

Where I hook it into Replicache it uses experimentalWatch and dumps the data returned by that method into a tree.

Maybe this can be omitted too? Since, presumably, Replicache already has all of this stuff in-memory? I'll need to dive into the Replicache internals to find out or get some pointers from @arv / @grgbkr

Yes, Replicache maintains a memory cache and is very careful to take the object returned by IDB, stick it in the cache, and pass it to the user with no copies. This is the way we have found is fastest historically.

but since declaration merging is also banned? we move Statement->StatementImpl

tantaman · 2024-03-14T01:36:32Z

@arv - lmk if this is close enough to merge.

The big picture should all come together when join is implemented and the integration tests are built out (#33).

map & filter aren't too interesting / this is a lot more than needed for just map & filter since it lays groundwork for future iterations. You can make a pipeline that is only map & filter incremental in a few lines of code.

You do start to need all the complications of a graph once you want to:

join different pipelines together
share common sections
introduce or, union, intersect & except to bring different pipelines together

other big picture references:

Linear with 1 million items, updating in realtime:
https://vlcn-io.github.io/materialite/linearite/

source: https://github.com/vlcn-io/materialite/tree/main/demos/linearite

linear-mil.mov

Walkthrough of React bindings:
https://github.com/vlcn-io/materialite/blob/main/demos/react/walkthrough/walkthrough.md

Benchmarks:
https://observablehq.com/@tantaman/materialite

arv · 2024-03-14T20:11:44Z

Lets land this. It will be easier to make changes to main in the future

aboodman · 2024-03-15T01:52:51Z

Boo. Ya.

…

On Thu, Mar 14, 2024 at 3:34 PM Matt Wonlaw ***@***.***> wrote: Merged #29 <#29> into main. — Reply to this email directly, view it on GitHub <#29 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAATUBAWVBUHO7QTKOT4OLTYYJF3BAVCNFSM6AAAAABEOB2XRWVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSGEZDKNRVGUZTSMY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

tantaman added 18 commits March 6, 2024 11:01

generate query types for all simple operators

9af74a1

- select - where - limit - orderBy - count up later: - join - groupBy - subSelect but first we'll generate the AST for these operators and get them working e2e in Replicache.

add a multiset implementation

4b1a95f

add a queue type to control flow of data between operators

3535490

add StreamReader and StreamWriter to represent input & output to oper…

61e5e4c

…ators

add linear operators

65805c2

add the DifferenceStream abstraction for creating a computation pipeline

488dc97

introduce a StatelessSource

05d574b

This'll serve as the basis for MemorySource and then integration with Replicache

think through AST -> Pipeline generation

c4f672e

laying some groundwork for OR, JOIN, UNION and grouped conditions`

b18155a

pare back the AST to complete the simplest case of IVM

caaad5f

test the pipelineBuilder

275fa7b

rename to EntityQuery

0c1b4dd

start enabling materialization of a pipeline

7352c4b

figure out our ordering conundrum

af20696

- orderBy fields are added as a hidden field on the mapped object - sources are created in the desired order

get zql -> materialization working e2e

bfd385b

fix desc bug

a98f78d

add a memory source, placeholders for next sets of test to write

a51d129

tantaman force-pushed the mlaw/zql-ivm branch from 849cd73 to 5e37ac3 Compare March 9, 2024 13:09

tantaman mentioned this pull request Mar 9, 2024

ZQL: Generate query types and build the AST for all simple operators #28

Closed

tantaman force-pushed the mlaw/zql-ivm branch from 5e37ac3 to 8048e0d Compare March 9, 2024 23:23

[wip] organize and document things for code review

36183c3

tantaman force-pushed the mlaw/zql-ivm branch from 8048e0d to 36183c3 Compare March 10, 2024 00:56

aboodman reviewed Mar 10, 2024

View reviewed changes

src/zql/README.md Show resolved Hide resolved

aboodman reviewed Mar 10, 2024

View reviewed changes

tantaman marked this pull request as ready for review March 11, 2024 16:08

tantaman added 3 commits March 11, 2024 16:12

ISource -> Source

f9c5756

don't include tests that are in the out folder

5731b9e

update files to match naming conventions

632cf66

grgbkr reviewed Mar 11, 2024

View reviewed changes

arv reviewed Mar 11, 2024

View reviewed changes

tantaman added 2 commits March 12, 2024 07:13

address @arv's style comments

5a63fc0

another round of code style improvements

958a158

tantaman force-pushed the mlaw/zql-ivm branch from 57aa2e6 to 958a158 Compare March 12, 2024 11:56

tantaman added 6 commits March 12, 2024 09:30

more nits

79ffacc

fix casing of filenames

57e3373

get rid of more Ixxx classes

5a40ede

fix comment typo

9ebbbb4

better names for the commit phases

fabbc73

- phase1: enqueue values - phase2: run operators - post-commit: notify observers

invariant-violation -> asserts

54c8871

arv reviewed Mar 12, 2024

View reviewed changes

src/zql/ivm/Multiset.ts Outdated Show resolved Hide resolved

tantaman changed the title ~~v1 of end-to-end implementation of zql~~ zql - v1 of end-to-end implementation of zql Mar 13, 2024

tantaman changed the title ~~zql - v1 of end-to-end implementation of zql~~ zql [1/n] - v1 of end-to-end implementation of zql Mar 14, 2024

tantaman added 2 commits March 13, 2024 21:15

remove self aliases

69e9fe0

IStatement -> Statement

436ded4

but since declaration merging is also banned? we move Statement->StatementImpl

edge -> relationship

c3ffd4a

tantaman merged commit b5424f1 into main Mar 15, 2024
4 checks passed

tantaman deleted the mlaw/zql-ivm branch March 15, 2024 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zql [1/n] - v1 of end-to-end implementation of zql #29

zql [1/n] - v1 of end-to-end implementation of zql #29

tantaman commented Mar 9, 2024 •

edited

Loading

aboodman left a comment

aboodman Mar 10, 2024

aboodman Mar 10, 2024

aboodman Mar 10, 2024

tantaman Mar 14, 2024

aboodman Mar 10, 2024

aboodman Mar 10, 2024

aboodman Mar 10, 2024

aboodman commented Mar 10, 2024

arv commented Mar 10, 2024

aboodman commented Mar 10, 2024 via email

tantaman commented Mar 10, 2024

grgbkr Mar 11, 2024

grgbkr Mar 11, 2024 •

edited

Loading

grgbkr Mar 12, 2024

grgbkr Mar 11, 2024

tantaman Mar 12, 2024

arv left a comment

tantaman commented Mar 13, 2024

aboodman commented Mar 13, 2024

tantaman commented Mar 14, 2024 •

edited

Loading

arv commented Mar 14, 2024

aboodman commented Mar 15, 2024 via email


		The components, in code, that make up the dataflow graph are:

		1. [ISource](./ivm/source/ISource.ts)

		@@ -0,0 +1,20 @@
		export type Edge<TSrc extends EntitySchema, TDst extends EntitySchema> = {

		@@ -0,0 +1,40 @@
		/* eslint-disable @typescript-eslint/ban-types */


		Would create 1k pipelines since we don't collapse over slots in stage 2.

		## Stage 3: Pipeline per Unique Unbound ZQL Query

zql [1/n] - v1 of end-to-end implementation of zql #29

zql [1/n] - v1 of end-to-end implementation of zql #29

Conversation

tantaman commented Mar 9, 2024 • edited Loading

ZQL

Creating an EntityQuery

EntityQuery

Prepared Statements

Prepared Statement Creation

Dataflow Internals: Source, DifferenceStream, Operator, View

Query Execution

Incremental Maintenance

Sources: Stateful vs Stateless

Views: ValueView, TreeView

OR, Parenthesis & Breadth First vs Depth First Computation

Transactions

aboodman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aboodman commented Mar 10, 2024

arv commented Mar 10, 2024

aboodman commented Mar 10, 2024 via email

tantaman commented Mar 10, 2024

Choose a reason for hiding this comment

grgbkr Mar 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arv left a comment

Choose a reason for hiding this comment

tantaman commented Mar 13, 2024

aboodman commented Mar 13, 2024

tantaman commented Mar 14, 2024 • edited Loading

other big picture references:

arv commented Mar 14, 2024

aboodman commented Mar 15, 2024 via email

tantaman commented Mar 9, 2024 •

edited

Loading

grgbkr Mar 11, 2024 •

edited

Loading

tantaman commented Mar 14, 2024 •

edited

Loading