Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zql [1/n] - v1 of end-to-end implementation of zql #29

Merged
merged 34 commits into from
Mar 15, 2024
Merged

Conversation

tantaman
Copy link

@tantaman tantaman commented Mar 9, 2024

For a high level overview before diving into the code, read below

The description below is a copy-paste of src/zql/README.md so the links to files are not going to work while this is a PR.

ZQL

./query/EntityQuery.ts is the main entrypoint for everything query related:

  • building
  • preparing
  • running
  • and materializing queries

Creating an EntityQuery

First, build your schema for rails as you normally would:

const issueSchema = z.object({
  id: z.string(),
  title: z.string(),
  created: z.date(),
  status: z.enum(['active', 'inactive']),
});
type Issue = z.infer<typeof issueSchema>;

Then you can create a well-typed query builder

const query = new EntityQuery<Issue>(context, 'issue');
  • The first param to EntityQuery is the integration point between the query builder and Replicache. It provides the query builder with a way to gain access to the current Replicache instance and collections. See makeTestContext for an example.
  • The second param to EntityQuery is the same prefix parameter that is passed to rails generate. It is used to identify the collection being queried.

Note: this'll eventually be folded into a method returned by GenerateResult so users are not exposed to either parameter on EntityQuery.

EntityQuery

./query/EntityQuery.ts

EntityQuery holds all the various query methods and is responsible for building the AST (./ast/ZqlAst.ts) to represent the query.

Example:

const derivedQuery = query
  .where(...)
  .join(...)
  .select('id', 'title', 'joined.thing', ...)
  .asc(...);

Under the hood, where, join, select, etc. are all making a copy of and updating the internal AST.

Key points:

  1. EntityQuery is immutable. Each method invoked on it returns a new query. This prevents queries that have been passed around from being modified out from under their users. This also makes it easy to fork queries that start from a common base.
  2. EntityQuery is a 100% type safe interface to the user. Layers below EntityQuery which are internal to the framework do need to ditch type safety in a number of places but, since the interface is typed, we know the types coming in are correct.
  3. The order in which methods are invoked on EntityQuery that return this does not and will not ever matter. All permutations will return the same AST and result in the same query.

Once a user has built a query they can turn it into a prepared statement.

Prepared Statements

./query/Statement.ts

A prepared statement is used to:

  1. Manage the lifetime of a query
  2. In the future, change bindings of the query
  3. De-duplicate queries
  4. Materialize a query
const stmt = derivedQuery.prepare();

Lifetime - A statement will subscribe to its input sources when it is subscribed or when it is materialized into a view. For this reason, statements must be cleaned up by calling destroy.

Bindings - not yet implemented. See the ZQL design doc.

Query de-duplication - not yet implemented. See the ZQL design doc.

Materialization - the process of running the query and, optionally, keeping that query's results up to date. Materialization can be 1-shot or continually maintained.

Prepared Statement Creation

./ast-to-ivm/pipelineBuilder.ts

When the user calls query.prepare() the AST held by the query is converted into a differential dataflow graph.

The resulting graph/pipeline is held by the prepared statement. The pipelineBuilder is responsible for performing this conversion.

high level notes on dataflow

The pipeline builder walks the AST --

  1. When encountering tables (via FROM and JOIN) they are added as sources to the graph.
  2. When encountering JOIN, adds a JOIN operator to join the two mentioned sources.
  3. When encountering WHERE conditions, those are added as filters against the sources.
  4. When encountering SELECT statements, those are added as map operations to re-shape the results.
  5. ORDER BY and LIMIT are retained to either be passed to the source provider, view or both.

Dataflow Internals: Source, DifferenceStream, Operator, View

Also see: ./ivm/README.md

The components, in code, that make up the dataflow graph are:

  1. ISource
    1. MemorySource
    2. StatelessSource
  2. DifferenceStream
    1. DifferenceStreamReader
    2. DifferenceStreamWriter
  3. Operator
    1. Join (handles JOIN)
    2. Map (handles SELECT as well as function application)
    3. Reduce (handles aggregates like GroupBy)
    4. Filter (handles WHERE and ON statements or HAVING when applied after a reduction)
    5. LinearCount (handles COUNT)
    6. Or
    7. And
    8. Union
    9. Intersect
  4. View
    1. ValueView
    2. TreeView

Conspicuously absent are LIMIT and ORDER BY. These are handled either by the sources or views. A future section is devoted to these two.

The above components would be composed into a graph like so:

img

Query Execution

Query execution, from scratch, and incremental maintenance are nearly identical processes.

The dataflow graph represents the execution plan for a query. To execute a query is a simple matter of sending all rows from all sources through the graph.

See Statement.test.ts for examples of queries being run from scratch.

Execution can be optimized in the case where a limit and cursor are provided (not yet implemented here).

In other words, if:

  1. the final view has the same ordering as the source and
  2. the query specifies a cursor

We can jump to that set of rows rather than feeding all rows. If a limit is specified we can stop reading rows once we hit the limit.

The limit functionality is implemented by making Multiset lazy. See Multiset.ts for how this is currently implemented. The limited view that is pulling values from a multiset can stop without all values being visited.

Not yet implemented would be index selection. E.g., queries of the form:

Issue.where('id', '=', x);
Issue.where('id', '>', x);

should just be lookups against the primary key rather than a full scan.

If a view's order does not match a source's order, we will (not yet implemented here) create a new version of the source that is in the view's order. This source will be maintained in concert with the original source and used for any queries that need the given ordering.

Incremental Maintenance

Incremental maintenance is simply a matter of feeding each write through the graph as it happens. The graph will produce the correct aggregate result at the end.

The details of how that works are specific to individual operators. For operators that only use the current row, like map & filter, it is trivial. They just emit their result. For join and reduce (not implemented yet) it is more complex.

What has been implemented here lays the groundwork for join & reduce hence why it isn't as simple as a system that only needs to support map & filter.

Sources: Stateful vs Stateless

Sources model tables. A source can come in stateful or stateless variants.

A stateless source cannot return historical data to queries that are subscribed after the source was created.

A stateful source knows it contents. When a data flow graph is attached to it, the full contents of the source is sent through the graph, effectively running the query against historical data.

Views: ValueView, TreeView

There are currently two kinds of views: ValueView and TreeView.

ValueView maintains the result of a count query.

TreeView maintains select queries.

count and select are distinct queries. This implementation does not support returning a count with other selected columns.

TreeView holds a comparator which uses the columns provided to Order By to sorts it contents. If no Order By is specified then the items are ordered by id. Any time an Order By is specified that does not include the id, the id is appended as the last item to order by. This is so we get a stable sort order when users sort on columns that are not unique.

Views can be subscribed to if a user wishes to be notified whenever the view is updated.

Aaron was pretty adamant about using native JS collections, hence the value property on TreeView returns a JS array. This is fine for cases where the view has a limit but for cases where you want a view of thousands+ of items I'd recommend the PersistentTreapView (not available here, but in Materialite).

OR, Parenthesis & Breadth First vs Depth First Computation

This PR does not support OR or nested conditions but does lay the groundwork for it by executing the dataflow graph breadth fist rather than depth first.

This is the reason for the split of dataflow events between enqueue and notify or run and notify for operators.

See the commentary on IOperator in Operator.ts

Transactions

The IVM system here has a concept of a transaction. It enables:

  1. Many sources and values to be written before updating query results
  2. Only notifying query subscribers after all queries have been updated with the results of the writes made in that transaction.

See commentary in ISourceInternal in ISource.ts as well as on IOperator in Operator.ts.


zkl-compressed.mov

- select
- where
- limit
- orderBy
- count

up later:
- join
- groupBy
- subSelect

but first we'll generate the AST for these operators and get them working e2e in Replicache.
This'll serve as the basis for MemorySource and then integration with
Replicache
A connection to replicache is provided by a `Context` param, injected into queries.

Rails will create query instances and will be responsible for injecting this param.

The context provides:
- a way to look up sources based on table/collection name
- a materialite instance for the given Replicache instance
- orderBy fields are added as a hidden field on the mapped object
- sources are created in the desired order
Copy link
Contributor

@aboodman aboodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only looked at the first few commits and my comments are not substantive.

My main question that I want to understand is where, if at all, copies of entity data are happening from the source since it's so important to minimize them for performance. I hope that we can make this work in a zero-copy way, except for creating the final array/map to return to caller.

src/zql/README.md Show resolved Hide resolved

The components, in code, that make up the dataflow graph are:

1. [ISource](./ivm/source/ISource.ts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most nit, we don't usually label our interfaces with a prefix I.

src/zql/ast-to-ivm/pipelineBuilder.ts Outdated Show resolved Hide resolved
src/zql/ast/ZqlAst.ts Outdated Show resolved Hide resolved
@@ -0,0 +1,20 @@
export type Edge<TSrc extends EntitySchema, TDst extends EntitySchema> = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I've read the design doc, I understand what this type does, but I think some comments would be useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Basically it's the same thing again about the "Edge" and "Node" terminology being unfamiliar. I wasn't sure if this was a node/edge in the query graph, or the data flow graph or ...).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to Relationship

src/zql/query/ZqlAst.ts Outdated Show resolved Hide resolved
@@ -0,0 +1,40 @@
/* eslint-disable @typescript-eslint/ban-types */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gonna take your word for it on this file. 😂

readonly prepare: () => IStatement<TReturn>;
}

export interface IStatement<TReturn> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nittiest Nit, but can we just say Statement? IStatement is very dot-net.

src/zql/query/EntityQueryInstance.ts Outdated Show resolved Hide resolved
src/zql/ivm/Multiset.ts Outdated Show resolved Hide resolved
src/zql/README.md Show resolved Hide resolved

Views can be subscribed to if a user wishes to be notified whenever the view is updated.

Aaron was pretty adamant about using native JS collections, hence the `value` property on `TreeView` returns a JS array. This is fine for cases where the view has a `limit` but for cases where you want a view of thousands+ of items I'd recommend the `PersistentTreapView` (not available here, but in Materialite).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, here I am. I go back and forth about this – it's possible I'm wrong. My main concern is the dx – the dx of true arrays is so nice because JS has so many language helpers and so on for them. When @arv gets in here he'll have opinions I'm sure and may overrule.

I think that in practice we do not want people querying thousands of items, they should instead page them in < 1k at a time. But I can think of use cases for querying thousands of items at once.

@aboodman
Copy link
Contributor

This is just absurdly, mind-numbingly exciting btw. Absolutely on the edge of my seat to start playing with this.

@arv
Copy link
Contributor

arv commented Mar 10, 2024

Any advice on reviewing this? Is it better to review commit by commit or just review the final results of the pr?

@aboodman
Copy link
Contributor

aboodman commented Mar 10, 2024 via email

@tantaman
Copy link
Author

Any advice on reviewing this? Is it better to review commit by commit or just review the final results of the pr?

Its up to you guys. I can split this up into a PR per commit for review purposes and stack all of those.

@tantaman tantaman marked this pull request as ready for review March 11, 2024 16:08

Would create 1k pipelines since we don't collapse over slots in stage 2.

## Stage 3: Pipeline per Unique Unbound ZQL Query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the edge of my seat for Stages 3+ :)

);
}

where<K extends keyof S['fields']>(
Copy link
Contributor

@grgbkr grgbkr Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had been thinking you would have to select a field in order to filter on it in where. Of course that is not how SQL works, and I think what you have here makes more sense. It will complicate slightly the logic for determining which columns to sync to the client, but I don't think by much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will also need logic for determining the 'required columns' for a query on the client, so that we can filter the source to just entities that have the 'required columns'. I'm imagining this would be the very first step in the pipeline.

readonly count: () => IEntityQuery<TSchema, number>;
readonly where: <K extends keyof TSchema['fields']>(
f: K,
op: Operator,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which operators are valid is dependent on the type of the field, and then the type for value is dependent on the type of the operator.

I don't have the full typing worked out, but I think instead of just Operator, we need
NumberOperator, StringOperator, BooleanOperator, and eventually SubqueryOperator.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which operators are valid is dependent on the type of the field, and then the type for value is dependent on the type of the operator.

Yeah. Let me see what I can work out.

Copy link
Contributor

@arv arv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do another high level pass tomorrow.

src/zql/schema/EntitySchema.ts Outdated Show resolved Hide resolved
src/zql/ivm/graph/operators/filter-operator.test.ts Outdated Show resolved Hide resolved
src/zql/ivm/view/primitive-view.test.ts Outdated Show resolved Hide resolved
src/zql/ivm/view/tree-view.ts Outdated Show resolved Hide resolved
src/zql/ivm/view/tree-view.ts Outdated Show resolved Hide resolved
src/zql/ivm/graph/operators/map-operaetor.ts Outdated Show resolved Hide resolved
src/zql/ivm/Multiset.ts Outdated Show resolved Hide resolved
@tantaman tantaman changed the title v1 of end-to-end implementation of zql zql - v1 of end-to-end implementation of zql Mar 13, 2024
@tantaman
Copy link
Author

My main question that I want to understand is where, if at all, copies of entity data are happening from the source since it's so important to minimize them for performance. I hope that we can make this work in a zero-copy way, except for creating the final array/map to return to caller. - @aboodman

As you saw, it does create a new object in order to apply the selection set. We could omit this step.

Where I hook it into Replicache it uses experimentalWatch and dumps the data returned by that method into a tree.

Maybe this can be omitted too? Since, presumably, Replicache already has all of this stuff in-memory? I'll need to dive into the Replicache internals to find out or get some pointers from @arv / @grgbkr

@aboodman
Copy link
Contributor

As you saw, it does create a new object in order to apply the selection set. We could omit this step.

The ultimate arbiter is going to be framerate in Repliear. Before, you managed to beat our carefully built code using abstractions that I felt would be expensive. So maybe @arv is right and we build it the reasonable way and see how it performs.

But I'm warning you now that I'm likely to keep pestering about taking copies off the read path since I know it will make us faster and it's like free performance.

Where I hook it into Replicache it uses experimentalWatch and dumps the data returned by that method into a tree.

Maybe this can be omitted too? Since, presumably, Replicache already has all of this stuff in-memory? I'll need to dive into the Replicache internals to find out or get some pointers from @arv / @grgbkr

Yes, Replicache maintains a memory cache and is very careful to take the object returned by IDB, stick it in the cache, and pass it to the user with no copies. This is the way we have found is fastest historically.

@tantaman tantaman changed the title zql - v1 of end-to-end implementation of zql zql [1/n] - v1 of end-to-end implementation of zql Mar 14, 2024
but since declaration merging is also banned? we move Statement->StatementImpl
@tantaman
Copy link
Author

tantaman commented Mar 14, 2024

@arv - lmk if this is close enough to merge.

The big picture should all come together when join is implemented and the integration tests are built out (#33).

map & filter aren't too interesting / this is a lot more than needed for just map & filter since it lays groundwork for future iterations. You can make a pipeline that is only map & filter incremental in a few lines of code.

You do start to need all the complications of a graph once you want to:

  • join different pipelines together
  • share common sections
  • introduce or, union, intersect & except to bring different pipelines together

other big picture references:

Linear with 1 million items, updating in realtime:
https://vlcn-io.github.io/materialite/linearite/

source: https://github.com/vlcn-io/materialite/tree/main/demos/linearite

linear-mil.mov

Walkthrough of React bindings:
https://github.com/vlcn-io/materialite/blob/main/demos/react/walkthrough/walkthrough.md

Benchmarks:
https://observablehq.com/@tantaman/materialite

@arv
Copy link
Contributor

arv commented Mar 14, 2024

Lets land this. It will be easier to make changes to main in the future

@tantaman tantaman merged commit b5424f1 into main Mar 15, 2024
4 checks passed
@tantaman tantaman deleted the mlaw/zql-ivm branch March 15, 2024 01:34
@aboodman
Copy link
Contributor

aboodman commented Mar 15, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants