Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query language #9

Closed
wkalt opened this issue Apr 2, 2024 · 6 comments
Closed

query language #9

wkalt opened this issue Apr 2, 2024 · 6 comments

Comments

@wkalt
Copy link
Owner

wkalt commented Apr 2, 2024

Background

Today dp3 executes only one kind of query, which you could consider as an unrestricted as-of join on timestamp, or a timestamp-sorted union.

We will want to expand this to more flexibly defined as-of joins on timestamp, and more generally provide a query language interface with some "psql-like" functionality in the CLI to the user to give them a more interactive and databasey experience.

SQL is a poor language for this because supporting any SQL sets up the expectation that we will support all of it -- which we are not prepared to do efficiently -- and even if we did there are no SQL implementations that express as-of joins in a concise and user-friendly way, that SQL non-users will be excited to pick up. People who like SQL will be annoyed that we only support a subset and people that hate it will be annoyed that our query language resembles it.

Finally SQL is going to limit our eventual prospects for flexible autocompletion support in the client, due to the way it lists columns prior to tables.

So, we are not going to implement SQL.

The concern about only implementing part of SQL also applies to the other languages people are trying to standardize like kusto or prql. We are at best going to get inspired by them - we aren't going to implement them or advertise support for them.

We do need a few similar features to what all of these provide though:

  • Filtering i.e where clauses with the usual binary operators
  • N-way as-of join on timestamp only, with support for either "succeeds" or "precedes" relationships since the user may want to think either way.
  • A modifier on the as-of join like "by less than X/by more than X", allowing you to specify you only want hits within a particular amount of time.
  • Composable
  • Limit/offset
  • Filtering on timestamp, but elevated to a part of the syntax.
  • Reverse the order of timestamp iteration, to enable reverse search as typically presented in log search interfaces.
  • Restriction on producer elevated to part of the syntax. Initially we will require it but not forever.
  • Complex array support. What we need here actually goes beyond SQL’s capabilities.
  • UDF support - a bonus but something we should plan for/think about, as users may wish to do some custom computations like linear algebra or whatever that we will never be fast enough to generalize in the product.

The intent of the language will be to support easy interactive searching, primarily on small numbers of topics, and to enable the expression of multi-topic conditions like "show me 50 times in the last month when we braked hard while it was raining" (considering in that example separate topics for "hard brake" and "is raining"). These conditions can get complex and incorporate several topics. I would expect that as-of joins bigger than 12-way will be rare and that most will be (well) under 6-way.

We likely never support other kinds of joins, or sorting on fields other than log time, or heavy analytics (at least in this engine). We need to play to the strengths of our system and the query engine that supports this will only be single-node. A closer analog for what we are targeting would be the query languages of sumologic, stackdriver, or any of the cloud log search tools, except with a lot more focus on as-of joins. For heavy analytics work users can use spark.

For purposes of autocomplete, we can assume that we can get both a fast listing of available tables (i.e topics), and a fast schema listing for a particular table. Both of those are true, we just don't have APIs for them. So if a query leads with a table and follows with the column restrictions, we will be able to autocomplete it (probably glossing over some nuances about the grammar).

Proposal

Get single topic:

/topic

Get two topics joined

Current behavior of topics param - comma operator

/topic1, /topic2

Restrict a single topic with scalar subfield

Conventional comparison operators supported i.e =, <, <=, ~, *~, <>, etc.

/topic where field.subfield.subfield2 = 'foo'

Restrict on a fixed-size array element.

/topic where field.subfield.subfield2[1] = 'foo'

Restrict on variable-sized array element.

We probably do need the ability to address a variable-sized element by index, but other queries on variable-sized arrays are going to require some kind of "any", "all", or "none" semantics I think. For access by index we will use the same syntax as above for fixed-size array. For the others we need something special

;; access by index
/topic where field.subfield.subfield2[1] = 'foo'

;; any/all/none - element alias. This can be anything the user wants but conventionally “.” will be used unless
;; disambiguation is required (e.g nested arrays). Note that we should be able to autocomplete child fields.

;; scalar arrays
/topic where field.subfield.subfield2 has any . > 10
/topic where field.subfield has no . > 10
/topic where field.subfield has all . > 10

;; complex element with scalar subfield
/topic where field.subfield.subfield2 has any .x.y > 10

;; complex element with scalar array-valued subfield – note use of alias “element” - could be anything.
/topic where field.subfield has any (.x.y has any element > 10)

;; complex element with complex array-valued subfield
/topic where field.subfield has any (.x.y has any element.x > 10)

;; double nested
/topic where field.subfield has any (.x.y has any (element.x.y has no child.x > 10))

It’s possible we won’t need the parentheses, but if we don’t it would be good to support them anyway since they make the structure clearer. In all cases above the argument to “any” could have been parenthesized.

Join two restrictions on timestamp

(/topic where ...), (/topic2 where ...)

Precedes/succeeds/neighbors operators.

Supports keywords nanoseconds, microseconds, milliseconds, seconds, minutes. We will probably cap at 1 or 5 minutes for now since something is going to buffer that data until/unless we spill queries to disk, which we can defer until requested.

(/topic where …) precedes (/topic2 where …) by less than 5 seconds
(/topic where …) succeeds (/topic2 where …) by less than or equal to 5 minutes
(/topic where …) neighbors (/topic2 where …) by less than 5 seconds

;; unrestricted
/topic precedes /topic2 by less than 5 seconds

;; Composition
((/topic where …) precedes (/topic2 where …) by less than 5 seconds) neighbors (/topic3 where…) by less than 5 seconds

As-of semijoin

^ AKA suppression operator

;; unrestricted semijoin of /topic and /topic2. Valid but pointless.
/topic, ^/topic2
;; useful example of semijoin, for instance if /topic is the only one required.
(/topic where …) precedes ^(/topic2 where …) by less than 5 seconds

Timestamp restriction

From/to keywords

;; dates
/topic from '2020-01-01' to '2020-01-02'

;; nanoseconds
/topic from 123 to 345

;; on the result of a join
/topic, /topic2 from '2020-01-01' to '2020-01-02'
(/topic where ...) precedes (/topic2 where ...) by less than 5 seconds from '2020-01-01' to '2020-01-02'

;; applied to subquery - mostly pointless but valid
(/topic from '2020-01-01' to '2020-01-02') precedes (/topic2 from '2020-01-02' to '2020-01-02')

Descending keyword

Reverse the sort order. This should be used at the end of queries (before limit/offset) but is also valid in subqueries. Most likely when used in subqueries the effect on precedes/succeeds behavior will become confusing.

/topic descending

/topic where ... descending

/topic where ... from a to b descending

;; legal but don't do this
(/topic where ... descending) precedes (/topic2 where ...) by less than 5 seconds

;; better
(/topic where ...) precedes (/topic2 where ...) by less than 5 seconds descending

Limit/offset

Valid at the end of any query/subquery only.

;; get 5 records, e.g to inspect structure
/topic limit 5

;; equivalent
/topic limit 5 offset 5
/topic offset 5 limit 5

;; on a join
(/topic where ...) precedes (/topic2 where ...) by less than 5 seconds descending limit 5 offset 5
@wkalt
Copy link
Owner Author

wkalt commented Apr 2, 2024

ping @jainilajmera @mullr @jhubberts for thoughts/insights

@wkalt
Copy link
Owner Author

wkalt commented Apr 7, 2024

I am now targeting something that looks more like this:

a
a, b
a precedes b
a, b precedes c
a precedes b precedes c
where a.foo = 10 and b.bar = 20 and c. baz = 30
limit 10 offset 20
a preceeds b succeeds c
where a.foo = 10 and b.bar = 20 and c.baz = 30

@wkalt
Copy link
Owner Author

wkalt commented Apr 9, 2024

from discussion with Russ about https://docs.auxon.io/speqtr/syntax.html#queries

  • we need a way to control the behavior of "asof" to indicate whether all RHS messages should be emitted or just the one succeeding the LHS. This is their concept of "crossing any".
  • they have a "not followed by"/"not preceded by"
  • AS keyword looks nice too. Repeating topic names in where clauses will be inconvenient for long topics.

@wkalt
Copy link
Owner Author

wkalt commented Apr 9, 2024

from device between '2020-01-01' and '2021-01-01'
/is_raining as raining precedes immediate /events as events
where raining.raining = true and events.type = "hard-brake"

@wkalt
Copy link
Owner Author

wkalt commented Apr 10, 2024

big chunk of query execution stuff is now merged. Leaving this open as we still do not have variable length array support.

@wkalt
Copy link
Owner Author

wkalt commented Apr 11, 2024

will actually close this and track new query language features in the monster list - that won't be the only one

@wkalt wkalt closed this as completed Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant