You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SurrealDB is built upon various ordered and, to the extent they are distributed, range-partitioned key-value stores such as TiKV. This has the potential to make range queries on keys (record ids) very performant. However, SurQL lacks a dedicated syntax or performance guarantees for such queries.
Consider the following timeseries records grouped by game (the first index of each record id is the game name, and the second index is a timestamp of days since product launch):
Assuming each node can handle 150 records, a likely partitioning into four nodes would result in the following ranges:
Node 1 gets keys ["Chess", 1] to ["Chess", 150]
Node 2 gets keys ["Chess", 151] to ["Chess", 300]
Node 3 gets keys ["Tetris", 1] to ["Tetris", 150]
Node 4 gets keys ["Tetris", 151] to ["Tetris", 300]
A common query pattern will be to chart the data for a particular game for the last 90 days. Using the game Tetris as an example, that means getting all records between ["Tetris", 210] (inclusive) and ["Tetris", 300] (inclusive). Luckily, these records all reside on Node 4, so the underlying KV-store can retrieve them in a single access (side note: if we were querying many more records, we might hit multiple nodes, but the ordering would make their disk accesses much more efficient and the number of nodes hit would be relatively minimal).
Describe the solution
Idea 1 (.. and ..= to signify Range<Key> and RangeInclusive<Key>, respectively):
SELECT id, players FROM metrics:["Tetris", 210]..["Tetris", 301]
SELECT id, players FROM metrics:["Tetris", 210]..=["Tetris", 300]
Or, if preferable, idea 1.5:
SELECT id, players FROM metrics:["Tetris", 210]..metrics:["Tetris", 301]
SELECT id, players FROM metrics:["Tetris", 210]..=metrics:["Tetris", 300]
Alternative methods
Idea 2 (support normal-SQL's BETWEEN ... AND ... syntax, and make sure it is optimized to use a range lookup from the underlying KV-store):
SELECT id, players FROM metrics WHERE id BETWEEN ["Tetris", 210] AND ["Tetris", 300]
Idea 3 (no new syntax, just make sure the following optimizes to use a range lookup from the underlying KV-store):
SELECT id, players FROM metrics WHERE ["Tetris", 210] <= id AND id <= ["Tetris", 300]
Non-solution: Changing the schema to use a random record id and to have an index on game name and timestamp would throw spatial-locality and, by extension, query performance out the window. Executing SELECT id, players FROM metrics WHERE name = "Tetris" AND timestamp BETWEEN 210 AND 300, assuming the existence of an ordered index on (name, timestamp), would play nicely with that index but then do 90 random accesses to fetch the actual records.
So at a quick glance, idea 1 would be the way we would want to go.
We already have (not yet documented) things called models. These are designed purely for testing (generating a large amount of records in a simple query), but make use of a range operator in the query parser...
CREATE |person:1000|; -- Create 1000 randomly generated people records
CREATE |person:1..1000|; -- Create 1000 people records with IDs from 1 to 1000
Just to summarise, we shouldn't go with ideas 2 or 3 because:
the range query is generated from the FROM clause and not from the WHERE clause
we would have to ensure the user creates indexes for the id field (which isn't and shouldn't be necessary)
I prefer the syntax of idea 1, as it fits more inline with the current SurrealQL concepts
Is your feature request related to a problem?
SurrealDB is built upon various ordered and, to the extent they are distributed, range-partitioned key-value stores such as TiKV. This has the potential to make range queries on keys (record ids) very performant. However, SurQL lacks a dedicated syntax or performance guarantees for such queries.
Consider the following timeseries records grouped by game (the first index of each record id is the game name, and the second index is a timestamp of days since product launch):
Assuming each node can handle 150 records, a likely partitioning into four nodes would result in the following ranges:
["Chess", 1]
to["Chess", 150]
["Chess", 151]
to["Chess", 300]
["Tetris", 1]
to["Tetris", 150]
["Tetris", 151]
to["Tetris", 300]
A common query pattern will be to chart the data for a particular game for the last 90 days. Using the game Tetris as an example, that means getting all records between
["Tetris", 210]
(inclusive) and["Tetris", 300]
(inclusive). Luckily, these records all reside on Node 4, so the underlying KV-store can retrieve them in a single access (side note: if we were querying many more records, we might hit multiple nodes, but the ordering would make their disk accesses much more efficient and the number of nodes hit would be relatively minimal).Describe the solution
Idea 1 (
..
and..=
to signifyRange<Key>
andRangeInclusive<Key>
, respectively):Or, if preferable, idea 1.5:
Alternative methods
Idea 2 (support normal-SQL's
BETWEEN ... AND ...
syntax, and make sure it is optimized to use a range lookup from the underlying KV-store):Idea 3 (no new syntax, just make sure the following optimizes to use a range lookup from the underlying KV-store):
Non-solution: Changing the schema to use a random record id and to have an index on game name and timestamp would throw spatial-locality and, by extension, query performance out the window. Executing
SELECT id, players FROM metrics WHERE name = "Tetris" AND timestamp BETWEEN 210 AND 300
, assuming the existence of an ordered index on(name, timestamp)
, would play nicely with that index but then do 90 random accesses to fetch the actual records.See also: https://discord.com/channels/902568124350599239/902568124350599242/1012746600315105401
SurrealDB version
surreal 1.0.0-beta.6 for linux on x86_64
Contact Details
finnbearone@gmail.com
Is there an existing issue for this?
Code of Conduct
The text was updated successfully, but these errors were encountered: