Feature: Range queries on record ids should be ergonomic and fast #66

finnbear · 2022-08-27T00:03:53Z

Is your feature request related to a problem?

SurrealDB is built upon various ordered and, to the extent they are distributed, range-partitioned key-value stores such as TiKV. This has the potential to make range queries on keys (record ids) very performant. However, SurQL lacks a dedicated syntax or performance guarantees for such queries.

Consider the following timeseries records grouped by game (the first index of each record id is the game name, and the second index is a timestamp of days since product launch):

[
  {
    id: ["Chess", 1],
    players: 50
  },
  {
    id: ["Chess", 2],
    players: 15
  },
  ...296 records omitted...
  {
    id: ["Chess", 299],
    players: 15
  },
  {
    id: ["Chess", 300],
    players: 15
  },
  {
    id: ["Tetris", 1],
    players: 10
  },
  {
    id: ["Tetris", 2],
    players: 12
  },
  ...296 records omitted...
  {
    id: ["Tetris", 299],
    players: 26
  },
  {
    id: ["Tetris", 300],
    players: 23
  }
]

Assuming each node can handle 150 records, a likely partitioning into four nodes would result in the following ranges:

Node 1 gets keys ["Chess", 1] to ["Chess", 150]
Node 2 gets keys ["Chess", 151] to ["Chess", 300]
Node 3 gets keys ["Tetris", 1] to ["Tetris", 150]
Node 4 gets keys ["Tetris", 151] to ["Tetris", 300]

A common query pattern will be to chart the data for a particular game for the last 90 days. Using the game Tetris as an example, that means getting all records between ["Tetris", 210] (inclusive) and ["Tetris", 300] (inclusive). Luckily, these records all reside on Node 4, so the underlying KV-store can retrieve them in a single access (side note: if we were querying many more records, we might hit multiple nodes, but the ordering would make their disk accesses much more efficient and the number of nodes hit would be relatively minimal).

Describe the solution

Idea 1 (.. and ..= to signify Range<Key> and RangeInclusive<Key>, respectively):

SELECT id, players FROM metrics:["Tetris", 210]..["Tetris", 301]
SELECT id, players FROM metrics:["Tetris", 210]..=["Tetris", 300]

Or, if preferable, idea 1.5:

SELECT id, players FROM metrics:["Tetris", 210]..metrics:["Tetris", 301]
SELECT id, players FROM metrics:["Tetris", 210]..=metrics:["Tetris", 300]

Alternative methods

Idea 2 (support normal-SQL's BETWEEN ... AND ... syntax, and make sure it is optimized to use a range lookup from the underlying KV-store):

SELECT id, players FROM metrics WHERE id BETWEEN ["Tetris", 210] AND ["Tetris", 300]

Idea 3 (no new syntax, just make sure the following optimizes to use a range lookup from the underlying KV-store):

SELECT id, players FROM metrics WHERE ["Tetris", 210] <= id AND id <= ["Tetris", 300]

Non-solution: Changing the schema to use a random record id and to have an index on game name and timestamp would throw spatial-locality and, by extension, query performance out the window. Executing SELECT id, players FROM metrics WHERE name = "Tetris" AND timestamp BETWEEN 210 AND 300, assuming the existence of an ordered index on (name, timestamp), would play nicely with that index but then do 90 random accesses to fetch the actual records.

See also: https://discord.com/channels/902568124350599239/902568124350599242/1012746600315105401

SurrealDB version

surreal 1.0.0-beta.6 for linux on x86_64

Contact Details

finnbearone@gmail.com

Is there an existing issue for this?

I have searched the existing issues

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

tobiemh · 2022-08-27T00:14:17Z

Thanks for this @finnbear!

So at a quick glance, idea 1 would be the way we would want to go.

We already have (not yet documented) things called models. These are designed purely for testing (generating a large amount of records in a simple query), but make use of a range operator in the query parser...

CREATE |person:1000|; -- Create 1000 randomly generated people records
CREATE |person:1..1000|; -- Create 1000 people records with IDs from 1 to 1000

Just to summarise, we shouldn't go with ideas 2 or 3 because:

the range query is generated from the FROM clause and not from the WHERE clause
we would have to ensure the user creates indexes for the id field (which isn't and shouldn't be necessary)
I prefer the syntax of idea 1, as it fits more inline with the current SurrealQL concepts

tobiemh closed this as completed in c1a1eba Aug 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Range queries on record ids should be ergonomic and fast #66

Feature: Range queries on record ids should be ergonomic and fast #66

finnbear commented Aug 27, 2022 •

edited

tobiemh commented Aug 27, 2022

Feature: Range queries on record ids should be ergonomic and fast #66

Feature: Range queries on record ids should be ergonomic and fast #66

Comments

finnbear commented Aug 27, 2022 • edited

Is your feature request related to a problem?

Describe the solution

Alternative methods

SurrealDB version

Contact Details

Is there an existing issue for this?

Code of Conduct

tobiemh commented Aug 27, 2022

finnbear commented Aug 27, 2022 •

edited