Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Range queries on record ids should be ergonomic and fast #66

Closed
2 tasks done
finnbear opened this issue Aug 27, 2022 · 1 comment
Closed
2 tasks done

Comments

@finnbear
Copy link
Contributor

finnbear commented Aug 27, 2022

Is your feature request related to a problem?

SurrealDB is built upon various ordered and, to the extent they are distributed, range-partitioned key-value stores such as TiKV. This has the potential to make range queries on keys (record ids) very performant. However, SurQL lacks a dedicated syntax or performance guarantees for such queries.

Consider the following timeseries records grouped by game (the first index of each record id is the game name, and the second index is a timestamp of days since product launch):

[
  {
    id: ["Chess", 1],
    players: 50
  },
  {
    id: ["Chess", 2],
    players: 15
  },
  ...296 records omitted...
  {
    id: ["Chess", 299],
    players: 15
  },
  {
    id: ["Chess", 300],
    players: 15
  },
  {
    id: ["Tetris", 1],
    players: 10
  },
  {
    id: ["Tetris", 2],
    players: 12
  },
  ...296 records omitted...
  {
    id: ["Tetris", 299],
    players: 26
  },
  {
    id: ["Tetris", 300],
    players: 23
  }
]

Assuming each node can handle 150 records, a likely partitioning into four nodes would result in the following ranges:

  1. Node 1 gets keys ["Chess", 1] to ["Chess", 150]
  2. Node 2 gets keys ["Chess", 151] to ["Chess", 300]
  3. Node 3 gets keys ["Tetris", 1] to ["Tetris", 150]
  4. Node 4 gets keys ["Tetris", 151] to ["Tetris", 300]

A common query pattern will be to chart the data for a particular game for the last 90 days. Using the game Tetris as an example, that means getting all records between ["Tetris", 210] (inclusive) and ["Tetris", 300] (inclusive). Luckily, these records all reside on Node 4, so the underlying KV-store can retrieve them in a single access (side note: if we were querying many more records, we might hit multiple nodes, but the ordering would make their disk accesses much more efficient and the number of nodes hit would be relatively minimal).

Describe the solution

Idea 1 (.. and ..= to signify Range<Key> and RangeInclusive<Key>, respectively):

SELECT id, players FROM metrics:["Tetris", 210]..["Tetris", 301]
SELECT id, players FROM metrics:["Tetris", 210]..=["Tetris", 300]

Or, if preferable, idea 1.5:

SELECT id, players FROM metrics:["Tetris", 210]..metrics:["Tetris", 301]
SELECT id, players FROM metrics:["Tetris", 210]..=metrics:["Tetris", 300]

Alternative methods

Idea 2 (support normal-SQL's BETWEEN ... AND ... syntax, and make sure it is optimized to use a range lookup from the underlying KV-store):

SELECT id, players FROM metrics WHERE id BETWEEN ["Tetris", 210] AND ["Tetris", 300]

Idea 3 (no new syntax, just make sure the following optimizes to use a range lookup from the underlying KV-store):

SELECT id, players FROM metrics WHERE ["Tetris", 210] <= id AND id <= ["Tetris", 300]

Non-solution: Changing the schema to use a random record id and to have an index on game name and timestamp would throw spatial-locality and, by extension, query performance out the window. Executing SELECT id, players FROM metrics WHERE name = "Tetris" AND timestamp BETWEEN 210 AND 300, assuming the existence of an ordered index on (name, timestamp), would play nicely with that index but then do 90 random accesses to fetch the actual records.

See also: https://discord.com/channels/902568124350599239/902568124350599242/1012746600315105401

SurrealDB version

surreal 1.0.0-beta.6 for linux on x86_64

Contact Details

finnbearone@gmail.com

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@tobiemh
Copy link
Member

tobiemh commented Aug 27, 2022

Thanks for this @finnbear!

So at a quick glance, idea 1 would be the way we would want to go.

We already have (not yet documented) things called models. These are designed purely for testing (generating a large amount of records in a simple query), but make use of a range operator in the query parser...

CREATE |person:1000|; -- Create 1000 randomly generated people records
CREATE |person:1..1000|; -- Create 1000 people records with IDs from 1 to 1000

Just to summarise, we shouldn't go with ideas 2 or 3 because:

  • the range query is generated from the FROM clause and not from the WHERE clause
  • we would have to ensure the user creates indexes for the id field (which isn't and shouldn't be necessary)
  • I prefer the syntax of idea 1, as it fits more inline with the current SurrealQL concepts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants