Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General query engine interface #5

Closed
rubensworks opened this issue Sep 30, 2021 · 28 comments
Closed

General query engine interface #5

rubensworks opened this issue Sep 30, 2021 · 28 comments

Comments

@rubensworks
Copy link
Member

rubensworks commented Sep 30, 2021

After some internal discussions with @gsvarovsky and @jacoscaz, we identified the need to come up with a base query engine interface for RDF/JS (for declarative queries).

In essence, it should expose an interface that allows you to do operations
such as const resultStream = await engine.query('some query');

The goal of this issue is to collect input on what already exists, so we can identify what the requirements are for such an interface.


Projects I contribute to that would benefit from this interface:

Big open question for me is how close the relationship to SPARQL should be. (We could for example start off with defining it in terms of SPARQL, but leave room for other query languages)

@tpluscode
Copy link

tpluscode commented Sep 30, 2021

@RubenVerborgh
Copy link
Member

(We could for example start off with defining it in terms of SPARQL, but leave room for other query languages)

Optional second argument, defaulting to { language: 'SPARQL', version: '1.1', extensions: [] }?

@gsvarovsky
Copy link

I would want to implement this interface with m-ld's Javascript engine, so that it can operate in environments that use SPARQL queries directly.

I have implemented my own interface, which looks like this:
https://github.com/m-ld/m-ld-js/blob/edge/src/rdfjs-support.ts

Note the dependency on the sparqlalgebrajs types – I would prefer not to have to pass strings, so there may be a need to have an interface package extracted from sparqlalgebrajs.

I would also prefer that the interface allowed a store to be quite explicit about which queries it supports (e.g. Construct but not Describe).

@jacoscaz
Copy link
Contributor

jacoscaz commented Sep 30, 2021

I maintain quadstore, a persistent RDF store with SPARQL capabilities via Comunica, and I would happily implement the proposed base query engine interface.

@tpluscode
Copy link

tpluscode commented Sep 30, 2021

Does it have to be complicated much? I've been working with @bergos' sparql-http-client and I think's just about right. It comes in two forms, both of which have methods select/construct/ask/update but differ on the return types from select and construct

declare module 'sparql-http-client/StreamClient' {
  class StreamClient {
    query: {
      select(query: string): Promise<EventEmitter>
      construct(query: string): Promise<import('rdf-js').Stream>
      ask(query: string): Promise<boolean>
      update(query: string): Promise<void>
    }
  }
}

declare module 'sparql-http-client/ParsingClient' {
  class StreamClient {
    query: {
      select(query: string): Promise<Array<Record<string, import('rdf-js').Term>>>
      construct(query: string): Promise<import('rdf-js').Dataset>
      ask(query: string): Promise<boolean>
      update(query: string): Promise<void>
    }
  }
}

The only change that I would make above is for the StreamClient not to return promises

class StreamClient {
  query: {
-    select(query: string): Promise<EventEmitter>
-    construct(query: string): Promise<import('rdf-js').Stream>
+    select(query: string): EventEmitter
+    construct(query: string): import('rdf-js').Stream
  }
}

@jacoscaz
Copy link
Contributor

jacoscaz commented Oct 1, 2021

both of which have methods select/construct/ask/update

I think I see the point you're making but doing this would imply leaking aspects of the query language into the query interface. It would be equivalent to having separate methods for SELECT, INSERT, DELETE and UPDATE queries in SQL-oriented database drivers and ORMs. IMHO, this would make it much harder to deal with use cases in which the nature of a query is not known ahead of time and also introduce too strong a coupling between SPARQL and the interface itself.

@tpluscode
Copy link

Database drivers (SDK?) do very much have similar distinction and it's nothing wrong. Think query/execute/executeScalar. I will not even comment on ORMs because that is the worst comparison ever. Might have a look at micro ORMs like dapper which are much "closer to metal" to mitigate the impedance mismatch issues plaguing ORMs

use cases in which the nature of a query is not known ahead of time

I'm curious about this statement. In what scenarios is it not known what is the desired kind of result? (tabular, graph or boolean).

and also introduce too strong a coupling between SPARQL and the interface itself.

What is the nature of this coupling? Arguably, the whole RDF stack is built on uniformity and standards. The RDF graph is the same graph in every software component. SPARQL, being similarly important core standard, can act the same. Otherwise I read you comment as an invitation to build (IMO unnecessary) abstractions

@rubensworks
Copy link
Member Author

I've created a new issue to follow up on the discussion of query methods: #6
(So that we can keep this issue here focussed on collecting existing approaches)

@tpluscode
Copy link

I might actually mention also my lib @tpluscode/sparql-builder. Right now it does rely on sparql-http-client for execution but a standard interface would be nice, especially if we could get an in-memory query engine

@rubensworks
Copy link
Member Author

Ah yes indeed, libs that depend on engines are also relevant to include here.

In that respect, the following libs may also be relevant:

@ericprud
Copy link

ericprud commented Oct 1, 2021

I'd suggest starting with the SPARQL Algebra (but be willing to depart from it as use cases indicate). You can do a lot with it (like all of SPARQL) but is simpler than SPARQL. For instance, the idiosyncracies of a SolutionModifier's GROUP BY and HAVING reuse aggregation and filter. You may also opt to lop off large parts of it, but having a set of composable operations should be familiar to programmers.

@gsvarovsky
Copy link

@tpluscode #5 (comment)

not to return promises

👍

@rubensworks
Copy link
Member Author

not to return promises

That would actually depend on the outcome of #6. Because if we only expose a single method there, then the return type would vary based on the query. Since query type and return type may be determined async, we may require promises.

@rubensworks
Copy link
Member Author

rubensworks commented Oct 5, 2021

I've had a look at all the suggested libraries, and I've tried to create an overview aspects that I feel may require some standardization.
If I missed any, please let me know!

Once we agree upon a list of aspect, we can branch of into separate issues to see how we want to tackle the specifics of each one.

1. Query method interface

How to pass a query to a library, and obtain results.

Discussion in #6.

Single method

All query forms are handled via a single method, possibly via method overloading or union types.

Example:

      query(query: string): Promise<SomeUnionType>

Implemented by:

  • M-ld
  • Quadstore
  • Comunica
  • rdf-test-suite.js

Form-based methods

Each query form has its own dedicated method.

Example:

      select(query: string): Promise<Array<Record<string, import('rdf-js').Term>>>
      construct(query: string): Promise<import('rdf-js').Dataset>
      ask(query: string): Promise<boolean>
      update(query: string): Promise<void>

Implemented by:

  • sparql-http-client
  • fetch-sparql-endpoint.js

Other

The following libraries follow another query interface, which seem to be use-case-specific, and may not benefit that much from standardization:

  • sparql-engine
  • LDflex
  • GraphQL-LD
  • @tpluscode/sparql-builder

2. Representing bindings

How to represent the results of tabular queries such as SELECT.

JSON-based

Example of a single bindings object:

{
  '?varA': namedNode('ex:a'),
  '?varB': namedNode('ex:b'),
}

Implemented by:

  • M-ld
  • rdf-test-suite.js
  • sparql-http-client
  • sparql-engine
  • @tpluscode/sparql-builder
  • fetch-sparql-endpoint.js

Object-based

A custom datastructure that exposes methods and allows bindings to be stored internally in a different manner.

Example of a single bindings object:

const bindings = ...
const term: RDF.Term = bindings.get('?a');

Implemented by:

  • Quadstore
  • Comunica

3. Exposing metadata

Both on query-level as on source-level, it may be beneficial to expose metadata such as cardinality (estimates).
Such information may be useful for query optimization.

Dedicated method for obtaining metadata

interface CountableSource {
  countQuads(): Promise<number> | number
}

Implemented by:

  • Comunica
  • M-ld

Generic object that provides metadata

const results = engine.query(...);
const metadata = await results.metadata();
console.log(metadata.cardinality);

Implemented by:

  • Comunica

4. Serializing results

A method to serialize query results to a standard format, such as SPARQL JSON results.
Related to this, methods may be added that expose the available formats.

  resultToString: (queryResult: ..., format?: string) => Promise<Stream<string>>;

Implemented by:

  • Comunica

5. Defining sources

Some engines allow query sources to vary per query execution,
and therefore enable passing it as an additional argument.

  query: (query: string, context: { sources: IDataSource[] }) => Promise<IQueryResult>;
export type IDataSource = string | RDF.Source | {
  type?: string;
  value: string | RDF.Source;
  context?: ActionContext;
};

Implemented by:

  • Comunica
  • rdf-test-suite.js

6. Passing query as algebra

Instead of passing a query string to an engine, a (pre-optimized?) algebra object may be passed.
Related to this, methods for parsing a query string to algebra may also be valuable to standardize.

Example:

export interface QueryableRdf<Q extends BaseQuad = Quad> {
  query(query: Algebra.Construct): Stream<Q>;
  query(query: Algebra.Describe): Stream<Q>;
  query(query: Algebra.Project): BaseStream<Binding>;
}

Implemented by:

  • M-ld
  • Quadstore
  • Comunica

7. Defining query syntax format

If engines support different query syntaxes, they typically allow this to be customized via an optional argument.

Example:

  query: (query: string, context: { queryFormat: string }) => Promise<IQueryResult>;

Implemented by:

  • Comunica

@jacoscaz
Copy link
Contributor

jacoscaz commented Oct 7, 2021

@rubensworks thank you for this list, very useful. I think as long as we keep our comments short, we might be able to discuss all of these points in this thread without branching into separate issues, which makes it a lot harder to keep track of the general picture IMHO. Of course, we will need to branch out for any point that sparks significant discussion.

My preferences..

1. Query method interface

Discussion in #6. My preference goes for single method + return type metadata.

2. Representing bindings

My preference goes for JSON-based representation (simple objects).

3. Exposing metadata

My preference goes for a generic object that provides metadata. I find that this approach leads to easier and better optimization in terms of sharing computation between metadata and query results. Worth mentioning that this is starting to have significant overlap with the current FilterableSource spec contained in this repo.

4. Serializing results

I would prefer not to standardize serialization in this spec.

5. Defining sources

Definitely in favor of this.

6. Passing query as algebra

Definitely in favor of this.

7. Defining query syntax format

Also discussed in #6, I have no need for anything else than SPARQL but I defer to people working with multiple query languages on this one.

@jacoscaz
Copy link
Contributor

jacoscaz commented Oct 11, 2021

Some of my notes from today's call with @gsvarovsky and @rubensworks...

1. Query method interface

@gsvarovsky pointed out that the single method approach leads to code which is not as immediate and easy to grasp. Nonetheless, we ultimately settled this approach mainly due to its flexibility, potential for optimization and the fact that it's the only method capable of covering all of our current use-cases. We evaluated using base classes for convenience methods but ultimately elected not to include convenience methods to keep the spec as small as possible.

2. Representing bindings

@rubensworks explained the inherent risk of conflicts with native object properties when using bindings representations based on simple javascript objects. We discussed using an object-based representation with instance-level methods strictly related to reading bindings (.get('?var')) and class-level static methods for more general manipulation of bindings. We also believe that following in the footsteps of the RDF/JS data model by using factory functions would be a good idea (const bindings = DataFactory.bindings()).

Open question: should we keep the ? in variable names?

3. Exposing metadata

Related to point 1), we discussed the single method approach, with the main query() method returning an intermediate result object having a metadata(): Promise<Metadata> method. Standardized metadata would include quad/bindings count and ordering. We noticed that the intermediate result object overlaps with the FilterableSource spec.

4. Serializing results

We all agree that serialization should not fall within the scope of this spec.

5. Defining sources

Defining sources at query time allows query engines to be re-used across sources.This would be best modeled by passing sources as a parameter/option of the main query method: .query('SELECT ...', { sources: [store]}).

6. Passing query as algebra

We considered basing the spec around two different query methods, one taking a SPARQL string and the other taking a SPARQL Algebra object. Implementors would be free to implement either/or.

7. Defining query syntax format

We all agree on keeping this spec SPARQL-based.

@blake-regalia
Copy link

Was there a posting about scheduling a call? Best to keep such things open to the community instead of behind closed doors.

@rubensworks
Copy link
Member Author

@blake-regalia There were no formal RDF/JS calls, no. Just some informal talk between @gsvarovsky, @jacoscaz, and myself about the overlaps between our work, and potential alignments.

Definitely open to have a call about the query spec, but not sure there is a real need for one at this stage?
Discussions via GH issues seems to be progressing quite well.

@blake-regalia
Copy link

blake-regalia commented Oct 11, 2021 via email

@rubensworks
Copy link
Member Author

I would appreciate being part of the discussion
over the phone however, just saying.

@blake-regalia Of course!
Would you like to initiate scheduling a call?

@RubenVerborgh
Copy link
Member

Open question: should we keep the ? in variable names?

Just reacting to this tiny nit: I would suggest to drop the question mark.

SPARQL already has two syntaxes for variables, one with ? and one with $, and they indicate the same variable:

A query variable is marked by the use of either "?" or "$"; the "?" or "$" is not part of the variable name.

https://www.w3.org/TR/sparql11-query/#QSynVariables

@bergos
Copy link
Member

bergos commented Oct 12, 2021

Please keep the variable interface of the Data Model spec in mind.

It should be possible to use variable term objects as identifier in bindings:

const bindings = ...
const a = factory.variable('a')
const term = bindings.get(a)

Then there is no need to open the leading ? discussion cause you can point to the Data Model spec that defines the value of a variable term like this:

value the name of the variable without leading "?" (example: "a").

@jacoscaz
Copy link
Contributor

It should be possible to use variable term objects as identifier in bindings

This is a very sensible consideration IMHO, although I do wonder about the effects on performance (and complexity?) in long chains of transformations. That said, I would be 100% in favor of using object-based representation of variables if at all possible.

@rubensworks
Copy link
Member Author

I do wonder about the effects on performance (and complexity?) in long chains of transformations.

I actually think this should be pretty ok performance-wise.

The only downside of this would be that it would be a bit less convenient for interface users to access values of a certain variable. But this is similar to the discussion around #6, as more dev-friendly abstractions can easily be built on top of this.

@rubensworks
Copy link
Member Author

FYI, possibility for a call about this on the mailinglist: https://lists.w3.org/Archives/Public/public-rdfjs/2021Oct/0000.html

@jacoscaz
Copy link
Contributor

PR to extend the discussion to everyone interested at #7

@jacoscaz
Copy link
Contributor

We've recently merged #7, which includes and elaborates upon what was discussed in this issue. I think we can close this in favor of more focused issues - @rubensworks final word up to you!

@rubensworks
Copy link
Member Author

Sounds good! Let's create new issues where needed based on the experimental interfaces in https://github.com/rdfjs/query-spec/blob/master/queryable-spec.ts

Once we're happy, we can create a proper spec.

For reference, I've started implementing these interfaces in @rdfjs/types in a new branch: https://github.com/rdfjs/types/tree/feature/query
Experimenting with them in Comunica as we speak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants