Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of sparql with large language model related functionality #193

Open
fsasaki opened this issue Dec 14, 2023 · 13 comments
Open

Integration of sparql with large language model related functionality #193

fsasaki opened this issue Dec 14, 2023 · 13 comments

Comments

@fsasaki
Copy link

fsasaki commented Dec 14, 2023

Why

Several vendors are looking into this space already, sometimes in relation to extended (vector based) search capabilities, sometimes in relation to more general large language model features like summarization or knowledge graph generation from unstructured text.

Previous work

See also https://www.biorxiv.org/content/10.1101/463778v1.full.pdf .

Proposed solution

Nothing concrete yet, currently gathering related work.

Considerations for backward compatibility

Too early to discuss.

@hartig
Copy link

hartig commented Dec 14, 2023

In the context of a tutorial that I gave a few years ago, I collected information about the full-text search features provided by several triple store vendors (BlazeGraph, Virtuoso, AllegroGraph, Stardog, GraphDB). The latest version of my slides with this information can be found at the following address, where slides 24 to 41 are the relevant ones.
https://www.ida.liu.se/research/semanticweb/events/SemWebCourse2019/TripleStores.pdf

@ktk
Copy link

ktk commented Dec 14, 2023

@hartig great one, tnx

@VladimirAlexiev
Copy link
Contributor

GraphDB supports the following:
https://graphdb.ontotext.com/documentation/10.4/gpt-queries.html

  • magic predicates to ask an LLM for text, list or table using data from your KG:
  • query explanation
  • result explanation, summarization, rephrasing, translation

https://graphdb.ontotext.com/documentation/10.4/retrieval-graphdb-connector.html

  • Indexing of KG entities in a vector database
  • Supports any text embedding algorithm and vector database. We've played with Weaviate, Elastic, etc
  • Uses the same powerful connector (indexing) language that we use for Elastic, Solr, Lucene
  • Automatic synchronization of changes in RDF data to the KG entity index
  • Supports nested objects (but not yet in the UI)
  • Serializes KG entities to text like this:
Franvino:
- is a RedWine.
- made from grape Merlo.
- made from grape Cabernet Franc.
- has sugar dry.
- has year 2012.

https://graphdb.ontotext.com/documentation/10.4/talk-to-graph.html

  • A simple chatbot using a defined KG entity index

image

We are working on natural language querying (NLQ) aka knowledge graph question answering (KGQA).
Cheers!

@jpmccu
Copy link

jpmccu commented Dec 14, 2023 via email

@ericprud
Copy link
Member

@jpmccu , very cool. Any idea whether standardized value sets for vector spaces would allow "composition" of machines? Specific example: given a set of synaptic weights for diagnosing an ischemic stroke and another set for traffic patterns in a city, could one combine independently-trained machines in order to optimize stroke patient care (e.g. decide between close hospital or one further away that's good at angioplasty)? Sounds like you might be playing with stuff like that. Testing that in SPARQL would be very interesting indeed.

@jpmccu
Copy link

jpmccu commented Dec 19, 2023 via email

@VladimirAlexiev
Copy link
Contributor

But @jpmccu and @ericprud, is it appropriate to store tensors in JSON?
Shouldn't we think of appropriate binary formats like HDF5 or stores like TensorStore?
There are also Data Abstraction Layers (eg GDAL) to isolate data access from the specific binary format/storage used.

Under https://accordproject.eu/ (automated compliance checking of architectural designs and urban planning) we're thinking about a binary data connector for GraphDB.

There's also schemaorg/schemaorg#3140

@jpmccu
Copy link

jpmccu commented Dec 20, 2023 via email

@fsasaki
Copy link
Author

fsasaki commented Dec 20, 2023

Thanks a lot to all for the discussion so far. Let me try to structure this.

A lot of the discussion seems to focused on how to encode vectors and how to use them for search.

And there are existing issues like #40 related to that.

There is nearly no discussion on capabilities to query LLMs and to generate graphs out of them. E.g. the magic predicates mentioned by @VladimirAlexiev or the ones from Franz I had mentioned at the top of this issue. Do people see a use case for these to be standardised?

Also, how about use cases that go beyond query but build on vector based similarity? One could use this for example for KG construction (which could be on top of queries via SPARQL CONSTRUCT) or validation ("check if everything which is skos:related is semantically really realted").

@fsasaki
Copy link
Author

fsasaki commented May 10, 2024

Some updates on this topic with newer developments.

@rdfguy mentioned in the KGC panel discussions on KG standards that the combination of symbolic and statistical reasoning would be potential future direction for graph technologies.

At the data week Leizip 2024, Lisa Wenige gave a 15 minute presentation on how this may look like, she showed sparql extensions for LLMs, see her 15 min presentation at https://www.youtube.com/watch?v=QfPCU8RiNhA&list=PLiyYYLqA8v5NBcAZJy6CpLVnDMrU4Y4yL&t=8344s

At the knowledge graph conference, LLM support was shown by nearly all knowledge graph vendors.
A few steps which seem to be common for GRAPH RAG patterns are

  • Storing vectors for (parts of) a graph, see e.g. Support for Vectors and Matrices #163
  • Providing vector generation capabilities. Common patterns seem to be: vector generation based on node descriptions or Concise Bounded Descriptions, or based on custom functions.
  • Provide similarity search based on vectors.

As pointed out previously in this issue, many of these topics are related to search. Now, there seem to functionalities beyond search, e.g.

  • Generate content using LLMs. Content can be textual content but also further graph structures
  • Validate based on statistical inferences, e.g. have SHACL constraints that a skos:matches relation can be justified by a statistical inference.

I am wondering if there is now a critical mass for starting work on this topic.

Both RDF and property graph vendors are quite active in this space now.
My perception is that property graph vendors are more recognized in the communities that need such functionalities, esp. AI.
Waiting too long to pick this up for RDF may mean to loose the attention of e.g. AI developers who are now start to look into graphs.

@ktk
Copy link

ktk commented May 10, 2024

As a quick reminder on how this group works: Everyone can pick up one of the topics and make a concrete proposal in the form of a SEP, see https://github.com/w3c/sparql-dev/tree/main/SEP.

But from experience I can say that it needs 1-2 people per SEP (at least) that really want to get it done and spend the time on it. We have a few successful examples when for example @afs and @Tpt created a SEP and worked on implementations after that in both Jena & Oxigraph.

@fsasaki
Copy link
Author

fsasaki commented May 10, 2024

@ktk thanks for the reminder. My question was meant to see if somebody wants to pick this (potentially jointly) up :)

@afs
Copy link
Collaborator

afs commented May 12, 2024

Interested!

There are several dimensions for SPARQL enhancements.

One part of this may be to work on the standardization of call-out extensibility.

Free text search is an example here. There is a common general sense of what a text search involves, while each text search system has particular features and syntax details. Therefore either define a (another!) free text search syntax or provide a flexible way to pass requests to text search systems.

What would be the requirements on a call-out interface to support LLM's? What about call-in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants