Integration of sparql with large language model related functionality #193

fsasaki · 2023-12-14T13:24:21Z

Why

Several vendors are looking into this space already, sometimes in relation to extended (vector based) search capabilities, sometimes in relation to more general large language model features like summarization or knowledge graph generation from unstructured text.

Previous work

Franz (RDF) https://franz.com/agraph/support/documentation/current/neuro-symbolic-llm-intro.html , for vector indexing with LLMs see https://franz.com/agraph/support/documentation/current/llm.html
Neptune (property graph part) https://docs.aws.amazon.com/neptune-analytics/latest/userguide/vector-index.html
neo4j (property graphs) https://github.com/neo4j/NaLLM?tab=readme-ov-file
Related discussions in Standardize free text search of RDF data #40 and Support for Vectors and Matrices #163

See also https://www.biorxiv.org/content/10.1101/463778v1.full.pdf .

Proposed solution

Nothing concrete yet, currently gathering related work.

Considerations for backward compatibility

Too early to discuss.

hartig · 2023-12-14T13:44:08Z

In the context of a tutorial that I gave a few years ago, I collected information about the full-text search features provided by several triple store vendors (BlazeGraph, Virtuoso, AllegroGraph, Stardog, GraphDB). The latest version of my slides with this information can be found at the following address, where slides 24 to 41 are the relevant ones.
https://www.ida.liu.se/research/semanticweb/events/SemWebCourse2019/TripleStores.pdf

ktk · 2023-12-14T14:09:46Z

@hartig great one, tnx

VladimirAlexiev · 2023-12-14T14:50:42Z

GraphDB supports the following:
https://graphdb.ontotext.com/documentation/10.4/gpt-queries.html

magic predicates to ask an LLM for text, list or table using data from your KG:
query explanation
result explanation, summarization, rephrasing, translation

https://graphdb.ontotext.com/documentation/10.4/retrieval-graphdb-connector.html

Indexing of KG entities in a vector database
Supports any text embedding algorithm and vector database. We've played with Weaviate, Elastic, etc
Uses the same powerful connector (indexing) language that we use for Elastic, Solr, Lucene
Automatic synchronization of changes in RDF data to the KG entity index
Supports nested objects (but not yet in the UI)
Serializes KG entities to text like this:

Franvino:
- is a RedWine.
- made from grape Merlo.
- made from grape Cabernet Franc.
- has sugar dry.
- has year 2012.

https://graphdb.ontotext.com/documentation/10.4/talk-to-graph.html

A simple chatbot using a defined KG entity index

We are working on natural language querying (NLQ) aka knowledge graph question answering (KGQA).
Cheers!

jpmccu · 2023-12-14T17:42:46Z

Interesting! I've started a plug-in for integrating vectors into SPARQL by using registered IRIs as defined vector spaces, and rdf:JSON literals as objects. Haven't made progress on the search side yet, but this is super relevant to many of our research projects. Jamie McCusker (she/her/hers) Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute ***@***.*** ***@***.***> http://tw.rpi.edu

…

On Thu, Dec 14, 2023 at 9:50 AM Vladimir Alexiev ***@***.***> wrote: GraphDB supports the following: https://graphdb.ontotext.com/documentation/10.4/gpt-queries.html - magic predicates to ask an LLM for text, list or table using data from your KG: - query explanation - result explanation, summarization, rephrasing, translation https://graphdb.ontotext.com/documentation/10.4/retrieval-graphdb-connector.html - Indexing of KG entities in a vector database - Supports any text embedding algorithm and vector database. We've played with Weaviate, Elastic, etc - Uses the same powerful connector (indexing) language that we use for Elastic, Solr, Lucene - Automatic synchronization of changes in RDF data to the KG entity index - Supports nested objects (but not yet in the UI) - Serializes KG entities to text like this: Franvino: - is a RedWine. - made from grape Merlo. - made from grape Cabernet Franc. - has sugar dry. - has year 2012. https://graphdb.ontotext.com/documentation/10.4/talk-to-graph.html - A simple chatbot using a defined KG entity index image.png (view on web) <https://github.com/w3c/sparql-dev/assets/536250/80129475-5d92-451e-98c5-bc0d75960e6a> We are working on natural language querying (NLQ) aka knowledge graph question answering (KGQA). Cheers! — Reply to this email directly, view it on GitHub <#193 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAETCEMZEA6QZC5KVGMSFDDYJMG5BAVCNFSM6AAAAABAU264N6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJVHE4TCNZSGE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ericprud · 2023-12-19T13:44:36Z

@jpmccu , very cool. Any idea whether standardized value sets for vector spaces would allow "composition" of machines? Specific example: given a set of synaptic weights for diagnosing an ischemic stroke and another set for traffic patterns in a city, could one combine independently-trained machines in order to optimize stroke patient care (e.g. decide between close hospital or one further away that's good at angioplasty)? Sounds like you might be playing with stuff like that. Testing that in SPARQL would be very interesting indeed.

jpmccu · 2023-12-19T14:09:44Z

We assume that each vector space dimension is consistent (and is enforced before storage in the vector DB). One could concatenate vectors into a vector union in a new space, but we haven't really thought about doing multi-space comparisons. Right now we just have the representation and a plug-in for whyis that intercepts the vectors as they're being published. We haven't done much more than brainstorm what the SPARQL would look like, beyond the BGPs for access looking like the RDF (using Jena PropertyFunctions) and ANN search using a PropertyFunction similar to the full text search module. Jamie McCusker (she/her/hers) Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute ***@***.*** ***@***.***> http://tw.rpi.edu

…

On Tue, Dec 19, 2023 at 8:44 AM ericprud ***@***.***> wrote: @jpmccu <https://github.com/jpmccu> , very cool. Any idea whether standardized value sets for vector spaces would allow "composition" of machines? Specific example: given a set of synaptic weights for diagnosing an ischemic stroke and another set for traffic patterns in a city, could one combine independently-trained machines in order to optimize stroke patient care (e.g. decide between close hospital or one further away that's good at angioplasty)? Sounds like you might be playing with stuff like that. Testing that in SPARQL would be very interesting indeed. — Reply to this email directly, view it on GitHub <#193 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAETCEPONWQBIVHXDUQTCALYKGK5BAVCNFSM6AAAAABAU264N6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRSG44DKOJZHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

VladimirAlexiev · 2023-12-20T10:01:29Z

But @jpmccu and @ericprud, is it appropriate to store tensors in JSON?
Shouldn't we think of appropriate binary formats like HDF5 or stores like TensorStore?
There are also Data Abstraction Layers (eg GDAL) to isolate data access from the specific binary format/storage used.

Under https://accordproject.eu/ (automated compliance checking of architectural designs and urban planning) we're thinking about a binary data connector for GraphDB.

There's also schemaorg/schemaorg#3140

jpmccu · 2023-12-20T14:16:44Z

They aren't actually stored in JSON, just represented that way. And within my system, we can add loaders for any useful format. JSON is useful because it can be embedded in Turtle easily, and I was able to create an RDFlib handler for it that didn't require serialization and deserialization, so they remain Python objects when put in memory graphs. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Vladimir Alexiev ***@***.***> Sent: Wednesday, December 20, 2023 5:01:40 AM To: w3c/sparql-dev ***@***.***> Cc: Jamie McCusker ***@***.***>; Mention ***@***.***> Subject: Re: [w3c/sparql-dev] Integration of sparql with large language model related functionality (Issue #193) But @jpmccu<https://github.com/jpmccu> and @ericprud<https://github.com/ericprud>, is it appropriate to store tensors in JSON? Shouldn't we think of appropriate binary formats like HDF5 or stores like TensorStore<https://google.github.io/tensorstore/>? There are also Data Abstraction Layers (eg GDAL) to isolate data access from the specific binary format/storage used. Under https://accordproject.eu/ (automated compliance checking of architectural designs and urban planning) we're thinking about a binary data connector for GraphDB. There's also schemaorg/schemaorg#3140<schemaorg/schemaorg#3140> — Reply to this email directly, view it on GitHub<#193 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAETCELFJ6IBVY6AUXSJYXLYKKZQJAVCNFSM6AAAAABAU264N6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRUGE4DOMBVHA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

fsasaki · 2023-12-20T14:56:15Z

Thanks a lot to all for the discussion so far. Let me try to structure this.

A lot of the discussion seems to focused on how to encode vectors and how to use them for search.

And there are existing issues like #40 related to that.

There is nearly no discussion on capabilities to query LLMs and to generate graphs out of them. E.g. the magic predicates mentioned by @VladimirAlexiev or the ones from Franz I had mentioned at the top of this issue. Do people see a use case for these to be standardised?

Also, how about use cases that go beyond query but build on vector based similarity? One could use this for example for KG construction (which could be on top of queries via SPARQL CONSTRUCT) or validation ("check if everything which is skos:related is semantically really realted").

fsasaki · 2024-05-10T09:13:28Z

Some updates on this topic with newer developments.

@rdfguy mentioned in the KGC panel discussions on KG standards that the combination of symbolic and statistical reasoning would be potential future direction for graph technologies.

At the data week Leizip 2024, Lisa Wenige gave a 15 minute presentation on how this may look like, she showed sparql extensions for LLMs, see her 15 min presentation at https://www.youtube.com/watch?v=QfPCU8RiNhA&list=PLiyYYLqA8v5NBcAZJy6CpLVnDMrU4Y4yL&t=8344s

At the knowledge graph conference, LLM support was shown by nearly all knowledge graph vendors.
A few steps which seem to be common for GRAPH RAG patterns are

Storing vectors for (parts of) a graph, see e.g. Support for Vectors and Matrices #163
Providing vector generation capabilities. Common patterns seem to be: vector generation based on node descriptions or Concise Bounded Descriptions, or based on custom functions.
Provide similarity search based on vectors.

As pointed out previously in this issue, many of these topics are related to search. Now, there seem to functionalities beyond search, e.g.

Generate content using LLMs. Content can be textual content but also further graph structures
Validate based on statistical inferences, e.g. have SHACL constraints that a skos:matches relation can be justified by a statistical inference.

I am wondering if there is now a critical mass for starting work on this topic.

Both RDF and property graph vendors are quite active in this space now.
My perception is that property graph vendors are more recognized in the communities that need such functionalities, esp. AI.
Waiting too long to pick this up for RDF may mean to loose the attention of e.g. AI developers who are now start to look into graphs.

ktk · 2024-05-10T10:20:00Z

As a quick reminder on how this group works: Everyone can pick up one of the topics and make a concrete proposal in the form of a SEP, see https://github.com/w3c/sparql-dev/tree/main/SEP.

But from experience I can say that it needs 1-2 people per SEP (at least) that really want to get it done and spend the time on it. We have a few successful examples when for example @afs and @Tpt created a SEP and worked on implementations after that in both Jena & Oxigraph.

fsasaki · 2024-05-10T10:23:36Z

@ktk thanks for the reminder. My question was meant to see if somebody wants to pick this (potentially jointly) up :)

afs · 2024-05-12T14:17:56Z

Interested!

There are several dimensions for SPARQL enhancements.

One part of this may be to work on the standardization of call-out extensibility.

Free text search is an example here. There is a common general sense of what a text search involves, while each text search system has particular features and syntax details. Therefore either define a (another!) free text search syntax or provide a flexible way to pass requests to text search systems.

What would be the requirements on a call-out interface to support LLM's? What about call-in?

ktk mentioned this issue Dec 14, 2023

Standardize free text search of RDF data #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration of sparql with large language model related functionality #193

Integration of sparql with large language model related functionality #193

fsasaki commented Dec 14, 2023

hartig commented Dec 14, 2023

ktk commented Dec 14, 2023

VladimirAlexiev commented Dec 14, 2023

jpmccu commented Dec 14, 2023 via email

ericprud commented Dec 19, 2023

jpmccu commented Dec 19, 2023 via email

VladimirAlexiev commented Dec 20, 2023

jpmccu commented Dec 20, 2023 via email

fsasaki commented Dec 20, 2023 •

edited

fsasaki commented May 10, 2024

ktk commented May 10, 2024

fsasaki commented May 10, 2024

afs commented May 12, 2024

Integration of sparql with large language model related functionality #193

Integration of sparql with large language model related functionality #193

Comments

fsasaki commented Dec 14, 2023

Why

Previous work

Proposed solution

Considerations for backward compatibility

hartig commented Dec 14, 2023

ktk commented Dec 14, 2023

VladimirAlexiev commented Dec 14, 2023

jpmccu commented Dec 14, 2023 via email

ericprud commented Dec 19, 2023

jpmccu commented Dec 19, 2023 via email

VladimirAlexiev commented Dec 20, 2023

jpmccu commented Dec 20, 2023 via email

fsasaki commented Dec 20, 2023 • edited

fsasaki commented May 10, 2024

ktk commented May 10, 2024

fsasaki commented May 10, 2024

afs commented May 12, 2024

fsasaki commented Dec 20, 2023 •

edited