Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add graph_stores, impl Simple KG & Nebula KG #2581

Merged
merged 10 commits into from Jun 7, 2023

Conversation

wey-gu
Copy link
Contributor

@wey-gu wey-gu commented May 6, 2023

draft for RFC #1318

  • add graph_stores
  • implement Simple kg and NebulaGraph kg

WIP:

  • get comments from jerry
  • continue to rebase for changes in 0.6.0(introduce graph_store to storage context)
  • implement load_from_disk and save_to_disk not needed after 0.6.x

@wey-gu wey-gu force-pushed the external_kg branch 3 times, most recently from 0e6b231 to 38c32be Compare May 6, 2023 09:51
@wey-gu wey-gu mentioned this pull request May 6, 2023
3 tasks
@jerryjliu
Copy link
Collaborator

@wey-gu this is cool, thanks for the PR - will try to take some time today to review and offer suggestions on how to rebase

@Disiok
Copy link
Collaborator

Disiok commented May 8, 2023

Hey @wey-gu this is great! As @jerryjliu noted, we made some fairly significant changes to how we handle storage (see https://gpt-index.readthedocs.io/en/latest/how_to/storage.html)

@Disiok
Copy link
Collaborator

Disiok commented May 8, 2023

Some specific notes:

  • for storage based on external connection (e.g. NebulaGraph) here, we no longer try to save the configuration, instead we ask user to reconstruct the connection
  • we created a storage context that bundles docstore, index store, and vector store. The question here is whether if we should add another graph store object into it. Would be great if you take a look at 0.6.0 and let us know your thoughts.

@Disiok
Copy link
Collaborator

Disiok commented May 8, 2023

Happy to help and answer any questions you might have about 0.6.0

@wey-gu
Copy link
Contributor Author

wey-gu commented May 8, 2023

Hey @wey-gu this is great! As @jerryjliu noted, we made some fairly significant changes to how we handle storage (see https://gpt-index.readthedocs.io/en/latest/how_to/storage.html)

Dear @Disiok

Got it!
I'll make changes based on new design of storage later and ask for further comments.

Thanks a lot!
Cheers// Wey

@jerryjliu
Copy link
Collaborator

awesome yeah @wey-gu adding on to what @Disiok said, it's possible a new graph store abstraction would be mostly used for specific indices, like our knowledge graph index. can be an optional part of our StorageContext that's none by default

@wey-gu wey-gu force-pushed the external_kg branch 3 times, most recently from bfb3e79 to 2e1f9af Compare May 25, 2023 09:27
@wey-gu
Copy link
Contributor Author

wey-gu commented May 25, 2023

Dear @Disiok

Some specific notes:

  • for storage based on external connection (e.g. NebulaGraph) here, we no longer try to save the configuration, instead we ask user to reconstruct the connection

wey: Got it, now I don't have to implement that :)

  • we created a storage context that bundles docstore, index store, and vector store. The question here is whether if we should add another graph store object into it. Would be great if you take a look at 0.6.0 and let us know your thoughts.

wey: This new abstraction is awesome, is it possible to enable chain-able storage context? The knowledge_graph index(now I added graph_store) comes with embedding support, which is memory based, is it possible to enable this embedding storage inside the knowledge_graph index consuming storage context in the future?

Dear @jerryjliu

awesome yeah @wey-gu adding on to what @Disiok said, it's possible a new graph store abstraction would be mostly used for specific indices, like our knowledge graph index. can be an optional part of our StorageContext that's none by default

wey: now the graph_store was introduced towards StorageContext and knowledge_graph_index was adapted to be based on simpleGraphStore or nebulaGraphStore :)


What do you think please of this change? Now it's storage context based :)

I will be working on more typical stories/demos for documents to help users understand how it works and how it helps

  • better consuming global/cross-node context
  • consuming existing knowledge graph in a custom retriever
  • composable index co-existing with other indexes
  • build a knowledge graph by simply dragging/putting docs in(with the help of llama hub from different sources)

in separate doc PR + blogs after this is merged.

Thanks again! I am super excited about this change :D

BR//Wey

@wey-gu
Copy link
Contributor Author

wey-gu commented May 29, 2023

Sorry, will fix the lint and UT issue.

@wey-gu wey-gu force-pushed the external_kg branch 2 times, most recently from ed9fcc3 to abd3690 Compare May 29, 2023 08:14
@wey-gu
Copy link
Contributor Author

wey-gu commented May 29, 2023

linting and UT passed now, locally, thanks! ❤
cc @Disiok @jerryjliu

@jerryjliu
Copy link
Collaborator

hey @wey-gu thanks for the changes - will take an action item to review this :)

@wey-gu
Copy link
Contributor Author

wey-gu commented Jun 5, 2023

pushed another version to address the conflicts related to changes from GPTKnowledgeGraphIndex to KnowledgeGraphIndex by @Disiok

@logan-markewich logan-markewich self-requested a review June 5, 2023 21:43
@wey-gu
Copy link
Contributor Author

wey-gu commented Jun 6, 2023

Thanks @logan-markewich for helping with the review!

Also, another rebase to resolve the lint error..

Copy link
Collaborator

@logan-markewich logan-markewich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks super good, and thanks for tackling such a complicated PR!

A few minor nits, but my main worry is backwards compatibility (both with previously saved knowledge graph indexes, as well as previously saved indexes in general)

The storage context code likely needs a bit of TLC to be more robust

Lastly, and maybe not this PR, but some docs on using Nebula would be cool. (I actually still need to try setting it up lol)

docs/examples/index_structs/knowledge_graph/example.html Outdated Show resolved Hide resolved
llama_index/__init__.py Outdated Show resolved Hide resolved
llama_index/data_structs/data_structs.py Outdated Show resolved Hide resolved
llama_index/data_structs/data_structs.py Outdated Show resolved Hide resolved
llama_index/indices/knowledge_graph/base.py Outdated Show resolved Hide resolved
llama_index/indices/knowledge_graph/retrievers.py Outdated Show resolved Hide resolved
llama_index/storage/storage_context.py Show resolved Hide resolved
Copy link
Contributor Author

@wey-gu wey-gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will work on all great review/improvement points from Logan, thanks again!

docs/examples/index_structs/knowledge_graph/example.html Outdated Show resolved Hide resolved
llama_index/__init__.py Outdated Show resolved Hide resolved
llama_index/data_structs/data_structs.py Outdated Show resolved Hide resolved
llama_index/data_structs/data_structs.py Outdated Show resolved Hide resolved
llama_index/data_structs/data_structs.py Outdated Show resolved Hide resolved
llama_index/graph_stores/types.py Show resolved Hide resolved
llama_index/indices/knowledge_graph/retrievers.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, make a lot of sense, I didn't think of this backward compatibility yet, but I'll try so in this round of push. thanks!

Still need a followup commit to address backwards compatibility
of graph_store.json from the previous impl.
@logan-markewich
Copy link
Collaborator

I think this is good to ship! Looking forward to your future work @wey-gu ! Knowledge graphs can be very powerful, and hoping llama-index can continue to be a great tool to leverage them

Comment on lines +86 to +87
SIMPLE_KG = "simple_kg"
NEBULAGRAPH = "nebulagraph"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: where are these used?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this just follows the same structure for the vector store registry -- is that registry used anywhere?

@@ -60,6 +60,7 @@ def __init__(
self._storage_context = storage_context or StorageContext.from_defaults()
self._docstore = self._storage_context.docstore
self._vector_store = self._storage_context.vector_store
self._graph_store = self._storage_context.graph_store
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the base class should know about graph store?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm true, but could say the same about the vector store? We can take a look at both in a future PR


"""
if persist_dir is None:
docstore = docstore or SimpleDocumentStore()
index_store = index_store or SimpleIndexStore()
vector_store = vector_store or SimpleVectorStore()
graph_store = graph_store or SimpleGraphStore()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just checking: there should be minimal overhead in doing this right?

Otherwise, we should just not construct an object and leave it as None, so it doesn't impact other type of indices that don't use graph.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same as a vector_store essentially in terms of impact -> should be near-zero.

Similar to how other indexes don't use the vector store, but we still instantiate it I guess

Copy link
Collaborator

@Disiok Disiok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some nitpick comments, but I don't want to block merging this massive PR.
We can do more cleanup after landing as well.

@logan-markewich
Copy link
Collaborator

logan-markewich commented Jun 7, 2023

Addressed a majority of your comments @Disiok I think it's good to land for now :)

@logan-markewich logan-markewich merged commit 7b9f6da into run-llama:main Jun 7, 2023
8 checks passed
@jerryjliu
Copy link
Collaborator

@wey-gu this is an amazing change, thanks for the contribution. thanks to @logan-markewich @Disiok for the reviews too. just a heads up, planning to publish this friday morning pacific time - let me know your twitter handle!

@wey-gu
Copy link
Contributor Author

wey-gu commented Jun 8, 2023

@wey-gu this is an amazing change, thanks for the contribution. thanks to @logan-markewich @Disiok for the reviews too. just a heads up, planning to publish this friday morning pacific time - let me know your twitter handle!

Dear @jerryjliu ,
Thanks so much, I am honored to have the chance to bring something the Llama Index project and it's been an awesome experience working within the great Llama community with you, @logan-markewich and @Disiok , I am working on upcoming PR and demo-project/video on top of this change.

My handle is wey_gu :)

Thanks!

@wey-gu
Copy link
Contributor Author

wey-gu commented Jun 8, 2023

I think this is good to ship! Looking forward to your future work @wey-gu ! Knowledge graphs can be very powerful, and hoping llama-index can continue to be a great tool to leverage them

Dear @logan-markewich ,

Many thanks for the great help and guide(and big thanks to @Disiok !!)!! I am preparing for upcoming PRs/DEMOs, let's make LLMs understand more knowledge with graphs!

BR//Wey

f" [rel.`{self._rel_prop_names[0]}`, dst(rel)] "
f"] AS rels "
f"RETURN "
f" subj,"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.fixbug:
need add rels in return stmt like:

f" subj, rels,"

2.question:
when i add entity type in match stmt like:

MATCH (s:entity)

and add limit stmt like:

LIMIT 1000

it also have ValueError : Scan vertices or edges need to specify a limit number, or limit number can not push down.

my envs:
nebula3-python==3.4.0
NebulaGraph version is 3.1.0

@wey-gu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please upgrade to NebulaGraph 3.5.0, and see what happens, this implementation was expecting NebulaGraph 3.5.0+

@pachgadehardik
Copy link

Need to load the existing data indexes from nebula graph into the retriever for knowledge graph query_engine. Is that functionality available in llama-index ?

@wey-gu
Copy link
Contributor Author

wey-gu commented Aug 7, 2023

Need to load the existing data indexes from nebula graph into the retriever for knowledge graph query_engine. Is that functionality available in llama-index ?

@pachgadehardik

For text2cypher, yes, it's already implemented! Follow https://gpt-index.readthedocs.io/en/latest/examples/query_engine/knowledge_graph_query_engine.html is all needed.

For Graph RAG(find major entities with keyword or embedding from the task, get subgraph as context), not yet fully supported. I am thinking of adding this soon.

@pachgadehardik
Copy link

Need to load the existing data indexes from nebula graph into the retriever for knowledge graph query_engine. Is that functionality available in llama-index ?

@pachgadehardik

For text2cypher, yes, it's already implemented! Follow https://gpt-index.readthedocs.io/en/latest/examples/query_engine/knowledge_graph_query_engine.html is all needed.

For Graph RAG(find major entities with keyword or embedding from the task, get subgraph as context), not yet fully supported. I am thinking of adding this soon.

Thanks a lot @wey-gu for the update. However I am facing an issue while running KnowledgeGraphQueryEngine. NebulaGraphStore is being loaded but when executing the KGQueryEngine, was facing a query syntax -
ValueError: Query failed. Query:
MATCH ()-[e:relationship]->()
WITH e limit 1
MATCH (m)-[:relationship]->(n) WHERE id(m) == src(e) AND id(n) == dst(e)
RETURN "(:" + tags(m)[0] + ")-[:relationship]->(:" + tags(n)[0] + ")" AS rels
, Param: {}Error message: Scan vertices or edges need to specify a limit number, or limit number cannot push down.

@wey-gu
Copy link
Contributor Author

wey-gu commented Aug 8, 2023

Need to load the existing data indexes from nebula graph into the retriever for knowledge graph query_engine. Is that functionality available in llama-index ?

@pachgadehardik
For text2cypher, yes, it's already implemented! Follow https://gpt-index.readthedocs.io/en/latest/examples/query_engine/knowledge_graph_query_engine.html is all needed.
For Graph RAG(find major entities with keyword or embedding from the task, get subgraph as context), not yet fully supported. I am thinking of adding this soon.

Thanks a lot @wey-gu for the update. However I am facing an issue while running KnowledgeGraphQueryEngine. NebulaGraphStore is being loaded but when executing the KGQueryEngine, was facing a query syntax - ValueError: Query failed. Query: MATCH ()-[e:relationship]->() WITH e limit 1 MATCH (m)-[:relationship]->(n) WHERE id(m) == src(e) AND id(n) == dst(e) RETURN "(:" + tags(m)[0] + ")-[:relationship]->(:" + tags(n)[0] + ")" AS rels , Param: {}Error message: Scan vertices or edges need to specify a limit number, or limit number cannot push down.

Dear @pachgadehardik

Could you share the NebulaGraph version? If it's an older version of NebulaGraph like 3.1.0, it's highly recommended to be upgraded to NebulaGraph 3.5.0, this is basically just a binary replacement(offline)

@pachgadehardik
Copy link

Need to load the existing data indexes from nebula graph into the retriever for knowledge graph query_engine. Is that functionality available in llama-index ?

@pachgadehardik
For text2cypher, yes, it's already implemented! Follow https://gpt-index.readthedocs.io/en/latest/examples/query_engine/knowledge_graph_query_engine.html is all needed.
For Graph RAG(find major entities with keyword or embedding from the task, get subgraph as context), not yet fully supported. I am thinking of adding this soon.

Thanks a lot @wey-gu for the update. However I am facing an issue while running KnowledgeGraphQueryEngine. NebulaGraphStore is being loaded but when executing the KGQueryEngine, was facing a query syntax - ValueError: Query failed. Query: MATCH ()-[e:relationship]->() WITH e limit 1 MATCH (m)-[:relationship]->(n) WHERE id(m) == src(e) AND id(n) == dst(e) RETURN "(:" + tags(m)[0] + ")-[:relationship]->(:" + tags(n)[0] + ")" AS rels , Param: {}Error message: Scan vertices or edges need to specify a limit number, or limit number cannot push down.

Dear @pachgadehardik

Could you share the NebulaGraph version? If it's an older version of NebulaGraph like 3.1.0, it's highly recommended to be upgraded to NebulaGraph 3.5.0, this is basically just a binary replacement(offline)

@wey-gu, the current version deployed in kubernetes cluster is 3.4.0

@wey-gu
Copy link
Contributor Author

wey-gu commented Aug 9, 2023

Need to load the existing data indexes from nebula graph into the retriever for knowledge graph query_engine. Is that functionality available in llama-index ?

@pachgadehardik
For text2cypher, yes, it's already implemented! Follow https://gpt-index.readthedocs.io/en/latest/examples/query_engine/knowledge_graph_query_engine.html is all needed.
For Graph RAG(find major entities with keyword or embedding from the task, get subgraph as context), not yet fully supported. I am thinking of adding this soon.

Thanks a lot @wey-gu for the update. However I am facing an issue while running KnowledgeGraphQueryEngine. NebulaGraphStore is being loaded but when executing the KGQueryEngine, was facing a query syntax - ValueError: Query failed. Query: MATCH ()-[e:relationship]->() WITH e limit 1 MATCH (m)-[:relationship]->(n) WHERE id(m) == src(e) AND id(n) == dst(e) RETURN "(:" + tags(m)[0] + ")-[:relationship]->(:" + tags(n)[0] + ")" AS rels , Param: {}Error message: Scan vertices or edges need to specify a limit number, or limit number cannot push down.

Dear @pachgadehardik
Could you share the NebulaGraph version? If it's an older version of NebulaGraph like 3.1.0, it's highly recommended to be upgraded to NebulaGraph 3.5.0, this is basically just a binary replacement(offline)

@wey-gu, the current version deployed in kubernetes cluster is 3.4.0

I see, the current schema fetching way is not compatible with the cluster that's older than 3.5.0, I'll pr to fix this today!
Thanks for letting me know this and sorry for it!

@wey-gu
Copy link
Contributor Author

wey-gu commented Aug 9, 2023

pachgadehardik

It should be fixed in #7204

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants