Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Option for Knowledge Graph Similarities #487

Merged
merged 27 commits into from
Feb 25, 2023
Merged

Add Option for Knowledge Graph Similarities #487

merged 27 commits into from
Feb 25, 2023

Conversation

logan-markewich
Copy link
Collaborator

@logan-markewich logan-markewich commented Feb 20, 2023

This PR adds support for using embedding similarities within the knowledge graph index.

You may have noticed that relying purely on keyword mapping for finding relation tuples may result in no relation tuples being found! If you didn't turn on the include_text=True option, then chances are the model will hallucinate an answer.

This PR allows you to use the existing keyword mapping, the top_k related knowledge graph tuples, or a combination of both!

The following is some example usage using both keywords and embeddings:

from gpt_index.data_structs.data_structs import  MODE_KEYWORDS, MODE_EMBEDDINGS
from gpt_index.indices.knowledge_graph.base import GPTKnowledgeGraphIndex

index = GPTKnowledgeGraphIndex(documents, llm_predictor=llm_predictor, mode=MODE_EMBEDDINGS+MODE_KEYWORDS)
response = index.query("What did the author do growing up?", include_text=False, similarity_cutoff=0.8, top_k_embeddings=1, response_mode="tree_summarize")

In my testing, the similarity cutoff is pretty important. Unrelated nodes are added to the prompt, it ends up hurting the response accuracy (maybe the prompt itself needs some investigation too 🤔)

@logan-markewich logan-markewich marked this pull request as draft February 20, 2023 01:56
@logan-markewich logan-markewich changed the title Add Option for Knowledge Graph Similarities [WIP] Add Option for Knowledge Graph Similarities Feb 20, 2023
@logan-markewich logan-markewich changed the title [WIP] Add Option for Knowledge Graph Similarities Add Option for Knowledge Graph Similarities Feb 20, 2023
@logan-markewich logan-markewich marked this pull request as ready for review February 20, 2023 02:01
@logan-markewich
Copy link
Collaborator Author

logan-markewich commented Feb 20, 2023

Ah maybe don't look at this quite yet, just realized _should_use_node filters out everything when similarity_cutoff is set

@logan-markewich
Copy link
Collaborator Author

Ok bug is fixed! (Also I need to figure out my flake8 setup, it wasn't catching the same rules as the automated checks)

Copy link
Collaborator

@jerryjliu jerryjliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing! I have a high-level comment about how to structure this a bit more nicely, let me know if you agree!

Once this PR is ready to land, let me know if you have a notebook sample demonstrating this capability - this would be a great feature to publicize :)

gpt_index/data_structs/data_structs.py Outdated Show resolved Hide resolved
gpt_index/indices/knowledge_graph/base.py Outdated Show resolved Hide resolved
gpt_index/indices/query/knowledge_graph/query.py Outdated Show resolved Hide resolved
gpt_index/indices/knowledge_graph/base.py Outdated Show resolved Hide resolved
@logan-markewich
Copy link
Collaborator Author

Thanks for the comments! I'll work to de-couple the query from the index, and add some examples of the usage to a notebook 👍🏻

@logan-markewich
Copy link
Collaborator Author

logan-markewich commented Feb 21, 2023

Alright, I think I've addressed your comments 🚀

I also added a mechanism to use the "include_text" option for triplets that are found using the embeddings

Just working on getting some examples, but I'm struggling a bit with the Paul Graham essay. Just trying to find a few queries where it can really shine. I might try with a different example dataset 🤔

I think a lot of it also depends on the quality of triplets (which could be alleviated by inserting 3rd party triplets directly into the index struct 😉)

Copy link
Collaborator

@jerryjliu jerryjliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks pretty good - some minor comments. let me know if you need any help on the notebook as well!

gpt_index/indices/knowledge_graph/base.py Show resolved Hide resolved
gpt_index/indices/query/knowledge_graph/query.py Outdated Show resolved Hide resolved
gpt_index/indices/query/knowledge_graph/query.py Outdated Show resolved Hide resolved
@@ -93,6 +154,9 @@ def _get_nodes_for_response(
# filter sorted nodes
sorted_nodes = [node for node in sorted_nodes if self._should_use_node(node)]
for chunk_idx, node in zip(sorted_chunk_indices, sorted_nodes):
# nodes are found with keyword mapping, give high conf to avoid cutoff
if similarity_tracker is not None:
similarity_tracker.add(node, 1000.0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit hacky (the num 1000 is a bit arbitrary), but it's ok, i'll take a stab at this later

Copy link
Collaborator Author

@logan-markewich logan-markewich Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely hacky, without the comment this line will look pretty strange.

Maybe to be less hacky, we can override the get_nodes_and_similarities_for_response or _should_use_node? to avoid the original problem of nodes being removed

tests/indices/knowledge_graph/test_base.py Outdated Show resolved Hide resolved
@logan-markewich
Copy link
Collaborator Author

logan-markewich commented Feb 23, 2023

Ok, addressed basically everything. Not 100% sure where to put the new enum, so its in the query file for now.

Still working on the example, but could maybe use a hand if you have time.

I was using the NYC wiki data example, I'll attach the index to save you the tokens (constructed using the same index options from the knowledge graph example notebook)

If you don't have time that's totally fine! I'll explore this more after work tomorrow otherwise. 💪🏻
index_kg_emb_nyc.zip

@jerryjliu
Copy link
Collaborator

If you don't have time that's totally fine! I'll explore this more after work tomorrow otherwise. 💪🏻 index_kg_emb_nyc.zip

thanks for sending over @logan-markewich! I might not have time until friday or this weekend or so to take a deeper look, but can get to it then if you'd like me to. just lmk

Copy link
Collaborator

@jerryjliu jerryjliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops sorry i accidentally meant to approve another PR! putting a hold on this until i get back to it

@logan-markewich
Copy link
Collaborator Author

Settled on an example inside the existing KG notebook. I'm pretty happy with the current state of this PR 💪🏻

@jerryjliu
Copy link
Collaborator

Settled on an example inside the existing KG notebook. I'm pretty happy with the current state of this PR 💪🏻

sweet! will take a look in a bit

Copy link
Collaborator

@jerryjliu jerryjliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! and amazing work on the demo notebook! I'll take a look at fiing some of these nits before merging

KEYWORD ("keyword"): Default query mode, using keywords to find triplets.
EMBEDDING ("embedding"): Embedding mode, using embeddings to find
similar triplets.
HYBRID ("hyrbid"): Hyrbid mode, combining both keywords and embeddings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: hybrid

embedding_mode (KGQueryMode): Specifies whether to use keyowrds,
embeddings, or both to find relevent triplets. Should be one of "keyword",
"embedding", or "hybrid".
top_k_embeddings (int): The number of top embeddings to use
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's rename similarity_top_k to keep consistent with vector index queries

Jerry Liu added 2 commits February 24, 2023 23:27
@jerryjliu jerryjliu merged commit 7281dea into run-llama:main Feb 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants