-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Option for Knowledge Graph Similarities #487
Conversation
Ah maybe don't look at this quite yet, just realized |
Ok bug is fixed! (Also I need to figure out my flake8 setup, it wasn't catching the same rules as the automated checks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing! I have a high-level comment about how to structure this a bit more nicely, let me know if you agree!
Once this PR is ready to land, let me know if you have a notebook sample demonstrating this capability - this would be a great feature to publicize :)
Thanks for the comments! I'll work to de-couple the query from the index, and add some examples of the usage to a notebook 👍🏻 |
Alright, I think I've addressed your comments 🚀 I also added a mechanism to use the "include_text" option for triplets that are found using the embeddings Just working on getting some examples, but I'm struggling a bit with the Paul Graham essay. Just trying to find a few queries where it can really shine. I might try with a different example dataset 🤔 I think a lot of it also depends on the quality of triplets (which could be alleviated by inserting 3rd party triplets directly into the index struct 😉) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks pretty good - some minor comments. let me know if you need any help on the notebook as well!
@@ -93,6 +154,9 @@ def _get_nodes_for_response( | |||
# filter sorted nodes | |||
sorted_nodes = [node for node in sorted_nodes if self._should_use_node(node)] | |||
for chunk_idx, node in zip(sorted_chunk_indices, sorted_nodes): | |||
# nodes are found with keyword mapping, give high conf to avoid cutoff | |||
if similarity_tracker is not None: | |||
similarity_tracker.add(node, 1000.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit hacky (the num 1000 is a bit arbitrary), but it's ok, i'll take a stab at this later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely hacky, without the comment this line will look pretty strange.
Maybe to be less hacky, we can override the get_nodes_and_similarities_for_response
or _should_use_node
? to avoid the original problem of nodes being removed
Ok, addressed basically everything. Not 100% sure where to put the new enum, so its in the query file for now. Still working on the example, but could maybe use a hand if you have time. I was using the NYC wiki data example, I'll attach the index to save you the tokens (constructed using the same index options from the knowledge graph example notebook) If you don't have time that's totally fine! I'll explore this more after work tomorrow otherwise. 💪🏻 |
thanks for sending over @logan-markewich! I might not have time until friday or this weekend or so to take a deeper look, but can get to it then if you'd like me to. just lmk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops sorry i accidentally meant to approve another PR! putting a hold on this until i get back to it
Settled on an example inside the existing KG notebook. I'm pretty happy with the current state of this PR 💪🏻 |
sweet! will take a look in a bit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! and amazing work on the demo notebook! I'll take a look at fiing some of these nits before merging
KEYWORD ("keyword"): Default query mode, using keywords to find triplets. | ||
EMBEDDING ("embedding"): Embedding mode, using embeddings to find | ||
similar triplets. | ||
HYBRID ("hyrbid"): Hyrbid mode, combining both keywords and embeddings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: hybrid
embedding_mode (KGQueryMode): Specifies whether to use keyowrds, | ||
embeddings, or both to find relevent triplets. Should be one of "keyword", | ||
"embedding", or "hybrid". | ||
top_k_embeddings (int): The number of top embeddings to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's rename similarity_top_k to keep consistent with vector index queries
This PR adds support for using embedding similarities within the knowledge graph index.
You may have noticed that relying purely on keyword mapping for finding relation tuples may result in no relation tuples being found! If you didn't turn on the
include_text=True
option, then chances are the model will hallucinate an answer.This PR allows you to use the existing keyword mapping, the top_k related knowledge graph tuples, or a combination of both!
The following is some example usage using both keywords and embeddings:
In my testing, the similarity cutoff is pretty important. Unrelated nodes are added to the prompt, it ends up hurting the response accuracy (maybe the prompt itself needs some investigation too 🤔)