Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HELP!!!!!!!!!! Can anyone help me in understanding how knowledge graph works internally in llama index? #11600

Closed
1 task done
vishnu9000 opened this issue Mar 4, 2024 · 5 comments
Labels
question Further information is requested

Comments

@vishnu9000
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

I am having trouble understanding knowledge graph working in llama index. I understood few parts but I am not having clear understanding of other sections. Following are the things I understood:

  1. Why knowledge graph is needed and how to construct it using llm
  2. How to use existing knowledge graph like neo4j or nebula.
  3. How user query is converted to cypher query for getting answer from KG.

What i don't understand:

  1. I have seen two way to construct answer from KG. First is direct cypher query creation, next is subgraph creation. In both case I have seen we are only retrieving the triplets like source relationship and target. But many cases without actual context from where this has been retrieved only with source-relationship-target triplet we cannot give accurate answer. So are we storing the chunk from where this triplet was created? If yes are we using that also to generate answers? If yes wouldn't that overflow our context window? and how will it be different from normal vector search?
  2. I have seen index query engine showing answers for question which are not in graph, so how to set prompt here?
@vishnu9000 vishnu9000 added the question Further information is requested label Mar 4, 2024
Copy link

dosubot bot commented Mar 4, 2024

Hey there, @vishnu9000! Great to see you diving deep into LlamaIndex again. How have you been?

Based on the context provided, it seems like you have a good understanding of how the knowledge graph works in LlamaIndex. Let's dive into your questions:

  1. Regarding the construction of answers from the knowledge graph, LlamaIndex extracts source-relationship-target triplets from the text and stores them in a graph structure. This is done using the mock_extract_triplets function in the test cases. The triplets are then used to form a knowledge graph, as seen in the instantiation of SimpleGraphStore and its use in KnowledgeGraphIndex. The context from where the triplet was created is not explicitly stored. Instead, the context is managed by controlling the depth of retrieval, as indicated by the graph_store_query_depth and similarity_top_k parameters. This prevents the context window from overflowing and differentiates it from a normal vector search by providing a structured and depth-limited retrieval of knowledge graph nodes based on similarity and relevance to the query.

  2. The index query engine provides answers for questions not present in the graph through the use of embeddings and similarity searches. The MockEmbedding class simulates the generation of embeddings for both queries and text, which are then used to find the most relevant nodes in the knowledge graph. This allows the system to find and return relevant information even if the exact query is not present in the graph. The use of embeddings and similarity searches enables the system to infer and retrieve related information based on the semantic similarity between the query and the knowledge stored in the graph, thus providing answers to questions not explicitly present.

For more details, you can refer to the test cases in llama-index-core/tests/indices/knowledge_graph/test_retrievers.py.

As for setting the prompt, it would depend on the specific use case and the nature of the data. Generally, the prompt should be designed to guide the model towards generating the desired output. However, without more specific information about your use case, it's difficult to provide a more detailed answer. Could you provide more details about the kind of prompts you're looking to set and the specific problems you're trying to solve?

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@vishnu9000
Copy link
Author

@dosubot Understood. So you are not storing the chunk where the triplet has been extracted. If that the case just relying on the triplets will it be complete answer? If the context from where this was retrieved is not included. Also let say if an entity in question has more than 100 relations, isn't it a leap of faith to put all its relation in context?

What does graph_store_query_depth and similarity_top_k do?

Also my second question is I saw a section in your document which shows how to manually add triplets in graph. Here you are also inserting the text chunk. Why is that?

Copy link

dosubot bot commented Mar 4, 2024

Based on the context provided, here are the answers to your questions:

  1. Context and Triplets: The source-relationship-target (SRT) triplets form the core of the knowledge graph, representing entities and their interrelations. Not including context in these triplets means that the relationships are considered in isolation, without additional information that could specify or clarify the nature of these relationships. This could lead to ambiguities or a lack of depth in understanding the connections between entities, as the context often provides crucial information that influences the interpretation of these relationships.

  2. Including all relations of an entity in context: Including all relations of an entity, especially when they number over 100, could be challenging in terms of both storage and processing. While technically feasible, it might lead to performance issues, as the system would need to handle a large volume of data for a single entity. This could slow down query processing and increase the complexity of the knowledge graph. Practical implementations might limit the number of relations considered or employ strategies to prioritize the most relevant or recent relations.

  3. Functions of the parameters graph_store_query_depth and similarity_top_k:

    • graph_store_query_depth: This parameter likely controls the depth of the query when retrieving information from the graph store. A higher depth means that the system will look further into the connections between entities, potentially returning more comprehensive information about the relationships involving the query subject.
    • similarity_top_k: This parameter seems to control the number of top similar entities or relationships to retrieve based on similarity scores. In the context of embedding-based retrieval, this would limit the results to the top k entities or relationships that are most similar to the query, as determined by their embedding vectors.
  4. Manually adding triplets to the graph along with the text chunk: Manually adding triplets along with text chunks allows for a more controlled and accurate construction of the knowledge graph. This process enables the inclusion of specific, curated information into the graph, ensuring that the relationships represented are accurate and relevant. It also allows for the enrichment of the graph with context that might not be automatically extractable from text. This manual intervention can improve the quality and utility of the knowledge graph, making it a more effective tool for information retrieval and analysis.

For more details, you can refer to the test cases in llama-index-core/tests/indices/knowledge_graph/test_retrievers.py and the HTML node parser in llama-index-core/llama_index/core/node_parser/file/html.py.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@logan-markewich
Copy link
Collaborator

@vishnu9000 isn't it a leap of faith to put all its relation in context -- its fine, since a response synthesizer is used which handles context limits

graph_store_query_depth and similarity_top_k do -- the first one modifies how "deep" to retrieve (If I retrieve a triplet, do I retrieve its neighbors? Neighbors of neighbors? etc.). Similarity_top_k is used when you constructed the triplets with embeddings.

Here you are also inserting the text chunk. Why is that? -- triplets are mapped to a text chunk where the triplet was extracted from. This is what include_text parameter does (i.e. do I send a the chunks where matching triplets are found to the LLM, or just the triplets alone)

@logan-markewich
Copy link
Collaborator

I really recommend just reading the source code for a deeper understanding here

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 3, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 10, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants