# The Start
This notebook documents the process of creating a graph index from Obsidian notes.

In [None]:
# This notebook is in the eval folder.  Change to the root folder.
%cd ..
%pwd  # To verify the current working directory

# Load the documents
First we need some TextNode objects. I put 3 documents within the `test` folder. 

In [None]:
# --->: Read in the markdown files in the Obsidian vault directory
from src.doc_stats import DocStats
from src.ingest_service import IngestService

# The Directory containing the knowledge documents used by the AI to do the analysis on the soil tests.
soil_knowledge_directory = r"G:\My Drive\Audios_To_Knowledge\knowledge\AskGrowBuddy\AskGrowBuddy\Knowledge\soil_test_knowlege\test"
# Load the documents
ingest_service = IngestService()
loaded_documents = ingest_service.load_obsidian_notes(soil_knowledge_directory)
# Show some summary stats about the documents
DocStats.print_llama_index_docs_summary_stats(loaded_documents)

## Split into Text Nodes
This is discussed more in the notebook where the vector index is built.

In [None]:
text_nodes = ingest_service.chunk_text(loaded_documents)
DocStats.print_llama_index_docs_summary_stats(text_nodes)

## View the nodes
Let's look at the contents of the nodes.  Open up the link to view the nodes in the browser.  There are three files in the `test` directory.  You can see these in the node viewer by looking at the source.
- `ph.md` is one node.
- `soil science notes.md` has 33 nodes.
- `Focusing on Calcium Nutrition.md` has has also one node.


In [None]:
from node_view import launch_node_viewer

# Create and launch the interface
launch_node_viewer(text_nodes)

# Build Knowledge Graph (not batch mode)
The method `build_knowledge_graph()` in `knowledge_graph.py` encapsulates creating our knowledge graph.  I am new to using a knowledge graph. I ended up:
- evolving LlamaIndex's `PropertyGraphIndex` class and rewriting it.
- using neo4j to store the graph.

In order to create the graph index, an LLM is used to extract triplets from the text.  These triplets are then used to create the graph in neo4j. This is a costly token consuming process.  I ended up using Ollama LLMs for testing and Anthropic's claude sonet for the final version.

__Note: neo4j must be running to create the graph index.__


In [5]:
# from src.knowledge_graph import BuildGraphIndex

# kg_builder = BuildGraphIndex()
# kg_index = kg_builder.build_graph_index(
#     text_nodes=text_nodes, database_name="test", llm_model_name="mistral_soil"
# )

# Build Knowledge Graph - Batch mode
I use Antrhopic's Claude Sonnet 3.5 to build the final version. Anthropic has a batch API that cuts the cost in half.  The way it works is you submit a batch of requests.  The API returns a batch id.  You poll the API until the batch is complete. Then retrieve the results.

Here is what the return is to a call to Anthropic's `create_batch()` method:

In [6]:
batch = {
  "id": "msgbatch_01EMPAyqx1B2mXtiD4EqMCvR",
  "type": "message_batch",
  "processing_status": "in_progress",
  "request_counts": {
    "processing": 35,
    "succeeded": 0,
    "errored": 0,
    "canceled": 0,
    "expired": 0
  },
  "created_at": "2024-11-04T17:27:01.482961+00:00",
  "ended_at": None,
  "expires_at": "2024-11-05T17:27:01.482961+00:00",
  "archived_at": None,
  "cancel_initiated_at": None,
  "results_url": None,
  "time_remaining": "20:53:12"
}

## Create a new batch job
It is easy to create a new batch job. Pass in the text nodes to the `create_batch()` method. An object of type `BetaMessmageBatch` is returned. The field returned can be seen in `BuildGraphIndexBatch.check_batch_status()`.

In [None]:
from src.knowledge_graph_batch import BuildGraphIndexBatch
kg_builder_batch = BuildGraphIndexBatch()
# Ask the llm to create triplets from the text in the nodes.  These triplets are then stored in neo4j as the knowledge graph.
batch = kg_builder_batch.create_batch(text_nodes)
# Save the batch ID in order to eventually retreive results after the batch job runs.
kg_builder_batch.save_batch_id(batch.id)
print(batch)

## Check Status
Check the status of the last batch job that was submitted.

In [None]:
import json
from src.knowledge_graph_batch import BuildGraphIndexBatch
kg_builder_batch = BuildGraphIndexBatch()
batch_id = kg_builder_batch.load_batch_id("batch_status.json")
batch_id = 'msgbatch_01L3c152bCazJSgTRKNJkVsm'
status = kg_builder_batch.check_batch_status(batch_id)
print(json.dumps(status, indent=4))


## List Batch Jobs
We can list the batch jobs that have been sent in.

In [6]:
def print_batch_summary(batches):
    for batch in batches.data:
        print(f"\nBatch ID: {batch.id}")
        print(f"Status: {batch.processing_status}")
        print(f"Created: {batch.created_at}")
        print(f"Ended: {batch.ended_at}")
        print("Request Counts:")
        print(f"  Processing: {batch.request_counts.processing}")
        print(f"  Succeeded: {batch.request_counts.succeeded}")
        print(f"  Errored: {batch.request_counts.errored}")
        print(f"  results URL: {batch.results_url}")
        print("-" * 50)


In [None]:
from src.knowledge_graph_batch import BuildGraphIndexBatch
kg_builder_batch = BuildGraphIndexBatch()
batches = kg_builder_batch.list_batches()
print_batch_summary(batches)

## Retrieve Batch Results
The batch results are retrieved and then saved so that we can process them locally.

In [None]:
import anthropic
from src.knowledge_graph_batch import BuildGraphIndexBatch


client = anthropic.Anthropic()
results = client.beta.messages.batches.results("msgbatch_016rrH1m8ACbxr7gtdeP4z8d")
kg_builder_batch = BuildGraphIndexBatch()
kg_builder_batch.save_batch_results(results)
for result in results:
    print(result)

In [None]:
from src.knowledge_graph_batch import BuildGraphIndexBatch
kg_builder_batch = BuildGraphIndexBatch()
results = kg_builder_batch.load_batch_results()
results[0]

## Build the Knowledge Graph
Process through each of the results. The results contain the triplets.  Processing means writing the triplets into neo4j to build the knowledge graph.

In [None]:
import anthropic
from src.knowledge_graph_batch import BuildGraphIndexBatch
kg_builder_batch = BuildGraphIndexBatch()
client = anthropic.Anthropic()
results = kg_builder_batch.load_batch_results()
kg_builder_batch.process_batch_results(text_nodes, results,database_name='test')

# View Token Use
To understand the system, I built the graph using two simple files and one far more rich in content. 
- `ph.md`
- `Focusing on Calcium Nutrition.md`
- `soil_science_notes.md`
I used mistral to build the graph.

I then used DB Browser to view the token count.  The SQL Query: `select sum("completion_tokens") from "token_usage"` returned  9144 tokens. The query: `select sum("prompt_tokens") from "token_usage"` returns 30,260 tokens.

This is just for 3 files.  There are far more documents than just this three to put into the knowledge graph.

In [None]:
# Retrieval
Now let's do retrieval.  For a knowledge graph, we have nodes and a relationship so

In [None]:
from src.knowledge_graph import RetrieveGraphNodes
retriever = RetrieveGraphNodes()
nodes = retriever.retrieve("What is the ideal ph for growing Cannabis?",database_name="test")