# The Start
This notebook documents:
- Loading documents from a directory of an Obsidian vault into "node" objects.

# Step 1: Load the Documents
The `IngestService` has a method `load_obsidian_notes` that loads the notes from an Obsidian vault.  The input can be either a list of Document objects or a directory containing Obsidian notes.

Using Obsidian notes provides several advantages:
- Notes are easy to write and edit.
- Frontmatter/tags transfer into the metadata of the nodes.
- The Headers provide natural splitting points for the text.

The `IngestService` class relies on Langchain's `ObsidianLoader` class to load the notes.  Langchain's loader is the only one I found that honored the Obsidian frontmatter, dataview fields, and tags to populate the metadata of the nodes.  Retrieval on metadata properties could be a powerful retrieval technique.

The `load_obsidian_notes` method takes in either:
- a list of strings where each string is considered a markdown file.
- a directory path to an Obsidian vault.
The list of strings options is useful for testing.

It returns a list of `LlamaIndex` `Document` objects.  There will be one `LlamaIndex` `Document` for each node. The node will contain an id, text, and metadata.

In [1]:
# E.g. to show using the ability to read in a list of Documents to test out loading Documents.

doc = """#Calcium_additive #raise_ph #Wollastonite #Silicon_additive #buffer_pH #Calcium
Growers  turn to Wollastonite for:
- Its **liming** capability.  Wollastonite's dissolution rate is slower than agricultural lime, offering a buffering effect against rapid pH changes. This makes Wollastonite beneficial in areas with fluctuating acidity levels.
- Adding **Silicon**.
- Adding **Calcium**.
Wollastonite's pH buffering effect and Silicon content contribute to pest control and powdery mildew suppression, although the exact mechanisms are not fully understood.

# What is Wollastonite?

## Formation
Wollastonite is formed when Limestone is subjected to heat and pressure during metamorphism if surrounding silicate minerals are present.
### Basic Reaction:
Given high pressure and high temperature:
- CaCO3 (Limestone) + SiO2 (silica) → CaSiO3 (Wollastonite) + CO2 (carbon Dioxide)
## Sources
China is the largest producer of Wollastonite. Other areas where Wollastonite is mined include the United States (although it was originally mined in California, the only active mining in the U.S. is now in New York State), India, Mexico, Canada, and Finland.

## Industrial Applications of Wollastonite

|Industry|Application|
|---|---|
|Ceramics|Smoother and more durable ceramics, reinforcement agent|
|Plastics and Rubber|Cost-effective strengthening agent|
|Paints and Coatings|Reinforcement, improved durability and impact resistance|
|Construction|Improved strength and durability of building materials, safe alternative to asbestos|
##  How Wollastonite Provides Plants with Ca and Si

Wollastonite reacts with Water and Carbon Dioxide in the soil to form Calcium Bicarbonate and Silicon Dioxide.
- CaSiO₃ (Wollastonite)+2CO₂ (carbon Dioxide,)+H₂O (Water)→Ca(HCO₃)₂ (Calcium bicarbonate)+SiO₂ (silica)

### Calcium
- Calcium bicarbonate  (Ca(HCO₃)₂) is unstable and fairly easily decomposes to Limestone (CaCO₃):
		- Ca(HCO₃)₂ (Calcium bicarbonate)→CaCO₃ (Limestone)+  CO₂ (carbon Dioxide) + H₂O (Water)

- Soils with a pH below 7 (acidic soils) contain hydrogen ions (H+). These hydrogen ions react with the Limestone (CaCO3) to form Calcium ions (Ca2+), Water (H2O), and Carbon Dioxide (CO2).
	- CaCO3 (Limestone) + 2H+ (hydrogen ions) → Ca2+ (Calcium ions) + H2O (Water) + CO2 (carbon Dioxide)
### Silicon
- Silicon Dioxide slowly breaks down into Silicic Acid, which plants absorb. This process is influenced by soil pH, temperature, and microbial activity.
	- SiO2 (Silicon Dioxide) + 2H2O (Water) → H4SiO4 (Silicic Acid)

- Plants absorb Silicic Acid from the soil solution through their roots.


"""

In [None]:
# Hit restart, then run this to start off with the right path for module resolution.
# This notebook is in the eval folder.  Change to the root folder.
%cd ..
%pwd  # To verify the current working directory

In [None]:
# --->: Read in the markdown files in the Obsidian vault directory
from src.ingest_service import IngestService
from src.doc_stats import DocStats
# The Directory containing the knowledge documents used by the AI to do the analysis on the soil tests.
soil_knowledge_directory = r"G:\My Drive\Audios_To_Knowledge\knowledge\AskGrowBuddy\AskGrowBuddy\Knowledge\soil_test_knowlege"
# Load the documents
ingest_service = IngestService()
loaded_documents = ingest_service.load_obsidian_notes(soil_knowledge_directory)
# Show some summary stats about the documents

DocStats.print_llama_index_docs_summary_stats(loaded_documents)

In [None]:
# A check for duplicate documents. The first time I loaded the documents, there were many duplicates. I added code to check and remove them.
from rich import print
from collections import defaultdict

# Dictionary to keep track of how many times each document name has appeared
doc_count = defaultdict(int)

for doc in loaded_documents:
    source = doc.metadata['source']
    doc_count[source] += 1

    if doc_count[source] > 1:
        print(f"{source} (Duplicate document. Document count: {doc_count[source]})")
    else:
        print(source)

I found duplicate documents, excalidraw documents, and code blocks that needed to be cleaned out of the documents. The `load_obsidian_notes` method handles removing some of these.  There are still challenges because there will be nodes with little to no, or meaningless, text.  It's a rabbithole probably worth exploring with a combination of pattern matching, nlp, and other replacement therapies. For now, after splitting the documents, I manually remove and pickle a "GOOD" collection of TextNodes.

# Step 2: Split the Documents using Markdown Splitting

In [None]:

text_nodes = ingest_service.chunk_text(loaded_documents)
DocStats.print_llama_index_docs_summary_stats(text_nodes)

# Step 3: Review Unuseful Nodes
After splitting the text, the contents of each node should be manually checked.  I spent several iterations on this process.  I evolved `loadObsidianNotes` based on manual review to automatically filter out some of the unuseful nodes. It is a "hunt and peck" activity.  If I was stronger in nlp, I would have a bigger toolbelt to filter the nodes through.  The ultimate goal is to have quality content in.

I build the node_viewer. It is best to view the nodes by clicking on the URL. This brings up a Gradio interface. You can then view the text and metadata of each node.  You can also delete nodes from the collection.

In [None]:
from node_view import launch_node_viewer
# Create and launch the interface
launch_node_viewer(text_nodes)


I save and restore the text nodes to make it easier to pick up where I left off.

In [None]:
# Saving the nodes in case we want to start before indexing.
import pickle
with open('eval/text_nodes.pkl', 'wb') as f:
    pickle.dump(text_nodes, f)

In [None]:
# This notebook is in the eval folder.  Change to the root folder.
%cd ..
%pwd  # To verify the current working directory

In [None]:
# Now unpickle
import pickle

with open('eval/text_nodes.pkl', 'rb') as f:
    text_nodes = pickle.load(f)

# Step 4: Build the Index
Now onto building the vector index.  I was originally going to use LlamaIndex APIs to simplify the code, but I was getting frustrated with dumb bugs like the files weren't updated for Pydantic 2.  I stuck with the chroma apis after that. Even then, there was a challenge with the score. I discuss that later.

`chromadb` is used as the persistent store for the vector index.  I started using the SentenceTransformer embedding functions. I played around with different embedding functions.  For now, this one works ok.

In [5]:
# The build_vector_index method does what is shown in the three cells after this one.
from src.ingest_service import IngestService
ingest_service = IngestService()
collection = ingest_service.build_vector_index(nodes=text_nodes, collection_name='soil_test_knowledge', embed_model_name='multi-qa-mpnet-base-cos-v1')


2024-10-27 15:23:36,376 - src.ingest_service - INFO - Starting to build vector index with embedding model: multi-qa-mpnet-base-cos-v1 - c:\Users\happy\Documents\Projects\askgrowbuddy\src\ingest_service.py:193
2024-10-27 15:24:13,301 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: multi-qa-mpnet-base-cos-v1 - c:\Users\happy\Documents\Projects\askgrowbuddy\.venv\Lib\site-packages\sentence_transformers\SentenceTransformer.py:216


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

After creating the collection of embeddings, the collection can be easily retrieved.

In [6]:
import chromadb
from rich import print
# Show available collections
chroma_client = chromadb.PersistentClient(path='vectorstore')
collection_list = chroma_client.list_collections()

print(collection_list)

In [9]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
embedding_function = SentenceTransformerEmbeddingFunction(model_name='multi-qa-mpnet-base-cos-v1')
chroma_client = chromadb.PersistentClient(path='vectorstore')
collection_name = 'soil_test_knowledge'
collection = chroma_client.get_collection(name=collection_name, embedding_function=embedding_function)
print(f"Number of nodes in the {collection} collection: {collection.count()}")


Let's check how well the index retrieves similar nodes to a question.

In [14]:
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

query = "ideal ph for Cannabis"
# A collection retrieves a dictionary with a list of ids, documents, metadata as the main properties we are interested in.
embedding_function = SentenceTransformerEmbeddingFunction(model_name='multi-qa-mpnet-base-cos-v1')
results_dict = collection.query(query_texts=[query],  n_results=5)
retrieved_documents = results_dict['documents'][0]
# Convert cosine distances to cosine similarities
results_dict['distances'][0] = [1 - distance for distance in results_dict['distances'][0]]

# Print the results
print("Cosine Similarities:")
for idx, similarity in enumerate(results_dict['distances'][0]):
    print(f"Result {idx}: {similarity}")
    print(results_dict['documents'][0][idx][:300])
    print('-'*50)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

IndexError: list index out of range

In [None]:
results_dict["distances"][0]

In [None]:
new_text = "The ideal ph for Cannabis is 6.8"
new_id = str(collection.count() + 1)
new_metadata = {"source": "manual_addition"}
# Add the new document to the collection
collection.add(
    documents=[new_text],
    ids=[new_id],
    metadatas=[new_metadata]
)

print(f"Added new document with ID: {new_id}")
print(f"New collection count: {collection.count()}")


In [None]:
# New cell to calculate cosine similarity using existing Visualize methods

from src.visualize import Visualize
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Initialize Visualize object (make sure to use the same parameters as before)
visualize = Visualize()

# Define your query and top_k
query = "Ideal ph for Cannabis."
top_k = 5

# Get relevant nodes and embeddings
nodeswithscore = visualize._get_relevant_nodes(query, top_k)
combined_embeddings, included_indices, n_samples, _ = visualize._prepare_embeddings(query, nodeswithscore)

# Extract query embedding and document embeddings
query_embedding = combined_embeddings[0]
doc_embeddings = combined_embeddings[1:]

# Calculate cosine similarities
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]

# Get corresponding documents and ids
results_dict = collection.query(query_texts=[query],  n_results=5)
documents = results_dict['documents'][0]
ids = results_dict['ids'][0]

# Create a list of (id, document, similarity) tuples
results = list(zip(ids, documents, similarities))

# Sort by similarity (highest to lowest)
results.sort(key=lambda x: x[2], reverse=True)

# Print top results
print(f"Top {top_k} most similar documents:")
for id, doc, sim in results[:top_k]:
    print(f"ID: {id}")
    print(f"Document: {doc[:100]}...")  # Print first 100 characters
    print(f"Similarity: {sim}")
    print()

# Compare with Chroma's results
print(f"\nChroma's top {top_k} results:")
for node in nodeswithscore:
    print(f"ID: {node.node.id_}")
    print(f"Document: {node.node.text[:100]}...")
    print(f"Score: {node.score}")
    print()

In [None]:
from transformers import AutoTokenizer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Initialize the tokenizer and embedding function
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-mpnet-base-cos-v1")
embedding_function = visualize.embedding_function

query = "ideal pH for Cannabis"
target_text = "The ideal ph for Cannabis is 6.8"

# Get embeddings
query_embedding = np.array(embedding_function([query])[0])
target_embedding = np.array(embedding_function([target_text])[0])

# Calculate cosine similarity manually
dot_product = np.dot(query_embedding, target_embedding)
query_norm = np.linalg.norm(query_embedding)
target_norm = np.linalg.norm(target_embedding)
manual_cosine_sim = dot_product / (query_norm * target_norm)

# Calculate using sklearn for comparison
sklearn_cosine_sim = cosine_similarity([query_embedding], [target_embedding])[0][0]

# Print results
print(f"Manual cosine similarity: {manual_cosine_sim}")
print(f"Sklearn cosine similarity: {sklearn_cosine_sim}")

# Existing token analysis code...
# (Keep the rest of the original cell's code here)

# Add this at the end of the cell:
print("\nEmbedding analysis:")
print(f"Query embedding shape: {query_embedding.shape}")
print(f"Target embedding shape: {target_embedding.shape}")
print(f"First 5 dimensions of query embedding: {query_embedding[:5]}")
print(f"First 5 dimensions of target embedding: {target_embedding[:5]}")

In [None]:
import chromadb
from chromadb.utils import embedding_functions
import numpy as np

client = chromadb.Client()
ef = embedding_functions.DefaultEmbeddingFunction()

# Test with "cosine" space
collection_cosine = client.create_collection("test_cosine", embedding_function=ef, metadata={"hnsw:space": "cosine"})

# Test with "ip" space
collection_ip = client.create_collection("test_ip", embedding_function=ef, metadata={"hnsw:space": "ip"})

# Create two simple vectors
vec1 = [1, 0]
vec2 = [0, 1]

# Add these to both collections
collection_cosine.add(ids=["vec1", "vec2"], embeddings=[vec1, vec2])
collection_ip.add(ids=["vec1", "vec2"], embeddings=[vec1, vec2])

# Query using vec1 for both collections
results_cosine = collection_cosine.query(query_embeddings=[vec1], n_results=2, include=["distances"])
results_ip = collection_ip.query(query_embeddings=[vec1], n_results=2, include=["distances"])

print("Cosine space results:", results_cosine)
print("IP space results:", results_ip)

In [None]:
results_dict["distances"][0][0]

In [None]:
results_dict["documents"][0][0]

In [None]:
results_dict = collection.query(query_texts=[query], n_results=5)

# Print all keys in results_dict
print("Keys in results_dict:")
for key in results_dict.keys():
    print(f"- {key}")


Turn the results_dict into llama-index NodeWithScore.  We use NodeWithScore when working with the other retrievers.

In [4]:
from llama_index.core.schema import NodeWithScore, TextNode

# Put results into a list of NodeWithScore objects
nodes_with_score = []
for i in range(len(results_dict['documents'][0])):
    # Create a TextNode with the document text, metadata, id, and embedding
    text_node = TextNode(
        text=results_dict['documents'][0][i],
        metadata=results_dict['metadatas'][0][i],
        id_=results_dict['ids'][0][i]
    )

    # Create a NodeWithScore, using the distance directly as the score
    node_with_score = NodeWithScore(node=text_node, score=results_dict['distances'][0][i])

    nodes_with_score.append(node_with_score)

# Now nodes_with_score is a list of NodeWithScore objects

In [None]:
for node in nodes_with_score:
    print(f"Node ID: {node.node.id_}, Type: {type(node.node.id_)}")

Let's visualize the nodes

In [None]:
from src.visualize import Visualize

visualize = Visualize('soil_test_knowledge')
visualize.plot_3d_umap(query, nodes_with_score)

In [None]:
from src.visualize import Visualize

visualize = Visualize('soil_test_knowledge')
visualize.plot_3d_plotly(query, nodes_with_score)

Explore the retrieved documents.

In [9]:
def print_dict_keys(dictionary):
    for key in dictionary.keys():
        print(key)

In [None]:
from rich import print
# What type of object is a document retrieved from the chroma collection?
print(type(results))
print_dict_keys(results)

In [None]:
# Let's look at the cosine similarity scores.
print(results['distances'][0])

In [None]:
print("Analyzing results structure:")
for i, result in enumerate(results):
    print(f"\nResult {i}:")
    if isinstance(result, dict):
        print("Keys in this result:")
        for key in result.keys():
            print(f"  - {key}")
        if 'id' in result:
            print(f"ID found: {result['id']}")
    else:
        print(f"Type: {type(result)}")
        print(f"Content: {str(result)[:50]}...")  # Print first 50 characters

    if i >= 4:  # Limit to first 5 results to avoid overwhelming output
        print("\n(Showing only first 5 results)")
        break

print("\nTotal number of results:", len(list(results)))

In [None]:
print_dict_keys(results)


In [None]:

print(results['ids'][50])


In [None]:
from rich import print
print(f"Successfully retrieved collection '{collection_name}'")
print(f"Number of items: {len(results['ids'])}")
print(f"Metadata sample: {results['metadatas'][0] if results['metadatas'] else 'No metadata'}")
print(f"Document sample: {results['documents'][0] if results['documents'] else 'No documents'}")
print(f"Embedding sample shape: {len(results['embeddings'][0]) if results['embeddings'] else 'No embeddings'}")

The `build_vector_index` method using the chromadb apis.  Similar to below.

In [None]:
# 1. Setup - load up the db and set up the embedding model that will be used during collection creation.
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
# The path to where the db is stored is fixed. The thought process is it simplifies the interface.
chroma_client = chromadb.PersistentClient(path='vectorstore')
collection_name = 'test'
embed_model_name = 'multi-qa-mpnet-base-cos-v1'
embedding_function = SentenceTransformerEmbeddingFunction(model_name=embed_model_name)

In [None]:

# 1.b Check embedding dimension
sample_text = "This is a sample text to check the embedding dimension."
sample_embedding = embedding_function([sample_text])
embedding_dim = len(sample_embedding[0])
print(f"Embedding dimension: {embedding_dim}")

In [None]:
# 2. Create the collection.  The documents will be embedded with the embedding function.  The metadata is added to the document as is an id.
existing_collections = chroma_client.list_collections()
if any(collection.name == collection_name for collection in existing_collections):
    chroma_client.delete_collection(collection_name)
    print(f"Collection {collection_name} has been deleted.")
# The metadata field sets the distance field to cosine similarity.
our_collection = chroma_client.create_collection( collection_name,embedding_function=embedding_function, metadata={"hnsw:space": "cosine"})
ids = [str(i) for i in range(len(text_nodes))]
documents = [node.text for node in text_nodes]
metadata_list = [node.metadata for node in text_nodes]
our_collection.add(ids=ids, documents=documents, metadatas = metadata_list)
print(f"Created collection '{collection_name}' with {our_collection.count()} document nodes")

With the vector index created, we can retrieve the results.

In [None]:
print(results)

In [None]:
chroma_client = chromadb.Client()
collection_name = "microsoft_annual_report_2022"
try:
    chroma_collection = chroma_client.create_collection( collection_name,embedding_function=embedding_function, metadata={"hnsw:space": "cosine"})
    logger.debug(f'Chroma collection {collection_name} was created.')
except:
    chroma_client.delete_collection(collection_name)

ids = [str(i) for i in range(len(text_nodes))]

chroma_collection.add(ids=ids, documents=text_nodes, metadatas = text_nodes.metadata)
chroma_collection.count()

In [None]:
# Create the collection

from src.ingest_service import IngestService
ingest_service = IngestService()
# Create a Chroma collection object of a given name. Metadata, embeddings, text are all added.
our_collection = ingest_service.create_collection(docs=text_nodes, collection='soil_test_knowledge', embedding_model_name='snowflake-arctic-embed')
# This will print the embedding dimension
# print(our_collection.count)

In [None]:
# Check embedding dimension
sample_text = "This is a sample text to check the embedding dimension."
sample_embedding = Settings.embed_model.get_text_embedding(sample_text)
embedding_dim = len(sample_embedding)
print(f"Embedding dimension: {embedding_dim}")

Now that we have our collection, we can create the index.

In [12]:
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.chroma import ChromaVectorStore
from src.ingest_service import IngestService
# Grab the vector index
ingest_service = IngestService()
our_collection = ingest_service.get_collection('soil_test_knowledge')
chroma_vector_store = ChromaVectorStore(chroma_collection=our_collection)
# Create a VectorStoreIndex using the ChromaVectorStore
vector_index = VectorStoreIndex.from_vector_store(chroma_vector_store, embed_model=Settings.embed_model)

Let's retrieve some documents.

In [None]:
retriever = vector_index.as_retriever(similarity_top_k=5,embed_model=Settings.embed_model)
q = "retrieve records that provide knowledge on the correct pH value for growing Cannabis as well as records that provide knowledge on what to do when the pH is too high or too low."

nodes = retriever.retrieve(q)

In [None]:
from node_view import print_node_scores
print_node_scores(nodes)

In [None]:
from node_view import launch_node_viewer
# Create and launch the interface
launch_node_viewer(nodes)