# The Start
Let's walk through the process of creating a vector index.

# Step 1: Load the Documents
The `IngestService` has a method `load_obsidian_notes` that loads the notes from an Obsidian vault.  The input can be either a list of Document objects or a directory containing Obsidian notes.

Using Obsidian notes provides several advantages:
- Notes are easy to write and edit.
- Frontmatter/tags transfer into the metadata of the nodes.
- The Headers provide natural splitting points for the text.

The `IngestService` class relies on Langchain's `ObsidianLoader` class to load the notes.  The `ObsidianLoader` class uses the frontmatter, tags, dataview fields, and file metadata to populate the metadata of the nodes.  The additional metadata will benefit retrieval of the best nodes to answer a given query.

The `load_obsidian_notes` method takes in either:
- a list of strings where each string is considered a markdown file.
- a directory path to an Obsidian vault.
The list of strings options is useful for testing.

It returns a list of `LlamaIndex` `Document` objects.  There will be one `LlamaIndex` `Document` for each node. The node will contain an id, text, and metadata.

In [None]:
# E.g. to show using the ability to read in a list of Documents to test out loading Documents.

doc = """#Calcium_additive #raise_ph #Wollastonite #Silicon_additive #buffer_pH #Calcium
Growers  turn to Wollastonite for:
- Its **liming** capability.  Wollastonite's dissolution rate is slower than agricultural lime, offering a buffering effect against rapid pH changes. This makes Wollastonite beneficial in areas with fluctuating acidity levels.
- Adding **Silicon**.
- Adding **Calcium**.
Wollastonite's pH buffering effect and Silicon content contribute to pest control and powdery mildew suppression, although the exact mechanisms are not fully understood.

# What is Wollastonite?

## Formation
Wollastonite is formed when Limestone is subjected to heat and pressure during metamorphism if surrounding silicate minerals are present.
### Basic Reaction:
Given high pressure and high temperature:
- CaCO3 (Limestone) + SiO2 (silica) → CaSiO3 (Wollastonite) + CO2 (carbon Dioxide)
## Sources
China is the largest producer of Wollastonite. Other areas where Wollastonite is mined include the United States (although it was originally mined in California, the only active mining in the U.S. is now in New York State), India, Mexico, Canada, and Finland.

## Industrial Applications of Wollastonite

|Industry|Application|
|---|---|
|Ceramics|Smoother and more durable ceramics, reinforcement agent|
|Plastics and Rubber|Cost-effective strengthening agent|
|Paints and Coatings|Reinforcement, improved durability and impact resistance|
|Construction|Improved strength and durability of building materials, safe alternative to asbestos|
##  How Wollastonite Provides Plants with Ca and Si

Wollastonite reacts with Water and Carbon Dioxide in the soil to form Calcium Bicarbonate and Silicon Dioxide.
- CaSiO₃ (Wollastonite)+2CO₂ (carbon Dioxide,)+H₂O (Water)→Ca(HCO₃)₂ (Calcium bicarbonate)+SiO₂ (silica)

### Calcium
- Calcium bicarbonate  (Ca(HCO₃)₂) is unstable and fairly easily decomposes to Limestone (CaCO₃):
		- Ca(HCO₃)₂ (Calcium bicarbonate)→CaCO₃ (Limestone)+  CO₂ (carbon Dioxide) + H₂O (Water)

- Soils with a pH below 7 (acidic soils) contain hydrogen ions (H+). These hydrogen ions react with the Limestone (CaCO3) to form Calcium ions (Ca2+), Water (H2O), and Carbon Dioxide (CO2).
	- CaCO3 (Limestone) + 2H+ (hydrogen ions) → Ca2+ (Calcium ions) + H2O (Water) + CO2 (carbon Dioxide)
### Silicon
- Silicon Dioxide slowly breaks down into Silicic Acid, which plants absorb. This process is influenced by soil pH, temperature, and microbial activity.
	- SiO2 (Silicon Dioxide) + 2H2O (Water) → H4SiO4 (Silicic Acid)

- Plants absorb Silicic Acid from the soil solution through their roots.


"""

In [None]:
# This notebook is in the eval folder.  Change to the root folder.
%cd ..
%pwd  # To verify the current working directory

In [None]:
# --->: Read in the markdown files in the Obsidian vault directory
from src.ingest_service import IngestService
from src.doc_stats import DocStats
# The Directory containing the knowledge documents used by the AI to do the analysis on the soil tests.
soil_knowledge_directory = r"G:\My Drive\Audios_To_Knowledge\knowledge\AskGrowBuddy\AskGrowBuddy\Knowledge\soil_test_knowlege"
# Load the documents
ingest_service = IngestService()
loaded_documents = ingest_service.load_obsidian_notes(soil_knowledge_directory)
# Show some summary stats about the documents

DocStats.print_llama_index_docs_summary_stats(loaded_documents)

In [6]:
from rich import print
from collections import defaultdict

# Dictionary to keep track of how many times each document name has appeared
doc_count = defaultdict(int)

for doc in loaded_documents:
    source = doc.metadata['source']
    doc_count[source] += 1

    if doc_count[source] > 1:
        print(f"{source} (Duplicate document. Document count: {doc_count[source]})")
    else:
        print(source)

In [None]:
# View the contents of all the fields in each node.
from node_view import launch_node_viewer
# Create and launch the interface
launch_node_viewer(loaded_documents)

# Step 2: Split the Documents using Markdown Splitting

In [None]:

text_nodes = ingest_service.chunk_text(loaded_documents)
DocStats.print_llama_index_docs_summary_stats(text_nodes)

# Step 3: Delete Unuseful Nodes
Some nodes will not contain any useful content. I delete them to provide cleaner data to the retriver.  I also check to see what other challenges might be occuring. In one case, I noted Excalidraw drawings were included.  I filtered these out.

In [None]:
from node_view import launch_node_viewer
# Create and launch the interface
launch_node_viewer(text_nodes)


In [4]:
# Saving the nodes in case we want to start before indexing.
import pickle
with open('eval/text_nodes.pkl', 'wb') as f:
    pickle.dump(text_nodes, f)

In [None]:
# This notebook is in the eval folder.  Change to the root folder.
%cd ..
%pwd  # To verify the current working directory

In [7]:
# Now unpickle
import pickle

with open('eval/text_nodes.pkl', 'rb') as f:
    text_nodes = pickle.load(f)

I deleted several nodes that were not useful. Several had Excalidraw content that should be filtered out.  

# Step 4: Build the Index
Now onto building the vector index.  I was originally going to use LlamaIndex APIs to simplify the code, but I was getting frustrated with dumb bugs like the files weren't updated for Pydantic 2.  We will use the chromadb api to build the index.

`chromadb` is used as the persistent store for the vector index.

In [None]:
# The build_vector_index method does what is shown in the three cells after this one.
from src.ingest_service import IngestService
ingest_service = IngestService()
collection = ingest_service.build_vector_index(nodes=text_nodes, collection_name='soil_test_knowledge')


The `build_vector_index` method using the chromadb apis.  Similar to below.

In [None]:
# 1. Setup - load up the db and set up the embedding model that will be used during collection creation.
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
# The path to where the db is stored is fixed. The thought process is it simplifies the interface.
chroma_client = chromadb.PersistentClient(path='vectorstore')
collection_name = 'test'
embed_model_name = 'multi-qa-mpnet-base-cos-v1'
embedding_function = SentenceTransformerEmbeddingFunction(model_name=embed_model_name)

In [None]:
# 1.a Show the collections in the database.
from rich import print
db = 'vectorstore'
collection_list = chroma_client.list_collections()

print(collection_list)

In [None]:

# 1.b Check embedding dimension
sample_text = "This is a sample text to check the embedding dimension."
sample_embedding = embedding_function([sample_text])
embedding_dim = len(sample_embedding[0])
print(f"Embedding dimension: {embedding_dim}")

In [None]:
# 2. Create the collection.  The documents will be embedded with the embedding function.  The metadata is added to the document as is an id.
existing_collections = chroma_client.list_collections()
if any(collection.name == collection_name for collection in existing_collections):
    chroma_client.delete_collection(collection_name)
    print(f"Collection {collection_name} has been deleted.")
# The metadata field sets the distance field to cosine similarity.
our_collection = chroma_client.create_collection( collection_name,embedding_function=embedding_function, metadata={"hnsw:space": "cosine"})
ids = [str(i) for i in range(len(text_nodes))]
documents = [node.text for node in text_nodes]
metadata_list = [node.metadata for node in text_nodes]
our_collection.add(ids=ids, documents=documents, metadatas = metadata_list)
print(f"Created collection '{collection_name}' with {our_collection.count()} document nodes")

With the vector index created, we can retrieve the results.

In [None]:
query = "What is the ideal ph for Cannabis?"

results = our_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results['documents'][0]

for document in retrieved_documents:
    print(document)
    print('\n')

In [None]:
print(results)

In [None]:
chroma_client = chromadb.Client()
collection_name = "microsoft_annual_report_2022"
try:
    chroma_collection = chroma_client.create_collection( collection_name,embedding_function=embedding_function, metadata={"hnsw:space": "cosine"})
    logger.debug(f'Chroma collection {collection_name} was created.')
except:
    chroma_client.delete_collection(collection_name)

ids = [str(i) for i in range(len(text_nodes))]

chroma_collection.add(ids=ids, documents=text_nodes, metadatas = text_nodes.metadata)
chroma_collection.count()

In [None]:
# Create the collection

from src.ingest_service import IngestService
ingest_service = IngestService()
# Create a Chroma collection object of a given name. Metadata, embeddings, text are all added.
our_collection = ingest_service.create_collection(docs=text_nodes, collection='soil_test_knowledge', embedding_model_name='snowflake-arctic-embed')
# This will print the embedding dimension


In [None]:
# Check embedding dimension
sample_text = "This is a sample text to check the embedding dimension."
sample_embedding = Settings.embed_model.get_text_embedding(sample_text)
embedding_dim = len(sample_embedding)
print(f"Embedding dimension: {embedding_dim}")

Now that we have our collection, we can create the index.

In [12]:
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.chroma import ChromaVectorStore
from src.ingest_service import IngestService
# Grab the vector index
ingest_service = IngestService()
our_collection = ingest_service.get_collection('soil_test_knowledge')
chroma_vector_store = ChromaVectorStore(chroma_collection=our_collection)
# Create a VectorStoreIndex using the ChromaVectorStore
vector_index = VectorStoreIndex.from_vector_store(chroma_vector_store, embed_model=Settings.embed_model)

Let's retrieve some documents.

In [None]:
retriever = vector_index.as_retriever(similarity_top_k=5,embed_model=Settings.embed_model)
q = "retrieve records that provide knowledge on the correct pH value for growing Cannabis as well as records that provide knowledge on what to do when the pH is too high or too low."

nodes = retriever.retrieve(q)

In [None]:
from node_view import print_node_scores
print_node_scores(nodes)

In [None]:
from node_view import launch_node_viewer
# Create and launch the interface
launch_node_viewer(nodes)