# The Start
We start with an already created vector index.
Let's get to know how the vector index works by exploring the path an Obsidian note takes from loading through indexing through being retrieved.

The document we'll be using, 'Bluelab Pulse Meter Review.md', includes frontmatter as well as transcribed text from a YouTube video.  This note was chosen because it has a moderate length and rich metadata fields.

The `load_obsidian_notes` method is used to load the note into a list of LlamaIndex Documents. In this case, there is only one document.  Notice the metadata. There are some interesting fields such as the `description` which came from the frontmatter of the Obsidian note.


In [None]:
%cd ..
%pwd  # To verify the current working directory

In [2]:
from rich import print

In [None]:
from src.ingest_service import IngestService
from src.doc_stats import DocStats
ingest_service = IngestService()
# obsidian_notes_path = 'eval/obsidian_notes'
obsidian_notes_path = r'G:\My Drive\Audios_To_Knowledge\knowledge\AskGrowBuddy\AskGrowBuddy\Knowledge\soil_test_knowlege'
docs = ingest_service.load_obsidian_notes(obsidian_notes_path)

DocStats.print_llama_index_docs_summary_stats(docs)


We have load a document.  Let's break it text nodes.  LlamaIndex's `MarkdownNodeParser` is used to do this.  Besides metadata, Obsidian notes break down text using markdown headers.  By using a `MarkdownNodeParser`, we can use these headers to guide how the text is chunked. The document workflow is using LlamaIndex libraries.  We start with a `LlamaIndexDocument` and now move to a `BaseNode`, or actually a `TextNode`


Now when we print out the docstats, we see that the number of documents (nodes) has increased from 1 to 6.  The content length varies and the nodes all have metadata.

In [None]:
text_nodes = ingest_service.chunk_text(docs)
DocStats.print_llama_index_docs_summary_stats(text_nodes)

# The Nodes
Let's look at the contents of the nodes.  The first node looks like it does not have any interesting content. The only thing in the text is the timestamp-url code block that is used to play the YouTube video.

```  timestamp-url
https://www.youtube.com/watch?v=KbZDsrs5roI
```
This surprised me because I expected nodes to start with a Header.  What I have learned so far:
- The markdown splitting is aggressive and simple.  It looks for a #. When it sees one, it starts a new node if there is text underneath it. This is how nodes can include very little text.  Some nodes - like the first one in this example - do not contain any semantically relevant information.  In this case, it contains only a timestamp-url codeblock.
- The content is interspersed with the timestamp code blocks. These should be removed.
- Even after this, the nodes will need to be reviewed and cleaned up.



# The End
This exploration started with a vector index whose nodes were created with a very agressive Markdown splitter. In addition, the content should be cleaned up. Even after these activities happen,  the nodes will need further review and cleanup.

The next step in the exploration is to explore the [node splitting process](https://github.com/solarslurpi/askgrowbuddy/blob/f29dd0ff194471bc546587ea17d603a55b85ff26/eval/eval_markdown_splitter.ipynb).


In [None]:
from node_view import launch_node_viewer
# Create and launch the interface
launch_node_viewer(text_nodes)


Let's store the nodes in a chromadb vector store.

In [None]:
ingest_service = IngestService()
# Create a Chroma collection object of a given name. Metadata, embeddings, text are all added.
our_collection = ingest_service.create_collection(docs=docs, collection='test', embedding_model_name='nomic-embed-text')

Now we'll retrieve the nodes from the vector store and convert them back to LlamaIndex TextNodes to get a closer look at what is in the vector store.  It should be the same view as when the text nodes were first created.

In [18]:
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.chroma import ChromaVectorStore
from src.ingest_service import IngestService
# Grab the vector index
ingest_service = IngestService()
our_collection = ingest_service.get_collection('test')
chroma_vector_store = ChromaVectorStore(chroma_collection=our_collection)
# Create a VectorStoreIndex using the ChromaVectorStore
vector_index = VectorStoreIndex.from_vector_store(chroma_vector_store)

In [19]:
from llama_index.core.schema import TextNode
def chroma_to_text_nodes(chroma_data):
    text_nodes = []
    for doc, metadata, node_id in zip(chroma_data['documents'], chroma_data['metadatas'], chroma_data['ids']):
        node = TextNode(
            text=doc,
            metadata=metadata,
            id_=node_id
        )
        text_nodes.append(node)
    return text_nodes


In [None]:
# Get all document IDs
doc_ids = our_collection.get()['ids']

# Print the number of documents
print(f"Total number of documents: {len(doc_ids)}")

# Retrieve the node
node_data = our_collection.get(ids=doc_ids  , include=['documents', 'metadatas'])
print(node_data['metadatas'][0])
text_nodes = chroma_to_text_nodes(node_data)
DocStats.print_llama_index_docs_summary_stats(text_nodes)


In [None]:
# Create and launch the interface
iface = create_node_viewer(text_nodes)
iface.launch(inline=True)

Finally, let's set up the vector index as a retriever and see how it works.  We'll retrieve `similarity_top_k`nodes.  Instead of being type `TextNode`, the nodes are now `NodeWithScore` objects.


In [None]:
retriever = vector_index.as_retriever(similarity_top_k=5)
response = retriever.retrieve("What is the ideal ph for growing Cannabis?")
print(f"Response type: {type(response[0])}")


In [None]:
query_engine = vector_index.as_query_engine(similarity_top_k=5)
response = query_engine.query("I am growing Cannabis in Living Soil and my pH is 8.  Is that ok?")
print(response)

In [None]:
type(response.source_nodes[0])


In [None]:
# Create and launch the interface
iface = create_node_viewer(response.source_nodes)
iface.launch(inline=True)