# The idea

1) Traverse the file system from 'root'. 
2) For each directory or file, recurse
3) For each file, split the full directory to dirtory name for each level, and append to the list. Use this list to construct the 'metadata' or 'tag' for the file (The directory name/path likely to capture the semantic meaning of the file).

**Ignore embedded the full documents** 

4) For the file itself, with the filename + its tags from previous step, use LLM to write short 'descriptions' / keywords.
5) Build the vector store in which indexed / embedded content is (description) and store meta data (entry_id)
6) Store (entry_id, full_path) to the hash table

The vector database is built/query with Ollama embedding (locally hosted). The description from keywork is written by OpenAI API for best result.

User -> OpenAI's GPT-4o -> Parse To Keyword -> Vector DB -> Context (possible matches etc.) -> GPT-4o -> Deliver to users

In [1]:
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.schema import TextNode

import chromadb
import os

model_endpoint = "http://localhost:11435/"
embedding_model = "nomic-embed-text"

ollama_embedding = OllamaEmbedding(
    model_name=embedding_model,
    base_url=model_endpoint,
)
class VectorIndex:
    def __init__(self, persist_dir="chroma_store") -> None:
        self.persist_dir = persist_dir

        # Initialize Chroma DB client with persistence
        self.chroma_client = chromadb.PersistentClient(path=self.persist_dir)
        self.chroma_collection = self.chroma_client.get_or_create_collection(name="docs")

        # Connect Chroma to LlamaIndex
        self.vector_store = ChromaVectorStore(chroma_collection=self.chroma_collection)
        self.storage_context = StorageContext.from_defaults(vector_store=self.vector_store)
        
    def build_vector_index_database(self, root_dir):
        nodes = []
        for subdir, _, files in os.walk(root_dir):
            if "/." in subdir: 
                continue
            for f in files:
                if f.endswith(".DS_Store") or f.endswith(".bin") or f.endswith("chroma.sqlite3"):
                    continue
                
                file_path = os.path.join(subdir, f)
                tracks = file_path[len(root_dir):].split("/") # root_dir info is implicitly shared
               
                node = TextNode(text=str(tracks)) 
                nodes.append(node)
        
        self.storage_context.persist()
        return VectorStoreIndex(nodes, embed_model=ollama_embedding)
   
indexer = VectorIndex()
ROOT_DIR ='/home/sdaadmin/tharitt_working/' 
index = indexer.build_vector_index_database(ROOT_DIR)

In [2]:
retriever = index.as_retriever(similarity_top_k=10, sparse_top_k=None)
query = "look for the file that describe python qi functions in rock physics catalog"
nodes = retriever.retrieve(query)

# Print the top-k matching documents
for i, node in enumerate(nodes):
    print(f"Result {i+1}:")
    print(node)

Result 1:
Node ID: 05b38408-8481-44b7-a859-f6f663ca6d56
Text: ['qi-llama', 'qi', 'qi_tools.py']
Score:  0.731

Result 2:
Node ID: ab33831e-6ff7-4cf7-8f92-a660153ce74f
Text: ['qi-llama', 'qi', '__pycache__', 'qi_tools.cpython-313.pyc']
Score:  0.716

Result 3:
Node ID: be91b241-fab6-4a6a-8499-b4d39aaf4891
Text: ['qi-llama', 'qi', '__pycache__', 'qi_tools.cpython-39.pyc']
Score:  0.712

Result 4:
Node ID: a1d17967-3f17-4afb-b5bf-c7ffea3a4ded
Text: ['qi-llama', 'qi', 'qi_bongkot_loader.py']
Score:  0.708

Result 5:
Node ID: 8376a85b-0fd9-4c13-adc4-31f625ad634b
Text: ['qi-llama', 'qi', 'qi_well.py']
Score:  0.707

Result 6:
Node ID: b1334546-3fe6-4158-970b-2d73a5d77841
Text: ['qi-llama', 'qi', '__pycache__',
'qi_bongkot_loader.cpython-313.pyc']
Score:  0.705

Result 7:
Node ID: f2f54caf-759c-44aa-9c80-b670bce01de9
Text: ['qi-llama', 'qi', 'qi_lang.py']
Score:  0.701

Result 8:
Node ID: 90ed4d12-9afb-41c3-8d28-943f9588fd93
Text: ['qi-llama', 'qi', 'qi_arthit_loader.py']
Score:  0.697

Resul

In [3]:
from openai_service import OpenAIService
llm = OpenAIService()
system_prompt=f"The user is looking for the full file path of the intended file. Given the context which describe the path in the list format\
    {nodes}. This comes from semantic search. The prefix is {ROOT_DIR}. The user query is {query}. \
        If there are multiple path that may match, pick few that you feel confident"
data = llm.create_data(
        system=system_prompt,
        prompt="Give a well structured, human-like response to the user."
        )
llm = OpenAIService()
response = llm.create_request(data)
print(response)


Based on your query regarding the file that describes Python functions related to the "qi" (rock physics catalog), I have identified a few potential file paths from the provided list. Here are the most relevant ones:

1. `/home/sdaadmin/tharitt_working/qi-llama/qi/qi_tools.py`
2. `/home/sdaadmin/tharitt_working/qi-llama/qi/qi_bongkot_loader.py`
3. `/home/sdaadmin/tharitt_working/qi-llama/qi/qi_well.py`
4. `/home/sdaadmin/tharitt_working/qi-llama/qi/qi_lang.py`
5. `/home/sdaadmin/tharitt_working/qi-llama/qi/qi_arthit_loader.py`

These files seem to be related to the "qi" functions in the rock physics catalog. If you need more specific details or further assistance, please let me know!
