# Load any Vector Stores + Get Top K

**Tools:**

1. LangChain: standardize way to implement (set up, create, and query) multiple vector stores
2. Vector Stores supported:
    1. Chroma
3. Embedding Models supported:
    1. HuggingFace

**References:**

1. [LangChain-Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma/)

In [1]:
import os
import sys
import chromadb

import pandas as pd

from tqdm import tqdm
from uuid import uuid4

from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

from langchain_core.documents import Document

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from prediction_properties import PredictionProperties
from text_generation_models import TextGenerationModelFactory
from vector_stores import ChromaVectorStore, VectorStoreDirector

In [2]:
pd.set_option('max_colwidth', 800)
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

## Load Vector Store

In [3]:
collection_name = "prediction_collection-real_data"
persist_directory = "../data/chroma/chroma_langchain_db"
chroma_loader = ChromaVectorStore(collection_name, persist_directory)
chroma_loader

	Collection Name: prediction_collection-real_data
	Persist Directory: ../data/chroma/chroma_langchain_db
	Vector Store: None
	Docments: []
	UUIDS: None
	Embedding Model: None


<vector_stores.ChromaVectorStore at 0x14b81cbe95d0>

## Load Prompt

In [4]:
query_string = PredictionProperties.get_prediction_properties()
query_string

' A prediction <p> = (<p_s>, <p_t>, <p_d>, <p_o>), where it consists of the following four properties:\n\n        1. <p_s>\n            - Defined as: \n                - Source entity that states the <p>\n            - Characteristics:\n                - A person with either: a name only, profile name only, geneder only, domain specific title only or any combination of these.\n                - An associated organization\n                - Named entity: Person, organization\n                - Part of speech: Noun\n            - Examples:\n                1. A person with a name only: Detravious\n                2. A person with a profile name: FitToJesus\n                3. A person with a gender only: He\n                4. A person with a domain specific title: reporter, analyst, expert, top executive, senior level person, etc \n                5. A person with a combination: Detravious, a reporter\n                6. An associated organization: FitTo...\n                7. A combina

In [5]:
chroma_director = VectorStoreDirector(loader=chroma_loader)
embedding_model_name = "Hugging Face"
# query_string = "Hey"
k = 3
query_results = chroma_director.query(embedding_model_name, query_string, k)

### LOADER ###
	<vector_stores.ChromaVectorStore object at 0x14b81cbe95d0>
### INITIALIZE CLIENT VECTOR STORE ###
	Vector Store (Prediction's Wrapper): None
### LOAD EMBEDDING MODEL ###


2025-10-06 12:39:31.675328: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-10-06 12:39:31.683827: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1759768771.692669 3282806 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1759768771.695274 3282806 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1759768771.702791 3282806 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

	Hugging Face
### LOAD VECTOR STORE ###
	Collection Name: prediction_collection-real_data
	Embedding Model: model_name='sentence-transformers/all-mpnet-base-v2' cache_folder=None model_kwargs={} encode_kwargs={} query_encode_kwargs={} multi_process=False show_progress=False
	Persist Directory: ../data/chroma/chroma_langchain_db
	Vector Store (Original): <langchain_chroma.vectorstores.Chroma object at 0x14b7c4e8f5d0>
	Vector Store (Prediction's Wrapper): <vector_stores.ChromaVectorStore object at 0x14b81cbe95d0>
	Documents (D) 40
### TOP K ###
	1. Similarity
	2. Similarity with score
	3. Similarity by vector
	4. Retriever
	Query Results: {'similarity_search': ['With the new production plant the company would increase its capacity to meet the expected increase in demand and would improve the use of raw materials and therefore increase the production profitability .', 'Finnish Talentum reports its operating profit increased to EUR 20.5 mn in 2005 from EUR 9.3 mn in 2004 , and net sales to

In [6]:
query_results

{'similarity_search': ['With the new production plant the company would increase its capacity to meet the expected increase in demand and would improve the use of raw materials and therefore increase the production profitability .',
  'Finnish Talentum reports its operating profit increased to EUR 20.5 mn in 2005 from EUR 9.3 mn in 2004 , and net sales totaled EUR 103.3 mn , up from EUR 96.4 mn .',
  'Net sales increased to EUR193 .3 m from EUR179 .9 m and pretax profit rose by 34.2 % to EUR43 .1 m. ( EUR1 = USD1 .4 )'],
 'similarity_with_score': [('With the new production plant the company would increase its capacity to meet the expected increase in demand and would improve the use of raw materials and therefore increase the production profitability .',
   '1.689347'),
  ('Finnish Talentum reports its operating profit increased to EUR 20.5 mn in 2005 from EUR 9.3 mn in 2004 , and net sales totaled EUR 103.3 mn , up from EUR 96.4 mn .',
   '1.743741'),
  ('Net sales increased to EUR193

## Pipeline: Rephraser

In [7]:
updated_prompt = f"""
    Given the prediction properties: {query_string} and the query results: {query_results}, generate a new query string. 
    The goal of this query string is to search my Chroma vector database for predictions that align with my prediction properties: {query_string}.
    In my Chroma vector database, I have the following datasets: (1) Financial PhraseBank, which is noted as "Polar sentiment dataset of sentences from financial news. 
    The dataset consists of 4840 sentences from English language financial news categorised by sentiment."
    Only generate the new query string and nothing else.
    """
updated_prompt

'\n    Given the prediction properties:  A prediction <p> = (<p_s>, <p_t>, <p_d>, <p_o>), where it consists of the following four properties:\n\n        1. <p_s>\n            - Defined as: \n                - Source entity that states the <p>\n            - Characteristics:\n                - A person with either: a name only, profile name only, geneder only, domain specific title only or any combination of these.\n                - An associated organization\n                - Named entity: Person, organization\n                - Part of speech: Noun\n            - Examples:\n                1. A person with a name only: Detravious\n                2. A person with a profile name: FitToJesus\n                3. A person with a gender only: He\n                4. A person with a domain specific title: reporter, analyst, expert, top executive, senior level person, etc \n                5. A person with a combination: Detravious, a reporter\n                6. An associated organization:

In [8]:
query_results = chroma_director.refine_query(embedding_model_name, query_string, k, llm_model_name='mistral-small-3.1')

### REPHRASER ###
	Updated Query String: <class 'list'> To summarize and clarify the structure and characteristics of the prediction `<p>`, let's break down each component with examples and additional details:
### QUERY AGAIN ###

### INITIALIZE CLIENT VECTOR STORE ###
	Vector Store (Prediction's Wrapper): None
### LOAD EMBEDDING MODEL ###
	Hugging Face
### LOAD VECTOR STORE ###
	Collection Name: prediction_collection-real_data
	Embedding Model: model_name='sentence-transformers/all-mpnet-base-v2' cache_folder=None model_kwargs={} encode_kwargs={} query_encode_kwargs={} multi_process=False show_progress=False
	Persist Directory: ../data/chroma/chroma_langchain_db
	Vector Store (Original): <langchain_chroma.vectorstores.Chroma object at 0x14b7c3c77b90>
	Vector Store (Prediction's Wrapper): <vector_stores.ChromaVectorStore object at 0x14b81cbe95d0>
	Documents (D) 40
### TOP K ###
	1. Similarity
	2. Similarity with score
	3. Similarity by vector
	4. Retriever
	Query Results: {'similarity_

In [9]:
query_results

{'similarity_search': ['With the new production plant the company would increase its capacity to meet the expected increase in demand and would improve the use of raw materials and therefore increase the production profitability .',
  'Operating profit totalled EUR 21.1 mn , up from EUR 18.6 mn in 2007 , representing 9.7 % of net sales .',
  'Net sales increased to EUR193 .3 m from EUR179 .9 m and pretax profit rose by 34.2 % to EUR43 .1 m. ( EUR1 = USD1 .4 )'],
 'similarity_with_score': [('With the new production plant the company would increase its capacity to meet the expected increase in demand and would improve the use of raw materials and therefore increase the production profitability .',
   '1.673075'),
  ('Operating profit totalled EUR 21.1 mn , up from EUR 18.6 mn in 2007 , representing 9.7 % of net sales .',
   '1.678059'),
  ('Net sales increased to EUR193 .3 m from EUR179 .9 m and pretax profit rose by 34.2 % to EUR43 .1 m. ( EUR1 = USD1 .4 )',
   '1.685584')],
 'similarit