# Top K with Vector Stores

Tools:
1. LangChain: standardize way to implement (set up, create, and query) multiple vector stores
2. Vector Stores:
    1. Chroma
3. Embedding Models
    1. HuggingFace

[LangChain-Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma/)

In [1]:
import os
import sys

import pandas as pd

from tqdm import tqdm
from uuid import uuid4

from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

from langchain_core.documents import Document

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

import log_files
from data_processing import DataProcessing

In [2]:
pd.set_option('max_colwidth', 800)
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

## Load Prediction and Observations Data

In [3]:
base_path = os.path.join(notebook_dir, '../data/')
financial_full_path = os.path.join(base_path, 'financial_phrase_bank/all_data-adjusted_header.csv')
financial_df = pd.read_csv(financial_full_path, encoding_errors = 'ignore')

In [4]:
financial_df['domain'] = 'financial'
financial_df

Unnamed: 0,sentiment,sentence,domain
0,neutral,"According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .",financial
1,neutral,"Technopolis plans to develop in stages an area of no less than 100,000 square meters in order to host companies working in computer technologies and telecommunications , the statement said .",financial
2,negative,"The international electronic industry company Elcoteq has laid off tens of employees from its Tallinn facility ; contrary to earlier layoffs the company contracted the ranks of its office workers , the daily Postimees reported .",financial
3,positive,With the new production plant the company would increase its capacity to meet the expected increase in demand and would improve the use of raw materials and therefore increase the production profitability .,financial
4,positive,"According to the company 's updated strategy for the years 2009-2012 , Basware targets a long-term net sales growth in the range of 20 % -40 % with an operating profit margin of 10 % -20 % of net sales .",financial
...,...,...,...
4841,negative,LONDON MarketWatch -- Share prices ended lower in London Monday as a rebound in bank stocks failed to offset broader weakness for the FTSE 100 .,financial
4842,neutral,"Rinkuskiai 's beer sales fell by 6.5 per cent to 4.16 million litres , while Kauno Alus ' beer sales jumped by 6.9 per cent to 2.48 million litres .",financial
4843,negative,"Operating profit fell to EUR 35.4 mn from EUR 68.8 mn in 2007 , including vessel sales gain of EUR 12.3 mn .",financial
4844,negative,"Net sales of the Paper segment decreased to EUR 221.6 mn in the second quarter of 2009 from EUR 241.1 mn in the second quarter of 2008 , while operating profit excluding non-recurring items rose to EUR 8.0 mn from EUR 7.6 mn .",financial


## Embedding Model(s)

In [5]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

#

## Create Vector Store: Collect Name + Connect Embedding Model(s)

- Collection [store your embeddings, documents, and any additional metadata][WEBSITE: [Getting Started with Chroma](https://docs.trychroma.com/docs/overview/getting-started)]
- Collections index your embeddings and documents, and enable efficient retrieval and filtering.

In [6]:
vector_store = Chroma(
    collection_name="prediction_collection-real_data",
    embedding_function=embeddings,
)

## Create a Chroma Client

In [7]:
import chromadb

client = chromadb.Client()

## Add Observations to Vector Store

In [13]:
documents = []

for idx, row in tqdm(financial_df.iterrows()):
    # print(f"""Index: {idx},
    #       Name: {row['Base Sentence']},
    #       Sentence Label: {row['Sentence Label']},
    #       Domain: {row['Domain']},
    #       Model Name: {row['Model Name']},
    #       API Name: {row['API Name']},
    #       Template Number: {row['Template Number']}
    #       """)
    base_sentence = row['sentence']
    domain = row['domain']

    document = Document(
        page_content=base_sentence,
        metadata={"domain": domain},
        id=idx
    )
    
    documents.append(document)

uuids = [str(uuid4()) for _ in range(len(documents))]
vector_store.add_documents(documents=documents, ids=uuids)

4846it [00:00, 78518.44it/s]


['2d63001c-3618-4e4e-b03f-fd769f718e5a',
 '6cc7933a-bdef-4de4-b65a-625f63b598a4',
 '977de090-dc2a-498a-afff-48285265ea89',
 'aec154f1-74a8-4a10-8ba5-bb3a22195cb9',
 '94a0a210-d831-44f8-b85a-7d2c60d46e7e',
 '85487b75-2e7e-40d1-a1c8-738b3e0d1b47',
 '9c8815a0-2b3d-4b18-b5a2-91da62e65206',
 '12085d00-d6d5-498b-9c3b-80ae10332386',
 'd0e22808-5a4b-436a-82b7-52c77fd75bef',
 '0855f758-7b88-44f7-afdb-9cb61fce1d30',
 '0d3bb4fc-a8ca-4a80-99f4-6831228ec7a6',
 'f6bf16b5-33f6-46dd-8ca9-35356fd0f95f',
 'c1a2c90d-6630-4c3a-ac96-45fe4ca19db7',
 '0bc46343-bb1d-4569-bfbd-cc9d0ece5165',
 '3867f146-90dd-4375-92ab-18a88e0839f7',
 '921c2813-af45-44c0-8498-e8cb1990753b',
 '0e96a366-0b94-44dd-a2c8-82d50e462e77',
 'c8c0ce04-7f02-486b-bfdb-5357006a56ce',
 '66411f6d-7146-4a87-9991-0e1222b9e330',
 'ab223d86-aff3-4475-856d-03f5fbbf84ea',
 'fa78b6cf-5a03-4466-a87d-a1c9346ac68c',
 'afb04aad-1f27-4a36-beae-3f2edd6ea57d',
 '71bcb938-57cb-4ea2-81b1-e6075b1fcfa7',
 'cc9a72c8-f1d5-49be-90e1-e5d627eb4644',
 '9da4aca6-b3f4-

## Query Vector Store

In [14]:
words = ['expected', 'will', 'may', 'should', 'predict', 'forecast', 'EUR']

In [15]:
for idx, word in enumerate(words):
    print(idx, word)
    print("-------Similarity-------")
    results = vector_store.similarity_search(
        word,
        k=3,
    )
    for res in results:
        print(f"* {res.page_content} [{res.metadata}]")

    print()
    print("-------Similarity with score-------")
    results = vector_store.similarity_search_with_score(
        word, k=1,
    )
    for res, score in results:
        print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

    print()
    print("-------Similarity by vector-------")
    results = vector_store.similarity_search_by_vector(
        embedding=embeddings.embed_query(word), k=3
    )
    for doc in results:
        print(f"* {doc.page_content} [{doc.metadata}]")

    print()
    print("-------Retriever-------")
    retriever = vector_store.as_retriever(
        search_type="mmr", search_kwargs={"k": 3, "fetch_k": 5}
    )
    retriever.invoke(word)

0 expected
-------Similarity-------
* The first quarter was as expected and was in line with analysts ' forecasts . [{'domain': 'financial'}]
* Markets had been expecting a poor performance , and the company 's stock was up 6 percent at  x20ac 23.89 US$ 33.84 in early afternoon trading in Helsinki . [{'domain': 'financial'}]
* Performance in the second half of 2009 exceeded expectations . [{'domain': 'financial'}]

-------Similarity with score-------
* [SIM=1.444711] The first quarter was as expected and was in line with analysts ' forecasts . [{'domain': 'financial'}]

-------Similarity by vector-------
* The first quarter was as expected and was in line with analysts ' forecasts . [{'domain': 'financial'}]
* Markets had been expecting a poor performance , and the company 's stock was up 6 percent at  x20ac 23.89 US$ 33.84 in early afternoon trading in Helsinki . [{'domain': 'financial'}]
* Performance in the second half of 2009 exceeded expectations . [{'domain': 'financial'}]

-----