# Vector Store

This notebook demonstrates how to work with vector stores for semantic search and retrieval.

We'll cover:
- Loading and chunking text documents
- Storing document embeddings in different vector databases
- Querying and retrieving relevant documents
- Using vector stores as configurable LangChain runnables

In [None]:
from dotenv import load_dotenv
from rich import print

load_dotenv(verbose=True)

%load_ext autoreload
%autoreload 2

import sys

sys.path.append(".")

### Split the text into chunks

First we load a text document and split it into smaller chunks for processing:

- Uses LangChain's `TextLoader` to load the file
- Applies `RecursiveCharacterTextSplitter` to break text into 2000-character chunks
- No overlap between chunks is configured

In [None]:
from genai_tk.core.embeddings_factory import EmbeddingsFactory
from genai_tk.core.vector_store_factory import VECTOR_STORE_ENGINE, VectorStoreRegistry
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = TextLoader("use_case_data/other/state_of_the_union.txt")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
print(texts)

### Store document embeddings in a vector database

We use our `VectorStoreRegistry` to:

1. Select a vector store backend (default is in-memory)
2. Configure the embedding model (default from config)
3. Add our document chunks to the store

Key benefits of the factory pattern:
- Easy switching between vector store implementations
- Consistent interface regardless of backend
- Centralized configuration management

In [None]:
vs_engine: VECTOR_STORE_ENGINE | None = None
vs_engine = "InMemory"

# Other choices (Examples)
# vs_engine = "Chroma_in_memory"
# vs_engine = "Sklearn"

vs_factory = VectorStoreRegistry(
    id=vs_engine,
    table_name_prefix="name",
    embeddings_factory=EmbeddingsFactory(),
)

print(vs_factory)

db = vs_factory.get()
db.add_documents(texts)

### Test semantic search queries

We'll search for content related to:
1. "What did the president say about Ketanji Brown Jackson" (English)
2. "Qu'as dit le président sur Ketanji Brown Jackson" (French)

This demonstrates:
- The vector store finds relevant content regardless of query language
- Semantic similarity works across languages when using multilingual embeddings

In [None]:
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query, k=3)
print(docs)

In [None]:
query = "Qu'as dit le président sur Ketanji Brown Jackson"
docs = db.similarity_search(query, k=3)
print(docs)

### Vector Store as Runnable

LangChain's `as_retriever()` converts the vector store into a runnable component that can:

- Be chained with other LangChain components
- Support streaming and async operations
- Be configured with search parameters

In [None]:
retriever = db.as_retriever()

a = retriever.invoke(query, k=1)
print(a)