# LangChain Integration

This notebook demonstrates how to use uubed with LangChain for enhanced embedding management in RAG applications.

**Note**: This notebook requires LangChain to be installed:
```bash
pip install langchain openai chromadb
```

In [None]:
# Check if required packages are available
try:
    import langchain
    print(f"LangChain version: {langchain.__version__}")
except ImportError:
    print("Please install LangChain: pip install langchain")

from uubed.integrations.langchain import (
    UubedEncoder,
    UubedEmbeddings,
    UubedDocumentProcessor,
    create_uubed_retriever
)

## 1. Basic Document Encoding

Let's start by encoding document embeddings with uubed:

In [None]:
from langchain.schema import Document
import numpy as np

# Create sample documents
documents = [
    Document(page_content="The quick brown fox jumps over the lazy dog.", metadata={"id": 1}),
    Document(page_content="Python is a versatile programming language.", metadata={"id": 2}),
    Document(page_content="Machine learning transforms data into insights.", metadata={"id": 3}),
]

# Simulate embeddings (in practice, these come from an embedding model)
embeddings = [
    np.random.randn(768).tolist(),  # Normalized embeddings
    np.random.randn(768).tolist(),
    np.random.randn(768).tolist(),
]

# Create encoder and encode documents
encoder = UubedEncoder(method="shq64")
encoded_docs = encoder.encode_documents(documents, embeddings)

# Check the results
for doc in encoded_docs:
    print(f"Document {doc.metadata['id']}: {doc.page_content[:50]}...")
    print(f"  Encoded: {doc.metadata['uubed_encoded']}")
    print()

## 2. Using UubedEmbeddings Wrapper

Wrap any LangChain embedding model to add uubed encoding:

In [None]:
# Create a mock embedding class for demonstration
class MockEmbeddings:
    """Mock embeddings for demonstration (replace with OpenAIEmbeddings, etc.)"""
    
    def embed_documents(self, texts):
        # Simulate embedding generation
        return [np.random.randn(384).tolist() for _ in texts]
    
    def embed_query(self, text):
        # Simulate query embedding
        return np.random.randn(384).tolist()
    
    async def aembed_documents(self, texts):
        return self.embed_documents(texts)
    
    async def aembed_query(self, text):
        return self.embed_query(text)

# Create base embeddings (in practice: OpenAIEmbeddings, CohereEmbeddings, etc.)
base_embeddings = MockEmbeddings()

# Wrap with uubed
uubed_embeddings = UubedEmbeddings(
    base_embeddings,
    method="t8q64",
    return_encoded=True,  # Return encoded strings instead of raw embeddings
    k=16  # Parameter for t8q64 method
)

# Use like any other embedding model
texts = [
    "Hello, world!",
    "Machine learning is fascinating.",
    "LangChain makes LLM apps easy."
]

encoded_embeddings = uubed_embeddings.embed_documents(texts)
print("Encoded embeddings:")
for text, encoded in zip(texts, encoded_embeddings):
    print(f"  '{text}' -> {encoded}")

## 3. Document Processing Pipeline

Use UubedDocumentProcessor in your document ingestion pipeline:

In [None]:
# Create a document processor
processor = UubedDocumentProcessor(
    embeddings=base_embeddings,
    method="zoq64",
    batch_size=2  # Process 2 documents at a time
)

# Create more documents
large_doc_set = [
    Document(page_content=f"Document {i}: This is test content for document number {i}.", 
             metadata={"doc_id": i, "source": "test"})
    for i in range(10)
]

# Process documents
processed_docs = processor.process(large_doc_set)

print(f"Processed {len(processed_docs)} documents")
print("\nSample processed document:")
print(f"  Content: {processed_docs[0].page_content}")
print(f"  Metadata: {processed_docs[0].metadata}")

## 4. Integration with Vector Stores

Here's how to use uubed with vector stores for enhanced retrieval:

In [None]:
# Mock vector store for demonstration
class MockVectorStore:
    """Mock vector store (replace with Chroma, Pinecone, etc.)"""
    
    def __init__(self):
        self.documents = []
        self.embeddings = []
        self.embedding_function = None
    
    def add_documents(self, documents):
        self.documents.extend(documents)
        return [f"id_{i}" for i in range(len(documents))]
    
    def as_retriever(self, **kwargs):
        return self
    
    def get_relevant_documents(self, query):
        # Simple mock retrieval
        return self.documents[:2] if self.documents else []

# Create vector store and retriever
vectorstore = MockVectorStore()

# Add processed documents
doc_ids = vectorstore.add_documents(processed_docs)
print(f"Added {len(doc_ids)} documents to vector store")

# Create uubed-enhanced retriever
retriever = create_uubed_retriever(
    vectorstore=vectorstore,
    embeddings=base_embeddings,
    method="shq64",
    search_kwargs={"k": 5}
)

# Use the retriever
query = "Tell me about document processing"
results = retriever.get_relevant_documents(query)
print(f"\nRetrieved {len(results)} documents for query: '{query}'")
for i, doc in enumerate(results):
    print(f"  {i+1}. {doc.page_content[:60]}...")

## 5. Complete RAG Pipeline Example

Here's a complete example showing how uubed fits into a RAG pipeline:

In [None]:
class UubedRAGPipeline:
    """Example RAG pipeline with uubed encoding."""
    
    def __init__(self, embedding_model, vector_store, encoding_method="auto"):
        # Wrap embeddings with uubed
        self.embeddings = UubedEmbeddings(
            embedding_model,
            method=encoding_method,
            return_encoded=False  # Keep raw embeddings for vector search
        )
        
        # Document processor for ingestion
        self.processor = UubedDocumentProcessor(
            self.embeddings,
            method=encoding_method
        )
        
        # Vector store for retrieval
        self.vector_store = vector_store
        self.vector_store.embedding_function = self.embeddings
    
    def ingest_documents(self, documents):
        """Ingest documents with uubed encoding."""
        # Process documents (adds encoded embeddings to metadata)
        processed = self.processor.process(documents)
        
        # Add to vector store
        ids = self.vector_store.add_documents(processed)
        
        return ids
    
    def search(self, query, k=5):
        """Search for relevant documents."""
        # The query will be automatically encoded by UubedEmbeddings
        retriever = self.vector_store.as_retriever(search_kwargs={"k": k})
        return retriever.get_relevant_documents(query)
    
    def get_encoding_stats(self):
        """Get statistics about encoded documents."""
        encoded_lengths = []
        for doc in self.vector_store.documents:
            if "uubed_encoded" in doc.metadata:
                encoded_lengths.append(len(doc.metadata["uubed_encoded"]))
        
        if encoded_lengths:
            return {
                "total_documents": len(self.vector_store.documents),
                "encoded_documents": len(encoded_lengths),
                "avg_encoding_length": np.mean(encoded_lengths),
                "encoding_method": self.processor.encoder.method
            }
        return {"total_documents": 0}

# Create and use the pipeline
pipeline = UubedRAGPipeline(
    embedding_model=MockEmbeddings(),
    vector_store=MockVectorStore(),
    encoding_method="shq64"
)

# Ingest documents
sample_docs = [
    Document(page_content="RAG combines retrieval with generation for better AI responses."),
    Document(page_content="Embeddings capture semantic meaning in vector space."),
    Document(page_content="uubed provides position-safe encoding for embeddings."),
]

doc_ids = pipeline.ingest_documents(sample_docs)
print(f"Ingested {len(doc_ids)} documents")

# Search
results = pipeline.search("How do embeddings work?", k=2)
print("\nSearch results:")
for i, doc in enumerate(results):
    print(f"  {i+1}. {doc.page_content}")
    if "uubed_encoded" in doc.metadata:
        print(f"     Encoded: {doc.metadata['uubed_encoded']}")

# Get statistics
stats = pipeline.get_encoding_stats()
print(f"\nPipeline statistics: {stats}")

## 6. Advanced: Custom Encoding Strategies

Choose encoding methods based on your use case:

In [None]:
def select_encoding_method(embedding_dim, use_case="general"):
    """Select optimal encoding method based on use case."""
    
    if use_case == "exact_match":
        # Need lossless encoding for exact matching
        return "eq64", {}
    
    elif use_case == "similarity_search":
        # Compact similarity-preserving encoding
        return "shq64", {"planes": min(64, embedding_dim // 4)}
    
    elif use_case == "sparse":
        # Sparse representation for high-dimensional embeddings
        k = min(32, embedding_dim // 10)
        return "t8q64", {"k": k}
    
    elif use_case == "spatial":
        # Spatial locality for range queries
        return "zoq64", {}
    
    else:
        # Auto-select based on dimensions
        return "auto", {}

# Test different strategies
use_cases = ["exact_match", "similarity_search", "sparse", "spatial"]
embedding_dims = [384, 768, 1536]

print("Encoding strategy recommendations:")
print("-" * 60)
for dim in embedding_dims:
    print(f"\nEmbedding dimension: {dim}")
    for use_case in use_cases:
        method, params = select_encoding_method(dim, use_case)
        print(f"  {use_case:20s}: {method} {params}")

## Summary

In this notebook, we covered:

1. **Basic encoding**: Adding uubed-encoded embeddings to document metadata
2. **Embedding wrapper**: Using UubedEmbeddings to wrap any LangChain embedding model
3. **Document processing**: Batch processing documents with automatic encoding
4. **Vector store integration**: Creating enhanced retrievers with uubed
5. **Complete RAG pipeline**: Full example of uubed in a RAG application
6. **Encoding strategies**: Selecting optimal encoding methods for different use cases

Key benefits of using uubed with LangChain:
- **Position-safe**: Prevents substring pollution in search
- **Flexible**: Works with any embedding model and vector store
- **Efficient**: Multiple encoding methods for different needs
- **Easy integration**: Drop-in components for existing pipelines

Next steps:
- Try with real embedding models (OpenAI, Cohere, etc.)
- Test with production vector stores (Chroma, Pinecone, Weaviate)
- Experiment with different encoding methods for your use case