# Embeddings and Vector Stores with LangChain

## Learning Objectives
By the end of this notebook, you will be able to:
- Generate embeddings from text using various providers
- Understand vector representations and similarity
- Work with vector stores (Chroma, FAISS, etc.)
- Perform semantic search and retrieval
- Optimize embedding strategies for cost and performance
- Build vector databases for RAG systems

## Why This Matters: The Heart of Semantic Search

**In RAG Systems:**
- Embeddings enable semantic understanding
- Vector stores provide fast retrieval
- Similarity search finds relevant context

**In Search Applications:**
- Go beyond keyword matching
- Find conceptually similar content
- Enable multilingual search

## Prerequisites
- Completed notebooks 00-05
- Understanding of document processing
- Basic linear algebra concepts helpful

## Setup: Install and Import Dependencies

In [None]:
# Install required packages
!pip install -q langchain langchain-community langchain-openai chromadb faiss-cpu tiktoken

import os
import numpy as np
from dotenv import load_dotenv
from typing import List

# Embeddings
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings

# Vector stores
from langchain_community.vectorstores import Chroma, FAISS
from langchain_core.documents import Document

# For splitting
from langchain.text_splitter import RecursiveCharacterTextSplitter

load_dotenv()
print("✅ Ready to work with embeddings and vectors!")

---

## Instructor Activity 1: Understanding Embeddings

### Example 1: Generating and Comparing Embeddings

In [None]:
# Empty cell for demonstration

<details>
<summary>Solution</summary>

```python
from langchain_openai import OpenAIEmbeddings
import numpy as np

# Initialize embeddings model
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"  # Newer, cheaper model
)

# Generate embeddings for similar and different texts
texts = [
    "Machine learning is a type of artificial intelligence.",
    "AI and ML are closely related technologies.",
    "The weather today is sunny and warm.",
    "Deep learning is a subset of machine learning."
]

# Generate embeddings
vector_embeddings = embeddings.embed_documents(texts)

print("Embedding Analysis:")
print("=" * 50)
print(f"Embedding dimensions: {len(vector_embeddings[0])}")
print(f"Number of texts: {len(texts)}")

# Calculate cosine similarity
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# Compare all pairs
print("\nSimilarity Matrix:")
for i in range(len(texts)):
    for j in range(i+1, len(texts)):
        similarity = cosine_similarity(
            vector_embeddings[i], 
            vector_embeddings[j]
        )
        print(f"Text {i+1} vs Text {j+1}: {similarity:.3f}")
        print(f"  '{texts[i][:30]}...' vs '{texts[j][:30]}...'")

# Query embedding
query = "What is artificial intelligence?"
query_embedding = embeddings.embed_query(query)

print(f"\nQuery: '{query}'")
print("Relevance scores:")
for i, text in enumerate(texts):
    score = cosine_similarity(query_embedding, vector_embeddings[i])
    print(f"  {score:.3f}: {text[:50]}...")
```

</details>

### Example 2: Different Embedding Providers

In [None]:
# Empty cell for demonstration

<details>
<summary>Solution</summary>

```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings

# Different embedding models
embedding_models = {
    "OpenAI": OpenAIEmbeddings(model="text-embedding-3-small"),
    "HuggingFace": HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2"
    )
}

test_text = "LangChain is great for building LLM applications."

print("Embedding Model Comparison:")
print("=" * 50)

for name, model in embedding_models.items():
    try:
        embedding = model.embed_query(test_text)
        print(f"\n{name}:")
        print(f"  Dimensions: {len(embedding)}")
        print(f"  Sample values: {embedding[:5]}")
        print(f"  Cost: {'Paid' if 'OpenAI' in name else 'Free'}")
    except Exception as e:
        print(f"\n{name}: Error - {str(e)[:50]}")

print("\n💡 Choose embeddings based on cost, quality, and speed needs!")
```

</details>

---

## Learner Activity 1: Practice with Embeddings

### Exercise: Build a Semantic Similarity Checker

In [None]:
# Your code here
# TODO: Create a function that:
# 1. Takes two texts as input
# 2. Generates embeddings
# 3. Calculates similarity
# 4. Returns if they're similar (threshold > 0.8)

<details>
<summary>Solution</summary>

```python
def semantic_similarity_checker(text1: str, text2: str, threshold: float = 0.8):
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    
    vec1 = embeddings.embed_query(text1)
    vec2 = embeddings.embed_query(text2)
    
    similarity = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
    
    is_similar = similarity > threshold
    
    print(f"Text 1: {text1[:50]}...")
    print(f"Text 2: {text2[:50]}...")
    print(f"Similarity: {similarity:.3f}")
    print(f"Similar? {is_similar}")
    
    return is_similar

# Test
pairs = [
    ("Python is a programming language", "Python is used for coding"),
    ("The sky is blue", "Grass is green")
]

for t1, t2 in pairs:
    semantic_similarity_checker(t1, t2)
    print()
```

</details>

---

## Instructor Activity 2: Vector Stores

### Example 1: Creating and Querying Chroma

In [None]:
# Empty cell for demonstration

<details>
<summary>Solution</summary>

```python
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

# Create documents
docs = [
    Document(
        page_content="LangChain is a framework for building LLM applications.",
        metadata={"source": "intro.txt", "topic": "langchain"}
    ),
    Document(
        page_content="Vector stores enable semantic search over documents.",
        metadata={"source": "vectors.txt", "topic": "search"}
    ),
    Document(
        page_content="RAG combines retrieval with generation for better results.",
        metadata={"source": "rag.txt", "topic": "rag"}
    ),
    Document(
        page_content="Embeddings convert text into numerical vectors.",
        metadata={"source": "embeddings.txt", "topic": "embeddings"}
    )
]

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    collection_name="demo_collection"
)

print("Vector Store Created!")
print(f"Documents: {vectorstore._collection.count()}")

# Similarity search
query = "How do I search documents?"
results = vectorstore.similarity_search(query, k=2)

print(f"\nQuery: '{query}'")
print("\nTop Results:")
for i, doc in enumerate(results, 1):
    print(f"{i}. {doc.page_content}")
    print(f"   Source: {doc.metadata['source']}")

# Search with scores
results_with_scores = vectorstore.similarity_search_with_score(query, k=2)

print("\nResults with Scores:")
for doc, score in results_with_scores:
    print(f"Score: {score:.3f} - {doc.page_content[:50]}...")

# Metadata filtering
filtered_results = vectorstore.similarity_search(
    query,
    k=2,
    filter={"topic": "search"}
)

print("\nFiltered Results (topic=search):")
for doc in filtered_results:
    print(f"- {doc.page_content}")
```

</details>

### Example 2: FAISS for Large-Scale Search

In [None]:
# Empty cell for demonstration

<details>
<summary>Solution</summary>

```python
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create larger dataset
long_text = """
Artificial Intelligence (AI) is revolutionizing how we interact with technology.
Machine learning models can now understand and generate human language.
Deep learning has enabled breakthroughs in computer vision and NLP.
Neural networks are inspired by the human brain's architecture.
Large language models like GPT have billions of parameters.
Transfer learning allows models to adapt to new tasks quickly.
Reinforcement learning helps agents learn through trial and error.
Computer vision enables machines to understand visual information.
Natural language processing allows computers to understand text.
AI ethics is crucial for responsible AI development.
"""

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
texts = splitter.split_text(long_text)

# Create FAISS index
embeddings = OpenAIEmbeddings()
faiss_index = FAISS.from_texts(
    texts,
    embeddings,
    metadatas=[{"chunk_id": i} for i in range(len(texts))]
)

print(f"FAISS Index created with {len(texts)} chunks")

# Search
query = "How do neural networks work?"
docs = faiss_index.similarity_search(query, k=3)

print(f"\nQuery: {query}")
print("\nTop Results:")
for doc in docs:
    print(f"- {doc.page_content}")

# Save and load
faiss_index.save_local("faiss_index")
loaded_index = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)
print("\n✅ Index saved and loaded successfully!")

# Maximum marginal relevance (diverse results)
mmr_results = faiss_index.max_marginal_relevance_search(query, k=3, fetch_k=10)
print("\nDiverse Results (MMR):")
for doc in mmr_results:
    print(f"- {doc.page_content}")
```

</details>

---

## Learner Activity 2: Build a Document Search System

### Exercise: Create a Searchable Knowledge Base

In [None]:
# Your code here
# TODO: Build a system that:
# 1. Loads multiple documents
# 2. Creates a vector store
# 3. Implements search with metadata filtering
# 4. Returns top K relevant documents

<details>
<summary>Solution</summary>

```python
class KnowledgeBase:
    def __init__(self):
        self.embeddings = OpenAIEmbeddings()
        self.vectorstore = None
    
    def add_documents(self, texts: List[str], metadatas: List[dict] = None):
        docs = [
            Document(page_content=text, metadata=metadata or {})
            for text, metadata in zip(texts, metadatas or [{}]*len(texts))
        ]
        
        if self.vectorstore is None:
            self.vectorstore = Chroma.from_documents(docs, self.embeddings)
        else:
            self.vectorstore.add_documents(docs)
    
    def search(self, query: str, k: int = 3, filter: dict = None):
        if not self.vectorstore:
            return []
        return self.vectorstore.similarity_search(query, k=k, filter=filter)

# Test
kb = KnowledgeBase()

# Add documents
kb.add_documents(
    texts=[
        "Python is great for data science.",
        "JavaScript is used for web development.",
        "Machine learning requires good data."
    ],
    metadatas=[
        {"category": "programming"},
        {"category": "programming"},
        {"category": "ml"}
    ]
)

# Search
results = kb.search("What language for AI?", k=2)
for doc in results:
    print(f"- {doc.page_content} [{doc.metadata}]")
```

</details>

---

## Instructor Activity 3: Advanced Vector Store Techniques

### Example: Hybrid Search and Indexing Strategies

In [None]:
# Empty cell for demonstration

<details>
<summary>Solution</summary>

```python
from langchain_community.vectorstores import Chroma
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Documents about programming
docs = [
    Document(page_content="Python pandas library is great for data analysis."),
    Document(page_content="JavaScript async/await simplifies asynchronous code."),
    Document(page_content="SQL JOIN operations combine data from multiple tables."),
    Document(page_content="Machine learning models need training data."),
    Document(page_content="REST APIs use HTTP methods like GET and POST.")
]

# Create vector store retriever
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embeddings)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Create keyword retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 3

# Combine retrievers
ensemble_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # Weight semantic search higher
)

# Test hybrid search
query = "How to analyze data with Python?"

print("Hybrid Search Results:")
results = ensemble_retriever.get_relevant_documents(query)
for i, doc in enumerate(results[:3], 1):
    print(f"{i}. {doc.page_content}")

# Compare with individual retrievers
print("\nVector Only:")
for doc in vector_retriever.get_relevant_documents(query)[:2]:
    print(f"- {doc.page_content}")

print("\nKeyword Only:")
for doc in bm25_retriever.get_relevant_documents(query)[:2]:
    print(f"- {doc.page_content}")

print("\n💡 Hybrid search combines semantic and keyword matching!")
```

</details>

---

## Summary & Next Steps

### What You've Learned
✅ Generating embeddings from text  
✅ Understanding vector similarity  
✅ Using Chroma and FAISS vector stores  
✅ Semantic search with metadata filtering  
✅ Hybrid search strategies  

### Key Takeaways
1. **Embeddings capture semantic meaning** - Similar concepts have similar vectors
2. **Vector stores enable fast search** - Efficient similarity matching at scale
3. **Metadata filtering adds precision** - Combine semantic and structured search
4. **Choose the right embedding model** - Balance cost, speed, and quality
5. **Hybrid search improves results** - Combine semantic and keyword matching

### What's Next?
In the next notebook (`07_rag_systems.ipynb`), you'll learn:
- Building complete RAG pipelines
- Query enhancement techniques
- Context window management
- RAG evaluation metrics

---

🎉 **Congratulations!** You've mastered embeddings and vector stores!