# IBM RAG and Agentic AI

## Course 3 - Vector Databases for RAG: An Introduction

### Module 1 - Introduction to Vector Databases and Chroma DB

#### Lecture 1 - Vector Database Concepts

- Vector DBs can be used to group items, classify items and suggest relationships among items
- Vector DBs can be used
    - to store complex data types (social likes, geospatial data, genomic data etc)
      <img src='images/200754.008_Vector-DB-reading-image1.png' width=600/>
    - perform similarity searches
    - for diverse domains like biology, healthcare, e-commerce, social media and traffic planning)
    - to support machine learning
- Traditional DBs store data as tables, Vector DBs store data as high dimensional vectors with size and direction. Each dimension relates to different attributes. For e.g. a book can be stored in vector DB as [1, 300, 2024, 4.2]

#### Lecture 2 - Traditional vs Vector Databases

|Function|Traditional databases|Vector databases|
|------|------|------|
|Data Representation|Traditional databases organize data in a structured format using tables, rows, and columns, ideal for relational data|Vector databases represent data as multi-dimensional vectors, efficiently encoding complex and unstructured data like images, text, and sensor data.|
|Data Search and Retrieval|SQL queries are suited for traditional databases with structured data.|Vector databases specialize in similarity searches and retrieving vectorized data, facilitating tasks like image retrieval, recommendation systems, and anomaly detection.|
|Indexing|Traditional databases employ indexing methods like B-trees for efficient data retrieval.|Vector databases use indexing structures like metric trees and hashing suited for high-dimensional spaces, enhancing nearest-neighbor searches and similarity assessments.|
|Scalability|Scaling traditional databases can be challenging, often requiring resource augmentation or data sharding.|Vector databases are designed for scalability, especially in handling large datasets and similarity searches, using distributed architectures for horizontal scaling.|
|Applications|Traditional databases are pivotal in business applications and transactional systems where structured data is processed.|Vector databases shine in analyzing vast datasets, supporting fields like scientific research, natural language processing, and multimedia analysis.|

#### Lecture 3 - Vector Database Types

- **In-memory** Vector DBs e.g. *RedisAI, Torchserve* store vectors in RAM hence fast but limited in size
- **Disk-Based** Vector DBs e.g. *Annoy, Milvus, ScaNN* store vectors on disk, use compression and indexing and are suitable for large datasets
- **Distributed** Vector DBs e.g. *FAISS, ElasticSearch+, Dask-ML* spread data across multiple nodes/servers hence great for horizotnal scaling and fault tolerance making them suitable for large datasets with fast retrieval
- **Graph Based** Vector DBs e.g. *Neo4J, Amazon Neptune, TigerGraph* model data as a graph with nodes and edges representing attributes. They are great at capturing complex relationships and graph analytics
- **Time Series** Vector DBs e.g. *InfluxDB, TimescaleDB, Prometheus* represent data collected over time as vectors and are good for identifying temporal patterns and anomalies

Vector DBs can also be classified as dedicated vector DBs or DBs that support vector search

**Dedicated Vector DBs**
- use unique data structures like reverse indexes, product quantization and Locality-sensitive Hashing (LSH)
- support vector operations like similarity search, nearest neighbour search and distance calculations
- provide scaleability through clustering or distributed nodes
- deliver speed through optimized algorithms and data structures
- are customisable by changing parameters of indexing and searching as per application needs
- Examples are FAISS, Annoy and Milvus

**Databases that support Vector Search**
- are regular DBs or data processing frameworks that have tools and addons to allow users to do vector search and other queries
- Store data as part of their data model as BLOBs, Arrays or UDTs.
- Allow standard and custom indexing to organise data
- Have add-on libraries and plugins to support vector operations
- Not as optimized or fast as dedicated vector DBs
- Examples are SingleStore (works with watsonx.ai), ElasticSearch, PostgreSQL, MySQL, RedisAI, Apache MongoDB and Apache Cassandra

#### Lecture 4 - Applications of Vector DBs

1. **Image and Video Analysis**
|Task|Capability|Uses|
|------|------|------|
|Feature Extraction & Representation|Store High-Dimensional Feature Vectors|Displays aspects of images, such as color histograms, texture descriptions or deep learning embeddings|
|Similarity Searches|Store Feature Vectors|Locate images, Summarize videos, and suggest images and videos based on content|
|Process Real-time data|Provide horizontal scaling for real-time storage|Perform video surveillance, object recognition, and live event analysis|

2. **Recommendation Systems**
|Task|Capability|Uses|
|------|------|------|
|Embedding Storage and Nearest Neighbour Search|Incorporate embeddings or numerical representations of items or entities generated by a recommendation system|Access the vector's likes and traits, Locate the vector's closest neighbours for improved personalized suggestions|
|Deliver performance improvement and scalability|Provide scalability to handle additional searches and vectors. Improve query processing and indexing structure|Deliver fast, scalabale recommendation services for large numbers of concurrent users|
|Provide cross-domain suggestions|Store embeddings and carry out cross-domain suggestions|Enhance the completeness of recommendation systems|

3. **Geospatial analysis and location-based services**
|Task|Capability|Uses|
|------|------|------|
|Efficiently store and index data| * Use indexing methods like R-tree or quad tree * Store geospatial data like addresses, polygons, GPS locations | Deliver spatial queries like closeness searches, range queries and spatial joins, for GPS information and other mapping needs|
|Provide location-based suggestions|Combine geospatial data with user preferences and location|Deliver recommendations for nearby events, services and places of interest|
|Deliver realtime geospatial analytics|*Process streaming data in real-time * Groups items together spatially * Recognizes spatial patterns|Power apps like tracking vehicles, managing fleets, dynamic routing, finding hotspots|

4. **Marketing and social media insights**

|Task|Capability|Uses|
|------|------|------|
|Provide distributed storage and parallel processing for horizontal scalability|Spread data and queries across multiple nodes or groups|Process big data and handle simultaneous queries such as SEO calculations|
|Reduce latency and boost overall speed|Use optimized caching and query execution plans|Obtain trending analytics faster|
|Adjust to changing task needs|Support auto-scaling and dynamic resource allocation|Your company can scale hardware and cloud resource usage for the best performance and lower costs|

#### Lecture 5 - Similarity Search

For any two vectors $\vec{a}$ and $\vec{b}$ 
- the **L2 distance** or Eucliendian distance $\sqrt{\sum{(a_i-b_i)}^2}$ is a **distance** metric
- the **dot product** $\sum{a_i b_i}$ or $\lVert{a}\rVert \lVert{b}\rVert cos(\alpha)$ is a **similarity** metric. However its negative can be used as a distance metric (larger dot product $\implies$ less distance)
- The **cosine similarity** cosine_similarity(a,b) $=\frac{a . b}{\lvert{a}\rVert \lvert{b}\rVert} = \frac{a}{\lvert{a}\rVert} \frac{b}{\lvert{b}\rVert} = norm(a) \times norm(b)$ is a **similarity** metric and (1-cosine_similarity) is a **distance** metric

|Metric|Sensitive to Magnitude|Normalised|Best For|
|------|------|------|------|
|L2 Distance|$\checkmark$Yes|$\times$No|Spatial Data, Clustering|
|Cosine Distance|$\times$No|$\checkmark$Yes|Text, Embeddings, NLP|
|Dot Product|$\checkmark$Yes|$\times$No|Neural Networks, recommender systems|

L2 distance works well for continuous, lower-dimensional data where magnitude matters. 

Cosine distance excels with high-dimensional, sparse data where direction is more important than magnitude. 

Dot product offers computational efficiency and is useful when both magnitude and direction contribute to similarity. 

##### Lab on manually implementing Vector Similarity

In [None]:
!pip install sentence-transformers==4.1.0 | tail -n 1

In [None]:
import math

import numpy as np
import scipy
import torch
from sentence_transformers import SentenceTransformer

In [None]:
# Example documents
documents = [
    'Bugs introduced by the intern had to be squashed by the lead developer.',
    'Bugs found by the quality assurance engineer were difficult to debug.',
    'Bugs are common throughout the warm summer months, according to the entomologist.',
    'Bugs, in particular spiders, are extensively studied by arachnologists.'
]

In [None]:
# Load a pre-trained model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

In [None]:
# Generate embeddings
embeddings = model.encode(documents)

In [None]:
embeddings.shape

In [None]:
embeddings

In [None]:
def euclidean_distance_fn(vector1, vector2):
    squared_sum = sum((x - y) ** 2 for x, y in zip(vector1, vector2))
    return math.sqrt(squared_sum)

In [None]:
euclidean_distance_fn(embeddings[0], embeddings[1])

In [None]:
euclidean_distance_fn(embeddings[1], embeddings[0])

In [None]:
l2_dist_manual = np.zeros([4,4])
for i in range(embeddings.shape[0]):
    for j in range(embeddings.shape[0]):
        l2_dist_manual[i,j] = euclidean_distance_fn(embeddings[i], embeddings[j])

l2_dist_manual

In [None]:
l2_dist_manual[0,1]

In [None]:
l2_dist_manual[1,0]

In [None]:
l2_dist_manual_improved = np.zeros([4,4])
for i in range(embeddings.shape[0]):
    for j in range(embeddings.shape[0]):
        if (i>j):
            l2_dist_manual_improved[i,j] = l2_dist_manual_improved[j,i]
        elif (i<j):
            l2_dist_manual_improved[i,j] = euclidean_distance_fn(embeddings[i], embeddings[j])
l2_dist_manual_improved

In [None]:
l2_dist_scipy = scipy.spatial.distance.cdist(embeddings, embeddings, 'euclidean')
l2_dist_scipy

In [None]:
np.allclose(l2_dist_manual, l2_dist_scipy)

In [None]:
def dot_product_fn(vector1, vector2):
    return sum(x * y for x, y in zip(vector1, vector2))

In [None]:
dot_product_fn(embeddings[0], embeddings[1])

In [None]:
dot_product_manual = np.empty([4,4])
for i in range(embeddings.shape[0]):
    for j in range(embeddings.shape[0]):
        dot_product_manual[i,j] = dot_product_fn(embeddings[i], embeddings[j])

dot_product_manual

In [None]:
# Matrix multiplication operator
dot_product_operator = embeddings @ embeddings.T
dot_product_operator

In [None]:
np.allclose(dot_product_manual, dot_product_operator, atol=1e-05)

In [None]:
# Equivalent to `np.matmul()` if both arrays are 2-D:
np.matmul(embeddings,embeddings.T)

In [None]:
# `np.dot` returns an identical result, but `np.matmul` is recommended if both arrays are 2-D:
np.dot(embeddings,embeddings.T)

In [None]:
dot_product_distance = -dot_product_manual
dot_product_distance

In [None]:
# L2 norms
l2_norms = np.sqrt(np.sum(embeddings**2, axis=1))
l2_norms

In [None]:
# L2 norms reshaped
l2_norms_reshaped = l2_norms.reshape(-1,1)
l2_norms_reshaped

In [None]:
normalized_embeddings_manual = embeddings/l2_norms_reshaped
normalized_embeddings_manual

In [None]:
np.allclose(np.sqrt(np.sum(normalized_embeddings_manual**2, axis=1)),np.array([1,1,1,1]))

In [None]:
normalized_embeddings_torch = torch.nn.functional.normalize(
    torch.from_numpy(embeddings)
).numpy()
normalized_embeddings_torch

In [None]:
np.allclose(normalized_embeddings_manual, normalized_embeddings_torch)

In [None]:
dot_product_fn(normalized_embeddings_manual[0], normalized_embeddings_manual[1])

In [None]:
cosine_similarity_manual = np.empty([4,4])
for i in range(normalized_embeddings_manual.shape[0]):
    for j in range(normalized_embeddings_manual.shape[0]):
        cosine_similarity_manual[i,j] = dot_product_fn(
            normalized_embeddings_manual[i], 
            normalized_embeddings_manual[j]
        )

cosine_similarity_manual

In [None]:
cosine_similarity_operator = normalized_embeddings_manual @ normalized_embeddings_manual.T
cosine_similarity_operator

In [None]:
np.allclose(cosine_similarity_manual, cosine_similarity_operator)

In [None]:
1 - cosine_similarity_manual

In [None]:
### YOUR CODE GOES HERE ###
# First, embed the query:
query_embedding = model.encode(
    ["Who is responsible for a coding project and fixing others' mistakes?"]
)

# Second, normalize the query embedding:
normalized_query_embedding = torch.nn.functional.normalize(
    torch.from_numpy(query_embedding)
).numpy()

# Third, calculate the cosine similarity between the documents and the query by using the dot product:
cosine_similarity_q3 = normalized_embeddings_manual @ normalized_query_embedding.T

# Fourth, find the position of the vector with the highest cosine similarity:
highest_cossim_position = cosine_similarity_q3.argmax()

# Fifth, find the document in that position in the `documents` array:
documents[highest_cossim_position]

# As you can see, the query retrieved the document `Bugs introduced by the intern had to be squashed by the lead developer.` which is what we would expect.

#### Lecture 6 - Chroma DB Key Concepts and Architecture

**Chroma DB Capabilities**
- Storage of embeddings and their metadata
- Vector Search
- Full-text Search
- Document Storage
- Metadata Filtering
- Multi-Modal Retrieval

**Deployment Modes**
- *Client Server Architecture*: Client and Server run as independent processes, server is launched through CLI or Docker Image and client connects to server over HTTP
- *Standalong Mode*: Meant for Python only, client and server run in same process, useful for capabilities demo or when it is clear that only one machine would be used

**Architecture Phases**
1. Obtaining Embeddings (Optional, Chroma DB can do this automatically)
2. Creating Collections
3. Storing Data (need to pass embeddings if Chroma DB is not handling embedding internally)
4. Performing Collection Operations (Update/Delete/Rename etc)
5. Querying and Grouping Data

**Ecosystem - Clients and Integrations**
- Officially supports Python and JS clients. Community supports Java, Ruby, C#, Go, Rust, PHP etc
- Integrates with Langchain, LlamaIndex and OLlama
- Provides native integrations with HuggingFace, Google and OpenAI

**In Practice**
Steps to execute
1. Create Collection
2. Add Text Chunks + Metadata (ChromaDB handles embeddings, else you pass embeddings)
3. Query Collection (most similar results returned, query embedding handled internally. Similarity by default is Euclidean (L2) Distance. Dot Product and Cosine Similarity are also supported.)

**Performance Features**
- Efficient Similarity Search
    - Optimized for Nearest Neighbour Search
    - Internally uses HNSW Algorithm (Hierarchical Navigable Small World)
- Coding Practices
    - Written in Rust $\implies$ 3-5 times improvement in querying and writing ops

**Use Cases and Applications**
- Recommender Systems
- Document Search with vector or full text search
- Image Retrieval, based on text queries using multi-modal retrieval
- AI-based chatbots built with semantic search and retrieval capabilities for context augmentation

**Filtering**
- Supports **Metadata filtering** (similar to SQL where clauses but more powerful) and **Document Filtering** (similar to SQL contains clauses but more powerful)
- **Metadata Filtering**
    - Syntax `collection.get(where={"key":"value"})` (works for `.delete()`and `.query()`also)
    - Following operators are also supported in metadata filtering: `$eq, $ne, $gt, $lt, $gte, $lte, $in, $nin` as `where={"key":{"$eq":"value"}}, where={"key":{"$in":["value1", "value2"]}}` etc

    - Filters can be combined as `and`or `or` using the syntax below
    ```python
    collection.get(
        where={"$and":[
            {"key1":{"$gte":"value1"}}, 
            {"key2":{"$lt":"value2"}}
            ]
        }
    )
- **Document Filtering**
    - Syntax `collection.get(where_document={"$contains":"value"})`. `not_contains`is also supported similarly

##### **Working Example of ChromaDB**

In [None]:
!pip install chromadb

In [None]:
import chromadb
from chromadb.utils import embedding_functions

ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

client=chromadb.Client()

collection_name = "filter_demo"

try:
    client.delete_collection(collection_name)
except ValueError:
    pass

collection=client.create_collection(
    name=collection_name,
    metadata={"description":"Used to demo filtering in ChromaDB"},
    configuration={
        "embedding_function":ef
    }
)

print(f"Collection Created: {collection.name}")

In [None]:
collection.add(
    documents=[
        "This is a document about LangChain",
        "This is a reading about LlamaIndex",
        "This is a book about Python",
        "This is a document about pandas",
        "This is another document about LangChain"
    ],
    metadatas=[
        {"source": "langchain.com", "version": 0.1},
        {"source": "llamaindex.ai", "version": 0.2},
        {"source": "python.org", "version": 0.3},
        {"source": "pandas.pydata.org", "version": 0.4},
        {"source": "langchain.com", "version": 0.5},
    ],
    ids=["id1", "id2", "id3", "id4", "id5"]
)

In [None]:
# finds all documents where the source is "langchain.com"
collection.get(
    where={"source": {"$eq": "langchain.com"}}
)

In [None]:
# finds all documents where the source is "langchain.com" with versions less than 0.3
collection.get(
    where={
        "$and": [
            {"source": {"$eq": "langchain.com"}}, 
            {"version": {"$lt": 0.3}}
        ]
    }
)

In [None]:
# retrieves all documents about LangChain and LlamaIndex with a version less than 0.3
collection.get(
    where={
        "$and": [
            {"source": {"$in": ["langchain.com", "llamaindex.ai"]}}, 
            {"version": {"$lt": 0.3}}
        ]
    }
)

In [None]:
# performs a full text search for such documents
collection.get(
    where_document={"$contains":"pandas"}
)

In [None]:
# looks for all documents containing "LangChain" or "Python" with version numbers greater than 0.1
collection.get(
    where={"version": {"$gt": 0.1}},
    where_document={
        "$or": [
            {"$contains": "LangChain"},
            {"$contains": "Python"}
        ]
    }
)

##### **Similarity Search and HNSW in Chroma DB**

**Vector Indexes** are specialized data structures that enable algorithms to compute similarity scores with only a small subset of vectors, significantly speeding up the search while still returning exact or near-optimal results.

One such vector index is **Hierarchical Navigable Small World or HNSW**

HNSW builds a multi-layered graph where:

- The upper layers contain a sparse overview of the data for fast navigation.
- The bottom layer holds all vectors for detailed search.

Each vector connects to a few nearby neighbors, forming a "small world" network—meaning most vectors can be reached in just a few steps.

HNSW is fast, acurate, scaleable, and versatile.

###### **HNSW setup in ChromaDB**

In [None]:
import chromadb
from chromadb.utils import embedding_functions

ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Collection creation
client = chromadb.Client()

collection_name = "hnsw_demo"

try:
    client.delete_collection(name=collection_name)
except ValueError:
    pass
    
collection = client.create_collection(
    name=collection_name,
    metadata={"topic": "query testing"},
    configuration={
        "hnsw": {
            # space can be "l2" for L2/Euclidean, "ip" for inner dot product or "cosine" for cosine similarity
            "space": "cosine",
            # ef_search determines size of candidate list when nearest neighbour search is done. Higher values
            # increase accuracy but reduce speed
            "ef_search": 100,
            # ef_construction determines size of candidate list used to select nearest neighbours when a new 
            # node is inserted into the index. Again higher values improve accuracy and reduce speed
            "ef_construction": 100,
            # max_neighbors determines maximum connections a node can have during construction. Higher values 
            # increase accuracy but also increase cost in terms of memory usage and time. Default value is 16
            "max_neighbors": 16
        },
        "embedding_function": ef
    }
)

Thus `ef_search` affects the breadth of search, while `ef_construction`and `max_neighbours`affect the quality of the vector index built

###### **Querying in Chroma DB**

In [None]:
collection.add(
    documents=[
        "Giant pandas are a bear species that lives in mountainous areas.",
        "A pandas DataFrame stores two-dimensional, tabular data",
        "I think everyone agrees that pandas are some of the cutest animals on the planet",
        "A direct comparison between pandas and polars indicates that polars is a more efficient library than pandas.",
    ],
    metadatas=[
        {"topic": "animals"},
        {"topic": "data analysis"},
        {"topic": "animals"},
        {"topic": "data analysis"},
    ],
    ids=["id1", "id2", "id3", "id4"]
)

In [None]:
# This will result in all 4 documents to be returned, ordered by increasing distance. 
collection.query(
    query_texts=["cat"],
    n_results=10,
)

In [None]:
# this will falsely fetch the document #4 confusing the polars library with polar bears. 
collection.query(
    query_texts=["polar bear"],
    n_results=1,
)

In [None]:
# this can be fixed by using filters along with the query as below
collection.query(
    query_texts=["polar bear"],
    n_results=1,
    where={'topic': 'animals'}
)

In [None]:
# as alternative to metadata filtering, we can also use full document text search filter as below
collection.query(
    query_texts=["polar bear"],
    n_results=1,
    where_document={'$not_contains': 'library'}
)

In [None]:
# both metadata filtering and full text search can be combined as well, as below
collection.query(
    query_texts=["polar bear"],
    n_results=1,
    where={'topic': 'animals'},
    where_document={'$not_contains': 'library'}
)

#### Lab: Similarity Search on Text Using a Chroma Vector Database

In [None]:
!pip install chromadb==1.0.12

In [None]:
!pip install sentence-transformers==4.1.0

In [None]:
import chromadb
from chromadb.utils import embedding_functions
# Define the embedding function using SentenceTransformers
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

In [None]:
# Create a new instance of ChromaClient to interact with the Chroma DB
client = chromadb.Client()

# Define the name for the collection to be created or retrieved
collection_name = "my_grocery_collection"

In [None]:
# Define the main function to interact with the Chroma DB
def main():
    try:
        # Create a collection in the Chroma database with a specified name, 
        # distance metric, and embedding function. In this case, we are using 
        # cosine distance
        try:
            client.delete_collection(collection_name)
        except ValueError:
            pass
            
        collection = client.create_collection(
            name=collection_name,
            metadata={"description": "A collection for storing grocery data"},
            configuration={
                "hnsw": {"space": "cosine"},
                "embedding_function": ef
            }
        )
        print(f"Collection created: {collection.name}")

        # Array of grocery-related text items
        texts = [
            'fresh red apples',
            'organic bananas',
            'ripe mangoes',
            'whole wheat bread',
            'farm-fresh eggs',
            'natural yogurt',
            'frozen vegetables',
            'grass-fed beef',
            'free-range chicken',
            'fresh salmon fillet',
            'aromatic coffee beans',
            'pure honey',
            'golden apple',
            'red fruit'
        ]
        
        # Create a list of unique IDs for each text item in the 'texts' array
         # Each ID follows the format 'food_<index>', where <index> starts from 1
        ids = [f"food_{index + 1}" for index, _ in enumerate(texts)]

        # Add documents and their corresponding IDs to the collection
        # The `add` method inserts the data into the collection
        # The documents are the actual text items, and the IDs are unique identifiers
        # ChromaDB will automatically generate embeddings using the configured embedding function
        collection.add(
            documents=texts,
            metadatas=[{"source": "grocery_store", "category": "food"} for _ in texts],
            ids=ids
        )

        # Retrieve all the items (documents) stored in the collection
        # The `get` method fetches all data from the collection
        all_items = collection.get()
        # Log the retrieved items to the console for inspection
        # This will print out all the documents, IDs, and metadata stored in the collection
        print("Collection contents:")
        print(f"Number of documents: {len(all_items['documents'])}")

        # Define the query term you want to search for in the collection
        # query_term = "apple"
        
        # practice exercise changes query term to a list
        query_term = ["red","fresh"]

        if (isinstance(query_term, str)):
            query_term = [query_term]

        # Perform a query to search for the most similar documents to the 'query_term'
        results = collection.query(
            query_texts=query_term,
            n_results=3  # Retrieve top 3 results
        )
        print(f"Query results for '{query_term}':")
        print(results)

        # Check if no results are returned or if the results array is empty
        if not results or not results['ids'] or len(results['ids'][0]) == 0:
            # Log a message indicating that no similar documents were found for the query term
            print(f'No documents found similar to "{query_term}"')
            return

        for q in range(len(query_term)):
            print(f'Top 3 similar documents to "{query_term[q]}":')
            # Access the nested arrays in 'results["ids"]' and 'results["distances"]'
            for i in range(min(3, len(results['ids'][q]))):
                doc_id = results['ids'][q][i]  # Get ID from 'ids' array
                score = results['distances'][q][i]  # Get score from 'distances' array
                # Retrieve text data from the results
                text = results['documents'][q][i]
                if not text:
                    print(f' - ID: {doc_id}, Text: "Text not available", Score: {score:.4f}')
                else:
                    print(f' - ID: {doc_id}, Text: "{text}", Score: {score:.4f}')

        perform_similarity_search(collection, all_items)
        
        pass
    except Exception as error:  # Catch any errors and log them to the console
        print(f"Error: {error}")

In [None]:
# Function to perform a similarity search in the collection
def perform_similarity_search(collection, all_items):
    try:
        # Place your similarity search code inside this block
        pass
    except Exception as error:
        print(f"Error in similarity search: {error}")

In [None]:
if __name__ == "__main__":
    main()

### Module 2 - Vector Databases for Recommendation Systems and RAG

#### Lecture 1 - Essential Database Operations in ChromaDB

Collections can be
- Created: `client.create_collection(name=...)`
- Retrieved: `client.get_collection(name=...)`
- Modified: `collection.modify(name=..., metadata=...)`. *Not everything can be modified*; embedding model and distance metric cannot be modified - they can be changed only by cloning a new collection.

Documents can be added to a collection by calling `collection.add(...)`

```python
    collection.add(
        documents=[
            "This is document 1",
            "This is document 2",
            ...], 
        metadatas=[
            {"topic":"X", "version":"0.1"},
            {"topic":"Y", "version":"1.2"}
            ...], 
        ids=["id1","ïd2"...]
    )

Documents can be retrieved using `collection.get()` which returns all documents
The output looks like below
```python
{
    "ids":["id1","id2"],
    "embeddings":None,
    "documents":["...","..."],
    "uris":None,
    "ïncluded":["metadatas","documents"],
    "data":None,
    "metadatas":[{"":"", "":""},{"":"", "":""}]
}

embeddings by default are not returned, but can be obtained by calling
```python
collection.get(include=["embeddings"])


To retrieve specific document, use `collection.get("id1")`

To update existing data in collection, use 
```python
collection.update(
    ids=["id1"], 
    metadatas=[{"topic":"New Topic", "version":"2.0"}],
    documents=["Updated document"]
)

ChromaDB handles embeddings in the background automatically when document changes.

Delete documents from collection using either ids, or where clause, or both.
```python
collection.delete(
    ids=["id1","id2"],
    where={"source":"llamaindex.ai"}
)

ChromaDB uses HNSW for its vector index. This can be tuned by specifying the `space` param that can have values `(l2, ip, cosine)`

```python
client.create_collection(
    name="my_collection",
    metadata={"description","My Demo Collection"},
    configuration={"hnsw" : {"space":"cosine"}},
    embedding_function=ef
)

#### Lab - Similarity Search on Employee Records using Python and Chroma DB

In [None]:
!pip install chromadb==1.0.12
!pip install sentence-transformers==4.1.0

In [None]:
# Importing necessary modules from the chromadb package:
# chromadb is used to interact with the Chroma DB database,
# embedding_functions is used to define the embedding model
import chromadb
from chromadb.utils import embedding_functions

# Define the embedding function using SentenceTransformers
# This function will be used to generate embeddings (vector representations) for the data
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Creating an instance of ChromaClient to establish a connection with the Chroma database
client = chromadb.Client()

In [None]:
# Defining a name for the collection where data will be stored or accessed
# This collection is likely used to group related records, such as employee data
collection_name = "employee_collection"

In [None]:
# Function to perform various types of searches within the collection
def perform_advanced_search(collection, all_items):
    try:
        # Advanced search operations will be placed here
        print("=== Similarity Search Examples ===")
        
        # Example 1: Search for Python developers
        print("\n1. Searching for Python developers:")
        query_text = "Python developer with web development experience"
        results = collection.query(
            query_texts=[query_text],
            n_results=3
        )
        print(f"Query: '{query_text}'")
        for i, (doc_id, document, distance) in enumerate(zip(
            results['ids'][0], results['documents'][0], results['distances'][0]
        )):
            metadata = results['metadatas'][0][i]
            print(f"  {i+1}. {metadata['name']} ({doc_id}) - Distance: {distance:.4f}")
            print(f"     Role: {metadata['role']}, Department: {metadata['department']}")
            print(f"     Document: {document[:100]}...")
        
        # Example 2: Search for leadership roles
        print("\n2. Searching for leadership and management roles:")
        query_text = "team leader manager with experience"
        results = collection.query(
            query_texts=[query_text],
            n_results=3
        )
        print(f"Query: '{query_text}'")
        for i, (doc_id, document, distance) in enumerate(zip(
            results['ids'][0], results['documents'][0], results['distances'][0]
        )):
            metadata = results['metadatas'][0][i]
            print(f"  {i+1}. {metadata['name']} ({doc_id}) - Distance: {distance:.4f}")
            print(f"     Role: {metadata['role']}, Experience: {metadata['experience']} years")

        print("\n=== Metadata Filtering Examples ===")

        # Example 1: Filter by department
        print("\n3. Finding all Engineering employees:")
        results = collection.get(
            where={"department": "Engineering"}
        )
        print(f"Found {len(results['ids'])} Engineering employees:")
        for i, doc_id in enumerate(results['ids']):
            metadata = results['metadatas'][i]
            print(f"  - {metadata['name']}: {metadata['role']} ({metadata['experience']} years)")
        
        # Example 2: Filter by experience range
        print("\n4. Finding employees with 10+ years experience:")
        results = collection.get(
            where={"experience": {"$gte": 10}}
        )
        print(f"Found {len(results['ids'])} senior employees:")
        for i, doc_id in enumerate(results['ids']):
            metadata = results['metadatas'][i]
            print(f"  - {metadata['name']}: {metadata['role']} ({metadata['experience']} years)")
        
        # Example 3: Filter by location
        print("\n5. Finding employees in California:")
        results = collection.get(
            where={"location": {"$in": ["San Francisco", "Los Angeles"]}}
        )
        print(f"Found {len(results['ids'])} employees in California:")
        for i, doc_id in enumerate(results['ids']):
            metadata = results['metadatas'][i]
            print(f"  - {metadata['name']}: {metadata['location']}")

        print("\n=== Combined Search: Similarity + Metadata Filtering ===")

        # Example: Find experienced Python developers in specific locations
        print("\n6. Finding senior Python developers in major tech cities:")
        query_text = "senior Python developer full-stack"
        results = collection.query(
            query_texts=[query_text],
            n_results=5,
            where={
                "$and": [
                    {"experience": {"$gte": 8}},
                    {"location": {"$in": ["San Francisco", "New York", "Seattle"]}}
                ]
            }
        )
        print(f"Query: '{query_text}' with filters (8+ years, major tech cities)")
        print(f"Found {len(results['ids'][0])} matching employees:")
        
        for i, (doc_id, document, distance) in enumerate(zip(
            results['ids'][0], results['documents'][0], results['distances'][0]
        )):
            metadata = results['metadatas'][0][i]
            print(f"  {i+1}. {metadata['name']} ({doc_id}) - Distance: {distance:.4f}")
            print(f"     {metadata['role']} in {metadata['location']} ({metadata['experience']} years)")
            print(f"     Document snippet: {document[:80]}...")

        # Check if the results are empty or undefined
        if not results or not results['ids'] or len(results['ids'][0]) == 0:
            # Log a message if no similar documents are found for the query term
            print(f'No documents found similar to "{query_text}"')
            return

        # Log the header for the top 3 similar documents based on the query term
        print(f'Top 3 similar documents to "{query_text}":')
        # Loop through the top 3 results and log the document details
        for i in range(min(3, len(results['ids'][0]))):
            # Extract the document ID and similarity score from the results
            doc_id = results['ids'][0][i]
            score = results['distances'][0][i]
            # Retrieve the document text corresponding to the current ID from the results
            text = results['documents'][0][i]
            # Check if the text is available; if not, log 'Text not available'
            if not text:
                print(f' - ID: {doc_id}, Text: "Text not available", Score: {score:.4f}')
            else:
                print(f' - ID: {doc_id}, Text: "{text}", Score: {score:.4f}')
            
        pass
    except Exception as error:
        print(f"Error in advanced search: {error}")

In [None]:
# Defining a function named 'main'
# This function is used to encapsulate the main operations for creating collections,
# generating embeddings, and performing similarity search
def main():
    try:
        # Code for database operations will be placed here
        # This includes creating collections, adding data, and performing searches
        # Creating a collection using the ChromaClient instance
        # The 'create_collection' method creates a new collection with the specified configuration
        collection = client.create_collection(
            # Specifying the name of the collection to be created
            name=collection_name,
            # Adding metadata to describe the collection
            metadata={"description": "A collection for storing employee data"},
            # Configuring the collection with cosine distance and embedding function
            configuration={
                "hnsw": {"space": "cosine"},
                "embedding_function": ef
            }
        )
        print(f"Collection created: {collection.name}")

        # Defining a list of employee dictionaries
        # Each dictionary represents an individual employee with comprehensive information
        employees = [
            {
                "id": "employee_1",
                "name": "John Doe",
                "experience": 5,
                "department": "Engineering",
                "role": "Software Engineer",
                "skills": "Python, JavaScript, React, Node.js, databases",
                "location": "New York",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_2",
                "name": "Jane Smith",
                "experience": 8,
                "department": "Marketing",
                "role": "Marketing Manager",
                "skills": "Digital marketing, SEO, content strategy, analytics, social media",
                "location": "Los Angeles",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_3",
                "name": "Alice Johnson",
                "experience": 3,
                "department": "HR",
                "role": "HR Coordinator",
                "skills": "Recruitment, employee relations, HR policies, training programs",
                "location": "Chicago",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_4",
                "name": "Michael Brown",
                "experience": 12,
                "department": "Engineering",
                "role": "Senior Software Engineer",
                "skills": "Java, Spring Boot, microservices, cloud architecture, DevOps",
                "location": "San Francisco",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_5",
                "name": "Emily Wilson",
                "experience": 2,
                "department": "Marketing",
                "role": "Marketing Assistant",
                "skills": "Content creation, email marketing, market research, social media management",
                "location": "Austin",
                "employment_type": "Part-time"
            },
            {
                "id": "employee_6",
                "name": "David Lee",
                "experience": 15,
                "department": "Engineering",
                "role": "Engineering Manager",
                "skills": "Team leadership, project management, software architecture, mentoring",
                "location": "Seattle",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_7",
                "name": "Sarah Clark",
                "experience": 8,
                "department": "HR",
                "role": "HR Manager",
                "skills": "Performance management, compensation planning, policy development, conflict resolution",
                "location": "Boston",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_8",
                "name": "Chris Evans",
                "experience": 20,
                "department": "Engineering",
                "role": "Senior Architect",
                "skills": "System design, distributed systems, cloud platforms, technical strategy",
                "location": "New York",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_9",
                "name": "Jessica Taylor",
                "experience": 4,
                "department": "Marketing",
                "role": "Marketing Specialist",
                "skills": "Brand management, advertising campaigns, customer analytics, creative strategy",
                "location": "Miami",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_10",
                "name": "Alex Rodriguez",
                "experience": 18,
                "department": "Engineering",
                "role": "Lead Software Engineer",
                "skills": "Full-stack development, React, Python, machine learning, data science",
                "location": "Denver",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_11",
                "name": "Hannah White",
                "experience": 6,
                "department": "HR",
                "role": "HR Business Partner",
                "skills": "Strategic HR, organizational development, change management, employee engagement",
                "location": "Portland",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_12",
                "name": "Kevin Martinez",
                "experience": 10,
                "department": "Engineering",
                "role": "DevOps Engineer",
                "skills": "Docker, Kubernetes, AWS, CI/CD pipelines, infrastructure automation",
                "location": "Phoenix",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_13",
                "name": "Rachel Brown",
                "experience": 7,
                "department": "Marketing",
                "role": "Marketing Director",
                "skills": "Strategic marketing, team leadership, budget management, campaign optimization",
                "location": "Atlanta",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_14",
                "name": "Matthew Garcia",
                "experience": 3,
                "department": "Engineering",
                "role": "Junior Software Engineer",
                "skills": "JavaScript, HTML/CSS, basic backend development, learning frameworks",
                "location": "Dallas",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_15",
                "name": "Olivia Moore",
                "experience": 12,
                "department": "Engineering",
                "role": "Principal Engineer",
                "skills": "Technical leadership, system architecture, performance optimization, mentoring",
                "location": "San Francisco",
                "employment_type": "Full-time"
            },
        ]

        # Create comprehensive text documents for each employee
        # These documents will be used for similarity search based on skills, roles, and experience
        employee_documents = []
        for employee in employees:
            document = f"{employee['role']} with {employee['experience']} years of experience in {employee['department']}. "
            document += f"Skills: {employee['skills']}. Located in {employee['location']}. "
            document += f"Employment type: {employee['employment_type']}."
            employee_documents.append(document)

        # Adding data to the collection in the Chroma database
        # The 'add' method inserts or updates data into the specified collection
        collection.add(
            # Extracting employee IDs to be used as unique identifiers for each record
            ids=[employee["id"] for employee in employees],
            # Using the comprehensive text documents we created
            documents=employee_documents,
            # Adding comprehensive metadata for filtering and search
            metadatas=[{
                "name": employee["name"],
                "department": employee["department"],
                "role": employee["role"],
                "experience": employee["experience"],
                "location": employee["location"],
                "employment_type": employee["employment_type"]
            } for employee in employees]
        )

        # Retrieving all items from the specified collection
        # The 'get' method fetches all records stored in the collection
        all_items = collection.get()
        # Logging the retrieved items to the console for inspection or debugging
        print("Collection contents:")
        print(f"Number of documents: {len(all_items['documents'])}")

        # Call the perform_advanced_search function with the collection and all_items as arguments
        perform_advanced_search(collection, all_items)
        
        pass
    except Exception as error:
        # Catching and handling any errors that occur within the 'try' block
        # Logs the error message to the console for debugging purposes
        print(f"Error: {error}")

In [None]:
if __name__ == "__main__":
    main()

#### Lecture 2 - How Vector Databases power RAG 

The RAG Pipeline and role of vector databases

<img src="images/RAG_Pipeline.png" width=600/>

**Why Vector DB for RAG**

- **Reduces risk** of critical errors such as
    - using different embedding models for source documents and user prompts
    - incorrectly linking embeddings to their source documents
- **Simplifies and speeds up** development
    - less custom code to maintain
    - faster implementation and debugging
- **Optimizes Performance**
    - Vector DBs are built for fast semantic similarity searches
    - Custom built alternatives are often slower without significant optimization

**Pitfalls of RAG**

|Pitfall|Remedy|
|------|------|
|Using different embedding models for source docs and prompts|Use same embedding model everywhere (vector DBs do this by default)|
|Chunks are either too short or too long|Choose chunk size long enough to keep meaning clear|
|Not re-embedding when data, metrics or embedding model changes|re-embed when data, metrics or model changes|
|Assuming retrieval is always accurate|Test results and adjust if needed|

RAG frameworks like LangChain and LlamaIndex handle things that Vector DBs don't
- chunking
- advanced retrieval logic
- prompt augmentation
- LLM integration

#### Lab - Food Recommendation system using Chroma DB

See Code/ibm-genai-course/food-chatbot

## Course 4 - Advanced RAG with Vector Databases and Retrievers

### Module 1 - Advanced Retrievers for RAG

#### Lecture 1 - Explore Advanced Retrievers in Langchain

**What is a Langchain Retriever**
- An interface that returns documents based on an unstructured query
- More general than a vector store
- Retrieves documents or their chunks
- Accepts a string query as input and returns a list of documents or chunks as output

**Vector Store Based Retriever**
- It plugs into an existing vector store and works by embedding the query and then comparing them with embedded chunks using similarity search or **maximum marginal relevance (MMR)** to retrieve the most relevant chunks.
- It is simple as it plugs into an existing vector DB and does not require LLM to retrieve most similar chunks

**MMR**
- MMR is a technique to improve both relevance and diversity of retrieved results
    - MMR selects documents that are most relevant to query and also minimally similar to previous documents
    - This avoids redundancy and ensures comprehensive coverage for the query
```python
retriever=vectordb.as_retriever(search_type="mmr")
docs=retriever.invoke("email policy")

**Multi-Query Retriever**
- Uses an LLM to generate different versions of the same query
- this overcomes differences in results due to changed in query wording or poor embeddings
- Takes a union across all documents returned by all query variants to build a larger set

*Example*
```python

from ibm_watson_ai.foundation_models import ModelInference
from ibm_watson_ai.metanames import GenTextParamsMetaNames as GenParams
from ibm_watson_ai import Credentials
from ibm_watson_ai.foundation_models.extensions.langchain import WatsonxLLM

def llm():
    model_id = "mistalai/mixtral-8x7b-instruct-v01"
    params = {
        GenParams.MAX_NEW_TOKENS: 256,
        GenParams.TEMPERATURE: 0.5
    }
    credentials = "https://us-south.ml.cloud.ibm.com"
    project_id="skills-network"
    model = ModelInference(
        model_id=model_id,
        params=params,
        credentials=credentials,
        project_id=project_id
    )
    mixtral_llm=WatsonxLLM(model=model)
    return mixtral_llm

from langchain.retrievers.multi_query import MultiQueryRetriever

# The multi query retriever
retriever = MultiQueryRetriever.from_llm(
    retriever=vector_db.as_retriever(),
    llm = llm()
)

docs = retriever.invoke("email policy")

**Self-Query Retriever**
- Documents have text as well as metadata.
- None of the retrievers discussed so far have the ability to access the metadata
- Converts every query into
    1. A string to look up semantically and
    2. Metadata filter to go along with it

**Self-Query retriever setup**
```python

from langchain_core.documents import Document
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrivers.self_query.base import SelfQueryRetriever
from lark import lark

docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year",1993, "rating" : 7.7, "genre": "science fiction"}
    )
    #... and  many more documents ...
]

vectordb=Chroma.from_document(docs, watsonx_embeddings)

metadata_field_info=[
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama']",
        type="string"
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer"
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string"
    ),
    AttributeInfo(
        name="rating",
        description="A 1-10 rating of the movie",
        type="float"
    )
]

document_content_description = "Brief summary of a movie"

retriever = SeflQueryRetriever.from_llm(
    llm(),
    vectordb,
    document_content_description,
    metadata_field_info
)

retriever.invoke("I want to watch a movie rated higher than 8.5")


**Parent Document Retriever**
- Splitting documents involves conflict between small documents for accuracy and long documents for context
- Parent Document Retriever fetches small chunks, looks up their parent IDs and then returns large documents for the small chunks
- in the setup below, after `invoke`the retriever will return large chunks matched to the smaller chunks 

**Parent Document retriever setup**
```python
from langchain.retrievers import ParentDocumentRetriever
from langchain.text_splitters import CharacterTextSplitter
from langchain.storage import InMemoryStore

#setup two splitters, one with big chunk size (parent) and one with small chunk size (child)
parent_splitter=CharacterTextSplitter(chunk_size=2000, chunk_overlap=20, separator="\n")
child_splitter=CharacterTextSplitter(chunk_size=400, chunk_overlap=20, separator="\n")

vectordb=Chroma(
    collection_name="split_parents",
    embedding_function=watsonx_embedding
)

# The storage layer for the parent documents
store = InMemoryStore()

retriever=ParentDocumentRetriever(
    vectorstore=vectordb,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter
)
retriever.add_documents(data)
retriever.invoke("smoking policy") 

#### Lab 1 - Build a Smarter Search with LangChain Context Retrieval

In [None]:
import sys
!{sys.executable} -m pip install "ibm-watsonx-ai==1.1.2" | tail -n 1
!{sys.executable} -m pip install "langchain==0.2.1" | tail -n 1
!{sys.executable} -m pip install "langchain-ibm==0.1.11" | tail -n 1
!{sys.executable} -m pip install "langchain-community==0.2.1" | tail -n 1
!{sys.executable} -m pip install "chromadb==0.4.24" | tail -n 1
!{sys.executable} -m pip install "pypdf==4.3.1" | tail -n 1
!{sys.executable} -m pip install "lark==1.1.9" | tail -n 1
!{sys.executable} -m pip install 'posthog<6.0.0' | tail -n 1

In [None]:
# You can use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

In [None]:
from ibm_watsonx_ai.foundation_models import ModelInference
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models.extensions.langchain import WatsonxLLM

In [None]:
def llm():
    model_id = 'mistralai/mistral-small-3-1-24b-instruct-2503'
    
    parameters = {
        GenParams.MAX_NEW_TOKENS: 256,  # this controls the maximum number of tokens in the generated output
        GenParams.TEMPERATURE: 0.5, # this randomness or creativity of the model's responses
    }
    
    credentials = {
        "url": "https://us-south.ml.cloud.ibm.com"
    }
    
    
    project_id = "skills-network"
    
    model = ModelInference(
        model_id=model_id,
        params=parameters,
        credentials=credentials,
        project_id=project_id
    )
    
    mixtral_llm = WatsonxLLM(model = model)
    return mixtral_llm

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
def text_splitter(data, chunk_size, chunk_overlap):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    chunks = text_splitter.split_documents(data)
    return chunks

In [None]:
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames
from langchain_ibm import WatsonxEmbeddings

In [None]:
def watsonx_embedding():
    embed_params = {
        EmbedTextParamsMetaNames.TRUNCATE_INPUT_TOKENS: 3,
        EmbedTextParamsMetaNames.RETURN_OPTIONS: {"input_text": True},
    }
    
    watsonx_embedding = WatsonxEmbeddings(
        model_id="ibm/slate-125m-english-rtrvr-v2",
        url="https://us-south.ml.cloud.ibm.com",
        project_id="skills-network",
        params=embed_params,
    )
    return watsonx_embedding

In [None]:
!wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/MZ9z1lm-Ui3YBp3SYWLTAQ/companypolicies.txt"

In [None]:
from langchain_community.document_loaders import TextLoader

In [None]:
loader = TextLoader("companypolicies.txt")
txt_data = loader.load()

In [None]:
chunks_txt = text_splitter(txt_data, 200, 20)

In [None]:
from langchain.vectorstores import Chroma

In [None]:
# this will fail right now because Watson API key is needed outside of the Skills Network Notebooks
vectordb = Chroma.from_documents(chunks_txt, watsonx_embedding())

In [None]:
query = "email policy"
retriever = vectordb.as_retriever()

In [None]:
#You can also specify search kwargs like k to limit the retrieval results
retriever = vectordb.as_retriever(search_kwargs={"k": 1})

In [None]:
# The following code is showing how to conduct an MMR search in a vector database. 
# You just need to sepecify search_type="mmr".
retriever = vectordb.as_retriever(search_type="mmr")

In [None]:
# You can also set a retrieval method that defines a similarity score threshold, 
# returning only documents with a score above that threshold.
retriever = vectordb.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.4}
)

In [None]:
docs = retriever.invoke(query)

**Multi-Query Retriever**

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/ioch1wsxkfqgfLLgmd-6Rw/langchain-paper.pdf")
pdf_data = loader.load()

In [None]:
# Split the document and store the embeddings into a vector database.

# Split
chunks_pdf = text_splitter(pdf_data, 500, 20)

# VectorDB
ids = vectordb.get()["ids"]
vectordb.delete(ids) # We need to delete existing embeddings from previous documents and then store current document embeddings in.
vectordb = Chroma.from_documents(documents=chunks_pdf, embedding=watsonx_embedding())

In [None]:
# The MultiQueryRetriever function from LangChain is used.

from langchain.retrievers.multi_query import MultiQueryRetriever

query = "What does the paper say about langchain?"

retriever = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm()
)

In [None]:
# Set logging for the queries.

import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [None]:
docs = retriever.invoke(query)

**Self-Querying Retriever**

A Self-Querying Retriever, as the name suggests, has the ability to query itself. Specifically, given a natural language query, the retriever uses a query-constructing LLM chain to generate a structured query. It then applies this structured query to its underlying vector store. This enables the retriever to not only use the user-input query for semantic similarity comparison with the contents of stored documents but also to extract and apply filters based on the metadata of those documents.
The following code demonstrates how to use a Self-Querying Retriever.

In [None]:
from langchain_core.documents import Document
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from lark import lark

In [None]:
docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "thriller",
            "rating": 9.9,
        },
    ),
]

In [None]:
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]

In [None]:
vectordb = Chroma.from_documents(docs, watsonx_embedding())

In [None]:
document_content_description = "Brief summary of a movie."

retriever = SelfQueryRetriever.from_llm(
    llm(),
    vectordb,
    document_content_description,
    metadata_field_info,
)

In [None]:
# This example only specifies a filter
retriever.invoke("I want to watch a movie rated higher than 8.5")

In [None]:
# This example specifies a query and a filter
retriever.invoke("Has Greta Gerwig directed any movies about women")

In [None]:
# This example specifies a composite filter
retriever.invoke("What's a highly rated (above 8.5) science fiction film?")

**Parent Document Retriever**

When splitting documents for retrieval, there are often conflicting desires:

1. You may want to have small documents so that their embeddings can most accurately reflect their meaning. If the documents are too long, the embeddings can lose meaning.
2. You want to have long enough documents so that the context of each chunk is retained.

The ParentDocumentRetriever strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent IDs for those chunks and returns those larger documents.

In [None]:
from langchain.retrievers import ParentDocumentRetriever
from langchain_text_splitters import CharacterTextSplitter
from langchain.storage import InMemoryStore

In [None]:
# Set two splitters. One is with big chunk size (parent) and one is with small chunk size (child)
parent_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=20, separator='\n')
child_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20, separator='\n')

In [None]:
vectordb = Chroma(
    collection_name="split_parents", embedding_function=watsonx_embedding()
)
#vectordb = Chroma.from_documents(documents=chunks_pdf, embedding=watsonx_embedding())
# The storage layer for the parent documents
store = InMemoryStore()

In [None]:
retriever = ParentDocumentRetriever(
    vectorstore=vectordb,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [None]:
retriever.add_documents(txt_data)

In [None]:
# These are the number of large chunks:
len(list(store.yield_keys()))

In [None]:
# Let's make sure the underlying vector store still retrieves the small chunks.
sub_docs = vectordb.similarity_search("smoking policy")
print(sub_docs[0].page_content)
retrieved_docs = retriever.invoke("smoking policy")
print(retrieved_docs[0].page_content)

In [None]:
# Excercise 1: Retrieve the top two results for the company policy document for the query "smoking policy" 
# using the Vector Store-Backed Retriever.
vectordb = Chroma.from_documents(documents=chunks_txt, embedding=watsonx_embedding())
retriever = vectordb.as_retriever(search_kwargs={"k": 2})
query = "smoking policy"
docs = retriever.invoke(query)
docs

In [None]:
# Excercise 2: Use the Self-Querying Retriever to invoke a query with a filter.

# You might encouter some errors or blank content when run the following code.
# It is becasue LLM cannot get the answer at first. Don't worry, re-run it several times you will get the answer.

vectordb = Chroma.from_documents(docs, watsonx_embedding())

retriever = SelfQueryRetriever.from_llm(
    llm(),
    vectordb,
    document_content_description,
    metadata_field_info,
)

# This example specifies a query with filter
retriever.invoke(
    "I want to watch a movie directed by Christopher Nolan"
)

#### Lecture 2 - Advanced Retrievers in LlamaIndex

Advanced retrievers in LlamaIndex are sophisticated components that go beyond simple vector similarity search to provide more nuanced, context-aware, and intelligent information retrieval. They combine multiple techniques such as:

- **Semantic Understanding**: Using embeddings to understand meaning and context
- **Keyword Matching**: Precise term-based search for exact specifications
- **Hierarchical Context**: Maintaining relationships between different levels of information
- **Multi-Query Processing**: Generating and combining results from multiple query variations
- **Fusion Techniques**: Intelligently combining results from different retrieval methods

Why are Advanced Retrievers Important?

- **Improved Accuracy**: Advanced retrievers can find more relevant information by using multiple search strategies
- **Better Context Preservation**: They maintain important relationships between pieces of information
- **Reduced Hallucination**: More precise retrieval leads to more accurate AI responses
- **Scalability**: Efficient retrieval strategies work better with large document collections
- **Flexibility**: Different retrieval methods can be combined for optimal results


##### Types of Indexes

**VectorStoreIndex**

Enables semantic search based on meaning
- Stores embeddings for each document chunk
- Best for semantic retrieval
- Common in LLM pipelines


**DocumentSummaryIndex**

Generated Summaries to identify relevant documents
- Generates and stores summaries of documents
- Filter Documents before full retrieval
- Useful for large, diverse document sets (that cannot fit in LLM context window)


**KeywordTableIndex**

Exact keyword matching for rule-based or hybrid search
- Extracts keywords from documents
- Maps keywords to specific chunks of content
- Enables exact keyword matching
- Useful for hybrid or rule-based search

##### Core and Advanced Retrievers

###### **Vector Index Retriever**

- uses embeddings to find relevant content
- ideal for general purpose search
- common in RAG

The Vector Index Retriever uses vector embeddings to find semantically related content, making it ideal for general-purpose search and widely used in retrieval-augmented generation (RAG) pipelines.

**How it works**:

- Documents are split into nodes and embedded using the configured embedding model
- Query is converted to an embedding vector
- Returns nodes ranked by cosine similarity to the query embedding
- Generates embeddings in batches of 2048 nodes by default

**When to use**:

- General-purpose semantic search (most common use case)
- Finding conceptually related content based on meaning rather than exact keywords
- RAG pipelines where semantic understanding is crucial
- When exact keyword matching isn't the primary requirement

**Key characteristics from authoritative source**:

- Stores embeddings for each document chunk (VectorStoreIndex foundation)
- Best for semantic retrieval based on meaning and context
- Commonly used in LLM pipelines for retrieval-augmented generation

**Strengths**:

- Excellent semantic understanding and context awareness
- Handles synonyms and related concepts effectively
- Works well with natural language queries

**Limitations**:

- May miss exact keyword matches when specific terms are crucial
- Requires a good embedding model for optimal performance
- Can be computationally intensive for large document collections

>TF-IDF : 
> **Term Frequency** (Word frequency in a document) X **Inverse Document Frequency** (how rare that word is across all documents)
> Thus it highlights words that are frequent in one document but rare across all documents

###### **BM-25 Retriever**

- Keyword based retrieval for ranking documents
- Retrieves content based on exact keyword matches (not semantic similarity)
- Improves on TF-IDF by
    - Adding term frequency saturation
    - Normalizing for document length

BM25 is a keyword-based retrieval method that improves on TF-IDF by addressing some of its key limitations. It's widely used in production search systems including Elasticsearch and Apache Lucene.

>**Understanding TF-IDF: The Foundation**
>
>Before diving into BM25, let's understand TF-IDF (Term Frequency-Inverse Document Frequency), which BM25 builds upon:
>
>Term Frequency (TF): Measures how often a word appears in a document
>
>Example: If "neural" appears 3 times in a 100-word document, TF = 3/100 = 0.03
>
>Inverse Document Frequency (IDF): Measures how rare a word is across all documents
>
>Example: If "neural" appears in only 2 out of 1000 documents, IDF = log(1000/2) = 6.21
>
>*Common words like "the" have low IDF; rare technical terms have high IDF*
>
>TF-IDF Score: TF × IDF
>
>Highlights words that are frequent in one document but rare across the collection
>
>Developed by Karen Spärck Jones, who pioneered the concept of term specificity

**How BM25 Improves Upon TF-IDF**

Key BM25 Improvements:

**Term Frequency Saturation**: BM25 reduces the impact of repeated terms using term frequency saturation

Problem: In TF-IDF, if a word appears 100 times vs 10 times, the score increases linearly

Solution: BM25 uses a saturation function that plateaus after a certain frequency

**Document Length Normalization**: BM25 adjusts for document length, making it more effective for keyword-based search

Problem: In TF-IDF, longer documents have unfair advantages

Solution: BM25 normalizes scores based on document length relative to average

Tunable Parameters: Allows fine-tuning for different types of content

k1 ≈ 1.2: Controls term frequency saturation (how quickly scores plateau)

b ≈ 0.75: Controls document length normalization (0=none, 1=full)

**When to Use BM25**

Ideal for:

- Technical documentation where exact terms matter
- Legal documents with specific terminology
- Product catalogs with precise specifications
- Academic papers with specialized vocabulary
- Applications requiring keyword-based retrieval rather than semantic similarity

**Advantages**:

- Excellent precision for exact term matches
- Fast computational performance
- Proven effectiveness in production systems
- No training required (unlike neural approaches)
- Interpretable scoring mechanism

**Limitations**:

- No semantic understanding (doesn't handle synonyms)
- Struggles with typos and variations
- Limited context understanding
- Requires careful parameter tuning for optimal performance

###### **Document Summary Index Retriever**

- use summaries to filter documents
- Two versions
    - LLM based (time-consuming, expensive)
    - Embedding based (uses semantic similarity, efficient for large collections)
- returns original documents, not their summaries

Document Summary Index Retrievers use document summaries instead of the actual documents to find relevant content, making them efficient for large collections. They return the original documents, not their summaries.

**How it works (from authoritative source)**:

- Generates and stores summaries of documents at indexing time
- Uses summaries to filter documents before retrieving full content
- Two-stage Process: First uses summaries to filter documents, then returns full document content
- Especially useful for large, diverse corpora that cannot fit in the context window of an LLM

**Two Retrieval Options**:

**DocumentSummaryIndexLLMRetriever**:

- Uses a large language model to analyze the query against document summaries
- Provides intelligent document selection but can be more time-consuming and expensive
- Best for complex queries requiring nuanced understanding

**DocumentSummaryIndexEmbeddingRetriever**:

- Uses semantic similarity between the query and summary embeddings
- Faster and more cost-effective than LLM-based approach
- Good for straightforward similarity matching

**When to use (based on authoritative guidance)**:

- Large document collections where documents cover different topics
- When you need efficient document-level filtering before detailed retrieval
- Multi-document QA where documents have distinct subject matters
- Large and diverse document sets that cannot fit in the context window of an LLM

**Configuration Parameters**:

- choice_top_k (LLM retriever): Number of documents to select
- similarity_top_k (Embedding retriever): Number of documents to select
- Default is 1, increase for multiple document retrieval

**Key Point**: Returns original documents, not their summaries - the summaries are only used for filtering

**Strengths**:

- Efficient document selection and reduces search space
- Good for heterogeneous collections with diverse topics
- Returns original documents with full context intact

**Limitations**:

- Requires LLM for summary generation during indexing
- May lose some detail present in original documents during summary creation
- LLM-based version can be slower and more expensive than other options

###### **Auto Merging Retriever**

- preserves context in long docs using a hierarchical structure
- uses hierarchical chunking to break document into parent and child nodes
- Retrieves parent node if enough child nodes match
- Consolidates related content and preserves broader context

Auto Merging Retriever is designed to preserve context in long documents using a hierarchical structure. **It uses hierarchical chunking to break documents into parent and child nodes, and if enough child nodes from the same parent are retrieved, the retriever returns the parent node instead.**

**How it works (from authoritative source)**:
- **Uses hierarchical chunking** to break documents into parent and child nodes
- **Retrieves parent if enough children match** - intelligent merging logic
- **Preserves context in long documents** by consolidating related content
- **Dual Storage**: Smaller child chunks are indexed in the vector store for precise matching, while larger parent chunks are stored in the docstore

**Key behavior pattern**:
- Child chunks enable precise matching for specific queries
- When multiple child chunks from the same parent are retrieved, the system returns the parent chunk
- This **helps consolidate related content and preserve broader context**

**When to use (based on authoritative guidance):**
- Long documents where small chunks lose important surrounding context
- Legal documents, research papers, technical specifications that need context preservation
- When you need both precise matching and comprehensive context
- Documents with natural hierarchical structure (sections, subsections)

**Configuration:**
- `chunk_sizes`: List of chunk sizes from largest to smallest (e.g., [512, 256, 128])
- `chunk_overlap`: Overlap between chunks to maintain continuity
- Storage context manages both vector store (child nodes) and docstore (parent nodes)

**Strengths**: 
- Automatically preserves context without manual intervention
- Reduces information fragmentation in long documents
- Intelligent merging based on retrieval patterns
- Maintains granular search capability while providing broader context

**Limitations**: 
- More complex setup compared to basic retrievers
- Requires hierarchical document structure to be effective
- Higher storage overhead due to multiple chunk levels
- May not be suitable for very short documents

*Based on: https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever/*

###### **Recursive Retriever**

- Follows node relationships using references e.g.
    - citations
    - metadata links
- Supports chunk and metadata linking references
- Retrieves related content across documents or abstraction layers

The Recursive Retriever is **designed to follow relationships between nodes using references**. **It can follow references from one node to another, such as citations in academic papers or other metadata links**, allowing it to **retrieve related content across documents or layers of abstraction**.

**How it works (from authoritative source)**:
- **Follows node references** - traverses relationships to find referenced content
- **Supports chunk and metadata linking** - handles different types of references
- **Multi-Level Navigation**: Can execute sub-queries on referenced retrievers or query engines
- **Network Building**: Creates a network of interconnected retrievers that can reference each other

**Reference Types Supported**:
1. **Chunk References**: Smaller child chunks refer to larger parent chunks for additional context
2. **Metadata References**: Summaries or generated questions refer to larger content chunks, such as citations in academic papers

**When to use (based on authoritative guidance):**
- **Academic papers with citations** and extensive references
- **Research papers** where you need to retrieve relevant content from cited papers
- Documentation with cross-references and linked content
- Knowledge bases with interconnected information
- When nodes reference structured data (tables, databases, other documents)

**Configuration:**
- `retriever_dict`: Maps node IDs or keys to specific retrievers
- `query_engine_dict`: Maps keys to query engines for sub-queries
- Node metadata can contain references to other nodes or data structures

**Key capability**: **Retrieves related content across documents** by following reference chains

**Strengths**: 
- Follows complex relationships and enables multi-step reasoning
- Provides comprehensive coverage across related documents
- Excellent for handling interconnected information systems
- Can traverse multiple levels of references automatically

**Limitations**: 
- Requires careful setup of node relationships
- Can be computationally expensive for deep reference chains
- Complex debugging when reference chains are extensive
- May retrieve too much related content if not properly configured

*Based on: https://docs.llamaindex.ai/en/stable/examples/retrievers/recurisve_retriever_nodes_braintrust/*

###### **Query Fusion Retriever**

- Combines results from multiple different retrievers
- Supports multiple query variations
- Use fusion strategies to improve recall
    - **Reciprocal Rank Fusion**: Combines rank lists by asigning high scores to documents that appear on top of any list
    - Relative Score Fusion: normalises score within result set by defining the maximum score
    - **Distribution Based Fusion** - uses statistical techniques such as Z-score normalization or percentile rankings to combine results making it ideal for variable scores.

The Query Fusion Retriever **combines results from different retrievers** (such as vector-based and keyword-based methods) and **optionally generates multiple variations of a query using an LLM to improve coverage**. **The results are merged using fusion strategies** to improve recall.

**How it works (from authoritative source)**:
- **Combines results from multiple retrievers** - e.g., vector-based and keyword-based methods
- **Supports multiple query variations** - generates different formulations of the same query
- **Uses fusion strategies to improve recall** - sophisticated merging techniques
- **Improved Coverage**: Reduces impact of query formulation on final results

**Core capabilities**:
1. **Multiple Retriever Support**: Combines results from different retrievers
2. **Query Variation Generation**: Optionally generates multiple variations of a query using an LLM
3. **Fusion Strategies**: Merges results using sophisticated fusion techniques

**Fusion Strategies Supported (from authoritative source)**:
1. **Reciprocal Rank Fusion (RRF)**: **Combines rankings across queries** - robust and doesn't rely on score magnitudes
2. **Relative Score Fusion**: **Normalizes scores within each result set** - preserves the relative confidence of each retriever
3. **Distribution-Based Fusion**: **Uses statistical normalization** - ideal for handling score variability

**When to use (based on authoritative guidance):**
- General Q&A where you want to combine semantic relevance with keyword matching
- Complex or ambiguous queries that may benefit from multiple formulations
- When query phrasing significantly impacts results
- Research and exploratory search scenarios
- When users provide under-specified or unclear queries

**Configuration:**
- `num_queries`: Number of query variations to generate (default: 4)
- `mode`: Fusion strategy ("reciprocal_rerank", "relative_score", "dist_based_score")
- `similarity_top_k`: Number of results to retrieve per query
- `use_async`: Enable async processing for better performance

**Key benefit**: **Uses fusion strategies such as reciprocal rank fusion or relative score fusion** to intelligently combine results

**Strengths**: 
- Improved recall through multiple query formulations
- Handles query variations effectively
- Reduces query sensitivity
- Combines strengths of different retrieval methods

**Limitations**: 
- Higher computational cost due to multiple retrievers/queries
- Requires LLM for query generation (additional cost)
- May introduce noise if fusion strategies are not well-tuned
- More complex setup and configuration

**Reciprocal Rank Fusion**

Reciprocal Rank Fusion is the most robust fusion method in QueryFusionRetriever, designed to combine ranked lists from multiple query variations by using the reciprocal of ranks, which reduces the impact of outliers and provides stable fusion results.

**How it works within QueryFusionRetriever**:
- Generates multiple query variations (e.g., "machine learning approaches", "ML techniques", "learning algorithms")
- Retrieves results for each query variation
- Calculates reciprocal rank score: `1 / (rank + k)` where k is typically 60
- Sums reciprocal rank scores across all query variations for each document
- Re-ranks documents by combined RRF scores

**Mathematical formula**:
```
RRF_score(d) = Σ (1 / (rank_i(d) + k))
```
Where:
- `d` is a document
- `rank_i(d)` is the rank of document d in query variation i's results
- `k` is a constant (typically 60) that controls the fusion behavior

**Why RRF works well for query fusion**:
- **Scale-invariant**: Works regardless of individual query result score ranges
- **Robust to outliers**: Reciprocal function reduces impact of extreme rankings
- **Query-agnostic**: Doesn't depend on specific query formulations
- **Proven effectiveness**: Well-established in information retrieval research

**When to use RRF mode**:
- Default choice for most query fusion scenarios
- When query variations might have very different result qualities
- When you want stable, predictable fusion behavior
- For production systems requiring consistent performance

**Advantages**:
- Most stable fusion method across different query types
- No parameter tuning required beyond the standard k=60
- Handles varying numbers of results per query variation gracefully
- Computationally efficient

**Limitations**:
- Loses absolute score information from individual queries
- Treats all query variations equally (no weighting)
- May not leverage score magnitude differences effectively

*Based on: https://docs.llamaindex.ai/en/stable/examples/retrievers/reciprocal_rerank_fusion/*

**Relative Score Fusion Mode**

Relative Score Fusion normalizes retrieval scores relative to the maximum score within each query variation's results, enabling effective combination when you want to preserve score magnitude information across different query formulations.

**How it works within QueryFusionRetriever**:
- Generates multiple query variations using LLM
- Retrieves results for each query variation
- Normalizes each query's scores by dividing by the maximum score in that query's results
- Creates scores in the range [0, 1] where 1 is the best result from each query variation
- Combines normalized scores using weighted average or sum

**Mathematical approach**:
```
normalized_score_i(d) = score_i(d) / max_score_i
combined_score(d) = Σ (weight_i × normalized_score_i(d))
```

**Why Relative Score Fusion is valuable for query variations**:
- **Preserves score magnitudes**: Unlike RRF, retains information about how confident each query was about its results
- **Fair combination**: Ensures no single query variation dominates due to different scoring scales
- **Interpretable results**: Final scores reflect the relative strength across query variations
- **Flexible weighting**: Can weight certain query formulations more heavily if desired

**When to use Relative Score mode**:
- When you trust the embedding model's confidence scores
- For queries where score magnitudes are meaningful
- When different query variations should contribute proportionally to their confidence
- In scenarios where you want to understand why certain results ranked highly

**Configuration within QueryFusionRetriever**:
- Automatically handles score normalization across query variations
- Equal weighting of all query variations by default
- Preserves relative differences in retriever confidence

**Advantages**:
- Preserves valuable score magnitude information
- Intuitive normalization approach
- Works well when retriever scores are reliable
- More interpretable than pure rank-based methods

**Limitations**:
- Sensitive to outlier scores within individual query results
- Assumes retriever scores are meaningful and comparable
- May not handle unreliable scoring mechanisms well

*Based on: https://docs.llamaindex.ai/en/stable/examples/retrievers/relative_score_dist_fusion/*

**Distribution-Based Score Fusion Mode**

Distribution-Based Score Fusion uses statistical properties of score distributions from each query variation to normalize and combine retrieval results, providing the most sophisticated handling of score variability and reliability across different query formulations.

**How it works within QueryFusionRetriever**:
- Generates multiple query variations using LLM
- Analyzes the statistical distribution of scores from each query variation
- Normalizes scores using distribution parameters (mean, standard deviation, percentiles)
- Applies statistical transformations like z-score normalization or percentile ranking
- Combines normalized scores with confidence weighting based on distribution characteristics

**Statistical approaches used**:
1. **Z-score normalization**: Centers scores around mean with unit variance
   - Formula: `z_score = (score - mean) / std_dev`
   - Converts to [0,1] range using sigmoid: `1 / (1 + exp(-z_score))`

2. **Percentile ranking**: Converts scores to percentile positions
   - Formula: `percentile = rank(score) / total_results`

3. **Distribution-aware normalization**: Considers score distribution shape
   - Uses IQR (Interquartile Range) to adjust for distribution spread
   - Handles multi-modal distributions from different query variations

**Why Distribution-Based Fusion excels for query variations**:
- **Statistical robustness**: Accounts for how scores are distributed within each query variation
- **Adaptive weighting**: Can weight query variations based on their score distribution confidence
- **Outlier handling**: Statistical methods naturally handle extreme scores
- **Multi-modal support**: Each query variation may have different score distribution characteristics

**When to use Distribution-Based mode**:
- When query variations produce very different score distributions
- For complex queries where some variations are much more reliable than others
- When you need statistically principled score combination
- In scenarios with noisy or unreliable retrieval scoring

**Advanced features in QueryFusionRetriever context**:
- Automatic distribution analysis for each query variation
- Confidence-based weighting of query variations
- Robust handling of varying result set sizes
- Statistical outlier detection within query results

**Advantages**:
- Most statistically principled approach to query fusion
- Handles complex score distributions effectively
- Adapts to different query variation characteristics
- Robust to various types of score variability and noise

**Limitations**:
- Most computationally intensive fusion method
- Requires sufficient results for reliable distribution estimation
- May over-normalize in some simple scenarios
- More complex to interpret than simpler fusion methods

*Based on: https://docs.llamaindex.ai/en/stable/examples/retrievers/relative_score_dist_fusion/*

###### **Recommended retrievers by use case**

|Use Case|Primary Retriever|Secondary/Hybrid|
|------|------|------|
|General Q&A|Vector Index Retriever|+BM25|
|Technical Docs|BM25 Retriever|+Vector|
|Long Documents|Auto Merging Retriever|-|
|Research Papers|Recursive Retriever|-|
|Large Document Sets|Document Summary Index Retriever|+Vector|

Based on the authoritative source and the characteristics of each retriever, here are recommended approaches for different scenarios:

**General Q&A Applications:**
- **Primary**: Vector Index Retriever for semantic understanding
- **Enhancement**: Combine with BM25 Retriever using Query Fusion for hybrid approach
- **Benefit**: Combines semantic relevance with keyword matching
- **From authoritative source**: "For general Q&A, use a vector index retriever, potentially combined with a BM25 retriever. This retriever fusion combines semantic relevance with keyword matching."

**Technical Documentation:**
- **Primary**: BM25 Retriever for exact term matching
- **Enhancement**: Vector Index Retriever as secondary for contextual flexibility
- **Benefit**: Prioritizes exact technical terms while maintaining semantic understanding
- **From authoritative source**: "For technical documents, especially those where exact terms need to be prioritized, consider making BM25 your primary retriever, with the vector index retriever adding contextual flexibility as a secondary retriever."

**Long Documents:**
- **Primary**: Auto Merging Retriever
- **Benefit**: Retrieves longer parent versions only if enough shorter child versions are retrieved, preserving context
- **From authoritative source**: "For long documents, the auto merging retriever is a great option, because it will retrieve longer parent versions only if enough shorter child versions are retrieved."

**Research Papers:**
- **Primary**: Recursive Retriever
- **Benefit**: Follows citations and references to retrieve relevant content from cited papers
- **From authoritative source**: "For research papers, use the recursive retriever in order to retrieve relevant content from cited papers."

**Large Document Collections:**
- **Primary**: Document Summary Index Retriever for initial filtering
- **Enhancement**: Followed by Vector Index Retriever for detailed search within relevant documents
- **Benefit**: Narrows down relevant documents first, then performs detailed retrieval
- **From authoritative source**: "For large document sets, consider using the document summary index retriever to narrow down the number of relevant documents, followed by a vector search within the remaining subset to retrieve the most pertinent content."


#### Lab 2 - Explore Advanced Retrievers in Llama Index

In [None]:
import sys
# Below will fail unless the appropriate Python version is running
!{sys.executable} -m pip install llama-index==0.12.49 \
    llama-index-embeddings-huggingface==0.5.5 \
    llama-index-llms-ibm==0.4.0 \
    llama-index-retrievers-bm25==0.5.2 \
    sentence-transformers==5.0.0 \
    rank-bm25==0.2.2 \
    PyStemmer==2.2.0.3 \
    ibm-watsonx-ai==1.3.31 | tail -n 1

In [None]:
import os
import json
from typing import List, Optional
import asyncio
import warnings
import numpy as np
warnings.filterwarnings('ignore')

# Core LlamaIndex imports
from llama_index.core import (
    VectorStoreIndex, 
    SimpleDirectoryReader, 
    Document,
    Settings,
    DocumentSummaryIndex,
    KeywordTableIndex
)
from llama_index.core.retrievers import (
    BaseRetriever,
    VectorIndexRetriever,
    AutoMergingRetriever,
    RecursiveRetriever,
    QueryFusionRetriever
)
from llama_index.core.indices.document_summary import (
    DocumentSummaryIndexLLMRetriever,
    DocumentSummaryIndexEmbeddingRetriever,
)
from llama_index.core.node_parser import SentenceSplitter, HierarchicalNodeParser
from llama_index.core.schema import NodeWithScore, QueryBundle
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.core.embeddings import BaseEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Advanced retriever imports
from llama_index.retrievers.bm25 import BM25Retriever

# IBM WatsonX LlamaIndex integration
from ibm_watsonx_ai import APIClient
from llama_index.llms.ibm import WatsonxLLM

# Sentence transformers
from sentence_transformers import SentenceTransformer

# Statistical libraries for fusion techniques
try:
    from scipy import stats
    SCIPY_AVAILABLE = True
except ImportError:
    SCIPY_AVAILABLE = False
    print("⚠️ scipy not available - some advanced fusion features will be limited")

print("✅ All imports successful!")

In [None]:
# watsonx.ai LLM using official LlamaIndex integration
def create_watsonx_llm():
    """Create watsonx.ai LLM instance using official LlamaIndex integration."""
    try:
        # Create the API client object
        api_client = APIClient({'url': "https://us-south.ml.cloud.ibm.com"})
        # Use llama-index-llms-ibm (official watsonx.ai integration)
        llm = WatsonxLLM(
            model_id="ibm/granite-3-3-8b-instruct",
            url="https://us-south.ml.cloud.ibm.com",
            project_id="skills-network",
            api_client=api_client,
            temperature=0.9
        )
        print("✅ watsonx.ai LLM initialized using official LlamaIndex integration")
        return llm
    except Exception as e:
        print(f"⚠️ watsonx.ai initialization error: {e}")
        print("Falling back to mock LLM for demonstration")
        
        # Fallback mock LLM for demonstration
        from llama_index.core.llms.mock import MockLLM
        return MockLLM(max_tokens=512)

In [None]:
# Initialize embedding model first
print("🔧 Initializing HuggingFace embeddings...")
embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)
print("✅ HuggingFace embeddings initialized!")

# Setup with watsonx.ai
print("🔧 Initializing watsonx.ai LLM...")
llm = create_watsonx_llm()

# Configure global settings
Settings.llm = llm
Settings.embed_model = embed_model
print("✅ watsonx.ai LLM and embeddings configured!")

In [None]:
# Sample data for the lab - AI/ML focused documents
SAMPLE_DOCUMENTS = [
    "Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.",
    "Deep learning uses neural networks with multiple layers to model and understand complex patterns in data.",
    "Natural language processing enables computers to understand, interpret, and generate human language.",
    "Computer vision allows machines to interpret and understand visual information from the world.",
    "Reinforcement learning is a type of machine learning where agents learn to make decisions through rewards and penalties.",
    "Supervised learning uses labeled training data to learn a mapping from inputs to outputs.",
    "Unsupervised learning finds hidden patterns in data without labeled examples.",
    "Transfer learning leverages knowledge from pre-trained models to improve performance on new tasks.",
    "Generative AI can create new content including text, images, code, and more.",
    "Large language models are trained on vast amounts of text data to understand and generate human-like text."
]

# Consistent query examples used throughout the lab
DEMO_QUERIES = {
    "basic": "What is machine learning?",
    "technical": "neural networks deep learning", 
    "learning_types": "different types of learning",
    "advanced": "How do neural networks work in deep learning?",
    "applications": "What are the applications of AI?",
    "comprehensive": "What are the main approaches to machine learning?",
    "specific": "supervised learning techniques"
}

print(f"📄 Loaded {len(SAMPLE_DOCUMENTS)} sample documents")
print(f"🔍 Prepared {len(DEMO_QUERIES)} consistent demo queries")
for i, doc in enumerate(SAMPLE_DOCUMENTS[:3], 1):
    print(f"{i}. {doc}")
print("...")

In [None]:
class AdvancedRetrieversLab:
    def __init__(self):
        print("🚀 Initializing Advanced Retrievers Lab...")
        self.documents = [Document(text=text) for text in SAMPLE_DOCUMENTS]
        self.nodes = SentenceSplitter().get_nodes_from_documents(self.documents)
        
        print("📊 Creating indexes...")
        # Create various indexes
        self.vector_index = VectorStoreIndex.from_documents(self.documents)
        self.document_summary_index = DocumentSummaryIndex.from_documents(self.documents)
        self.keyword_index = KeywordTableIndex.from_documents(self.documents)
        
        print("✅ Advanced Retrievers Lab Initialized!")
        print(f"📄 Loaded {len(self.documents)} documents")
        print(f"🔢 Created {len(self.nodes)} nodes")

# Initialize the lab
lab = AdvancedRetrieversLab()

In [None]:
print("=" * 60)
print("1. VECTOR INDEX RETRIEVER")
print("=" * 60)

# Basic vector retriever
vector_retriever = VectorIndexRetriever(
    index=lab.vector_index,
    similarity_top_k=3
)

# Alternative creation method
alt_retriever = lab.vector_index.as_retriever(similarity_top_k=3)

query = DEMO_QUERIES["basic"]  # "What is machine learning?"
nodes = vector_retriever.retrieve(query)

print(f"Query: {query}")
print(f"Retrieved {len(nodes)} nodes:")
for i, node in enumerate(nodes, 1):
    print(f"{i}. Score: {node.score:.4f}")
    print(f"   Text: {node.text[:100]}...")
    print()

In [None]:
print("=" * 60)
print("2. BM25 RETRIEVER")
print("=" * 60)

try:
    import Stemmer
    
    # Create BM25 retriever with default parameters
    bm25_retriever = BM25Retriever.from_defaults(
        nodes=lab.nodes,
        similarity_top_k=3,
        stemmer=Stemmer.Stemmer("english"),
        language="english"
    )
    
    query = DEMO_QUERIES["technical"]  # "neural networks deep learning"
    nodes = bm25_retriever.retrieve(query)
    
    print(f"Query: {query}")
    print("BM25 analyzes exact keyword matches with sophisticated scoring")
    print(f"Retrieved {len(nodes)} nodes:")
    
    for i, node in enumerate(nodes, 1):
        score = node.score if hasattr(node, 'score') and node.score else 0
        print(f"{i}. BM25 Score: {score:.4f}")
        print(f"   Text: {node.text[:100]}...")
        
        # Highlight which query terms appear in the text
        text_lower = node.text.lower()
        query_terms = query.lower().split()
        found_terms = [term for term in query_terms if term in text_lower]
        if found_terms:
            print(f"   → Found terms: {found_terms}")
        print()
    
    print("BM25 vs TF-IDF Comparison:")
    print("TF-IDF Problem: Linear term frequency scaling")
    print("  Example: 10 occurrences → score of 10, 100 occurrences → score of 100")
    print("BM25 Solution: Saturation function")
    print("  Example: 10 occurrences → high score, 100 occurrences → slightly higher score")
    print()
    print("TF-IDF Problem: No document length consideration")
    print("  Example: Long documents dominate results")
    print("BM25 Solution: Length normalization (b parameter)")
    print("  Example: Scores adjusted based on document length vs. average")
    print()
    print("Key BM25 Parameters:")
    print("- k1 ≈ 1.2: Term frequency saturation (how quickly scores plateau)")
    print("- b ≈ 0.75: Document length normalization (0=none, 1=full)")
    print("- IDF weighting: Rare terms get higher scores")
        
except ImportError:
    print("⚠️ BM25Retriever requires 'pip install PyStemmer'")
    print("Demonstrating BM25 concepts with fallback vector search...")
    
    fallback_retriever = lab.vector_index.as_retriever(similarity_top_k=3)
    query = DEMO_QUERIES["technical"]
    nodes = fallback_retriever.retrieve(query)
    
    print(f"Query: {query}")
    print("(Using vector fallback to demonstrate BM25 concepts)")
    
    for i, node in enumerate(nodes, 1):
        print(f"{i}. Vector Score: {node.score:.4f}")
        print(f"   Text: {node.text[:100]}...")
        
        # Demonstrate TF-IDF concept manually
        text_lower = node.text.lower()
        query_terms = query.lower().split()
        found_terms = [term for term in query_terms if term in text_lower]
        
        if found_terms:
            print(f"   → BM25 would boost this result for terms: {found_terms}")
        print()
    
    print("BM25 Concept Demonstration:")
    print("1. TF-IDF Foundation:")
    print("   - Term Frequency: How often words appear in document")
    print("   - Inverse Document Frequency: How rare words are across collection")
    print("   - TF-IDF = TF × IDF (balances frequency vs rarity)")
    print()
    print("2. BM25 Improvements:")
    print("   - Saturation: Prevents over-scoring repeated terms")
    print("   - Length normalization: Prevents long document bias")
    print("   - Tunable parameters: k1 (saturation) and b (length adjustment)")
    print()
    print("3. Real-world Usage:")
    print("   - Elasticsearch default scoring function")
    print("   - Apache Lucene/Solr standard")
    print("   - Used in 83% of text-based recommender systems")
    print("   - Developed by Robertson & Spärck Jones at City University London")

In [None]:
print("=" * 60)
print("3. DOCUMENT SUMMARY INDEX RETRIEVERS")
print("=" * 60)

# LLM-based document summary retriever
doc_summary_retriever_llm = DocumentSummaryIndexLLMRetriever(
    lab.document_summary_index,
    choice_top_k=3  # Number of documents to select
)

# Embedding-based document summary retriever  
doc_summary_retriever_embedding = DocumentSummaryIndexEmbeddingRetriever(
    lab.document_summary_index,
    similarity_top_k=3  # Number of documents to select
)

query = DEMO_QUERIES["learning_types"]  # "different types of learning"

print(f"Query: {query}")

print("\nA) LLM-based Document Summary Retriever:")
print("Uses LLM to select relevant documents based on summaries")
try:
    nodes_llm = doc_summary_retriever_llm.retrieve(query)
    print(f"Retrieved {len(nodes_llm)} nodes")
    for i, node in enumerate(nodes_llm[:2], 1):
        print(f"{i}. Score: {node.score:.4f}" if hasattr(node, 'score') and node.score else f"{i}. (Document summary)")
        print(f"   Text: {node.text[:80]}...")
        print()
except Exception as e:
    print(f"LLM-based retrieval demo: {str(e)[:100]}...")

print("B) Embedding-based Document Summary Retriever:")
print("Uses vector similarity between query and document summaries")
try:
    nodes_emb = doc_summary_retriever_embedding.retrieve(query)
    print(f"Retrieved {len(nodes_emb)} nodes")
    for i, node in enumerate(nodes_emb[:2], 1):
        print(f"{i}. Score: {node.score:.4f}" if hasattr(node, 'score') and node.score else f"{i}. (Document summary)")
        print(f"   Text: {node.text[:80]}...")
        print()
except Exception as e:
    print(f"Embedding-based retrieval demo: {str(e)[:100]}...")

print("Document Summary Index workflow:")
print("1. Generates summaries for each document using LLM")
print("2. Uses summaries to select relevant documents")
print("3. Returns full content from selected documents")

In [None]:
print("=" * 60)
print("4. AUTO MERGING RETRIEVER")
print("=" * 60)

# Create hierarchical nodes
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[512, 256, 128]
)

hier_nodes = node_parser.get_nodes_from_documents(lab.documents)

# Create storage context with all nodes
from llama_index.core import StorageContext
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.vector_stores import SimpleVectorStore

docstore = SimpleDocumentStore()
docstore.add_documents(hier_nodes)

storage_context = StorageContext.from_defaults(docstore=docstore)

# Create base index
base_index = VectorStoreIndex(hier_nodes, storage_context=storage_context)
base_retriever = base_index.as_retriever(similarity_top_k=6)

# Create auto-merging retriever
auto_merging_retriever = AutoMergingRetriever(
    base_retriever, 
    storage_context,
    verbose=True
)

query = DEMO_QUERIES["advanced"]  # "How do neural networks work in deep learning?"
nodes = auto_merging_retriever.retrieve(query)

print(f"Query: {query}")
print(f"Auto-merged to {len(nodes)} nodes")
for i, node in enumerate(nodes[:3], 1):
    print(f"{i}. Score: {node.score:.4f}" if hasattr(node, 'score') and node.score else f"{i}. (Auto-merged)")
    print(f"   Text: {node.text[:120]}...")
    print()

In [None]:
print("=" * 60)
print("6.1 RECIPROCAL RANK FUSION MODE DEMONSTRATION")
print("=" * 60)

# Create QueryFusionRetriever with RRF mode
base_retriever = lab.vector_index.as_retriever(similarity_top_k=5)

print("Testing QueryFusionRetriever with reciprocal_rerank mode:")
print("This demonstrates how RRF works within the query fusion framework")

# Use the same query for consistency across all fusion modes
query = DEMO_QUERIES["comprehensive"]  # "What are the main approaches to machine learning?"

try:
    # Create query fusion retriever with RRF mode
    rrf_query_fusion = QueryFusionRetriever(
        [base_retriever],
        similarity_top_k=3,
        num_queries=3,
        mode="reciprocal_rerank",
        use_async=False,
        verbose=True
    )
    
    print(f"\nQuery: {query}")
    print("QueryFusionRetriever will:")
    print("1. Generate query variations using LLM")
    print("2. Retrieve results for each variation")
    print("3. Apply Reciprocal Rank Fusion")
    
    nodes = rrf_query_fusion.retrieve(query)
    
    print(f"\nRRF Query Fusion Results:")
    for i, node in enumerate(nodes, 1):
        print(f"{i}. Final RRF Score: {node.score:.4f}")
        print(f"   Text: {node.text[:100]}...")
        print()
    
    print("RRF Benefits in Query Fusion Context:")
    print("- Automatically handles query variations of different quality")
    print("- No bias toward queries that return higher raw scores")
    print("- Stable performance across diverse query formulations")
    
except Exception as e:
    print(f"QueryFusionRetriever error: {e}")
    print("Demonstrating RRF concept manually with query variations...")
    
    # Manual demonstration with query variations derived from the main query
    query_variations = [
        DEMO_QUERIES["comprehensive"],  # Original query
        "machine learning approaches and methods",
        "different ML techniques and algorithms"
    ]
    
    print("Manual RRF with Query Variations:")
    all_results = {}
    
    for i, query_var in enumerate(query_variations):
        print(f"\nQuery variation {i+1}: {query_var}")
        nodes = base_retriever.retrieve(query_var)
        
        # Apply RRF scoring
        for rank, node in enumerate(nodes):
            node_id = node.node.node_id
            if node_id not in all_results:
                all_results[node_id] = {
                    'node': node,
                    'rrf_score': 0,
                    'query_ranks': []
                }
            
            # Calculate RRF contribution: 1 / (rank + k)
            k = 60  # Standard RRF parameter
            rrf_contribution = 1.0 / (rank + 1 + k)
            all_results[node_id]['rrf_score'] += rrf_contribution
            all_results[node_id]['query_ranks'].append((i, rank + 1))
    
    # Sort by final RRF score
    sorted_results = sorted(
        all_results.values(), 
        key=lambda x: x['rrf_score'], 
        reverse=True
    )
    
    print(f"\nCombined RRF Results (top 3):")
    for i, result in enumerate(sorted_results[:3], 1):
        print(f"{i}. Final RRF Score: {result['rrf_score']:.4f}")
        print(f"   Query ranks: {result['query_ranks']}")
        print(f"   Text: {result['node'].text[:100]}...")
        print()
    
    print("RRF Formula Demonstration:")
    print("For each document: RRF_score = Σ(1 / (rank + 60))")
    print("- Rank 1 in query: 1/(1+60) = 0.0164")
    print("- Rank 2 in query: 1/(2+60) = 0.0161")
    print("- Rank 3 in query: 1/(3+60) = 0.0159")
    print("Documents appearing in multiple queries get higher combined scores")

In [None]:
print("=" * 60)
print("6.2 RELATIVE SCORE FUSION MODE DEMONSTRATION")
print("=" * 60)

base_retriever = lab.vector_index.as_retriever(similarity_top_k=5)

print("Testing QueryFusionRetriever with relative_score mode:")
print("This mode preserves score magnitudes while normalizing across query variations")

# Use the same query for consistency across all fusion modes
query = DEMO_QUERIES["comprehensive"]  # "What are the main approaches to machine learning?"

try:
    # Create query fusion retriever with relative score mode
    rel_score_fusion = QueryFusionRetriever(
        [base_retriever],
        similarity_top_k=3,
        num_queries=3,
        mode="relative_score",
        use_async=False,
        verbose=False
    )
    
    print(f"\nQuery: {query}")
    print("QueryFusionRetriever with relative_score will:")
    print("1. Generate query variations")
    print("2. Normalize scores within each variation (score/max_score)")
    print("3. Combine normalized scores")
    
    nodes = rel_score_fusion.retrieve(query)
    
    print(f"\nRelative Score Fusion Results:")
    for i, node in enumerate(nodes, 1):
        print(f"{i}. Combined Relative Score: {node.score:.4f}")
        print(f"   Text: {node.text[:100]}...")
        print()
    
    print("Relative Score Benefits in Query Fusion:")
    print("- Preserves confidence information from embedding model")
    print("- Ensures fair contribution from each query variation")
    print("- More interpretable than rank-only methods")
    
except Exception as e:
    print(f"QueryFusionRetriever error: {e}")
    print("Demonstrating Relative Score concept manually...")
    
    # Manual demonstration with query variations derived from the main query
    query_variations = [
        DEMO_QUERIES["comprehensive"],  # Original query
        "machine learning approaches and methods",
        "different ML techniques and algorithms"
    ]
    
    print("Manual Relative Score Fusion with Query Variations:")
    all_results = {}
    query_max_scores = []
    
    # Step 1: Get results and find max scores for each query
    for i, query_var in enumerate(query_variations):
        print(f"\nQuery variation {i+1}: {query_var}")
        nodes = base_retriever.retrieve(query_var)
        scores = [node.score or 0 for node in nodes]
        max_score = max(scores) if scores else 1.0
        query_max_scores.append(max_score)
        
        print(f"Max score for this query: {max_score:.4f}")
        
        # Store results with normalization info
        for node in nodes:
            node_id = node.node.node_id
            original_score = node.score or 0
            normalized_score = original_score / max_score if max_score > 0 else 0
            
            if node_id not in all_results:
                all_results[node_id] = {
                    'node': node,
                    'combined_score': 0,
                    'contributions': []
                }
            
            all_results[node_id]['combined_score'] += normalized_score
            all_results[node_id]['contributions'].append({
                'query': i,
                'original': original_score,
                'normalized': normalized_score
            })
    
    # Step 2: Sort by combined relative score
    sorted_results = sorted(
        all_results.values(),
        key=lambda x: x['combined_score'],
        reverse=True
    )
    
    print(f"\nCombined Relative Score Results (top 3):")
    for i, result in enumerate(sorted_results[:3], 1):
        print(f"{i}. Combined Score: {result['combined_score']:.4f}")
        print(f"   Score breakdown:")
        for contrib in result['contributions']:
            print(f"     Query {contrib['query']}: {contrib['original']:.3f} → {contrib['normalized']:.3f}")
        print(f"   Text: {result['node'].text[:100]}...")
        print()
    
    print("Relative Score Normalization Process:")
    print("1. For each query variation, find max_score")
    print("2. Normalize: normalized_score = original_score / max_score")
    print("3. Sum normalized scores across all query variations")
    print("4. Documents with consistently high scores across queries win")

In [None]:
print("=" * 60)
print("6.3 DISTRIBUTION-BASED SCORE FUSION MODE DEMONSTRATION")
print("=" * 60)

base_retriever = lab.vector_index.as_retriever(similarity_top_k=8)

print("Testing QueryFusionRetriever with dist_based_score mode:")
print("This mode uses statistical analysis for the most sophisticated score fusion")

# Use the same query for consistency across all fusion modes
query = DEMO_QUERIES["comprehensive"]  # "What are the main approaches to machine learning?"

try:
    # Create query fusion retriever with distribution-based mode
    dist_fusion = QueryFusionRetriever(
        [base_retriever],
        similarity_top_k=3,
        num_queries=3,
        mode="dist_based_score",
        use_async=False,
        verbose=False
    )
    
    print(f"\nQuery: {query}")
    print("QueryFusionRetriever with dist_based_score will:")
    print("1. Generate query variations")
    print("2. Analyze score distributions for each variation")
    print("3. Apply statistical normalization (z-score, percentiles)")
    print("4. Combine with distribution-aware weighting")
    
    nodes = dist_fusion.retrieve(query)
    
    print(f"\nDistribution-Based Fusion Results:")
    for i, node in enumerate(nodes, 1):
        print(f"{i}. Statistically Normalized Score: {node.score:.4f}")
        print(f"   Text: {node.text[:100]}...")
        print()
    
    print("Distribution-Based Benefits in Query Fusion:")
    print("- Accounts for score distribution differences between query variations")
    print("- Statistically robust against outliers and noise")
    print("- Adapts weighting based on query variation reliability")
    
except Exception as e:
    print(f"QueryFusionRetriever error: {e}")
    print("Demonstrating Distribution-Based concept manually...")
    
    if not SCIPY_AVAILABLE:
        print("⚠️ Full statistical analysis requires scipy")
    
    # Manual demonstration with query variations derived from the main query
    query_variations = [
        DEMO_QUERIES["comprehensive"],  # Original query
        "machine learning approaches and methods",
        "different ML techniques and algorithms"
    ]
    
    print("Manual Distribution-Based Fusion with Query Variations:")
    all_results = {}
    variation_stats = []
    
    # Step 1: Collect results and analyze distributions
    for i, query_var in enumerate(query_variations):
        print(f"\nQuery variation {i+1}: {query_var}")
        nodes = base_retriever.retrieve(query_var)
        scores = [node.score or 0 for node in nodes]
        
        # Calculate distribution statistics
        mean_score = np.mean(scores) if scores else 0
        std_score = np.std(scores) if len(scores) > 1 else 1
        min_score = np.min(scores) if scores else 0
        max_score = np.max(scores) if scores else 1
        
        stats_info = {
            'mean': mean_score,
            'std': std_score,
            'min': min_score,
            'max': max_score,
            'nodes': nodes,
            'scores': scores
        }
        variation_stats.append(stats_info)
        
        print(f"Distribution stats: mean={mean_score:.3f}, std={std_score:.3f}")
        print(f"Score range: [{min_score:.3f}, {max_score:.3f}]")
        
        # Apply z-score normalization
        for node, score in zip(nodes, scores):
            node_id = node.node.node_id
            
            # Z-score normalization
            if std_score > 0:
                z_score = (score - mean_score) / std_score
            else:
                z_score = 0
            
            # Convert to [0,1] using sigmoid
            normalized_score = 1 / (1 + np.exp(-z_score))
            
            if node_id not in all_results:
                all_results[node_id] = {
                    'node': node,
                    'combined_score': 0,
                    'contributions': []
                }
            
            all_results[node_id]['combined_score'] += normalized_score
            all_results[node_id]['contributions'].append({
                'query': i,
                'original': score,
                'z_score': z_score,
                'normalized': normalized_score
            })
    
    # Step 2: Sort by combined distribution-based score
    sorted_results = sorted(
        all_results.values(),
        key=lambda x: x['combined_score'],
        reverse=True
    )
    
    print(f"\nCombined Distribution-Based Results (top 3):")
    for i, result in enumerate(sorted_results[:3], 1):
        print(f"{i}. Combined Score: {result['combined_score']:.4f}")
        print(f"   Statistical breakdown:")
        for contrib in result['contributions']:
            print(f"     Query {contrib['query']}: {contrib['original']:.3f} → "
                  f"z={contrib['z_score']:.2f} → {contrib['normalized']:.3f}")
        print(f"   Text: {result['node'].text[:100]}...")
        print()
    
    print("Distribution-Based Process:")
    print("1. Calculate mean and std for each query variation")
    print("2. Z-score normalize: z = (score - mean) / std")
    print("3. Sigmoid transform: normalized = 1 / (1 + exp(-z))")
    print("4. Sum normalized scores across variations")
    print("5. Results reflect statistical significance across all query forms")

# Show fusion mode comparison summary
print("\n" + "=" * 60)
print("FUSION MODES COMPARISON SUMMARY")
print("=" * 60)
print("All three modes tested with the same query for direct comparison:")
print(f"Query: {query}")
print()
print("Mode Characteristics:")
print("• RRF (reciprocal_rerank): Most robust, rank-based, scale-invariant")
print("• Relative Score: Preserves confidence, normalizes by max score")  
print("• Distribution-Based: Most sophisticated, statistical normalization")
print()
print("Choose based on your use case:")
print("- Production stability → RRF")
print("- Score interpretability → Relative Score")
print("- Statistical robustness → Distribution-Based")

##### Exercise 1 - Build a Custom Hybrid Retriever

Your task is to create a hybrid retriever that combines both vector similarity and BM25 keyword search for improved results.

**Requirements:**
- Use both Vector Index Retriever and BM25 Retriever
- Implement a simple score fusion mechanism which takes a weighted average of normalized scores
- Test with different query types (semantic vs keyword-focused)

**Important Note**: Node IDs from different retrievers won't match even for the same content, so we need to match by text content instead.

In [None]:
# Create both retrievers
vector_retriever = lab.vector_index.as_retriever(similarity_top_k=10)
try:
    bm25_retriever = BM25Retriever.from_defaults(
        nodes=lab.nodes, similarity_top_k=10
    )
except:
    # Fallback if BM25 is not available
    bm25_retriever = vector_retriever

def hybrid_retrieve(query, top_k=5):
    # Get results from both retrievers
    vector_results = vector_retriever.retrieve(query)
    bm25_results = bm25_retriever.retrieve(query)
    
    # Create dictionaries using text content as keys (since node IDs differ)
    vector_scores = {}
    bm25_scores = {}
    all_nodes = {}
    
    # Normalize vector scores
    max_vector_score = max([r.score for r in vector_results]) if vector_results else 1
    for result in vector_results:
        text_key = result.text.strip()  # Use text content as key
        normalized_score = result.score / max_vector_score
        vector_scores[text_key] = normalized_score
        all_nodes[text_key] = result
    
    # Normalize BM25 scores
    max_bm25_score = max([r.score for r in bm25_results]) if bm25_results else 1
    for result in bm25_results:
        text_key = result.text.strip()  # Use text content as key
        normalized_score = result.score / max_bm25_score
        bm25_scores[text_key] = normalized_score
        all_nodes[text_key] = result
    
    # Calculate hybrid scores
    hybrid_results = []
    for text_key in all_nodes:
        vector_score = vector_scores.get(text_key, 0)
        bm25_score = bm25_scores.get(text_key, 0)
        hybrid_score = 0.7 * vector_score + 0.3 * bm25_score
        
        hybrid_results.append({
            'node': all_nodes[text_key],
            'vector_score': vector_score,
            'bm25_score': bm25_score,
            'hybrid_score': hybrid_score
        })
    
    # Sort by hybrid score and return top k
    hybrid_results.sort(key=lambda x: x['hybrid_score'], reverse=True)
    return hybrid_results[:top_k]

# Test with different queries
test_queries = [
    "What is machine learning?",
    "neural networks deep learning", 
    "supervised learning techniques"
]

for query in test_queries:
    print(f"Query: {query}")
    results = hybrid_retrieve(query, top_k=3)
    for i, result in enumerate(results, 1):
        print(f"{i}. Hybrid Score: {result['hybrid_score']:.3f}")
        print(f"   Vector: {result['vector_score']:.3f}, BM25: {result['bm25_score']:.3f}")
        print(f"   Text: {result['node'].text[:80]}...")
    print()

##### Exercise 2 - Create a Production RAG Pipeline

Build a complete RAG pipeline that uses multiple retrieval strategies and includes evaluation metrics.

**Requirements:**
- Implement retrieval with multiple strategies
- Add query routing logic
- Include basic evaluation metrics that evaluate whether the pipeline succeeded or failed
- Handle edge cases and errors

In [None]:
class ProductionRAGPipeline:
    def __init__(self, index, llm):
        self.index = index
        self.llm = llm
        self.vector_retriever = index.as_retriever(similarity_top_k=5)
        
    def _route_query(self, question):
        """Simple query routing based on question characteristics"""
        if any(word in question.lower() for word in ["what", "explain", "describe"]):
            return "semantic"
        elif any(word in question.lower() for word in ["list", "types", "examples"]):
            return "comprehensive"
        else:
            return "semantic"
    
    def query(self, question, strategy="auto"):
        try:
            # Route query if strategy is auto
            if strategy == "auto":
                strategy = self._route_query(question)
            
            # Retrieve relevant documents
            if strategy == "semantic":
                retriever = self.vector_retriever
                top_k = 3
            elif strategy == "comprehensive":
                retriever = self.vector_retriever
                top_k = 5
            else:
                retriever = self.vector_retriever
                top_k = 3
            
            # Get relevant documents
            relevant_docs = retriever.retrieve(question)
            
            # Prepare context
            context = "\n\n".join([doc.text for doc in relevant_docs[:top_k]])
            
            # Generate response
            prompt = f"""Based on the following context, please answer the question:

Context:
{context}

Question: {question}

Answer:"""
            
            try:
                response = self.llm.complete(prompt)
                return {
                    "answer": response.text,
                    "strategy": strategy,
                    "num_docs": len(relevant_docs),
                    "status": "success"
                }
            except Exception as e:
                return {
                    "answer": f"Based on the retrieved documents: {context[:200]}...",
                    "strategy": strategy,
                    "num_docs": len(relevant_docs),
                    "status": f"llm_error: {str(e)}"
                }
                
        except Exception as e:
            return {
                "answer": "I encountered an error processing your question.",
                "strategy": strategy,
                "num_docs": 0,
                "status": f"error: {str(e)}"
            }
    
    def evaluate(self, test_queries):
        results = []
        for query in test_queries:
            result = self.query(query)
            results.append({
                "query": query,
                "result": result,
                "success": result["status"] == "success"
            })
        
        success_rate = sum(1 for r in results if r["success"]) / len(results)
        return {
            "success_rate": success_rate,
            "results": results
        }

# Test the pipeline
pipeline = ProductionRAGPipeline(lab.vector_index, llm)

test_queries = [
    "What is machine learning?",
    "List different types of learning algorithms",
    "Explain neural networks"
]

print("Testing Production RAG Pipeline:")
for query in test_queries:
    result = pipeline.query(query)
    print(f"\nQuery: {query}")
    print(f"Strategy: {result['strategy']}")
    print(f"Status: {result['status']}")
    print(f"Answer: {result['answer'][:100]}...")

# Evaluate performance
evaluation = pipeline.evaluate(test_queries)
print(f"\nPipeline Success Rate: {evaluation['success_rate']:.2%}")

### Module 2 - Build a Comprehensive RAG Application

#### Lecture 1 - Introduction to FAISS and how it compares to ChromaDB

**Facebook AI Similarity Search** or **FAISS**
- is a library created by meta for fast vector search
- runs on a single machine, either CPU or GPU
- you write code to use it, as it has no database storage or server
- is great for custom, high-performance systems


**ChromaDB**
- is a vector DB designed for AI applications
- stores vector and metadata together
- can run locally or as a server
- easy to use with tools like LangChain

|Features|FAISS|ChromaDB|
|------|------|------|
|Type|Library|Database|
|Deployment|Local, single node|Local, single node or distributed|
|Indexing Options|Flat, IVF, LSH, HNSW|HNSW only|
|Metadata support|None|Storage, filtering|
|LangChain/LlamaIndex support|Yes|Yes|

##### **Indexing**

**Flat Indexing**

<img src="images/FAISS-flat-indexing.png" width=400/>

- Stores embeddings of all documents
- embeds the query
- measures distance of query embedding from all vectors using dot product or euclidean distance in a **brute force** way
- then returns the k-nearest vectors ordered from closest to farthest

Very accurate, but very slow

**Inverted File Index (IVF)**

<img src="images/FAISS-IVF-Index.png" width=400/>

- Clusters vectors using techniques like k-means
- forms Voronoi cells around centroids
- each cell contains vectors closest to its centroid
- when a query vector is introduced, the search is limited to vectors in the nearest cell

it is faster than a flat-index, but may slightly reduce accuracy as some nearby vectors may lie in another cell

**Locality-Sensitive Hashing (LSH)**
- uses hash functions to group similar vectors
- best for high dimensional sparse data
- not the fastest or most accurate method

**Hierarchical Navigable Small World (HNSW)**

<img src="images/FAISS-HNSW-Index.png" width=400/>

- top layer serves like an express highway to quickly relocate to the right "region"
- lower layers then help reach the final nearest vectors
- fast and accurate, especially for large datasets

##### **Extending FAISS with Milvus**

- FAISS offers high performance vector search but lacks certain features like metadata support and distributed scaling
- Milvus build on FAISS to add missing production ready capabilities
- Adds metadata storage and filtering for hybrid query capabilities
- Supports distributed and scalable deployments

##### **FAISS, Milvus or Chroma DB - when to use which?**

|FAISS|Chroma DB|Milvus|
|------|------|------|
|Custome High Performance Systems|Fast Prototyping|Distributed production-scale systems|
|GPU Acceleration|Metadata-rich queries|FAISS-style indexing with database features|
|Local-only deployments||Hybrid search (vector+metadata)|

#### Lab 1 - Semantic Similarity with FAISS

In [None]:
import sys
!{sys.executable} -m pip install faiss-cpu numpy scikit-learn
!{sys.executable} -m pip install "tensorflow>=2.0.0"
!{sys.executable} -m pip install --upgrade tensorflow-hub

In [43]:
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import faiss
import re
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from pprint import pprint

# Suppressing warnings
def warn(*args, **kwargs):
    pass

import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

**The 20 Newsgroups Dataset**

In this project, we'll be using the 20 Newsgroups dataset, a collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups. It's a go-to dataset in the NLP community because it presents real-world challenges:

**What is the 20 Newsgroups Dataset?**

- **Diverse Topics**: The dataset spans 20 different topics, from sports and science to politics and religion, reflecting the diverse interests of newsgroup members.
- **Natural Language**: It contains actual discussions, with all the nuances of human language, making it ideal for semantic search.
- **Prevalence of Context**: The conversations within it require understanding of context to differentiate between the topics effectively.

**How are we using the 20 Newsgroups Dataset?**

1. **Exploring Data**: We'll start by loading the dataset and exploring its structure to understand the kind of information it holds.
2. **Preprocessing**: We'll clean the text data, removing any unwanted noise that could affect our semantic analysis.
3. **Vectorization**: We'll then use the Universal Sentence Encoder to transform this text into numerical vectors that capture the essence of each document.
4. **Semantic Search Implementation**: Finally, we'll use FAISS to index these vectors, allowing us to perform fast and efficient semantic searches across the dataset.

By working with the 20 Newsgroups dataset, you'll gain hands-on experience with real-world data and the end-to-end process of building a semantic search engine.

In [44]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

In [None]:
pprint(list(newsgroups_train.target_names))

In [None]:
# Display the first 3 posts from the dataset
for i in range(3):
    print(f"Sample post {i+1}:\n")
    pprint(newsgroups_train.data[i])
    print("\n" + "-"*80 + "\n")

**Pre-processing Data**

In this section, we focus on preparing the text data from the 20 Newsgroups dataset for our semantic search engine. Preprocessing is a critical step to ensure the quality and consistency of the data before it's fed into the Universal Sentence Encoder.

**Steps in Preprocessing:**

1. **Fetching Data**: 
   - We load the complete 20 Newsgroups dataset using `fetch_20newsgroups` from `sklearn.datasets`. 
   - `documents = newsgroups.data` stores all the newsgroup documents in a list.

2. **Defining the Preprocessing Function**:
   - The `preprocess_text` function is designed to clean each text document. Here's what it does to every piece of text:
     - **Removes Email Headers**: Strips off lines that start with 'From:' as they usually contain metadata like email addresses.
     - **Eliminates Email Addresses**: Finds patterns resembling email addresses and removes them.
     - **Strips Punctuations and Numbers**: Removes all characters except alphabets, aiding in focusing on textual data.
     - **Converts to Lowercase**: Standardizes the text by converting all characters to lowercase, ensuring uniformity.
     - **Trims Excess Whitespace**: Cleans up any extra spaces, tabs, or line breaks.

3. **Applying Preprocessing**:
   - We iterate over each document in the `documents` list and apply our `preprocess_text` function.
   - The cleaned documents are stored in `processed_documents`, ready for further processing.

By preprocessing the text data in this way, we reduce noise and standardize the text, which is essential for achieving meaningful semantic analysis in later steps.


In [47]:
newsgroups = fetch_20newsgroups(subset='all')
documents = newsgroups.data

# Basic preprocessing of text data
def preprocess_text(text):
    # Remove email headers
    text = re.sub(r'^From:.*\n?', '', text, flags=re.MULTILINE)
    # Remove email addresses
    text = re.sub(r'\S*@\S*\s?', '', text)
    # Remove punctuations and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove excess whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Preprocess each document
processed_documents = [preprocess_text(doc) for doc in documents]

In [None]:
# Choose a sample post to display
sample_index = 0  # for example, the first post in the dataset

# Print the original post
print("Original post:\n")
print(newsgroups_train.data[sample_index])
print("\n" + "-"*80 + "\n")

# Print the preprocessed post
print("Preprocessed post:\n")
print(preprocess_text(newsgroups_train.data[sample_index]))
print("\n" + "-"*80 + "\n")

**Universal Sentence Encoder**

After preprocessing the text data, the next step is to transform this cleaned text into numerical vectors using the Universal Sentence Encoder (USE). These vectors capture the semantic essence of the text.

**Loading the USE Module:**

- We use TensorFlow Hub (`hub`) to load the pre-trained Universal Sentence Encoder.
- `embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")` fetches the USE module, making it ready for vectorization.

**Defining the Embedding Function:**

- The `embed_text` function is defined to take a piece of text as input and return its vector representation.
- Inside the function, `embed(text)` converts the text into a high-dimensional vector, capturing the nuanced semantic meaning.
- `.numpy()` is used to convert the result from a TensorFlow tensor to a NumPy array, which is a more versatile format for subsequent operations.

**Vectorizing Preprocessed Documents:**

- We then apply the `embed_text` function to each document in our preprocessed dataset, `processed_documents`.
- `np.vstack([...])` stacks the vectors vertically to create a 2D array, where each row represents a document.
- The resulting array `X_use` holds the vectorized representations of all the preprocessed documents, ready to be used for semantic search indexing and querying.

By vectorizing the text with USE, we've now converted our textual data into a format that can be efficiently processed by machine learning algorithms, setting the stage for the next step: indexing with FAISS.

In [49]:
# Load the Universal Sentence Encoder's TF Hub module
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Function to generate embeddings
def embed_text(text):
    return embed(text).numpy()

# Generate embeddings for each preprocessed document
X_use = np.vstack([embed_text([doc]) for doc in processed_documents])

**Indexing with FAISS**

With our documents now represented as vectors using the Universal Sentence Encoder, the next step is to use FAISS (Facebook AI Similarity Search) for efficient similarity searching.

**Creating a FAISS Index:**

- We first determine the dimension of our vectors from `X_use` using `X_use.shape[1]`.
- A FAISS index (`index`) is created specifically for L2 distance (Euclidean distance) using `faiss.IndexFlatL2(dimension)`.
- We add our document vectors to this index with `index.add(X_use)`. This step effectively creates a searchable space for our document vectors.

**Choosing the Right Index:**

- In this project, we use `IndexFlatL2` for its simplicity and effectiveness in handling small to medium-sized datasets.
- FAISS offers a variety of indexes tailored for different use cases and dataset sizes. Depending on your specific needs and the complexity of your data, you might consider other indexes for more efficient searching.
- For larger datasets or more advanced use cases, indexes like `IndexIVFFlat`, `IndexIVFPQ`, and others can provide faster search times and reduced memory usage. Explore more at [FAISS indexes wiki](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes).


In [51]:
dimension = X_use.shape[1]
index = faiss.IndexFlatL2(dimension)  # Creating a FAISS index
index.add(X_use)  # Adding the document vectors to the index

**Quering with FAISS**
**Defining the Search Function:**

- The `search` function is designed to find documents that are semantically similar to a given query.
- It preprocesses the query text using the `preprocess_text` function to ensure consistency.
- The query text is then converted to a vector using `embed_text`.
- FAISS performs a search for the nearest neighbors (`k`) to this query vector in our index.
- It returns the distances and indices of these nearest neighbors.

**Executing a Query and Displaying Results:**

- We test our search engine with an example query (e.g., "motorcycle").
- The `search` function returns the indices of the documents in the index that are most similar to the query.
- For each result, we display:
   - The ranking of the result (based on distance).
   - The distance value itself, indicating how close the document is to the query.
   - The actual text of the document. We display both the preprocessed and original versions of each document for comparison.

This functionality showcases the practical application of semantic search: retrieving information that is contextually relevant to the query, not just based on keyword matching. The displayed results will give a clear idea of how our semantic search engine interprets and responds to natural language queries.


In [None]:
# Function to perform a query using the Faiss index
def search(query_text, k=5):
    # Preprocess the query text
    preprocessed_query = preprocess_text(query_text)
    # Generate the query vector
    query_vector = embed_text([preprocessed_query])
    # Perform the search
    distances, indices = index.search(query_vector.astype('float32'), k)
    return distances, indices

# Example Query
query_text = "motorcycle"
distances, indices = search(query_text)

# Display the results
for i, idx in enumerate(indices[0]):
    # Ensure that the displayed document is the preprocessed one
    print(f"Rank {i+1}: (Distance: {distances[0][i]})\n{processed_documents[idx]}\n")

In [None]:
# Display the results
for i, idx in enumerate(indices[0]):
    # Displaying the original (unprocessed) document corresponding to the search result
    print(f"Rank {i+1}: (Distance: {distances[0][i]})\n{documents[idx]}\n")

#### Reading - Hierarchical Navigable Small World (HNSW)

![HNSW Reading]("readings/HNSW.pdf")