# IBM RAG and Agentic AI

## Course 3 - Vector Databases for RAG: An Introduction

### Module 1 - Introduction to Vector Databases and Chroma DB

#### Lecture 1 - Vector Database Concepts

- Vector DBs can be used to group items, classify items and suggest relationships among items
- Vector DBs can be used
    - to store complex data types (social likes, geospatial data, genomic data etc)
      <img src='images/200754.008_Vector-DB-reading-image1.png' width=600/>
    - perform similarity searches
    - for diverse domains like biology, healthcare, e-commerce, social media and traffic planning)
    - to support machine learning
- Traditional DBs store data as tables, Vector DBs store data as high dimensional vectors with size and direction. Each dimension relates to different attributes. For e.g. a book can be stored in vector DB as [1, 300, 2024, 4.2]

#### Lecture 2 - Traditional vs Vector Databases

|Function|Traditional databases|Vector databases|
|------|------|------|
|Data Representation|Traditional databases organize data in a structured format using tables, rows, and columns, ideal for relational data|Vector databases represent data as multi-dimensional vectors, efficiently encoding complex and unstructured data like images, text, and sensor data.|
|Data Search and Retrieval|SQL queries are suited for traditional databases with structured data.|Vector databases specialize in similarity searches and retrieving vectorized data, facilitating tasks like image retrieval, recommendation systems, and anomaly detection.|
|Indexing|Traditional databases employ indexing methods like B-trees for efficient data retrieval.|Vector databases use indexing structures like metric trees and hashing suited for high-dimensional spaces, enhancing nearest-neighbor searches and similarity assessments.|
|Scalability|Scaling traditional databases can be challenging, often requiring resource augmentation or data sharding.|Vector databases are designed for scalability, especially in handling large datasets and similarity searches, using distributed architectures for horizontal scaling.|
|Applications|Traditional databases are pivotal in business applications and transactional systems where structured data is processed.|Vector databases shine in analyzing vast datasets, supporting fields like scientific research, natural language processing, and multimedia analysis.|

#### Lecture 3 - Vector Database Types

- **In-memory** Vector DBs e.g. *RedisAI, Torchserve* store vectors in RAM hence fast but limited in size
- **Disk-Based** Vector DBs e.g. *Annoy, Milvus, ScaNN* store vectors on disk, use compression and indexing and are suitable for large datasets
- **Distributed** Vector DBs e.g. *FAISS, ElasticSearch+, Dask-ML* spread data across multiple nodes/servers hence great for horizotnal scaling and fault tolerance making them suitable for large datasets with fast retrieval
- **Graph Based** Vector DBs e.g. *Neo4J, Amazon Neptune, TigerGraph* model data as a graph with nodes and edges representing attributes. They are great at capturing complex relationships and graph analytics
- **Time Series** Vector DBs e.g. *InfluxDB, TimescaleDB, Prometheus* represent data collected over time as vectors and are good for identifying temporal patterns and anomalies

Vector DBs can also be classified as dedicated vector DBs or DBs that support vector search

**Dedicated Vector DBs**
- use unique data structures like reverse indexes, product quantization and Locality-sensitive Hashing (LSH)
- support vector operations like similarity search, nearest neighbour search and distance calculations
- provide scaleability through clustering or distributed nodes
- deliver speed through optimized algorithms and data structures
- are customisable by changing parameters of indexing and searching as per application needs
- Examples are FAISS, Annoy and Milvus

**Databases that support Vector Search**
- are regular DBs or data processing frameworks that have tools and addons to allow users to do vector search and other queries
- Store data as part of their data model as BLOBs, Arrays or UDTs.
- Allow standard and custom indexing to organise data
- Have add-on libraries and plugins to support vector operations
- Not as optimized or fast as dedicated vector DBs
- Examples are SingleStore (works with watsonx.ai), ElasticSearch, PostgreSQL, MySQL, RedisAI, Apache MongoDB and Apache Cassandra

#### Lecture 4 - Applications of Vector DBs

1. **Image and Video Analysis**
|Task|Capability|Uses|
|------|------|------|
|Feature Extraction & Representation|Store High-Dimensional Feature Vectors|Displays aspects of images, such as color histograms, texture descriptions or deep learning embeddings|
|Similarity Searches|Store Feature Vectors|Locate images, Summarize videos, and suggest images and videos based on content|
|Process Real-time data|Provide horizontal scaling for real-time storage|Perform video surveillance, object recognition, and live event analysis|

2. **Recommendation Systems**
|Task|Capability|Uses|
|------|------|------|
|Embedding Storage and Nearest Neighbour Search|Incorporate embeddings or numerical representations of items or entities generated by a recommendation system|Access the vector's likes and traits, Locate the vector's closest neighbours for improved personalized suggestions|
|Deliver performance improvement and scalability|Provide scalability to handle additional searches and vectors. Improve query processing and indexing structure|Deliver fast, scalabale recommendation services for large numbers of concurrent users|
|Provide cross-domain suggestions|Store embeddings and carry out cross-domain suggestions|Enhance the completeness of recommendation systems|

3. **Geospatial analysis and location-based services**
|Task|Capability|Uses|
|------|------|------|
|Efficiently store and index data| * Use indexing methods like R-tree or quad tree * Store geospatial data like addresses, polygons, GPS locations | Deliver spatial queries like closeness searches, range queries and spatial joins, for GPS information and other mapping needs|
|Provide location-based suggestions|Combine geospatial data with user preferences and location|Deliver recommendations for nearby events, services and places of interest|
|Deliver realtime geospatial analytics|*Process streaming data in real-time * Groups items together spatially * Recognizes spatial patterns|Power apps like tracking vehicles, managing fleets, dynamic routing, finding hotspots|

4. **Marketing and social media insights**

|Task|Capability|Uses|
|------|------|------|
|Provide distributed storage and parallel processing for horizontal scalability|Spread data and queries across multiple nodes or groups|Process big data and handle simultaneous queries such as SEO calculations|
|Reduce latency and boost overall speed|Use optimized caching and query execution plans|Obtain trending analytics faster|
|Adjust to changing task needs|Support auto-scaling and dynamic resource allocation|Your company can scale hardware and cloud resource usage for the best performance and lower costs|

#### Lecture 4 - Similarity Search

For any two vectors $\vec{a}$ and $\vec{b}$ 
- the **L2 distance** or Eucliendian distance $\sqrt{\sum{(a_i-b_i)}^2}$ is a **distance** metric
- the **dot product** $\sum{a_i b_i}$ or $\lVert{a}\rVert \lVert{b}\rVert cos(\alpha)$ is a **similarity** metric. However its negative can be used as a distance metric (larger dot product $\implies$ less distance)
- The **cosine similarity** cosine_similarity(a,b) $=\frac{a . b}{\lvert{a}\rVert \lvert{b}\rVert} = \frac{a}{\lvert{a}\rVert} \frac{b}{\lvert{b}\rVert} = norm(a) \times norm(b)$ is a **similarity** metric and (1-cosine_similarity) is a **distance** metric

|Metric|Sensitive to Magnitude|Normalised|Best For|
|------|------|------|------|
|L2 Distance|$\checkmark$Yes|$\times$No|Spatial Data, Clustering|
|Cosine Distance|$\times$No|$\checkmark$Yes|Text, Embeddings, NLP|
|Dot Product|$\checkmark$Yes|$\times$No|Neural Networks, recommender systems|

L2 distance works well for continuous, lower-dimensional data where magnitude matters. 

Cosine distance excels with high-dimensional, sparse data where direction is more important than magnitude. 

Dot product offers computational efficiency and is useful when both magnitude and direction contribute to similarity. 

In [None]:
!pip install sentence-transformers==4.1.0 | tail -n 1

In [None]:
import math

import numpy as np
import scipy
import torch
from sentence_transformers import SentenceTransformer

In [None]:
# Example documents
documents = [
    'Bugs introduced by the intern had to be squashed by the lead developer.',
    'Bugs found by the quality assurance engineer were difficult to debug.',
    'Bugs are common throughout the warm summer months, according to the entomologist.',
    'Bugs, in particular spiders, are extensively studied by arachnologists.'
]

In [None]:
# Load a pre-trained model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

In [None]:
# Generate embeddings
embeddings = model.encode(documents)

In [None]:
embeddings.shape

In [None]:
embeddings

In [None]:
def euclidean_distance_fn(vector1, vector2):
    squared_sum = sum((x - y) ** 2 for x, y in zip(vector1, vector2))
    return math.sqrt(squared_sum)

In [None]:
euclidean_distance_fn(embeddings[0], embeddings[1])

In [None]:
euclidean_distance_fn(embeddings[1], embeddings[0])

In [None]:
l2_dist_manual = np.zeros([4,4])
for i in range(embeddings.shape[0]):
    for j in range(embeddings.shape[0]):
        l2_dist_manual[i,j] = euclidean_distance_fn(embeddings[i], embeddings[j])

l2_dist_manual

In [None]:
l2_dist_manual[0,1]

In [None]:
l2_dist_manual[1,0]

In [None]:
l2_dist_manual_improved = np.zeros([4,4])
for i in range(embeddings.shape[0]):
    for j in range(embeddings.shape[0]):
        if (i>j):
            l2_dist_manual_improved[i,j] = l2_dist_manual_improved[j,i]
        elif (i<j):
            l2_dist_manual_improved[i,j] = euclidean_distance_fn(embeddings[i], embeddings[j])
l2_dist_manual_improved

In [None]:
l2_dist_scipy = scipy.spatial.distance.cdist(embeddings, embeddings, 'euclidean')
l2_dist_scipy

In [None]:
np.allclose(l2_dist_manual, l2_dist_scipy)

In [None]:
def dot_product_fn(vector1, vector2):
    return sum(x * y for x, y in zip(vector1, vector2))

In [None]:
dot_product_fn(embeddings[0], embeddings[1])

In [None]:
dot_product_manual = np.empty([4,4])
for i in range(embeddings.shape[0]):
    for j in range(embeddings.shape[0]):
        dot_product_manual[i,j] = dot_product_fn(embeddings[i], embeddings[j])

dot_product_manual

In [None]:
# Matrix multiplication operator
dot_product_operator = embeddings @ embeddings.T
dot_product_operator

In [None]:
np.allclose(dot_product_manual, dot_product_operator, atol=1e-05)

In [None]:
# Equivalent to `np.matmul()` if both arrays are 2-D:
np.matmul(embeddings,embeddings.T)

In [None]:
# `np.dot` returns an identical result, but `np.matmul` is recommended if both arrays are 2-D:
np.dot(embeddings,embeddings.T)

In [None]:
dot_product_distance = -dot_product_manual
dot_product_distance

In [None]:
# L2 norms
l2_norms = np.sqrt(np.sum(embeddings**2, axis=1))
l2_norms

In [None]:
# L2 norms reshaped
l2_norms_reshaped = l2_norms.reshape(-1,1)
l2_norms_reshaped

In [None]:
normalized_embeddings_manual = embeddings/l2_norms_reshaped
normalized_embeddings_manual