# 🗃️ Vector Database: The Heart of Educational RAG

This notebook shows you how to build and use vector databases that power intelligent educational platforms like Canopy.

You've already deployed Milvus through GitOps - now let's see how it works! We'll convert course content into searchable vectors and demonstrate how students can find relevant materials by meaning, not just keywords.

Let's build your vector database! 🚀

## 📦 Install Required Packages

Install the Python packages needed for this lab.

In [None]:
# Step 1: Install necessary libraries (run in a cell if needed)
!pip install -q pymilvus==2.5.0 sentence-transformers==3.0.1 scikit-learn==1.4.2 matplotlib==3.8.4 marshmallow==3.20.2 boto3==1.34.103 docling==2.39.0 huggingface-hub==0.33.2 langchain-core==0.3.68 langchain-openai==0.3.27

Collecting pymilvus==2.5.0
  Downloading pymilvus-2.5.0-py3-none-any.whl.metadata (5.7 kB)
Collecting sentence-transformers==3.0.1
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting scikit-learn==1.4.2
  Downloading scikit_learn-1.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting matplotlib==3.8.4
  Downloading matplotlib-3.8.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB)
Collecting marshmallow==3.20.2
  Downloading marshmallow-3.20.2-py3-none-any.whl.metadata (7.5 kB)
Collecting boto3==1.34.103
  Downloading boto3-1.34.103-py3-none-any.whl.metadata (6.6 kB)
Collecting docling==2.39.0
  Downloading docling-2.39.0-py3-none-any.whl.metadata (10 kB)
Collecting huggingface-hub==0.33.2
  Downloading huggingface_hub-0.33.2-py3-none-any.whl.metadata (14 kB)
Collecting langchain-core==0.3.68
  Downloading langchain_core-0.3.68-py3-none-any.whl.metadata (5.8 kB)
Collecting langchain-open

## 📚 Import Libraries

Import the tools we'll use for vector database operations.

In [None]:
# Import libraries for vector database operations
from pymilvus import connections, utility, Collection, CollectionSchema, FieldSchema, DataType
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

## 🗃️ Populate Your Vector Database

Let's connect to Milvus and set up a collection to store course content as searchable vectors.

We'll use the `all-MiniLM-L6-v2` embedding model which creates 384-dimensional vectors. The vector dimensions must match your chosen embedding model exactly! There are many different embedding models available on Hugging Face - check the **[Embedding LLM Leaderboard](https://huggingface.co/spaces/mteb/leaderboard)** to compare them.

`all-MiniLM-L6-v2` isn't the top performer, but it's one of the best in its size class and downloads/runs quickly for this lab.

## 🔗 Connect to Your Milvus Database

Connect to the Milvus instance you deployed via GitOps.

‼️⚠️ IMPORTANT ⚠️‼️

Add your username and cluster domain that were shared with you. This connects to your Milvus instance in the `{username}-test` namespace.

### 🖼️ Optional: Explore Milvus Attu Web Interface

Before we start coding, you can visually explore your empty Milvus database using Attu (Milvus web UI):

**Attu URL**: `https://milvus-test-attu-{username}-test.{cluster_domain}`

Replace `{username}` and `{cluster_domain}` with your values. You'll see an empty database initially - perfect for understanding the starting point!

In [None]:
# IMPORTANT! Add your username and cluster domain here
username = "<USER_NAME>"
cluster_domain = "<CLUSTER_DOMAIN>"

# Define collection name for our educational content vectors
collection_name = "vectordb_collection"

connections.connect(
    uri=f"http://milvus-test.{username}-test.svc.cluster.local:19530",
    alias="default"
)

## 🧹 Clean Up Previous Runs

Remove any existing collections to start fresh.

In [None]:
# Remove existing collection if it exists
if utility.has_collection(collection_name):
    utility.drop_collection(collection_name)

print(f"Collection list after cleanup: {utility.list_collections()}")

## 📋 Define Database Schema

Vector databases need a schema just like traditional databases. Our schema defines what each record contains: a unique ID and a vector embedding.

This structure is essential for storing and managing vector embeddings efficiently - Milvus needs to know exactly what fields to expect and their data types.

In [None]:
# Databases need a schema that defines the structure of each record
# Our schema has two fields: an identifier and a vector embedding

# Define the primary key field for unique record identification
id_field = FieldSchema(
    name="id",
    dtype=DataType.INT64,
    is_primary=True,
    auto_id=False
)

# Specify embedding model and its output dimension
embedding_model = "all-MiniLM-L6-v2"  # Hugging Face model name
embedding_dim = 384  # Vector size must match model output

# Define the vector field to hold embedding values
embedding_field = FieldSchema(
    name="embedding",
    dtype=DataType.FLOAT_VECTOR,
    dim=embedding_dim
)

# Assemble collection schema combining ID and embedding fields
schema = CollectionSchema(
    fields=[id_field, embedding_field],
    description="Educational content vectors",
    enable_dynamic_field=False  # Strict schema enforcement
)

## 🏗️ Create the Collection

A collection in Milvus is like a table in a traditional database - it's where your embedding vectors will be stored, indexed, and queried.

We'll configure it with strong consistency to ensure you always get the most up-to-date data when searching for educational content.

In [None]:
# Create the Milvus collection with our schema and configuration
collection = Collection(
    name=collection_name, 
    schema=schema, 
    using='default',  # Use default connection
    shards_num=2,  # Number of data shards for distribution
    consistency_level="Strong"  # Ensures latest data is always returned
)

print(f"Collection: {collection.schema}\n")

print(f"Collection list: {utility.list_collections()}")

## 💾 Store Content in Vector Database

Time to save our course content vectors in Milvus for searching!

In [None]:
# Load embedding model from Hugging Face
model = SentenceTransformer(embedding_model)

sentences = ["Introduction to Machine Learning covers supervised learning algorithms.",
             "Machine Learning fundamentals include supervised algorithm techniques.",
             "Computer Science department offers advanced database systems courses.",
             "Students can access research databases through the library portal."]

embeddings = model.encode(sentences)

In [None]:
# Prepare educational content vectors for database insertion
data = [
    {"id": i, "embedding": vec.tolist()}  # Convert numpy array to list
    for i, vec in enumerate(embeddings)
]

# Insert the vectors into our Milvus collection
collection.insert(data=data)

# Create an index for fast similarity searching
# COSINE metric is perfect for semantic similarity
collection.create_index(
    field_name="embedding",
    index_params={
        "metric_type": "COSINE",  # Cosine similarity for semantic search
        "index_type": "IVF_FLAT",  # Inverted file index
        "params": {"nlist": 128}  # Number of clusters for indexing
    },
    index_name="idx"
)

# Commit changes and load collection into memory for searching
collection.flush()  # Ensure data is written to disk
collection.load()   # Load collection into memory for fast queries

### 📊 Check Attu After Insertion

Now visit your Attu web interface again to see the data visualization:

**Attu URL**: `https://milvus-test-attu-<USER_NAME>-test.<CLUSTER_DOMAIN>`

You'll now see your collection with 4 educational content vectors!

## 🔎 Search Your Vector Database

Now the exciting part! Let's search for educational content using semantic similarity. We stored four course-related sentences - two about Machine Learning, two about other topics.

We'll search with "What AI ethics topics are covered in the curriculum?" and see how Machine Learning content scores higher than unrelated academic content. This demonstrates exactly how Canopy finds relevant course materials when students ask questions!

In [None]:
# Demonstrate semantic search with educational query
print("\n🔁 VECTOR DATABASE RETRIEVAL DEMO")
query = "What AI ethics topics are covered in the curriculum?"
query_vector = model.encode([query])  # Convert query to vector

# Search our vector database for semantically similar content
results = collection.search(
    data=query_vector,
    anns_field="embedding",  # Field containing our vectors
    param={"metric_type": "COSINE"},  # Use cosine similarity
    limit=3,  # Return top 3 matches
    output_fields=["embedding"]  # Include original vectors in results
)

# Map result IDs back to original text
id_to_text = {i: sentence for i, sentence in enumerate(sentences)}

# Display the search results with similarity scores
print(f"\n📌 Query: '{query}'\n")
print("📥 Top matches:\n")
for match in results[0]:
    match_id = match.id
    score = match.score  # Cosine similarity score (higher = more similar)
    matched_text = id_to_text.get(match_id, "[Unknown]")

    print(f"🆔 ID: {match_id}")
    print(f"🧠 Text: {matched_text}")
    print(f"📏 Score: {score:.4f}\n")

print("✅ Notice: ML-related content gets higher similarity scores!")
print("🎓 This is exactly how Canopy finds relevant course materials!")

In [None]:
# Clean up - close connection and remove collection (optional)
collection.release()
#utility.drop_collection(collection_name)

## 🎉 You've Set up your Vector Database!

**What you accomplished:**
- Connected to your deployed Milvus instance via GitOps
- Learned how embedding models convert text to searchable vectors
- Created a database schema and collection for educational content  
- Tested semantic similarity with course materials
- Demonstrated how vector search finds relevant content by meaning

**Key insights:**
- Related educational content (like ML courses) gets high similarity scores
- Unrelated content gets filtered out automatically
- This enables intelligent search that understands context, not just keywords

This is the foundation that powers Canopy's intelligent search capabilities. Students can now ask questions and get relevant answers based on meaning!

Go back to the instructions to integrate this vector database with LlamaStack for complete RAG functionality.