# Exercises XP: Vector Databases and RAG
Use this guided notebook and fill each TODO before running cells.

## What you'll learn
- Vector search strategies (KNN, ANN) and evaluation.
- Vector database utility (similarity search, RAG).
- Differences between vector DBs, libraries, and plugins.
- Best practices for vector store usage and performance.
- How LMs use context; embedding generation and storage.
- Querying vector stores and applying LMs for QA with retrieved context.

## What you'll build
A functional RAG pipeline with FAISS and ChromaDB, plus QA over retrieved context using a Hugging Face model.

## 0. Setup
Run the install cell once. If your platform needs system deps (e.g., libomp for FAISS), follow instructions in comments.

In [None]:
%pip uninstall -y pydantic-core pydantic
%pip install -U "pydantic<2"
%pip install -U "faiss-cpu>=1.8.0" "chromadb==0.3.21"
%pip install -U "numpy<2" sentence-transformers transformers

In [None]:
import os
import json
from pathlib import Path
import numpy as np
import pandas as pd
import faiss
from sentence_transformers import SentenceTransformer, InputExample
import chromadb
from chromadb.config import Settings
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from IPython.display import display
os.makedirs('cache', exist_ok=True)


## ðŸŒŸ Exercise 1 Â· Data loading and preparation

In [15]:
data_path = 'labeled_newscatcher_dataset.csv'
pdf = pd.read_csv(data_path, sep=';')

if 'id' not in pdf.columns:
    pdf['id'] = range(len(pdf))

display(pdf.head())

pdf_subset = pdf.iloc[:1000].copy()
pdf_subset[['id', 'title']].head()

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4


Unnamed: 0,id,title
0,0,A closer look at water-splitting's solar fuel ...
1,1,"An irresistible scent makes locusts swarm, stu..."
2,2,Artificial intelligence warning: AI will know ...
3,3,Glaciers Could Have Sculpted Mars Valleys: Study
4,4,Perseid meteor shower 2020: What time and how ...


## ðŸŒŸ Exercise 2 Â· Vectorization with Sentence Transformers

In [16]:
from sentence_transformers import InputExample, SentenceTransformer

# Helper function is already defined
def example_create_fn(idx: int, text: str) -> InputExample:
    return InputExample(guid=str(idx), texts=[text], label=0.0)

# âœ… Create training examples from the subset
faiss_train_examples = [
    example_create_fn(row.id, row.title)
    for _, row in pdf_subset.iterrows()
]

# Preview first 2 examples
faiss_train_examples[:2]



[<sentence_transformers.readers.InputExample.InputExample at 0x784d328bb770>,
 <sentence_transformers.readers.InputExample.InputExample at 0x784d32852600>]

In [17]:
model = SentenceTransformer('all-MiniLM-L6-v2')
titles_list = pdf_subset['title'].tolist()
faiss_title_embedding = model.encode(titles_list, convert_to_numpy=True, show_progress_bar=True)
len(faiss_title_embedding), len(faiss_title_embedding[0])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

(1000, 384)

## ðŸŒŸ Exercise 3 Â· FAISS indexing and search

In [18]:
pdf_to_index = pdf_subset
id_index = pdf_to_index['id'].to_numpy().astype(np.int64)
content_encoded_normalized = faiss_title_embedding.astype('float32')
faiss.normalize_L2(content_encoded_normalized)
index_content = faiss.IndexIDMap(faiss.IndexFlatIP(content_encoded_normalized.shape[1]))
index_content.add_with_ids(content_encoded_normalized, id_index)
index_content.ntotal


1000

In [19]:
def search_content(query: str, pdf_to_index: pd.DataFrame, k: int = 3):
    # Encode the query using the same sentence transformer model
    query_vector = model.encode([query], convert_to_numpy=True).astype('float32')

    # Normalize the query vector
    faiss.normalize_L2(query_vector)

    # Search the FAISS index
    sims, ids = index_content.search(query_vector, k)

    # Retrieve matching rows
    results = pdf_to_index[pdf_to_index['id'].isin(ids[0])].copy()

    # Add similarity scores
    results['similarities'] = sims[0][:len(results)]

    # Sort by similarity descending
    results = results.sort_values(by='similarities', ascending=False)

    return results

# Test the search function
display(search_content('animal', pdf_to_index, k=5))

Unnamed: 0,topic,link,domain,published_date,title,lang,id,similarities
99,TECHNOLOGY,https://www.gematsu.com/2020/08/ghostwire-toky...,gematsu.com,2020-08-07 16:43:13,Ghostwire: Tokyo confirms dog petting,en,99,0.391902
176,TECHNOLOGY,https://www.pushsquare.com/news/2020/08/random...,pushsquare.com,2020-08-03 16:30:00,Random: You Can Pick Up and Pet Cats in Assass...,en,176,0.376784
762,SCIENCE,https://af.reuters.com/article/worldNews/idAFK...,af.reuters.com,2020-08-13 16:51:00,'Secret' life of sharks: Study reveals their s...,en,762,0.344058
928,SCIENCE,https://www.thecut.com/2020/08/scientists-say-...,thecut.com,2020-08-04 12:52:00,Just Let This Lizard Be a Dinosaur,en,928,0.317387
975,HEALTH,https://www.news-medical.net/news/20200813/Res...,news-medical.net,2020-08-13 05:18:00,Researchers explore social behavior of animals...,en,975,0.295497


## ðŸŒŸ Exercise 4 Â· ChromaDB collection and querying

In [20]:
import chromadb
from chromadb.config import Settings
import json

# Initialize ChromaDB client
chroma_client = chromadb.Client(Settings(anonymized_telemetry=False))

collection_name = 'my_news'

# Delete existing collection if it exists
if any(c.name == collection_name for c in chroma_client.list_collections()):
    chroma_client.delete_collection(name=collection_name)

# Create a new collection
collection = chroma_client.create_collection(name=collection_name)

# Add first 100 titles with metadata and unique IDs
collection.add(
    documents=pdf_subset["title"][:100].tolist(),
    metadatas=[{"topic": t} for t in pdf_subset["topic"][:100].tolist()],
    ids=[str(i) for i in pdf_subset["id"][:100].tolist()]
)

# Query the collection for documents related to 'space'
results = collection.query(
    query_texts=["space"],
    n_results=10
)

# Print results neatly
print(json.dumps(results, indent=2))



ERROR:chromadb.telemetry.posthog:Failed to send telemetry event client_start: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.posthog:Failed to send telemetry event collection_add: capture() takes 1 positional argument but 3 were given


{
  "ids": [
    [
      "72",
      "7",
      "30",
      "26",
      "23",
      "76",
      "69",
      "40",
      "47",
      "75"
    ]
  ],
  "embeddings": null,
  "documents": [
    [
      "Beck teams up with NASA and AI for 'Hyperspace' visual album experience",
      "Orbital space tourism set for rebirth in 2021",
      "NASA drops \"insensitive\" nicknames for cosmic objects",
      "\u2018It came alive:\u2019 NASA astronauts describe experiencing splashdown in SpaceX Dragon",
      "Hubble Uses Moon As \u201cMirror\u201d to Study Earth\u2019s Atmosphere \u2013 Proxy in Search of Potentially Habitable Planets Around Other Stars",
      "Australia's small yet crucial part in the mission to find life on Mars",
      "NASA Astronauts in SpaceX Capsule Splashdown in Gulf Of Mexico",
      "SpaceX's Starship spacecraft saw 150 meters high",
      "NASA\u2019s InSight lander shows what\u2019s beneath Mars\u2019 surface",
      "Alien base on Mercury: ET hunters claim to find hu

## ðŸŒŸ Exercise 5 Â· Question answering with a Hugging Face model

In [21]:
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

model_id = 'google/flan-t5-small'  # lightweight QA model

# âœ… Create the text2text-generation pipeline
pipe = pipeline(
    "text2text-generation",
    model=model_id,
    tokenizer=model_id,
    device_map="auto"  # uses GPU if available
)

# Define your question and context
question = "What's the latest news on space development?"

# Use top 3 retrieved documents from ChromaDB as context
context_docs = results['documents'][0][:3]
context = ' '.join(context_docs)

# Build the prompt
prompt = f"Answer the question using only the context.\nContext: {context}\nQuestion: {question}\nAnswer:\n"

# Generate the answer
response = pipe(prompt)[0]['generated_text']

print(response)



config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cpu


NASA drops "insensitive" nicknames for cosmic objects
