<a href="https://colab.research.google.com/github/salmantec/AI-Agents-Crash-Course/blob/feat%2FDay-3/Day-3/Day_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
## Add search

In [None]:
# In the first part of the course, we focus on data preparation. Before we can use data for AI agents, we need to prepare it properly.

# We have already downloaded the data from a GitHub repository. Yesterday (Day 2), we processed it by chunking it where necessary.

# Now it's time to use this data. We will index this data by putting it inside a search engine. This allows us to quickly find relevant information when users ask questions.

# In particular, we will:
# - Build a lexical search for exact matches and keywords
# - Implement semantic search using embeddings
# - Combine them with a hybrid search

# At the end of this lesson, you'll have a working search system you can query about your project. This search engine can be used later by the AI agent to look up user questions in the database.


In [1]:
!pip install uv

Collecting uv
  Downloading uv-0.8.22-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading uv-0.8.22-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: uv
Successfully installed uv-0.8.22


In [2]:
## 1. Text search

# The simplest type of search is a text search. Suppose we build a Q&A system for courses (using the FAQ dataset). We want to find the answer to this question:

# "What should be in a test dataset for AI evaluation?"

# Text search works by finding all documents that contain at least one word from the query. The more words from the query that appear in a document, the more relevant that document is.

# This is how modern search systems like Apache Solr or Elasticsearch work. They use indexes to efficiently search through millions of documents without having to scan each one individually.

# In this lesson, we'll start with a simple in-memory text search. The engine we will use is called minsearch.

# Note: This search engine was implemented as part of a workshop I held some time ago. You can find details here if you want to know how it works

!uv pip install minsearch requests python-frontmatter

[2mUsing Python 3.12.11 environment at: /usr[0m
[2K[2mResolved [1m18 packages[0m [2min 154ms[0m[0m
[2K[2mPrepared [1m2 packages[0m [2min 10ms[0m[0m
[2K[2mInstalled [1m2 packages[0m [2min 2ms[0m[0m
 [32m+[39m [1mminsearch[0m[2m==0.0.5[0m
 [32m+[39m [1mpython-frontmatter[0m[2m==1.1.0[0m


In [3]:
#  find read_repo_data in the first lesson and sliding_window in the second lesson

import io
import zipfile
import requests
import frontmatter

def read_repo_data(repo_owner, repo_name):
  """
  Download and parse all markdown files from a github repository

  Args:
    repo_owner : Github username or organization
    repo_name: Repository name

  Returns:
    List of dictionaries containing file content and metadata
  """
  prefix = 'https://codeload.github.com'
  url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
  resp = requests.get(url)

  if resp.status_code != 200:
    raise Exception(f"Failed to download repository {repo_owner}/{repo_name}: {resp.status_code}")

  repository_data = []

  # Create a ZipFile object from the downloaded content
  zf = zipfile.ZipFile(io.BytesIO(resp.content))

  for file_info in zf.infolist():
    filename = file_info.filename
    filename_lower = filename.lower()

    if not (filename_lower.endswith('.md') or (filename_lower.endswith('.mdx'))):
      continue

    try:
      with zf.open(file_info) as f_in:
        content = f_in.read().decode('utf-8', errors='ignore')
        post = frontmatter.loads(content)
        data = post.to_dict()
        data['filename'] = filename
        repository_data.append(data)
    except Exception as e:
      print(f"Error processing {filename}: {e}")
      continue

  zf.close()
  return repository_data

In [4]:
def sliding_window(sequence, size, step):
  if size <= 0 or step <= 0:
    raise ValueError("Size and step must be positive")

  n = len(sequence)
  result = []
  for i in range(0, n, step):
    chunk = sequence[i:i+size]
    result.append({'start': i, 'chunk': chunk})
    if i + size >= n:
      break

  return result

In [5]:
evidently_docs = read_repo_data('evidentlyai', 'docs')

evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    evidently_chunks.extend(chunks)

# Let's now index this data with minsearch:

from minsearch import Index

index = Index(
    text_fields=["chunk", "title", "description", "filename"],
    keyword_fields=[]
)

index.fit(evidently_chunks)

# Here we create an index that will search through four text fields: chunk content, title, description, and filename. The keyword_fields parameter is for exact matches (we don't need it for now).

# We can now use it for search:

query = 'What should be in a test dataset for AI evaluation?'
results = index.search(query)
print(results)

[{'start': 0, 'chunk': 'Retrieval-Augmented Generation (RAG) systems rely on retrieving answers from a knowledge base before generating responses. To evaluate them effectively, you need a test dataset that reflects what the system *should* know.\n\nInstead of manually creating test cases, you can generate them directly from your knowledge source, ensuring accurate and relevant ground truth data.\n\n## Create a RAG test dataset\n\nYou can generate ground truth RAG dataset from your data source.\n\n### 1. Create a Project\n\nIn the Evidently UI, start a new Project or open an existing one.\n\n* Navigate to “Datasets” in the left menu.\n* Click “Generate” and select the “RAG” option.\n\n![](/images/synthetic/synthetic_data_select_method.png)\n\n### 2. Upload your knowledge base\n\nSelect a file containing the information your AI system retrieves from. Supported formats: Markdown (.md), CSV, TXT, PDFs. Choose how many inputs to generate.\n\n![](/images/synthetic/synthetic_data_inputs_examp

In [6]:
# For DataTalksClub FAQ, it's similar, except we don't need to chunk the data. For the data engineering course, it'll look like this:

dtc_faq = read_repo_data('DataTalksClub', 'faq')

de_dtc_faq = [d for d in dtc_faq if 'data-engineering' in d['filename']]

faq_index = Index(
    text_fields=["question", "content"],
    keyword_fields=[]
)

faq_index.fit(de_dtc_faq)

query = 'Course: Can I still join the course after the start date?'
results = index.search(query)
print(results)

# This is text search, also known as "lexical search". We look for exact matches between our query and the documents.

[{'start': 0, 'chunk': 'The result of each evaluation is a Report (summary of metrics with visuals) with an optional Test Suite (when it also includes pass/fail results on set conditions).\n\n**Browse the results**. To access the results of your evaluations, enter your Project and navigate to the "Reports" section in the left menu. Here, you can view all your evaluation artifacts and browse them by Tags, time, or metadata. You can also download them as HTML or JSON.\n\n![](/images/evals_browse_reports-min.png)\n\nTo see and compare the evaluation results, click on "Explore" next to the individual Report.\n\n**Explore view**. You\'ll get the Report or Test Suite and, if available, the dataset linked to the evaluation.\n\n![](/images/evals_explore_view-min.png)\n\n- To view the Report only, click on the "Dataset" sign at the top to hide the dataset.\n- To explore the Dataset only, choose "Go to dataset".\n\n**Compare**. To analyze multiple evaluation results side by side, simply select t

In [7]:
## 2. Vector search

# Text search has limitations. Consider these two queries:
# - "I just discovered the program, can I still enroll?"
# - "I just found out about the course, can I still join?"

# These ask the same question but share no common words (among important ones). Text search would fail to find relevant matches.

# This is where embeddings help. Embeddings are numerical representations of text that capture semantic meaning. Words and phrases with similar meanings have similar embeddings, even if they use different words.
# Vector search uses these embeddings to identify semantically similar documents, rather than just exact word matches.

# For vector search, we need to turn our documents into vectors (embeddings).

# We will use the sentence-transformers library for this purpose.

!uv pip install sentence-transformers

[2mUsing Python 3.12.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 137ms[0m[0m


In [8]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('multi-qa-distilbert-cos-v1')

# The multi-qa-distilbert-cos-v1 model is trained explicitly for question-answering tasks. It creates embeddings optimized for finding answers to questions.

# Other popular models include:
# - all-MiniLM-L6-v2 - General-purpose, fast, and efficient
# - all-mpnet-base-v2 - Higher quality, slower

# Check Sentence Transformers documentation (https://www.sbert.net/docs/pretrained_models.html) for more options.

# This is how we use it

record = de_dtc_faq[2]
text = record['question'] + ' ' + record['content']
v_doc = embedding_model.encode(text)

# We combine the question and answer text, then convert it to an embedding vector.

# Let's do the same for the query:

query = 'I just found out about the course. Can I enroll now?'
v_query = embedding_model.encode(query)

# This is how we compute similarity between the query and document vectors:
similarity = v_query.dot(v_doc)

# The dot product measures similarity between vectors

# Values closer to 1 indicate higher similarity, closer to 0 means lower similarity. This works because the model creates normalized embeddings where cosine similarity equals the dot product.

# So we can create embeddings for all documents, then compute similarity between the query and each document to find the most similar ones.

# This is what VectorSearch from minsearch does. Let's use it.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/523 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [9]:
# First, we turn our docs into embeddings. This process takes time, so we'll monitor progress with tqdm:

from tqdm.auto import tqdm
import numpy as np

faq_embeddings = []

for d in tqdm(de_dtc_faq):
  text = d['question'] + ' ' + d['content']
  v = embedding_model.encode(text)
  faq_embeddings.append(v)

faq_embeddings = np.array(faq_embeddings)

# We combine question and answer text for each FAQ entry. We convert the list to a NumPy array for efficient similarity computations.

  0%|          | 0/449 [00:00<?, ?it/s]

In [11]:
# Now let's use VectorSearch:

from minsearch import VectorSearch

faq_vindex = VectorSearch()
faq_vindex.fit(faq_embeddings, de_dtc_faq)

# This creates a vector search index using our embeddings and original documents

# Let's use it now

query = 'Can I join the course now?'
q = embedding_model.encode(query)
results = faq_vindex.search(q)

# We frist create an embedding for our query (q), then search for similar document embeddings.

print(results)

[{'id': '3f1424af17', 'question': 'Course: Can I still join the course after the start date?', 'sort_order': 3, 'content': "Yes, even if you don't register, you're still eligible to submit the homework.\n\nBe aware, however, that there will be deadlines for turning in homeworks and the final projects. So don't leave everything for the last minute.", 'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md'}, {'id': '068529125b', 'question': 'Course - Can I follow the course after it finishes?', 'sort_order': 8, 'content': 'Yes, we will keep all the materials available, so you can follow the course at your own pace after it finishes.\n\nYou can also continue reviewing the homeworks and prepare for the next cohort. You can also start working on your final capstone project.', 'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/008_068529125b_course-can-i-follow-the-course-after-it-finishes.md'}, 

In [12]:
# You can easily do the same with the Evidently docs (but only use the chunk field for embeddings):

evidently_embeddings = []

for d in tqdm(evidently_chunks):
  v = embedding_model.encode(d['chunk'])
  evidently_embeddings.append(v)

evidently_embeddings = np.array(evidently_embeddings)

evidently_vindex = VectorSearch()
evidently_vindex.fit(evidently_embeddings, evidently_chunks)

  0%|          | 0/575 [00:00<?, ?it/s]

<minsearch.vector.VectorSearch at 0x7c9866d77740>

In [13]:
## 3. Hybrid search

# Text search is fast and efficient.
# It works well for exact matches and specific terms, and requires no model inference. However, it misses semantically similar but differently worded queries and  struggles to handle synonyms effectively.

# Vector search captures semantic meaning and handles paraphrased questions.
# It works with synonyms and related concepts. But it may miss exact keyword matches

# Combining both approaches gives us the best of both worlds. This is knows as "Hybrid search".

# The code is quite simple

query = 'Can I join the course now?'

text_results = faq_index.search(query, num_results=5)

q = embedding_model.encode(query)
vector_results = faq_vindex.search(q, num_results=5)

final_results = text_results + vector_results

print(final_results)

[{'id': '3f1424af17', 'question': 'Course: Can I still join the course after the start date?', 'sort_order': 3, 'content': "Yes, even if you don't register, you're still eligible to submit the homework.\n\nBe aware, however, that there will be deadlines for turning in homeworks and the final projects. So don't leave everything for the last minute.", 'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md'}, {'id': '9e508f2212', 'question': 'Course: When does the course start?', 'sort_order': 1, 'content': "The next cohort starts January 13th, 2025. More info at [DTC](https://datatalks.club/blog/guide-to-free-online-courses-at-datatalks-club.html).\n\n- Register before the course starts using this [link](https://airtable.com/shr6oVXeQvSI5HuWD).\n- Join the [course Telegram channel with announcements](https://t.me/dezoomcamp).\n- Don’t forget to register in DataTalks.Club's Slack and join the channel.", 'file

In [14]:
# Putting this together

# Our search is implemented!

# But before we can use it in our agent, we need to organize the code. Let's put all the code into different functions.

def text_search(query):
  return faq_index.search(query, num_results=5)

def vector_search(query):
  q = embedding_model.encode(query)
  return faq_vindex.search(q, num_results=5)

def hubrid_search(query):
  text_results = text_search(query)
  vector_results = vector_search(query)

  # combine and deduplicate results
  seen_ids = set()
  combined_results = []

  for result in text_results + vector_results:
    if result['filename'] not in seen_ids:
      seen_ids.add(result[['filename']])
      combined_results.append(result)

  return combined_results

In [None]:
# We have seen 3 approaches: text search, vector search, and hybrid search. You may wonder, how do I select the best one? We will discuss evaluation methods later in the course.

# But like with chunking, you should always start with the simplest approach. For search, that's text search. It's faster, easier to debug, and works well for many use cases. Only add complexity when a simple text search isn't sufficient.

# But let's first build our agent! Our data is ready. Tomorrow, we will build a conversational agent that can answer questions based on the data we collected.
