# Day 3: Add Search

https://docs.google.com/document/d/1aieUiF2oN8q3FCvAV7KEyC9HoPFxFIAHvUEnhALz8Og/edit?tab=t.0

#### 1. Text search

In [1]:
from utils.ingest import read_repo_data
from utils.chunk import  sliding_window
from pprint import pprint

In [2]:
evidently_docs = read_repo_data('evidentlyai', 'docs')

evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    evidently_chunks.extend(chunks)

In [5]:
evidently_chunks[0]

{'start': 0,
 'chunk': '<Note>\n  If you\'re not looking to build API reference documentation, you can delete\n  this section by removing the api-reference folder.\n</Note>\n\n## Welcome\n\nThere are two ways to build API documentation: [OpenAPI](https://mintlify.com/docs/api-playground/openapi/setup) and [MDX components](https://mintlify.com/docs/api-playground/mdx/configuration). For the starter kit, we are using the following OpenAPI specification.\n\n<Card\n  title="Plant Store Endpoints"\n  icon="leaf"\n  href="https://github.com/mintlify/starter/blob/main/api-reference/openapi.json"\n>\n  View the OpenAPI specification file\n</Card>\n\n## Authentication\n\nAll API endpoints are authenticated using Bearer tokens and picked up from the specification file.\n\n```json\n"security": [\n  {\n    "bearerAuth": []\n  }\n]\n```',
 'title': 'Introduction',
 'description': 'Example section for showcasing API endpoints',
 'filename': 'docs-main/api-reference/introduction.mdx'}

Simple in-memory text search 

- [minisearch](https://github.com/alexeygrigorev/minsearch)

In [3]:
from minsearch import Index

index = Index(
    text_fields=["chunk", "title", "description", "filename"],
    keyword_fields=[]
)

index.fit(evidently_chunks)

<minsearch.minsearch.Index at 0x77d62c7ec290>

In [4]:
index.docs

[{'start': 0,
  'chunk': '<Note>\n  If you\'re not looking to build API reference documentation, you can delete\n  this section by removing the api-reference folder.\n</Note>\n\n## Welcome\n\nThere are two ways to build API documentation: [OpenAPI](https://mintlify.com/docs/api-playground/openapi/setup) and [MDX components](https://mintlify.com/docs/api-playground/mdx/configuration). For the starter kit, we are using the following OpenAPI specification.\n\n<Card\n  title="Plant Store Endpoints"\n  icon="leaf"\n  href="https://github.com/mintlify/starter/blob/main/api-reference/openapi.json"\n>\n  View the OpenAPI specification file\n</Card>\n\n## Authentication\n\nAll API endpoints are authenticated using Bearer tokens and picked up from the specification file.\n\n```json\n"security": [\n  {\n    "bearerAuth": []\n  }\n]\n```',
  'title': 'Introduction',
  'description': 'Example section for showcasing API endpoints',
  'filename': 'docs-main/api-reference/introduction.mdx'},
 {'start'

In [4]:
query = 'What should be in a test dataset for AI evaluation?'
results = index.search(query)

In [23]:
pprint(results[:3])

[{'content': "Yes, even if you don't register, you're still eligible to submit "
             'the homework.\n'
             '\n'
             'Be aware, however, that there will be deadlines for turning in '
             "homeworks and the final projects. So don't leave everything for "
             'the last minute.',
  'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md',
  'id': '3f1424af17',
  'question': 'Course: Can I still join the course after the start date?',
  'sort_order': 3},
 {'content': 'Yes, we will keep all the materials available, so you can follow '
             'the course at your own pace after it finishes.\n'
             '\n'
             'You can also continue reviewing the homeworks and prepare for '
             'the next cohort. You can also start working on your final '
             'capstone project.',
  'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/008

In [6]:
# For DataTalksClub FAQ, it's similar, except we don't need to chunk the data.
dtc_faq = read_repo_data('DataTalksClub', 'faq')

de_dtc_faq = [d for d in dtc_faq if 'data-engineering' in d['filename']]

faq_index = Index(
    text_fields=["question", "content"],
    keyword_fields=[]
)

faq_index.fit(de_dtc_faq)

<minsearch.minsearch.Index at 0x76c6cf6ee0f0>

In [24]:
results = faq_index.search(query)
pprint(results[:3])

[{'content': "Yes, even if you don't register, you're still eligible to submit "
             'the homework.\n'
             '\n'
             'Be aware, however, that there will be deadlines for turning in '
             "homeworks and the final projects. So don't leave everything for "
             'the last minute.',
  'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md',
  'id': '3f1424af17',
  'question': 'Course: Can I still join the course after the start date?',
  'sort_order': 3},
 {'content': 'The next cohort starts January 13th, 2025. More info at '
             '[DTC](https://datatalks.club/blog/guide-to-free-online-courses-at-datatalks-club.html).\n'
             '\n'
             '- Register before the course starts using this '
             '[link](https://airtable.com/shr6oVXeQvSI5HuWD).\n'
             '- Join the [course Telegram channel with '
             'announcements](https://t.me

### 2. Vector search

For vector search, we need to turn our documents into vectors (embeddings).

- [sentence-transformers](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html)

In [8]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('multi-qa-distilbert-cos-v1')

The `multi-qa-distilbert-cos-v1` model is trained explicitly for question-answering tasks. It creates embeddings optimized for finding answers to questions.

Other popular models include:
- `all-MiniLM-L6-v2` - General-purpose, fast, and efficient
- `all-mpnet-base-v2` - Higher quality, slower

In [9]:
record = de_dtc_faq[2]
pprint(record)

{'content': "Yes, even if you don't register, you're still eligible to submit "
            'the homework.\n'
            '\n'
            'Be aware, however, that there will be deadlines for turning in '
            "homeworks and the final projects. So don't leave everything for "
            'the last minute.',
 'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md',
 'id': '3f1424af17',
 'question': 'Course: Can I still join the course after the start date?',
 'sort_order': 3}


In [10]:
text = record['question'] + ' ' + record['content']
v_doc = embedding_model.encode(text)

In [11]:
v_doc[:10]

array([ 0.06053618, -0.07152255,  0.03627984, -0.02451891,  0.05339969,
       -0.03946254, -0.01561872,  0.02626443, -0.02312284, -0.00011206],
      dtype=float32)

In [12]:
query = 'I just found out about the course. Can I enroll now?'
v_query = embedding_model.encode(query)

In [13]:
similarity = v_query.dot(v_doc)
print("Similarity:", similarity)

Similarity: 0.5190934


In [14]:
from tqdm.auto import tqdm
import numpy as np

faq_embeddings = []

for d in tqdm(de_dtc_faq):
    text = d['question'] + ' ' + d['content']
    v = embedding_model.encode(text)
    faq_embeddings.append(v)

# convert the list to a NumPy array for efficient similarity computations.
faq_embeddings = np.array(faq_embeddings)

  0%|          | 0/449 [00:00<?, ?it/s]

In [15]:
from minsearch import VectorSearch

faq_vindex = VectorSearch()
faq_vindex.fit(faq_embeddings, de_dtc_faq)

<minsearch.vector.VectorSearch at 0x76c5bf6b5b20>

In [16]:
query = 'Can I join the course now?'
q = embedding_model.encode(query)
results = faq_vindex.search(q)

In [17]:
pprint(results[:3])

[{'content': "Yes, even if you don't register, you're still eligible to submit "
             'the homework.\n'
             '\n'
             'Be aware, however, that there will be deadlines for turning in '
             "homeworks and the final projects. So don't leave everything for "
             'the last minute.',
  'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md',
  'id': '3f1424af17',
  'question': 'Course: Can I still join the course after the start date?',
  'sort_order': 3},
 {'content': 'Yes, we will keep all the materials available, so you can follow '
             'the course at your own pace after it finishes.\n'
             '\n'
             'You can also continue reviewing the homeworks and prepare for '
             'the next cohort. You can also start working on your final '
             'capstone project.',
  'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/008

### 3. Hybrid search

Text search is fast and efficient. Vector search captures semantic meaning and handles paraphrased questions. Combining both approaches gives us the best of both worlds. This is known as "hybrid search."


In [18]:
query = 'Can I join the course now?'

text_results = faq_index.search(query, num_results=5)

q = embedding_model.encode(query)
vector_results = faq_vindex.search(q, num_results=5)

final_results = text_results + vector_results

In [21]:
pprint(final_results[:3])

[{'content': "Yes, even if you don't register, you're still eligible to submit "
             'the homework.\n'
             '\n'
             'Be aware, however, that there will be deadlines for turning in '
             "homeworks and the final projects. So don't leave everything for "
             'the last minute.',
  'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md',
  'id': '3f1424af17',
  'question': 'Course: Can I still join the course after the start date?',
  'sort_order': 3},
 {'content': 'The next cohort starts January 13th, 2025. More info at '
             '[DTC](https://datatalks.club/blog/guide-to-free-online-courses-at-datatalks-club.html).\n'
             '\n'
             '- Register before the course starts using this '
             '[link](https://airtable.com/shr6oVXeQvSI5HuWD).\n'
             '- Join the [course Telegram channel with '
             'announcements](https://t.me

### 4. Putting this together

In [26]:
def text_search(query):
    return faq_index.search(query, num_results=5)

def vector_search(query):
    q = embedding_model.encode(query)
    return faq_vindex.search(q, num_results=5)

def hybrid_search(query):
    text_results = text_search(query)
    vector_results = vector_search(query)
    
    # Combine and deduplicate results
    seen_ids = set()
    combined_results = []

    for result in text_results + vector_results:
        if result['filename'] not in seen_ids:
            seen_ids.add(result['filename'])
            combined_results.append(result)
    
    return combined_results

In [27]:
query = 'Can I join the course now?'

In [35]:
# Using text search
results = text_search(query)

print("Question:", query)
for result in results:
    print("*" * 80)
    print(f"Filename: {result['filename']}")
    print(f"Content: {result['content'][:200]}...")

Question: Can I join the course now?
********************************************************************************
Filename: faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md
Content: Yes, even if you don't register, you're still eligible to submit the homework.

Be aware, however, that there will be deadlines for turning in homeworks and the final projects. So don't leave everythi...
********************************************************************************
Filename: faq-main/_questions/data-engineering-zoomcamp/general/001_9e508f2212_course-when-does-the-course-start.md
Content: The next cohort starts January 13th, 2025. More info at [DTC](https://datatalks.club/blog/guide-to-free-online-courses-at-datatalks-club.html).

- Register before the course starts using this [link](h...
********************************************************************************
Filename: faq-main/_questions/data-engineerin

In [36]:
# Using vector search
results = vector_search(query)

print("Question:", query)
for result in results:
    print("*" * 80)
    print(f"Filename: {result['filename']}")
    print(f"Content: {result['content'][:200]}...")

Question: Can I join the course now?
********************************************************************************
Filename: faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md
Content: Yes, even if you don't register, you're still eligible to submit the homework.

Be aware, however, that there will be deadlines for turning in homeworks and the final projects. So don't leave everythi...
********************************************************************************
Filename: faq-main/_questions/data-engineering-zoomcamp/general/008_068529125b_course-can-i-follow-the-course-after-it-finishes.md
Content: Yes, we will keep all the materials available, so you can follow the course at your own pace after it finishes.

You can also continue reviewing the homeworks and prepare for the next cohort. You can ...
********************************************************************************
Filename: faq-main/_questions/

In [37]:
# Using hybrid search
results = hybrid_search(query)

print("Question:", query)
for result in results:
    print("*" * 80)
    print(f"Filename: {result['filename']}")
    print(f"Content: {result['content'][:200]}...")

Question: Can I join the course now?
********************************************************************************
Filename: faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md
Content: Yes, even if you don't register, you're still eligible to submit the homework.

Be aware, however, that there will be deadlines for turning in homeworks and the final projects. So don't leave everythi...
********************************************************************************
Filename: faq-main/_questions/data-engineering-zoomcamp/general/001_9e508f2212_course-when-does-the-course-start.md
Content: The next cohort starts January 13th, 2025. More info at [DTC](https://datatalks.club/blog/guide-to-free-online-courses-at-datatalks-club.html).

- Register before the course starts using this [link](h...
********************************************************************************
Filename: faq-main/_questions/data-engineerin