# Semantic Search Demo

This notebook demonstrates how to use the fine-tuned embedding model for semantic search: given a query, find the most relevant documents from a collection.


In [1]:
# Import functions from the scripts directory
from src.data.loaders import load_toy_dataset
from src.models.embedding_pipeline import load_embeddinggemma_model
from src.models.lora_setup import setup_lora_model
from src.llm.semantic_search import embed_document_collection, search


  from .autonotebook import tqdm as notebook_tqdm


## Load Model

Load the fine-tuned model (or base model with LoRA for demonstration).


In [2]:
# Load model
tokenizer, base_model = load_embeddinggemma_model()
model = setup_lora_model(base_model, r=16, lora_alpha=32, lora_dropout=0.1)

print("Model loaded for semantic search")


    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    


Model loaded for semantic search


## Create Document Collection

We'll use the positive sentences from our dataset as a document collection to search over.


In [3]:
# Load dataset and use positives as documents
train_data = load_toy_dataset()
documents = [item["positive"] for item in train_data]

print(f"Document collection contains {len(documents)} documents:")
for i, doc in enumerate(documents, 1):
    print(f"  {i}. {doc}")


Document collection contains 4 documents:
  1. Playing soccer is my favorite hobby.
  2. It's quite a sunny day outside.
  3. The football match ended in a draw.
  4. It rained heavily throughout the day.


## Embed Document Collection

Pre-compute embeddings for all documents (this is efficient for repeated searches).


In [4]:
# Embed all documents
document_embeddings = embed_document_collection(documents, model, tokenizer)

print(f"Document embeddings shape: {document_embeddings.shape}")
print("Documents are now ready for semantic search!")


Document embeddings shape: torch.Size([4, 768])
Documents are now ready for semantic search!


## Perform Semantic Search

Search for documents relevant to a query. The model should find documents that are semantically similar, even if they don't share exact keywords.


In [5]:
# Example query
query = "I enjoy playing soccer in my free time."

# Perform search
results = search(
    query,
    document_embeddings,
    documents,
    model,
    tokenizer,
    top_k=3
)

# Display results
print(f"Query: '{query}'")
print("\nTop Results:")
print("-" * 60)
for result in results:
    print(f"\nRank {result['rank']}: (similarity={result['similarity']:.3f})")
    print(f"  Document: '{result['document']}'")


Query: 'I enjoy playing soccer in my free time.'

Top Results:
------------------------------------------------------------

Rank 1: (similarity=0.883)
  Document: 'Playing soccer is my favorite hobby.'

Rank 2: (similarity=0.681)
  Document: 'It's quite a sunny day outside.'

Rank 3: (similarity=0.635)
  Document: 'The football match ended in a draw.'


## Try More Queries

Test the semantic search with different queries to see how well it retrieves relevant documents.


In [6]:
# Additional test queries
test_queries = [
    "It's a beautiful sunny day outside.",
    "The match was tied with no winner.",
    "Heavy rain fell all day long."
]

for query in test_queries:
    results = search(query, document_embeddings, documents, model, tokenizer, top_k=2)
    print(f"\nQuery: '{query}'")
    print(f"Top match: '{results[0]['document']}' (sim={results[0]['similarity']:.3f})")
    print("-" * 60)



Query: 'It's a beautiful sunny day outside.'
Top match: 'It's quite a sunny day outside.' (sim=0.942)
------------------------------------------------------------

Query: 'The match was tied with no winner.'
Top match: 'The football match ended in a draw.' (sim=0.802)
------------------------------------------------------------

Query: 'Heavy rain fell all day long.'
Top match: 'It rained heavily throughout the day.' (sim=0.968)
------------------------------------------------------------
