# 03_semantic_search_demo.ipynb — Semantic search (SBERT + OpenAI)

Цей ноутбук: побудова семантичного пошуку поверх унікальних StackOverflow title’ів.

- Sentence Transformers embeddings + NearestNeighbors (cosine)
- OpenAI embeddings (`text-embedding-3-large`) + NearestNeighbors (cosine)
- Демо запитів і якісне порівняння


## Installs (Colab)

In [None]:
!pip install -q datasets scikit-learn pandas matplotlib sentence-transformers openai

## Imports

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import torch
from sklearn.neighbors import NearestNeighbors
from sentence_transformers import SentenceTransformer


## Load dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "sentence-transformers/stackexchange-duplicates",
    "title-title-pair"
)

# Convert to pandas for convenience
df = dataset["train"].to_pandas()
df.head()

###Semantic Search модуль

In [15]:
all_titles = pd.concat([df["title1"], df["title2"]], ignore_index=True)

# Унікальні заголовки
unique_titles = all_titles.drop_duplicates().reset_index(drop=True)
len(unique_titles)

453369

In [16]:
unique_titles.head()

Unnamed: 0,0
0,How to get jquery to read a dynamic divs?
1,Does Steam back up my games?
2,Will my opponent lose from his own Plague Spit...
3,Checking whether a GeoServer service is WMS or...
4,Sqrt function working in c


Завантажуємо SentenceTransformer і рахуємо ембедінги

In [17]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device=device)

Device: cuda


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [18]:
# Ембедінги для всіх унікальних заголовків
title_embeddings = model.encode(
    unique_titles.tolist(),
    batch_size=64,
    convert_to_numpy=True,
    show_progress_bar=True,
    device=device
)

title_embeddings = title_embeddings.astype("float32")
title_embeddings.shape

Batches:   0%|          | 0/7084 [00:00<?, ?it/s]

(453369, 384)

NearestNeighbors з cosine metric

In [24]:
# Створюємо brute-force індекс з cosine-відстанню
nn = NearestNeighbors(
    n_neighbors=5,
    metric="cosine"   # 1 - cos_sim
)

nn.fit(title_embeddings)

Функція semantic search

In [25]:
def semantic_search(query, model, nn, titles, k=5):
    """
    Повертає top-k найбільш схожих питань для заданого текстового запиту.
    Використовує SBERT-ембедінги + sklearn.NearestNeighbors (cosine).
    """
    # Ембединг запиту
    q_emb = model.encode([query], convert_to_numpy=True, device=device).astype("float32")

    # Пошук сусідів
    distances, indices = nn.kneighbors(q_emb, n_neighbors=k)

    results = []
    for dist, idx in zip(distances[0], indices[0]):
        sim = 1.0 - dist  # cosine similarity
        results.append({
            "title": titles.iloc[idx],
            "distance": float(dist),
            "similarity": float(sim)
        })
    return results

Потестуємо на кількох запитах

In [26]:
queries = [
    "How to fix NullPointerException in Java?",
    "Train/test split for machine learning model",
    "How to center a div in CSS?"
]

for q in queries:
    print("\n" + "="*80)
    print("QUERY:", q)
    res = semantic_search(q, model, nn, unique_titles, k=5)
    for r in res:
        print(f"  sim={r['similarity']:.3f} | {r['title']}")


QUERY: How to fix NullPointerException in Java?
  sim=1.000 | How to fix NullPointerException in Java?
  sim=0.984 | Issue with NullPointerException in Java
  sim=0.981 | java.lang.NullPointerException error, how to fix?
  sim=0.981 | java.lang.NullPointerException error how to fix?
  sim=0.980 | Getting NullPointerException Error in Java

QUERY: Train/test split for machine learning model
  sim=0.705 | Alternatives to a train-test split with a small data set
  sim=0.669 | When should we split the data into train, valid and test datasets?
  sim=0.654 | Validation: Data splitting into training vs. test datasets
  sim=0.616 | How to split data into 3 sets (train, validation and test)?
  sim=0.614 | How to do data augmentation and train-validate split?

QUERY: How to center a div in CSS?
  sim=0.943 | How to center a div?
  sim=0.943 | How to center a div
  sim=0.929 | How to center a div element?
  sim=0.915 | How to center a div within a div
  sim=0.913 | How to center a div inside a d

In [29]:
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

Enter your OpenAI API key: ··········


In [31]:
from openai import OpenAI
client = OpenAI()

Функція для отримання ембедінга одного тексту

Для моделі text-embedding-3-large

In [32]:
def get_embedding(text: str, model: str = "text-embedding-3-large"):
    text = text.replace("\n", " ")  # рекомендація OpenAI
    resp = client.embeddings.create(
        model=model,
        input=[text]
    )
    return resp.data[0].embedding

Ембедінги для всіх unique_titles

In [34]:
# Наприклад, обмежимось 10 000 заголовків
max_titles = 10_000
titles_subset = unique_titles.iloc[:max_titles]

openai_embeddings = []

for i, text in enumerate(titles_subset):
    emb = get_embedding(text, model="text-embedding-3-large")
    openai_embeddings.append(emb)
    if (i + 1) % 1000 == 0:
        print(f"Processed {i+1} titles")

openai_embeddings = np.array(openai_embeddings, dtype="float32")
openai_embeddings.shape

Processed 1000 titles
Processed 2000 titles
Processed 3000 titles
Processed 4000 titles
Processed 5000 titles
Processed 6000 titles
Processed 7000 titles
Processed 8000 titles
Processed 9000 titles
Processed 10000 titles


(10000, 3072)

In [35]:
nn_openai = NearestNeighbors(
    n_neighbors=5,
    metric="cosine"
)
nn_openai.fit(openai_embeddings)

In [36]:
def semantic_search_openai(query, client, nn, titles, model="text-embedding-3-large", k=5):
    # 1. Отримуємо ембедінг запиту
    q_emb = np.array(get_embedding(query, model=model), dtype="float32").reshape(1, -1)

    # 2. Шукаємо сусідів
    distances, indices = nn.kneighbors(q_emb, n_neighbors=k)

    results = []
    for dist, idx in zip(distances[0], indices[0]):
        sim = 1.0 - dist
        results.append({
            "title": titles.iloc[idx],
            "distance": float(dist),
            "similarity": float(sim)
        })
    return results

In [37]:
queries = [
    "How to fix NullPointerException in Java?",
    "Train/test split for machine learning model",
    "How to center a div in CSS?"
]

for q in queries:
    print("\n" + "="*80)
    print("QUERY:", q)
    res = semantic_search_openai(q, client, nn_openai, titles_subset, k=5)
    for r in res:
        print(f"  sim={r['similarity']:.3f} | {r['title']}")


QUERY: How to fix NullPointerException in Java?
  sim=0.820 | How can I fix Fatal Exception: java.lang.NullpointerException in my Java file?
  sim=0.746 | What do i do when I am getiting Java.lang.NullPointerException?
  sim=0.721 | Java - code always throw NullPointerException
  sim=0.721 | Java code says: nullPointerException
  sim=0.715 | What is causing my NullPointerException

QUERY: Train/test split for machine learning model
  sim=0.396 | Why is the outer loop of nested cross validation needed?
  sim=0.378 | Training loss low, but testing loss is high
  sim=0.363 | Neural network performance evaluation
  sim=0.345 | Neural network in R using the caret package
  sim=0.342 | Data standardization or normalization in GBDT

QUERY: How to center a div in CSS?
  sim=0.799 | How to I center is DIV?
  sim=0.779 | How can I center text inside a div element using CSS
  sim=0.694 | How to set div center vertically
  sim=0.692 | How do I center text vertically with CSS?
  sim=0.681 | How to

## Notes

- Для OpenAI embeddings ключ можна передати через Colab Secrets (`OPENAI_API_KEY`) або через інтерактивний ввід (`getpass`).
- Для демо достатньо підмножини `unique_titles` (наприклад 20k), щоб контролювати вартість і час.
