# **Semantic Search using Vector Database (Endee)**

This notebook builds a semantic search system using the BBC News dataset.
The goal is to demonstrate how unstructured text data can be embedded,
stored in a vector database, and queried using similarity search.

----------------

### Workflow
1. Load BBC News dataset
2. Preprocess text
3. Generate embeddings using Sentence Transformers
4. Store vectors in Endee
5. Perform semantic search

--------------------

In [1]:
# Install required libraries
!pip install -q sentence-transformers numpy pandas

### Step 1: Load BBC News Dataset

We use the BBC News dataset containing unstructured news articles
across multiple categories. This dataset will be used to build
a semantic search system using vector embeddings.

-----------------------

In [2]:
from google.colab import files
import zipfile
import os

# Upload ZIP file
uploaded = files.upload()

# Get uploaded file name
zip_filename = list(uploaded.keys())[0]
print(f"Uploaded file: {zip_filename}")

# Extract ZIP file
with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
    zip_ref.extractall("bbc_news")

print("Extraction completed.")

# List extracted files/folders
os.listdir("bbc_news")


Saving BBC News Summary.zip to BBC News Summary (1).zip
Uploaded file: BBC News Summary (1).zip
Extraction completed.


['BBC News Summary']

### Step 2: Load and Prepare BBC News Text Data

This step safely navigates the nested folder structure of the BBC News dataset,
reads all article text files, and creates a structured dataset for embeddings.

----------------


In [5]:
import os
import pandas as pd

DATASET_PATH = "bbc_news/BBC News Summary/BBC News Summary/News Articles"

documents = []

for category in os.listdir(DATASET_PATH):
    category_path = os.path.join(DATASET_PATH, category)

    if not os.path.isdir(category_path):
        continue

    for file_name in os.listdir(category_path):
        file_path = os.path.join(category_path, file_name)

        if os.path.isfile(file_path):
            with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
                text = f.read().strip()

                if len(text) > 100:
                    documents.append({
                        "id": f"{category}_{file_name}",
                        "category": category,
                        "text": text
                    })

df = pd.DataFrame(documents)

print("Total documents loaded:", len(df))
df.head()

Total documents loaded: 2225


Unnamed: 0,id,category,text
0,tech_151.txt,tech,'Blog' picked as word of the year\n\nThe term ...
1,tech_233.txt,tech,2D Metal Slug offers retro fun\n\nLike some dr...
2,tech_308.txt,tech,Microsoft makes anti-piracy move\n\nMicrosoft ...
3,tech_168.txt,tech,A decade of good website design\n\nThe web loo...
4,tech_319.txt,tech,Why Cell will get the hard sell\n\nThe world i...


### Step 3: Generate Text Embeddings

In this step, we convert each news article into a dense vector
representation using a pre-trained Sentence Transformer model.

These embeddings will later be stored in the Endee vector database
for semantic search.

--------------

In [6]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Convert texts to embeddings
texts = df["text"].tolist()

print("Generating embeddings...")
embeddings = model.encode(
    texts,
    batch_size=32,
    show_progress_bar=True
)

# Attach embeddings to dataframe
df["embedding"] = embeddings.tolist()

print("Embedding shape:", np.array(embeddings).shape)
df.head()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generating embeddings...


Batches:   0%|          | 0/70 [00:00<?, ?it/s]

Embedding shape: (2225, 384)


Unnamed: 0,id,category,text,embedding
0,tech_151.txt,tech,'Blog' picked as word of the year\n\nThe term ...,"[0.010807140730321407, -0.10933691263198853, 0..."
1,tech_233.txt,tech,2D Metal Slug offers retro fun\n\nLike some dr...,"[-0.023779505863785744, 0.0030381432734429836,..."
2,tech_308.txt,tech,Microsoft makes anti-piracy move\n\nMicrosoft ...,"[-0.0838695839047432, 0.02840115688741207, 0.0..."
3,tech_168.txt,tech,A decade of good website design\n\nThe web loo...,"[-0.022386251017451286, 0.02649599127471447, 0..."
4,tech_319.txt,tech,Why Cell will get the hard sell\n\nThe world i...,"[0.018591539934277534, -0.02684377133846283, -..."


### Step 4: Prepare Vector Records for Endee

In this step, we structure the embeddings along with metadata
so they can be stored in the Endee vector database.
Each record contains:
- a unique ID
- the embedding vector
- metadata such as category and text

-----------------

In [7]:
# Prepare vector records (Endee-style)
vector_records = []

for _, row in df.iterrows():
    vector_records.append({
        "id": row["id"],
        "vector": row["embedding"],
        "metadata": {
            "category": row["category"],
            "text": row["text"][:500]  # truncate for metadata safety
        }
    })

print("Total vector records prepared:", len(vector_records))
vector_records[0]

Total vector records prepared: 2225


{'id': 'tech_151.txt',
 'vector': [0.010807140730321407,
  -0.10933691263198853,
  0.009803259745240211,
  0.04897219315171242,
  0.08123082667589188,
  -0.00928831472992897,
  -0.0076011293567717075,
  0.02224893681704998,
  0.0037863377947360277,
  0.07004967331886292,
  -0.01170384231954813,
  0.11663801968097687,
  0.046894486993551254,
  0.029237134382128716,
  0.010334055870771408,
  0.013272405602037907,
  0.0033293876331299543,
  -0.05134706199169159,
  -0.03554220497608185,
  0.06350404024124146,
  0.07178613543510437,
  0.06331442296504974,
  0.044859834015369415,
  0.023621872067451477,
  0.02245604619383812,
  -0.06489241123199463,
  -0.11646071821451187,
  -0.05888688564300537,
  -0.019792741164565086,
  0.0388677716255188,
  -0.02507781609892845,
  0.029957422986626625,
  0.01931779272854328,
  0.03042387031018734,
  -0.014122402295470238,
  -0.059993308037519455,
  -0.03512492775917053,
  -0.052652254700660706,
  -0.0029232092201709747,
  -0.015375837683677673,
  -0.0489

### Step 5: Semantic Search using Cosine Similarity

In this step, we implement semantic search by:
1. Converting a user query into an embedding
2. Comparing it with stored document embeddings
3. Retrieving the most semantically similar articles

This simulates how vector databases like Endee perform similarity search.

--------------

In [8]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Function to perform semantic search
def semantic_search(query, top_k=5):
    # Embed the query
    query_embedding = model.encode([query])

    # Convert stored embeddings to numpy array
    doc_embeddings = np.array(df["embedding"].tolist())

    # Compute cosine similarity
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]

    # Get top-k most similar documents
    top_indices = similarities.argsort()[-top_k:][::-1]

    results = []
    for idx in top_indices:
        results.append({
            "id": df.iloc[idx]["id"],
            "category": df.iloc[idx]["category"],
            "score": float(similarities[idx]),
            "text_preview": df.iloc[idx]["text"][:300]
        })

    return results

In [9]:
query = "latest technology and software innovations"
results = semantic_search(query)

for r in results:
    print(f"\nID: {r['id']}")
    print(f"Category: {r['category']}")
    print(f"Similarity Score: {r['score']:.4f}")
    print(f"Preview: {r['text_preview']}")


ID: tech_295.txt
Category: tech
Similarity Score: 0.4272
Preview: More power to the people says HP

The digital revolution is focused on letting people tell and share their own stories, according to Carly Fiorina, chief of technology giant Hewlett Packard.

The job of firms such as HP now, she said in a speech at the Consumer Electronics Show (CES), was to ensure 

ID: tech_228.txt
Category: tech
Similarity Score: 0.4272
Preview: More power to the people says HP

The digital revolution is focused on letting people tell and share their own stories, according to Carly Fiorina, chief of technology giant Hewlett Packard.

The job of firms such as HP now, she said in a speech at the Consumer Electronics Show (CES), was to ensure 

ID: tech_309.txt
Category: tech
Similarity Score: 0.4094
Preview: What's next for next-gen consoles?

The next generation of video games consoles are in development but what will the new machines mean for games firms and consumers? We may not know when they will 

### Step 6: Endee Vector Database Integration (Conceptual)

In a production environment, the embeddings generated in this project
would be stored in the Endee vector database.

Endee would handle:
- Vector indexing
- Efficient similarity search
- Scalability and low-latency retrieval

Below is a conceptual example showing how vectors would be inserted
and queried using Endee's API or SDK.

---------------------

In [None]:
"""
# Example pseudo-code for Endee integration

from endee import EndeeClient

client = EndeeClient(api_key="YOUR_API_KEY")

# Create a collection
client.create_collection(
    name="bbc_news",
    dimension=384
)

# Insert vectors
client.insert(
    collection="bbc_news",
    records=vector_records
)

# Semantic search
results = client.search(
    collection="bbc_news",
    query_vector=query_embedding,
    top_k=5
)
"""

### **Conclusion**

In this notebook, we built a complete semantic search pipeline using the BBC News dataset.

We processed unstructured text, generated embeddings using a transformer-based model,
and performed similarity search to retrieve relevant documents based on meaning.

This workflow mirrors how modern vector databases like Endee are used in
real-world AI applications such as semantic search and retrieval-augmented systems.

--------------------------------