# Weighted Embeddings for Semantic Search (Pinecone + Sentence Transformers)

This notebook demonstrates an advanced semantic search technique:
**weighted embeddings**.

Instead of embedding a single concatenated text field, each record is split
into multiple semantic fields (course + section attributes), embedded separately,
and combined using a weighted vector sum.

This approach enables:
- Emphasizing the most important fields (e.g., section description)
- Improving retrieval quality in structured datasets
- More controllable semantic search behavior


In [1]:
import getpass
import os
import pandas as pd
import pinecone
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer
import pinecone
import numpy as np

In [2]:
pc = Pinecone(api_key= os.environ.get("PINECONE_API_KEY"), environment= os.environ.get("PINECONE_ENV"))

## Loading Course Section Dataset

The dataset contains both course-level and section-level information.
Each record is assigned a unique vector ID using:

`course_id_section_id`

This ensures a stable unique identifier per section.


In [3]:
files = pd.read_csv("../../data/course_section_descriptions.csv", encoding='ANSI')

In [4]:
files["unique_id"] = files["course_id"].astype(str) + "_" + files["section_id"].astype(str)

## Creating Metadata for Retrieval Interpretation

Metadata is attached to each vector so results can be interpreted easily.
This includes:
- course name
- section name
- section description


In [5]:
files["metadata"] = files.apply(lambda row: {
    "course_name": row["course_name"],
    "section_name": row["section_name"],
    "section_description": row["section_description"]
}, axis=1)

## Defining Field Weights

Each embedding field is assigned a weight based on importance.

Higher weight means:
- stronger influence on semantic similarity
- increased retrieval sensitivity to that field

The weights sum to 1.0 for interpretability.


In [6]:
WEIGHTS = {
    "course_name": 0.35,
    "course_technology": 0.15,
    "course_description": 0.10,
    "section_name": 0.20,
    "section_description": 0.25
}


## Weighted Embedding Strategy

For each record:

1. Encode individual fields separately
2. Combine embeddings via weighted sum
3. Normalize the final embedding vector

Normalization ensures the vector length is consistent,
which improves cosine similarity behavior.


In [7]:
def weighted_embedding(row, model, weights):
    e_course_name = model.encode(row['course_name'])
    e_course_technology = model.encode(row['course_technology'])
    e_course_description = model.encode(row['course_description'])
    e_section_name = model.encode(row['section_name'])
    e_section_description = model.encode(row['section_description'])

    combined = (
        weights['course_name'] * e_course_name + weights['course_technology'] * e_course_technology + weights['course_description'] * e_course_description + weights['section_name'] * e_section_name + weights['section_description'] * e_section_description )

    combined =  combined / np.linalg.norm(combined)
    return combined.tolist()

## Embedding Model Selection

`multi-qa-distilbert-cos-v1` is used because it is optimized for:
- semantic search
- question-answer retrieval use cases

This model outputs embeddings of **768 dimensions**.


In [8]:
model = SentenceTransformer('multi-qa-distilbert-cos-v1')

In [9]:
files["embeddings"] = files.apply(lambda row: weighted_embedding(row, model, WEIGHTS), axis=1)

## Creating a Pinecone Index for Weighted Embeddings

A Pinecone index is created with:
- dimension = 768
- metric = cosine similarity

The index is deleted and recreated if it already exists.


In [10]:
index_name = "bert"
dimension = 768  # Dimension of the embeddings
metric = "cosine"  # Similarity metric

In [11]:
if index_name in [i.name for i in pc.list_indexes()]:
    pc.delete_index(index_name)
    print(f"Deleted existing index '{index_name}'.")
else:
    print(f"{index_name} not in the index list.")

In [12]:
pc.create_index(
    name=index_name,
    dimension=dimension,
    metric=metric,
    spec = ServerlessSpec( cloud="aws", region="us-east-1")
)

In [13]:
index =  pc.Index(index_name)

## Upserting Weighted Embeddings into Pinecone

Each vector is stored as:
- ID: unique_id
- Values: weighted embedding vector
- Metadata: interpretability fields


In [14]:
vectors_to_upsert = [
    (row["unique_id"], row["embeddings"], row["metadata"]) for index, row in files.iterrows()
]

In [15]:
index.upsert(vectors=vectors_to_upsert)
print("Data upserted successfully.")

## Weighted Query Embedding

The query is embedded using the same weighted strategy
to match the indexed representation.

This keeps query and document vectors aligned.


In [16]:
def weighted_query_embedding(query, model, weights):
    # encode query as if it belongs to each field
    q_course_name = model.encode(query)
    q_course_technology = model.encode(query)
    q_course_description = model.encode(query)
    q_section_name = model.encode(query)
    q_section_description = model.encode(query)

    combined = (
        weights["course_name"] * q_course_name +
        weights["course_technology"] * q_course_technology +
        weights["course_description"] * q_course_description +
        weights["section_name"] * q_section_name +
        weights["section_description"] * q_section_description
    )

    combined = combined / np.linalg.norm(combined)

    return combined.tolist()


## Semantic Search with Weighted Embeddings

A semantic query is converted into a weighted embedding
and searched against the Pinecone index.

Results are filtered using a similarity threshold.


In [17]:
query = "regression in python"
query_embedding = weighted_query_embedding(query, model, WEIGHTS)


In [18]:
query_results = index.query(
    vector=query_embedding,
    top_k=12,
    include_metadata=True
)

In [19]:
score_threshold = 0.4

In [20]:
for match in query_results['matches']:
    if match['score'] >= score_threshold:
        course_details =  match.get('metadata', {})
        course_name = course_details.get('course_name', 'N/A')
        section_name = course_details.get('section_name', 'N/A')
        section_description = course_details.get('section_description', 'N/A')
        print(f"Matched Item ID: {match['id']}, Score: {match['score']}")
        print(f"Course: {course_name} \nSection: {section_name} \nDescription: {section_description}\n")

## Summary

This notebook demonstrated a structured semantic search pipeline using
**weighted embeddings**:

- Built unique IDs for course sections
- Stored interpretability metadata
- Embedded structured fields separately
- Combined embeddings with a weighted sum
- Normalized vectors for cosine similarity
- Indexed weighted vectors into Pinecone
- Queried Pinecone using the same weighted embedding strategy

Weighted embeddings provide a practical approach to improving retrieval
quality when working with structured datasets.
