# Weighted Embeddings Semantic Search (Pinecone + Sentence Transformers)

This notebook demonstrates **weighted embeddings** for semantic search.

Instead of embedding one concatenated text field, each structured field
(course + section attributes) is embedded separately and combined using
a weighted sum.

This approach allows:
- prioritizing important fields (e.g., section description)
- improving retrieval quality on structured educational datasets
- fine-grained search at the section level


In [None]:
import getpass
import os
import pandas as pd
import pinecone
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer
import pinecone
import numpy as np

In [None]:
pc = Pinecone(api_key= os.environ.get("PINECONE_API_KEY"), environment= os.environ.get("PINECONE_ENV"))

## Loading Course + Section Dataset

Each section is assigned a stable unique identifier:

`course_id_section_id`

This enables section-level indexing and retrieval.


In [None]:
files = pd.read_csv("../../data/course_section_descriptions.csv", encoding='ANSI')

In [None]:
files["unique_id"] = files["course_id"].astype(str) + "_" + files["section_id"].astype(str)

## Metadata Storage

Metadata is attached to each vector so results can be interpreted
without requiring a lookup table.


In [None]:
files["metadata"] = files.apply(lambda row: {
    "course_name": row["course_name"],
    "section_name": row["section_name"],
    "section_description": row["section_description"]
}, axis=1)

## Field Weights

Each field is embedded separately and combined using a weighted sum.

Higher weight increases that fieldâ€™s influence on similarity search.


In [None]:
WEIGHTS = {
    "course_name": 0.35,
    "course_technology": 0.15,
    "course_description": 0.10,
    "section_name": 0.20,
    "section_description": 0.25
}


## Building Weighted Embeddings

For each record:
1. Encode each field into its own embedding vector
2. Combine using weights
3. Normalize the final embedding vector

Normalization helps cosine similarity behave consistently.


In [None]:
def weighted_embedding(row, model, weights):
    e_course_name = model.encode(row['course_name'])
    e_course_technology = model.encode(row['course_technology'])
    e_course_description = model.encode(row['course_description'])
    e_section_name = model.encode(row['section_name'])
    e_section_description = model.encode(row['section_description'])

    combined = (
        weights['course_name'] * e_course_name + weights['course_technology'] * e_course_technology + weights['course_description'] * e_course_description + weights['section_name'] * e_section_name + weights['section_description'] * e_section_description )

    combined =  combined / np.linalg.norm(combined)
    return combined.tolist()

## Embedding Model

`all-MiniLM-L6-v2` is used to generate embeddings:
- general purpose semantic embeddings
- 384 dimensional vectors
- small and fast


In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

In [None]:
files["embeddings"] = files.apply(lambda row: weighted_embedding(row, model, WEIGHTS), axis=1)

## Creating Pinecone Index

The Pinecone index must match the embedding dimension.

Here:
- dimension = 384 (MiniLM)
- metric = cosine


In [None]:
index_name = "my-index"
dimension = 384  # Dimension of the embeddings
metric = "cosine"  # Similarity metric

In [None]:
if index_name in [i.name for i in pc.list_indexes()]:
    pc.delete_index(index_name)
    print(f"Deleted existing index '{index_name}'.")
else:
    print(f"{index_name} not in the index list.")

In [None]:
pc.create_index(
    name=index_name,
    dimension=dimension,
    metric=metric,
    spec = ServerlessSpec( cloud="aws", region="us-east-1")
)

In [None]:
index =  pc.Index(index_name)

## Upserting Weighted Vectors

Vectors are inserted into Pinecone as:
- ID: unique section identifier
- Values: weighted embedding
- Metadata: course + section info


In [None]:
vectors_to_upsert = [
    (row["unique_id"], row["embeddings"], row["metadata"]) for index, row in files.iterrows()
]

In [None]:
index.upsert(vectors=vectors_to_upsert)

## Weighted Query Embedding

The query is embedded using the same weighted strategy
to align query vector space with the indexed vectors.


In [None]:
def weighted_query_embedding(query, model, weights):
    # encode query as if it belongs to each field
    q_course_name = model.encode(query)
    q_course_technology = model.encode(query)
    q_course_description = model.encode(query)
    q_section_name = model.encode(query)
    q_section_description = model.encode(query)

    combined = (
        weights["course_name"] * q_course_name +
        weights["course_technology"] * q_course_technology +
        weights["course_description"] * q_course_description +
        weights["section_name"] * q_section_name +
        weights["section_description"] * q_section_description
    )

    combined = combined / np.linalg.norm(combined)

    return combined.tolist()


## Semantic Search

A query is embedded and searched against the Pinecone index.
Results above a similarity threshold are displayed.


In [None]:
query = "regression in python"
query_embedding = weighted_query_embedding(query, model, WEIGHTS)


In [None]:
query_results = index.query(
    vector=query_embedding,
    top_k=12,
    include_metadata=True
)


In [None]:
score_threshold = 0.4

In [None]:
for match in query_results['matches']:
    if match['score'] >= score_threshold:
        course_details =  match.get('metadata', {})
        course_name = course_details.get('course_name', 'N/A')
        section_name = course_details.get('section_name', 'N/A')
        section_description = course_details.get('section_description', 'N/A')
        print(f"Matched Item ID: {match['id']}, Score: {match['score']}")
        print(f"Course: {course_name} \nSection: {section_name} \nDescription: {section_description}\n")

## Summary

This notebook demonstrated **weighted embeddings semantic search**:

- Created section-level IDs (course_id_section_id)
- Stored interpretability metadata
- Embedded multiple structured fields separately
- Combined embeddings using weighted sum
- Normalized final vectors for cosine similarity
- Indexed the result in Pinecone
- Queried Pinecone using aligned weighted query embeddings

Weighted embedding strategies are useful when structured fields
do not contribute equally to semantic relevance.
