# Semantic Search with Pinecone (Course + Section Level Indexing)

This notebook demonstrates building a **section-level semantic search**
system using **Pinecone** and **Sentence Transformers**.

Instead of embedding only entire courses, this implementation embeds:

- Course-level information (name, technology, description)
- Section-level information (section name, section description)

This provides fine-grained retrieval for more accurate search and RAG.


In [9]:
import getpass
import os
import pandas as pd
import pinecone
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer
import pinecone

In [10]:
pc = Pinecone(api_key= os.environ.get("PINECONE_API_KEY"), environment= os.environ.get("PINECONE_ENV"))

In [11]:
files = pd.read_csv("../../data/course_section_descriptions.csv", encoding='ANSI')

## Loading Course Section Dataset

The dataset contains course and section information.

A `unique_id` is constructed using:
- `course_id`
- `section_id`

This ensures each section is uniquely indexed in Pinecone.


In [12]:
files["unique_id"] = files["course_id"].astype(str) + "_" + files["section_id"].astype(str)

## Creating Metadata

Metadata is attached to each vector so retrieval can return:
- Course name
- Section name
- Section description

This metadata is stored alongside vectors in Pinecone
and returned during query.


In [13]:
files["metadata"] = files.apply(lambda row: {
    "course_name": row["course_name"],
    "section_name": row["section_name"],
    "section_description": row["section_description"]
}, axis=1)

## Embedding Model

Sentence Transformers are used to convert text into dense vectors.
`all-MiniLM-L6-v2` provides fast and high-quality embeddings
with 384 dimensions.


In [15]:
model = SentenceTransformer('all-MiniLM-L6-v2')

## Generating Embeddings for Course Sections

Each row is converted into a combined text block containing:
- Course name and technology
- Course description
- Section name and section description

This improves retrieval precision for specific topics
(e.g., "regression in python").


In [None]:
def create_embeddings(row):
    combined_text = f" {row['course_name']} {row['course_technology']} {row['course_description']} {row['section_name']} {row['section_description']}"

    return model.encode(combined_text)

In [16]:
files["embeddings"] = files.apply(create_embeddings, axis=1)

## Pinecone Index Setup

A Pinecone serverless index is created using:
- cosine similarity
- 384-dimensional vectors

The index is deleted first if it already exists
to ensure a clean build.


In [18]:
index_name = "my-index"
dimension = 384  # Dimension of the embeddings
metric = "cosine"  # Similarity metric

In [19]:
if index_name in [i.name for i in pc.list_indexes()]:
    pc.delete_index(index_name)
    print(f"Deleted existing index '{index_name}'.")
else:
    print(f"{index_name} not in the index list.")

In [20]:
pc.create_index(
    name=index_name,
    dimension=dimension,
    metric=metric,
    spec = ServerlessSpec( cloud="aws", region="us-east-1")
)

In [21]:
index =  pc.Index(index_name)

## Upserting Section Vectors into Pinecone

Vectors are stored as:

- `id`: course_id_section_id (unique per section)
- `values`: section embedding vector
- `metadata`: course and section details for interpretation


In [22]:
vectors_to_upsert = [
    (row["unique_id"], row["embeddings"], row["metadata"]) for index, row in files.iterrows()
]

In [23]:
index.upsert(vectors=vectors_to_upsert)
print("Data upserted successfully.")

## Semantic Search Query

A natural language query is embedded using the same model
and searched against Pinecone.

The highest scoring matches represent the most relevant
course sections.


In [34]:
query =  "regression in python"
query_embedding = model.encode(query).tolist() 

In [35]:
query_results = index.query(
    vector=query_embedding,
    top_k=12,
    include_metadata=True
)

In [36]:
score_threshold = 0.4

## Filtering Search Results

Matches are filtered using a similarity threshold.
For each match, associated metadata is displayed:
- course name
- section name
- section description


In [37]:
for match in query_results['matches']:
    if match['score'] >= score_threshold:
        course_details =  match.get('metadata', {})
        course_name = course_details.get('course_name', 'N/A')
        section_name = course_details.get('section_name', 'N/A')
        section_description = course_details.get('section_description', 'N/A')
        print(f"Matched Item ID: {match['id']}, Score: {match['score']}")
        print(f"Course: {course_name} \nSection: {section_name} \nDescription: {section_description}\n")

## Summary

This notebook demonstrated section-level semantic search with Pinecone:

- Built unique IDs per course section
- Stored rich metadata for interpretability
- Embedded course + section text using Sentence Transformers
- Created and populated a Pinecone index
- Queried Pinecone using natural language search
- Filtered results using a similarity threshold

This structure can be extended into:
- metadata filtering (technology/topic)
- RAG retrieval for question answering
- building a course recommender search UI
