# Semantic Search with Pinecone on Course Descriptions

This notebook demonstrates how to build a **semantic search system**
using **Pinecone** as the vector database and **Sentence Transformers**
for embedding text into vectors.

The workflow includes:
- Loading structured course data from CSV
- Creating enhanced course text descriptions
- Generating embeddings using `all-MiniLM-L6-v2`
- Indexing vectors into Pinecone
- Performing semantic search queries


In [None]:
import getpass
import os
import pandas as pd
import pinecone
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer

## Loading Course Data

The course metadata is loaded from a CSV file.
A new human-readable course description is generated
to enrich semantic meaning for embedding.


In [None]:
files = pd.read_csv("../../data/course_descriptions.csv", encoding='ANSI')
pd.set_option('display.max_rows', 106)
files.head()

## Creating an Enriched Course Description

A combined text representation improves embedding quality by including:
- Course name
- Slug
- Technology
- Topic


In [None]:
def create_course_Description(row):
    return f"The course name {row['course_name']}, the slug is : {row['course_slug']}, and the technology is : {row['course_technology']} and the course topic is : {row['course_topic']}"

In [None]:

files['course_description_new'] = files.apply(create_course_Description, axis=1)
print(files['course_description_new'])

## Pinecone Setup

A Pinecone serverless index is created using:
- Cosine similarity
- Fixed embedding dimensionality (384)
- AWS + a specified region

The index is deleted and recreated if it already exists.


In [None]:
pc = Pinecone(api_key= os.environ.get("PINECONE_API_KEY"), environment= os.environ.get("PINECONE_ENV"))

In [None]:
index_name = "my-index"
dimension = 384  # Dimension of the embeddings
metric = "cosine"  # Similarity metric

In [None]:
if index_name in [i.name for i in pc.list_indexes()]:
    pc.delete_index(index_name)
    print(f"Deleted existing index '{index_name}'.")
else:
    print(f"{index_name} not in the index list.")

In [None]:
pc.create_index(
    name=index_name,
    dimension=dimension,
    metric=metric,
    spec = ServerlessSpec( cloud="aws", region="us-east-1")
)

In [None]:
index =  pc.Index(index_name)

## Generating Text Embeddings

A Sentence Transformer model converts each course record into a vector.

Text fields are combined to maximize semantic information:
- Original description
- Generated description
- Short description


In [None]:
model =  SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
def create_embedding(row):
    combined_text = ' '.join([str(row[field]) for field in ['course_description', 'course_description_new', 'course_description_short']])
    embedding =  model.encode(combined_text, show_progress_bar=False)
    return embedding

In [None]:
files["embedding"] = files.apply(create_embedding, axis=1)

## Upserting Vectors into Pinecone

Each vector is upserted as:
- `id`: course_name
- `values`: embedding vector

This makes the course name directly retrievable in search results.


In [None]:
vectors_to_upsert = [(str(row['course_name']), row['embedding'].tolist()) for _, row in files.iterrows()]
index.upsert(vectors=vectors_to_upsert)

print("Data upserted successfully.")

## Semantic Search

A query is embedded using the same embedding model and searched
against the Pinecone index using cosine similarity.

The results are filtered by a similarity score threshold.


In [None]:
query = "Clustering"
query_embedding = model.encode(query).tolist()

In [None]:
query_results = index.query( vector= [query_embedding], top_k=12, include_metadata=True)

In [None]:
query_results

### Filtering Search Results

Only results above a similarity threshold are displayed.


In [None]:
score_threshold = 0.3
for match in query_results['matches']:
    if match['score'] >= score_threshold:
        print(f"Course Name: {match['id']}, Score: {match['score']}")

## Summary

This notebook demonstrated end-to-end semantic search using Pinecone:

- Loaded structured course metadata from CSV
- Enriched course descriptions for better embeddings
- Generated embeddings using Sentence Transformers
- Created a Pinecone vector index
- Upserted course vectors into the index
- Queried the index for semantic search

This structure can be extended into:
- metadata filtering (topic/technology)
- RAG retrieval
- search UI integration
