<a href="https://colab.research.google.com/github/solomontessema/Generative-AI-with-Python/blob/main/notebooks/new_Vector_Database_with_Pinecone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What you'll do

1. Install and configure **Pinecone** and **OpenAI**.
2. Review how similarity metrics work and when to use each.
3. Create **three Pinecone indexes** (cosine / euclidean / dotproduct).
4. Upsert the **same vectors** into each index.
5. Run the **same query** against each index and compare results.
6. Try **metadata filtering**, **updates**, and **deletes**.
7. (Optional) See a **FastAPI** example for a semantic search API.

> 🧪 This notebook is designed for experimentation—feel free to tweak texts, add more vectors, and observe how rankings change across metrics.

In [None]:
# %%capture
!pip -q install pinecone-client openai python-dotenv

## Configure API keys (Pinecone + OpenAI)

- You can paste your keys when prompted, or set them in a `.env` file.
- **Never** commit secrets to version control.

**Required env vars:**
- `PINECONE_API_KEY`
- `OPENAI_API_KEY`

In [None]:
import os
from getpass import getpass
from dotenv import load_dotenv

load_dotenv()

if not os.getenv("PINECONE_API_KEY"):
    os.environ["PINECONE_API_KEY"] = getpass("Enter PINECONE_API_KEY: ")
if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter OPENAI_API_KEY: ")

print("Pinecone key set:", bool(os.getenv("PINECONE_API_KEY")))
print("OpenAI key set:  ", bool(os.getenv("OPENAI_API_KEY")))

## Handling Different Similarity Metrics

**Cosine similarity**
- Measures the angle between vectors (direction-only).
- Popular for text embeddings; magnitude differences are downplayed.

**Euclidean distance**
- Straight-line distance between points in space.
- Sensitive to vector magnitude and scale.

**Dot product**
- Sum of element-wise products.
- Can emphasize vectors with larger magnitudes (length matters).

## Initialize clients & helpers

We will:
1. Initialize the Pinecone client (serverless spec).
2. Initialize the OpenAI client.
3. Define a reusable `embed()` function using **`text-embedding-3-small`** (1536-dim).

In [None]:
from pinecone import Pinecone, ServerlessSpec
from openai import OpenAI

PINECONE_API_KEY = os.environ["PINECONE_API_KEY"]
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

pc = Pinecone(api_key=PINECONE_API_KEY)
client = OpenAI(api_key=OPENAI_API_KEY)

EMBED_MODEL = "text-embedding-3-small"  # 1536-dim
EMBED_DIM = 1536

def embed(text: str):
    """Return a single 1536-dim embedding vector for the given text."""
    resp = client.embeddings.create(model=EMBED_MODEL, input=[text])
    return resp.data[0].embedding

print("Clients ready.")

## Create three indexes with different metrics

We'll create:
- `my-index-cosine` (cosine)
- `my-index-euclidean` (euclidean)
- `my-index-dotproduct` (dotproduct)

⚠️ Index creation can take a few seconds. If the index exists, we reuse it.

In [None]:
CLOUD = "aws"
REGION = "us-east-1"

index_names = {
    "cosine": "my-index-cosine",
    "euclidean": "my-index-euclidean",
    "dotproduct": "my-index-dotproduct"
}

existing = set(pc.list_indexes().names())

for metric, name in index_names.items():
    if name not in existing:
        print(f"Creating {name} with metric={metric}...")
        pc.create_index(
            name=name,
            dimension=EMBED_DIM,
            metric=metric,
            spec=ServerlessSpec(cloud=CLOUD, region=REGION)
        )
    else:
        print(f"{name} already exists; skipping creation.")

index_cosine = pc.Index(index_names["cosine"]) 
index_euclidean = pc.Index(index_names["euclidean"]) 
index_dotproduct = pc.Index(index_names["dotproduct"]) 

print("Indexes are ready for use.")

## Upsert the same vectors into each index

We'll use two short texts with distinct topics and add a `topic` metadata field to each. 

This makes it easy to demonstrate **metadata filtering** later.

In [None]:
texts = {
    "concept1": {
        "text": "The theory of relativity and its implications",
        "metadata": {"topic": "physics"}
    },
    "concept2": {
        "text": "The role of empathy in leadership",
        "metadata": {"topic": "psychology"}
    }
}

def upsert_to_index(index, items):
    vectors = []
    for _id, payload in items.items():
        vec = embed(payload["text"])  # 1536-dim
        vectors.append({"id": _id, "values": vec, "metadata": payload["metadata"]})
    index.upsert(vectors=vectors)

for idx in (index_cosine, index_euclidean, index_dotproduct):
    upsert_to_index(idx, texts)

print("Upserted vectors into all three indexes.")

## Query each index with the same query vector

We'll query for **“The theory of relativity”** and compare the top-k results from each index.

In [None]:
from pprint import pprint

query_text = "The theory of relativity"
query_vec = embed(query_text)

def run_query(index, vector, top_k=5, with_metadata=True):
    return index.query(
        vector=vector,
        top_k=top_k,
        include_metadata=with_metadata
    )

res_cosine = run_query(index_cosine, query_vec)
res_euclid = run_query(index_euclidean, query_vec)
res_dot   = run_query(index_dotproduct, query_vec)

print("\n=== Cosine similarity results ===")
pprint([{"id": m.id, "score": m.score, "metadata": m.metadata} for m in res_cosine.matches])

print("\n=== Euclidean distance results ===")
pprint([{"id": m.id, "score": m.score, "metadata": m.metadata} for m in res_euclid.matches])

print("\n=== Dot product results ===")
pprint([{"id": m.id, "score": m.score, "metadata": m.metadata} for m in res_dot.matches])

## Understanding the impact

- **Cosine similarity** focuses on vector *direction* → good for text where magnitude differences aren’t meaningful.
- **Euclidean distance** uses absolute distance → can be influenced by magnitude/scale.
- **Dot product** increases with both alignment and magnitude → longer vectors can score higher.

> ✨ Tip: For modern text embeddings, cosine is a great default. If your pipeline preserves meaningful magnitudes (e.g., TF-IDF-like scaling) or you need geometric distance, experiment with dot product or euclidean.

## Metadata filtering example

You can filter matches by metadata—e.g., return only items where `topic == "physics"`.

In [None]:
res_filtered = index_cosine.query(
    vector=query_vec,
    top_k=5,
    include_metadata=True,
    filter={"topic": {"$eq": "physics"}}
)
print("Filtered (topic=physics) on cosine index:")
pprint([{"id": m.id, "score": m.score, "metadata": m.metadata} for m in res_filtered.matches])

## Updating and deleting vectors

- **Update** by upserting with the same ID.
- **Delete** by ID.
- Re-query to verify the change.

In [None]:
# Update concept1 text
updated_text = "Updated theory of relativity and its modern implications"
index_cosine.upsert(
    vectors=[{
        "id": "concept1",
        "values": embed(updated_text),
        "metadata": {"topic": "physics", "version": "updated"}
    }]
)

# Delete concept2
index_cosine.delete(ids=["concept2"]) 

# Re-query to see new state (broad query)
broad_vec = embed("theory")
res_after = index_cosine.query(vector=broad_vec, top_k=10, include_metadata=True)
pprint([{"id": m.id, "score": m.score, "metadata": m.metadata} for m in res_after.matches])

## (Optional) Integrating with an application (FastAPI example)

Below is a minimal FastAPI endpoint that:
1. Accepts a query string
2. Embeds with OpenAI
3. Queries a Pinecone index
4. Returns top-k matches

This is **for reference**—running FastAPI in Colab requires extra steps (e.g., `uvicorn` + tunnels).

In [None]:
FASTAPI_SNIPPET = r'''\nfrom fastapi import FastAPI, HTTPException\nfrom pydantic import BaseModel\nfrom pinecone import Pinecone\nfrom openai import OpenAI\nimport os\n\napp = FastAPI()\n\npc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])\nindex = pc.Index("my-index-cosine")  # choose your index\nclient = OpenAI(api_key=os.environ["OPENAI_API_KEY"])\n\nEMBED_MODEL = "text-embedding-3-small"\n\nclass QueryRequest(BaseModel):\n    query: str\n    top_k: int = 5\n\ndef embed(text: str):\n    resp = client.embeddings.create(model=EMBED_MODEL, input=[text])\n    return resp.data[0].embedding\n\n@app.post("/search")\ndef search(req: QueryRequest):\n    if not req.query:\n        raise HTTPException(status_code=400, detail="query is required")\n    qv = embed(req.query)\n    results = index.query(vector=qv, top_k=req.top_k, include_metadata=True)\n    return {"matches": [\n        {"id": m.id, "score": m.score, "metadata": m.metadata} for m in results.matches\n    ]}\n'''\nprint(FASTAPI_SNIPPET)

## (Optional) Cleanup

Delete the demo indexes to avoid clutter and charges. Set `DO_DELETE = True` to proceed.

In [None]:
DO_DELETE = False  # set to True to delete the demo indexes

if DO_DELETE:
    for name in index_names.values():
        print("Deleting:", name)
        pc.delete_index(name)
    print("Deleted demo indexes.")
else:
    print("Skipping delete. Set DO_DELETE=True to remove demo indexes.")

## Next steps

- Add more documents and experiment with **longer texts** or **different domains**.
- Try **hybrid search** (lexical + vector) by combining scores externally.
- Attach **payloads** such as URLs, timestamps, authors in metadata and filter on them.
- Benchmark metrics for your task (e.g., retrieval for QA vs. recommendations).