# Pinecone Vector Database Integration

This notebook demonstrates how to use Pinecone as a vector database for storing and retrieving document embeddings in a RAG (Retrieval-Augmented Generation) system.

## Overview
- Set up Pinecone vector database
- Create embeddings using Google's Gemini model
- Store documents with metadata
- Perform similarity searches with filtering
- Use retrievers for advanced querying

## Prerequisites
- Pinecone API key
- Google API key
- Required packages: pinecone-client, langchain-pinecone, langchain-google-genai

In [None]:
"""
Environment Setup and API Key Configuration

This cell loads the required API keys from environment variables and validates
that all necessary credentials are available for the Pinecone and Google services.
"""

# Load environment variables
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Get API keys from environment
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')

# Validate that required API keys are loaded
if not PINECONE_API_KEY:
    raise ValueError("PINECONE_API_KEY not found in environment variables")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY not found in environment variables")

print("✓ API keys loaded successfully")

✓ API keys loaded successfully


In [None]:
"""
Initialize Pinecone Client

Create a connection to Pinecone using the API key.
Pinecone is a vector database service that allows for efficient similarity search
and retrieval of high-dimensional vectors.
"""

from pinecone import Pinecone

# Initialize Pinecone client with API key
pc = Pinecone(api_key=PINECONE_API_KEY)
print("✓ Pinecone client initialized successfully")

In [None]:
"""
Initialize Google Generative AI Embeddings

Set up the embedding model using Google's Gemini embedding model.
This model converts text into high-dimensional vectors (embeddings) that
capture semantic meaning and can be used for similarity comparisons.
"""

from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI

# Initialize Google Gemini embeddings model
# The gemini-embedding-001 model produces 3072-dimensional vectors
embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")
print("✓ Google Gemini embeddings model initialized")

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
"""
Create and Configure Pinecone Index

This cell creates a new Pinecone index if it doesn't exist, or connects to an existing one.
The index is configured with:
- Dimension: 3072 (matching the Gemini embedding model output)
- Metric: cosine similarity for comparing vectors
- Serverless specification for automatic scaling
"""

from pinecone import ServerlessSpec

# Define index configuration
index_name = "rag"  # Name of the Pinecone index
dimension = 3072   # Must match the embedding model's output dimension
metric = "cosine"  # Similarity metric for vector comparisons

# Create index if it doesn't exist
if not pc.has_index(index_name):
    print(f"Creating new index '{index_name}'...")
    pc.create_index(
        name=index_name,
        dimension=dimension,
        metric=metric,
        spec=ServerlessSpec(
            cloud="aws",           # Cloud provider
            region="us-east-1"     # AWS region
        ),
    )
    print(f"✓ Index '{index_name}' created successfully")
else:
    print(f"✓ Index '{index_name}' already exists")

# Connect to the index
index = pc.Index(index_name)
print(f"✓ Connected to index: {index_name}")

In [None]:
"""
Display Index Information

Show the current state and configuration of the Pinecone index.
"""
print("Index Information:")
print(f"Index: {index}")

<pinecone.db_data.index.Index at 0x71fe226f6e10>

In [None]:
"""
Initialize LangChain Pinecone Vector Store

Create a LangChain wrapper around the Pinecone index that provides
a unified interface for document storage and retrieval operations.
This abstraction makes it easier to work with vectors in LangChain applications.
"""

from langchain_pinecone import PineconeVectorStore

# Create vector store wrapper
vector_store = PineconeVectorStore(
    index=index,           # The Pinecone index instance
    embedding=embeddings   # The embedding function to use
)
print("✓ LangChain Pinecone vector store initialized")


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from langchain_pinecone.vectorstores import Pinecone, PineconeVectorStore


In [None]:
"""
Display Vector Store Information
"""
print("Vector Store Information:")
print(f"Vector Store: {vector_store}")

<langchain_pinecone.vectorstores.PineconeVectorStore at 0x71fe20220ce0>

In [None]:
"""
Create Sample Documents

Define a collection of sample documents with different content types and sources.
Each document includes:
- page_content: The actual text content
- metadata: Additional information like source type for filtering

These documents simulate different types of content you might encounter in a RAG system:
- Social media posts (tweets)
- News articles
- Website content
"""

from langchain_core.documents import Document

# Personal/Social Media Content
document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet", "category": "personal"},
)

document_2 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet", "category": "entertainment"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet", "category": "technology"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet", "category": "technology"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet", "category": "personal"},
)

# News Content
document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news", "category": "weather"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news", "category": "crime"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news", "category": "finance"},
)

# Website/Review Content
document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website", "category": "technology"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website", "category": "sports"},
)

# Compile all documents into a list
documents = [
    document_1, document_2, document_3, document_4, document_5,
    document_6, document_7, document_8, document_9, document_10,
]

print(f"✓ Created {len(documents)} sample documents")

In [None]:
"""
Add Documents to Vector Store

Upload all sample documents to the Pinecone vector database.
This process:
1. Generates embeddings for each document's content
2. Stores the embeddings along with metadata in Pinecone
3. Returns unique IDs for each stored document
"""

print("Adding documents to vector store...")
document_ids = vector_store.add_documents(documents=documents)
print(f"✓ Successfully added {len(document_ids)} documents to the vector store")
print(f"Document IDs: {document_ids[:3]}..." if len(document_ids) > 3 else f"Document IDs: {document_ids}")

['27e14bf5-a894-4e42-b39a-027e818f4f60',
 '05052e9d-3933-4a17-98f8-d17954425f8b',
 '3d4c8b3d-121f-47f0-985e-5bb71947445f',
 '1a4ca509-8123-4b2c-8167-09476e02b82e',
 '415cd051-5309-4a7e-a3ab-a5eb7cd6d475',
 '26a13be2-d9c4-42b0-80d1-5c20be5232d6',
 'd913f86b-05bb-44c3-b5df-e8fa136eaccc',
 '79f7df4f-6882-4c8e-8b33-db28b7bb8838',
 '1fd1e3e4-a699-46c6-bdb8-7280433c1996',
 'a3eab079-7d50-428a-9a74-8751da5ff259']

In [None]:
"""
Perform Similarity Search with Filtering

Demonstrate how to query the vector store for similar documents.
This example searches for content related to "LangGraph" but only
within documents that have source="tweet".

Parameters:
- query: The search text
- k: Number of results to return
- filter: Metadata filtering criteria
"""

print("=== Similarity Search: LangGraph in Tweets ===")
results = vector_store.similarity_search(
    query="How good is LangGraph?",
    k=2,                           # Return top 2 results
    filter={"source": "tweet"},    # Only search in tweets
)

print(f"Found {len(results)} results:")
for i, res in enumerate(results, 1):
    print(f"{i}. {res.page_content}")
    print(f"   Metadata: {res.metadata}")
    print()

* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]
* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]


In [None]:
"""
Similarity Search with Similarity Scores

Perform a search that returns both the matching documents and their
similarity scores. This helps understand how closely each result
matches the query.

The similarity score ranges from 0 to 1, where:
- 1.0 = perfect match
- 0.0 = no similarity
"""

print("=== Similarity Search with Scores: Weather in News ===")
results = vector_store.similarity_search_with_score(
    query="Will it be hot tomorrow?", 
    k=2,                          # Return top 2 results
    filter={"source": "news"}     # Only search in news articles
)

print(f"Found {len(results)} results:")
for i, (res, score) in enumerate(results, 1):
    print(f"{i}. [Similarity: {score:.3f}] {res.page_content}")
    print(f"   Metadata: {res.metadata}")
    print()

* [SIM=0.736267] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]
* [SIM=0.570427] The stock market is down 500 points today due to fears of a recession. [{'source': 'news'}]


In [None]:
"""
Create and Use a Retriever with Score Threshold

Set up a retriever that only returns documents with similarity scores
above a specified threshold. This helps filter out irrelevant results.

Configuration:
- search_type: "similarity_score_threshold" - only return results above threshold
- k: maximum number of results to return
- score_threshold: minimum similarity score (0.4 = 40% similarity)
"""

print("=== Retriever with Score Threshold ===")

# Create retriever with score threshold
retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "k": 2,                    # Maximum results to return
        "score_threshold": 0.4     # Minimum similarity score (40%)
    },
)

# Perform retrieval
print("Searching for: 'Stealing from the bank is a crime' in news articles...")
results = retriever.invoke("Stealing from the bank is a crime")

print(f"Found {len(results)} results above threshold:")
for i, doc in enumerate(results, 1):
    print(f"{i}. {doc.page_content}")
    print(f"   Metadata: {doc.metadata}")
    print()

if not results:
    print("No results found above the similarity threshold of 0.4")

[Document(id='1a4ca509-8123-4b2c-8167-09476e02b82e', metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.'),
 Document(id='1fd1e3e4-a699-46c6-bdb8-7280433c1996', metadata={'source': 'news'}, page_content='The stock market is down 500 points today due to fears of a recession.')]

In [None]:
"""
Summary and Next Steps

This notebook demonstrated:
1. ✓ Setting up Pinecone vector database
2. ✓ Creating embeddings with Google Gemini
3. ✓ Storing documents with metadata
4. ✓ Performing similarity searches
5. ✓ Using filters and score thresholds
6. ✓ Working with retrievers

Next steps for production use:
- Implement batch document processing
- Add error handling and retry logic
- Monitor index usage and performance
- Set up automated index management
- Integrate with your specific data sources
"""

print("🎉 Pinecone Vector Database Tutorial Complete!")
print("\nKey takeaways:")
print("- Pinecone provides scalable vector storage and search")
print("- Metadata filtering enables targeted searches")  
print("- Score thresholds help maintain result quality")
print("- LangChain integration simplifies vector operations")