# Vector Databases with Pinecone

## Introduction to Pinecone

- Pinecone is a fully managed vector database service that makes it easy to build high-performance vector search applications. 

- Unlike traditional databases that store and query structured data, Pinecone is specifically designed for similarity search in high-dimensional vector spaces. 

- This makes it particularly well-suited for applications in machine learning, natural language processing, computer vision, and recommendation systems.

## Why Pinecone?

Pinecone offers several advantages that make it an excellent choice for vector search applications:

1. Managed Infrastructure: Pinecone handles all the infrastructure management, scaling, and optimization, allowing developers to focus on building applications.

2. Real-time Updates: Unlike some vector databases that require periodic rebuilding of indices, Pinecone supports real-time updates to your vector data.

3. Hybrid Search: Combine vector similarity search with traditional metadata filtering for more precise results.

4. Enterprise Features: Built-in security features, automatic backups, and high availability make it suitable for production deployments.

## Getting Started with Pinecone

### Creating a Pinecone Account and Obtaining API Keys

Before we can start using Pinecone, we need to create an account and obtain our API credentials. Here's how to do it:

1. First, visit the Pinecone website (https://www.pinecone.io/) and click on the "Start Free" button.

2. You'll be prompted to create an account. You can sign up using:
   - Your email address
   - GitHub account
   - Google account

3. After creating your account, you'll be taken to the Pinecone Console. This is where you'll manage your indexes and API keys.

4. To create an API key:
   - Navigate to the API Keys section in the console
   - Click on "Create API Key"
   - Give your key a meaningful name (e.g., "development-key" or "tutorial-key")
   - Set the appropriate permissions (read/write)
   - Copy and save your API key immediately - you won't be able to see it again!

5. Make note of your environment. You can find this in the console next to your API key. It will look something like "us-east1-gcp" or "us-west1-aws".

Important Security Considerations:
- Never commit your API key to version control
- Use environment variables or secure secret management systems
- Rotate your keys periodically
- Create separate keys for development and production

Here's how to properly manage your API key in your code:

In [1]:
import os

os.environ['PINECONE_API_KEY'] = "FILL_IN_YOUR_API_KEY"

In [2]:

# Get API key from environment variable
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')


Create a `.env` file in your project directory:


```plaintext
PINECONE_API_KEY=your-api-key-here
```

Make sure to add `.env` to your `.gitignore` file:
```plaintext
# .gitignore
.env
```

## Setting Up Pinecone

let's import our dependencies and initialize Pinecone:


In [None]:
from pinecone import Pinecone, ServerlessSpec
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Dict
import os
from tqdm.auto import tqdm

#Define index name
index_name = "sentence-embeddings"
pinecone_env = "us-east-1"  # You can change this to your preferred region

# Initialize Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)


## Creating a Vector Index

Let's create an index to store our vectors. Pinecone indexes require configuration based on your specific use case:

In [None]:
# Create index if it doesn't exist
if index_name not in pc.list_indexes().names():
    # Create a serverless index
    pc.create_index(
        name=index_name,
        dimension=384,  # feature dimension
        metric='cosine',
        spec=ServerlessSpec(
            cloud='aws',
            region=pinecone_env  # You can change this to your preferred region
        )
    )
    print(f"Created new index: {index_name}")
else:
    print(f"Using existing index: {index_name}")

# Get the index
index = pc.Index(index_name)

## Loading and Processing Data

Let's create a simple document processing pipeline:

In [5]:
import pinecone


class Document:
    def __init__(self, title: str, content: str, metadata: Dict = None):
        self.title = title
        self.content = content
        self.metadata = metadata or {}
        
    def to_text(self) -> str:
        return f"{self.title} {self.content}"

def process_documents(documents: List[Document], 
                     model: SentenceTransformer) -> List[Dict]:
    """
    Process documents into vectors and prepare them for Pinecone indexing.
    """
    processed_docs = []
    
    for i, doc in enumerate(tqdm(documents)):
        # Generate vector embedding
        vector = model.encode(doc.to_text())
        
        # Create the record
        record = {
            'id': f'doc_{i}',
            'values': vector.tolist(),
            'metadata': {
                'title': doc.title,
                'content': doc.content,
                **doc.metadata
            }
        }
        processed_docs.append(record)
    
    return processed_docs

## Indexing Documents

Now let's create a function to upload our vectors to Pinecone:

In [None]:
def index_documents(index: pinecone.Index, 
                   processed_docs: List[Dict], 
                   batch_size: int = 100):
    """
    Upload document vectors to Pinecone in batches.
    """
    total_docs = len(processed_docs)
    
    for i in tqdm(range(0, total_docs, batch_size)):
        batch = processed_docs[i:min(i + batch_size, total_docs)]
        index.upsert(vectors=batch)

# Example usage:
sample_docs = [
    Document(
        title="Introduction to Machine Learning",
        content="Machine learning is a subset of artificial intelligence...",
        metadata={"category": "technology", "date": "2024-01-01"}
    ),
    # Add more documents...
]

# Initialize the embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Process and index documents
processed_docs = process_documents(sample_docs, model)
index_documents(index, processed_docs)

## Implementing Semantic Search

Let's create a search function that uses Pinecone's hybrid search capabilities:

In [7]:
def semantic_search(
    query: str,
    index: pinecone.Index,
    model: SentenceTransformer,
    top_k: int = 5,
    metadata_filter: Dict = None
) -> List[Dict]:
    """
    Perform semantic search with optional metadata filtering.
    """
    # Generate query vector
    query_vector = model.encode(query).tolist()
    
    # Perform the search
    search_results = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True,
        filter=metadata_filter
    )
    
    # Format results
    formatted_results = []
    for match in search_results['matches']:
        result = {
            'title': match['metadata']['title'],
            'content': match['metadata']['content'][:200] + "...",
            'score': match['score'],
            'metadata': {k:v for k,v in match['metadata'].items() 
                        if k not in ['title', 'content']}
        }
        formatted_results.append(result)
    
    return formatted_results

# Example usage:
results = semantic_search(
    query="what is machine learning?",
    index=index,
    model=model,
    metadata_filter={"category": "technology"}
)

# Display results
for i, result in enumerate(results, 1):
    print(f"\nResult {i}:")
    print(f"Title: {result['title']}")
    print(f"Preview: {result['content']}")
    print(f"Score: {result['score']}")
    print(f"Metadata: {result['metadata']}")

## Advanced Features

### Namespace Management

Pinecone supports namespaces to partition your vector data:


In [8]:
def index_to_namespace(
    index: pinecone.Index,
    namespace: str,
    vectors: List[Dict]
):
    """
    Index vectors to a specific namespace.
    """
    index.upsert(
        vectors=vectors,
        namespace=namespace
    )

def search_in_namespace(
    index: pinecone.Index,
    namespace: str,
    query_vector: List[float],
    top_k: int = 5
):
    """
    Search within a specific namespace.
    """
    return index.query(
        vector=query_vector,
        namespace=namespace,
        top_k=top_k,
        include_metadata=True
    )

## Best Practices and Optimization

1. Batch Processing: Always batch your upserts to optimize performance.
2. Index Configuration: Choose appropriate pod types and sizes based on your data volume and query patterns.
3. Vector Dimension: Use lower-dimensional embeddings when possible to reduce storage and improve query speed.
4. Metadata Design: Keep metadata fields minimal and relevant to your search needs.

## Clean Up

Don't forget to clean up resources when you're done:

In [9]:
# Delete the index when you're finished
pc.delete_index(index_name)


**NOTES:**

This notebook demonstrates the fundamental concepts of working with Pinecone for vector search applications. You can extend these examples based on your specific use case, whether it's document similarity, recommendation systems, or any other vector search application.

Remember to replace the API key and environment variables with your actual Pinecone credentials when using this code. Also, consider the pricing implications of your index configuration and usage patterns in production environments.