# Pinecone

- Author: [ro__o_jun](https://github.com/ro-jun)
- Design: []()
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/08-Embeeding/01-OpenAIEmbeddings.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/08-Embeeding/01-OpenAIEmbeddings.ipynb)

## Overview

This tutorial provides a comprehensive guide to integrating `Pinecone` with `LangChain` for creating and managing high-performance vector databases.  

It explains how to set up `Pinecone` , `preprocess documents` , and utilize Pinecone's APIs for vector indexing and `document retrieval` .  

Additionally, it demonstrates advanced features like `hybrid search` using `dense` and `sparse embeddings` , `metadata filtering` , and `dynamic reranking` to build efficient and scalable search systems.  

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [What is Pinecone?](#what-is-pinecone)
- [Pinecone setup guide](#Pinecone-setup-guide)
- [Data preprocessing](#data-preprocessing)
- [Pinecone and LangChain Integration Guide: Step by Step](#pinecone-and-langchain-integration-guide-step-by-step)
- [Pinecone: Add to DB Index (Upsert)](#pinecone-add-to-db-index-upsert)
- [Index inquiry/delete](#index-inquirydelete)
- [Create HybridRetrieve](#create-hybridretrieve)

### References

- [Langchain-PineconeVectorStore](https://python.langchain.com/api_reference/pinecone/vectorstores/langchain_pinecone.vectorstores.PineconeVectorStore.html)
- [Langchain-Retrievers](https://python.langchain.com/docs/integrations/retrievers/pinecone_hybrid_search/)
- [Pinecone-Docs](https://docs.pinecone.io/guides/get-started/overview)
- [Pinecone-Docs-integrations](https://docs.pinecone.io/integrations/langchain)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial


[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain-pinecone",
        "pinecone[grpc]",
        "nltk",
        "langchain_community",
        "pymupdf",
        "langchain-openai",
        "pinecone-text",
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "PINECONE_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Pinecone",
    },
)

Environment variables have been set successfully.


[Note] If you are using a `.env` file, proceed as follows.

In [4]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

## What is Pinecone?

`Pinecone` is a **cloud-based** , high-performance vector database for **efficient vector storage and retrieval** in AI and machine learning applications.

**Features** :
1. **Supports SDKs** for Python, Node.js, Java, and Go.
2. **Fully managed** : Reduces the burden of infrastructure management.
3. **Real-time updates** : Supports real-time insertion, updates, and deletions.

**Advantages** :
1. Scalability for large datasets.
2. Real-time data processing.
3. High availability with cloud infrastructure.

**Disadvantages** :
1. Relatively higher cost compared to other vector databases.
2. Limited customization options.

## Pinecone setup guide

This section explains how to set up `Pinecone` , including `API key` creation.

**[steps]**

1. Log in to [Pinecone](https://www.pinecone.io/)
2. Create an API key under the `API Keys` tab.

![example](./assets/04-pinecone-api-01.png)  
![example](./assets/04-pinecone-api-02.png)  

## Data preprocessing

Below is the preprocessing process for general documents.  
Reads all `data/*.pdf` files under `ROOT_DIR` and saves them in `document_lsit.`

In [5]:
from utils.pinecone import DocumentProcessor

directory_path = "data/*.pdf"
doc_processor = DocumentProcessor(
    directory_path=directory_path,
    chunk_size=300,
    chunk_overlap=50,
    use_basename=True,
)
split_docs = doc_processor.process_pdf_files(directory_path)

print(f"Number of processed documents: {len(split_docs)}")

[INFO] Processed 414 documents from 1 files.
Number of processed documents: 414


In [6]:
split_docs[12].page_content

'up. I have a serious reason: he is the best friend I have in the world. I have another reason: this grown-up understands everything, even books about children. I have a third reason: he lives in France where he is hungry and cold. He needs cheering up. If all these'

In [7]:
split_docs[12].metadata

{'source': 'TheLittlePrince.pdf',
 'file_path': 'data\\TheLittlePrince.pdf',
 'page': 2,
 'total_pages': 64,
 'format': 'PDF 1.3',
 'title': '',
 'author': 'Paula MacDowell',
 'subject': '',
 'keywords': '',
 'creator': 'Safari',
 'producer': 'Mac OS X 10.10.5 Quartz PDFContext',
 'creationDate': "D:20160209011144Z00'00'",
 'modDate': "D:20160209011144Z00'00'",
 'trapped': ''}

Performs document processing to save DB in Pinecone. You can select `metadata_keys` during this process.

You can additionally tag metadata and, if desired, add and process metadata ahead of time in a preprocessing task.

- `split_docs` : List[Document] containing the results of document splitting.
- `metadata_keys` : List containing metadata keys to be added to the document.
- `min_length` : Specifies the minimum length of the document. Documents shorter than this length are excluded.
- `use_basename` : Specifies whether to use the file name based on the source path. The default is `False` .

**Preprocessing of documents**

- Extract the required `metadata` information.
- Filters only data longer than the minimum length.
- Specifies whether to use the document's `basename` . The default is `False` .
- Here, `basename` refers to the very last part of the file.
- For example, `/data/TheLittlePrince.pdf` becomes `TheLittlePrince.pdf`.


In [8]:
contents, metadatas = doc_processor.preprocess_documents(docs=split_docs, min_length=10)

print(f"Number of processed documents: {len(contents)}")
print(f"Metadata keys: {list(metadatas.keys())}")
print(f"Sample 'source' metadata: {metadatas['source'][:5]}")

Preprocessing documents: 100%|██████████| 414/414 [00:00<00:00, 31331.84it/s]

Number of processed documents: 414
Metadata keys: ['source', 'page', 'author']
Sample 'source' metadata: ['TheLittlePrince.pdf', 'TheLittlePrince.pdf', 'TheLittlePrince.pdf', 'TheLittlePrince.pdf', 'TheLittlePrince.pdf']





In [9]:
# Check number of documents, check number of sources, check number of pages
len(contents), len(metadatas["source"]), len(metadatas["page"]), len(
    metadatas["author"]
)

(414, 414, 414, 414)

## Pinecone and LangChain Integration Guide: Step by Step

This guide outlines the integration of Pinecone and LangChain to set up and utilize a vector database. 

Below are the key steps to complete the integration.

### Pinecone client initialization and vector database setup

The provided code performs the initialization of a Pinecone client, sets up an index in Pinecone, and defines a vector database to store embeddings.

**[caution]**    

If you are considering HybridSearch, specify the metric as dotproduct.  
Basic users cannot use PodSpec.  

### Pinecone index settings

**This explains how to create and check indexes.**

In [10]:
import os
from utils.pinecone import PineconeDocumentManager

# Initialize Pinecone client with API key from environment variables
pc_db = PineconeDocumentManager(api_key=os.environ.get("PINECONE_API_KEY"))
pc_db

<utils.pinecone.PineconeDocumentManager at 0x18a2d5fffd0>

In [11]:
# Check existing index names
pc_db.check_indexes()

Existing Indexes: [{
    "name": "langchain-opentutorial-index",
    "dimension": 3072,
    "metric": "dotproduct",
    "host": "langchain-opentutorial-index-9v46jum.svc.aped-4627-b74a.pinecone.io",
    "spec": {
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    },
    "status": {
        "ready": true,
        "state": "Ready"
    },
    "deletion_protection": "disabled"
}, {
    "name": "langchain-opentutorial-multimodal-1024",
    "dimension": 1024,
    "metric": "dotproduct",
    "host": "langchain-opentutorial-multimodal-1024-9v46jum.svc.aped-4627-b74a.pinecone.io",
    "spec": {
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    },
    "status": {
        "ready": true,
        "state": "Ready"
    },
    "deletion_protection": "disabled"
}]


In [12]:
from pinecone import ServerlessSpec, PodSpec

# Create or reuse the index
index_name = "langchain-opentutorial-index"

# Set to True when using the serverless method, and False when using the PodSpec method.
use_serverless = True
if use_serverless:
    spec = ServerlessSpec(cloud="aws", region="us-east-1")
else:
    spec = PodSpec(environment="us-west1-gcp", pod_type="p1.x1", pods=1)

pc_db.create_index(
    index_name=index_name,
    dimension=3072,
    metric="dotproduct",
    spec=spec,
)

Using existing index: langchain-opentutorial-index


<pinecone.grpc.index_grpc.GRPCIndex at 0x18a2dea3c50>

**This is how to check the inside of an index.**

In [13]:
index = pc_db.get_index(index_name)
print(index.describe_index_stats())

{'dimension': 3072,
 'index_fullness': 0.0,
 'namespaces': {'langchain-opentutorial-01': {'vector_count': 414}},
 'total_vector_count': 414}


![04-pinecone-index.png](./assets/04-pinecone-index.png)

**This is how to clear an index.**

**[Note]** If you want to delete the index, uncomment the lines below and run the code.

In [14]:
# index_name = "langchain-opentutorial-index2"

# pc_db.delete_index(index_name)
# print(pc_db.list_indexes())

## Create Sparse Encoder

- Create a sparse encoder.

- Perform stopword processing.

- Learn contents using Sparse Encoder. The encode learned here is used to create a Sparse Vector when storing documents in VectorStore.


Simplified NLTK-based BM25 tokenizer

In [15]:
from utils.pinecone import NLTKBM25Tokenizer

tokenizer = NLTKBM25Tokenizer()

[INFO] Downloading NLTK stopwords and punkt tokenizer...
[INFO] NLTK setup completed.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\thdgh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\thdgh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Tokenization test

In [16]:
text = "This is an example text, and it contains some punctuation and stop words."
tokens = tokenizer(text)

print("Before stop words modification:", tokenizer(text))
tokenizer.add_stop_words(["text", "stop"])
print("\nAfter adding stop words:", tokenizer(text))
tokenizer.remove_stop_words(["text", "stop"])
print("\nAfter removing stop words:", tokenizer(text))

Before stop words modification: ['example', 'text', 'contains', 'punctuation', 'stop', 'words']

After adding stop words: ['example', 'contains', 'punctuation', 'words']

After removing stop words: ['example', 'text', 'contains', 'punctuation', 'stop', 'words']


Create Sparse Encoder

In [17]:
from pinecone_text.sparse import BM25Encoder

sparse_encoder = BM25Encoder()

# Connect custom tokenizer
sparse_encoder._tokenizer = tokenizer

In [18]:
# sparse_encoder test
test_corpus = ["This is a text document.", "Another document for testing."]
sparse_encoder.fit(test_corpus)

print(sparse_encoder.encode_documents("Test document."))

  0%|          | 0/2 [00:00<?, ?it/s]

{'indices': [3127628307, 3368723024], 'values': [0.49504950495049505, 0.49504950495049505]}


Train the corpus on Sparse Encoder.

- `save_path` : Path to save Sparse Encoder. Later, the Sparse Encoder saved in pickle format will be loaded and used for query embedding. Therefore, specify the path to save it.

In [19]:
import pickle

save_path = "./sparse_encoder.pkl"

# Learn and save Sparse Encoder.
sparse_encoder.fit(contents)
with open(save_path, "wb") as f:
    pickle.dump(sparse_encoder, f)
print(f"[fit_sparse_encoder]\nSaved Sparse Encoder to: {save_path}")

  0%|          | 0/414 [00:00<?, ?it/s]

[fit_sparse_encoder]
Saved Sparse Encoder to: ./sparse_encoder.pkl


[Optional]  
Below is the code to use when you need to reload the learned and saved Sparse Encoder later.

In [20]:
file_path = "./sparse_encoder.pkl"

# It is used later to load the learned sparse encoder.
try:
    with open(file_path, "rb") as f:
        loaded_file = pickle.load(f)
    print(f"[load_sparse_encoder]\nLoaded Sparse Encoder from: {file_path}")
    sparse_encoder = loaded_file
except Exception as e:
    print(f"[load_sparse_encoder]\n{e}")
    sparse_encoder = None

[load_sparse_encoder]
Loaded Sparse Encoder from: ./sparse_encoder.pkl


## Pinecone: Add to DB Index (Upsert)

![04-pinecone-upsert](./assets/04-pinecone-upsert.png)

- `context`: This is the context of the document.
- `page` : The page number of the document.
- `source` : This is the source of the document.
- `values` : This is an embedding of a document obtained through Embedder.
- `sparse values` : This is an embedding of a document obtained through Sparse Encoder.

Upsert documents in batches without distributed processing.
If the amount of documents is not large, use the method below.

In [21]:
from langchain_openai import OpenAIEmbeddings

openai_embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# Please set
embedder = openai_embeddings
batch_size = 32
namespace = "langchain-opentutorial-01"

# Running upsert on Pinecone
pc_db.upsert_documents(
    index=index,
    contents=contents,
    metadatas=metadatas,
    embedder=openai_embeddings,
    sparse_encoder=sparse_encoder,
    namespace=namespace,
    batch_size=batch_size,
)

Processing Batches: 100%|██████████| 13/13 [00:59<00:00,  4.59s/it]

{'dimension': 3072,
 'index_fullness': 0.0,
 'namespaces': {'langchain-opentutorial-01': {'vector_count': 414}},
 'total_vector_count': 414}





Below, distributed processing is performed to quickly upsert large documents. Use this for large uploads.

In [22]:
from langchain_openai import OpenAIEmbeddings

openai_embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

embedder = openai_embeddings
# Set batch size and number of workers
batch_size = 32
max_workers = 8
namespace = "langchain-opentutorial-02"

# Running Upsert in Parallel on Pinecone
pc_db.upsert_documents_parallel(
    index=index,
    contents=contents,
    metadatas=metadatas,
    embedder=openai_embeddings,
    sparse_encoder=sparse_encoder,
    namespace=namespace,
    batch_size=batch_size,
    max_workers=max_workers,
)

Processing Batches in Parallel: 100%|██████████| 13/13 [00:12<00:00,  1.03it/s]


{'dimension': 3072,
 'index_fullness': 0.0,
 'namespaces': {'langchain-opentutorial-01': {'vector_count': 414},
                'langchain-opentutorial-02': {'vector_count': 0}},
 'total_vector_count': 414}


In [23]:
print(index.describe_index_stats())

{'dimension': 3072,
 'index_fullness': 0.0,
 'namespaces': {'langchain-opentutorial-01': {'vector_count': 414},
                'langchain-opentutorial-02': {'vector_count': 414}},
 'total_vector_count': 828}


![04-pinecone-namespaces-01.png](./assets/04-pinecone-namespaces-01.png)

## Index inquiry/delete

The `describe_index_stats` method provides statistical information about the contents of an index. This method allows you to obtain information such as the number of vectors and dimensions per namespace.

**Parameter** * `filter` (Optional[Dict[str, Union[str, float, int, bool, List, dict]]]): A filter that returns statistics only for vectors that meet certain conditions. Default is None * `**kwargs`: Additional keyword arguments

**Return value** * `DescribeIndexStatsResponse`: Object containing statistical information about the index

**Usage example** * Default usage: `index.describe_index_stats()` * Apply filter: `index.describe_index_stats(filter={'key': 'value'})`

In [24]:
# Index lookup
index.describe_index_stats()

{'dimension': 3072,
 'index_fullness': 0.0,
 'namespaces': {'langchain-opentutorial-01': {'vector_count': 414},
                'langchain-opentutorial-02': {'vector_count': 414}},
 'total_vector_count': 828}

**Search for documents in the index**

In [25]:
# Define your query
question = "If you come at 4 PM, I will be happy from 3 PM. As time goes by, I will become happier."

# Convert the query into dense and sparse vectors
dense_vector = embedder.embed_query(question)
sparse_vector = sparse_encoder.encode_documents(question)

results = pc_db.search(
    index = index,
    namespace="langchain-opentutorial-01",
    query=dense_vector,
    sparse_vector=sparse_vector,
    top_k=3,
    include_metadata=True,
)

print(results)

{'matches': [{'id': 'doc-303',
              'metadata': {'author': 'Paula MacDowell',
                           'context': "o'clock in the afternoon, then at three "
                                      "o'clock I shall begin to be happy. I "
                                      'shall feel happier and happier as the '
                                      "hour advances. At four o'clock, I shall "
                                      'already be worrying and jumping about. '
                                      'I shall show you how',
                           'page': 46.0,
                           'source': 'TheLittlePrince.pdf'},
              'score': 1.3499277,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': 'doc-304',
              'metadata': {'author': 'Paula MacDowell',
                           'context': 'happy I am! But if you come at just any '
                                      'time, I shall neve

**Delete namespace**

In [26]:
index.delete(delete_all=True, namespace="langchain-opentutorial-02")



![04-pinecone-namespaces-02.png](./assets/04-pinecone-namespaces-02.png)

In [27]:
index.describe_index_stats()

{'dimension': 3072,
 'index_fullness': 0.0,
 'namespaces': {'langchain-opentutorial-01': {'vector_count': 414}},
 'total_vector_count': 414}

Below are features exclusive to paid users. Metadata filtering is available to paid users.

In [28]:
from pinecone.exceptions import PineconeException

try:
    index.delete(
        filter={"source": {"$eq": "TheLittlePrince.pdf"}},
        namespace="langchain-opentutorial-01",
    )
except PineconeException as e:
    print(f"Error while deleting using filter:\n{e}")

index.describe_index_stats()

Error while deleting using filter:
UNKNOWN:Error received from peer  {grpc_message:"Invalid request.", grpc_status:3, created_time:"2025-02-15T15:26:54.3610786+00:00"}


{'dimension': 3072,
 'index_fullness': 0.0,
 'namespaces': {'langchain-opentutorial-01': {'vector_count': 414}},
 'total_vector_count': 414}

## Create HybridRetrieve

**PineconeHybridRetriever initialization parameter settings**

The `init_pinecone_index` function and the `PineconeHybridRetriever` class implement a hybrid search system using Pinecone. This system combines dense and sparse vectors to perform effective document retrieval.

Pinecone index initialization

The `init_pinecone_index` function initializes the Pinecone index and sets up the necessary components.

Parameters 
* `index_name` (str): Pinecone index name 
* `namespace` (str): Namespace to use 
* `api_key` (str): Pinecone API key 
* `sparse_encoder_pkl_path` (str): Sparse encoder pickle file path 
* `stopwords` (List[str]): List of stop words 
* `tokenizer` (str): Tokenizer to use (default: "nltk") 
* `embeddings` (Embeddings): Embedding model 
* `alpha` (float): Weight of dense and sparse vectors Adjustment parameter (default: 0.5)
* `top_k` (int): Maximum number of documents to return (default: 4) 

**Main features** 
1. Pinecone index initialization and statistical information output
2. Sparse encoder (BM25) loading and tokenizer settings
3. Specify namespace


In [29]:
from langchain_openai import OpenAIEmbeddings
from utils.pinecone import PineconeDocumentManager
import os

pc_db = PineconeDocumentManager(api_key=os.environ.get("PINECONE_API_KEY"))

# Settings
index_name = "langchain-opentutorial-index"
namespace = "langchain-opentutorial-01"
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
sparse_encoder = sparse_encoder  # Pre-initialized BM25Encoder

# Create Hybrid Search Retriever
retriever = pc_db.create_hybrid_search_retriever(
    index_name=index_name,
    embeddings=embeddings,
    sparse_encoder=sparse_encoder,
    namespace=namespace,
    alpha=0.5,
    top_k=4,
)

[INFO] Hybrid Search Retriever initialized for index 'langchain-opentutorial-index'.


**Main properties** 
* `embeddings` : Embedding model for dense vector transformations 
* `sparse_encoder:` Encoder for sparse vector transformations 
* `index` : Pinecone index object 
* `top_k` : Maximum number of documents to return 
* `alpha` : Weight adjustment parameters for dense and sparse vectors 
* `namespace` : Namespace within the Pinecone index.

**Features** 
* HybridSearch Retriever combining dense and sparse vectors 
* Search strategy can be optimized through weight adjustment 
* Various dynamic metadata filtering can be applied (using `search_kwargs` : `filter` , `top_k` , `alpha` , etc.)

**Use example** 
1. Initialize required components with the `init_pinecone_index` function   
2. Create a `PineconeHybridRetriever` instance with initialized components.  
3. Perform a hybrid search using the generated retriever to create a `PineconeHybridRetriever`.

**general search**

In [30]:
query = "If you come at 4 PM, I will be happy from 3 PM. As time goes by, I will become happier."
search_results = retriever(query)

for result in search_results:
    print("Page Content:", result["metadata"]["context"])
    print("Metadata:", result["metadata"])
    print("\n====================\n")

Page Content: o'clock in the afternoon, then at three o'clock I shall begin to be happy. I shall feel happier and happier as the hour advances. At four o'clock, I shall already be worrying and jumping about. I shall show you how
Metadata: {'context': "o'clock in the afternoon, then at three o'clock I shall begin to be happy. I shall feel happier and happier as the hour advances. At four o'clock, I shall already be worrying and jumping about. I shall show you how", 'page': 46.0, 'author': 'Paula MacDowell', 'source': 'TheLittlePrince.pdf'}


Page Content: happy I am! But if you come at just any time, I shall never know at what hour my heart is to be ready to greet you . . . One must observe the proper rites . . ." "What is a rite?" asked the little prince.
Metadata: {'context': 'happy I am! But if you come at just any time, I shall never know at what hour my heart is to be ready to greet you . . . One must observe the proper rites . . ." "What is a rite?" asked the little prince.', 'pag

Using dynamic search_kwargs - k: specify maximum number of documents to return

In [31]:
query = "If you come at 4 PM, I will be happy from 3 PM. As time goes by, I will become happier."

search_kwargs = {"top_k": 2}
search_results = retriever(query, **search_kwargs)

for result in search_results:
    print("Page Content:", result["metadata"]["context"])
    print("Metadata:", result["metadata"])
    print("\n====================\n")

Page Content: o'clock in the afternoon, then at three o'clock I shall begin to be happy. I shall feel happier and happier as the hour advances. At four o'clock, I shall already be worrying and jumping about. I shall show you how
Metadata: {'context': "o'clock in the afternoon, then at three o'clock I shall begin to be happy. I shall feel happier and happier as the hour advances. At four o'clock, I shall already be worrying and jumping about. I shall show you how", 'page': 46.0, 'author': 'Paula MacDowell', 'source': 'TheLittlePrince.pdf'}


Page Content: happy I am! But if you come at just any time, I shall never know at what hour my heart is to be ready to greet you . . . One must observe the proper rites . . ." "What is a rite?" asked the little prince.
Metadata: {'context': 'happy I am! But if you come at just any time, I shall never know at what hour my heart is to be ready to greet you . . . One must observe the proper rites . . ." "What is a rite?" asked the little prince.', 'pag


Use dynamic `search_kwargs` - `alpha` : Weight adjustment parameters for dense and sparse vectors. Specify a value between 0 and 1. `0.5` is the default, the closer it is to 1, the higher the weight of the dense vector is.

In [32]:
query = "If you come at 4 PM, I will be happy from 3 PM. As time goes by, I will become happier."

search_kwargs = {"alpha": 1, "top_k": 2}
search_results = retriever(query, **search_kwargs)

for result in search_results:
    print("Page Content:", result["metadata"]["context"])
    print("Metadata:", result["metadata"])
    print("\n====================\n")

Page Content: o'clock in the afternoon, then at three o'clock I shall begin to be happy. I shall feel happier and happier as the hour advances. At four o'clock, I shall already be worrying and jumping about. I shall show you how
Metadata: {'context': "o'clock in the afternoon, then at three o'clock I shall begin to be happy. I shall feel happier and happier as the hour advances. At four o'clock, I shall already be worrying and jumping about. I shall show you how", 'page': 46.0, 'author': 'Paula MacDowell', 'source': 'TheLittlePrince.pdf'}


Page Content: of misunderstandings. But you will sit a little closer to me, every day . . ." The next day the little prince came back. "It would have been better to come back at the same hour," said the fox. "If, for example, you come at four
Metadata: {'context': 'of misunderstandings. But you will sit a little closer to me, every day . . ." The next day the little prince came back. "It would have been better to come back at the same hour," said th

In [33]:
query = "If you come at 4 PM, I will be happy from 3 PM. As time goes by, I will become happier."

search_kwargs = {"alpha": 0, "top_k": 2}
search_results = retriever(query, **search_kwargs)

for result in search_results:
    print("Page Content:", result["metadata"]["context"])
    print("Metadata:", result["metadata"])
    print("\n====================\n")

Page Content: o'clock in the afternoon, then at three o'clock I shall begin to be happy. I shall feel happier and happier as the hour advances. At four o'clock, I shall already be worrying and jumping about. I shall show you how
Metadata: {'context': "o'clock in the afternoon, then at three o'clock I shall begin to be happy. I shall feel happier and happier as the hour advances. At four o'clock, I shall already be worrying and jumping about. I shall show you how", 'page': 46.0, 'author': 'Paula MacDowell', 'source': 'TheLittlePrince.pdf'}


Page Content: happy I am! But if you come at just any time, I shall never know at what hour my heart is to be ready to greet you . . . One must observe the proper rites . . ." "What is a rite?" asked the little prince.
Metadata: {'context': 'happy I am! But if you come at just any time, I shall never know at what hour my heart is to be ready to greet you . . . One must observe the proper rites . . ." "What is a rite?" asked the little prince.', 'pag

**Metadata filtering**

![04-pinecone-filter](./assets/04-pinecone-filter.png)

Using dynamic search_kwargs - filter: Apply metadata filtering

(Example) Search with a value less than 25 pages.

In [34]:
query = "If you come at 4 PM, I will be happy from 3 PM. As time goes by, I will become happier."

search_kwargs = {"alpha": 1, "top_k": 3, "filter": {"page": {"$lt": 25}}}
search_results = retriever(query, **search_kwargs)

for result in search_results:
    print("Page Content:", result["metadata"]["context"])
    print("Metadata:", result["metadata"])
    print("\n====================\n")

Page Content: "I am very fond of sunsets. Come, let us go look at a sunset now." "But we must wait," I said. "Wait? For what?" "For the sunset. We must wait until it is time." At first you seemed to be very much surprised. And then you laughed to yourself. You said to me:
Metadata: {'context': '"I am very fond of sunsets. Come, let us go look at a sunset now." "But we must wait," I said. "Wait? For what?" "For the sunset. We must wait until it is time." At first you seemed to be very much surprised. And then you laughed to yourself. You said to me:', 'page': 15.0, 'author': 'Paula MacDowell', 'source': 'TheLittlePrince.pdf'}


Page Content: Hum! That will be about--about--that will be this evening about twenty minutes to eight. And you will see how well I am obeyed!" The little prince yawned. He was regretting his lost sunset. And then, too, he was already beginning to be a little bored.
Metadata: {'context': 'Hum! That will be about--about--that will be this evening about twenty minut

In [35]:
query = "If you come at 4 PM, I will be happy from 3 PM. As time goes by, I will become happier."

search_kwargs = {"alpha": 1, "top_k": 4, "filter": {"page": {"$in": [25, 16]}}}
search_results = retriever(query, **search_kwargs)

for result in search_results:
    print("Page Content:", result["metadata"]["context"])
    print("Metadata:", result["metadata"])
    print("\n====================\n")

Page Content: He should be able, for example, to order me to be gone by the end of one minute. It seems to me that conditions are favorable . . ." As the king made no answer, the little prince hesitated a moment. Then, with a sigh, he took his leave.
Metadata: {'context': 'He should be able, for example, to order me to be gone by the end of one minute. It seems to me that conditions are favorable . . ." As the king made no answer, the little prince hesitated a moment. Then, with a sigh, he took his leave.', 'page': 25.0, 'author': 'Paula MacDowell', 'source': 'TheLittlePrince.pdf'}


Page Content: way." "No," said the king. But the little prince, having now completed his preparations for departure, had no wish to grieve the old monarch. "If Your Majesty wishes to be promptly obeyed," he said, "he should be able to give me a reasonable order.
Metadata: {'context': 'way." "No," said the king. But the little prince, having now completed his preparations for departure, had no wish to griev