### What is Hybrid Search ??

Hybrid Search is a search technique that combines both dense vector search (based on embeddings capturing semantic meaning) and traditional keyword-based search (sparse search) to retrieve the most relevant results. It leverages the strengths of both approaches to improve search accuracy and relevance, especially in scenarios like information retrieval and question-answering systems.

### Why Pinecone ??
Pinecone is a managed vector database designed for similarity search and real-time analytics. It helps store and search high-dimensional vectors efficiently.

In hybrid search, Pinecone can enhance search capabilities by combining traditional keyword search with vector-based search. It allows you to index and search complex data such as embeddings from machine learning models, making it easier to retrieve relevant results based on semantic similarity, alongside traditional text-based queries. 

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

from langchain_community.retrievers import PineconeHybridSearchRetriever
from pinecone import Pinecone, ServerlessSpec

# Index Name in Pinecone
index_name = "hybrid-search-with-pinecone"

# Initialize the Pinecone Client
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
pc = Pinecone(api_key= PINECONE_API_KEY)

In [42]:
index = pc.Index(index_name)

In [6]:
# Importing Embedding Model for Dense Vector Search

HUGGINGFACE_API_KEY = os.getenv('HUGGINGFACE_API_KEY')

from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')
print(embeddings)
word_embedding_dimension = embeddings.client[1].word_embedding_dimension



client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
) model_name='all-MiniLM-L6-v2' cache_folder=None model_kwargs={} encode_kwargs={} multi_process=False show_progress=False


In [7]:
# Creating The Pinecone Index For The First Time
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name= index_name,
        dimension= word_embedding_dimension,
        metric= "dotproduct",
        spec= ServerlessSpec(cloud= 'aws', region='us-east-1')
        
    )

In [32]:
# Imorting Embedding for Keyword-based Search or Sparse Search (with TFIDF)

from pinecone_text.sparse import SpladeEncoder

splade = SpladeEncoder()
splade



<pinecone_text.sparse.splade_encoder.SpladeEncoder at 0x3087bf190>

In [34]:
corpus = [
    "I am Susovan",
    "I am from India",
    "MS Dhoni is the best captain in cricket till now."
]

In [35]:
sparse_vector = splade.encode_documents(corpus)

In [46]:
retriever = PineconeHybridSearchRetriever(embeddings= embeddings, sparse_encoder= splade, index= index)

In [47]:
retriever

PineconeHybridSearchRetriever(embeddings=HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False), sparse_encoder=<pinecone_text.sparse.splade_encoder.SpladeEncoder object at 0x3087bf190>, index=<pinecone.data.index.Index object at 0x173957010>)

In [48]:
retriever.add_texts(corpus)

100%|██████████| 1/1 [00:02<00:00,  2.93s/it]


In [49]:
retriever.invoke('Where I am from?')

[Document(page_content='I am from India'),
 Document(page_content='I am Susovan'),
 Document(page_content='MS Dhoni is the best captain in cricket till now.')]

In [52]:
retriever.invoke('Who is the best captain in cricket till now')

[Document(page_content='MS Dhoni is the best captain in cricket till now.'),
 Document(page_content='I am from India'),
 Document(page_content='I am Susovan')]