[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/hybrid-search/fast-intro/pubmed-bm25.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/hybrid-search/fast-intro/pubmed-bm25.ipynb)

# Hybrid Search in Pinecone

Pinecone now supports search across dense and sparse vectors with the introduction of the new hybrid vector index — a first-of-its-kind solution that lets engineers easily build keyword-aware semantic search into their applications. In this notebook, we will demonstrate how to use the new hybrid vector index.



## Install Dependencies

In [7]:
!pip install -qU \
    pinecone-datasets==0.7.0 \
    pinecone-client==3.1.0 \
    pinecone-text \
    torch \
    datasets \
    transformers \
    sentence-transformers \
    requests

## Initializing the Index

Now we need a place to store these embeddings and enable a efficient vector search through them all. To do that we use Pinecone, we can get a [free API key](https://app.pinecone.io/) and enter it below where we will initialize our connection to Pinecone and create a new index.

In [10]:
import os
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = 'ae77b29e-1db7-4c2e-a90c-4506b2b355af'

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [11]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

Create the index:

In [12]:
index_name = "hybrid-test"

In [13]:
import time

# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=384,
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

## Load Dataset

We will use a Question Answering dataset specific to the medical domain. This is to show how the hybrid index helps to rank search results better when domain specifc terms are used. We will load the dataset from Huggingface dataset hub as follows:

In [14]:
from datasets import load_dataset

# load the medical dataset from huggingfac dataset hub
pubmed = load_dataset('pubmed_qa', 'pqa_labeled', split='train')
pubmed

Downloading readme:   0%|          | 0.00/5.19k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset({
    features: ['pubid', 'question', 'context', 'long_answer', 'final_decision'],
    num_rows: 1000
})

Let's look at a set of contexts (which we will be storing).

In [15]:
contexts = []
# loop through the context passages
for record in pubmed['context']:
    # join context passages for each question and append to contexts list
    contexts.append('\n'.join(record['contexts']))
# view some of the contexts
for context in contexts[:2]:
    print(f"{context[:300]}...")

Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cel...
Assessment of visual acuity depends on the optotypes used for measurement. The ability to recognize different optotypes differs even if their critical details appear under the same visual angle. Since optotypes are evaluated on individuals with good visual acuity and without eye disorders, differenc...


## Sparse Vectors

We will use the BERT tokenizer to create sparse vectors, i.e. we will take the token IDs.

In [16]:
from transformers import BertTokenizerFast

# load bert tokenizer from huggingface
tokenizer = BertTokenizerFast.from_pretrained(
    'bert-base-uncased'
)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Let's create a sparse vector for a context passage.

In [17]:
# tokenize the context passage
inputs = tokenizer(
    contexts[0], padding=True, truncation=True,
    max_length=512
)
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [18]:
# extract the input ids
input_ids = inputs['input_ids']
input_ids

[101,
 16984,
 3526,
 2331,
 1006,
 7473,
 2094,
 1007,
 2003,
 1996,
 12222,
 2331,
 1997,
 4442,
 2306,
 2019,
 15923,
 1012,
 1996,
 12922,
 3269,
 1006,
 9706,
 17175,
 18150,
 2239,
 11934,
 27806,
 1007,
 7137,
 2566,
 29278,
 10708,
 1999,
 2049,
 3727,
 2083,
 7473,
 2094,
 1012,
 1996,
 3727,
 1997,
 1996,
 3269,
 8676,
 1997,
 1037,
 17779,
 6198,
 1997,
 20134,
 1998,
 18323,
 9607,
 4372,
 20464,
 18606,
 2024,
 29111,
 1012,
 7473,
 2094,
 5158,
 1999,
 1996,
 4442,
 2012,
 1996,
 2415,
 1997,
 2122,
 2024,
 29111,
 1998,
 22901,
 15436,
 2015,
 1010,
 7458,
 3155,
 2274,
 4442,
 2013,
 1996,
 12436,
 28817,
 20051,
 5397,
 1012,
 1996,
 2535,
 1997,
 10210,
 11663,
 15422,
 4360,
 2076,
 7473,
 2094,
 2038,
 2042,
 3858,
 1999,
 4176,
 1025,
 2174,
 1010,
 2009,
 2038,
 2042,
 2625,
 3273,
 2076,
 7473,
 2094,
 1999,
 4264,
 1012,
 1996,
 2206,
 3259,
 3449,
 14194,
 8524,
 4570,
 1996,
 2535,
 1997,
 23079,
 10949,
 2076,
 13908,
 2135,
 12222,
 7473,
 2094,
 1999,
 2426

We need to convert this into a dictionary of key to frequency values.

In [19]:
from collections import Counter

# convert the input_ids list to a dictionary of key to frequency values
sparse_vec = dict(Counter(input_ids))
for v,k in enumerate(sparse_vec):

  sparse_vec[k]+=0.0

sparse_vec

{101: 1.0,
 16984: 1.0,
 3526: 2.0,
 2331: 2.0,
 1006: 10.0,
 7473: 13.0,
 2094: 13.0,
 1007: 10.0,
 2003: 2.0,
 1996: 13.0,
 12222: 2.0,
 1997: 13.0,
 4442: 7.0,
 2306: 2.0,
 2019: 1.0,
 15923: 1.0,
 1012: 14.0,
 12922: 2.0,
 3269: 3.0,
 9706: 1.0,
 17175: 1.0,
 18150: 1.0,
 2239: 1.0,
 11934: 2.0,
 27806: 2.0,
 7137: 1.0,
 2566: 3.0,
 29278: 2.0,
 10708: 2.0,
 1999: 11.0,
 2049: 1.0,
 3727: 4.0,
 2083: 1.0,
 8676: 1.0,
 1037: 8.0,
 17779: 1.0,
 6198: 1.0,
 20134: 1.0,
 1998: 7.0,
 18323: 1.0,
 9607: 1.0,
 4372: 1.0,
 20464: 2.0,
 18606: 1.0,
 2024: 3.0,
 29111: 2.0,
 5158: 1.0,
 2012: 1.0,
 2415: 1.0,
 2122: 2.0,
 22901: 1.0,
 15436: 1.0,
 2015: 1.0,
 1010: 7.0,
 7458: 1.0,
 3155: 1.0,
 2274: 1.0,
 2013: 1.0,
 12436: 1.0,
 28817: 1.0,
 20051: 1.0,
 5397: 1.0,
 2535: 2.0,
 10210: 2.0,
 11663: 1.0,
 15422: 1.0,
 4360: 1.0,
 2076: 4.0,
 2038: 2.0,
 2042: 2.0,
 3858: 1.0,
 4176: 1.0,
 1025: 2.0,
 2174: 1.0,
 2009: 1.0,
 2625: 1.0,
 3273: 1.0,
 4264: 1.0,
 2206: 1.0,
 3259: 1.0,
 3449: 1.

Let's write a function to do this in batches. Notice that we are removing some keys from the dictionary. These are special tokens from the tokenizer which we do not really need when creating sparse vectors.

In [32]:
def build_dict(input_batch):
 # store a batch of sparse embeddings
    sparse_emb = []
    # iterate through input batch
    for token_ids in input_batch:
      indices = []
      values = []
      # convert the input_ids list to a dictionary of key to frequency values
      d = dict(Counter(token_ids))
      for v,k in enumerate(d):
        d[k]+=0.0
      for idx in d:
          indices.append(idx)
          values.append(d[idx])
      sparse_emb.append({'indices': indices, 'values': values})
    # return sparse_emb list
    return sparse_emb


def generate_sparse_vectors(context_batch):
   # create batch of input_ids
  inputs = tokenizer(
          context_batch, padding=True,
          truncation=True,
          max_length=512,
  )['input_ids']
  # create sparse dictionaries
  sparse_embeds = build_dict(inputs)
  return sparse_embeds

Let's write another function to help us generate sparse vectors in batches.

## Dense Vectors

Alongside sparse vectors we must also add dense vectors (as usual). We do this like so:

In [24]:
from sentence_transformers import SentenceTransformer

# load a sentence transformer model from huggingface
model = SentenceTransformer(
    'multi-qa-MiniLM-L6-cos-v1',
    device='cuda'
)
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

The model gives us a `384` dimensional dense vector.

## Upsert Documents

Now we can go ahead and generate sparse and dense vectors for the full dataset and upsert them along with the metadata to the new hybrid index. We can do that easily as follows:

In [34]:
from tqdm.auto import tqdm

batch_size = 32

for i in tqdm(range(0, len(contexts), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(contexts))
    # extract batch
    context_batch = contexts[i:i_end]
    # create unique IDs
    ids = [str(x) for x in range(i, i_end)]
    # add context passages as metadata
    meta = [{'context': context} for context in context_batch]
    # create dense vectors
    dense_embeds = model.encode(context_batch).tolist()
    # create sparse vectors
    sparse_embeds = generate_sparse_vectors(context_batch)

    vectors = []
    # loop through the data and create dictionaries for uploading documents to pinecone index
    for _id, sparse, dense, metadata in zip(ids, sparse_embeds, dense_embeds, meta):
        vectors.append({
            'id': _id,
            'sparse_values': sparse,
            'values': dense,
            'metadata': metadata
        })

    # upload the documents to the new hybrid index
    index.upsert(vectors=vectors)

# show index description after uploading the documents
index.describe_index_stats()

  0%|          | 0/32 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1376}},
 'total_vector_count': 1376}

## Querying

Now we can query the index, providing the sparse and dense vectors of a question, along with a weight for keyword relevance (“alpha”). `Alpha=1` will provide a purely semantic-based search result and `alpha=0` will provide a purely keyword-based result equivalent to BM25. The default value is `0.5`.

Let's write a helper function to execute queries and after that run some queries.

In [66]:
sparse_vec[0]['indices']

[101,
 2064,
 9349,
 7066,
 2224,
 1996,
 6887,
 4160,
 1011,
 1023,
 2000,
 14358,
 6245,
 1999,
 2111,
 2007,
 4432,
 3279,
 1029,
 102]

In [85]:
import numpy as np

In [103]:
def hybrid_scale(dense, sparse, alpha: float):
    # check alpha value is in range
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    # scale sparse and dense vectors to create hybrid search vecs
    hsparse = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    hdense = [v * alpha for v in dense[0]]
    return hdense, hsparse


def hybrid_query(question, top_k, alpha):
   # convert the question into a sparse vector
   sparse_vec = generate_sparse_vectors([question])[0]
   # convert the question into a dense vector
   dense_vec = model.encode([question]).tolist()
   # scale alpha with hybrid_scale
   dense_vec, sparse_vec = hybrid_scale(
      dense_vec, sparse_vec, alpha
   )
   # query pinecone with the query parameters
   result = index.query(
      vector=dense_vec,
      sparse_vector=sparse_vec,
      top_k=top_k,
      include_metadata=True
   )
   # return search results as json
   return result

In [104]:
dense_vec

[[-0.019297383725643158,
  0.0654173195362091,
  -0.014061014167964458,
  -0.006335671059787273,
  -0.0376705639064312,
  0.071548230946064,
  -0.010632597841322422,
  0.03229121118783951,
  -0.06527982652187347,
  0.0011939911637455225,
  -0.032171279191970825,
  0.06502251327037811,
  0.012238657101988792,
  0.08890759199857712,
  -0.034776072949171066,
  0.024041704833507538,
  -0.0186057947576046,
  0.07089893519878387,
  -0.0018389503238722682,
  0.012076079845428467,
  -0.03415577858686447,
  0.010876793414354324,
  0.05791918560862541,
  0.007891638204455376,
  -0.0010522764641791582,
  0.026829976588487625,
  -0.00619586743414402,
  -0.06759925186634064,
  0.0448341928422451,
  0.09566035866737366,
  -0.0003155610174871981,
  0.042571406811475754,
  0.07465416938066483,
  0.012672827579081059,
  -0.024033185094594955,
  0.014316917397081852,
  -0.10230477899312973,
  0.053635768592357635,
  -0.06510622799396515,
  -0.011515851132571697,
  -0.01692943088710308,
  0.0069727380760

In [105]:
hdense = [v * 0.3 for v in dense_vec[0]]
hdense

[-0.005789215117692947,
 0.01962519586086273,
 -0.004218304250389338,
 -0.001900701317936182,
 -0.01130116917192936,
 0.021464469283819197,
 -0.0031897793523967266,
 0.009687363356351852,
 -0.019583947956562042,
 0.00035819734912365673,
 -0.009651383757591248,
 0.019506753981113432,
 0.0036715971305966376,
 0.026672277599573135,
 -0.01043282188475132,
 0.007212511450052261,
 -0.00558173842728138,
 0.02126968055963516,
 -0.0005516850971616804,
 0.0036228239536285397,
 -0.010246733576059342,
 0.003263038024306297,
 0.01737575568258762,
 0.002367491461336613,
 -0.00031568293925374744,
 0.008048992976546288,
 -0.001858760230243206,
 -0.02027977555990219,
 0.013450257852673531,
 0.028698107600212096,
 -9.466830524615943e-05,
 0.012771422043442726,
 0.022396250814199447,
 0.0038018482737243176,
 -0.0072099555283784865,
 0.004295075219124555,
 -0.03069143369793892,
 0.01609073057770729,
 -0.019531868398189545,
 -0.003454755339771509,
 -0.005078829266130924,
 0.0020918214228004216,
 -0.0004613

In [106]:
dense_vec[0]

[-0.019297383725643158,
 0.0654173195362091,
 -0.014061014167964458,
 -0.006335671059787273,
 -0.0376705639064312,
 0.071548230946064,
 -0.010632597841322422,
 0.03229121118783951,
 -0.06527982652187347,
 0.0011939911637455225,
 -0.032171279191970825,
 0.06502251327037811,
 0.012238657101988792,
 0.08890759199857712,
 -0.034776072949171066,
 0.024041704833507538,
 -0.0186057947576046,
 0.07089893519878387,
 -0.0018389503238722682,
 0.012076079845428467,
 -0.03415577858686447,
 0.010876793414354324,
 0.05791918560862541,
 0.007891638204455376,
 -0.0010522764641791582,
 0.026829976588487625,
 -0.00619586743414402,
 -0.06759925186634064,
 0.0448341928422451,
 0.09566035866737366,
 -0.0003155610174871981,
 0.042571406811475754,
 0.07465416938066483,
 0.012672827579081059,
 -0.024033185094594955,
 0.014316917397081852,
 -0.10230477899312973,
 0.053635768592357635,
 -0.06510622799396515,
 -0.011515851132571697,
 -0.01692943088710308,
 0.006972738076001406,
 -0.00153782544657588,
 0.024894142

In [111]:
question = "Can clinicians use the PHQ-9 to assess depression in people with vision loss?"

First, we will do a pure semantic search by setting the alpha value as 1.

In [112]:
hybrid_query(question, top_k=3, alpha=1)

{'matches': [{'id': '305',
              'metadata': {'context': 'The gap between evidence-based '
                                      'treatments and routine care has been '
                                      'well established. Findings from the '
                                      'Sequenced Treatments Alternatives to '
                                      'Relieve Depression (STAR*D) emphasized '
                                      'the importance of measurement-based '
                                      'care for the treatment of depression as '
                                      'a key ingredient for achieving response '
                                      'and remission; yet measurement-based '
                                      'care approaches are not commonly used '
                                      'in clinical practice.\n'
                                      'The Nine-Item Patient Health '
                                      'Questionnaire (PHQ-

The most relevant result from above is the second document with id 711. Now let's try with an alpha value of 0.3.

In [110]:
hybrid_query(question, top_k=3, alpha=0.3)

{'matches': [{'id': '956',
              'metadata': {'context': 'The differential diagnosis between '
                                      "essential tremor (ET) and Parkinson's "
                                      'disease (PD) may be, in some cases, '
                                      'very difficult on clinical grounds '
                                      'alone. In addition, it is accepted that '
                                      'a small percentage of ET patients '
                                      'presenting symptoms and signs of '
                                      'possible PD may progress finally to a '
                                      'typical pattern of parkinsonism. '
                                      'Ioflupane, '
                                      'N-u-fluoropropyl-2a-carbomethoxy-3a-(4-iodophenyl) '
                                      'nortropane, also called FP-CIT, '
                                      'labelled with (123)I (comm

The most relevant document is now ranked the highest.

# Delete the Index

In [None]:
pc.delete_index("hybrid-test")

---