[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb)

# Semantic Search

In this walkthrough, we'll learn how to use Pinecone for semantic search using a multilingual translation dataset.

We'll grab English sentences and search over a corpus of related sentences, aiming to find the relevant subset to our query.


Semantic search is a form of retrieval that allows you to find documents that are similar in meaning to a given query, irrespective of the words used in each query.

Semantic search is often in opposition to lexical search, where keywords are used to identify relevant documents to a given query, though it doesn't have to always be this way!

 It's super helpful for applications that require an understanding of a query's intent (such as when a user queries with a question over a corpus), or for when traditional lexical search doesn't work (such as in multimodal or multilingual applications).


To begin, let's install the following libraries:

## Installation

In [1]:
!uv pip install -qU \
  pinecone~=7.3.0 \
  pinecone-notebooks==0.1.1 \
  numpy==2.0.2 \
  datasets==3.5.1

---

🚨 _Note: the above `uv pip install` is formatted for Colab Jupyter notebooks. If running elsewhere you may need to drop the `!`._. If you want to run without uv, remove "uv"

---

## Setting up

### Get and Set the Pinecone API Key

We'll first need a free Pinecone account and API key.

This cell will help you create an account if you don't have one and then create an API key and save it in your Colab environment.

Run the cell below, and click the Pinecone Connect button to create an account or log in, and follow the prompts to generate an API key:

In [2]:
from pinecone_notebooks.colab import Authenticate

Authenticate()

Now that our key is ready, we can retrieve it from our environment and proceed:

In [3]:
from pinecone import Pinecone
# Initialize client
import os

api_key = os.environ.get("PINECONE_API_KEY")

pc = Pinecone(
        # You can remove this for your own projects!
        api_key=api_key,
        source_tag="pinecone_examples:docs:semantic_search"
    )

### Creating a Pinecone Index with Integrated Inference

Typically, semantic search requires three pieces: a processed data source (chunks, or records in Pinecone), an embedding model, and a vector database.

Integrated Inference allows you to specify the creation of a Pinecone index with a specific Pinecone-hosted embedding model, which makes it easy to interact with the index. To learn more about Integrated Inference, including what other models are available, take a [look here](https://docs.pinecone.io/guides/get-started/overview#integrated-embedding).


Here, we specify a starter tier index with the [llama-text-embed-v2](https://docs.pinecone.io/models/llama-text-embed-v2) embedding model. We also specify a mapping for what field in our records we will embed with this model. Then, we grab the index we just created for embedding later.

Want to instead embed a subset with multiple languages? Use the [multilingual-e5-large model](https://docs.pinecone.io/models/multilingual-e5-large) and simply specify this inplace of the previous model when creating an index.

In [4]:

index_name = "semantic-search"

if not pc.has_index(index_name):
    pc.create_index_for_model(
        name=index_name,
        cloud="aws",
        region="us-east-1",
        embed={
            # Use this if you want to instead embed non-english or a multilingual subset of the data
            #"model":"multilingual-e5-large",
            "model": "llama-text-embed-v2",
            "field_map":{"text": "chunk_text"}
        }
    )

# Initialize index client
index = pc.Index(name=index_name)

# View index stats
index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

## Creating our dataset

We're working with a small subset of a large multilingual dataset called Tatoeba. Tatoeba consists of hundreds of thousands of sentence translation pairs, and sometimes serves as a benchmark for crosslingual semantic search capabilities.

In this notebook, we're just testing semantic search, so we'll grab a subset of english sentences that include the word "park".

Why "park"? In English, park has multiple meanings which occur in different contexts. It could mean a place, such as a public park. Or, it could mean an action with a car (to park) or a place (park-ing spot). Semantic search using embedding models will naturally distinguise between these contexts, without invervention or labeling!

This is the key benefit for semantic search; a way to abstract and represent the meaning of user queries without any additional work.

And, since our embedding model is inherently multilingual, we can even do this semantic search across several languages without any additional work!

In [5]:
from datasets import load_dataset
# specify that we want the english-spanish translation pairs
tatoeba = load_dataset("Helsinki-NLP/tatoeba", lang1="en", lang2="es", trust_remote_code=True, split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

tatoeba.py: 0.00B [00:00, ?B/s]

Downloading data:   0%|          | 0.00/6.88M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Let's take a quick look at a few data points

In [6]:
tatoeba[0:5]

{'id': ['0', '1', '2', '3', '4'],
 'translation': [{'en': "Let's try something.", 'es': '¡Intentemos algo!'},
  {'en': "Let's try something.", 'es': 'Intentemos algo.'},
  {'en': "Let's try something.", 'es': 'Permíteme hacer algo.'},
  {'en': "Let's try something.", 'es': 'Permíteme intentarlo.'},
  {'en': 'I have to go to sleep.', 'es': 'Tengo que irme a dormir.'}]}

In [8]:
keywords= ["park"]

def simple_keyword_filter(sentence, keywords):
  # filter for a list of keywords by sentence
  # This is really just for making a toy example quickly, not useful for production.
    for keyword in keywords:
        if keyword in sentence:
            return True
    return False

def transform_dataset_for_pinecone(dataset, use_filter=True):
    # Feel free to adjust this code to simulate a larger search!

    if use_filter:
        # filter for a list of keywords by sentence, helpful for building intuition on semantic search
        translation_pairs = dataset.filter(lambda x: simple_keyword_filter(
        sentence = x["translation"]["en"], keywords=keywords))
    else:
        # use the full 200k+ dataset. Run only if you want to embed this many records!
        translation_pairs = dataset

    print(translation_pairs[:3])
    # flatten and shuffle for ease of use
    translation_pairs = translation_pairs.flatten()
    print(translation_pairs[:3])
    translation_pairs = translation_pairs.shuffle(seed=1)

    # If you want to include the spanish subset, simply repeat the below steps with "es" instead of "en"
    # Be sure to create your index with multilingual-e5-large as well in this case!
    english_sentences = translation_pairs.rename_column("translation.en", "text").remove_columns("translation.es")

    # add lang column to indicate embedding origin
    english_sentences = english_sentences.add_column("lang", ["en"]*len(english_sentences))


    records = []

    # for idx, sentence in enumerate(english_sentences):
    #     # Here, we create a record for each sentence in the dataset
    #     # The record contains an ID and metadata fields which we can use to filter if desired
    #     # The chunk_text field is the text we will embed
    #     records.append(
    #         {
    #             "id": str(idx),
    #             "chunk_text": sentence["text"],
    #             "lang": sentence["lang"]
    #         }
    #     )

    # convert to record format
    return records


records = transform_dataset_for_pinecone(tatoeba)

Filter:   0%|          | 0/214127 [00:00<?, ? examples/s]

{'id': ['1057', '1058', '1059'], 'translation': [{'en': 'Sir, you are not allowed to park your car here.', 'es': 'Señor, usted no puede estacionar su coche aquí.'}, {'en': 'Sir, you are not allowed to park your car here.', 'es': 'No puede estacionarse aquí.'}, {'en': 'Sir, you are not allowed to park your car here.', 'es': 'Señor, no puede estacionar su coche aquí.'}]}


Flattening the indices:   0%|          | 0/416 [00:00<?, ? examples/s]

## Upserting data into the Pinecone index

Here, we embed and upsert the data into Pinecone. What this means is that each record we formatted above will interact with our embedding model we specified prior, and produce a vector embedding. Then, we take these embedding batches and store them in Pinecone with the additional information we specified, which is also known as metadata.

Metadata is handy for things like filtering, like for if you stored several languages in the same index and want to return just one based on metadata. To learn more about metadata, take a [look here](https://docs.pinecone.io/guides/index-data/indexing-overview#metadata).

We specify and create a namespace called "english-sentences", which is a higher level unit of organization when interacting with Pinecone.

Querying on namespaces performs a sort of broad filter to only records that exist in that namespace, which has the nice effect of speeding up searches too.

To learn more about namespaces, [look here](https://docs.pinecone.io/guides/index-data/indexing-overview#namespaces)


In [None]:
from tqdm import tqdm

batch_size = 96
namespace = "english-sentences"


# We upsert in batches of 96 to avoid hitting the embedding model's rate limit.
# Libraries like backoff can be used here to handler large embedding jobs.

for start in tqdm(range(0, len(records), batch_size), f"Upserting records batch: "):
    index.upsert_records(records=records[start:start+batch_size], namespace = namespace)

Upserting records batch: 100%|██████████| 5/5 [00:02<00:00,  1.91it/s]


## Making Queries

Now that our index is populated we can begin making queries.

The tricky part about querying with semantic search is that we'd normally need to involve an embedding model here again too!

But with Pinecone's Integrated Inference, we can just invoke our index we created and send the text we want to search with there. Specifically,
the search query is vectorized using the same embedding model we specified prior, and we use this vector to find all closest vectors in the database to it to return.

Neat!

Our goal here is to write a query sentence that uses one form of the word park, and find sentences that use park in a semantically similar manner. So, let's try this:


In [None]:
search_query = "I want to go to the park and relax"

results = index.search(
    namespace=namespace,
    query={
        "top_k": 10,
        "inputs": {
            'text': search_query
        }
    }
)

for result in results["result"]["hits"]:
    print(f'Sentence: {result["fields"]["chunk_text"]} Semantic Similarity Score: {result["_score"]}\n')

And now, let's use the other meaning of the word, park!

In [None]:
search_query = "I need a place to park"

results = index.search(
    namespace=namespace,
    query={
        "top_k": 10,
        "inputs": {
            'text': search_query
        }
    }
)

for result in results["result"]["hits"]:
    print(f'Sentence: {result["fields"]["chunk_text"]} Semantic Similarity Score: {result["_score"]}\n')

## Wait, how is this working?

When performing semantic search with Pinecone's vector database, you are asking the following question: Given this query vector, what are the closest vectors to it in the database?

Because of the way embedding models are trained, this closeness in vector space corresponds to similarity in meaning. The exact metric used for our implementation is cosine similarity, which is simply the angle between the input vector and a document vector. For small amounts of vectors, this task is trivial, but what happens when you have hundreds of thousands, millions or even billions? And what about query latency?

The magic of Pinecone's vector database is advanced algorithms that can quickly index and do this search on billion-scale vectors effectively!

## Demo Cleanup

You can go ahead and ask more queries above. When you're done, delete the index to save resources.

Congrats, you've just implemented semantic search with Pinecone!


In [None]:
pc.delete_index(name=index_name)

---