# Semantic Similarity with Replicate Embedding Models

This notebook demonstrates how you can use embedding models on Replicate to power tasks like semantic search or clustering. We'll use one of Replicate's hosted versions of [MPNet](https://replicate.com/replicate/all-mpnet-base-v2/versions/f7565bcaa9b9ec3f3560e57421d85d7788d5402f5df305f599f8d5cda0a6d6bb), which is a pre-trained language model developed by Microsoft ([paper](https://arxiv.org/abs/2004.09297)).

You'll learn how to:

* Use the Replicate API to obtain embeddings for documents
* Setup a semantic similarity system using just `numpy` and Replicate

## Setup

First, we need to install `replicate`.

In [1]:
!pip install replicate

Collecting replicate
  Downloading replicate-0.8.1.tar.gz (22 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: replicate
  Building wheel for replicate (pyproject.toml) ... [?25ldone
[?25h  Created wheel for replicate: filename=replicate-0.8.1-py3-none-any.whl size=21099 sha256=e64c5db250e0537cef80716207a530f874962fa53df59bda75fc219da3c25df6
  Stored in directory: /root/.cache/pip/wheels/12/86/e0/876cae2f7d3eabe6e3adabcab93b95d38f6d38843a7c311aeb
Successfully built replicate
Installing collected packages: replicate
Successfully installed replicate-0.8.1
You should consider upgrading via the '/root/.pyenv/versions/3.8.16/bin/python3.8 -m pip install --upgrade pip' command.[0m[33m
[0m

Next, we'll import the packages we'll rely on throughout the notebook.

In [59]:
import os
import replicate
import numpy as np
import json
from functools import lru_cache
from typing import List
import logging


We also need to specify our Replicate API token, which can be found in your Replicate Profile.

In [9]:
REPLICATE_API_TOKEN = input('Enter your Replicate API token here:')

# Set the environment variable
os.environ['REPLICATE_API_TOKEN'] = REPLICATE_API_TOKEN

# Calculating embeddings for documents

The model we're using is a sentence-transformer based on MPNet. 

Sentence-transformers are a specialized type of embedding model that generates a single numerical representation (embedding) for an entire document, such as a sentence or a paragraph. These models first calculate embeddings for individual tokens (words or subwords) in the document. Then, they perform mean-pooling over the token embeddings to generate a final document-level embedding. This representation captures the overall meaning of the document. By comparing the embeddings of different documents, you can measure their semantic similarity, which is useful for tasks like text classification, clustering, and search.

To obtain an embedding for a document, you can call the Replicate API with an input that specifies a `text` parameter. 

Note, you also need to specify the `model_version`, which, for the model we've selected is: "replicate/all-mpnet-base-v2:f7565bcaa9b9ec3f3560e57421d85d7788d5402f5df305f599f8d5cda0a6d6bb"`

In [77]:
embedding = replicate.run(
  model_version="replicate/all-mpnet-base-v2:f7565bcaa9b9ec3f3560e57421d85d7788d5402f5df305f599f8d5cda0a6d6bb",
  input={"text": "Map this into semantic space."}
)

print(len(embedding))

1


This will return a `list` containing a single `list` of floats, which constitute the embedding values of your input document.

## Batch Encoding

Often, when we need to obtain embeddings for a large number of documents, it's better to encode documents in batches. If computation is executed on a GPU, this allows us to exploit GPU acceleration for parallel processes. To obtain embeddings for a batch of documents, you can use the `text_batch` argument instead of the `text` argument. 

`text_batch` expects a JSON-formatted list of documents, like this: 

In [75]:
candidates = ["This is a list of documents", "that will be processed as a batch."]
text_batch = json.dumps(candidates)
print("Here's our JSON-formatted list of documents:")
text_batch

Here's our JSON-formatted list of documents:


'["This is a list of documents", "that will be processed as a batch."]'

Now, to obtain embeddings, we just need to run:

In [76]:
candidate_embeddings = replicate.run(
  model_version="replicate/all-mpnet-base-v2:f7565bcaa9b9ec3f3560e57421d85d7788d5402f5df305f599f8d5cda0a6d6bb",
  input={"text_batch": text_batch}
)
print(len(candidate_embeddings))

2


As before, a `list` is returned. However, it now is has a length of 2, because it contains embeddings for two documents.

# Building a Semantic Search PoC

Now we're ready to build out a Semantic Search PoC, which we'll implement as a simple Python class. During instantiation, our class will accept a list of candidate documents and it will compute and store their embeddings. We'll also design the `__call__` method so that calling an instance of the class with a query document will run a semantic search process against our candidate documents.




### Implementation

In [103]:

class SemanticSearch:

    def __init__(self, model_version, candidates):
        self.model_version = model_version
        self.candidates = candidates
        self.candidate_embeddings = self.encode_candidates(candidates)
    
    def encode_candidates(self, candidates: List[str]):
        """
        This function encodes the candidate documents into a `np.array` of embeddings.
        """
        
        print(f"Encoding {len(docs)} docs...")

        text_batch = json.dumps(candidates)
        
        doc_embeddings = replicate.run(
            self.model_version,
            input={"text_batch": text_batch}
        )
        
        doc_embeddings = np.array(doc_embeddings)

        return doc_embeddings
    
    @lru_cache(maxsize=None)
    def encode_query(self, query):
        """
        This method encodes the query into a `np.array` of embeddings. It also uses a lru cache to avoid
        recomputing embeddings for identical queries. 
        """
        query_embedding = replicate.run(
            self.model_version,
            input={"text": query}
        )

        query_embedding = np.array(query_embedding[0])

        return query_embedding

    @staticmethod
    def _cos_sim(query_embedding, candidate_embeddings):
        """
        This function computes the cosine similarities between a query embedding and an array of candidate embeddings.
        """
        cosine_similarities = np.dot(candidate_embeddings, query_embedding) / (np.linalg.norm(candidate_embeddings, axis=1) * np.linalg.norm(query_embedding))
        return cosine_similarities

    
    def __call__(self, query, candidate_embeddings=None, candidates=None):
        """
        This method encodes `query` into an embedding, 
        calculates the cosine similarities between the query embedding and `candidate_embeddings`, and
        returns the index and text the document with the highest cosine similarity.
        """
        if not candidate_embeddings:
            candidate_embeddings = self.candidate_embeddings
        if not candidates:
            candidates = self.candidates 

        # Get input embedding
        query_embedding = self.encode_query(query)

        # Compute the cosine similarity between the input embedding and all embeddings
        cosine_similarities = self._cos_sim(query_embedding, candidate_embeddings)
        
        # Get the index of the nearest neighbor
        indx = np.argsort(cosine_similarities)[-1]

        return {"id": indx,  "score": cosine_similarities[indx], "text": candidates[indx]}
    
    

### Demonstration

Great! Now we're ready to test it out. For this exercise, I've just specified an assortment of 10 strings that we'll use as our candidate documents. We'll use them to instantiate our SemanticSearch instance.

In [104]:
model_id =   "replicate/all-mpnet-base-v2:f7565bcaa9b9ec3f3560e57421d85d7788d5402f5df305f599f8d5cda0a6d6bb"

candidates = [
    "The sun is shining and the birds are singing.",
    "Cats and dogs are popular pets.",
    "The ocean is deep and full of mysteries.",
    "I love eating pizza and drinking beer.",
    "Education is important for personal growth.",
    "The city skyline at night is beautiful.",
    "Dogs are loyal and loving companions.",
    "Music has the power to evoke emotions.",
    "Traveling to new places broadens your perspective.",
    "Rainy days are perfect for curling up with a good book."
]

search = SemanticSearch(model_id, candidates)

Encoding 10 docs...


Now, when we call our class instance with a query string, that following steps will be performed:

1. The query string will be encoded into an embedding
2. The query embedding will be compared against each candidate embedding via cosine similarity
3. The index and text of the candidate embedding with the highest cosine similarity will be returned

In [105]:
search("What kinds of pets are popular?")

{'id': 1,
 'score': 0.7579522779442195,
 'text': 'Cats and dogs are popular pets.'}

In [106]:
search("Do you have any information on self-improvement?")

{'id': 4,
 'score': 0.4409378015110852,
 'text': 'Education is important for personal growth.'}

# Next steps...

Cool, we just built a simple, but effective semantic search PoC. What's next?

Well, while it's easy to build out a PoC, it can be a bit more difficult to ensure your users enjoy the best possible experience. And, unfortunately, there's no one-size-fits-all solution. However, here are some things to consider as you build out your system:

#### What was your model trained to do?

As always, it's crucial to understand what your model was trained to do. For example, the model we've selected for this tutorial was trained to predict sentence pairs (see [here](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#:~:text=seb.sbert.net-,Background,-The%20project%20aims) for more details). This is a _great_ way to develop a pre-trained model that can be used to calculate the semantic similarity between between two sentences.

However, there are also some important implications to consider. For example, this model was not explicitly and exclusively trained to perform question answering. This means that you can meaningfully calculate semantic similarity for query strings that are not questions:

In [107]:
search("I love dogs.")

{'id': 6,
 'score': 0.6339098054211136,
 'text': 'Dogs are loyal and loving companions.'}

Another implication is that you may observe decreased accuracy if you're encoding long documents. This particular model was fine-tuned on _sentences_ and it may/or may not be able to encode multi-sentence documents with comparable fidelity. 

#### Ensuring high-quality responses

Our PoC implementation simply returns the candidate document that received the highest similarity score. However, what happens if there simply isn't a good candidate document?

In [108]:
search("What is the tallest mountain in the world?")

{'id': 5,
 'score': 0.1848880883273737,
 'text': 'The city skyline at night is beautiful.'}

It would probably be better if we allowed our search system to abstain if no suitable matches were identified. 

We can do implement a naive solution simply by adding a threshold to our scoring process.

In [109]:

class SemanticSearch:

    def __init__(self, model_version, candidates):
        self.model_version = model_version
        self.candidates = candidates
        self.candidate_embeddings = self.encode_candidates(candidates)
    
    def encode_candidates(self, candidates: List[str]):
        """
        This function encodes the candidate documents into a `np.array` of embeddings.
        """
        
        print(f"Encoding {len(docs)} docs...")

        text_batch = json.dumps(candidates)
        
        doc_embeddings = replicate.run(
            self.model_version,
            input={"text_batch": text_batch}
        )
        
        doc_embeddings = np.array(doc_embeddings)

        return doc_embeddings
    
    @lru_cache(maxsize=None)
    def encode_query(self, query):
        """
        This method encodes the query into a `np.array` of embeddings. It also uses a lru cache to avoid
        recomputing embeddings for identical queries. 
        """
        query_embedding = replicate.run(
            self.model_version,
            input={"text": query}
        )

        query_embedding = np.array(query_embedding[0])

        return query_embedding

    @staticmethod
    def _cos_sim(query_embedding, candidate_embeddings):
        """
        This function computes the cosine similarities between a query embedding and an array of candidate embeddings.
        """
        cosine_similarities = np.dot(candidate_embeddings, query_embedding) / (np.linalg.norm(candidate_embeddings, axis=1) * np.linalg.norm(query_embedding))
        return cosine_similarities

    
    def __call__(self, query, candidate_embeddings=None, candidates=None, similarity_threshold=0.40):
        """
        This method encodes `query` into an embedding, 
        calculates the cosine similarities between the query embedding and `candidate_embeddings`, and
        returns the index and text the document with the highest cosine similarity.
        """
        if not candidate_embeddings:
            candidate_embeddings = self.candidate_embeddings
        if not candidates:
            candidates = self.candidates 

        # Get input embedding
        query_embedding = self.encode_query(query)

        # Compute the cosine similarity between the input embedding and all embeddings
        cosine_similarities = self._cos_sim(query_embedding, candidate_embeddings)
        
        # Get the index of the nearest neighbor
        indx = np.argsort(cosine_similarities)[-1]
        
        # Abstain if similarity is not high enough
        if cosine_similarities[indx] < similarity_threshold:
            result = None
        else:
            result = {"id": indx,  "score": cosine_similarities[indx], "text": candidates[indx]}

        return result
    
    

In [110]:
model_id =   "replicate/all-mpnet-base-v2:f7565bcaa9b9ec3f3560e57421d85d7788d5402f5df305f599f8d5cda0a6d6bb"

candidates = [
    "The sun is shining and the birds are singing.",
    "Cats and dogs are popular pets.",
    "The ocean is deep and full of mysteries.",
    "I love eating pizza and drinking beer.",
    "Education is important for personal growth.",
    "The city skyline at night is beautiful.",
    "Dogs are loyal and loving companions.",
    "Music has the power to evoke emotions.",
    "Traveling to new places broadens your perspective.",
    "Rainy days are perfect for curling up with a good book."
]

search = SemanticSearch(model_id, candidates)
search("What is the tallest mountain in the world?")

Encoding 10 docs...


{'id': None, 'score': None, 'text': None}

Now our search system returns 