# Embeddings Encoder Customization

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/langkit/blob/main/langkit/examples/Custom_Encoder.ipynb)

Some modules in LangKit, such as the __themes__ and __input_output__ module, use an encoder to convert text into embeddings. In this example, we will show how you can plug in your own encoder to LangKit.

## Default Behavior

Let's use the themes module to talk about the default behavior. If you simply import the module without additional configuration, the default encoder will be used, which is the `all-MiniLM-L6-v2` model from the [Sentence Transformers](https://www.sbert.net/) library.


In [None]:
%pip install -U langkit[all] 
%pip install tensorflow tensorflow_hub

In [7]:
from langkit import themes

similarity_score = themes.group_similarity("Sorry, but I can't assist you with that.", "refusal")

print("The similarity score with default encoder is: ", similarity_score)

The similarity score with default encoder is:  0.9999999403953552


## Custom Encoder

You can also pass your own encoder function to be used by LangKit. Let's define a simple function that takes a list of strings and returns a list of embeddings, one for each input string. Then, we pass that function to the `themes` initializer as the `custom_encoder` parameter.

In [8]:
from typing import List
def embed(texts: List[str]) -> List[List[float]]:
    return [[0.2,0.2] for _ in texts]

themes.init(custom_encoder=embed)

Now, if we run the similarity calculation again, we see that the score is near 1.0. That makes sense, considering the embedding for every string is the same, and the cosine similarity between the same vector is 1.0.

In [3]:
similarity_score = themes.group_similarity("Sorry, but I can't assist you with that.", "refusal")

print("The similarity score with the custom encoder is: ", similarity_score)

The similarity score with the custom encoder is:  0.9999999403953552


## Universal Sentence Encoder

Let's show another example with a real encoder. We will Google's Universal Sentence Encoder to encode sentences into vectors. We will use the [TensorFlow Hub](https://www.tensorflow.org/hub) to download the model, and pass the embed function into Langkit, just as before.

> If your local environment doesn't have the required dependencies to use `tensorflow_hub`, we recommend running this example on Colab by clicking the button on the top of this example.

In [9]:
import tensorflow_hub as hub

# Load pre-trained universal sentence encoder model
use_embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

Let's use `input_output` as an example this time, which will calculate the similarity score between a pair of prompt and response.

In [10]:
from langkit.input_output import init, prompt_response_similarity

init(custom_encoder=use_embed)

similarity_score = prompt_response_similarity({"prompt": ["What is the capital of France?"], "response": ["Mitochondria is the powerhouse of the cell"]})

print("The similarity score with the Universal Sentence Encoder is: ", similarity_score)

The similarity score with the Universal Sentence Encoder is:  [0.08993373811244965]
