# Create embeddings with Generative AI Hub
Like any other machine learning model, also foundation models only work with numbers. In the context of generative AI, these numbers are embeddings. Embeddings are numerical representations of unstructured data, such as text and images. The text embedding model of OpenAI `text-embedding-3-small` for example turns your input text into 1536 numbers. That is a vector with 1536 dimensions.

ðŸ‘‰ Select the kernel again. Make sure to select the same virtual environment as in the previous exercise so that all your packages are installed.

In [1]:
import init_env

init_env.set_environment_variables()

from gen_ai_hub.proxy.native.openai import embeddings

# TODO assign the model name of the embedding model here, e.g. "text-embedding-3-small"
EMBEDDING_MODEL_NAME = "text-embedding-3-small"

## Create embeddings
Define the method **get_embedding()**.

In [2]:
def get_embedding(input_text):
    response = embeddings.create(
        input=input_text,            
        model_name=EMBEDDING_MODEL_NAME
    )
    embedding = response.data[0].embedding
    return embedding

Get embeddings for the words: **apple, orange, phone** and for the phrases: **I love dogs, I love animals, I hate cats.**

In [3]:
apple_embedding = get_embedding("apple")
orange_embedding = get_embedding("orange")
phone_embedding = get_embedding("phone")
dog_embedding = get_embedding("I love dogs")
animals_embedding = get_embedding("I love animals")
cat_embedding = get_embedding("I hate cats")

print(apple_embedding)

[0.01764063909649849, -0.016817327588796616, -0.04184354469180107, 0.019008787348866463, -0.0018100723391398787, -0.026902882382273674, 0.007264504674822092, 0.02287108078598976, -0.01952940970659256, -0.016732575371861458, 0.02187826670706272, 0.00507607264444232, -0.05138426274061203, -0.004758250433951616, 0.01920250616967678, -0.006774150300770998, -0.02586163580417633, -0.014589546248316765, -0.006586484145373106, -0.0010881621856242418, -0.017773820087313652, 0.04704977571964264, 0.003592902794480324, 0.04620224982500076, 0.0332956463098526, -0.011786658316850662, -0.019166184589266777, 0.015933480113744736, 0.007470332086086273, -0.0271934624761343, -0.03162480890750885, -0.04586323723196983, 0.04457984119653702, -0.011296304874122143, 0.0025577109772711992, 0.002215673914179206, 0.022822652012109756, -0.01965048536658287, 0.008130190894007683, -0.004068122711032629, -0.0077790734358131886, 0.0388408824801445, 0.06160299852490425, 0.039664193987846375, 0.03208489343523979, -0.05

## Calculate Vector Similarities
To calculate the cosine similarity of the vectors, we also need the [SciPy](https://scipy.org/) package. SciPy contains many fundamental algorithms for scientific computing.

Cosine similarity is used to measure the distance between two vectors. The closer the two vectors are, the higher the similarity between the embedded texts.

ðŸ‘‰ Import the SciPy package and define the method **get_cosine_similarity()**.

In [4]:
from scipy import spatial

# TODO the get_cosine_similarity function does not work very well does it? Fix it!
def get_cosine_similarity(vector_1, vector_2):
    return 1-spatial.distance.cosine(vector_1, vector_2)

ðŸ‘‰ Calculate similarities between the embeddings of the words and phrases from above and find the most similar vectors. You can follow the example below.

In [5]:
print("apple-orange")
print(get_cosine_similarity(apple_embedding, orange_embedding))

apple-orange
0.471337352219533


[Next exercise](05-store-embeddings-hana.ipynb)