# Embeddings

<h3>What are embeddings?</h3>

- Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts. 
  Text embeddings measure the relatedness of text strings. 
  

- Embeddings are useful for working with natural language and code, because they can be readily consumed and compared by other machine learning models and algorithms like clustering or search.

- Embeddings that are numerically similar are also semantically similar. 

[Embeddings by OpenAI](https://openai.com/blog/introducing-text-and-code-embeddings?source=post_page-----d5d438bb5766--------------------------------)


<h3>Embeddings are commonly used for:</h3>

- Search (where results are ranked by relevance to a query string)

- Clustering (where text strings are grouped by similarity)

- Recommendations (where items with related text strings are recommended)

- Anomaly detection (where outliers with little relatedness are identified)

- Diversity measurement (where similarity distributions are analyzed)

- Classification (where text strings are classified by their most similar label)

<h3>Cosine similarity algorithm: Deep dive</h3>

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space based on the cosine of the angle between them, resulting in a value between -1 and 1. The value -1 means that the vectors are opposite, 0 represents orthogonal vectors, and value 1 signifies similar vectors.



<br>

<img src="./img/cosine-similarity.png" width="800px"/>

<br>

To compute the cosine similarity between vectors A and B, you can use the following formula:
<br><br>
<img src="./img/similarity-formula.png" width="400px"/>

<br>

The cosine similarity is often used in text analytics to compare documents and determine if they’re similar and how much. In that case, documents must be represented as a vector, where a unique word is a dimension and the frequency or weight of that unique word in the document represents the value of that specific dimension. After the transformation of documents to vectors is done, comparison using cosine similarity is relatively straightforward — we measure the cosine of the angle between their vectors. If the angle between vectors (documents) is small, then the cosine of the angle is high, and hence, documents are similar. Opposite to that, if the angle between vectors (documents) is large, then the cosine of the angle is low, resulting in opposite documents (not similar). Cosine similarity considers the orientation of the vectors, but it does not take their magnitudes into account. In the previous example, this means that even documents of totally different lengths can be considered similar if they are related to the same topic.

> Intuitive interpretation and versatility of the cosine similarity algorithm have found their way into various applications, spanning from text analysis and recommendation systems to complex graph databases. The algorithm's ability to capture the orientation of vectors makes it a robust measure of similarity, especially in high-dimensional spaces.

[source](https://memgraph.com/blog/cosine-similarity-python-scikit-learn)

## Word

In [6]:
!python -m spacy download en_core_web_md

C:\Users\xandg\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe: No module named spacy


In [8]:
import spacy

nlp = spacy.load("en_core_web_md")

dog_embedding = nlp.vocab["dog"].vector

type(dog_embedding)


print(dog_embedding.shape)


dog_embedding[0:10]

ModuleNotFoundError: No module named 'spacy'

In [None]:
import numpy as np

def compute_cosine_similarity(u: np.ndarray, v: np.ndarray) -> float:
    """Compute the cosine similarity between two vectors"""

    return (u @ v) / (np.linalg.norm(u) * np.linalg.norm(v))

In [3]:
from sklearn.metrics.pairwise import cosine_similarity

nlp = spacy.load("en_core_web_md")

dog_embedding = nlp.vocab["dog"].vector
cat_embedding = nlp.vocab["cat"].vector
apple_embedding = nlp.vocab["apple"].vector
tasty_embedding = nlp.vocab["tasty"].vector
delicious_embedding = nlp.vocab["delicious"].vector
truck_embedding = nlp.vocab["truck"].vector

dog_embedding


ModuleNotFoundError: No module named 'sklearn'

In [4]:
print("cosine_similarity(dog, cat)")
print(cosine_similarity([dog_embedding], [cat_embedding])[0][0],"\n")

print("cosine_similarity(delicious, tasty)")
print(cosine_similarity([delicious_embedding], [tasty_embedding])[0][0],"\n")

print("cosine_similarity(apple, delicious)")
print(cosine_similarity([apple_embedding], [delicious_embedding])[0][0],"\n")

print("cosine_similarity(dog, apple)")
print(cosine_similarity([dog_embedding], [apple_embedding])[0][0],"\n")

print("cosine_similarity(truck, delicious)")
print(cosine_similarity([truck_embedding], [delicious_embedding])[0][0],"\n")

cosine_similarity(dog, cat)


NameError: name 'cosine_similarity' is not defined

<h4>Illustrative - to create an intuition about word similarity by using embeddings</h4>

<img src="./img/2d-embeddings-ex.png" width="800px"/>

<img src="./img/cosine-similarity.png" width="800px"/>


## Sentences

In [20]:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [
         "The canine barked loudly.",
         "The dog made a noisy bark.",
         "He ate a lot of pizza.",
         "He devoured a large quantity of pizza pie.",
]

text_embeddings = model.encode(texts)

print(type(text_embeddings))


text_embeddings.shape

<class 'numpy.ndarray'>


(4, 384)

In [22]:
from sklearn.metrics.pairwise import cosine_similarity

text_embeddings_dict = dict(zip(texts, list(text_embeddings)))

dog_text_1 = "The canine barked loudly."
dog_text_2 = "The dog made a noisy bark."
pizza_text_1 = "He ate a lot of pizza."
pizza_test_2 = "He devoured a large quantity of pizza pie."


In [25]:
sim1 = cosine_similarity(
    [text_embeddings_dict[dog_text_1]],
    [text_embeddings_dict[dog_text_2]]
)

print(f"""
{dog_text_1}
{dog_text_2}
Similarity: {sim1[0][0]}
""")



sim2 = cosine_similarity(
    [text_embeddings_dict[pizza_text_1]],
    [text_embeddings_dict[pizza_test_2]]
)

print(f"""
{pizza_text_1}
{pizza_test_2}
Similarity: {sim2[0][0]}
""")


sim3 = cosine_similarity(
    [text_embeddings_dict[dog_text_1]],
    [text_embeddings_dict[pizza_text_1]]
)

print(f"""
{dog_text_1}
{pizza_text_1}
Similarity: {sim3[0][0]}
""")



The canine barked loudly.
The dog made a noisy bark.
Similarity: 0.7768615484237671


He ate a lot of pizza.
He devoured a large quantity of pizza pie.
Similarity: 0.7871338725090027


The canine barked loudly.
He ate a lot of pizza.
Similarity: 0.09128269553184509

