# Similarity Analysis & Sentence embeddings.
To do similarity analysis on all documents we want an embedding for each document. An embedding is a vector representation of the document. I will use sentence's to demonstrate the concepts. 

In [1]:
from sentence_transformers import SentenceTransformer
import numpy as np

### Cosine Similarity.
Sentence embeddings are vectors that represent the meaning of a sentence. The cosine similarity between two vectors calculates the cosine of the angle between them. 
- If the vectors are similar, the cosine similarity will be close to 1. 
- if the vectors are orthogonal(unrelated), the cosine similarity will be 0.
- If the vectors are dissimilar, the cosine similarity will be close to -1.

the cosine similarity is calculated as follows:
$$\text{cos}(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}$$
where $\mathbf{A} \cdot \mathbf{B}$ is the dot product of the vectors and $\|\mathbf{A}\|\|\mathbf{B}\|$ is the product of the magnitudes of the vectors.


In [2]:
def cosine_similarity(a, b):
    """The cosine similarity between two sentence embeddings."""
    return (a@b) / (np.linalg.norm(a) * np.linalg.norm(b))

### Sentence Transformers
I am using the `sentence-transformers` library to generate embeddings for the sentences. The library provides a simple interface to generate embeddings for sentences. The library can be installed using the following command:
```bash
pip install sentence-transformers
```
I will use the all-MiniLM-L6-v2 model to generate embeddings for each document. Info on the model can be found [here](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

In [3]:
# Load the model we want to use. Most Hugging Face models are supported. all-MiniLM-L6-v2 is a small model that works well for demonstration.
model = SentenceTransformer('all-MiniLM-L6-v2')

In [4]:
# The stand-in data pool of plain text documents.
documents = [
    'We have 200 costumers around the world with 1000 products.',
    'We have costumers world wide with products in the thousands.',
    'Our products are known by costumers all over the world.',
    'Our company makes ice cream.',
]
doc_embeddings = model.encode(documents)

In [5]:
doc_embeddings.shape

(4, 384)

Now we create a dictionary of embedded documents to stand in for a vector database. 


In [6]:
embeddings_dict = dict(zip([f'doc{i}' for i in range(len(documents))], list(doc_embeddings)))

### Similarity Analysis.
Wow that we have embeddings for all documents organized, its easy to calculate the similarity between any two documents. The first 3 documents are similar, so the cosine similarity between them will be close to 1. 

In [7]:
cosine_similarity(embeddings_dict['doc0'], embeddings_dict['doc1'])

np.float32(0.88820124)

In [8]:
cosine_similarity(embeddings_dict['doc1'], embeddings_dict['doc2'])

np.float32(0.86992204)

In [9]:
cosine_similarity(embeddings_dict['doc2'], embeddings_dict['doc0'])

np.float32(0.72208786)

The last document about ice cream is dissimilar to the first 3, with a cosine similarity around 0.24. 

In [10]:
cosine_similarity(embeddings_dict['doc3'], embeddings_dict['doc0'])

np.float32(0.24467042)