# Cosine Similarity calculation between sentences with Transformers

---


## Each single token (i.e. after the originals words goes through the Tokenization) > produces a single vector > which has a size 768.

Here 768 are the number of columns.

Those 768 values contain our numerical representation of a single token — which we can use as contextual word embeddings.

And Because there is one of these vectors for representing each token (output by each encoder), for the totality, we are actually looking at a tensor of size 768 by the number of tokens.

------

## BERT has the limit of 512 tokens.

Normally, for longer sequences, you just truncate to 512 tokens.

The limit is derived from the positional embeddings in the Transformer architecture, for which a maximum length needs to be imposed.

The magnitude of such a size is related to the amount of memory needed to handle texts: attention layers scale quadratically with the sequence length, which poses a problem with long texts.

In [1]:
sentences = [
    "It had been sixteen days since the zombies first attacked.",
    
    "When confronted with a rotary dial phone the teenager was perplexed.",
    "His confidence would have bee admirable if it wasn't for his stupidity.",
    "I'm confused: when people ask me what's up, and I point, they groan.",
    "They called out her name time and again, but were met with nothing but silence.",
    "After the last zombie attack sixteen days back, they are taking control of the city",
]

In [2]:
from transformers import AutoTokenizer, AutoModel
import torch

In [3]:
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')

In [4]:
example_sentence = 'its a sunny morning'
example_tokens = tokenizer.encode_plus(example_sentence, max_length = 512, truncation = True, padding = 'max_length', return_tensors = 'pt' )
example_tokens.keys()

dict_keys(['input_ids', 'attention_mask'])

In [5]:
tokens = {'input_ids': [], 'attention_mask': [] }

for sentence in sentences:
    new_tokens = tokenizer.encode_plus(sentence, max_length = 512, truncation = True, padding = 'max_length', return_tensors = 'pt' )
    tokens['input_ids'].append(new_tokens['input_ids'][0])
    tokens['attention_mask'].append(new_tokens['attention_mask'][0])

# restructure a list of tensors into single tensor
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])  


In [7]:
tokens['input_ids'].shape

torch.Size([6, 512])

In [8]:
outputs = model(**tokens)
outputs

BaseModelOutputWithPooling(last_hidden_state=tensor([[[-1.6771e-02,  2.7875e-02,  7.9733e-02,  ...,  8.5568e-02,
          -1.9001e-01,  3.5048e-02],
         [ 1.0377e-01,  1.3945e-01,  4.4118e-02,  ...,  1.5550e-01,
          -2.6237e-01,  2.6015e-02],
         [ 7.1797e-02,  9.0426e-02,  3.3001e-02,  ...,  9.4124e-02,
          -2.1211e-01,  2.2209e-02],
         ...,
         [-4.7748e-03,  9.3277e-02,  6.8232e-02,  ...,  2.2988e-01,
          -1.6626e-01,  1.0913e-01],
         [-4.7748e-03,  9.3277e-02,  6.8232e-02,  ...,  2.2988e-01,
          -1.6626e-01,  1.0913e-01],
         [-4.7748e-03,  9.3277e-02,  6.8232e-02,  ...,  2.2988e-01,
          -1.6626e-01,  1.0913e-01]],

        [[ 1.2549e-01, -1.0631e-01,  3.6856e-02,  ...,  6.0813e-02,
          -9.1067e-02,  3.6397e-02],
         [ 1.5688e-01, -2.4870e-01,  4.6818e-03,  ...,  8.2096e-02,
          -1.8744e-01,  7.9757e-02],
         [ 1.9245e-01, -1.4947e-01,  6.3418e-02,  ...,  1.9506e-01,
          -8.3352e-02,  5.4228e

In [9]:
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

In [10]:
embeddings = outputs.last_hidden_state
embeddings

tensor([[[-1.6771e-02,  2.7875e-02,  7.9733e-02,  ...,  8.5568e-02,
          -1.9001e-01,  3.5048e-02],
         [ 1.0377e-01,  1.3945e-01,  4.4118e-02,  ...,  1.5550e-01,
          -2.6237e-01,  2.6015e-02],
         [ 7.1797e-02,  9.0426e-02,  3.3001e-02,  ...,  9.4124e-02,
          -2.1211e-01,  2.2209e-02],
         ...,
         [-4.7748e-03,  9.3277e-02,  6.8232e-02,  ...,  2.2988e-01,
          -1.6626e-01,  1.0913e-01],
         [-4.7748e-03,  9.3277e-02,  6.8232e-02,  ...,  2.2988e-01,
          -1.6626e-01,  1.0913e-01],
         [-4.7748e-03,  9.3277e-02,  6.8232e-02,  ...,  2.2988e-01,
          -1.6626e-01,  1.0913e-01]],

        [[ 1.2549e-01, -1.0631e-01,  3.6856e-02,  ...,  6.0813e-02,
          -9.1067e-02,  3.6397e-02],
         [ 1.5688e-01, -2.4870e-01,  4.6818e-03,  ...,  8.2096e-02,
          -1.8744e-01,  7.9757e-02],
         [ 1.9245e-01, -1.4947e-01,  6.3418e-02,  ...,  1.9506e-01,
          -8.3352e-02,  5.4228e-02],
         ...,
         [ 1.3306e-01,  4

In [11]:
embeddings.shape

torch.Size([6, 512, 768])

-------------

## What is the basic concept behind Pooling 

Using pooling, it generates from a variable sized sentence a fixed sized sentence embedding. This layer also allows to use the CLS token if it is returned by the underlying word embedding model. You can concatenate multiple poolings together.

### We can also consider the last hidden state of shape [batch_size, seq_len, hidden_state], the average across seq_len dimensions to get averaged/mean embeddings.

## Mean Pooling

BERT produces contextualized word embeddings for all input tokens in our text. As we want a fixed-sized output representation (vector u), we need a pooling layer. Different pooling options are available, the most basic one is mean-pooling: We simply average all contextualized word embeddings BERT is giving us. This gives us a fixed 768 dimensional output vector and which is independent of how long our input text was.

---------

## From Sentence Transformer to Cosine Similarity Calculation

### For us to transform our **`last_hidden_states`** tensor into our desired vector — we use a mean pooling method.

### Each of these 512 tokens has separate 768 values. This pooling work will take the average of all token embeddings and consolidate them into a unique 768 vector space, producing a ‘sentence vector’.

And thats why => After we have produced our dense vectors `embeddings`, we need to perform a *mean pooling* operation on them to create a single vector encoding (the **sentence embedding**). 

--------------

### To do this mean pooling operation we will need to multiply each value in our `embeddings` tensor by it's respective `attention_mask` value - so that we can mask or ignore non-real tokens. you should only take into account those tokens which are not padding tokens if you want to average them.

To perform this operation, we first resize our `attention_mask` tensor:

In [12]:
attention_mask = tokens['attention_mask']
attention_mask.shape

torch.Size([6, 512])

In [13]:
resized_attention_mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
resized_attention_mask.shape

torch.Size([6, 512, 768])

In [14]:
resized_attention_mask

tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 

In [15]:
resized_attention_mask[0][0].shape

torch.Size([768])

In [16]:
masked_embedding = embeddings * resized_attention_mask
masked_embedding.shape

torch.Size([6, 512, 768])

In [17]:
masked_embedding

tensor([[[-1.6771e-02,  2.7875e-02,  7.9733e-02,  ...,  8.5568e-02,
          -1.9001e-01,  3.5048e-02],
         [ 1.0377e-01,  1.3945e-01,  4.4118e-02,  ...,  1.5550e-01,
          -2.6237e-01,  2.6015e-02],
         [ 7.1797e-02,  9.0426e-02,  3.3001e-02,  ...,  9.4124e-02,
          -2.1211e-01,  2.2209e-02],
         ...,
         [-0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          -0.0000e+00,  0.0000e+00],
         [-0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          -0.0000e+00,  0.0000e+00],
         [-0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          -0.0000e+00,  0.0000e+00]],

        [[ 1.2549e-01, -1.0631e-01,  3.6856e-02,  ...,  6.0813e-02,
          -9.1067e-02,  3.6397e-02],
         [ 1.5688e-01, -2.4870e-01,  4.6818e-03,  ...,  8.2096e-02,
          -1.8744e-01,  7.9757e-02],
         [ 1.9245e-01, -1.4947e-01,  6.3418e-02,  ...,  1.9506e-01,
          -8.3352e-02,  5.4228e-02],
         ...,
         [ 0.0000e+00,  0

In [18]:
summed_masked_embeddings = torch.sum(masked_embedding, 1)
summed_masked_embeddings.shape

torch.Size([6, 768])

In [19]:
summed_masked_embeddings

tensor([[ 0.5913,  1.1332,  0.5017,  ...,  1.8387, -3.1667,  0.8035],
        [ 2.5646, -1.7509,  0.7735,  ...,  1.0171, -1.7503,  1.2193],
        [-0.5600,  3.2766,  0.5741,  ..., -2.1975,  0.9449,  0.7661],
        [ 1.5776, -3.6690, -0.8673,  ...,  2.7474,  2.0530, -3.2321],
        [ 4.1955,  0.8502, -0.0662,  ...,  2.1655, -2.1070,  1.7098],
        [-0.0775,  2.2288,  1.7956,  ...,  1.4763, -0.8778, -0.6049]],
       grad_fn=<SumBackward1>)

In [20]:
resized_attention_mask

tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 

In [21]:
count_of_one_in_mask_tensor = torch.clamp(resized_attention_mask.sum(1), min=1e-9 )

count_of_one_in_mask_tensor.shape

torch.Size([6, 768])

In [22]:
count_of_one_in_mask_tensor

tensor([[13., 13., 13.,  ..., 13., 13., 13.],
        [16., 16., 16.,  ..., 16., 16., 16.],
        [19., 19., 19.,  ..., 19., 19., 19.],
        [23., 23., 23.,  ..., 23., 23., 23.],
        [19., 19., 19.,  ..., 19., 19., 19.],
        [18., 18., 18.,  ..., 18., 18., 18.]])

In [23]:
summed_masked_embeddings.shape

torch.Size([6, 768])

In [24]:
count_of_one_in_mask_tensor.shape

torch.Size([6, 768])

In [25]:
mean_pooled = summed_masked_embeddings / count_of_one_in_mask_tensor

In [27]:
mean_pooled.shape

torch.Size([6, 768])

# Calculate cosine similarity for sentence `0`:

In [28]:
from sklearn.metrics.pairwise import cosine_similarity

mean_pooled = mean_pooled.detach().numpy()

cosine_similarity([mean_pooled[0]], mean_pooled[1:] )

array([[ 0.12543598,  0.05873729, -0.00801761,  0.20136353,  0.6832719 ]],
      dtype=float32)