I have been working on performance for some cosine similarity implementations and I want to share results.
 
I will use as a benchmark a 10_000 x 2_048 float32 matrix for index embedding and a 2_000 x 2_048 float32 matrix for query embedding.

The result will be a matrix 2_000 x 10_000 which will contain in row *i* and column *j* the cosine similarity of *i*-query for *j*  -index.

Query and index embeddings are true global features extracted through DELG and cached on disk.

Data can't fit all in memory.

I will try 3 implementations:  

* Cosine Similarity with Scipy (Sequential)
* Cosine Similarity with Scipy (Batch on CPU)
* Cosine Similarity with TensorFlow (Batch on GPU)

Let's go ! 

In [None]:
import numpy as np 
import joblib

In [None]:
def batch_generator (embeddings, batch_size=1000):
    start = 0
    stop = 0
    while stop < embeddings.shape[0] :
        stop = stop + batch_size
        yield embeddings[start:stop]
        start = stop
        
        
train_embeddings = joblib.load("../input/glr2020-data-for-cosine-similarity/train_embeddings.joblib")
test_embeddings = joblib.load("../input/glr2020-data-for-cosine-similarity/test_embeddings.joblib")

print(f'train_embeddings:{train_embeddings.shape}, test_embeddings:{test_embeddings.shape}')


## Scipy (Sequential CPU)

This is the slow "for loop" way. For each test query we calculate the cosine similarity for all the elements of the index - with a for loop - using the scipy function `spatial.distance.cdist` which calculates the cosine distance between vectors.

The advantage of this methow is the low footprint memory usage: it uses 2048 * (1 + len (train_embeddings)) *  float32 of memory for each cycle.

In [None]:
%%time
from scipy import spatial

scipy_similarity = np.zeros ((test_embeddings.shape[0], train_embeddings.shape[0]))

for test_index in range(test_embeddings.shape[0]):
    scipy_similarity[test_index] = 1 - spatial.distance.cdist(
        test_embeddings[np.newaxis, test_index, :], train_embeddings,
        'cosine')[0]
    


## Scipy (Batch CPU)

In this case we use scipy's numpy vectorization of `spatial.distance.cdist`.

We are forced to perform the calculation in several batches, because all data don't fit in memory.

In [None]:
%%time

from scipy import spatial


batch_scipy_similarity = np.zeros ((test_embeddings.shape[0], train_embeddings.shape[0]))


test_batch_size=500
train_batch_size=1000
for i, test_emb in enumerate(batch_generator (test_embeddings, batch_size=test_batch_size)):
    for j, train_emb in enumerate(batch_generator (train_embeddings, batch_size=train_batch_size)):
        batch_scipy_similarity[i*test_batch_size:(i+1)*min(test_batch_size,test_emb.shape[0]),
                      j*train_batch_size:(j+1)*min(train_batch_size,train_emb.shape[0])] =  1 - spatial.distance.cdist(
                                        test_emb, train_emb,
                                        'cosine')


In [None]:
## check difference 
np.sum( np.abs (scipy_similarity) - np.abs(batch_scipy_similarity))

## TensorFlow (Batch GPU)

Cosine similarity is the cosine of the angle between two vectors, which is also the same as the inner product of the same vectors normalized to both have length 1: we can compute it with TensorFlow using `tf.reduce_sum` and `tf.norm`.

In this way we use the GPU managed by TensorFlow instead of CPU.

In [None]:
%%time

import tensorflow as tf
import numpy as np


tf_similarity = np.zeros ((test_embeddings.shape[0], train_embeddings.shape[0]))


test_batch_size=500
train_batch_size=1000
for i, test_emb in enumerate(batch_generator (test_embeddings, batch_size=test_batch_size)):
    for j, train_emb in enumerate(batch_generator (train_embeddings, batch_size=train_batch_size)):
        

        b = tf.constant (train_emb)
        a = tf.constant (test_emb)

        similarity = tf.reduce_sum(a[:, tf.newaxis] * b, axis=-1)
        similarity /= tf.norm(a[:, tf.newaxis], axis=-1) * tf.norm(b, axis=-1)
        
        tf_similarity[i*test_batch_size:(i+1)*min(test_batch_size,test_emb.shape[0]),
                      j*train_batch_size:(j+1)*min(train_batch_size,train_emb.shape[0])] = similarity.numpy()




In [None]:
## check difference 
np.sum(np.abs(scipy_similarity) - np.abs(tf_similarity))