## Comparing document embedding models

* We have 2 different document embedding models: $A$ and $B$. 
* We have a corpus of $N \gg 1$ documents. 

We would like to know if the 2 embedding models are statistically functionally related.

### Solution: 

1. For each document $D_n$ compute its corresponding vector embedding $E_n := A(D_n)$, $F_n = B(D_n)$.
2. For embedding model $A$ determine an appropriate value of $K$ as follows:
   1. Randomly select  $M \gg 1$ documents $\{D_m\}$ from the corpus. Let $\{E_m\}$ denote their corresponding vector embeddings.
   2. For each vector $E_m$ find its $P \ge 100$ closest neighbors $\{E_{m,p}\}$. **Exclude** $E_m$ from the list of neighbors. Compute the cosine similarity between vector $E_m$ and its $P$ closest neighors: $(<E_m,E_{m,p}>)$.
   3. Sort the cosine similarities as an ascending sequence $(cos_p)$.
   4. Find the elbow in the sequence of cosines $(cos_p)$ using an F-test to split the sequence into 2 subsequences. The size of the second subsequence is the corresponding value of $K_m$.
   5. We can either take $K= \max(\{K_m\})$ or keep the set of $K$ values $\{K_m\}$.
3. Now that we have a value of $K$ for each document $D_m$ we have a corresponding persistence parameter $p=0.01^{(1/K)}$.
4. For each document $D_m$ we have a ranked list of $K$ documents $list\_a = (i_1,\ldots,i_K)$ where $i_k$ denotes the index of the document that is $k^{th}$ closes to $D_m$   according to model $A$. We generate a similar second list of K ranked documents $list\_b = (j_1,\ldots, j_K)$ by proximity according to model $B$ (look at the corresponding vectors $(F_n)$.) For a corresponding probality distribution for model A's ranked list, we use the geometric based sequence 
$$prob = \frac{(1-p)}{1-p^{K}}\times (1,p,p^2, \ldots, p_{K-1})$$
5. Now we can use the rbo_analytics compute_recommender_test_statistic function to perform hypothesis testing.
```python
import rbo_analytics

# lists_a is a list of ranked lists of neighbors for (D_m) according to recommender A.
# lists_b is a list of ranked lists of neighbors for (D_m) according to recommender B.
# probs is a list of list of probabilities, probs[m]  are the probabilities corresponding
# to a list of ranked neighbors lists_a[m] for document m according to A. 
Z = rbo_analytics.compute_recommender_test_statistic(lists_a, lists_b,probs,verbose=True)


print("Sigmage that the 2 document embedders are functionally related: {Z}")
print(f"The 2 document embedders are functionally related: {Z>=-2.33}")
```


In [None]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np
from tqdm.autonotebook import tqdm
import pandas as pd
import rbo_analytics

In [None]:
from sentence_transformers import SentenceTransformer
import torch

# Define the device
# It will automatically use 'cuda' if available, otherwise 'cpu'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

# Load the model and explicitly specify the device
model_name = 'intfloat/multilingual-e5-base'
multilingual_e5_model = SentenceTransformer(model_name, device=device)

# # Example Usage
# sentences = ["This is a test sentence.", "Dies ist ein Testsatz."]
# embeddings = model.encode(sentences, show_progress_bar=True) 

# # To verify, you can inspect the model's internal device:
# print(model.device) 
# # This should print "cuda:0" or similar


In [None]:
all_mpnet_base_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
paraphrase_multilingual_model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
multilingual_e5_model = SentenceTransformer(model_name, device=device)

# Loading in the Usenet Newsgroup data set and embedding it
I'm going to embed the corpus with 3 different document embedding algorithms so that I can measure the
pairwise similarity between the 3 embedding algorithms.


In [None]:
dataset = fetch_20newsgroups(subset='all')

data = dataset['data']

In [None]:
def encode_model(documents,model=all_mpnet_base_model):
    """Embed in batches of 4 """
    N = len(documents)
    results = []
    for t in tqdm(range(0,N,4),desc='embedding documents'):
        input_texts = documents[t:t+4]
        # Tokenize the input texts
        numpified = model.encode(input_texts)
        results.append(numpified)

    embeddings_array  = np.vstack(results)
    return embeddings_array

In [None]:
multilingual_e5_embeddings = encode_model(data,model=multilingual_e5_model)
np.savetxt('embeddings_e5.numpy',multilingual_e5_embeddings)

In [None]:
embeddings_array = encode_model(data)
np.savetxt('all-mpnet.numpy',embeddings_array)

In [None]:
embeddings_array = encode_model(data,model=paraphrase_multilingual_model)
np.savetxt('paraphrase-multilingual.numpy',embeddings_array)

In [None]:
paraphrase_multilingual = np.loadtxt('paraphrase-multilingual.numpy')
all_mpnet = np.loadtxt('all-mpnet.numpy')
multilingual_e5 = np.loadtxt('embeddings_e5.numpy')

# 3 Way Comparison of all-mpnet-base-v2, paraphrase-multilingual-MiniLM-L12-v2, and multilingual-e5-base
Here is where we will construct lists of rankings for randomly selected documents in the corpus of newsgroup postings
and then compute the $Z$ statistic as to whether or not pairs of document embedders are functionally related.

In [None]:
X = (paraphrase_multilingual**2).sum(axis=1).reshape(-1,1)
distances_paraphrase= X + X.T - 2*paraphrase_multilingual.dot(paraphrase_multilingual.T)
distances_paraphrase[distances_paraphrase<0.0] = 0.0
distances_paraphrase =distances_paraphrase**0.5

In [None]:
Y = (all_mpnet**2).sum(axis=1).reshape(-1,1)
distances_all_mpnet= Y + Y.T - 2*all_mpnet.dot(all_mpnet.T)
distances_all_mpnet[distances_all_mpnet<0.0] = 0.0
distances_all_mpnet =distances_all_mpnet**0.5

In [None]:
Z = (multilingual_e5**2).sum(axis=1).reshape(-1,1)
distances_multilingual_e5= Y + Y.T - 2*multilingual_e5.dot(multilingual_e5.T)
distances_multilingual_e5[distances_multilingual_e5<0.0] = 0.0
distances_multilingual_e5 =distances_multilingual_e5**0.5

In [None]:
def generate_geometric_probs(K):
    """Generates a probability distribution, Pr(x = n) ~ p**n """

    p = (0.01)**(1/K)
    probs = p**np.arange(K)
    probs = probs/probs.sum()
    return probs

I'm going to randomly select 100 postings and for each of the postings finding the $K$ nearest neighbors according to the document embedders.
I'm going to use $K=30$ for each neighborhood.

In [None]:
N = X.shape[0]
np.random.seed(42) # For reproducible results.
ks = np.random.choice(N,size=100,replace=False) # random sampling of documents
Ks = np.ones(100,dtype=np.int32)*30 # how many neighbors per document to consider.

# Comparing paraphrase-multilingual-MiniLM-L12-v2 against all-mpnet-base-v2

In [None]:
lists_a = [((distances_paraphrase[n]).argsort()[1:Ks[t]]).tolist() for t,n in enumerate(ks)]
lists_b = [((distances_all_mpnet[n]).argsort()[1:Ks[t]]).tolist() for t,n in enumerate(ks)]

In [None]:
probs_a  = [generate_geometric_probs(K-1) for K in Ks]

In [None]:
Z = rbo_analytics.compute_recommender_test_statistic(lists_a, lists_b,probs_a,verbose=True)

print(f"Sigmage that the 2 Document embedders are functionally related: {Z}")

if Z >= -2.33:
    print("The 2 Document embedding algorithms are functionally related.")
else:
    print("The 2 Document embedding algorithms are *NOT* functionally related.")


I want to reverse the order of the lists to demonstrate that the $Z$ statistic is not symmetric in its arguments:

In [None]:
Z = rbo_analytics.compute_recommender_test_statistic(lists_b, lists_a,probs_a,verbose=True)

print(f"Sigmage that the 2 Document embedders are functionally related: {Z}")

if Z >= -2.33:
    print("The 2 Document embedding algorithms are functionally related.")
else:
    print("The 2 Document embedding algorithms are *NOT* functionally related.")


# Comparing paraphrase-multilingual-MiniLM-L12-v2 against multilingual-e5-base

In [None]:
lists_c = [((distances_multilingual_e5[n]).argsort()[1:Ks[t]]).tolist() for t,n in enumerate(ks)]

In [None]:
Z = rbo_analytics.compute_recommender_test_statistic(lists_a, lists_c,probs_a,verbose=True)

print(f"Sigmage that the 2 Document embedders are functionally related: {Z}")

if Z >= -2.33:
    print("The 2 Document embedding algorithms are functionally related.")
else:
    print("The 2 Document embedding algorithms are *NOT* functionally related.")


## Comparing all-mpnet-base-v2 against multilingual-e5-base

In [None]:
Z = rbo_analytics.compute_recommender_test_statistic(lists_b, lists_c,probs_a,verbose=True)

print(f"Sigmage that the 2 Document embedders are functionally related: {Z}")

if Z >= -2.33:
    print("The 2 Document embedding algorithms are functionally related.")
else:
    print("The 2 Document embedding algorithms are *NOT* functionally related.")


# Conclusion
Based on $Z$ statistics if I were to construct a graph of similarity, where edge weights are the $Z$ statistic I would conclude that
the all-mpnet-base-v2 algorithm is strongly similar in embedding documents with the multilingual-e5-base algorithm ($Z \approx 13.3$), 
that paraphrase-multilingual-MiniLM-L12-v2 and multilingual-e5-base are strongly related ($Z \approx 4.4$) and that 
all-mpnet-base-v2 is also strongly related to paraphrase-multilingual-MiniLM-L12-v2 ($Z \approx 4.7$).