## The Model
I want to use [e5-small-v2](https://huggingface.co/intfloat/e5-small-v2) or [e5-base-v2](https://huggingface.co/intfloat/e5-base-v2) and compare it against [e5 multilingual](https://huggingface.co/intfloat/multilingual-e5-small) before and after training an adaptive layer on it.

Model stats:

| Model | # params | # layers | emb size |
| --- | --- | --- | --- |
| e5-small-v2 | 33M | 12 | 384 |
| e5 multilingual | 118M | 12 | 384 |
| e5-base-v2 | 118M | 12 | 768 |

## The Data
We want to use translated datasets, probably only for the languages used in e5-multilingual, or maybe just one. Kinda weird because models are different. I'm also curious how much the multilingual dataset differentiates between the same sentence in different languages

Can find the data [here](https://huggingface.co/datasets/allenai/nllb)
 


## Training the model

In [1]:
import torch
from transformers import AutoTokenizer, AutoModel
if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
print("Using device:", device)

# load in model folders from disk
# try:
#     small_v2 = AutoModel.from_pretrained("models/e5-small-v2")
#     multilingual = AutoModel.from_pretrained("models/multilingual-e5-small")
#     base_v2 = AutoModel.from_pretrained("models/e5-base-v2")
#     small_v2_tokenizer = AutoTokenizer.from_pretrained("tokenizers/e5-small-v2")
#     multilingual_tokenizer = AutoTokenizer.from_pretrained("tokenizers/multilingual-e5-small")
#     base_v2_tokenizer = AutoTokenizer.from_pretrained("tokenizers/e5-base-v2")
# except (FileNotFoundError, OSError):
#     small_v2 = AutoModel.from_pretrained("intfloat/e5-small-v2")
#     multilingual = AutoModel.from_pretrained("intfloat/multilingual-e5-small")
#     base_v2 = AutoModel.from_pretrained("intfloat/e5-base-v2")
#     small_v2_tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-small-v2")
#     multilingual_tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-small")
#     base_v2_tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-base-v2")

#     # save to disk in new folders models and tokenizers
#     small_v2.save_pretrained("models/e5-small-v2")
#     multilingual.save_pretrained("models/multilingual-e5-small")
#     base_v2.save_pretrained("models/e5-base-v2")
#     small_v2_tokenizer.save_pretrained("tokenizers/e5-small-v2")
#     multilingual_tokenizer.save_pretrained("tokenizers/multilingual-e5-small")
#     base_v2_tokenizer.save_pretrained("tokenizers/e5-base-v2")

Using device: mps


Wrap the models in the class so they're easier to call

In [2]:
class embedding_model:
    def __init__(self, model, tokenizer, device):
        self.model = model.to(device)
        self.tokenizer = tokenizer
        self.device = device

    def average_pool(last_hidden_states: torch.Tensor,
                 attention_mask: torch.Tensor) -> torch.Tensor:
        last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
        return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

    def __call__(self, data):
        tokens_and_mask = self.tokenizer(data, return_tensors='pt', padding=True, truncation=True, max_length=512).to(self.device)
        model_output = self.model(tokens_and_mask["input_ids"], attention_mask=tokens_and_mask["attention_mask"])
        embedding = embedding_model.average_pool(model_output.last_hidden_state, attention_mask=tokens_and_mask["attention_mask"])
        # normalize the embedding
        embedding = embedding / embedding.norm(dim=-1, keepdim=True)
        return embedding    



# create embedding models
# small_v2_model = embedding_model(small_v2, small_v2_tokenizer, device)
# multilingual_model = embedding_model(multilingual, multilingual_tokenizer, device)
# base_v2_model = embedding_model(base_v2, base_v2_tokenizer, device)


In [4]:
from datasets import load_dataset
# note: esp-eng dataset is 38GB
test_dataset = load_dataset("allenai/nllb", "eng_Latn-spa_Latn", split="train", streaming=True).take(1000)

train_dataset = load_dataset("allenai/nllb", "eng_Latn-spa_Latn", split="train", streaming=True).skip(1000)

## Question: How does multi-lingual treat translated text pairs?

In [6]:
# plot the reduced vectors
import matplotlib.pyplot as plt


# English sentences
sentences = [
    "This is an example sentence",
    "Each sentence is converted",
    "into a vector",
    "Such vectors can be compared",
    "similar sentences will be close",
    "while different sentences are expected to be far apart",
]
# spanish sentences (translated from english)
sentences_es = [
    "Esta es una oración de ejemplo",
    "Cada oración se convierte",
    "en un vector",
    "Tales vectores se pueden comparar",
    "se espera que las oraciones similares estén cerca",
    "mientras que las oraciones diferentes deben estar muy separadas",
]

# en = torch.tensor(multilingual.encode(sentences))
# es = torch.tensor(multilingual.encode(sentences_es))
# en_and_es = torch.cat((en, es), dim=0)
# print(F.cosine_similarity(en, es, dim=1))
# map = UMAP(n_components=2, n_neighbors=5, min_dist=0.3, metric="cosine")
# reduced = map.fit(en_and_es)


# plt.figure(figsize=(10, 10))
# plt.scatter(reduced.embedding_[:6, 0], reduced.embedding_[:6, 1], c="blue")
# plt.scatter(reduced.embedding_[6:, 0], reduced.embedding_[6:, 1], c="red")
# plt.show()

## Answer:  It doesn't seem to notice the difference at least not obviously. It doesn't outweigh everything else.

In [81]:
# make a copy of the model
base_v2_model
dynamic = AutoModel.from_pretrained("models/e5-base-v2").to(device)
dynamic_tokenizer = AutoTokenizer.from_pretrained("tokenizers/e5-base-v2")
dynamic_model = embedding_model(dynamic, dynamic_tokenizer)

tensor([[1, 1, 1, 1, 1, 1, 1]], device='mps:0')
tensor(1., device='mps:0', grad_fn=<NormBackward1>)


tensor([14.1293], device='mps:0', grad_fn=<NormBackward1>)


In [19]:
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from tqdm import tqdm

DYNAMIC_SCALING = 10e-2

class ContrastiveLoss(nn.Module):
    def __init__(self):
        super(ContrastiveLoss, self).__init__()

    def forward(self, native_fixed, native_dynamic, foreign_dynamic):
        # We want the native_fixed and native_dynamic to be close together
        native_sim = F.cosine_similarity(native_fixed, native_dynamic)
        dynamic_sim = F.cosine_similarity(native_dynamic, foreign_dynamic)
        loss_contrastive = torch.mean(1 - native_sim) + DYNAMIC_SCALING * torch.mean(1 - dynamic_sim)  

        return loss_contrastive


def evaluate(dynamic_model, fixed_model, dataset, batch_size=32, criterion=ContrastiveLoss, langs=["eng_Latn", "spa_Latn"], device=device):
    """
    Evaluates the model on the given dataset
    Returns the average loss
    """
    dataset = dataset.with_format("torch")
    dataloader = DataLoader(dataset, batch_size=batch_size)
    
    inter_model_loss_total = 0
    intra_lang_loss_dynamic_total = 0
    num_batches = 0

    for i, batch in enumerate(dataloader):
        # Forward pass
        fixed_native = fixed_model(batch["translation"][langs[0]])
        dynamic_native = dynamic_model(batch["translation"][langs[0]])
        dynamic_foreign = dynamic_model(batch["translation"][langs[1]])
        
        inter_model_loss = torch.mean(F.cosine_similarity(fixed_native, dynamic_native))
        intra_lang_loss_dynamic = torch.mean(F.cosine_similarity(dynamic_native, dynamic_foreign))
        
        inter_model_loss_total += inter_model_loss.item()
        intra_lang_loss_dynamic_total += intra_lang_loss_dynamic.item()
        num_batches += 1

    avg_inter_model_loss = inter_model_loss_total / num_batches
    avg_intra_lang_loss_dynamic = intra_lang_loss_dynamic_total / num_batches

    return avg_inter_model_loss, avg_intra_lang_loss_dynamic        

def train(fixed_model, dynamic_model, num_text_pairs, evals, train_dataset=train_dataset, test_dataset=test_dataset, lr=.0001, batch_size=32, criterion=ContrastiveLoss(), langs=["eng_Latn", "spa_Latn"], device=device):
    """
    Lang[0] is the native language
    Lang[1] is the foreign language
    """
    
    # Define the optimizer
    optimizer = torch.optim.Adam(dynamic_model.model.parameters(), lr=lr)
    
    # Convert the dataset to torch format and create a DataLoader
    train_dataset = train_dataset.with_format("torch")
    dataloader = DataLoader(train_dataset, batch_size=batch_size)

    for i, batch in enumerate(dataloader):
        # Forward pass
        fixed_native = fixed_model(batch["translation"][langs[0]])
        dynamic_native = dynamic_model(batch["translation"][langs[0]])
        dynamic_foreign = dynamic_model(batch["translation"][langs[1]])
        loss = criterion(fixed_native, dynamic_native, dynamic_foreign)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if i * batch_size > num_text_pairs:
            break

        if i % 100 == 0:
            print(f'Epoch [{i+1}/{num_text_pairs/batch_size}]')
            print(f'Loss: {loss.item()}')
            test = evaluate(dynamic_model, fixed_model, test_dataset, batch_size=batch_size, criterion=criterion, langs=langs, device=device)
            print(f"Average inter-model similarity: {test[0]}")
            print(f"Average intra-lang similarity: {test[1]}")
            evals.append(test)
            #TODO don't hardcode this
            dynamic_model.model.save_pretrained(f"training/e5-base-v2-{i * batch_size}")

        # print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item()}')
    return loss.item(), evals

# dynamic = embedding_model(AutoModel.from_pretrained("models/e5-base-v2"), AutoTokenizer.from_pretrained("tokenizers/e5-base-v2"), device)
fixed = embedding_model(AutoModel.from_pretrained("models/e5-base-v2"), AutoTokenizer.from_pretrained("tokenizers/e5-base-v2"), device)
# multi = embedding_model(AutoModel.from_pretrained("models/multilingual-e5-small"), AutoTokenizer.from_pretrained("tokenizers/multilingual-e5-small"), device)
evals = []
train(fixed, dynamic, num_text_pairs=50_000, batch_size=64, evals=evals)

Epoch [1/781.25]
Loss: 0.010403558611869812
Average inter-model similarity: 0.9887597672641277
Average intra-lang similarity: 0.9559698179364204
Epoch [101/781.25]
Loss: 0.00917188823223114
Average inter-model similarity: 0.9959704056382179
Average intra-lang similarity: 0.9540795646607876
Epoch [201/781.25]
Loss: 0.008555032312870026
Average inter-model similarity: 0.9960418380796909
Average intra-lang similarity: 0.9550326094031334
Epoch [301/781.25]
Loss: 0.008802447468042374
Average inter-model similarity: 0.9956105276942253
Average intra-lang similarity: 0.9574507139623165
Epoch [401/781.25]
Loss: 0.008591266348958015
Average inter-model similarity: 0.9952964968979359
Average intra-lang similarity: 0.957591962069273
Epoch [501/781.25]
Loss: 0.008262112736701965
Average inter-model similarity: 0.99581703171134
Average intra-lang similarity: 0.9582218788564205
Epoch [601/781.25]
Loss: 0.007295450195670128
Average inter-model similarity: 0.9958999119699001
Average intra-lang similari

(0.008968481793999672,
 [(0.9887597672641277, 0.9559698179364204),
  (0.9959704056382179, 0.9540795646607876),
  (0.9960418380796909, 0.9550326094031334),
  (0.9956105276942253, 0.9574507139623165),
  (0.9952964968979359, 0.957591962069273),
  (0.99581703171134, 0.9582218788564205),
  (0.9958999119699001, 0.9587362334132195),
  (0.9958053007721901, 0.9597854018211365)])

In [20]:
# store dynamic.model in a folder
dynamic.model.save_pretrained("trained/e5-base-v2-150k")

## Testing

In [21]:
for pair in zip(sentences, sentences_es):
    print(pair)
    fix = fixed(pair)
    dyn = dynamic(pair)
    print("fixed", F.cosine_similarity(fix[0].unsqueeze(0), fix[1].unsqueeze(0)).item())
    print("dynamic", F.cosine_similarity(dyn[0].unsqueeze(0), dyn[1].unsqueeze(0)).item())
    print("fixed v dynamic", F.cosine_similarity(fix[0].unsqueeze(0), dyn[0].unsqueeze(0)).item())


# a = dynamic_model.embed(["espanol", "spanish"], device=device)
# b = base_v2_model.embed(["espanol", "spanish"], device=device)

('This is an example sentence', 'Esta es una oración de ejemplo')
fixed 0.7975223660469055
dynamic 0.8925096988677979
fixed v dynamic 0.9899151921272278
('Each sentence is converted', 'Cada oración se convierte')
fixed 0.7486991882324219
dynamic 0.8775215744972229
fixed v dynamic 0.9962867498397827
('into a vector', 'en un vector')
fixed 0.9113370180130005
dynamic 0.9575779438018799
fixed v dynamic 0.9964996576309204
('Such vectors can be compared', 'Tales vectores se pueden comparar')
fixed 0.856069803237915
dynamic 0.9698941707611084
fixed v dynamic 0.9961637258529663
('similar sentences will be close', 'se espera que las oraciones similares estén cerca')
fixed 0.8379346132278442
dynamic 0.8799442648887634
fixed v dynamic 0.9934095144271851
('while different sentences are expected to be far apart', 'mientras que las oraciones diferentes deben estar muy separadas')
fixed 0.8365286588668823
dynamic 0.8870798945426941
fixed v dynamic 0.9920188188552856


In [26]:
print(torch.cosine_similarity(dynamic("el niño triste"), dynamic("the sad boy"), dim=1))

tensor([0.9289], device='mps:0', grad_fn=<SumBackward1>)


In [27]:
torch.cosine_similarity(dynamic("I love you"), dynamic("I really love you"))

tensor([0.9369], device='mps:0', grad_fn=<SumBackward1>)

In [35]:
print(torch.cosine_similarity(a[0].unsqueeze(0), a[1].unsqueeze(0)))
print(torch.cosine_similarity(b[0].unsqueeze(0), b[1].unsqueeze(0)))

tensor([0.9073], device='mps:0', grad_fn=<SumBackward1>)
tensor([0.8750], device='mps:0', grad_fn=<SumBackward1>)


In [None]:
embed(multilingual.to(device), base_v2_tokenizer, ["espanol", "spanish"], device=device)

In [25]:
train_dataset

<datasets.iterable_dataset.IterableDataset at 0x7ff200778cd0>