# Embedding

## TODO Train Sentence Transformer, Find DB ?

Sentence embeddings are vectors (lists of numbers) that represent the meaning of entire sentences.

Firstly we do prerocessing in [preprocessing.ipynb](../datasets/preprocessing.ipynb) <br>
Afterward we create the embedding using a sentence transformer

## Notebook Overview

1. Utils - used across the board. <br>
2. Sentence Transformers - explanation about why what and how of transformers. <br>
3. Refining Transformer - possilbe additions to make the sentence transformer the best. <br>

## Utils

### Imports

In [None]:
%load_ext autoreload
%autoreload 2

from transformers import AutoTokenizer
from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

from torch.utils.data import DataLoader
import torch
from datasets import load_dataset, Dataset

from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import pandas as pd
import random
import emoji
from urllib.parse import urlparse
import re
import os
import sys

module_path = os.path.abspath(os.path.join('../'))
if module_path not in sys.path:
    sys.path.append(module_path)

import data.data_utils as data_utils
import model.model_utils as model_utils

In [None]:
dataset_name = "jack_vs_calley_1000"
data = data_utils.get_comments(dataset_name)
tweets = data_utils.get_tweets()

In [None]:
st_model = model_utils.get_sentence_transformer_model()
all_mpnet_base_v2 = SentenceTransformer('all-mpnet-base-v2')
all_MiniLM_L6_v2 = SentenceTransformer('all-MiniLM-L6-v2')

### Create Embeddings

In [None]:
st_model = model_utils.get_sentence_transformer_model()
embeddings = st_model.encode(data)

### Hugging Face

In [None]:
from huggingface_hub import notebook_login

notebook_login()

### Augment Tweets

#### Strategy

<u>**Default Denoising Stategy**</u> <br>
The default noising strategy in DenoisingAutoEncoderDataset (random token deletion with a 30% probability) is a good starting point, but it’s not optimized for tweets. <br>

<u>**Best Noising Strategy for Tweets**</u> <br>

**Conservative Token Deletion** - Delete tokens with a lower probability than the default (e.g., 10-15% instead of 30%) because tweets are short, and deleting too many tokens can destroy the meaning.
Avoid deleting critical tokens like hashtags, mentions, URLs, and emojis to preserve tweet-specific structure. <br>

**Token Swapping** - Swap adjacent tokens with a small probability (e.g., 10%) to simulate minor word order variations, which can happen in informal writing. Ensure swaps don’t break hashtags, mentions, or URLs.

**Token Replacement (Simulate Typos)** - Replace tokens with similar-looking characters (e.g., "love" → "l0ve", "the" → "teh") to mimic common typos in tweets. Use a small probability (e.g., 5-10%) to avoid overwhelming the text with typos.

#### Helpers

In [None]:
def is_special_token(token):
    if token.startswith("#") or token.startswith("@"):
        return True
    elif contains_emoji(token):
        return True
    elif is_url(token):
        return True
    else:
        return False

def contains_emoji(token):
    for char in token:
        if char in emoji.EMOJI_DATA:
            return True
    return False

def is_url(token):
    try:
        result = urlparse(token)
        return all([result.scheme, result.netloc])
    except ValueError:
        return False

In [None]:
def get_tokens_delete_conservative(tokens, deletion_prob = 0.1):
    noisy_tokens = []
    for token in tokens:
        if is_special_token(token) or random.random() > deletion_prob:
            noisy_tokens.append(token)

    return noisy_tokens

In [None]:
def get_tokens_swapped(tokens, swap_prob = 0.1):
    noisy_tokens = tokens.copy()
    for i in range(len(noisy_tokens) - 1):
        if random.random() < swap_prob and not is_special_token(noisy_tokens[i]) and not is_special_token(noisy_tokens[i + 1]):
            noisy_tokens[i], noisy_tokens[i + 1] = noisy_tokens[i + 1], noisy_tokens[i]

    return noisy_tokens

In [None]:
# Helper function to introduce a typo
def introduce_typo(word):
    if len(word) < 2:
        return word

    idx = random.randint(0, len(word) - 2)
    return word[:idx] + word[idx + 1] + word[idx] + word[idx + 2:]

def get_typo_tokens(tokens, typo_prob = 0.05):
    noisy_tokens = []
    for token in tokens:
        if is_special_token(token) or random.random() > typo_prob:
            noisy_tokens.append(token)
        else:
            noisy_tokens.append(introduce_typo(token))

    return noisy_tokens

#### Create Train Examples

In [None]:
def add_noise(tweet, deletion_prob=0.1, swap_prob=0.1, typo_prob=0.05):
    tokens = tweet.split()
    tokens = get_tokens_delete_conservative(tokens, deletion_prob)
    tokens = get_tokens_swapped(tokens, swap_prob)
    tokens = get_typo_tokens(tokens, typo_prob)

    return " ".join(tokens)

# Test the noise function
tweet = "I love my new iPhone 📱 #TechLover @Apple https://apple.com"
noisy_tweet = add_noise(tweet)
print(f"Original: {tweet}")
print(f"Noisy: {noisy_tweet}")

In [None]:
train_examples = []
for tweet in tweets:
    augmented_tweet = add_noise(tweet)
    train_examples.append({"sentence1": tweet, "sentence2": augmented_tweet})

## Sentence Transformers

### Why ?

**Why are Transformers Used for This Task?** <br>
<u>Contextual Understanding</u>
Transformers excel at understanding the context of words within a sentence. They use an "attention mechanism" that allows them to weigh the importance of different words in the sentence when determining its meaning. They can effectively capture long-range dependencies between words in a sentence. This means they can understand how words that are far apart in a sentence relate to each other, which is essential for understanding the overall meaning of the sentence. <br>

<u>Pre-training and Fine-tuning</u>
Transformer models can be pre-trained on massive amounts of text data, allowing them to learn a deep understanding of language. This pre-trained knowledge can then be fine-tuned for specific tasks, such as generating sentence embeddings.  <br>

<u>Superior Performance</u>
Transformer-based models have consistently achieved state-of-the-art results on a wide range of NLP tasks, including sentence embedding. Their ability to capture complex semantic relationships makes them ideal for this task.   

### What ?

#### Transformer Basics

##### Transformer Architecture

Transformer is a deep learning architecture 

* **Key Features:**
    * Attention Mechanism: This is the core of the Transformer, allowing the model to focus on relevant parts of the input.
    * Encoder-Decoder Structure: The original Transformer architecture consists of an encoder (to process the input) and a decoder (to generate the output).
    * Parallel Processing: Unlike recurrent neural networks (RNNs), Transformers can process input sequences in parallel, leading to faster training.

<img src="images/transformer-architecture.png">

##### Input Embedding

First we give every word a vector based on pre trained "Embedding Space". <br>
Existing models like GloVe: Global Vectors For Word Representation <br>
<img src="images/embedding-space.png" width="200" height="200">

##### Positional Encoder

Vector that gives contex based on position of word in sentence. <br>
The positional encodings have the same dimension d<sub>model</sub>
as the embeddings, so that the two can be summed. <br>
where pos is the position and i is the dimension. <br>
<img src="images/positional-encoder.png" width="300" height="200">

##### Multi Head Attention

How relevant is the i'th word to the other words. <br>
For every word we have an attention vector generated which captures contextual relationships between words. <br>
<img src="images/attention-vectors.png" width="400" height="200"> <br>

Each Word Vector is broken to into 3 same dimentional vectors. <Br>
Q - query: what im looking for. <br>
K - key: what i can offer. <br>
V - value: what i actaully have to offer. <br>

How are Q,K,V calculated ? <br>
For each of the h attention heads the models stores three weight matrices. <br>
W<sub>Q</sub><sup>(i)</sup>: The weight matrix for the Query projection. <br>
W<sub>K</sub><sup>(i)</sup>: The weight matrix for the Key projection. <br>
W<sub>V</sub><sup>(i)</sup>: The weight matrix for the Value projection. <br>
Each of these matrices has dimensions (d<sub>model</sub>, d<sub>k</sub>), where:
d<sub>model</sub> is the dimension of the input embeddings (e.g., 768 for BERT base).
d<sub>k</sub> is the dimension of the key/query/value vectors for each head (and is typically d<sub>model</sub>/h).
Total number of weights: So, for each head, you have d<sub>model</sub> * d<sub>k</sub> weights in each of the three matrices. For h heads, the total number of weights is 3 * h * d<sub>model</sub> * d<sub>k</sub>. Since d<sub>k</sub> is often d<sub>model</sub>/h, this simplifies to 3 * d<sub>model</sub><sup>2</sup>.

##### Add And Norm

Takes the output matrix from attention, keeps feeding the calculated input embedding. <br>
This is done to ensure there is stronger information signal that flows through deep networks. Required because of vanishing gradients problem in back propagation (gradient becomes 0 after many backpropagation). <br>
To prevent, we induce stronger signals from the inputs in different parts of the network.

#### BERT (Bidirectional Encoder Representations from Transformers)

BERT is a specific Transformer-based model developed by Google.
* **Key Features:**
    * **Encoder-Only:** Uses only the Transformer's encoder.
    * **Bidirectional Training:** Trained to understand context from both directions.
    * **Pre-training and Fine-tuning:** Pre-trained on a massive dataset, then fine-tuned for specific tasks.


#### Sentence Transformers

While you can derive sentence embeddings from BERT (e.g., by averaging word embeddings), the resulting embeddings are often not optimal for tasks that require comparing the semantic similarity of sentences. <br>

Sentence Transformers are a framework or a methodology. They don't dictate a single architecture. Instead, they provide a way to train Transformer models to generate high-quality sentence embeddings.   
You can use various base Transformer architectures (like BERT, RoBERTa, MPNet, etc.) within the Sentence Transformers framework. <br>

Input Sentence --> Tokenization --> Transformer Encoder --> Pooling --> Sentence Embedding <br>

"all-mpnet-base-v2" <br>
When you see a model like "all-mpnet-base-v2", that means that the MPNet base Transformer architecture was used, and then it was fine tuned using the sentence transformer methodology.   
Therefore, it is a sentence transformer, that uses the mpnet architecture

In [None]:
st_model

##### Pooling

You can't directly compare the embedding of two sentences if they have different numbers of tokens.
You need a way to summarize the values of each sentence into a single, consistent representation.

The pooling layer produces a single set of numbers that represents the "summary" of the sentence.
This set of numbers is the sentence embedding.
Because the pooling layer always produces a fixed-size output, you can now easily compare the sentence embeddings of different sentences.

### How ?

#### Simalrity By Cosine

In [None]:
text_index_to_compare = 13
similarity_scores = cosine_similarity(embeddings, embeddings)[text_index_to_compare]
similarity_df = pd.DataFrame({'similarity': similarity_scores, 'sentence': data})
similarity_df = similarity_df.sort_values('similarity', ascending=False)
display(similarity_df.head(20))

#### Embedding - Preprocessing

<img src="images/preprocessing.png">

Sentence Transformers handle preprocessing in a specific way that's optimized for creating effective sentence embeddings. <br>

**Tokenization:** <br>
typically use sub-word tokenization algorithms like WordPiece or Byte-Pair Encoding (BPE). These algorithms break words into smaller, more frequent units, which helps handle out-of-vocabulary words and reduces the vocabulary size. <br>

**Lowercasing (Optional):** <br>
Some Sentence Transformer models might apply lowercasing as a preprocessing step. However, this is not always the case. <br>

**No Stemming or Lemmatization:** <br>
Sentence Transformers generally do not perform stemming or lemmatization.
The reason is that these techniques can sometimes lose semantic information, which is crucial for generating accurate sentence embeddings.

**No Stop Word Removal:** <br>
Sentence Transformers also typically do not remove stop words.
The context provided by stop words can be important for understanding the meaning of a sentence. <br>

**Normalization:** <br>
Sentence Transformers do perform normalization in the sense that they convert the input text into a numerical representation (embeddings).
They also normalize the embeddings themselves, so that they have unit length. <br>

##### Tokenizaiton

splitting text into smaller units called tokens. These tokens can be words, subwords, or even characters.

**Special Tokens** <br>
[CLS] -> Added at the beginning of each input sequence and is used for classification tasks (`<s>`). <br>
[SEP] -> Separate two sentences in a sequence (`</s>`). <br>
[UNK] -> When a word is not found in the vocabulary, it's replaced with this. <br>
`##` - this token is together with the last token

**Padding** <br>
Transformers require input sequences to have a fixed length. <br>
The shorter sentences are padded with a special token, usually [PAD]. <br>
If some texts are longer than the length that the model can handl the texts are either truncated (cut off at the maximum length) or split into multiple segments.

**Remove Numbers:** <br>
Sentence Transformers might keep numbers as tokens, as they can sometimes carry semantic meaning.
If you want to remove them, you would need to do it as a separate preprocessing step before feeding the text to the Sentence Transformer.

**Remove Punctuation:** <br>
Sentence Transformers' tokenizers often handle punctuation marks as separate tokens.
They might keep some punctuation, as it can contribute to meaning (e.g., question marks, exclamation points).
If you want to remove all punctuation, you would have to do it beforehand.

**Remove Special Characters and Symbols:** <br>
Similar to punctuation, Sentence Transformers might keep some special characters and symbols.
Again, you would need to remove them manually if desired.

**Remove Non-English Words:** <br>
Sentence Transformers are typically trained on multilingual or English corpora.
They might not explicitly remove non-English words, but they might not produce meaningful embeddings for them.
If you want to filter out non-English words, you would need to do it beforehand.

**Remove Words with Less Than Three Letters:** <br>
Sentence Transformers generally do not filter out short words.
You would need to do this manually if you want to remove them.

In [None]:
tokenizer = st_model.tokenizer
tokenized_texts = []
for text in data:
    tokens = tokenizer.tokenize(text)
    tokenized_texts.append(tokens)

# Print the tokenized texts
for tokens in tokenized_texts:
    print(tokens)

###### Tokenization Code Example

In [None]:
tokens = tokenizer.tokenize("This is f* **ed and crazyyyy")
print(tokens)

In [None]:
sentences = [
    "It was a great discussion",
    "Calley comes off shady AF.",
    "Total Jack count anyone?",
    "Wow is Calley tongue tied."
]
batch = tokenizer(sentences, padding=True)
print(batch)

In [None]:
for ids in batch["input_ids"]:
    print(tokenizer.decode(ids))

##### Lemmatization

determines the word base form, or lemma. It considers the context of the word and aims to produce actual words from the dictionary.   
Example:
Original words: "better," "good," "best"
Lemmatized words: "good," "good," "good"

Source: A systematic review of text stemming techniques.

##### Stop words

common words in a language that are often filtered out of text analysis tasks because they are considered to carry little semantic meaning or contribute minimally to the overall understanding of a text.

Examples of common English stop words include:
Articles: the, a, an
Prepositions: in, on, at, with, for
Conjunctions: and, or, but
Pronouns: I, you, he, she, it, they, we
Auxiliary Verbs: is, am, are, was, were, will, shall, can, could, may, might, must

**Why remove stop words?**
Reduced Dimensionality: By removing stop words, we can reduce the dimensionality of the text data, making it easier to process and analyze.
Improved Performance: Removing stop words can improve the performance of many NLP tasks, such as text classification, sentiment analysis, and information retrieval.
Focus on Key Words: By filtering out stop words, we can focus on the most important words in the text, which can lead to more accurate and meaningful analysis.

##### Stemming

reducing words to their root form by removing suffixes, prefixes, or other affixes. 
Example:
Original words: "cats," "catlike," "catty"   
Stemmed words: "cat", "catlik", "catti"   

Source: Kaur, J.; Buttar, P.K. A systematic review on stopword removal algorithms. Int. J. Future Revolut. Comput. Sci. Commun. Eng. 2018, 4, 207–210.

## Optimization

### Maximum Sequence Length

In [None]:
word_counts = [len(sentence.split()) for sentence in data]

# Create the histogram
plt.figure(figsize=(15, 5))
plt.hist(word_counts, bins=200, alpha=0.7, label='sentences')

# Add title and labels
plt.title('Frequency of number of words per sentence')
plt.xlabel('Number of words')
plt.ylabel('Frequency')

# Add legend
plt.legend()

# Show the plot
plt.show()

In [None]:
over_512 = sum(1 for count in word_counts if count > 512)
over_256 = sum(1 for count in word_counts if count > 256)

print(f"Number of sentences with over 512 words: {over_512}")
print(f"Number of sentences with over 256 words: {over_256}")

### Adding Vocabulary ?

In [None]:
tokenizer = st_model.tokenizer
tokenized_texts = []
for text in data:
    tokens = tokenizer.tokenize(text)
    tokenized_texts.append(tokens)

# Print the tokenized texts
for tokens in tokenized_texts:
    print(tokens)

as we can see some words are not in the vocabulary. utube (youtube), bitcoiner, doesnt understand the name kally, psychpath, behaving ..

### Training A Model ?

#### Comparing Training Methods


| **Method**                          | **How It Works**                                                                 | **Pros**                                                                 | **Cons**                                                                 |
|-------------------------------------|----------------------------------------------------------------------------------|--------------------------------------------------------------------------|--------------------------------------------------------------------------|
| **Standard Supervised Fine-Tuning** | Trains on labeled data with a task-specific loss (e.g., cosine similarity).     | High accuracy, task-specific optimization.                               | Requires labeled data, limited by dataset size/quality.                  |
| **Unsupervised Fine-Tuning**        | Uses unlabeled data with objectives like TSDAE (denoising) or SimCSE (contrastive). | No labeled data needed, adapts to domain.                                | Lower accuracy than supervised, depends on data quality.                 |
| **Self-Supervised (MLM)**           | Fine-tunes on unlabeled data by predicting masked words.                        | No labeled data needed, improves contextual understanding.               | Not task-specific, needs large data, requires further adaptation.        |
| **SetFit (Few-Shot Learning)**      | Trains on small labeled data with contrastive learning and an SVM classifier.   | Very efficient, needs little labeled data, fast training.                | Requires some labeled data, performance tied to dataset quality.         |
| **Zero-Shot Learning (LLM)**        | Uses an LLM to generate embeddings or classify via prompts, no fine-tuning.     | No training data needed, flexible, leverages LLM knowledge.              | Unpredictable performance, computationally expensive.                    |
| **Contrastive Learning (Hard Negatives)** | Trains with positive/negative pairs, often mining hard negatives (InfoNCE loss). | Improves fine-grained similarity, can be supervised or unsupervised.     | Requires careful pair selection, can be computationally intensive.       |
| **Knowledge Distillation**          | A teacher model (e.g., a larger bi-encoder) trains a smaller Sentence Transformer. | Transfers knowledge from a better model, efficient inference.            | Needs a trained teacher model, may lose some accuracy.                   |
| **Multi-Task Learning**             | Trains on multiple tasks (e.g., similarity, classification) simultaneously.     | Better generalization, leverages diverse data.                           | More complex to implement, requires diverse labeled data.                |

---

#### Contrastive Tension

Unsupervised method that uses two models. If the same sentences are passed to Model1 and Model2, then the respective sentence embeddings should get a large dot-score. If the different sentences are passed, then the sentence embeddings should get a low score

#### SimCSE (Simple Contrastive Learning of Sentence Embeddings)

##### SimCSE Explanation

SimCSE leverages contrastive learning. It takes a sentence and creates two slightly different versions of it, treating them as positive pairs.   
The core idea is to maximize the similarity between these positive pairs while minimizing the similarity between all other sentences (negative pairs) in the batch.   
The most common form of SimCSE uses dropout as the only form of data augmentation. Applying dropout twice to the same sentence yields two slightly different embedding

##### Shorter Training
  


Training initially looked like it was going to take 30 hours. so we took several steps in order to shorten it. <br>

If GPU:
* move model to cuda
* use amp (float 32 to 16)

CPU Meaasures: 
1. **Gradient Accumalation** - Instead of processing the entire batch at once, the batch is divided into smaller "micro-batches". The model performs a forward pass and a backward pass on each micro-batch.
The gradients are calculated for each micro-batch but are not immediately applied to update the model's weights. They are accumulated (added together). And After processing they are used to update the model's weights.

2. **Use Smaller Model** - MiniLM-L6-v2 instead of all-mpnet-base-v2

3. **Use Faster Optimizer** - AdamW instead of Adam

Other Considerations: num_workers not needed (data is small)

##### Training

In [None]:
model = all_MiniLM_L6_v2
model_dir = os.path.join(model_utils.models_dir, "trained_miniLM_twitter")

In [None]:
train_dataset = Dataset.from_list(train_examples)

train_loss = losses.MultipleNegativesRankingLoss(st_model)

# Define training arguments
training_args = SentenceTransformerTrainingArguments(
    output_dir=model_dir,
    num_train_epochs=1,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    warmup_steps=100,
    learning_rate=2e-5,
    weight_decay=0.01,
    optim="adamw_torch", 
    logging_steps=10,
    save_strategy="steps",
    save_steps=500,
    fp16=False,  # Disable mixed precision (not supported on CPU)
)

In [None]:
trainer = SentenceTransformerTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    loss=train_loss
)

In [None]:
trainer.train()

#### TSDAE (Transformer-based Denoising Autoencoder)

**Problem - Most models are encoder only and do not support decoder**

If your data is significantly different from the general text that sentence transformer was trained on, TSDAE could help improve the model's performance. <br>

First, Corrupt a sentence in some way (e.g., randomly masking words, deleting words, or shuffling words). Than, we train the model to minimize the difference between the reconstructed sentence and the original sentence. <br>

<u>AutoEncoder</u> <br>
Neural network that learns to copy its input to its output. <br>
Encoder Compresses the input into a lower-dimensional representation (latent space).   <br>
Decoder Reconstructs the original input from the latent representation.   <br>
The idea is that by forcing the network to compress and then reconstruct the input, it learns to capture the most important features of the data.

<u>Denoising AutoEncoder</u> <br>
A variation of an autoencoder where the input is corrupted or "noised" before being fed into the encoder.  <br> 
The decoder is then tasked with reconstructing the original, uncorrupted input.   <br>
This forces the network to learn more robust and generalizable representations, as it has to learn to "denoise" the input.   

### Gemini

In [None]:
GEMINI_API_KEY = "AIzaSyDFm56mSyyYDUAL8yeWlYJ3Rf9z_fNFU9A"

In [None]:
import google.generativeai as genai

genai.configure(api_key=GEMINI_API_KEY)

result = genai.embed_content(
        model="models/text-embedding-004",
        content="What is the meaning of life?")

print(str(result['embedding']))

### Evaluating Embedding

#### Sentence Transformer Evaluation Methods

For Our Task Short Text Clustring - Semantic similarity is the most relavant

| Evaluator                     | Task Focus                     | Dataset Example         | Metric                     | Relevance to Short Text Clustering |
|-------------------------------|--------------------------------|-------------------------|----------------------------|-----------------------------|
| **EmbeddingSimilarityEvaluator** | Semantic similarity           | STS-B                   | Spearman/Pearson correlation | **High**: Directly assesses embedding quality for similarity, crucial for clustering. |
| **BinaryClassificationEvaluator** | Binary classification         | MRPC, NLI (binarized)   | Accuracy, F1, AP           | **Moderate**: Can test if embeddings distinguish similar/dissimilar pairs, indirectly useful. |
| **TripletEvaluator**          | Triplet ranking               | ALLNLI                  | Triplet accuracy           | **Moderate**: Can improve embeddings for clustering by enforcing distance margins. |
| **LabelAccuracyEvaluator**    | Classification                | Any labeled dataset     | Classification accuracy    | **Low**: Focuses on classification, not clustering. |
| **MSEEvaluator**              | Regression                    | STS-B                   | Mean squared error         | **Low**: Regression-focused, less interpretable for clustering. |
| **ParaphraseMiningEvaluator** | Paraphrase detection          | Quora Question Pairs    | Precision, Recall, F1      | **Moderate**: Useful if clustering involves grouping paraphrases. |
| **InformationRetrievalEvaluator** | Information retrieval       | MS MARCO, BEIR          | NDCG, MRR, Recall@k        | **Low**: Focuses on ranking, not clustering. |
| **TranslationEvaluator**      | Cross-lingual alignment       | Tatoeba                 | Spearman/Pearson correlation | **Low**: Only relevant for cross-lingual clustering. |
| **RerankingEvaluator**        | Reranking                     | MS MARCO                | Ranking accuracy           | **Low**: Focuses on ordering, not clustering. |
| **SequentialEvaluator**       | Multiple evaluations          | Depends on evaluators   | Combined metrics           | **High (if used with relevant evaluators)**: Combines multiple relevant evaluations. |

#### STS (Semantic Textual Similarity)

##### Explain

STS benchmarks typically involve pairs of sentences and human-annotated scores that indicate their similarity.   
Models are evaluated by comparing their predicted similarity scores to these human-generated scores.
Common evaluation metrics include Pearson's correlation and Spearman's rank correlation.   


##### Set Up

In [None]:
eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")

# Initialize the evaluator
dev_evaluator = EmbeddingSimilarityEvaluator(
    sentences1=eval_dataset["sentence1"],
    sentences2=eval_dataset["sentence2"],
    scores=eval_dataset["score"],
    main_similarity=SimilarityFunction.COSINE,
    name="sts-dev",
)

##### all-mpnet-base-v2

In [None]:
results_all_mpnet_base_v2 = dev_evaluator(all_mpnet_base_v2)

In [None]:
results_all_mpnet_base_v2

In [None]:
results_all_MiniLM_L6_v2 = dev_evaluator(all_MiniLM_L6_v2)

In [None]:
results_all_MiniLM_L6_v2

##### all-mpnet-base-v2 trained SimCSE