# Selecting and Speeding up your Sentence Transformer Model

This notebooks walks you through the process of selecting a [Sentence Transformer](https://www.sbert.net/) model by considering tradeffs between speed and performance. You can start by browsing Hugging Face and using the [Massive Text Embedding Benchmark](https://huggingface.co/blog/mteb) to select a model that fits your task/size needs.
Once you have selected a model, we will show you how to speed up the model with a couple of techniques including batching, ONNX, and a optimized container.

There is also an accompany short video that covers the same content: [Selecting and Speeding up your Sentence Transformer Model](https://www.youtube.com/watch?v=WQqAN4k3R4g&t)

In [None]:
!pip install -q sentence-transformers[train] model2vec snowflake-ml-python==1.6.3

I am going to run the models on a GPU, so we will need this ONNX library.

In [None]:
!pip install -q optimum[onnxruntime-gpu] ## Going to use GPU ONNX for some comparisons

In [None]:
from sentence_transformers.models import StaticEmbedding
from sentence_transformers import SentenceTransformer, util
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, confusion_matrix
from datasets import load_dataset
import time
import random
import pandas as pd

## Get some data
I am going to use two datasets:
- A synthetically generated dataset of 100,000 random sentences 
- A dataset of Amazon reviews

Ideally you will evaluate on your own task / dataset.

Generate a synthetic dataset of random sentences

In [None]:
random.seed(10)
def generate_random_sentences(num_sentences, vocab):
    return [' '.join(random.choices(vocab, k=random.randint(5, 15))) for _ in range(num_sentences)]

# Create a simple vocabulary for random sentence generation
vocab = ['apple', 'banana', 'car', 'dog', 'elephant', 'fruit', 'giraffe', 'hat', 'ice', 'jungle', 'kite', 
         'lemon', 'monkey', 'notebook', 'orange', 'pizza', 'queen', 'river', 'sun', 'tree', 'umbrella', 
         'violin', 'water', 'x-ray', 'yacht', 'zebra']

# Generate 100,000 random sentences for scale testing - you can always increase this
num_sentences = 100000
sentences = generate_random_sentences(num_sentences, vocab)

#if you want to test using dataframes
df = pd.DataFrame(sentences)
df.columns = ['SENTENCES']
df.head()

Amazon review dataset from [Hugging Face text classification](https://huggingface.co/datasets?task_categories=task_categories:text-classification&sort=trending) datasets. For this notebook, I limit this to 100k for test/train. If you want to test at larger scales, you can easily change the `num_samples` variable.

In [None]:
ds = load_dataset("fancyzhx/amazon_polarity")

num_samples = 100000

# Extract the train and test splits
train_data = ds['train'].select(range(min(len(ds['train']), num_samples)))
test_data = ds['test'].select(range(min(len(ds['test']), num_samples)))

# Extract text features and labels from the dataset
train_texts = train_data['content']  # Assuming 'content' contains the text data
train_labels = train_data['label']   # Assuming 'label' contains the labels
test_texts = test_data['content']
test_labels = test_data['label']

# Evaluate Models on Speed and Accuracy

I have a script here that will evaluate the models on speed and accuracy. It measures
- Time to encode 100,000 sentences
- Computing pairwise cosine similarity between 100,000 sentences
- Time to encode Amazon review dataset
- Text classification metrics for the Amazon review dataset 

I evaluate three different types of models here (but feel free to add more):
- Static Embedding Model (similar to Word2Vec)
- MiniLM - small language model fine tuned for sentence transformers
- GTE large model

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')
#model = SentenceTransformer("tomaarsen/static-bert-uncased-gooaq")
#model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)


# Measure encoding time for random sentences
start_time = time.time()
embeddings = model.encode(sentences, convert_to_tensor=True)
end_time = time.time()
#print(f"Time taken to encode {num_sentences} sentences: {end_time - start_time} seconds")

# Compute pairwise similarities on a subset of the embeddings for testing
start_time = time.time()
#similarity_scores = util.pytorch_cos_sim(embeddings[:10000], embeddings[:10000])
end_time = time.time()
#print(f"Time taken to compute similarity for 10000 pairs: {end_time - start_time} seconds")

# Measure encoding time for Amazon Review dataset
start_time = time.time()
train_embeddings_model = model.encode(train_texts, convert_to_tensor=True)
test_embeddings_model = model.encode(test_texts, convert_to_tensor=True)
end_time = time.time()
print(f"Time taken to encode {num_samples} reviews dataset: {end_time - start_time} seconds")

train_embeddings = train_embeddings_model.cpu().numpy()
test_embeddings = test_embeddings_model.cpu().numpy()
# Scale the embeddings (standardization)
scaler = StandardScaler()
train_embeddings_scaled = scaler.fit_transform(train_embeddings)
test_embeddings_scaled = scaler.transform(test_embeddings)

# Train a logistic regression classifier with increased iterations and different solver
classifier = LogisticRegression(max_iter=500, solver='saga')  # Increased iterations and different solver
classifier.fit(train_embeddings_scaled, train_labels)

# Predict on the test set
predictions = classifier.predict(test_embeddings_scaled)

# Evaluate accuracy
accuracy = accuracy_score(test_labels, predictions)
classification_rep = classification_report(test_labels, predictions)
roc_auc = roc_auc_score(test_labels, classifier.predict_proba(test_embeddings_scaled)[:, 1], multi_class='ovr')
conf_matrix = confusion_matrix(test_labels, predictions)

# Display metrics
print(f'Accuracy: {accuracy:.4f}')
print("\nClassification Report:")
print(classification_rep)
print(f'ROC AUC: {roc_auc:.4f}')
print("\nConfusion Matrix:")
print(conf_matrix)

The results - you can see the traaeoffs between speed and accuracy.

Static Embedding Model - tomaarsen/static-bert-uncased-gooaq  
Time taken to encode Amazon Reviews: 19 seconds   
Text Classification Accuracy: 0.8009  
ROC AUC: 0.8808

Sentence Transformers - all-MiniLM-L6-v2  
Time taken to encode Amazon Reviews: 71 seconds    
Text Classification Accuracy: 0.8279  
ROC AUC: 0.9062

Sentence Transformers - Alibaba-NLP/gte-large-en-v1.5  
Time taken to encode Amazon Reviews: 1384 seconds   
Text Classification Accuracy: 0.9620   
ROC AUC: 0.9909 

# Speeding up the model

Here are a few easy techniques you can use to speed up the models
- Use Batching
- Using ONNX
- Using a optimized container - for example, [Text Embedding Inference Container](https://github.com/huggingface/text-embeddings-inference)

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

# Measure encoding time for random sentences
start_time = time.time()
embeddings = model.encode(df['SENTENCES'], convert_to_tensor=True)
end_time = time.time()
print(f"Time taken to encode {num_sentences} sentences: {end_time - start_time} seconds")

model = SentenceTransformer('all-MiniLM-L6-v2')
# Measure encoding time for random sentences
start_time = time.time()
embeddings = model.encode(df['SENTENCES'], convert_to_tensor=True, batch_size=256, show_progress_bar=True)
end_time = time.time()
print(f"Time taken to encode {num_sentences} sentences using batches: {end_time - start_time} seconds")

model = SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")
# Measure encoding time for random sentences
start_time = time.time()
embeddings = model.encode(df['SENTENCES'], convert_to_tensor=True, show_progress_bar=True)
end_time = time.time()
print(f"Time taken to encode {num_sentences} sentences with ONNX: {end_time - start_time} seconds")

model = SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")
# Measure encoding time for random sentences
start_time = time.time()
embeddings = model.encode(df['SENTENCES'], convert_to_tensor=True, batch_size=256, show_progress_bar=True)
end_time = time.time()
print(f"Time taken to encode {num_sentences} sentences with ONNX in Batches: {end_time - start_time} seconds")


Results:
Time taken to encode 100000 sentences: 17.990296125411987 seconds  
Time taken to encode 100000 sentences using batches: 5.701428413391113 seconds   
Time taken to encode 100000 sentences with ONNX: 12.300400495529175 seconds  
Time taken to encode 100000 sentences with ONNX in Batches: 8.326634883880615 seconds

# Visualize Embeddings

Visualize the embeddings is very useful for understanding how your embeddings are working. This is a simple example using UMAP. You should visualize as part of your workflow. 

In [None]:
!pip install umap-learn

In [None]:
import umap.umap_ as umap
import matplotlib.pyplot as plt

reducer = umap.UMAP()
embeddings_2d = reducer.fit_transform(test_embeddings_scaled)

plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=test_labels, cmap='plasma')
plt.colorbar()
plt.title('UMAP of Embeddings')
plt.show()

# Register Sentence Transformers Model in Snowflake

In [None]:
from snowflake.ml.registry import registry
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark import Session
from sentence_transformers import SentenceTransformer

session = Session.builder.configs(SnowflakeLoginOptions("connection_name")).create()
reg = registry.Registry(session=session, database_name='rajiv', schema_name='demos')

model = SentenceTransformer('all-MiniLM-L6-v2')

### Run Predictions Locally

In [None]:
# Measure encoding time for random sentences
start_time = time.time()
embeddings = model.encode(sentences, convert_to_tensor=True)
end_time = time.time()
print(f"Time taken to encode {num_sentences} sentences using Local Model: {end_time - start_time} seconds")

### Run Predictions in Warehouse

In [None]:
# Log the model with conda dependencies for warehouses
conda_forge_model = reg.log_model(
    model,
    model_name="sentence_transformer_minilm",
    version_name='conda_forge_force',
    sample_input_data=sentences,  # Needed for determining signature of the model
   conda_dependencies=["sentence-transformers", "pytorch", "transformers"]
)

In [None]:
# Measure encoding time for random sentences
conda_model = reg.get_model("sentence_transformer_minilm").version("conda_forge_force")
start_time = time.time()
conda_model.run(sentences, function_name="encode")
end_time = time.time()
print(f"Time taken to encode {num_sentences} sentences using Warehouse: {end_time - start_time} seconds")

### Run Predictions in SPSC Container

This process first generates the container and then runs the predictions in the container.

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

# Log the model with pip dependencies for containers
pip_model = reg.log_model(
    model,
    model_name="sentence_transformer_minilm",
    version_name='pip',
    sample_input_data=sentences,  # Needed for determining signature of the model
   pip_requirements=["sentence-transformers", "torch", "transformers"], # If you want to run this model in the Warehouse, you can use conda_dependencies instead
)

In [None]:
pip_model = reg.get_model("sentence_transformer_minilm").version("pip")
pip_model.create_service(service_name="sentence_transformer_minilmV2",
                  service_compute_pool="NOTEBOOK_GPU_NV_S",
                  image_repo="rajiv.public.images",
                  build_external_access_integration="RAJ_OPEN_ACCESS_INTEGRATION",
                  gpu_requests="1",
                  ingress_enabled=True)

In [None]:
# Measure encoding time for random sentences
start_time = time.time()
spcs_prediction = pip_model.run(sentences, function_name='encode', service_name="sentence_transformer_minilmV2")
end_time = time.time()
print(f"Time taken to encode {num_sentences} sentences using Deploy: {end_time - start_time} seconds")

In [None]:
session.sql("DROP SERVICE RAJIV.DEMOS.sentence_transformer_minilmV2").collect()