<a href="https://colab.research.google.com/github/michaelwnau/ai_academy_notebooks/blob/main/Faster_Embeddings_with_Optimum.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Check the full benchmark report on [Optimum Benchmark x MTEB](https://github.com/huggingface/optimum-benchmark/tree/main/examples/fast-mteb) 📊
CPU benchmarks are coming soon!

<p align="center">
  <img src="https://raw.githubusercontent.com/huggingface/optimum-benchmark/main/examples/fast-mteb/artifacts/forward_latency_plot.png" alt="Latency" width="45%"/>
  <img src="https://raw.githubusercontent.com/huggingface/optimum-benchmark/main/examples/fast-mteb/artifacts/forward_throughput_plot.png" alt="Latency" width="45%"/>
</p>

In [None]:
#@title We'll be using Optimum's OnnxRuntime support with `CUDAExecutionProvider` [because it's fast while also supporting dynamic shapes](https://github.com/huggingface/optimum-benchmark/tree/main/examples/fast-mteb#notes)

!pip install optimum[onnxruntime-gpu]

Collecting optimum[onnxruntime-gpu]
  Downloading optimum-1.13.2.tar.gz (300 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/301.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/301.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.0/301.0 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting coloredlogs (from optimum[onnxruntime-gpu])
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Collecting transformers[sentencepiece]>=4.26.0 (from optimum[onnxruntime-gpu])
  Downloading transformers-4.33.2-py3-none-any.whl 

In [None]:
#@title [`optimum-cli`](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization#optimizing-a-model-during-the-onnx-export) makes it extremely easy to export a model to ONNX and apply SOTA graph optimizations/fusions

!optimum-cli export onnx \
  --model BAAI/bge-base-en-v1.5 \
  --task feature-extraction \
  --optimize O4 \
  --device cuda \
  bge_auto_opt_O4 # output folder

Framework not specified. Using pt to export to ONNX.
Using the export variant default. Available variants are:
	- default: The default ONNX variant.
Using framework PyTorch: 2.0.1+cu118
Overriding 1 configuration item(s)
	- use_cache -> False
verbose: False, log level: Level.ERROR

2023-09-23 16:36:57.549953904 [W:onnxruntime:, session_state.cc:1162 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2023-09-23 16:36:57.549980232 [W:onnxruntime:, session_state.cc:1164 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Overridding for_gpu=False to for_gpu=True as half precision is available only on GPU.
Optimizing model...
2023-09-23 16:37:00.599797730 [W:onnxruntime:, session_state.cc:1162 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned t

In [None]:
#@title Based on the example given in [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5#using-huggingface-transformers)

import torch
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForFeatureExtraction

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('/content/bge_auto_opt_O4')
ort_model = ORTModelForFeatureExtraction.from_pretrained('/content/bge_auto_opt_O4', provider="CUDAExecutionProvider")

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt').to("cuda")
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = ort_model(**encoded_input)
    # Perform pooling. In this case, cls pooling.
    sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)

Sentence embeddings:
tensor([[ 0.0251,  0.0052,  0.0221,  ...,  0.0092, -0.0089, -0.0150],
        [-0.0125,  0.0130,  0.0137,  ...,  0.0215,  0.0258,  0.0107]],
       device='cuda:0')
