# Local Embeddings

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/llm-workshop/blob/main/embeddings/1_embeddings_local.ipynb)


We will try a few embedding models running locally and compare their performance

![](../media/embeddings-2.png)

## References

- https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface.html#huggingfaceembedding
- Leaderboard : https://huggingface.co/spaces/mteb/leaderboard
- Explaining leaderboard: https://huggingface.co/blog/mteb

## Colab Setup

In [1]:
# are we running in Colab?
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT running in Colab")
   RUNNING_IN_COLAB = False

if RUNNING_IN_COLAB:
   ! pip install  --default-timeout=100 sentence_transformers

NOT running in Colab


## GPU

GPUs can really accellerate LLM / embedding operations.  Let's make sure we are using GPUs if we have them

In [2]:
import os
import torch

## To disable GPU and experiment, uncomment the following line
## Normally, you would want to use GPU, if one is available.
# os.environ["CUDA_VISIBLE_DEVICES"]=""

print ("using CUDA/GPU: ", torch.cuda.is_available())

for i in range(torch.cuda.device_count()):
   print("device ", i , torch.cuda.get_device_properties(i).name)

using CUDA/GPU:  True
device  0 NVIDIA GeForce RTX 3090


In [3]:
import os, sys

this_dir = os.path.abspath('')
parent_dir = os.path.dirname(this_dir)
sys.path.append (os.path.abspath (parent_dir))

In [4]:
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'


### Let's try a few embedding models

See hugging face embedding models (sentence transformers) here : https://huggingface.co/models?library=sentence-transformers&sort=trending

Here are a select models for comparison.  Taken from leaderboard : https://huggingface.co/spaces/mteb/leaderboard

| model name                              | overall score | model size | model params | embedding length | License  | url                                                            |
|-----------------------------------------|---------------|------------|--------------|------------------|----------|----------------------------------------------------------------|
| intfloat/e5-mistral-7b-instruct         | 66.x          | 15 GB      | 7.11 B       | 4096             | MIT      | https://huggingface.co/intfloat/e5-mistral-7b-instruct         |
| BAAI/bge-large-en-v1.5                  | 64.x          | 1.34 GB    | 335 M        | 1024             | MIT      | https://huggingface.co/BAAI/bge-large-en-v1.5                  |
| BAAI/bge-small-en-v1.5                  | 62.x          | 133 MB     | 33.5 M       | 384              | MIT      | https://huggingface.co/BAAI/bge-small-en-v1.5                  |
| sentence-transformers/all-mpnet-base-v2 | 57.8          | 438 MB     |              | 768              | Apache 2 | https://huggingface.co/sentence-transformers/all-mpnet-base-v2 |
| sentence-transformers/all-MiniLM-L12-v2 | 56.x          | 134 MB     |              | 384              | Apache 2 | https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 |
| sentence-transformers/all-MiniLM-L6-v2  | 56.x          | 91 MB      |              | 384              | Apache 2 | https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2  |

### Benchmark

Benchmarks are fun way to evalute the performance of embedding models.

Here we use Python's `%timeit` command.  it runs the code a few times and calcualtes the average.

The following code block will calculate benchmark numbers.

Here are the benchmark numbers for encoding "Hello World" on my desktop
- Ubuntu 22.04 with Nvidia CUDA drivers (Driver Version: 525.147.05   CUDA Version: 12.0)
- Intel 8 core CPU @ 3.6 GHZ
- 32 GB Memory
- Nvidia GEFORCE 2070 with 8 GB memory

As you can see, larger models tend take longer to execute.

You can also see GPU can really accelerate execution time!

At the end, it is a trade off between **accuracy and performancee**

| Model                                   | model size | embedding length | Execution time in ms (with GPU) | Execution time in ms (without GPU) |
|-----------------------------------------|------------|------------------|---------------------------------|------------------------------------|
| BAAI/bge-large-en-v1.5                  | 1.34 GB    | 1024             | 12.5                            | 67.7                               |
| BAAI/bge-small-en-v1.5                  | 133 MB     | 384              | 6.5                             | 9.4                                |
| sentence-transformers/all-mpnet-base-v2 | 438 MB     | 768              | 6.65                            | 20.5                               |
| sentence-transformers/all-MiniLM-L12-v2 | 134 MB     | 384              | 6.74                            | 9.4                                |
| sentence-transformers/all-MiniLM-L6-v2  | 91 MB      | 384              | 3.68                            | 4.8                                |

In [5]:
from sentence_transformers import SentenceTransformer
import time
import timeit

embedding_models = [
    'sentence-transformers/all-mpnet-base-v2' ,
    'sentence-transformers/all-MiniLM-L12-v2' ,
    'sentence-transformers/all-MiniLM-L6-v2' ,
    'BAAI/bge-small-en-v1.5' ,
    'BAAI/bge-large-en-v1.5' ,
]


for model in embedding_models:
    print ("===== model : ", model)
    embed_model = SentenceTransformer(model)

    embeddings = embed_model.encode("Hello World!")

    print(f'embeding_length={len(embeddings):,}', flush=True)
    %timeit (embed_model.encode("Hello World!"))
    print()


  from tqdm.autonotebook import tqdm, trange


===== model :  sentence-transformers/all-mpnet-base-v2




embeding_length=768
7.84 ms ± 72.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

===== model :  sentence-transformers/all-MiniLM-L12-v2
embeding_length=384
6.92 ms ± 32.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

===== model :  sentence-transformers/all-MiniLM-L6-v2
embeding_length=384
3.98 ms ± 138 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

===== model :  BAAI/bge-small-en-v1.5
embeding_length=384
6.57 ms ± 167 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

===== model :  BAAI/bge-large-en-v1.5
embeding_length=1,024
13.2 ms ± 512 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

