# Problems Regarding Tracing Model with Dense Layer in ONNX

## Background Context

We want to trace a pretrained sentence-transformer model into torch_script and onnx to upload it to the artifact server, so that the users can use the following ml-commons API to deploy the model and generate embeddings: `register` -> `deploy` -> `generate_embedding`. 

```
import opensearch_py_ml as oml
from opensearchpy import OpenSearch
from opensearch_py_ml.ml_models import SentenceTransformerModel
from opensearch_py_ml.ml_commons import MLCommonClient

client = get_os_client()
ml_client = MLCommonClient(client)

model_id = "sentence-transformers/distiluse-base-multilingual-cased-v1"
folder_path='sentence-transformers-onxx/distiluse-base-multilingual-cased-v1'

pre_trained_model = SentenceTransformerModel(model_id=model_id, folder_path=folder_path, overwrite=True)
model_path_onnx = pre_trained_model.save_as_onnx(model_id=model_id)

model_config_path_onnx = 'sentence-transformers-onxx/distiluse-base-multilingual-cased-v1/ml-commons_model_config.json'
ml_client.register_model(model_path_onnx, model_config_path_onnx, isVerbose=True)

input_sentences = ["first sentence", "second sentence"]
results = ml_client.generate_embedding("hv0c7IkBVsgBeq9g7M_J", input_sentences)
onnx_embedding = [
            embedding_output_onnx["inference_results"][i]["output"][0]["data"]
            for i in range(len(input_sentences))
]

```

This should give the same output with the following Huggingface `encode` calls.
```
from sentence_transformers import SentenceTransformer
input_sentences = ["first sentence", "second sentence"]

huggingface_model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v1')
huggingface_embeddings = huggingface_model.encode(input_sentences)
print(huggingface_embeddings)

```

Note that there is our `SentenceTransformerModel` class and Hugging Face `SentenceTransformer`class.

The problem we face is that `onnx_embedding` has shape `(768,)`  while `huggingface_embeddings` has shape `(512,)`. However, we do not face this problem with `torch_script_embedding`.

## Findings

Based on the mismatch in output shape and the model architecture below, the problem is likely because the `Dense` layer is not part of the `onnx` model.

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)
```

At first, I thought that the problem is with `convert` in `transformers.convert_graph_to_onnx` that we use to trace `onnx` file. However, I realize that for other models that have pooling layer as the last layer, their `onnx` file does not include pooling layer as well. Moreover, [this line of code](https://github.com/opensearch-project/ml-commons/blob/7467a692e3cf3d7cdf2b5db0b21c11e67fcf5621/ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/text_embedding/ONNXSentenceTransformerTextEmbeddingTranslator.java#L83C19-L83C32) in ML-Commons shows that `onnx` model relies on post-processing in ML-commons to generate embedding from the model output (See example of how others load and generate embeddings by calling pooling function on model output [here](https://github.com/SidJain1412/sentence-transformers/blob/master/examples/onnx/onnx_example.ipynb)), while `torch_script` does not rely on this. Hence, the problem is we have not applied `dense` function. We should add it in post-processing step as we do for `pooling` and `normalize` 

## Experiments with Post-Processing in Python

Based on above findings, I tried loading `onnx` model and adding `pooling` and `dense` layers to its output.

### Case A: Models with Only Pooling — Working fine

Below is the overview of the code for tracing models with only pooling layer in `onnx`. This works fine.

I. Get Inputs
```
from transformers import AutoTokenizer

input_sentences = ["first sentence", "second sentence", "very very long random sentence for testing"]
autotokenizer = AutoTokenizer.from_pretrained(model_id)
auto_features = autotokenizer(
            input_sentences, return_tensors="pt", padding=True, truncation=True
        )
```

II. Load `onnx` model & Generate Ouputs
```
from os import environ
from psutil import cpu_count
from onnxruntime import InferenceSession, SessionOptions, get_all_providers

environ["OMP_NUM_THREADS"] = str(cpu_count(logical=True))
environ["OMP_WAIT_POLICY"] = 'ACTIVE'


ort_session = InferenceSession(model_path, providers=["CPUExecutionProvider"])

def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

ort_inputs = {k: v.cpu().detach().numpy() for k, v in auto_features.items()}
ort_outs = ort_session.run(None, ort_inputs)
```

III. Add Pooling Layer to Get Sentence Embeddings
```
import torch
from sentence_transformers.models import Pooling

pooling_layer = Pooling(768, pooling_mode_cls_token=True, pooling_mode_mean_tokens=False)
features = {
    'token_embeddings':  torch.from_numpy(ort_outs[0]),
    'attention_mask': torch.from_numpy(ort_inputs['attention_mask'])
}
pooling_layer.forward(features)
sentence_embeddings = features['sentence_embedding']

embedding_data_onnx = [
            sentence_embeddings[i]
            for i in range(len(input_sentences))
        ]
```

IV. Verify Embedding with Embeddings Encoded with Hugging Face Model 
```
import numpy as np
from sentence_transformers import SentenceTransformer

original_pre_trained_model = SentenceTransformer(model_id) # From Huggingface
original_embedding_data = list(
    original_pre_trained_model.encode(input_sentences, convert_to_numpy=True)
)
        
for i in range(len(input_sentences)):
    print(i)
    print(np.testing.assert_allclose(original_embedding_data[i], embedding_data_onnx[i], rtol=1e-03, atol=1e-05))
```

### Case B: Models with Pooling & Dense (First Attempt) — Failed

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)
```
https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1

#### 0. Trace the Model

In [1]:
import os
import sys
sys.path.append(os.path.abspath(os.path.join('../../..')))

import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings("ignore", message="Unverified HTTPS request")
warnings.filterwarnings("ignore", message="TracerWarning: torch.tensor")
warnings.filterwarnings("ignore", message="using SSL with verify_certs=False is insecure.")

import opensearch_py_ml as oml
from opensearchpy import OpenSearch
from opensearch_py_ml.ml_models import SentenceTransformerModel
# import mlcommon to later register the model to OpenSearch Cluster
from opensearch_py_ml.ml_commons import MLCommonClient

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import os
from pathlib import Path
from sentence_transformers import SentenceTransformer
from transformers.convert_graph_to_onnx import convert

model_id = "sentence-transformers/distiluse-base-multilingual-cased-v1"
folder_path='sentence-transformers-onxx/distiluse-base-multilingual-cased-v1'
model_name = str(model_id.split("/")[-1] + ".onnx")
model_path = os.path.join(folder_path, "onnx", model_name)

# model = SentenceTransformer(model_id)
# folder_path='sentence-transformers-onxx/distiluse-base-multilingual-cased-v1'

# model_name = str(model_id.split("/")[-1] + ".onnx")

# model_path = os.path.join(folder_path, "onnx", model_name)
        
# convert(
#     framework="pt",
#     model=model_id,
#     output=Path(model_path),
#     opset=15,
# )

pre_trained_model = SentenceTransformerModel(folder_path=folder_path, overwrite=True)
zip_path = pre_trained_model.save_as_onnx(model_id=model_id)

ONNX opset version set to: 15
Loading pipeline (model: sentence-transformers/distiluse-base-multilingual-cased-v1, tokenizer: sentence-transformers/distiluse-base-multilingual-cased-v1)
Creating folder sentence-transformers-onxx/distiluse-base-multilingual-cased-v1/onnx
Using framework PyTorch: 1.13.1+cu117
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch', 1: 'sequence'}
Ensuring inputs are in correct order
head_mask is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask']


  mask, torch.tensor(torch.finfo(scores.dtype).min)


model file is saved to  sentence-transformers-onxx/distiluse-base-multilingual-cased-v1/onnx/distiluse-base-multilingual-cased-v1.onnx
zip file is saved to  sentence-transformers-onxx/distiluse-base-multilingual-cased-v1/distiluse-base-multilingual-cased-v1.zip 



#### I. Get Inputs

In [3]:
from transformers import AutoTokenizer

input_sentences = ["first sentence", "second sentence", "very very long random sentence for testing"]
autotokenizer = AutoTokenizer.from_pretrained(model_id)
auto_features = autotokenizer(
            input_sentences, return_tensors="pt", padding=True, truncation=True
        )
auto_features

{'input_ids': tensor([[  101, 10422, 49219,   102,     0,     0,     0,     0,     0],
        [  101, 11132, 49219,   102,     0,     0,     0,     0,     0],
        [  101, 12558, 12558, 11695, 61952, 49219, 10142, 38306,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]])}

II. Load `onnx` model & Generate Ouputs

In [4]:
from os import environ
from psutil import cpu_count
from onnxruntime import InferenceSession, SessionOptions, get_all_providers

environ["OMP_NUM_THREADS"] = str(cpu_count(logical=True))
environ["OMP_WAIT_POLICY"] = 'ACTIVE'

ort_session = InferenceSession(model_path, providers=["CPUExecutionProvider"])

def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

ort_inputs = {k: v.cpu().detach().numpy() for k, v in auto_features.items()}
ort_outs = ort_session.run(None, ort_inputs)

In [5]:
ort_inputs

{'input_ids': array([[  101, 10422, 49219,   102,     0,     0,     0,     0,     0],
        [  101, 11132, 49219,   102,     0,     0,     0,     0,     0],
        [  101, 12558, 12558, 11695, 61952, 49219, 10142, 38306,   102]]),
 'attention_mask': array([[1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [6]:
print(len(ort_outs))
print(ort_outs[0].shape)

1
(3, 9, 768)


#### III. Add Pooling Layer (Mean Pooling)

In [7]:
import torch
from sentence_transformers.models import Pooling

pooling_layer = Pooling(768, pooling_mode_mean_tokens=True)
features = {
    'token_embeddings':  torch.from_numpy(ort_outs[0]),
    'attention_mask': torch.from_numpy(ort_inputs['attention_mask'])
}
pooling_layer.forward(features)
print(features.keys())
print(features['sentence_embedding'].shape)

dict_keys(['token_embeddings', 'attention_mask', 'sentence_embedding'])
torch.Size([3, 768])


#### IV. Add Dense Layer

In [8]:
import torch
from sentence_transformers.models import Dense
dense_layer = Dense(768, 512, bias=True, activation_function=torch.nn.modules.activation.Tanh())
dense_layer.forward(features)

{'token_embeddings': tensor([[[-9.3187e-02,  4.3200e-02,  1.4325e-01,  ..., -1.1029e-01,
            9.4209e-02, -3.6948e-03],
          [-1.0118e-01,  2.4131e-02,  1.6269e-01,  ..., -7.6247e-02,
            5.1720e-02,  2.4384e-02],
          [-1.1288e-01,  1.0230e-01,  7.9914e-02,  ..., -9.0805e-02,
            1.2735e-01, -6.1911e-02],
          ...,
          [-3.1557e-02, -1.0514e-03,  3.4721e-02,  ..., -1.7142e-01,
            1.1406e-01,  2.1534e-02],
          [-1.5114e-02,  3.9489e-02,  5.6255e-02,  ..., -1.1204e-01,
            1.0426e-01, -1.5420e-04],
          [-1.2633e-02,  1.5307e-02,  2.2421e-02,  ..., -1.2982e-01,
            1.1010e-01,  9.1765e-03]],
 
         [[-4.6544e-02,  5.0517e-02,  9.9694e-02,  ..., -4.3155e-02,
            9.5299e-02,  3.0037e-02],
          [-7.2471e-02,  8.0396e-02,  8.7466e-02,  ..., -2.0926e-02,
            3.2088e-02,  3.8028e-02],
          [-6.9155e-02,  1.0407e-01,  3.6066e-02,  ..., -2.7982e-02,
            1.1376e-01, -1.7804e-02],

In [9]:
features['sentence_embedding'].shape

torch.Size([3, 512])

In [10]:
embedding_data_onnx = [
            features['sentence_embedding'][i].cpu().detach().numpy()
            for i in range(len(input_sentences))
        ]

#### V. Verify Embedding with Embeddings Encoded with Hugging Face Model 

In [11]:
import numpy as np
from sentence_transformers import SentenceTransformer

original_pre_trained_model = SentenceTransformer(model_id) # From Huggingface
original_embedding_data = list(
    original_pre_trained_model.encode(input_sentences, convert_to_numpy=True)
)
        
for i in range(len(input_sentences)):
    print(i)
    print(np.testing.assert_allclose(original_embedding_data[i], embedding_data_onnx[i], rtol=1e-03, atol=1e-05))

0


AssertionError: 
Not equal to tolerance rtol=0.001, atol=1e-05

Mismatched elements: 512 / 512 (100%)
Max absolute difference: 0.2255789
Max relative difference: 53.520542
 x: array([ 1.097567e-02,  6.483248e-02, -4.571173e-02,  9.350104e-02,
       -2.485733e-02, -3.051357e-02,  8.830560e-03,  1.258769e-02,
        8.662871e-03, -4.904142e-02,  5.009779e-04, -6.247674e-03,...
 y: array([ 0.029242,  0.044274, -0.048109,  0.005339,  0.048379, -0.061276,
        0.009061, -0.022101, -0.013753,  0.003824,  0.001493,  0.019158,
        0.051583,  0.003758, -0.041029,  0.108596,  0.037499, -0.06195 ,...

### Case B: Models with Pooling & Dense (Second Attempt) — Succeeded

Try changing how I initialized `Dense` layer

#### III. Add Pooling Layer (Mean Pooling) [No Change]

In [14]:
import torch
from sentence_transformers.models import Pooling

pooling_layer = Pooling(768, pooling_mode_mean_tokens=True)
features_2 = {
    'token_embeddings':  torch.from_numpy(ort_outs[0]),
    'attention_mask': torch.from_numpy(ort_inputs['attention_mask'])
}
pooling_layer.forward(features_2)
print(features_2.keys())
print(features_2['sentence_embedding'].shape)

dict_keys(['token_embeddings', 'attention_mask', 'sentence_embedding'])
torch.Size([3, 768])


#### IV. Add Dense Layer with `load()` method of `Dense` class

In [15]:
import torch
from sentence_transformers.models import Dense
loaded_dense_layer = Dense.load(folder_path + '/2_Dense')
loaded_dense_layer.forward(features_2)

{'token_embeddings': tensor([[[-9.3187e-02,  4.3200e-02,  1.4325e-01,  ..., -1.1029e-01,
            9.4209e-02, -3.6948e-03],
          [-1.0118e-01,  2.4131e-02,  1.6269e-01,  ..., -7.6247e-02,
            5.1720e-02,  2.4384e-02],
          [-1.1288e-01,  1.0230e-01,  7.9914e-02,  ..., -9.0805e-02,
            1.2735e-01, -6.1911e-02],
          ...,
          [-3.1557e-02, -1.0514e-03,  3.4721e-02,  ..., -1.7142e-01,
            1.1406e-01,  2.1534e-02],
          [-1.5114e-02,  3.9489e-02,  5.6255e-02,  ..., -1.1204e-01,
            1.0426e-01, -1.5420e-04],
          [-1.2633e-02,  1.5307e-02,  2.2421e-02,  ..., -1.2982e-01,
            1.1010e-01,  9.1765e-03]],
 
         [[-4.6544e-02,  5.0517e-02,  9.9694e-02,  ..., -4.3155e-02,
            9.5299e-02,  3.0037e-02],
          [-7.2471e-02,  8.0396e-02,  8.7466e-02,  ..., -2.0926e-02,
            3.2088e-02,  3.8028e-02],
          [-6.9155e-02,  1.0407e-01,  3.6066e-02,  ..., -2.7982e-02,
            1.1376e-01, -1.7804e-02],

In [16]:
features_2['sentence_embedding'].shape

torch.Size([3, 512])

In [19]:
embedding_data_onnx_2 = [
            features_2['sentence_embedding'][i].cpu().detach().numpy()
            for i in range(len(input_sentences))
        ]

#### V. Verify Embedding with Embeddings Encoded with Hugging Face Model  [No Change]

In [20]:
import numpy as np
from sentence_transformers import SentenceTransformer

original_pre_trained_model = SentenceTransformer(model_id) # From Huggingface
original_embedding_data = list(
    original_pre_trained_model.encode(input_sentences, convert_to_numpy=True)
)
        
for i in range(len(input_sentences)):
    print(i)
    print(np.testing.assert_allclose(original_embedding_data[i], embedding_data_onnx_2[i], rtol=1e-03, atol=1e-05))

0
None
1
None
2
None


## Questions 
* Why do we need to use `load()` method? Why isn't initializing `Dense()` with parameters sufficient?
* What is in [`2_Dense/pytorch_model.bin`](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1/tree/main/2_Dense)?
* Does it mean that we should upload `2_Dense/pytorch_model.bin` to model hub as well apart from `.onnx` and `tokenizer.json`?

```
dense_layer = Dense(768, 512, bias=True, activation_function=torch.nn.modules.activation.Tanh())
loaded_dense_layer = Dense.load(folder_path + '/2_Dense')
```
Note:
- See `Dense` class definition here: https://github.com/SidJain1412/sentence-transformers/blob/master/sentence_transformers/models/Dense.py)
- The file at `folder_path + '/2_Dense'` is https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1/tree/main/2_Dense

In [21]:
dense_layer

Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})

In [22]:
loaded_dense_layer

Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})

## Resources:

https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/Dense.py#L32
https://github.com/huggingface/notebooks/blob/main/examples/onnx-export.ipynb
https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1