# Problems Regarding Tracing Model with Dense Layer in ONNX

## Background Context

We want to trace a pretrained sentence-transformer model into torch_script and onnx to upload it to the artifact server, so that the users can use the following ml-commons API to deploy the model and generate embeddings: `register` -> `deploy` -> `generate_embedding`. 

```
import opensearch_py_ml as oml
from opensearchpy import OpenSearch
from opensearch_py_ml.ml_models import SentenceTransformerModel
from opensearch_py_ml.ml_commons import MLCommonClient

client = get_os_client()
ml_client = MLCommonClient(client)

model_id = "sentence-transformers/distiluse-base-multilingual-cased-v1"
folder_path='sentence-transformers-onxx/distiluse-base-multilingual-cased-v1'

pre_trained_model = SentenceTransformerModel(model_id=model_id, folder_path=folder_path, overwrite=True)
model_path_onnx = pre_trained_model.save_as_onnx(model_id=model_id)

model_config_path_onnx = 'sentence-transformers-onxx/distiluse-base-multilingual-cased-v1/ml-commons_model_config.json'
ml_client.register_model(model_path_onnx, model_config_path_onnx, isVerbose=True)

input_sentences = ["first sentence", "second sentence"]
results = ml_client.generate_embedding("hv0c7IkBVsgBeq9g7M_J", input_sentences)
onnx_embedding = [
            embedding_output_onnx["inference_results"][i]["output"][0]["data"]
            for i in range(len(input_sentences))
]

```

This should give the same output with the following Huggingface `encode` calls.
```
from sentence_transformers import SentenceTransformer
input_sentences = ["first sentence", "second sentence"]

huggingface_model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v1')
huggingface_embeddings = huggingface_model.encode(input_sentences)
print(huggingface_embeddings)

```

Note that there is our `SentenceTransformerModel` class and Hugging Face `SentenceTransformer`class.

The problem we face is that `onnx_embedding` has shape `(768,)`  while `huggingface_embeddings` has shape `(512,)`. However, we do not face this problem with `torch_script_embedding`.

## Findings

Based on the mismatch in output shape and the model architecture below, the problem is likely because the `Dense` layer is not part of the `onnx` model.

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)
```

At first, I thought that the problem is with `convert` in `transformers.convert_graph_to_onnx` that we use to trace `onnx` file. However, I realize that for other models that have pooling layer as the last layer, their `onnx` file does not include pooling layer as well. Moreover, [this line of code](https://github.com/opensearch-project/ml-commons/blob/7467a692e3cf3d7cdf2b5db0b21c11e67fcf5621/ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/text_embedding/ONNXSentenceTransformerTextEmbeddingTranslator.java#L83C19-L83C32) in ML-Commons shows that `onnx` model relies on post-processing in ML-commons to generate embedding from the model output (See example of how others load and generate embeddings by calling pooling function on model output [here](https://github.com/SidJain1412/sentence-transformers/blob/master/examples/onnx/onnx_example.ipynb)), while `torch_script` does not rely on this. Hence, the problem is we have not applied `dense` function. We should add it in post-processing step as we do for `pooling` and `normalize` 

# Problems

Based on above findings, I tried loading `onnx` model and adding pooling and dense layers to its output. But there is still a mistmach.

### Case A: Models with Only Pooling — Working fine

Below is the overview of the code for tracing models with only pooling layer in `onnx`. This works fine.

I. Get Inputs
```
from transformers import AutoTokenizer

input_sentences = ["first sentence", "second sentence", "very very long random sentence for testing"]
autotokenizer = AutoTokenizer.from_pretrained(model_id)
auto_features = autotokenizer(
            input_sentences, return_tensors="pt", padding=True, truncation=True
        )
```

II. Load `onnx` model & Generate Ouputs
```
from os import environ
from psutil import cpu_count
from onnxruntime import InferenceSession, SessionOptions, get_all_providers

environ["OMP_NUM_THREADS"] = str(cpu_count(logical=True))
environ["OMP_WAIT_POLICY"] = 'ACTIVE'


ort_session = InferenceSession(model_path, providers=["CPUExecutionProvider"])

def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

ort_inputs = {k: v.cpu().detach().numpy() for k, v in auto_features.items()}
ort_outs = ort_session.run(None, ort_inputs)
```

III. Add Pooling Layer to Get Sentence Embeddings
```
import torch
from sentence_transformers.models import Pooling

pooling_layer = Pooling(768, pooling_mode_cls_token=True, pooling_mode_mean_tokens=False)
features = {
    'token_embeddings':  torch.from_numpy(ort_outs[0]),
    'attention_mask': torch.from_numpy(ort_inputs['attention_mask'])
}
pooling_layer.forward(features)
sentence_embeddings = features['sentence_embedding']

embedding_data_onnx = [
            sentence_embeddings[i]
            for i in range(len(input_sentences))
        ]
```

IV. Verify Embedding with Embeddings Encoded with Hugging Face Model 
```
import numpy as np
from sentence_transformers import SentenceTransformer

original_pre_trained_model = SentenceTransformer(model_id) # From Huggingface
original_embedding_data = list(
    original_pre_trained_model.encode(input_sentences, convert_to_numpy=True)
)
        
for i in range(len(input_sentences)):
    print(i)
    print(np.testing.assert_allclose(original_embedding_data[i], embedding_data_onnx[i], rtol=1e-03, atol=1e-05))
```

### Case B: Models with Pooling & Dense — Mismatch Output

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)
```
https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1

#### 0. Trace the Model

In [1]:
import os
from pathlib import Path
from sentence_transformers import SentenceTransformer
from transformers.convert_graph_to_onnx import convert

model_id = "sentence-transformers/clip-ViT-B-32-multilingual-v1"
folder_path='sentence-transformers-onxx/clip-ViT-B-32-multilingual-v12'
model_name = str(model_id.split("/")[-1] + ".onnx")
model_path = os.path.join(folder_path, "onnx", model_name)

model = SentenceTransformer(model_id)
folder_path='sentence-transformers-onxx/clip-ViT-B-32-multilingual-v1'

model_name = str(model_id.split("/")[-1] + ".onnx")

model_path = os.path.join(folder_path, "onnx", model_name)
        
convert(
    framework="pt",
    model=model_id,
    output=Path(model_path),
    opset=15,
)

  from .autonotebook import tqdm as notebook_tqdm


ONNX opset version set to: 15
Loading pipeline (model: sentence-transformers/clip-ViT-B-32-multilingual-v1, tokenizer: sentence-transformers/clip-ViT-B-32-multilingual-v1)
Creating folder sentence-transformers-onxx/clip-ViT-B-32-multilingual-v1/onnx
Using framework PyTorch: 1.13.1+cu117
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch', 1: 'sequence'}
Ensuring inputs are in correct order
head_mask is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask']


  mask, torch.tensor(torch.finfo(scores.dtype).min)


#### I. Get Inputs

In [2]:
from transformers import AutoTokenizer

input_sentences = ["first sentence", "second sentence", "very very long random sentence for testing"]
autotokenizer = AutoTokenizer.from_pretrained(model_id)
auto_features = autotokenizer(
            input_sentences, return_tensors="pt", padding=True, truncation=True
        )
auto_features

{'input_ids': tensor([[  101, 10422, 49219,   102,     0,     0,     0,     0,     0],
        [  101, 11132, 49219,   102,     0,     0,     0,     0,     0],
        [  101, 12558, 12558, 11695, 61952, 49219, 10142, 38306,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]])}

II. Load `onnx` model & Generate Ouputs

In [3]:
from os import environ
from psutil import cpu_count
from onnxruntime import InferenceSession, SessionOptions, get_all_providers

environ["OMP_NUM_THREADS"] = str(cpu_count(logical=True))
environ["OMP_WAIT_POLICY"] = 'ACTIVE'

ort_session = InferenceSession(model_path, providers=["CPUExecutionProvider"])

def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

ort_inputs = {k: v.cpu().detach().numpy() for k, v in auto_features.items()}
ort_outs = ort_session.run(None, ort_inputs)

In [4]:
ort_inputs

{'input_ids': array([[  101, 10422, 49219,   102,     0,     0,     0,     0,     0],
        [  101, 11132, 49219,   102,     0,     0,     0,     0,     0],
        [  101, 12558, 12558, 11695, 61952, 49219, 10142, 38306,   102]]),
 'attention_mask': array([[1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [5]:
print(len(ort_outs))
print(ort_outs[0].shape)

1
(3, 9, 768)


#### III. Add Pooling Layer (Mean Pooling)

In [6]:
import torch
from sentence_transformers.models import Pooling

pooling_layer = Pooling(768, pooling_mode_mean_tokens=True)
features = {
    'token_embeddings':  torch.from_numpy(ort_outs[0]),
    'attention_mask': torch.from_numpy(ort_inputs['attention_mask'])
}
pooling_layer.forward(features)
print(features.keys())
print(features['sentence_embedding'].shape)

dict_keys(['token_embeddings', 'attention_mask', 'sentence_embedding'])
torch.Size([3, 768])


#### IV. Add Dense Layer

In [7]:
import torch
from sentence_transformers.models import Dense
dense_layer = Dense(768, 512, bias=False, activation_function=torch.nn.modules.linear.Identity())
dense_layer.forward(features)

{'token_embeddings': tensor([[[ 0.6112, -0.4034,  0.4854,  ..., -0.1686,  0.5235, -0.0819],
          [ 0.4675, -0.3926,  0.3994,  ..., -0.2330,  0.4120, -0.1260],
          [ 0.5767, -0.4248,  0.4461,  ..., -0.1511,  0.4863, -0.0758],
          ...,
          [ 0.5466, -0.2550,  0.2605,  ..., -0.0381,  0.4918,  0.0092],
          [ 0.5454, -0.2626,  0.3026,  ..., -0.0441,  0.4790,  0.0404],
          [ 0.5621, -0.2934,  0.3254,  ..., -0.0600,  0.4934,  0.0442]],
 
         [[ 0.5738, -0.3759,  0.5529,  ..., -0.1182,  0.5259, -0.0573],
          [ 0.3484, -0.3386,  0.4806,  ..., -0.1488,  0.4365, -0.0916],
          [ 0.5895, -0.3740,  0.5069,  ..., -0.1296,  0.5038, -0.0276],
          ...,
          [ 0.5081, -0.2275,  0.3310,  ..., -0.0255,  0.4946,  0.0071],
          [ 0.4917, -0.2309,  0.3524,  ..., -0.0208,  0.4764,  0.0318],
          [ 0.5092, -0.2576,  0.3779,  ..., -0.0224,  0.4903,  0.0379]],
 
         [[ 0.4235, -0.4610,  0.2390,  ..., -0.2002,  0.2846, -0.2286],
        

In [8]:
features['sentence_embedding'].shape

torch.Size([3, 512])

In [9]:
embedding_data_onnx = [
            features['sentence_embedding'][i].cpu().detach().numpy()
            for i in range(len(input_sentences))
        ]

#### V. Verify Embedding with Embeddings Encoded with Hugging Face Model 

In [10]:
import numpy as np
from sentence_transformers import SentenceTransformer

original_pre_trained_model = SentenceTransformer(model_id) # From Huggingface
original_embedding_data = list(
    original_pre_trained_model.encode(input_sentences, convert_to_numpy=True)
)
        
for i in range(len(input_sentences)):
    print(i)
    print(np.testing.assert_allclose(original_embedding_data[i], embedding_data_onnx[i], rtol=1e-03, atol=1e-05))

0


AssertionError: 
Not equal to tolerance rtol=0.001, atol=1e-05

Mismatched elements: 512 / 512 (100%)
Max absolute difference: 7.389585
Max relative difference: 1491.1345
 x: array([-3.352571e-01, -7.378923e-02, -2.863503e-01, -4.555894e-02,
       -2.281259e-01,  1.402846e-01, -2.240530e-01, -1.598666e+00,
        2.009055e-01,  1.197599e-01,  3.269672e-02,  2.298519e-01,...
 y: array([-4.161581e-02, -2.660530e-01, -2.662762e-01,  1.485295e-01,
       -1.738283e-02,  1.110369e-01, -3.970781e-02,  1.896576e-01,
        4.613394e-02, -5.534234e-03,  1.668322e-01,  2.779251e-02,...

# Resources:

https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/Dense.py#L32
https://github.com/huggingface/notebooks/blob/main/examples/onnx-export.ipynb
https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1

# Other Models with Dense Layer that I Can Trace in TorchScript, but Not ONNX
https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1

(There are more models with dense layer but I haven't tried tracing in torch script)

In [16]:
d_model = Dense.load(folder_path + '/2_Dense')

In [17]:
import torch
from sentence_transformers.models import Pooling

pooling_layer = Pooling(768, pooling_mode_mean_tokens=True)
features = {
    'token_embeddings':  torch.from_numpy(ort_outs[0]),
    'attention_mask': torch.from_numpy(ort_inputs['attention_mask'])
}
pooling_layer.forward(features)
print(features.keys())
print(features['sentence_embedding'].shape)

dict_keys(['token_embeddings', 'attention_mask', 'sentence_embedding'])
torch.Size([3, 768])


In [18]:
d_model.forward(features)

{'token_embeddings': tensor([[[ 0.6112, -0.4034,  0.4854,  ..., -0.1686,  0.5235, -0.0819],
          [ 0.4675, -0.3926,  0.3994,  ..., -0.2330,  0.4120, -0.1260],
          [ 0.5767, -0.4248,  0.4461,  ..., -0.1511,  0.4863, -0.0758],
          ...,
          [ 0.5466, -0.2550,  0.2605,  ..., -0.0381,  0.4918,  0.0092],
          [ 0.5454, -0.2626,  0.3026,  ..., -0.0441,  0.4790,  0.0404],
          [ 0.5621, -0.2934,  0.3254,  ..., -0.0600,  0.4934,  0.0442]],
 
         [[ 0.5738, -0.3759,  0.5529,  ..., -0.1182,  0.5259, -0.0573],
          [ 0.3484, -0.3386,  0.4806,  ..., -0.1488,  0.4365, -0.0916],
          [ 0.5895, -0.3740,  0.5069,  ..., -0.1296,  0.5038, -0.0276],
          ...,
          [ 0.5081, -0.2275,  0.3310,  ..., -0.0255,  0.4946,  0.0071],
          [ 0.4917, -0.2309,  0.3524,  ..., -0.0208,  0.4764,  0.0318],
          [ 0.5092, -0.2576,  0.3779,  ..., -0.0224,  0.4903,  0.0379]],
 
         [[ 0.4235, -0.4610,  0.2390,  ..., -0.2002,  0.2846, -0.2286],
        

In [19]:
embedding_data_onnx = [
            features['sentence_embedding'][i].cpu().detach().numpy()
            for i in range(len(input_sentences))
        ]

In [20]:
import numpy as np
from sentence_transformers import SentenceTransformer

original_pre_trained_model = SentenceTransformer(model_id) # From Huggingface
original_embedding_data = list(
    original_pre_trained_model.encode(input_sentences, convert_to_numpy=True)
)
        
for i in range(len(input_sentences)):
    print(i)
    print(np.testing.assert_allclose(original_embedding_data[i], embedding_data_onnx[i], rtol=1e-03, atol=1e-05))

0
None
1
None
2
None


In [21]:
d_model

Dense({'in_features': 768, 'out_features': 512, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})

In [23]:
dense_layer

Dense({'in_features': 768, 'out_features': 512, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})

In [24]:
d_model == dense_layer

False

In [28]:
d_model.state_dict == dense_layer.state_dict

False

In [31]:
d_model.activation_function

Identity()

In [32]:
dense_layer.activation_function

Identity()