In [0]:
%pip install onnx
dbutils.library.restartPython()

In [0]:
#https://huggingface.co/blog/convert-transformers-to-onnx
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
DIRECTORY='/Volume/edav_prd_cdh/cdh_ml/metadata_data/onnx/'
# load model and tokenizer
# https://huggingface.co/docs/transformers/main/en/model_doc/distilbert#transformers.DistilBertForTokenClassification
model_user="sarahmiller137"
model_id = "distilbert-base-uncased-ft-ncbi-disease"
model = AutoModelForSequenceClassification.from_pretrained(f"{model_user}/{model_id}")
tokenizer = AutoTokenizer.from_pretrained(f"{model_user}/{model_id}")
dummy_model_input = tokenizer( "HuggingFace is a company based in Paris and New York", add_special_tokens=False, return_tensors="pt")
onnxfile=f"{model_id}.onnx"
# export
torch.onnx.export(
    model, 
    tuple(dummy_model_input.values()),
    f=onnxfile,  
    input_names=['input_ids', 'attention_mask'], 
    output_names=['logits'], 
    dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence'}, 
                  'attention_mask': {0: 'batch_size', 1: 'sequence'}, 
                  'logits': {0: 'batch_size', 1: 'sequence'}}, 
    do_constant_folding=True, 
    opset_version=13, 
)


In [0]:
dbutils.fs.ls("")

![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace%20ONNX%20in%20Spark%20NLP%20-%20BertForTokenClassification.ipynb)

## Import ONNX BertForTokenClassification models from HuggingFace 🤗  into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- ONNX support was introduced in  `Spark NLP 5.0.0`, enabling high performance inference for models.
- `BertForTokenClassification` is only available since in `Spark NLP 5.1.3` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import BERT models trained/fine-tuned for token classification via `BertForTokenClassification` or `TFBertForTokenClassification`. These models are usually under `Token Classification` category and have `bert` in their labels
- Reference: [TFBertForTokenClassification](https://huggingface.co/transformers/model_doc/bert.html#tfbertfortokenclassification)
- Some [example models](https://huggingface.co/models?filter=bert&pipeline_tag=token-classification)

[ONNX on Azure](https://opensource.microsoft.com/blog/2023/10/04/accelerating-over-130000-hugging-face-models-with-onnx-runtime/)  
[Optimum transformers extension](https://huggingface.co/docs/optimum/index)  

## Export and Save HuggingFace model

optimum supercedes transformers
- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.29.1`. This doesn't mean it won't work with the future releases
- Albert uses SentencePiece, so we will have to install that as well

In [0]:
#%pip install -q --upgrade NO NO NO transformers[onnx]==4.34.1 maybe sentencepiece 
#%pip install optimum[exporters]@git+https://github.com/huggingface/optimum.git
#%pip install optimum[onnxruntime]
%pip install --upgrade --upgrade-strategy eager optimum[exporters]
%pip install --upgrade tensorflow
%pip install --upgrade huggingface_hub
%pip install tf-keras
dbutils.library.restartPython()

- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.
- We'll use [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER) model from HuggingFace as an example
- In addition to `TFBertForTokenClassification` we also need to save the `BertTokenizer`. This is the same for every model, these are assets needed for tokenization inside Spark NLP.

In [0]:
%sh
cd /Volumes/edav_prd_cdh/cdh_ml/metadata_data
mkdir onnx_models

In [0]:
#from optimum.onnxruntime import ORTModelForTokenClassification
import tensorflow as tf
import huggingface_hub

[export transformer to onnx via cli](https://huggingface.co/docs/transformers/serialization)

In [0]:
%sh
cd /Volumes/edav_prd_cdh/cdh_ml/metadata_data
mkdir hfhub
cd hfhub
pip install huggingface_hub[hf_transfer]
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download alvaroalon2/biobert_diseases_ner

In [0]:
%sh
cd /Volumes/edav_prd_cdh/cdh_ml/metadata_data
#mkdir onnx_models
#cd onnx_models
#optimum-cli export onnx --help > onnx_help.txt
git clone https://huggingface.co/alvaroalon2/biobert_diseases_ner biobert_diseases_ner
#optimum-cli export onnx --model alvaroalon2/biobert_diseases_ner biobert_diseases_ner/

In [0]:


MODEL_NAME = 'dslim/bert-base-NER'
EXPORT_PATH = f"/Volumes/edav_prd_cdh/cdh_ml/metadata_data/onnx/{MODEL_NAME}"

ort_model = optimum.onnxruntime.ORTModelForTokenClassification.from_pretrained(MODEL_NAME, export=True)

# Save the ONNX model
ort_model.save_pretrained(EXPORT_PATH)

Let's have a look inside these two directories and see what we are dealing with:

In [0]:
%ls -l {EXPORT_PATH}

In [0]:
!mkdir {EXPORT_PATH}/assets

- As you can see, we need to move `vocabs.txt` from the tokenizer to assets folder which Spark NLP will look for
- We also need `labels` and their `ids` which is saved inside the model's config. We will save this inside `labels.txt`

In [0]:
# get label2id dictionary
labels = ort_model.config.label2id
# sort the dictionary based on the id
labels = sorted(labels, key=labels.get)

with open(EXPORT_PATH +'/assets/labels.txt', 'w') as f:
    f.write('\n'.join(labels))

In [0]:
!mv {EXPORT_PATH}/vocab.txt {EXPORT_PATH}/assets

Voila! We have our `vocab.txt` and `labels.txt` inside assets directory

In [0]:
!ls -lR {EXPORT_PATH}

## Import and Save BertForTokenClassification in Spark NLP


- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [0]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Let's start Spark with Spark NLP included via our simple `start()` function

In [0]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()

print("Apache Spark version: {}".format(spark.version))

- Let's use `loadSavedModel` functon in `BertForTokenClassification` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `BertForTokenClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.




In [0]:
from sparknlp.annotator import *
from sparknlp.base import *

tokenClassifier = BertForTokenClassification\
  .loadSavedModel(EXPORT_PATH, spark)\
  .setInputCols(["document",'token'])\
  .setOutputCol("ner")\
  .setCaseSensitive(True)\
  .setMaxSentenceLength(128)

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [0]:
tokenClassifier.write().overwrite().save("./{}_spark_nlp_onnx".format(MODEL_NAME))

Let's clean up stuff we don't need anymore

In [0]:
!rm -rf {EXPORT_PATH}

Awesome 😎  !

This is your BertForTokenClassification model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [0]:
! ls -l {MODEL_NAME}_spark_nlp_onnx

Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny BertForTokenClassification model 😊

In [0]:
tokenClassifier_loaded = BertForTokenClassification.load("./{}_spark_nlp_onnx".format(MODEL_NAME))\
  .setInputCols(["document",'token'])\
  .setOutputCol("ner")

You can see what labels were used to train this model via `getClasses` function:

In [0]:
# .getClasses was introduced in spark-nlp==3.4.0
tokenClassifier_loaded.getClasses()

This is how you can use your loaded classifier model in Spark NLP 🚀 pipeline:

In [0]:
document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    tokenClassifier_loaded
])

# couple of simple examples
example = spark.createDataFrame([["My name is Sarah and I live in London"],
                                 ['My name is Clara and I live in Berkeley, California.']]).toDF("text")

result = pipeline.fit(example).transform(example)

# result is a DataFrame
result.select("text", "ner.result").show()

That's it! You can now go wild and use hundreds of `BertForTokenClassification` models from HuggingFace 🤗 in Spark NLP 🚀
