# Transfer learning with Huggingface using CodeFlare

In this notebook you will learn how to leverage the **[huggingface](https://huggingface.co/)** support in ray ecosystem to carry out a text classification task using transfer learning. We will be referencing the example **[here](https://huggingface.co/docs/transformers/tasks/sequence_classification)**

The example carries out a text classification task on **[imdb dataset](https://huggingface.co/datasets/imdb)** and tries to classify the movie reviews as positive or negative. Huggingface library provides an easy way to build a model and the dataset to carry out this classification task. In this case we will be using **distilbert-base-uncased** model which is a **BERT** based model.

Huggingface has a **[built in support for ray ecosystem](https://docs.ray.io/en/releases-1.13.0/_modules/ray/ml/train/integrations/huggingface/huggingface_trainer.html)** which allows the huggingface trainer to scale on CodeFlare and can scale the training as we add additional gpus and can run distributed training across multiple GPUs that will help scale out the training.


### Getting all the requirements in place

In [None]:
# Import pieces from codeflare-sdk
from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration
from codeflare_sdk.cluster.auth import TokenAuthentication

In [None]:
# Create authentication object for oc user permissions and login
auth = TokenAuthentication(
    token = "sha256~Wclt3EEMNzwGRp6sFrmNcJQSjWi824Cm2bJsd1gjj7Q",
    server = "https://c130-e.us-south.containers.cloud.ibm.com:30202",
    skip_tls = True
)
auth.login()

Here, we want to define our cluster by specifying the resources we require for our batch workload. Below, we define our cluster object (which generates a corresponding AppWrapper).

In [None]:
# Create our cluster and submit appwrapper
cluster = Cluster(ClusterConfiguration(name='hfgputest', min_worker=1, max_worker=3, min_cpus=8, max_cpus=8, min_memory=16, max_memory=16, gpu=1, instascale=False))

Next, we want to bring our cluster up, so we call the `up()` function below to submit our cluster AppWrapper yaml onto the MCAD queue, and begin the process of obtaining our resource cluster.

In [None]:
cluster.up()

Now, we want to check on the initial status of our resource cluster, then wait until it is finally ready for use.

In [None]:
cluster.status()

In [None]:
cluster.wait_ready()

In [None]:
cluster.status()

Let's quickly verify that the specs of the cluster are as expected.

In [None]:
cluster.details()

In [None]:
ray_cluster_uri = cluster.cluster_uri()
print(ray_cluster_uri)

**NOTE**: Now we have our resource cluster with the desired GPUs, so we can interact with it to train the HuggingFace model.

In [None]:
#before proceeding make sure the cluster exists and the uri is not empty
assert ray_cluster_uri, "Ray cluster needs to be started and set before proceeding"

import ray
from ray.air.config import ScalingConfig

# reset the ray context in case there's already one. 
ray.shutdown()
# establish connection to ray cluster

#install additionall libraries that will be required for this training
runtime_env = {"pip": ["scikit-learn", "accelerate", "transformers", "datasets", "evaluate", "pyarrow<7.0.0"]}

ray.init(address=f'{ray_cluster_uri}', runtime_env=runtime_env)

print("Ray cluster is up and running: ", ray.is_initialized())

**NOTE** : in this case since we are running a task for which we need additional pip packages. we can install those by passing them in the `runtime_env` variable

### Transfer learning code from huggingface

We are using the code based on the example **[here](https://huggingface.co/docs/transformers/tasks/sequence_classification)** . 

In [None]:
@ray.remote
def train_fn():
    from datasets import load_dataset
    import transformers
    from transformers import AutoTokenizer, TrainingArguments
    from transformers import AutoModelForSequenceClassification
    import numpy as np
    from datasets import load_metric
    import ray
    from ray import tune
    from ray.train.huggingface import HuggingFaceTrainer

    dataset = load_dataset("imdb")
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)

    tokenized_datasets = dataset.map(tokenize_function, batched=True)

    # Using a fraction of dataset but you can run with the full dataset
    # hmm, this does not limit to 1000 when we later use the ray.data.from_huggingface
    small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
    small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
    print(f"len of small_train_dataset {small_train_dataset} and small_eval_dataset {small_eval_dataset}")

    # Using a fraction of dataset - The limit here works
    ray_train_ds = ray.data.from_huggingface(small_train_dataset).random_shuffle(seed=42).limit(1000)
    ray_evaluation_ds = ray.data.from_huggingface(small_eval_dataset).random_shuffle(seed=42).limit(1000)
    
    # Using the full dataset
    #ray_train_ds = ray.data.from_huggingface(small_train_dataset)
    #ray_evaluation_ds = ray.data.from_huggingface(small_eval_dataset)
    print(f"len of ray_train_ds {ray_train_ds} and ray_evaluation_ds {ray_evaluation_ds}")
    
    def compute_metrics(eval_pred):
        metric = load_metric("accuracy")
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    def trainer_init_per_worker(train_dataset, eval_dataset, **config):
        model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)#, torchscript=True)

        #training_args = TrainingArguments("/tmp/hf_imdb/test", eval_steps=1, disable_tqdm=True, 
        #                                  num_train_epochs=2, skip_memory_metrics=True,
        #                                  learning_rate=2e-5,
        #                                  per_device_train_batch_size=16,
        #                                  per_device_eval_batch_size=16,                                
        #                                  weight_decay=0.01,)
        training_args = TrainingArguments("results", disable_tqdm=True, 
                                          num_train_epochs=3, skip_memory_metrics=True,
                                          learning_rate=2e-5,
                                          evaluation_strategy="steps",
                                          save_strategy = "no",
                                          logging_strategy = "steps",
                                          log_level = 'info',
                                          logging_first_step = True,
                                          logging_steps = 200,
                                          eval_steps = 200,
                                          per_device_train_batch_size=16,
                                          per_device_eval_batch_size=16,                                
                                          weight_decay=0.01,)
        
        return transformers.Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            compute_metrics=compute_metrics
        )

    scaling_config = ScalingConfig(num_workers=3, use_gpu=True) #num workers is the number of gpus

    # we are using the ray native HuggingFaceTrainer, but you can swap out to use non ray Huggingface Trainer. Both have the same method signature. 
    # the ray native HFTrainer has built in support for scaling to multiple GPUs
    trainer = HuggingFaceTrainer(
        trainer_init_per_worker=trainer_init_per_worker,
        scaling_config=scaling_config,
        datasets={"train": ray_train_ds, "evaluation": ray_evaluation_ds},
    )
    result = trainer.fit()
    print(f"metrics: {result.metrics}")
    print(f"checkpoint: {result.checkpoint}")
    print(f"log_dir: {result.log_dir}")
    return result.checkpoint
    #return result.log_dir

**NOTE:** This code will produce a lot of output and will run for **approximately 2 minutes.** As a part of execution it will download the `imdb` dataset, `distilbert-base-uncased` model and then will start transfer learning task for training the model with this dataset. 

In [None]:
#call the above cell as a remote ray function
result=ray.get(train_fn.remote())

In [None]:
from ray.train.torch import TorchCheckpoint
checkpoint: TorchCheckpoint = result
path = checkpoint.to_directory()

In [None]:
print(path)
!ls {path}

In [None]:
!cp -r {path} ./checkpoint2
#log_dir=result.log_dir
#print(f"log_dir: {log_dir}")

In [None]:
#path="./checkpoint2"
path="/opt/app-root/src/huggingface-checkpoint"

# Check if GPU is enabled

In [None]:
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

import onnxruntime as rt
print(rt.get_device())

# Inference using the checkpoint

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}
text1 = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."
text2 = "This is a catastrophe. Each of the three movies had different actors that made it difficult to follow."
batch=[text1,text2]
inputs = tokenizer(batch, padding=True, truncation=True, max_length=512, return_tensors="pt")

In [None]:
print(inputs)

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(path,num_labels=2, id2label=id2label, label2id=label2id)
with torch.no_grad(): logits = model(**inputs).logits # For pytorch you have to unpack

In [None]:
print(logits)
print(torch.nn.Softmax(dim=1)(logits)) #tf.math.softmax(logits, axis=-1)

In [None]:
import numpy as np
print(np.array(logits))
predicted_class_id = np.array(logits).argmax(axis=1)
print(predicted_class_id)
print([model.config.id2label[i] for i in predicted_class_id])

# Test the pipeline

In [None]:
import transformers
import transformers.convert_graph_to_onnx as onnx_convert
from pathlib import Path

In [None]:
pipeline = transformers.pipeline("sentiment-analysis",model=model,tokenizer=tokenizer)
#pipeline = transformers.pipeline("text-classification",model=model,tokenizer=tokenizer)

In [None]:
result = pipeline("Both the music and visual were astounding, not to mention the actors performance.")
print(result)

In [None]:
from datasets import load_dataset
dataset = load_dataset("imdb") #emotion
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm
test_set = datasets.load_dataset("imdb", name="plain_text", split="test[:10]")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
for out in pipe(KeyDataset(test_set, "text"), batch_size=8, truncation="only_first"): print(out)

In [None]:
[i for i in KeyDataset(test_set, "text")]

# Convert the model to onnx with and without quantization

In [None]:
onnx_convert.convert_pytorch(pipeline, opset=11, output=Path("classifier.onnx"), use_external_format=False)

Due to current limitations in ONNX Runtime, it is not possible to use quantized models with CUDAExecutionProvider https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#use-cuda-execution-provider-with-quantized-models

In [None]:
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic("classifier.onnx", "classifier_int8.onnx", weight_type=QuantType.QUInt8)

# Test execution of converted onnx model using onnxruntime and with Quantization

In [None]:
import onnxruntime as ort
#providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
session_options = ort.SessionOptions()
session_options.log_severity_level = 0
session = ort.InferenceSession("classifier.onnx",providers=providers,session_options=session_options)
session_int8 = ort.InferenceSession("classifier_int8.onnx",providers=providers,session_options=session_options)

In [None]:
import numpy as np
inputs = tokenizer(batch, padding=True, truncation=True, max_length=512, return_tensors="np")
out = session.run(input_feed=dict(inputs),output_names=['output_0'])[0]
out_int8 = session_int8.run(input_feed=dict(inputs),output_names=['output_0'])[0]
print('Without quantization',np.argmax(np.array(out),axis=1))
print(out)
print(torch.nn.Softmax(dim=1)(torch.from_numpy(out)))
print('With quantization',np.argmax(np.array(out_int8),axis=1))
print(out_int8)
print(torch.nn.Softmax(dim=1)(torch.from_numpy(out_int8)))

# Test with the imdb test Dataset

In [None]:
from datasets import load_dataset
dataset = load_dataset("imdb") #emotion
def tokenize_function(examples):return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets)

In [None]:
from IPython.display import display, HTML
import random
import pandas as pd
import datasets

def show_random_elements(dataset, num_examples=10):
    #assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        #while pick in picks: pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    count=0
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel): df[column] = df[column].transform(lambda i: typ.names[i])
    df['row_num']=picks
    display(HTML(df.to_html()))

In [None]:
show_random_elements(dataset["train"],5)

In [None]:
from datasets import load_metric

metric = load_metric("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

## Check the accuracy of full test Dataset using GPU

In [None]:
%%time
import numpy as np
sum=0.0
batches=len(tokenized_datasets["test"])//100
print(batches)
for i in range(batches):
    partial_eval_dataset = tokenized_datasets["test"][i*100:(i+1)*100]
    input_feed = {"input_ids": np.array(partial_eval_dataset['input_ids']),"attention_mask": np.array(partial_eval_dataset['attention_mask'])}
    out = session.run(input_feed=input_feed,output_names=['output_0'])[0]
    predictions = np.argmax(out, axis=-1)
    m=metric.compute(predictions=predictions, references=partial_eval_dataset['label'])['accuracy']
    #print(i*100,m)
    sum+=m
out = None
print("Accuracy",sum/batches)

## Compare with time required for quantized model using a portion of the test DataSet with GPU

In [None]:
full_eval_dataset = tokenized_datasets["test"][200:300]

In [None]:
import numpy as np
input_feed = {
    "input_ids": np.array(full_eval_dataset['input_ids']),
    "attention_mask": np.array(full_eval_dataset['attention_mask'])
}

In [None]:
%%time
# Original Model
out = session.run(input_feed=input_feed,output_names=['output_0'])[0]
predictions = np.argmax(out, axis=-1)
print(metric.compute(predictions=predictions, references=full_eval_dataset['label']))
out = None
predictions=None

In [None]:
%%time
# Quantized Model
out_int8 = session_int8.run(input_feed=input_feed,output_names=['output_0'])[0]
predictions_int8 = np.argmax(out_int8, axis=-1)
print(metric.compute(predictions=predictions_int8, references=full_eval_dataset['label']))
out_int8 = None
predictions_int8=None

## Compare with time required for quantized model using a portion of the test DataSet with CPU

In [None]:
providers=['CPUExecutionProvider']
session_options = ort.SessionOptions()
session_options.log_severity_level = 0
session = ort.InferenceSession("classifier.onnx",providers=providers,session_options=session_options)
session_int8 = ort.InferenceSession("classifier_int8.onnx",providers=providers,session_options=session_options)

In [None]:
%%time
# Original Model
out = session.run(input_feed=input_feed,output_names=['output_0'])[0]
predictions = np.argmax(out, axis=-1)
print(metric.compute(predictions=predictions, references=full_eval_dataset['label']))
out = None
predictions=None

In [None]:
%%time
# Quantized Model
out_int8 = session_int8.run(input_feed=input_feed,output_names=['output_0'])[0]
predictions_int8 = np.argmax(out_int8, axis=-1)
print(metric.compute(predictions=predictions_int8, references=full_eval_dataset['label']))
out_int8 = None
predictions_int8=None

# Preparing the model for Triton

In [None]:
# Load the model from the checkpoint path saved to earlier
model = AutoModelForSequenceClassification.from_pretrained(path,num_labels=2, id2label=id2label, label2id=label2id, torchscript=True)
text1 = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."
text2 = "This is a catastrophe. Each of the three movies had different actors that made it difficult to follow."
batch=[text1,text2]
inputs = tokenizer(batch, padding=True, truncation=True, max_length=512, return_tensors="pt")

class Pytorch_to_TorchScript(torch.nn.Module):
    def __init__(self):
        super(Pytorch_to_TorchScript,self).__init__()
        #print(model)
        self.model = model
    def forward(self,input_ids,attention_mask=None):
        x=self.model(input_ids,attention_mask) #[0] #.get('logits')
        #x = {'logits': x}
        #return x
        return x
    
ptmodel=Pytorch_to_TorchScript().eval()
model_scripted = torch.jit.trace(ptmodel,(inputs['input_ids'],inputs['attention_mask']))#,strict=False) # Export to TorchScript
# Test the outputs
with torch.no_grad():
    output_model = ptmodel(inputs['input_ids'],inputs['attention_mask'])
output_traced_model = model_scripted(inputs['input_ids'],inputs['attention_mask'])
print('output_from_model = ' + str(output_model))
print('output_from_traced_model = ' + str(output_traced_model))

model_scripted.save('/opt/app-root/src/hfmodel/1/model.pt') # Save

In [None]:
print('model_scripted = ' + str(model_scripted))
print('model_scripted = ' + str(model_scripted.graph))

In [None]:
#def func(): return [v for k, v in x.items() if k in ["1"]]
#torch.jit.script(func)
#m = torch.jit.script(Pytorch_to_TorchScript())
#m.save('hfmodel_scripted.pt') # Save

Do not run this section

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(path,num_labels=2, id2label=id2label, label2id=label2id, torchscript=True)
text1 = "This wasn't a masterpiece. Not completely faithful to the books, but boring from beginning to end. Not my favorite of the three."
text2 = "This is a catastrophe. Each of the three movies had different actors that made it difficult to follow."
batch=[text1,text2]
inputs = tokenizer(batch, padding=True, truncation=True, max_length=512, return_tensors="pt")

with torch.no_grad():
    output_model = model(input_ids=inputs['input_ids'],attention_mask=inputs['attention_mask'])
#output_traced_model = model_scripted(inputs['input_ids'],inputs['attention_mask'])
print('output_model = ' + str(output_model))
print('output_traced_model = ' + str(output_traced_model))
model_scripted = torch.jit.trace(model,(inputs['input_ids'],inputs['attention_mask']))#,strict=False) # Export to TorchScript
model_scripted.save('/opt/app-root/src/hfmodel/1/model.pt') # Save

# Verify that the exported model works

In [None]:
# Test the outputs by loading the exported TorchScript model
loaded_model = torch.jit.load("/opt/app-root/src/hfmodel/1/model.pt")
with torch.no_grad():
    output_model = loaded_model(inputs['input_ids'],inputs['attention_mask'])
output_traced_model = model_scripted(inputs['input_ids'],inputs['attention_mask'])
print('output_model = ' + str(output_model))
print('output_traced_model = ' + str(output_traced_model))

# Upload Torchscript model to S3

In [None]:
import os
import boto3
from boto3 import session

key_id = os.environ.get('AWS_ACCESS_KEY_ID')
secret_key = os.environ.get('AWS_SECRET_ACCESS_KEY')
endpoint_url = os.environ.get('AWS_S3_ENDPOINT')
session = boto3.session.Session(aws_access_key_id=key_id, aws_secret_access_key=secret_key)
s3_client = boto3.client('s3', aws_access_key_id=key_id, aws_secret_access_key=secret_key,endpoint_url=endpoint_url,verify=False)
buckets=s3_client.list_buckets()
for bucket in buckets['Buckets']: print(bucket['Name'])

In [None]:
s3_client.upload_file("/opt/app-root/src/hfmodel/1/model.pt", bucket['Name'],"hfmodel/1/model.pt")
s3_client.upload_file("/opt/app-root/src/hfmodel/config.pbtxt", bucket['Name'],"hfmodel/config.pbtxt")

In [None]:
[item.get("Key") for item in s3_client.list_objects_v2(Bucket=bucket['Name']).get("Contents")]

Now manually deploy the model from Data Science Projects

# Convert to onyx

In [None]:
torch.onnx.export(
    model, 
    tuple(inputs.values()),
    f="torch-model.onnx",  
    input_names=['input_ids', 'attention_mask'], 
    output_names=['logits'], 
    dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence'}, 
                  'attention_mask': {0: 'batch_size', 1: 'sequence'}, 
                  'logits': {0: 'batch_size', 1: 'sequence'}}, 
    do_constant_folding=True, 
    opset_version=13, 
)

In [None]:
from datasets import load_dataset
dataset = load_dataset("imdb")

In [None]:
import onnx
import onnxruntime
import torch
import numpy as np

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
print(tokenizer)

session = onnxruntime.InferenceSession('torch-model.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
text="This is a catastrophe."
inputs = tokenizer(text, return_tensors="np")
print(inputs)

result1 = session.run([i.name for i in session.get_outputs()], dict(inputs))
print(result1)

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}
predicted_class_id = np.array(result1).argmax().item()
print(id2label[predicted_class_id])

In [None]:
#import tensorflow as tf
#predictions = tf.math.softmax(result, axis=-1)
print(torch.nn.Softmax(dim=1)(torch.tensor(result1[0])))

In [None]:
text1 = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."
text2 = "This is a catastrophe."
batch=[text1,text2]
inputs = tokenizer(batch, padding=True, truncation=True, max_length=512, return_tensors="np")
print(inputs)
result2 = session.run([i.name for i in session.get_outputs()], dict(inputs))
print(result2)
torch.nn.Softmax(dim=1)(torch.tensor(result2[0]))
print(np.argmax(torch.nn.Softmax(dim=1)(torch.tensor(result2[0])),axis=1))
print([id2label[i.item()] for i in torch.argmax(torch.nn.Softmax(dim=1)(torch.tensor(result2[0])),axis=1)])
labels=[id2label[labelid] for labelid in torch.argmax(torch.nn.Softmax(dim=1)(torch.tensor(result2[0])),axis=1).tolist()]
print(labels)

# Upload the onnx model and quantized onnx model to S3 Bucket

In [None]:
import os
import boto3
from boto3 import session

key_id = os.environ.get('AWS_ACCESS_KEY_ID')
secret_key = os.environ.get('AWS_SECRET_ACCESS_KEY')
endpoint_url = os.environ.get('AWS_S3_ENDPOINT')
session = boto3.session.Session(aws_access_key_id=key_id, aws_secret_access_key=secret_key)
s3_client = boto3.client('s3', aws_access_key_id=key_id, aws_secret_access_key=secret_key,endpoint_url=endpoint_url,verify=False)
buckets=s3_client.list_buckets()
for bucket in buckets['Buckets']: print(bucket['Name'])

In [None]:
#print(bucket['Name'])
#modelfile='torch-model.onnx'
#s3_client.upload_file(modelfile, bucket['Name'],'hf_model.onnx')
s3_client.upload_file("classifier.onnx", bucket['Name'],'classifier.onnx')
s3_client.upload_file("classifier_int8.onnx", bucket['Name'],'classifier_int8.onnx')

In [None]:
[item.get("Key") for item in s3_client.list_objects_v2(Bucket=bucket['Name']).get("Contents")]

Now manually deploy the model from Data Science Projects

---
# Submit inferencing request to Deployed model using HTTP

In [None]:
import requests
import json
URL='http://modelmesh-serving.huggingface.svc.cluster.local:8008/v2/models/hfmodel/infer' # underscore characters are removed
headers = {}
payload = {
        "inputs": [{ "name": "input_ids", "shape": inputs.get('input_ids').shape, "datatype": "INT32", "data": inputs.get('input_ids').tolist()},{ "name": "attention_mask", "shape": inputs.get('attention_mask').shape, "datatype": "INT8", "data": inputs.get('attention_mask').tolist()}]
    }
print(payload)
headers = {"content-type": "application/json"}
res = requests.post(URL, json=payload, headers=headers)
print(res)
print(res.text)

In [None]:
result=[np.array(res.json().get('outputs')[0].get('data')).reshape(res.json().get('outputs')[0].get('shape'))]

In [None]:
torch.nn.Softmax(dim=1)(torch.tensor(result[0]))
print(np.argmax(torch.nn.Softmax(dim=1)(torch.tensor(result[0])),axis=1))
print('Using item',[id2label[i.item()] for i in torch.argmax(torch.nn.Softmax(dim=1)(torch.tensor(result[0])),axis=1)])
labels=[id2label[labelid] for labelid in torch.argmax(torch.nn.Softmax(dim=1)(torch.tensor(result[0])),axis=1).tolist()]
print('Using to_list',labels)

# Submit batches of inferencing requests to Deployed model using HTTP

In [None]:
%%time
import requests
import json
URL='http://modelmesh-serving.huggingface.svc.cluster.local:8008/v2/models/hfmodel/infer' # underscore characters are removed
headers = {}
import numpy as np
sum=0.0
batch_size=250
batches=len(tokenized_datasets["test"])//batch_size
print(batches)
for batchnum in range(batches):
    partial_eval_dataset = tokenized_datasets["test"][batchnum*batch_size:(batchnum+1)*batch_size]
    payload = {
            "inputs": [{ "name": "input_ids", "shape": (batch_size,512), "datatype": "INT32", "data": partial_eval_dataset['input_ids']},
                       { "name": "attention_mask", "shape": (batch_size,512), "datatype": "INT8", "data": partial_eval_dataset['attention_mask']}]
        }
    #print(payload)
    headers = {"content-type": "application/json"}
    res = requests.post(URL, json=payload, headers=headers)
    #print(res.json())#.get["outputs"])#[0].get["data"])
    predictions = np.argmax(np.array(res.json()["outputs"][0]["data"]).reshape(*[int(i) for i in res.json()["outputs"][0]["shape"]]), axis=1)
    m=metric.compute(predictions=predictions, references=partial_eval_dataset['label'])['accuracy']
    print(batchnum,m)
    sum+=m
print("Accuracy",sum/batches)

In [None]:
import time

URL='http://modelmesh-serving.huggingface.svc.cluster.local:8008/v2/models/hfmodel/infer' # underscore characters are removed
headers = {}
eval_dataset = tokenized_datasets["test"]
length_eval = len(eval_dataset)
batch_size = 250
n_steps = int(length_eval/batch_size)
count=0

start = time.time()
with torch.no_grad():
    for i in range(0,n_steps):
        print (i,"/",n_steps)
        small_eval_dataset = eval_dataset.select(range(i*batch_size,(i+1)*batch_size))
        batch_x = small_eval_dataset['text']
        batch_y = small_eval_dataset['label']
        inputs = tokenizer(batch_x, padding=True, truncation=True, max_length=512, return_tensors="pt")
        
        payload = {
        "inputs": [{ "name": "input_ids", 
                    "shape": inputs.get('input_ids').shape, 
                    "datatype": "INT32", 
                    "data": inputs.get('input_ids').tolist()},
                   { "name": "attention_mask", 
                    "shape": inputs.get('attention_mask').shape, 
                    "datatype": "INT8", 
                    "data": inputs.get('attention_mask').tolist()}]}
        
        headers = {"content-type": "application/json"}
        res = requests.post(URL, json=payload, headers=headers, verify = False)


        result =[np.array(res.json().get('outputs')[0].get('data')).reshape(res.json().get('outputs')[0].get('shape'))]
        
        predicted_class_id = np.argmax(result[0],axis=1)
        #print(predicted_class_id==np.array(batch_y))
        count_batch = (predicted_class_id == np.array(batch_y)).sum()
        count = count + count_batch
        print (f'Batch {count_batch}/{batch_size} Total {count}/{batch_size*(i+1)}')
        
end = time.time()

print(end-start)

print ("Accuracy is ", count/length_eval)

# Submit batches of inferencing requests to Deployed model using GRPC

In [None]:
!pip install grpcio grpcio-tools==1.46.0

In [None]:
#!wget https://raw.githubusercontent.com/kserve/kserve/master/docs/predict-api/v2/grpc_predict_v2.proto
!wget https://raw.githubusercontent.com/kserve/modelmesh-serving/main/fvt/proto/kfs_inference_v2.proto
!python3 -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. ./kfs_inference_v2.proto

In [None]:
import time
import grpc
import kfs_inference_v2_pb2, kfs_inference_v2_pb2_grpc
from google.protobuf.json_format import MessageToDict
import struct
import base64

FLOAT = 'f'
grpc_url="modelmesh-serving.huggingface.svc.cluster.local:8033"
grpc_channel = grpc.insecure_channel(grpc_url)
grpc_stub = kfs_inference_v2_pb2_grpc.GRPCInferenceServiceStub(grpc_channel)

eval_dataset = tokenized_datasets["test"]
length_eval = len(eval_dataset)
batch_size = 440
n_steps = int(length_eval/batch_size)

count=0
start = time.time()
with torch.no_grad():
    for i in range(0,n_steps):
        print (i,"/",n_steps)
        small_eval_dataset = eval_dataset.select(range(i*batch_size,min((i+1)*batch_size,len(eval_dataset))))
        batch_x = small_eval_dataset['text']
        batch_y = small_eval_dataset['label']
        inputs = tokenizer(batch_x, padding=True, truncation=True, max_length=512, return_tensors="pt")
        
        payload = { "model_name": "classifier",
        "inputs": [{ "name": "input_ids", "shape": inputs.get('input_ids').shape, "datatype": "INT32", 
                     "contents": {"int_contents":[y for x in inputs.get('input_ids').tolist() for y in x]}},
                   { "name": "attention_mask", "shape": inputs.get('attention_mask').shape, "datatype": "INT8", 
                     "contents": {"int_contents":[y for x in inputs.get('attention_mask').tolist() for y in x]}}]
                  }
        request=kfs_inference_v2_pb2.ModelInferRequest(model_name="hfmodel",inputs=payload["inputs"])
        response = grpc_stub.ModelInfer(request)

        d = MessageToDict(response.outputs[0])
        #print(d)
        binary_data=bytes([x for x in response.raw_output_contents[0]])
        fmt = '<' + FLOAT * (len(binary_data) // struct.calcsize(FLOAT))
        numbers = struct.unpack(fmt, binary_data)
        #print('numbers',numbers)

        predicted_class_id = np.array(numbers).reshape(*[int(i) for i in d.get("shape")]).argmax(axis=1)
        #print('predicted_class_id',predicted_class_id,np.array(batch_y))
        #print(predicted_class_id,np.array(batch_y),predicted_class_id==np.array(batch_y))
        count_batch = (predicted_class_id == np.array(batch_y)).sum()
        count = count + count_batch
        print (f'Batch {count_batch}/{min((i+1)*batch_size,len(eval_dataset))-i*batch_size} Total {count}/{min(batch_size*(i+1),len(eval_dataset))}')
    
end = time.time()

print(end-start)

print ("Accuracy is ", count/length_eval)

# Test single payload using gRPC

In [None]:
text1 = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."
text2 = "This is a catastrophe."
batch=[text1,text2]
inputs = tokenizer(batch, padding=True, truncation=True, max_length=512, return_tensors="np")
print(inputs)
payload = { "model_name": "hfmodel",
        "inputs": [{ "name": "input_ids", "shape": inputs.get('input_ids').shape, "datatype": "INT32", 
                     "contents": {"int_contents":[y for x in inputs.get('input_ids').tolist() for y in x]}},
                   { "name": "attention_mask", "shape": inputs.get('attention_mask').shape, "datatype": "INT8", 
                     "contents": {"int_contents":[y for x in inputs.get('attention_mask').tolist() for y in x]}}]
    }
print(json.dumps(payload))

In [None]:
import grpc
import kfs_inference_v2_pb2, kfs_inference_v2_pb2_grpc
grpc_url="modelmesh-serving.huggingface.svc.cluster.local:8033"
request=kfs_inference_v2_pb2.ModelInferRequest(model_name="hfmodel",inputs=payload["inputs"])
grpc_channel = grpc.insecure_channel(grpc_url)
grpc_stub = kfs_inference_v2_pb2_grpc.GRPCInferenceServiceStub(grpc_channel)
response = grpc_stub.ModelInfer(request)

In [None]:
print(type(response.outputs),type(response.raw_output_contents))
from google.protobuf.json_format import MessageToDict
d = MessageToDict(response.outputs[0])
print(d)
binary_data=bytes([x for x in response.raw_output_contents[0]])

In [None]:
import struct
import base64
FLOAT = 'f'
fmt = '<' + FLOAT * (len(binary_data) // struct.calcsize(FLOAT))
numbers = struct.unpack(fmt, binary_data)
print(numbers)

In [None]:
np.array(numbers).reshape(*[int(i) for i in d.get("shape")])

Finally, we bring our resource cluster down and release/terminate the associated resources, bringing everything back to the way it was before our cluster was brought up.

# Conclusion
As shown in the above example, you can easily run your Huggingface transfer learning tasks easily and natively on CodeFlare. You can scale them from 1 to n GPUs without requiring you to make any significant code changes and leveraging the native Huggingface trainer. 

Also refer to additional notebooks that showcase other use cases
In our next notebook [./02_codeflare_workflows_encoding.ipynb ] shows an sklearn example and how you can leverage workflows to run experiment pipelines and explore multiple pipelines in parallel on CodeFlare cluster. 


In [None]:
cluster.down()

In [None]:
auth.logout()