# Transfer learning with Huggingface using CodeFlare

In this notebook you will learn how to leverage the **[huggingface](https://huggingface.co/)** support in ray ecosystem to carry out a text classification task using transfer learning. We will be referencing the example **[here](https://huggingface.co/docs/transformers/tasks/sequence_classification)**

The example carries out a text classification task on **[imdb dataset](https://huggingface.co/datasets/imdb)** and tries to classify the movie reviews as positive or negative. Huggingface library provides an easy way to build a model and the dataset to carry out this classification task. In this case we will be using **distilbert-base-uncased** model which is a **BERT** based model.

Huggingface has a **[built in support for ray ecosystem](https://docs.ray.io/en/releases-1.13.0/_modules/ray/ml/train/integrations/huggingface/huggingface_trainer.html)** which allows the huggingface trainer to scale on CodeFlare and can scale the training as we add additional gpus and can run distributed training across multiple GPUs that will help scale out the training.


### Getting all the requirements in place

In [1]:
# Import pieces from codeflare-sdk
from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration
from codeflare_sdk.cluster.auth import TokenAuthentication

In [2]:
# Create authentication object for oc user permissions and login
auth = TokenAuthentication(
    token = "sha256~wTEk7b6J0jRiIGCCCl8f_uVRimPYqMjDjthEsQE5i9s",
    server = "https://api.mini2.mydomain.com:6443",
    skip_tls = True
)
auth.login()



Here, we want to define our cluster by specifying the resources we require for our batch workload. Below, we define our cluster object (which generates a corresponding AppWrapper).

In [3]:
# Create our cluster and submit appwrapper
cluster = Cluster(ClusterConfiguration(name='hfgputest', min_worker=1, max_worker=3, min_cpus=8, max_cpus=8, min_memory=16, max_memory=16, gpu=1, instascale=False))

Written to: hfgputest.yaml


Next, we want to bring our cluster up, so we call the `up()` function below to submit our cluster AppWrapper yaml onto the MCAD queue, and begin the process of obtaining our resource cluster.

In [4]:
cluster.up()

Now, we want to check on the initial status of our resource cluster, then wait until it is finally ready for use.

In [5]:
cluster.status()

(<CodeFlareClusterStatus.QUEUED: 3>, False)

In [6]:
cluster.wait_ready()

Waiting for requested resources to be set up...
Requested cluster up and running!


In [7]:
cluster.status()

(<CodeFlareClusterStatus.READY: 1>, True)

Let's quickly verify that the specs of the cluster are as expected.

In [8]:
cluster.details()

RayCluster(name='hfgputest', status=<RayClusterStatus.READY: 'ready'>, min_workers=1, max_workers=3, worker_mem_min=16, worker_mem_max=16, worker_cpu=8, worker_gpu=1, namespace='huggingface', dashboard='http://ray-dashboard-hfgputest-huggingface.apps.mini2.mydomain.com')

In [9]:
ray_cluster_uri = cluster.cluster_uri()
print(ray_cluster_uri)

ray://hfgputest-head-svc.huggingface.svc:10001


**NOTE**: Now we have our resource cluster with the desired GPUs, so we can interact with it to train the HuggingFace model.

In [10]:
#before proceeding make sure the cluster exists and the uri is not empty
assert ray_cluster_uri, "Ray cluster needs to be started and set before proceeding"

import ray
from ray.air.config import ScalingConfig

# reset the ray context in case there's already one. 
ray.shutdown()
# establish connection to ray cluster

#install additionall libraries that will be required for this training
runtime_env = {"pip": ["transformers", "datasets", "evaluate", "pyarrow<7.0.0"]}

ray.init(address=f'{ray_cluster_uri}', runtime_env=runtime_env)

print("Ray cluster is up and running: ", ray.is_initialized())

Ray cluster is up and running:  True


**NOTE** : in this case since we are running a task for which we need additional pip packages. we can install those by passing them in the `runtime_env` variable

### Transfer learning code from huggingface

We are using the code based on the example **[here](https://huggingface.co/docs/transformers/tasks/sequence_classification)** . 

In [11]:
@ray.remote
def train_fn():
    from datasets import load_dataset
    import transformers
    from transformers import AutoTokenizer, TrainingArguments
    from transformers import AutoModelForSequenceClassification
    import numpy as np
    from datasets import load_metric
    import ray
    from ray import tune
    from ray.train.huggingface import HuggingFaceTrainer

    dataset = load_dataset("imdb")
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)

    tokenized_datasets = dataset.map(tokenize_function, batched=True)

    #using a fraction of dataset but you can run with the full dataset
    small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(100))
    small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100))

    print(f"len of train {small_train_dataset} and test {small_eval_dataset}")

    ray_train_ds = ray.data.from_huggingface(small_train_dataset)
    ray_evaluation_ds = ray.data.from_huggingface(small_eval_dataset)

    def compute_metrics(eval_pred):
        metric = load_metric("accuracy")
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    def trainer_init_per_worker(train_dataset, eval_dataset, **config):
        model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

        training_args = TrainingArguments("/tmp/hf_imdb/test", eval_steps=1, disable_tqdm=True, 
                                          num_train_epochs=1, skip_memory_metrics=True,
                                          learning_rate=2e-5,
                                          per_device_train_batch_size=16,
                                          per_device_eval_batch_size=16,                                
                                          weight_decay=0.01,)
        return transformers.Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            compute_metrics=compute_metrics
        )

    scaling_config = ScalingConfig(num_workers=3, use_gpu=True) #num workers is the number of gpus

    # we are using the ray native HuggingFaceTrainer, but you can swap out to use non ray Huggingface Trainer. Both have the same method signature. 
    # the ray native HFTrainer has built in support for scaling to multiple GPUs
    trainer = HuggingFaceTrainer(
        trainer_init_per_worker=trainer_init_per_worker,
        scaling_config=scaling_config,
        datasets={"train": ray_train_ds, "evaluation": ray_evaluation_ds},
    )
    result = trainer.fit()
    print(f"metrics: {result.metrics}")
    print(f"checkpoint: {result.checkpoint}")
    print(f"log_dir: {result.log_dir}")
    return result.checkpoint
    #return result.log_dir

**NOTE:** This code will produce a lot of output and will run for **approximately 2 minutes.** As a part of execution it will download the `imdb` dataset, `distilbert-base-uncased` model and then will start transfer learning task for training the model with this dataset. 

In [12]:
#call the above cell as a remote ray function
result=ray.get(train_fn.remote())

Downloading builder script: 100%|██████████| 4.31k/4.31k [00:00<00:00, 5.20MB/s]
Downloading metadata: 100%|██████████| 2.17k/2.17k [00:00<00:00, 3.29MB/s]
Downloading readme: 100%|██████████| 7.59k/7.59k [00:00<00:00, 10.8MB/s]


[2m[36m(train_fn pid=294)[0m Downloading and preparing dataset imdb/plain_text to /home/ray/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]
Downloading data:   0%|          | 17.4k/84.1M [00:00<13:43, 102kB/s]
Downloading data:   0%|          | 82.9k/84.1M [00:00<04:23, 319kB/s]
Downloading data:   0%|          | 132k/84.1M [00:00<04:04, 343kB/s] 
Downloading data:   0%|          | 214k/84.1M [00:00<03:01, 462kB/s]
Downloading data:   0%|          | 312k/84.1M [00:00<02:28, 565kB/s]
Downloading data:   1%|          | 443k/84.1M [00:00<01:55, 727kB/s]
Downloading data:   1%|          | 591k/84.1M [00:00<01:37, 857kB/s]
Downloading data:   1%|          | 755k/84.1M [00:01<01:23, 1.00MB/s]
Downloading data:   1%|          | 935k/84.1M [00:01<01:14, 1.12MB/s]
Downloading data:   1%|▏         | 1.13M/84.1M [00:01<01:05, 1.26MB/s]
Downloading data:   2%|▏         | 1.38M/84.1M [00:01<00:56, 1.45MB/s]
Downloading data:   2%|▏         | 1.67M/84.1M [00:01<00:47, 1.73MB/s]
Downloading data:   3%|▎         | 2.16M/84.1M [00:01<00:34, 2.38MB/s]
Downloading data:   3%|▎    

[2m[36m(train_fn pid=294)[0m Dataset imdb downloaded and prepared to /home/ray/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


Downloading (…)okenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 8.96kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 483/483 [00:00<00:00, 126kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 6.44MB/s]
Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 8.52MB/s]
Map:   0%|          | 0/25000 [00:00<?, ? examples/s]
Map:   4%|▍         | 1000/25000 [00:00<00:17, 1333.84 examples/s]
Map:   8%|▊         | 2000/25000 [00:01<00:14, 1608.44 examples/s]
Map:  12%|█▏        | 3000/25000 [00:01<00:13, 1674.60 examples/s]
Map:  16%|█▌        | 4000/25000 [00:02<00:12, 1668.97 examples/s]
Map:  20%|██        | 5000/25000 [00:03<00:12, 1653.89 examples/s]
Map:  24%|██▍       | 6000/25000 [00:03<00:11, 1651.86 examples/s]
Map:  28%|██▊       | 7000/25000 [00:04<00:10, 1733.88 examples/s]
Map:  32%|███▏      | 8000/25000 [00:04<00:09, 1

[2m[36m(train_fn pid=294)[0m len of train Dataset({
[2m[36m(train_fn pid=294)[0m     features: ['text', 'label', 'input_ids', 'attention_mask'],
[2m[36m(train_fn pid=294)[0m     num_rows: 100
[2m[36m(train_fn pid=294)[0m }) and test Dataset({
[2m[36m(train_fn pid=294)[0m     features: ['text', 'label', 'input_ids', 'attention_mask'],
[2m[36m(train_fn pid=294)[0m     num_rows: 100
[2m[36m(train_fn pid=294)[0m })
[2m[36m(train_fn pid=294)[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
[2m[36m(train_fn pid=294)[0m 	- Avoid using `tokenizers` before the fork if possible
[2m[36m(train_fn pid=294)[0m 	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[2m[36m(train_fn pid=294)[0m == Status ==
[2m[36m(train_fn pid=294)[0m Current time: 2023-04-16 12:36:17 (running for 00:00:04.89)
[2m[36m(train_fn pid=294)[0m Memory usage on 

[2m[36m(RayTrainWorker pid=150, ip=10.130.0.183)[0m 2023-04-16 12:36:20,196	INFO config.py:87 -- Setting up process group for: env:// [rank=0, world_size=3]


[2m[36m(train_fn pid=294)[0m == Status ==
[2m[36m(train_fn pid=294)[0m Current time: 2023-04-16 12:36:22 (running for 00:00:09.90)
[2m[36m(train_fn pid=294)[0m Memory usage on this node: 69.1/503.7 GiB 
[2m[36m(train_fn pid=294)[0m Using FIFO scheduling algorithm.
[2m[36m(train_fn pid=294)[0m Resources requested: 1.0/26 CPUs, 3.0/3 GPUs, 0.0/52.15 GiB heap, 0.0/15.46 GiB objects
[2m[36m(train_fn pid=294)[0m Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2023-04-16_12-36-12
[2m[36m(train_fn pid=294)[0m Number of trials: 1/1 (1 RUNNING)
[2m[36m(train_fn pid=294)[0m +--------------------------------+----------+------------------+
[2m[36m(train_fn pid=294)[0m | Trial name                     | status   | loc              |
[2m[36m(train_fn pid=294)[0m |--------------------------------+----------+------------------|
[2m[36m(train_fn pid=294)[0m | HuggingFaceTrainer_f1d17_00000 | RUNNING  | 10.130.0.183:117 |
[2m[36m(train_fn pid=294)[0m +--------

Downloading (…)lve/main/config.json: 100%|██████████| 483/483 [00:00<00:00, 151kB/s]
Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]
Downloading pytorch_model.bin:   4%|▍         | 10.5M/268M [00:00<00:11, 23.0MB/s]
Downloading pytorch_model.bin:   8%|▊         | 21.0M/268M [00:00<00:06, 37.2MB/s]
Downloading pytorch_model.bin:  12%|█▏        | 31.5M/268M [00:00<00:04, 52.0MB/s]
Downloading pytorch_model.bin:  20%|█▉        | 52.4M/268M [00:00<00:02, 75.5MB/s]
Downloading pytorch_model.bin:  27%|██▋       | 73.4M/268M [00:01<00:02, 89.0MB/s]
Downloading pytorch_model.bin:  35%|███▌      | 94.4M/268M [00:01<00:01, 98.1MB/s]
Downloading pytorch_model.bin:  43%|████▎     | 115M/268M [00:01<00:01, 104MB/s]  
Downloading (…)lve/main/config.json: 100%|██████████| 483/483 [00:00<00:00, 217kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 483/483 [00:00<00:00, 177kB/s]
Downloading pytorch_model.bin:  51%|█████     | 136M/268M [00:01<00:01, 108MB/s]
Downloadi

[2m[36m(train_fn pid=294)[0m == Status ==
[2m[36m(train_fn pid=294)[0m Current time: 2023-04-16 12:36:27 (running for 00:00:14.90)
[2m[36m(train_fn pid=294)[0m Memory usage on this node: 69.3/503.7 GiB 
[2m[36m(train_fn pid=294)[0m Using FIFO scheduling algorithm.
[2m[36m(train_fn pid=294)[0m Resources requested: 1.0/26 CPUs, 3.0/3 GPUs, 0.0/52.15 GiB heap, 0.0/15.46 GiB objects
[2m[36m(train_fn pid=294)[0m Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2023-04-16_12-36-12
[2m[36m(train_fn pid=294)[0m Number of trials: 1/1 (1 RUNNING)
[2m[36m(train_fn pid=294)[0m +--------------------------------+----------+------------------+
[2m[36m(train_fn pid=294)[0m | Trial name                     | status   | loc              |
[2m[36m(train_fn pid=294)[0m |--------------------------------+----------+------------------|
[2m[36m(train_fn pid=294)[0m | HuggingFaceTrainer_f1d17_00000 | RUNNING  | 10.130.0.183:117 |
[2m[36m(train_fn pid=294)[0m +--------

Downloading pytorch_model.bin:  51%|█████     | 136M/268M [00:02<00:01, 88.4MB/s]
[2m[36m(RayTrainWorker pid=147, ip=10.128.0.73)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight']
[2m[36m(RayTrainWorker pid=147, ip=10.128.0.73)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(RayTrainWorker pid=147, ip=10.128.0.73)[0m - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a Ber

[2m[36m(train_fn pid=294)[0m == Status ==
[2m[36m(train_fn pid=294)[0m Current time: 2023-04-16 12:36:32 (running for 00:00:19.90)
[2m[36m(train_fn pid=294)[0m Memory usage on this node: 70.5/503.7 GiB 
[2m[36m(train_fn pid=294)[0m Using FIFO scheduling algorithm.
[2m[36m(train_fn pid=294)[0m Resources requested: 1.0/26 CPUs, 3.0/3 GPUs, 0.0/52.15 GiB heap, 0.0/15.46 GiB objects
[2m[36m(train_fn pid=294)[0m Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2023-04-16_12-36-12
[2m[36m(train_fn pid=294)[0m Number of trials: 1/1 (1 RUNNING)
[2m[36m(train_fn pid=294)[0m +--------------------------------+----------+------------------+
[2m[36m(train_fn pid=294)[0m | Trial name                     | status   | loc              |
[2m[36m(train_fn pid=294)[0m |--------------------------------+----------+------------------|
[2m[36m(train_fn pid=294)[0m | HuggingFaceTrainer_f1d17_00000 | RUNNING  | 10.130.0.183:117 |
[2m[36m(train_fn pid=294)[0m +--------

Downloading pytorch_model.bin:  47%|████▋     | 126M/268M [00:07<00:05, 28.1MB/s]
Downloading pytorch_model.bin:  51%|█████     | 136M/268M [00:08<00:04, 29.8MB/s]
Downloading pytorch_model.bin:  55%|█████▍    | 147M/268M [00:08<00:03, 31.6MB/s]
Downloading pytorch_model.bin:  59%|█████▊    | 157M/268M [00:08<00:03, 33.0MB/s]
Downloading pytorch_model.bin:  63%|██████▎   | 168M/268M [00:08<00:02, 34.2MB/s]
Downloading pytorch_model.bin:  67%|██████▋   | 178M/268M [00:09<00:02, 35.5MB/s]
Downloading pytorch_model.bin:  70%|███████   | 189M/268M [00:09<00:02, 37.0MB/s]
Downloading pytorch_model.bin:  74%|███████▍  | 199M/268M [00:09<00:01, 38.8MB/s]
Downloading pytorch_model.bin:  78%|███████▊  | 210M/268M [00:09<00:01, 40.8MB/s]
Downloading pytorch_model.bin:  82%|████████▏ | 220M/268M [00:10<00:01, 43.1MB/s]
Downloading pytorch_model.bin:  86%|████████▌ | 231M/268M [00:10<00:00, 45.2MB/s]
Downloading pytorch_model.bin:  90%|█████████ | 241M/268M [00:10<00:00, 47.5MB/s]
Downloading pyto

[2m[36m(train_fn pid=294)[0m == Status ==
[2m[36m(train_fn pid=294)[0m Current time: 2023-04-16 12:36:37 (running for 00:00:24.90)
[2m[36m(train_fn pid=294)[0m Memory usage on this node: 70.5/503.7 GiB 
[2m[36m(train_fn pid=294)[0m Using FIFO scheduling algorithm.
[2m[36m(train_fn pid=294)[0m Resources requested: 1.0/26 CPUs, 3.0/3 GPUs, 0.0/52.15 GiB heap, 0.0/15.46 GiB objects
[2m[36m(train_fn pid=294)[0m Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2023-04-16_12-36-12
[2m[36m(train_fn pid=294)[0m Number of trials: 1/1 (1 RUNNING)
[2m[36m(train_fn pid=294)[0m +--------------------------------+----------+------------------+
[2m[36m(train_fn pid=294)[0m | Trial name                     | status   | loc              |
[2m[36m(train_fn pid=294)[0m |--------------------------------+----------+------------------|
[2m[36m(train_fn pid=294)[0m | HuggingFaceTrainer_f1d17_00000 | RUNNING  | 10.130.0.183:117 |
[2m[36m(train_fn pid=294)[0m +--------



[2m[36m(train_fn pid=294)[0m == Status ==
[2m[36m(train_fn pid=294)[0m Current time: 2023-04-16 12:36:42 (running for 00:00:29.91)
[2m[36m(train_fn pid=294)[0m Memory usage on this node: 71.4/503.7 GiB 
[2m[36m(train_fn pid=294)[0m Using FIFO scheduling algorithm.
[2m[36m(train_fn pid=294)[0m Resources requested: 1.0/26 CPUs, 3.0/3 GPUs, 0.0/52.15 GiB heap, 0.0/15.46 GiB objects
[2m[36m(train_fn pid=294)[0m Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2023-04-16_12-36-12
[2m[36m(train_fn pid=294)[0m Number of trials: 1/1 (1 RUNNING)
[2m[36m(train_fn pid=294)[0m +--------------------------------+----------+------------------+
[2m[36m(train_fn pid=294)[0m | Trial name                     | status   | loc              |
[2m[36m(train_fn pid=294)[0m |--------------------------------+----------+------------------|
[2m[36m(train_fn pid=294)[0m | HuggingFaceTrainer_f1d17_00000 | RUNNING  | 10.130.0.183:117 |
[2m[36m(train_fn pid=294)[0m +--------



[2m[36m(train_fn pid=294)[0m == Status ==
[2m[36m(train_fn pid=294)[0m Current time: 2023-04-16 12:48:44 (running for 00:12:31.75)
[2m[36m(train_fn pid=294)[0m Memory usage on this node: 72.9/503.7 GiB 
[2m[36m(train_fn pid=294)[0m Using FIFO scheduling algorithm.
[2m[36m(train_fn pid=294)[0m Resources requested: 1.0/26 CPUs, 3.0/3 GPUs, 0.0/52.15 GiB heap, 0.0/15.46 GiB objects
[2m[36m(train_fn pid=294)[0m Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2023-04-16_12-36-12
[2m[36m(train_fn pid=294)[0m Number of trials: 1/1 (1 RUNNING)
[2m[36m(train_fn pid=294)[0m +--------------------------------+----------+------------------+--------+------------------+--------+-----------------+----------+
[2m[36m(train_fn pid=294)[0m | Trial name                     | status   | loc              |   iter |   total time (s) |   loss |   learning_rate |    epoch |
[2m[36m(train_fn pid=294)[0m |--------------------------------+----------+------------------+-------



[2m[36m(train_fn pid=294)[0m Result for HuggingFaceTrainer_f1d17_00000:
[2m[36m(train_fn pid=294)[0m   _time_this_iter_s: 31.07945704460144
[2m[36m(train_fn pid=294)[0m   _timestamp: 1681674542
[2m[36m(train_fn pid=294)[0m   _training_iteration: 2
[2m[36m(train_fn pid=294)[0m   date: 2023-04-16_12-49-02
[2m[36m(train_fn pid=294)[0m   done: true
[2m[36m(train_fn pid=294)[0m   epoch: 1.0
[2m[36m(train_fn pid=294)[0m   experiment_id: c144d505d0c149efa2a1c3c5a0f5bdc2
[2m[36m(train_fn pid=294)[0m   experiment_tag: '0'
[2m[36m(train_fn pid=294)[0m   hostname: hfgputest-worker-small-group-hfgputest-6wlvr
[2m[36m(train_fn pid=294)[0m   iterations_since_restore: 2
[2m[36m(train_fn pid=294)[0m   learning_rate: 0.0
[2m[36m(train_fn pid=294)[0m   loss: 0.1855
[2m[36m(train_fn pid=294)[0m   node_ip: 10.130.0.183
[2m[36m(train_fn pid=294)[0m   pid: 117
[2m[36m(train_fn pid=294)[0m   should_checkpoint: true
[2m[36m(train_fn pid=294)[0m   step: 521
[



[2m[36m(train_fn pid=294)[0m == Status ==
[2m[36m(train_fn pid=294)[0m Current time: 2023-04-16 12:49:11 (running for 00:12:59.18)
[2m[36m(train_fn pid=294)[0m Memory usage on this node: 70.6/503.7 GiB 
[2m[36m(train_fn pid=294)[0m Using FIFO scheduling algorithm.
[2m[36m(train_fn pid=294)[0m Resources requested: 0/26 CPUs, 0/3 GPUs, 0.0/52.15 GiB heap, 0.0/15.46 GiB objects
[2m[36m(train_fn pid=294)[0m Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2023-04-16_12-36-12
[2m[36m(train_fn pid=294)[0m Number of trials: 1/1 (1 TERMINATED)
[2m[36m(train_fn pid=294)[0m +--------------------------------+------------+------------------+--------+------------------+--------+-----------------+---------+
[2m[36m(train_fn pid=294)[0m | Trial name                     | status     | loc              |   iter |   total time (s) |   loss |   learning_rate |   epoch |
[2m[36m(train_fn pid=294)[0m |--------------------------------+------------+------------------+----

[2m[36m(train_fn pid=294)[0m 2023-04-16 12:49:11,740	INFO tune.py:777 -- Total run time: 779.38 seconds (779.18 seconds for the tuning loop).


In [13]:
from ray.train.torch import TorchCheckpoint
checkpoint: TorchCheckpoint = result
path = checkpoint.to_directory()

In [14]:
print(path)
!ls {path}

/tmp/checkpoint_tmp_41cdabea109d46b1bde91192b77a3ecf
E0416 19:49:19.333066640     978 completion_queue.cc:738]              Kick failed: UNKNOWN:Bad file descriptor {created_time:"2023-04-16T19:49:19.332851369+00:00", errno:9, os_error:"Bad file descriptor", syscall:"eventfd_write"}
config.json		_preprocessor	   trainer_state.json
_current_checkpoint_id	pytorch_model.bin  training_args.bin
_metadata.meta.pkl	rng_state_0.pth
optimizer.pt		scheduler.pt


In [None]:
#log_dir=result.log_dir
#print(f"log_dir: {log_dir}")

# Inference using the checkpoint

In [15]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
#DistilbertTokenizerFast
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}
model = AutoModelForSequenceClassification.from_pretrained(path,num_labels=2, id2label=id2label, label2id=label2id)
text1 = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."
text2 = "This is a catastrophe. Each of the three movies had different actors that made it difficult to follow."
#inputs = tokenizer(text, return_tensors="pt")
batch=[text1,text2]
inputs = tokenizer(batch, padding=True, truncation=True, max_length=512, return_tensors="pt")

with torch.no_grad(): logits = model(**inputs).logits # For pytorch you have to unpack

  from .autonotebook import tqdm as notebook_tqdm
Downloading (…)okenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 11.1kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 483/483 [00:00<00:00, 217kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 5.37MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 8.14MB/s]


In [16]:
print(logits)
print(torch.nn.Softmax(dim=1)(logits)) #tf.math.softmax(logits, axis=-1)

tensor([[-1.6680,  2.4469],
        [ 1.0801, -0.9055]])
tensor([[0.0161, 0.9839],
        [0.8793, 0.1207]])


In [17]:
import numpy as np
print(np.array(logits))
predicted_class_id = np.array(logits).argmax(axis=1)
print(predicted_class_id)
print([model.config.id2label[i] for i in predicted_class_id])

[[-1.6679788   2.4469006 ]
 [ 1.0801115  -0.90551084]]
[1 0]
['POSITIVE', 'NEGATIVE']


# Convert to onyx

In [18]:
torch.onnx.export(
    model, 
    tuple(inputs.values()),
    f="torch-model.onnx",  
    input_names=['input_ids', 'attention_mask'], 
    output_names=['logits'], 
    dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence'}, 
                  'attention_mask': {0: 'batch_size', 1: 'sequence'}, 
                  'logits': {0: 'batch_size', 1: 'sequence'}}, 
    do_constant_folding=True, 
    opset_version=13, 
)

  mask, torch.tensor(torch.finfo(scores.dtype).min)


verbose: False, log level: Level.ERROR



In [19]:
from datasets import load_dataset
dataset = load_dataset("imdb")

Downloading builder script: 100%|██████████| 4.31k/4.31k [00:00<00:00, 3.53MB/s]
Downloading metadata: 100%|██████████| 2.17k/2.17k [00:00<00:00, 823kB/s]
Downloading readme: 100%|██████████| 7.59k/7.59k [00:00<00:00, 1.70MB/s]


Downloading and preparing dataset imdb/plain_text to /opt/app-root/src/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data: 100%|██████████| 84.1M/84.1M [00:20<00:00, 4.20MB/s]
                                                                                              

Dataset imdb downloaded and prepared to /opt/app-root/src/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 705.36it/s]


In [20]:
import onnx
import onnxruntime
import torch
import numpy as np

In [21]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [22]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
print(tokenizer)

session = onnxruntime.InferenceSession('torch-model.onnx', None)
text="This is a catastrophe."
inputs = tokenizer(text, return_tensors="np")
print(inputs)

result1 = session.run([i.name for i in session.get_outputs()], dict(inputs))
print(result1)

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}
predicted_class_id = np.array(result1).argmax().item()
print(id2label[predicted_class_id])

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)
{'input_ids': array([[  101,  2023,  2003,  1037, 25539,  1012,   102]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1]])}
[array([[ 0.46370822, -0.31326616]], dtype=float32)]
NEGATIVE


In [23]:
#import tensorflow as tf
#predictions = tf.math.softmax(result, axis=-1)
print(torch.nn.Softmax(dim=1)(torch.tensor(result1[0])))

tensor([[0.6850, 0.3150]])


In [24]:
text1 = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."
text2 = "This is a catastrophe."
batch=[text1,text2]
inputs = tokenizer(batch, padding=True, truncation=True, max_length=512, return_tensors="np")
result2 = session.run([i.name for i in session.get_outputs()], dict(inputs))
print(result2)
torch.nn.Softmax(dim=1)(torch.tensor(result2[0]))
print(np.argmax(torch.nn.Softmax(dim=1)(torch.tensor(result2[0])),axis=1))
print([id2label[i.item()] for i in torch.argmax(torch.nn.Softmax(dim=1)(torch.tensor(result2[0])),axis=1)])
labels=[id2label[labelid] for labelid in torch.argmax(torch.nn.Softmax(dim=1)(torch.tensor(result2[0])),axis=1).tolist()]
print(labels)

[array([[-1.6679785 ,  2.4469004 ],
       [ 0.46370822, -0.31326616]], dtype=float32)]
tensor([1, 0])
['POSITIVE', 'NEGATIVE']
['POSITIVE', 'NEGATIVE']


# Upload the model to S3 Bucket

In [26]:
import os
import boto3
from boto3 import session

key_id = os.environ.get('AWS_ACCESS_KEY_ID')
secret_key = os.environ.get('AWS_SECRET_ACCESS_KEY')
endpoint_url = os.environ.get('AWS_S3_ENDPOINT')
session = boto3.session.Session(aws_access_key_id=key_id, aws_secret_access_key=secret_key)
s3_client = boto3.client('s3', aws_access_key_id=key_id, aws_secret_access_key=secret_key,endpoint_url=endpoint_url,verify=False)
buckets=s3_client.list_buckets()
for bucket in buckets['Buckets']: print(bucket['Name'])

spark-demo-07ae575f-1b38-4064-b358-c0f7e1a88ddb




In [27]:
print(bucket['Name'])
modelfile='torch-model.onnx'
s3_client.upload_file(modelfile, bucket['Name'],'hf_model.onnx')

spark-demo-07ae575f-1b38-4064-b358-c0f7e1a88ddb




In [28]:
[item.get("Key") for item in s3_client.list_objects_v2(Bucket=bucket['Name']).get("Contents")]



['hf_model.onnx',
 'iris/model.joblib',
 'mnist-model/1/saved_model.pb',
 'mnist-model/1/variables/variables.data-00000-of-00001',
 'mnist-model/1/variables/variables.index',
 'mnist-svm.joblib',
 'mnist.onnx',
 'mnist_model_from_saved_model.onnx',
 'model.joblib',
 'saved_model.pb',
 'yolov5n.onnx']

Now manually deploy the model from Data Science Projects

---
# Submit inferencing request to Deployed model

In [None]:
import requests
import json
URL='http://modelmesh-serving.huggingface.svc.cluster.local:8008/v2/models/hfmodel/infer' # underscore characters are removed
headers = {}
payload = {
        "inputs": [{ "name": "input_ids", "shape": inputs.get('input_ids').shape, "datatype": "INT64", "data": inputs.get('input_ids').tolist()},{ "name": "attention_mask", "shape": inputs.get('attention_mask').shape, "datatype": "INT64", "data": inputs.get('attention_mask').tolist()}]
    }

headers = {"content-type": "application/json"}
res = requests.post(URL, json=payload, headers=headers)
print(res)
print(res.text)

In [None]:
result=[np.array(res.json().get('outputs')[0].get('data')).reshape(res.json().get('outputs')[0].get('shape'))]

In [None]:
torch.nn.Softmax(dim=1)(torch.tensor(result[0]))
print(np.argmax(torch.nn.Softmax(dim=1)(torch.tensor(result[0])),axis=1))
print('Using item',[id2label[i.item()] for i in torch.argmax(torch.nn.Softmax(dim=1)(torch.tensor(result[0])),axis=1)])
labels=[id2label[labelid] for labelid in torch.argmax(torch.nn.Softmax(dim=1)(torch.tensor(result[0])),axis=1).tolist()]
print('Using to_list',labels)

Finally, we bring our resource cluster down and release/terminate the associated resources, bringing everything back to the way it was before our cluster was brought up.

# Conclusion
As shown in the above example, you can easily run your Huggingface transfer learning tasks easily and natively on CodeFlare. You can scale them from 1 to n GPUs without requiring you to make any significant code changes and leveraging the native Huggingface trainer. 

Also refer to additional notebooks that showcase other use cases
In our next notebook [./02_codeflare_workflows_encoding.ipynb ] shows an sklearn example and how you can leverage workflows to run experiment pipelines and explore multiple pipelines in parallel on CodeFlare cluster. 


In [None]:
cluster.down()

In [None]:
auth.logout()