# Report: Solving Task Clarity for LLM-based Agents

## Problem Statement

The challenge of achieving task clarity for LLM-based Agents involves enabling these language model-based agents to understand and execute user-provided tasks accurately.
This report explores a current solution and outlines future work in addressing this problem.
Especially for multi-languages

## Example:

- **Client Input:** "Find for me a product in a website."
- **Expected Output:**
  ```python
  {'task_type': 'Information retrieval', 'scores': > 0.2, 'inference_time': < 10 s}


## Outline

#### Sections:
1. **Current Solution: Zero-Shot Classification on a Trained Bert-NLI on Multilingual Dataset**

2. **Fine Tune**
   - Focuses on the manual fine-tuning process, including adjustments in model parameters, architecture modifications, and specific training procedures.
   
     2.1. **Solution Approach**

     2.2. **Dataset Preparation** 

     2.3. **Manual Fine Tuning**
     
     2.4. **Auto Fine Tuning**

3. **Save, Load, and Test Original vs. Manual Fine Tune vs. Auto Fine Tune Model**


## Current solution: Zero-Shot Classification on a trained Bert-NLI on Multilingual Dataset

*   model_id: "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"

In [1]:
# #install library
# ! pip install accelerate peft bitsandbytes guidance trl py7zr
# ! pip install git+https://github.com/huggingface/transformers

In [2]:
#logging Hugging Face
hf_key = "Your Hugging Face key"
!huggingface-cli login --token $hf_key

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
# install library
!pip install transformers[sentencepiece]~=4.33.0 -qq
!pip install datasets~=2.14.0 -qq
!pip install accelerate~=0.23.0 -qq
!pip install wandb~=0.15.0 -qq
!pip install mdutils~=1.6.0 -qq
!pip install scikit-learn~=1.2.0 -qq
!pip install sentencepiece
!pip install optuna~=3.3.0

Collecting optuna~=3.3.0
  Downloading optuna-3.3.0-py3-none-any.whl (404 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m404.2/404.2 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna~=3.3.0)
  Downloading alembic-1.12.1-py3-none-any.whl (226 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.8/226.8 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cmaes>=0.10.0 (from optuna~=3.3.0)
  Downloading cmaes-0.10.0-py3-none-any.whl (29 kB)
Collecting colorlog (from optuna~=3.3.0)
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna~=3.3.0)
  Downloading Mako-1.3.0-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, cmaes, alembic, optuna
Successfully installed Mako-1.3.0 alembic-1.12.1 cmaes-0.10.0 colorlog-6.

In [4]:
from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7")

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(device)


cuda


In [6]:
#experiments
import asyncio
import re
import time

input = "Write a pong game in python."
candidate_labels = ["Greeting", "Information retrieval", "Sentiment analysis", "Text generation", "Code generation", "Q&A", "Summarization", "Translation"]
time_execution = time.time()
output = classifier(input, candidate_labels, multi_label=False)
time_execution = round(time.time() - time_execution, 2)
print(f"task_type: {output['labels'][0]}, scores: {round(output['scores'][0],2)}, inference_time: {time_execution}")

task_type: Code generation, scores: 0.47, inference_time: 3.9


In [7]:
async def task_classifier(client_input: str, task_types: str):
    """Classify tasks for LLM-based gent"""
    candidate_labels = [label.strip() for label in task_types.split(",")]
    time_execution = time.time()
    output = classifier(str(client_input), candidate_labels, multi_label=False)
    # output = classifier(input, candidate_labels, multi_label=False)
    time_execution = round(time.time() - time_execution, 2)
    return {"task_type": output['labels'][0], "scores": round(output['scores'][0],2), "inference_time": time_execution}

### Test the solution & results

In [8]:
# List of test cases
test_cases = [
    "Summarize the latest scientific research.",
    "Translate this English text into French.",
    "What's the weather forecast for tomorrow?",
    "Generate Python code for a simple calculator.",
    "Please provide me with the most popular video currently trending on YouTube",
    "Hãy dịch giúp mình",
    "Chào các bác",
    "Tìm kiếm thông tin giúp tôi",
    "私に合った製品を見つける",
    "너무 사랑해요",
    "Рад встрече"
]

# Run the task_classifier function for each test case
for test_case in test_cases:
    result = await task_classifier(test_case, 'Greeting, Information retrieval, Sentiment analysis, Text generation, Code generation, Q&A, Summarization, Translation')
    print(f"Input: '{test_case}'")
    print(f"Output: {result}")
    print()

Input: 'Summarize the latest scientific research.'
Output: {'task_type': 'Summarization', 'scores': 0.97, 'inference_time': 2.33}

Input: 'Translate this English text into French.'
Output: {'task_type': 'Translation', 'scores': 0.92, 'inference_time': 3.87}

Input: 'What's the weather forecast for tomorrow?'
Output: {'task_type': 'Q&A', 'scores': 0.92, 'inference_time': 7.62}

Input: 'Generate Python code for a simple calculator.'
Output: {'task_type': 'Code generation', 'scores': 0.98, 'inference_time': 6.1}

Input: 'Please provide me with the most popular video currently trending on YouTube'
Output: {'task_type': 'Translation', 'scores': 0.27, 'inference_time': 3.91}

Input: 'Hãy dịch giúp mình'
Output: {'task_type': 'Summarization', 'scores': 0.4, 'inference_time': 3.14}

Input: 'Chào các bác'
Output: {'task_type': 'Greeting', 'scores': 0.84, 'inference_time': 4.77}

Input: 'Tìm kiếm thông tin giúp tôi'
Output: {'task_type': 'Information retrieval', 'scores': 0.61, 'inference_time':

## Finetune
The solution is referred to Mr. Laurer's solution, available at [this GitHub link](https://github.com/MoritzLaurer/summer-school-transformers-2023/blob/main/4_tune_bert_nli.ipynb) for related content and examples.


### Solution Approach

Fine-tuning a Large Language Model (LLM) like DeBERTa for a task classification problem, where the task is to classify user queries into specific categories, is a nuanced process. The approach of transforming the dataset into a Natural Language Inference (NLI) format can be highly effective, especially when using a model pre-trained on NLI tasks like mDeBERTa-v3-base-mnli-xnli.

### Dataset Preparation

In [9]:
# mount to drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
# process data
import pandas as pd
from sklearn.model_selection import train_test_split
import datasets

#dataset path
dataset_path = '/content/drive/MyDrive/Colab Notebooks/task_classification/LLM_NLI_Dataset_Balanced.csv'  # Replace with your transformed dataset path

# Load the dataset
df = pd.read_csv(dataset_path)
print(f'Original Data: {df.head()}')

# Numerical the labels
label_mapping = {'Entailment': 0, 'Neutral': 1, 'Contradiction': 2}
df['label'] = df['Label'].replace(label_mapping)


# Split the dataset
train, test = train_test_split(df, test_size=0.3, random_state=42) # 70% training, 30% for test

# convert pandas dataframes to Hugging Face dataset object to facilitate pre-processing
dataset = datasets.DatasetDict({
    "train": datasets.Dataset.from_pandas(train),
    "test": datasets.Dataset.from_pandas(test)
})
print(f'Train dataset:{dataset["train"].features}')
print(train.describe())

## load the BERT-NLI's tokenizer
# you can choose any of the NLI models here: https://huggingface.co/MoritzLaurer
model_name = "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"  # multilingual model: "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)

# tokenize
def tokenize_nli_format(examples):
  return tokenizer(examples["Premise"], examples["Hypothesis"], truncation=True, max_length=512)  # max_length can be reduced to e.g. 256 to increase speed, but long texts will be cut off

# Tokenize the dataset
dataset = dataset.map(tokenize_nli_format, batched=True)

# remove unnecessary columns for model training
dataset = dataset.remove_columns([
    'Task', '__index_level_0__', 'Label'])

# inspect the dataset
print("The overall structure of the pre-processed train and test sets:\n")
print(dataset)


Original Data:           Task                                            Premise  \
0  Translation  An LLM is used to translate a technical docume...   
1  Translation  An LLM is tasked with initiating a conversatio...   
2  Translation  A user queries an LLM for the latest research ...   
3  Translation  An LLM analyzes customer reviews to determine ...   
4  Translation  An LLM is used to write a short story in the s...   

                                          Hypothesis       Label  
0  LLMs can accurately translate text between lan...  Entailment  
1  LLMs can accurately translate text between lan...  Entailment  
2  LLMs can accurately translate text between lan...  Entailment  
3  LLMs can accurately translate text between lan...  Entailment  
4  LLMs can accurately translate text between lan...  Entailment  
Train dataset:{'Task': Value(dtype='string', id=None), 'Premise': Value(dtype='string', id=None), 'Hypothesis': Value(dtype='string', id=None), 'Label': Value(dtype='st

Map:   0%|          | 0/453 [00:00<?, ? examples/s]

Map:   0%|          | 0/195 [00:00<?, ? examples/s]

The overall structure of the pre-processed train and test sets:

DatasetDict({
    train: Dataset({
        features: ['Premise', 'Hypothesis', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 453
    })
    test: Dataset({
        features: ['Premise', 'Hypothesis', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 195
    })
})


#### Manual Fine Tune

In [11]:
# import libraries
import transformers
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
import pandas as pd
from datasets import Dataset
import torch

max_length = 512
# Load the transformed NLI dataset
max_length=512
SEED_GLOBAL = 2023


# Initialize the Model
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# use GPU (cuda) if available, otherwise use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
model.to(device);

Device: cuda


##### Setting training arguments / hyperparameters, metrics

The subsequent cell defines crucial hyperparameters. We selected settings that generally perform effectively to eliminate the necessity for hyperparameter exploration. Additionally, we offer code later on for conducting hyperparameter searches, allowing researchers the option to potentially enhance performance by a small margin.

In [12]:
from transformers import TrainingArguments, Trainer, logging

# Set the directory to write the fine-tuned model and training logs to.
# With google colab, this will create a temporary folder, which will be deleted once you disconnect.
# You can connect to your personal google drive to save models and logs properly.
training_directory = "BERT-nli-demo"

# FP16 is a hyperparameter which can increase training speed and reduce memory consumption, but only on GPU and if batch-size > 8, see here: https://huggingface.co/transformers/performance.html?#fp16
# FP16 does not work on CPU or for multilingual mDeBERTa models
#fp16_bool = True if torch.cuda.is_available() else False
#if "mdeberta" in model_name.lower(): fp16_bool = False  # multilingual mDeBERTa does not support FP16 yet: https://github.com/microsoft/DeBERTa/issues/77
# in case of hyperparameter search end the end: FP16 has to be set to False. The integrated hyperparameter search with the Hugging Face Trainer can lead to errors otherwise.
#fp16_bool = False

# Hugging Face tipps to increase training speed and decrease out-of-memory (OOM) issues: https://huggingface.co/transformers/performance.html?
# Overview of all training arguments: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
train_args = TrainingArguments(
    output_dir=f'./results/{training_directory}',
    logging_dir=f'./logs/{training_directory}',
    learning_rate=2e-5,
    per_device_train_batch_size=16,  # if you get an out-of-memory error, reduce this value to 8 or 4 and restart the runtime. Higher values increase training speed, but also increase memory requirements. Ideal values here are always a multiple of 8.
    per_device_eval_batch_size=80,  # if you get an out-of-memory error, reduce this value, e.g. to 40 and restart the runtime
    #gradient_accumulation_steps=4, # Can be used in case of memory problems to reduce effective batch size. accumulates gradients over X steps, only then backward/update. decreases memory usage, but also slightly speed. (!adapt/halve batch size accordingly)
    num_train_epochs=2,  # this can be increased, but higher values increase training time. Good values for NLI are between 3 and 20.
    warmup_ratio=0.25,  # a good normal default value is 0.06 for normal BERT-base models, but since we want to reuse prior NLI knowledge and avoid catastrophic forgetting, we set the value higher
    weight_decay=0.1,
    seed=SEED_GLOBAL,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    #fp16=fp16_bool,  # Can speed up training and reduce memory consumption, but only makes sense at batch-size > 8. loads two copies of model weights, which creates overhead. https://huggingface.co/transformers/performance.html?#fp16
    #fp16_full_eval=fp16_bool,
    evaluation_strategy="epoch", # options: "no"/"steps"/"epoch"
    #eval_steps=10_000,  # evaluate after n steps if evaluation_strategy!='steps'. defaults to logging_steps
    save_strategy = "epoch",  # options: "no"/"steps"/"epoch"
    #save_steps=10_000,              # Number of updates steps before two checkpoint saves.
    #save_total_limit=10,             # If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir
    #logging_strategy="steps",
    report_to=[],  # "all"  # logging
    #push_to_hub=False,
    #push_to_hub_model_id=f"{model_name}-finetuned-{task}",
)


In [13]:
# helper function to clean memory and reduce risk of out-of-memory error
import gc
def clean_memory():
  #del(model)
  if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()
  gc.collect()

clean_memory()

**Explanation of different training arguments:**

You can find more arguments, explanations and examples in the Hugging Face [documentation](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments).

* **num_train_epochs**:
Specifies the number of times the entire training dataset is passed through the model. For example, num_train_epochs=3 means the trainer will iterate over the entire training dataset three times.

* **per_device_train_batch_size**:
The model does not learn from the entire dataset at once, but in batches of e.g. 16 texts. For example, if per_device_train_batch_size=16, then the model analyses 16 texts and sees how wrong it was on these 16 texts. After analysing these 16 texts, the model's parameters are updated/optimised to make the model less wrong on these texts. The degree to which the model's parameters are updated is called 'learning rate'.

* **learning_rate**:
The the "rate" or speed with which the model's parameters are updated by the optimisation algorithm. A smaller value makes the model's parameter updated more slowly after each batch, while a larger value updates the model's paramters more drastically. A good general value is 2e-5 (which means 0.00002).

* **per_device_eval_batch_size**:
The number of evaluation examples used in one batch during evaluation. This batch size is irrelevant for the model's learning. A higher value makes evaluation faster (more texts are processed at the same time), but higher values also cost memory and increase the risk of out-of-memory errors (OOM).

* **gradient_accumulation_steps**:
Indicates the number of steps before performing a backward/update pass. This means the loss is accumulated over gradient_accumulation_steps steps instead of updating after every step. Useful for training with larger effective batch sizes using limited memory.

* **warmup_ratio**:
Specifies the ratio of total training steps for which the learning rate will be linearly increased (warm-up phase) before it's decayed.

* **weight_decay**:
A regularization technique which penalizes large weights by adding a penalty term to the loss. It helps prevent overfitting.

* **seed**:
Sets a seed for reproducibility. Ensures that multiple runs with the same seed produce the same results.

* **load_best_model_at_end**:
Loads the best model (according to metric_for_best_model) at the end of training instead of the last model.

* **metric_for_best_model**:
Determines which metric to use for evaluating and determining the best model during training.

* **evaluation_strategy**:
Defines when to evaluate the model. For example, evaluation_strategy="epoch" evaluates the model after every epoch.

* **save_strategy**:
Specifies when to save the model. For example, save_strategy="epoch" saves the model after every epoch.

* **fp16**:
Enables mixed precision training if set to True, which can speed up training and reduce memory usage. But this does not work with every model and is only beneficial with a batch_size >= 16.


In [14]:
# Metrics
from sklearn.metrics import balanced_accuracy_score, precision_recall_fscore_support, accuracy_score, classification_report
import numpy as np

def compute_metrics_nli_binary(eval_pred, label_text_alphabetical=None):
    predictions, labels = eval_pred

    ### reformat model output to enable calculation of standard metrics
    # split in chunks with predictions for each hypothesis for one unique premise
    def chunks(lst, n):  # Yield successive n-sized chunks from lst. https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
        for i in range(0, len(lst), n):
            yield lst[i:i + n]

    # for each chunk/premise, select the most likely hypothesis
    softmax = torch.nn.Softmax(dim=1)
    prediction_chunks_lst = list(chunks(predictions, len(set(label_text_alphabetical)) ))
    hypo_position_highest_prob = []
    for i, chunk in enumerate(prediction_chunks_lst):
        hypo_position_highest_prob.append(np.argmax(np.array(chunk)[:, 0]))  # only accesses the first column of the array, i.e. the entailment/true prediction logit of all hypos and takes the highest one

    label_chunks_lst = list(chunks(labels, len(set(label_text_alphabetical)) ))
    label_position_gold = []
    for chunk in label_chunks_lst:
        label_position_gold.append(np.argmin(chunk))  # argmin to detect the position of the 0 among the 1s

    #print("Highest probability prediction per premise: ", hypo_position_highest_prob)
    #print("Correct label per premise: ", label_position_gold)

    ### calculate standard metrics
    precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(label_position_gold, hypo_position_highest_prob, average='macro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
    precision_micro, recall_micro, f1_micro, _ = precision_recall_fscore_support(label_position_gold, hypo_position_highest_prob, average='micro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
    acc_balanced = balanced_accuracy_score(label_position_gold, hypo_position_highest_prob)
    acc_not_balanced = accuracy_score(label_position_gold, hypo_position_highest_prob)
    metrics = {
        'accuracy': acc_not_balanced,
        'f1_macro': f1_macro,
        'accuracy_balanced': acc_balanced,
        'f1_micro': f1_micro,
        'precision_macro': precision_macro,
        'recall_macro': recall_macro,
        'precision_micro': precision_micro,
        'recall_micro': recall_micro,
        #'label_gold_raw': label_position_gold,
        #'label_predicted_raw': hypo_position_highest_prob
    }
    #print("Aggregate metrics: ", {key: metrics[key] for key in metrics if key not in ["label_gold_raw", "label_predicted_raw"]} )  # print metrics but without label lists
    #print("Detailed metrics: ", classification_report(label_position_gold, hypo_position_highest_prob, labels=np.sort(pd.factorize(label_text_alphabetical, sort=True)[0]), target_names=label_text_alphabetical, sample_weight=None, digits=2, output_dict=True,
    #                            zero_division='warn'), "\n")
    return metrics

# Create alphabetically ordered list of the original dataset classes/labels
# This is necessary to be sure that the ordering of the test set labels and predictions is the same. Otherwise there is a risk that labels and predictions are in a different order and resulting metrics are wrong.
label_text_alphabetical = np.sort(train.Hypothesis.unique())
print(label_text_alphabetical)


['LLMs can accurately identify and classify sentiments in text, recognizing subtle emotional cues and expressions.'
 'LLMs can accurately translate text between languages by understanding contextual nuances and idiomatic expressions, maintaining the original meaning and tone.'
 'LLMs can create coherent, contextually appropriate, and engaging text across various genres and styles.'
 'LLMs can curate and recommend content that aligns with user preferences and behaviors, enhancing user experience.'
 'LLMs can efficiently retrieve relevant information from a vast corpus, understanding and interpreting complex user queries.'
 'LLMs can generate appropriate and culturally sensitive greetings, adapting to various social contexts and user profiles.'
 'LLMs can generate syntactically correct and logically coherent code, aiding in software development tasks.'
 'LLMs can provide accurate, concise, and relevant answers to a wide range of questions, showcasing deep understanding and reasoning.'
 '

In [15]:
# Fine-tuning and evaluation

## Fine-tuning
# training
manual_trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=lambda eval_pred: compute_metrics_nli_binary(eval_pred, label_text_alphabetical=label_text_alphabetical)
)

manual_trainer.train()

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Accuracy Balanced,F1 Micro,Precision Macro,Recall Macro,Precision Micro,Recall Micro
1,No log,0.924523,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,No log,0.499559,0.090909,0.041667,0.066667,0.090909,0.125,0.025,0.090909,0.090909


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=58, training_loss=1.1051797537968076, metrics={'train_runtime': 69.4023, 'train_samples_per_second': 13.054, 'train_steps_per_second': 0.836, 'total_flos': 26303077287630.0, 'train_loss': 1.1051797537968076, 'epoch': 2.0})

In [16]:
# Evaluate the fine-tuned model on the held-out test set
results = manual_trainer.evaluate()
print(f'Manual Fine-tuned Results:\n {results}')

Manual Fine-tuned Results:
 {'eval_loss': 0.4995587468147278, 'eval_accuracy': 0.09090909090909091, 'eval_f1_macro': 0.04166666666666667, 'eval_accuracy_balanced': 0.06666666666666667, 'eval_f1_micro': 0.09090909090909091, 'eval_precision_macro': 0.125, 'eval_recall_macro': 0.025, 'eval_precision_micro': 0.09090909090909091, 'eval_recall_micro': 0.09090909090909091, 'eval_runtime': 0.2833, 'eval_samples_per_second': 688.352, 'eval_steps_per_second': 10.59, 'epoch': 2.0}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Auto Fine Tune

##### Automatically setting training arguments / hyperparameters
- Enhance performance via hyperparameter search (hp-search) tailored to your task and dataset.
- Conduct hp-search on a subset of the training set (e.g., validation set) to prevent test set data leakage.
- Note: Small datasets might benefit from hp-search on two different train-validation splits (< 2000 data points).
- See documentation on hp-search with Hugging Face Transformers [here](https://huggingface.co/docs/transformers/main/hpo_train).


In [18]:
## train-validation split - test set should not be visible during hp-search
# https://huggingface.co/docs/datasets/v2.5.1/en/package_reference/main_classes#datasets.Dataset.train_test_split

# the ideal size of the validation set depends on the size of your training data. Each label should have at the very least a few dozen examples in the validation set (ideally several hundred)
validation_set_size = 0.4  # for a training data size of 1000 with 3 classes we use 40% of the training data for validating hyperparameters

# train-validation split for hp-search
dataset_hp = dataset["train"].train_test_split(test_size=validation_set_size, seed=SEED_GLOBAL, shuffle=True)
print(dataset_hp)

DatasetDict({
    train: Dataset({
        features: ['Premise', 'Hypothesis', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 271
    })
    test: Dataset({
        features: ['Premise', 'Hypothesis', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 182
    })
})


In [19]:
# Clean memory and reduce risk of out-of-memory error
clean_memory()

In [20]:
## Reinitialize trainer for hp-search
# https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10

def model_init():
  clean_memory()
  return AutoModelForSequenceClassification.from_pretrained(model_name).to(device)  # return_dict=True

auto_trainer = Trainer(
    model_init=model_init,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset_hp["train"],
    eval_dataset=dataset_hp["test"],
    compute_metrics=lambda eval_pred: compute_metrics_nli_binary(eval_pred, label_text_alphabetical=label_text_alphabetical)
)

In [21]:
# Define the hyperparameters you want to optimise
# Use Optuna for hp-search: https://optuna.readthedocs.io/en/stable/
def hp_space(trial):
    return {
        "learning_rate": trial.suggest_categorical("learning_rate", [9e-6, 2e-5, 4e-5]),
        "num_train_epochs": 4, #trial.suggest_int("num_train_epochs", 4, 24, log=False, step=4),   # increasing the maximum number of epochs here could increase performance but will take (much) longer to train
        "warmup_ratio": trial.suggest_float("warmup_ratio", 0.1, 0.6, log=True),
        "per_device_train_batch_size": 16,  # lower this value in case of out-of-memory errors and restart the runtime
        #"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
        "evaluation_strategy": "no",
        "save_strategy": "no",
    }

In [22]:
import optuna

# number of differen hp configurations to test
numer_of_trials = 5  # increasing this value can lead to better hyperparameters, but will take longer
# chose the sampler for sampling hp configurations
optuna_sampler = optuna.samplers.TPESampler(
    seed=SEED_GLOBAL, consider_prior=True, prior_weight=1.0, consider_magic_clip=True,
    consider_endpoints=False, n_startup_trials=numer_of_trials/2, n_ei_candidates=24,
    multivariate=False, group=False, warn_independent_sampling=True, constant_liar=False
)  # https://optuna.readthedocs.io/en/stable/reference/generated/optuna.samplers.TPESampler.html#optuna.samplers.TPESampler

# Hugging Face Documentation: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.hyperparameter_search
best_run = auto_trainer.hyperparameter_search(
    n_trials=numer_of_trials,
    compute_objective=lambda metrics: metrics["eval_f1_macro"],
    direction="maximize",
    hp_space= hp_space,
    backend='optuna',
    **{"sampler": optuna_sampler}
)

[I 2023-11-17 05:51:09,934] A new study created in memory with name: no-name-684f2b85-2584-4b9b-b67d-49b92a6eee81


Step,Training Loss


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
[I 2023-11-17 05:51:30,179] Trial 0 finished with value: 0.03703703703703704 and parameters: {'learning_rate': 2e-05, 'warmup_ratio': 0.12546162504661698}. Best is trial 0 with value: 0.03703703703703704.


Step,Training Loss


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
[I 2023-11-17 05:51:44,361] Trial 1 finished with value: 0.07333333333333333 and parameters: {'learning_rate': 2e-05, 'warmup_ratio': 0.36806941169521934}. Best is trial 1 with value: 0.07333333333333333.


Step,Training Loss


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
[I 2023-11-17 05:51:59,686] Trial 2 finished with value: 0.03333333333333334 and parameters: {'learning_rate': 2e-05, 'warmup_ratio': 0.24555638780055672}. Best is trial 1 with value: 0.07333333333333333.


Step,Training Loss


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
[I 2023-11-17 05:52:16,829] Trial 3 finished with value: 0.030303030303030307 and parameters: {'learning_rate': 9e-06, 'warmup_ratio': 0.5331240839812664}. Best is trial 1 with value: 0.07333333333333333.


Step,Training Loss


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
[I 2023-11-17 05:52:29,592] Trial 4 finished with value: 0.1111111111111111 and parameters: {'learning_rate': 4e-05, 'warmup_ratio': 0.4233655282316589}. Best is trial 4 with value: 0.1111111111111111.


In [23]:
# show best hyperparameters based on hp-search
print(best_run)

BestRun(run_id='4', objective=0.1111111111111111, hyperparameters={'learning_rate': 4e-05, 'warmup_ratio': 0.4233655282316589}, run_summary=None)


In [24]:
# update the training arguments with the best hyperparameters
for k,v in best_run.hyperparameters.items():
    setattr(train_args, k, v)
print("\n", train_args)


 TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=True,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_mode

In [25]:
# Training
auto_trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset["train"],  #.shard(index=1, num_shards=100),  # https://huggingface.co/docs/datasets/processing.html#sharding-the-dataset-shard
    eval_dataset=dataset["test"],  #.shard(index=1, num_shards=100),
    compute_metrics=lambda eval_pred: compute_metrics_nli_binary(eval_pred, label_text_alphabetical=label_text_alphabetical)
)

auto_trainer.train()

Step,Training Loss


TrainOutput(global_step=116, training_loss=0.0957093567683779, metrics={'train_runtime': 15.9231, 'train_samples_per_second': 113.797, 'train_steps_per_second': 7.285, 'total_flos': 52510568689800.0, 'train_loss': 0.0957093567683779, 'epoch': 4.0})

In [26]:
## Evaluate the fine-tuned model on the held-out test set
results = auto_trainer.evaluate()
print(f'Automatical Fine-tuned Results:\n {results}')


Automatical Fine-tuned Results:
 {'eval_loss': 0.00012210608110763133, 'eval_accuracy': 0.09090909090909091, 'eval_f1_macro': 0.044444444444444446, 'eval_accuracy_balanced': 0.08333333333333333, 'eval_f1_micro': 0.09090909090909091, 'eval_precision_macro': 0.1111111111111111, 'eval_recall_macro': 0.027777777777777776, 'eval_precision_micro': 0.09090909090909091, 'eval_recall_micro': 0.09090909090909091, 'eval_runtime': 0.3148, 'eval_samples_per_second': 619.499, 'eval_steps_per_second': 9.531, 'epoch': 4.0}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Save, load and test Original vs Manual Fine Tune vs Auto Fine Tune model

### Save models

In [27]:
import os
# insert the path where you want to save the model
os.chdir("/content/drive/MyDrive/Colab Notebooks/task_classification")
print(os.getcwd())

/content/drive/MyDrive/Colab Notebooks/task_classification


In [28]:
### save best manual model to disk
directory_save_model = f"{training_directory}/"
model_name_custom = f"{model_name.split('/')[-1]}-manual-custom"
manual_mode_custom_path = directory_save_model + model_name_custom

# save the model to google drive
manual_trainer.save_model(output_dir=manual_mode_custom_path)

In [29]:
### save best manual model to disk
model_name_custom = f"{model_name.split('/')[-1]}-auto-custom"
auto_mode_custom_path = directory_save_model + model_name_custom

# save the model to google drive
auto_trainer.save_model(output_dir=auto_mode_custom_path)

### Load models

In [30]:
# load your models and tokenizer saved before from disk
manual_model = AutoModelForSequenceClassification.from_pretrained(manual_mode_custom_path)
manual_tokenizer = AutoTokenizer.from_pretrained(manual_mode_custom_path, use_fast=True, model_max_length=512)  # we load the tokenizer from the original BERT-NLI model

In [31]:
# load your models and tokenizer saved before from disk
auto_model = AutoModelForSequenceClassification.from_pretrained(auto_mode_custom_path)
auto_tokenizer = AutoTokenizer.from_pretrained(auto_mode_custom_path, use_fast=True, model_max_length=512)  # we load the tokenizer from the original BERT-NLI model

In [32]:
# create customized_classifier of the fine-tuned model
manual_fine_tune_classifier = pipeline(
    "zero-shot-classification",
    model=manual_model,  # if you have trained a model above, load_best_model_at_end in the training arguments has automatically replaced model with the fine-tuned model
    # or load a model from the Hugging Face hub, e.g. for 0-shot classification
    #model="MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c",
    tokenizer=manual_tokenizer,
    framework="pt",
    device=device,
)


auto_fine_tune_classifier = pipeline(
    "zero-shot-classification",
    model=auto_model,  # if you have trained a model above, load_best_model_at_end in the training arguments has automatically replaced model with the fine-tuned model
    # or load a model from the Hugging Face hub, e.g. for 0-shot classification
    #model="MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c",
    tokenizer=auto_tokenizer,
    framework="pt",
    device=device,
)

### Test Original vs Manual Fine Tune vs Auto Fine Tune model

In [33]:
async def task_classifier(client_input: str, task_types: str, classifier_func):
    """
    Classify tasks for LLM-based agent.

    Parameters:
    - client_input (str): The input text to be classified.
    - task_types (str): A comma-separated string of candidate labels for classification.
    - classifier_func (callable): The classifier function to be used for classification.

    Returns:
    A dictionary with the top task type, its score, and the inference time.
    """
    candidate_labels = [label.strip() for label in task_types.split(",")]
    time_execution_start = time.time()

    # Using the passed classifier function
    output = classifier_func(str(client_input), candidate_labels, multi_label=False)

    time_execution = round(time.time() - time_execution_start, 2)
    return {
        "task_type": output['labels'][0],
        "scores": round(output['scores'][0], 2),
        "inference_time": time_execution
    }


In [34]:
# List of test cases
task_types = 'Greeting, Information retrieval, Sentiment analysis, Text generation, Code generation, Q&A, Summarization, Translation'
test_cases = [
    # Expected LLM Task: Translation
    "Ein Benutzer bittet ein LLM, einen technischen Artikel vom Deutschen ins Chinesische zu übersetzen.", # A user asks an LLM to translate a technical article from German to Chinese.
    "ユーザーが、英語の文書を日本語に翻訳するようLLMに依頼する。", # A user requests an LLM to translate an English document into Japanese.
    "Пользователь просит LLM перевести техническую статью с русского на английский.", # A user asks an LLM to translate a technical article from Russian to English.
    "用户要求LLM将技术文件从中文翻译成越南语。", # A user requests an LLM to translate a technical document from Chinese to Vietnamese.
    "Người dùng yêu cầu LLM dịch một bài báo kỹ thuật từ tiếng Anh sang tiếng Việt.", # A user asks an LLM to translate a technical article from English to Vietnamese.

    # Expected LLM Task: Greeting
    "Ein LLM begrüßt einen japanischen Benutzer mit einem herzlichen 'Guten Morgen'.", # An LLM greets a Japanese user with a warm 'Good Morning'.
    "LLMは、ロシア語を話すユーザーに対して「こんにちは」と挨拶する。", # An LLM greets a Russian-speaking user with 'Hello'.
    "LLM chào mừng người dùng bằng cách nói 'Xin chào' bằng tiếng Việt.", # An LLM welcomes a user by saying 'Hello' in Vietnamese.
    "LLM开始与使用中文的用户进行对话，首先说'你好'。", # An LLM begins a conversation with a user in Chinese, starting with 'Hello'.
    "Пользователь входит в чат, и LLM приветствует его на русском языке.", # A user enters a chat, and the LLM greets them in Russian.

    # Expected LLM Task: Information Retrieval
    "ユーザーが、最新の量子コンピューティングに関する情報をLLMに問い合わせる。", # A user queries an LLM for the latest information on quantum computing in Japanese.
    "Ein Benutzer fragt ein LLM nach den neuesten Forschungsergebnissen in der Biotechnologie auf Deutsch.", # A user asks an LLM for the latest research findings in biotechnology in German.
    "用户向LLM查询有关可持续能源技术的最新信息。", # A user queries an LLM for the latest information on sustainable energy technology in Chinese.
    "Người dùng yêu cầu LLM tìm kiếm thông tin về công nghệ AI mới nhất bằng tiếng Việt.", # A user asks an LLM to search for the latest information on AI technology in Vietnamese.
    "Пользователь спрашивает у LLM о последних достижениях в космической отрасли на русском языке.", # A user inquires with an LLM about the latest advancements in the space industry in Russian.
]

# Run the task_classifier function for each test case
for test_case in test_cases:
    print(f"## Test case: '{test_case}' \n")
    print(f"# Default Model ")
    result = await task_classifier(test_case, task_types, classifier)
    print(f"Output: {result} \n")
    print(f"# Manual Fine-tuned Model ##")
    result = await task_classifier(test_case, task_types, manual_fine_tune_classifier)
    print(f"Output: {result} \n")
    print(f"# Auto Fine-tuned Model ##")
    result = await task_classifier(test_case, task_types, auto_fine_tune_classifier)
    print(f"Output: {result} \n")

Test case: 'Ein Benutzer bittet ein LLM, einen technischen Artikel vom Deutschen ins Chinesische zu übersetzen.' 

## Default Model ##
Output: {'task_type': 'Translation', 'scores': 0.59, 'inference_time': 2.69} 

## Manual Fine-tuned Model ##
Output: {'task_type': 'Q&A', 'scores': 0.15, 'inference_time': 0.26} 

## Auto Fine-tuned Model ##
Output: {'task_type': 'Q&A', 'scores': 0.15, 'inference_time': 0.21} 

Test case: 'ユーザーが、英語の文書を日本語に翻訳するようLLMに依頼する。' 

## Default Model ##
Output: {'task_type': 'Translation', 'scores': 0.62, 'inference_time': 2.3} 

## Manual Fine-tuned Model ##
Output: {'task_type': 'Q&A', 'scores': 0.35, 'inference_time': 0.22} 

## Auto Fine-tuned Model ##
Output: {'task_type': 'Q&A', 'scores': 0.35, 'inference_time': 0.21} 

Test case: 'Пользователь просит LLM перевести техническую статью с русского на английский.' 

## Default Model ##
Output: {'task_type': 'Translation', 'scores': 0.56, 'inference_time': 2.23} 

## Manual Fine-tuned Model ##
Output: {'task_typ



Output: {'task_type': 'Q&A', 'scores': 0.47, 'inference_time': 0.28} 

## Auto Fine-tuned Model ##
Output: {'task_type': 'Q&A', 'scores': 0.47, 'inference_time': 0.26} 

Test case: 'Ein Benutzer fragt ein LLM nach den neuesten Forschungsergebnissen in der Biotechnologie auf Deutsch.' 

## Default Model ##
Output: {'task_type': 'Translation', 'scores': 0.41, 'inference_time': 3.1} 

## Manual Fine-tuned Model ##
Output: {'task_type': 'Information retrieval', 'scores': 0.24, 'inference_time': 0.28} 

## Auto Fine-tuned Model ##
Output: {'task_type': 'Information retrieval', 'scores': 0.24, 'inference_time': 0.3} 

Test case: '用户向LLM查询有关可持续能源技术的最新信息。' 

## Default Model ##
Output: {'task_type': 'Information retrieval', 'scores': 0.79, 'inference_time': 3.11} 

## Manual Fine-tuned Model ##
Output: {'task_type': 'Information retrieval', 'scores': 0.52, 'inference_time': 0.35} 

## Auto Fine-tuned Model ##
Output: {'task_type': 'Information retrieval', 'scores': 0.52, 'inference_time': 0.35

In [35]:
#experiments
input = "An LLM functions as a virtual assistant for legal advice, requiring it to understand and respond accurately to complex legal queries while maintaining conversational flow."
candidate_labels = ["Greeting", "Information retrieval", "Sentiment analysis", "Text generation", "Code generation", "Q&A", "Summarization", "Translation", "Conversation and Chatbots"]

# Default model
time_execution = time.time()
output = classifier(input, candidate_labels, multi_label=False)
time_execution = round(time.time() - time_execution, 2)
print(f"Default Model: task_type: {output['labels'][0]}, scores: {round(output['scores'][0],2)}, inference_time: {time_execution}")

# Manual Fine-tuned Model
time_execution = time.time()
output = manual_fine_tune_classifier(input, candidate_labels, multi_label=False)
time_execution = round(time.time() - time_execution, 2)
print(f"Manual Fine-tuned Model: task_type: {output['labels'][0]}, scores: {round(output['scores'][0],2)}, inference_time: {time_execution}")

# Auto Fine-tuned Model
time_execution = time.time()
output = auto_fine_tune_classifier(input, candidate_labels, multi_label=False)
time_execution = round(time.time() - time_execution, 2)
print(f"Auto Fine-tuned Model: task_type: {output['labels'][0]}, scores: {round(output['scores'][0],2)}, inference_time: {time_execution}")

Default Model: task_type: Summarization, scores: 0.33, inference_time: 7.01
Manual Fine-tuned Model: task_type: Conversation and Chatbots, scores: 0.53, inference_time: 0.53
Auto Fine-tuned Model: task_type: Conversation and Chatbots, scores: 0.53, inference_time: 0.54
