<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/t5/T5_sequence_to_sequence_custom_mlflow_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#MLFLOW
https://mlflow.org/docs/latest/introduction/index.html


MLflow is a solution to many of these issues in this dynamic landscape, offering tools and simplifying processes to streamline the ML lifecycle and foster collaboration among ML practitioners.

https://mlflow.org/docs/latest/llms/llm-evaluate/index.html

# MLflow transformers Guide
https://mlflow.org/docs/latest/llms/transformers/guide/index.html
https://mlflow.org/docs/latest/llms/transformers/tutorials/fine-tuning/transformers-fine-tuning.html

# T5forConditionalGeneration
T5Model contains the encoder (stack of encoder layers) and decoder (stack of decoder layers) without any task specific heads. It returns the raw hidden states of the decoder as output.

T5ForConditionalGeneration also contains the encoder and decoder and adds an additional linear layer (lm_head) which takes the final hidden states of decoder and generates the next token.

For fine-tuning the model for seq2seq generation you should use T5ForConditionalGeneration, if you want to add some different task specific head then you can T5Model.

# ngrok
Connect localhost to the internet for testing applications and APIs
Bring secure connectivity to apps and APIs in localhost and dev/test environments with just one command or function call.
- Webhook testing
- Developer Previews
- Mobile backend testing

https://ngrok.com/


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install mlflow pyngrok evaluate  bitsandbytes accelerate datasets transformers==4.39.3 --quiet
get_ipython().system_raw("mlflow ui --port 5000 &")

In [5]:

from pyngrok import ngrok
from getpass import getpass

# Terminate open tunnels if exist
ngrok.kill()

In [6]:
from google.colab import userdata
NGROK_AUTH_TOKEN  = userdata.get('NGROK')

ngrok.set_auth_token(NGROK_AUTH_TOKEN)

# Open an HTTPs tunnel on port 5000 for http://localhost:5000
ngrok_tunnel = ngrok.connect(addr="5000", proto="http", bind_tls=True)
print("MLflow Tracking UI:", ngrok_tunnel.public_url)

MLflow Tracking UI: https://b94e-34-124-207-210.ngrok-free.app


## Fine-Tuning Transformers with MLflow for Enhanced Model Management



In [7]:
# Disable tokenizers warnings when constructing pipelines
%env TOKENIZERS_PARALLELISM=false

import warnings

# Disable a few less-than-useful UserWarnings from setuptools and pydantic
warnings.filterwarnings("ignore", category=UserWarning)

env: TOKENIZERS_PARALLELISM=false


### Preparing the Dataset and Environment for Fine-Tuning

#### Key Steps in this Section

1. **Loading the Dataset**: Utilizing the `sms_spam` dataset for spam detection.
2. **Splitting the Dataset**: Dividing the dataset into training and test sets with an 80/20 distribution.
3. **Importing Necessary Libraries**: Including libraries like `evaluate`, `mlflow`, `numpy`, and essential components from the `transformers` library.

Before diving into the fine-tuning process, setting up our environment and preparing the dataset is crucial. This step involves loading the dataset, splitting it into training and testing sets, and initializing essential components of the Transformers library. These preparatory steps lay the groundwork for an efficient fine-tuning process.

This setup ensures that we have a solid foundation for fine-tuning our model, with all the necessary data and tools at our disposal. In the following Python code, we'll execute these steps to kickstart our model fine-tuning journey.

In [8]:
import evaluate
import numpy as np
from datasets import load_dataset, Dataset
from transformers import (
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    pipeline,
)

import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split

In [9]:
data_path = "/content/drive/MyDrive/data/documents_final_cv.csv"
data = pd.read_csv(data_path, header=None)
data_path_test = "/content/drive/MyDrive/data/documents_test_cv.csv"
data_test= pd.read_csv(data_path_test,  header=None)
data.columns = ["label", "text"]
data_test.columns = ["label", "text"]

In [10]:
y =  data['label'].values
X = data.drop("label", axis=1)

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


In [12]:
X_train['label'] = y_train
X_test['label'] = y_test

In [13]:
X_train.head()

Unnamed: 0,text,label
54,profile resume capm ga 30022 404 372 2126 lead...,cv
175,evaluation document created security 98549 day...,non-cv
78,scott willard cumberland park dr fl 32821 febr...,cv
100,evaluation document created security 19 meadow...,non-cv
158,evaluation document created 55373 main safety ...,non-cv


In [14]:
train_dataset = Dataset.from_pandas(X_train)
test_dataset= Dataset.from_pandas(X_test)
df_val = Dataset.from_pandas(data_test)


In [15]:
train_dataset = train_dataset.remove_columns(['__index_level_0__'])
train_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 160
})

In [16]:
test_dataset = test_dataset.remove_columns(['__index_level_0__'])
test_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 40
})

In [17]:
train_dataset[0]['text']

'profile resume capm ga 30022 404 372 2126 leader 15 years management experience leading large enterprise teams comprehensive global support service itsm itil methodology global environment focus itsm processes partnership itsm tooling working collaboratively executive leadership key stakeholders across hierarchal levels proper alignment technology initiatives business experience phases program initial analysis design support experience building solutions yield employee solid expertise ensuring operational excellence establishing quantifiable team management communications skills well team player analytical thinking service delivery management change release management methodologies management methodologies analysis business blueprinting system documentation sap templates erp configuration deployment platform management service delivery associate project management mysap srm mysap scm procurement computer science 2002 history atos solutions ga 2020 present servicenow platform service d

In [18]:
train_dataset[0]['label']

'cv'

In [19]:
train_dataset.to_pandas()['label'].value_counts()

label
cv        80
non-cv    80
Name: count, dtype: int64

In [20]:
test_dataset.to_pandas()['label'].value_counts()

label
cv        20
non-cv    20
Name: count, dtype: int64

In [21]:
import evaluate
import nltk
import numpy as np
from typing import List, Tuple
from nltk.tokenize import sent_tokenize
from datasets import Dataset, concatenate_datasets
from huggingface_hub import HfFolder
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments
)

In [22]:
PROJECT = "FlanT5-Custom"
MODEL_NAME = 'google/flan-t5-base'
DATASET = "CVS-Premcloud"

In [23]:
MODEL_ID = "google/flan-t5-base"
# Load tokenizer of FLAN-t5
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

In [24]:

# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([train_dataset, test_dataset]).map(
    lambda x: tokenizer(x["text"], truncation=True), batched=True, remove_columns=['text', 'label']
)
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Max source length: 512


In [25]:
# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([train_dataset, test_dataset]).map(
    lambda x: tokenizer(x["label"], truncation=True), batched=True, remove_columns=['text', 'label']
)
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Max target length: 5


In [26]:
REPOSITORY_ID = f"{MODEL_ID.split('/')[1]}-text-classification"
REPOSITORY_ID

'flan-t5-base-text-classification'

### Model Initialization and Label Mapping

Next, we'll set up label mappings and initialize the model for our text classification task.

Having prepared our data, the next crucial step is to initialize our model and set up label mappings. This involves defining a clear relationship between the labels in our dataset and their corresponding representations in the model.

#### Setting Up Label Mappings

- **Defining Label Mappings**: Creating bi-directional mappings between integer labels and textual representations ("ham" and "spam").

#### Initializing the Model

- **Model Selection**: Choosing the `distilbert-base-uncased` model for its balance of performance and efficiency.
- **Model Configuration**: Configuring the model for sequence classification with the defined label mappings.

Proper model initialization and label mapping are key to ensuring that the model accurately understands and processes the task at hand. By explicitly defining these mappings and selecting an appropriate pre-trained model, we lay the groundwork for effective and efficient fine-tuning.

In [27]:
# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=REPOSITORY_ID,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    fp16=False,     # Overflows with fp16
    learning_rate=1e-3,
    num_train_epochs=4,
    logging_dir=f"{REPOSITORY_ID}/logs",    # logging & evaluation strategies
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,

)

def preprocess_function(sample: Dataset, padding: str = "max_length") -> dict:
    """ Preprocess the dataset. """

    # add prefix to the input for t5
    inputs = [item for item in sample["text"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["label"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


def postprocess_text(preds: List[str], labels: List[str]) -> Tuple[List[str], List[str]]:
    """ helper function to postprocess text"""
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    # load metrics
    metric = evaluate.load("f1")
    metric1 = evaluate.load("roc_auc")
    metric2 = evaluate.load("recall")
    metric3 = evaluate.load("precision")

    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    id2label = {0: "non-cv", 1: "cv"}
    label2id = {"non-cv": 0, "cv": 1}

    decoded_preds_bin = np.array([label2id.get(x) for x in decoded_preds ])
    decoded_labels_bin = np.array([label2id.get(x) for x in decoded_labels ])
    result = metric.compute(predictions=decoded_preds_bin, references=decoded_labels_bin, average='macro')
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result['roc_auc'] = metric1.compute(prediction_scores=decoded_preds_bin, references=decoded_labels_bin, average='macro')['roc_auc']
    result['recall'] = metric2.compute(predictions=decoded_preds_bin, references=decoded_labels_bin, average='macro')['recall']
    result['precision'] = metric3.compute(predictions=decoded_preds_bin, references=decoded_labels_bin, average='macro')['precision']

    return result

In [28]:
import datasets
df_train_test = datasets.DatasetDict({"train": train_dataset, "test": test_dataset})
df_train_test

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 160
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 40
    })
})

In [29]:
tokenized_dataset = df_train_test.map(preprocess_function, batched=True, remove_columns=['text', 'label'])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Map:   0%|          | 0/160 [00:00<?, ? examples/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


In [31]:
tokenized_dataset['train'][0]['labels']

[3, 75, 208, 1, -100]

In [30]:
tokenized_dataset['train'][0]['input_ids']

[3278,
 4258,
 2468,
 51,
 7922,
 3147,
 2884,
 3,
 25285,
 220,
 5865,
 1401,
 2688,
 2488,
 627,
 203,
 758,
 351,
 1374,
 508,
 5399,
 2323,
 3452,
 1252,
 380,
 313,
 165,
 51,
 34,
 173,
 15663,
 1252,
 1164,
 992,
 165,
 51,
 2842,
 4696,
 165,
 51,
 1464,
 53,
 464,
 9642,
 120,
 4297,
 2843,
 843,
 10588,
 640,
 1382,
 7064,
 138,
 1425,
 2757,
 14632,
 748,
 6985,
 268,
 351,
 17258,
 478,
 2332,
 1693,
 408,
 380,
 351,
 740,
 1275,
 6339,
 3490,
 1973,
 2980,
 3,
 5833,
 7763,
 8978,
 3,
 12585,
 13500,
 99,
 23,
 179,
 372,
 758,
 5030,
 1098,
 168,
 372,
 1959,
 18355,
 1631,
 313,
 1929,
 758,
 483,
 1576,
 758,
 25984,
 758,
 25984,
 1693,
 268,
 28471,
 53,
 358,
 7192,
 16333,
 7405,
 3,
 49,
 102,
 5298,
 12001,
 1585,
 758,
 313,
 1929,
 7573,
 516,
 758,
 82,
 7,
 9,
 102,
 3,
 7,
 52,
 51,
 82,
 7,
 9,
 102,
 3,
 7,
 75,
 51,
 19339,
 1218,
 2056,
 4407,
 892,
 44,
 32,
 7,
 1275,
 7922,
 6503,
 915,
 313,
 7651,
 1585,
 313,
 1929,
 2743,
 44,
 32,
 7,
 1252,
 44,

# model loader with a sequence-to-sequence language modeling head
https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM

In [32]:
# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID)

In [33]:
#nltk.download("punkt")

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

In [34]:
# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [35]:
# If you are running this tutorial in local mode, leave the next line commented out.
# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.

mlflow.set_tracking_uri("http://127.0.0.1:5000")

### Integrating MLflow for Experiment Tracking

The final preparatory step before beginning the training process is to integrate MLflow for experiment tracking.
   
MLflow is a critical tool in our workflow, enabling us to log, monitor, and compare different runs of our model training.

#### Setting up the MLflow Experiment

- **Naming the Experiment**: We use `mlflow.set_experiment` to create a new experiment or assign the current run to an existing experiment. In this case, we name our experiment "Spam Classifier Training". This name should be descriptive and related to the task at hand, aiding in organizing and identifying experiments later.
- **Role of MLflow in Training**: By setting up an MLflow experiment, we can track various aspects of our model training, such as parameters, metrics, and outputs. This tracking is invaluable for comparing different models, tuning hyperparameters, and maintaining a record of our experiments.

#### Benefits of Experiment Tracking
Utilizing MLflow for experiment tracking offers several advantages:

- **Organization**: Keeps your training runs organized and easily accessible.
- **Comparability**: Allows for easy comparison of different training runs to understand the impact of changes in parameters or data.
- **Reproducibility**: Enhances the reproducibility of experiments by logging all necessary details.

With MLflow set up, we're now ready to begin the training process, keeping track of every important aspect along the way.

In the next code snippet, we'll set up our MLflow experiment for tracking the training of our spam classification model.

In [37]:
# Pick a name that you like and reflects the nature of the runs that you will be recording to the experiment.
mlflow.set_experiment("T5 Custom Classifier seq_to_seq Training v2")

2024/05/02 10:36:20 INFO mlflow.tracking.fluent: Experiment with name 'T5 Custom Classifier seq_to_seq Training v2' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/797910977184651261', creation_time=1714646180373, experiment_id='797910977184651261', last_update_time=1714646180373, lifecycle_stage='active', name='T5 Custom Classifier seq_to_seq Training v2', tags={}>

In [40]:
!rm -rf ./flan-T5-fine-tune

In [41]:
!mkdir ./flan-T5-fine-tune
custom_path = "./flan-T5-fine-tune/"

In [42]:
with mlflow.start_run() as run:
    train_results = trainer.train()
    print(train_results.metrics)
    trainer.model.save_pretrained(custom_path)
    trainer.data_collator.tokenizer.save_pretrained(custom_path)

    transformers_model = {"model": trainer.model, "tokenizer": trainer.data_collator.tokenizer}
    task = "text-classification"
    model_info = mlflow.transformers.log_model(
        transformers_model=transformers_model,
        artifact_path="text_classifier",
        task=task,
    )
    print(model_info)


Epoch,Training Loss,Validation Loss,F1,Gen Len,Roc Auc,Recall,Precision
1,0.4965,0.008734,0.974984,4.525,0.975,0.975,0.97619
2,0.0004,0.002362,1.0,4.5,1.0,1.0,1.0
3,0.0,0.002515,1.0,4.5,1.0,1.0,1.0
4,0.0,0.002525,1.0,4.5,1.0,1.0,1.0


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


{'train_runtime': 113.7749, 'train_samples_per_second': 5.625, 'train_steps_per_second': 0.703, 'total_flos': 438244705566720.0, 'train_loss': 0.12422466695006733, 'epoch': 4.0}


The model 'T5ForConditionalGeneration' is not supported for text-classification. Supported models are ['AlbertForSequenceClassification', 'BartForSequenceClassification', 'BertForSequenceClassification', 'BigBirdForSequenceClassification', 'BigBirdPegasusForSequenceClassification', 'BioGptForSequenceClassification', 'BloomForSequenceClassification', 'CamembertForSequenceClassification', 'CanineForSequenceClassification', 'LlamaForSequenceClassification', 'ConvBertForSequenceClassification', 'CTRLForSequenceClassification', 'Data2VecTextForSequenceClassification', 'DebertaForSequenceClassification', 'DebertaV2ForSequenceClassification', 'DistilBertForSequenceClassification', 'ElectraForSequenceClassification', 'ErnieForSequenceClassification', 'ErnieMForSequenceClassification', 'EsmForSequenceClassification', 'FalconForSequenceClassification', 'FlaubertForSequenceClassification', 'FNetForSequenceClassification', 'FunnelForSequenceClassification', 'GemmaForSequenceClassification', 'GPT2F

<mlflow.models.model.ModelInfo object at 0x7ec89c49c550>


In [43]:
run.to_dictionary()

{'info': {'artifact_uri': 'mlflow-artifacts:/797910977184651261/4e218f478e224c9eb8e7e76529a7f1b3/artifacts',
  'end_time': None,
  'experiment_id': '797910977184651261',
  'lifecycle_stage': 'active',
  'run_id': '4e218f478e224c9eb8e7e76529a7f1b3',
  'run_name': 'popular-toad-904',
  'run_uuid': '4e218f478e224c9eb8e7e76529a7f1b3',
  'start_time': 1714646261368,
  'status': 'RUNNING',
  'user_id': 'root'},
 'data': {'metrics': {},
  'params': {},
  'tags': {'mlflow.runName': 'popular-toad-904',
   'mlflow.source.name': '/usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py',
   'mlflow.user': 'root',
   'mlflow.source.type': 'LOCAL'}}}

In [44]:
run.data

<RunData: metrics={}, params={}, tags={'mlflow.runName': 'popular-toad-904',
 'mlflow.source.name': '/usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py',
 'mlflow.source.type': 'LOCAL',
 'mlflow.user': 'root'}>

In [45]:
import transformers
from mlflow.models import infer_signature
from mlflow.transformers import generate_signature_output
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [47]:
model_info.signature

inputs: 
  [string (required)]
outputs: 
  ['label': string (required), 'score': double (required)]
params: 
  None

In [46]:
model_info.model_uri

'runs:/4e218f478e224c9eb8e7e76529a7f1b3/text_classifier'

In [48]:
classification_components = mlflow.transformers.load_model(
    model_info.model_uri, return_type="components"
)


Downloading artifacts:   0%|          | 0/20 [00:00<?, ?it/s]

2024/05/02 10:51:18 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
2024/05/02 10:51:31 INFO mlflow.transformers: 'runs:/4e218f478e224c9eb8e7e76529a7f1b3/text_classifier' resolved as 'mlflow-artifacts:/797910977184651261/4e218f478e224c9eb8e7e76529a7f1b3/artifacts/text_classifier'


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [49]:
type(classification_components['tokenizer'])

In [50]:
type(classification_components['model'])

### Loading and Testing the Model from MLflow

After logging our fine-tuned model to MLflow, we'll now load and test it.
    
#### Loading the Model from MLflow

- **Using mlflow.transformers.load_model**: We use this function to load the model stored in MLflow. This demonstrates how models can be retrieved and utilized post-training, ensuring they are accessible for future use.
- **Retrieving Model URI**: We use the `model_uri` obtained from logging the model to MLflow. This URI is the unique identifier for our logged model, allowing us to retrieve it accurately.

#### Testing the Model with Validation Text

- **Preparing Validation Text**: We use a creatively crafted text to test the model's performance. This text is designed to mimic a typical spam message, which is relevant to our model's training on spam classification.
- **Evaluating Model Output**: By passing this text through the loaded model, we can observe its performance and effectiveness in a practical scenario. This step is crucial to ensure that the model works as expected in real-world conditions.

Testing the model after loading it from MLflow is essential for several reasons:

- **Validation of Logging Process**: It confirms that the model was logged and loaded correctly.
- **Practical Performance Assessment**: Provides a real-world assessment of the model's performance, which is critical for deployment decisions.
- **Demonstrating End-to-End Workflow**: Showcases a complete workflow from training, logging, loading, to using the model, which is vital for understanding the entire model lifecycle.

In the next code block, we'll load our model from MLflow and test it with a validation text to assess its real-world performance.

In [51]:
def load_dataset_test(data_path) -> Dataset:
    """ Load dataset. """

    dataset_ecommerce_pandas = pd.read_csv(data_path, header=None, names=['label', 'text'])
    dataset_ecommerce_pandas['label'] = dataset_ecommerce_pandas['label'].astype(str)
    dataset_ecommerce_pandas['text'] = dataset_ecommerce_pandas['text'].astype(str)
    dataset = Dataset.from_pandas(dataset_ecommerce_pandas)

    return dataset

In [52]:
datatest= load_dataset_test(data_path_test)

In [53]:
len(datatest)

284

In [54]:
datatest[0]

{'label': 'cv',
 'text': 'justin wa adkins thrives cultivating business workflows creating strategic key stakeholders simple effective delights customers key inflection points inside business focusing innovation bringing 8 years coding experience managing mission world better place supporting make central actuarial science may manager 2021 construction accounting consulting telecom erp related project data conversion ms dynamics solomon vista 3 months limited solomon cio role managing multiple systems bringing cohesion lead senior analyst 2016 april systems construction accounting battle erp related project project code vista etl modules reconcile jc beginning balances using managed implemented ms materials module batch creation multiple streamlined xml certified payroll report efficient washington equipment gps import allocation procedure built forecast software project managers upload capabilities directly viewpoint database lead eos level 10 weekly meetings mentored 6 various busine

In [55]:
import torch
from tqdm.auto import tqdm

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sklearn.metrics import classification_report

In [56]:
model = classification_components.get('model')
tokenizer = classification_components.get('tokenizer')

In [57]:
def classify(text_to_classify: str) -> str:
    """Classify a text using the model."""
    inputs = tokenizer.encode_plus(text_to_classify, padding='max_length', max_length=512, return_tensors='pt')
    inputs = inputs.to('cuda') if torch.cuda.is_available() else inputs.to('cpu')
    outputs = model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=5)

    prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return prediction


def evaluate_model() -> None:
    """Evaluate the model on the test dataset."""
    predictions_list, labels_list = [], []

    samples_number = len(datatest)
    progress_bar = tqdm(range(samples_number))

    for i in range(samples_number):
        text = datatest['text'][i]
        predictions_list.append(classify(text))
        labels_list.append(str(datatest['label'][i]))

        progress_bar.update(1)

    report = classification_report(labels_list, predictions_list)
    print(report)

In [58]:
evaluate_model()

  0%|          | 0/284 [00:00<?, ?it/s]

              precision    recall  f1-score   support

          cv       0.97      1.00      0.99       142
      non-cv       1.00      0.97      0.99       142

    accuracy                           0.99       284
   macro avg       0.99      0.99      0.99       284
weighted avg       0.99      0.99      0.99       284



In [48]:
ngrok.kill()