<a href="https://www.kaggle.com/code/philanoe/nlp-transformer-training?scriptVersionId=102105898" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## 💻 UnpackAI DL201 Bootcamp - Week 3 - Training a NLP transformer

### 📕 Learning Objectives

* Getting working examples able to achieve the main NLP tasks
* Knowing the existence of Hugging Face and the strenth of its pre-trained models and all-in-one pipelines

### 📖 Concepts map

* Pipeline
* Training

This code was tested on Kaggle, without Accelerator, on 2022/7/30.

In [1]:
# install the necessary libraries (need internet access)
!pip install -Uqq datasets # transformers libraries
!pip install -Uqq wandb # used during transformer training, even if we disable it

In [2]:
# libraries importation
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import json
import datasets #library used to adapt our data to the transformers
# the following libraries are here to download the pre-trained model, tokenize the training data, define the training arguments...
from transformers import DataCollatorWithPadding, AutoTokenizer, TrainingArguments, Trainer, AutoModelForSequenceClassification

import os
from pathlib import Path
for dirname, _, filenames in os.walk("../input/intent-recognition-chatbot-corpus-from-askubuntu"):
    for filename in filenames:
        print(os.path.join(dirname, filename))

../input/intent-recognition-chatbot-corpus-from-askubuntu/AskUbuntu Corpus.json


In [3]:
# environment preparation
os.environ["TOKENIZERS_PARALLELISM"]="true"
os.environ["WANDB_DISABLED"] = "true"
#Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5.
# Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).

In [4]:
#check if the environment variable was set correctly
print(os.environ.get('TOKENIZERS_PARALLELISM', ''))
print(os.environ.get('WANDB_DISABLED', ''))

true
true


# First part : train a sentence classifier with the entire available data

In [5]:
# read json data as a dictionary 
with open('../input/intent-recognition-chatbot-corpus-from-askubuntu/AskUbuntu Corpus.json', 'r') as f:
  data = json.load(f)
# Intent and Text information are stored in the value corresponding to sentences key 
sentences=data["sentences"]
# Get intent content using list comprehension by looping in the sentences values 
labelList=[i["intent"]for i in sentences]
# Get text content using list comprehension by looping in the sentences values 
textList=[i['text'] for i in sentences]

In [6]:
# Create IntentDataFrame with label list and text list
DFData = {'label' : labelList, 'sentence' : textList}
IntentDataFrame = pd.DataFrame(data = DFData)

In [7]:
# Delete the samples with "None" as label
IntentDataFrame=IntentDataFrame[IntentDataFrame["label"]!="None"]

In [8]:
# check whether the training values are quite balanced
IntentDataFrame["label"].value_counts()

Software Recommendation    57
Make Update                47
Shutdown Computer          27
Setup Printer              23
Name: label, dtype: int64

In [9]:
# replace the labels strings by label numbers (we could automate this process for larger labels sets)
LabelToIndex = {"Software Recommendation":0,"Make Update":1,"Shutdown Computer":2,"Setup Printer":3}
IntentDataFrame["label"]=IntentDataFrame["label"].map(LabelToIndex)

In [10]:
# convert train_df to a dataset so that it can be used by Hugging Face models and tokenizers
train_dataset=datasets.Dataset.from_pandas(IntentDataFrame)

In [11]:
print(train_dataset)

Dataset({
    features: ['label', 'sentence', '__index_level_0__'],
    num_rows: 154
})


In [12]:
# Remove __index_level_0__ columns because we do not need it for training
train_dataset=train_dataset.remove_columns(["__index_level_0__"])

In [13]:
# Import AutoTokenizer with checkpoint"distilbert-base-uncased"
my_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [14]:
# Define the tokenization function
def preprocess_function(Input_Dataset):
    return my_tokenizer(Input_Dataset["sentence"], truncation=True, padding=True)

In [15]:
# and use this function to tokenize the traning dataset (batch by batch if it is too large)
tokenize_train=train_dataset.map(preprocess_function,batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [16]:
# Choose the default data_collator to adapt our data to the model training
my_data_collator = DataCollatorWithPadding(tokenizer=my_tokenizer)

In [17]:
# Get the pre-trained model from the web (256 Mb !!!)
my_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=4)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifi

In [18]:
# Set the training parameters
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=7,
    weight_decay=0.01,
    #evaluation_strategy="epoch"
)

my_trainer = Trainer(
    model=my_model,
    args=training_args,
    train_dataset=tokenize_train,
    #eval_dataset=tokenize_test,  Here, we work with the entire dataset as training data
    #compute_metrics=compute_metrics,
    tokenizer=my_tokenizer,
    data_collator=my_data_collator,
)


my_trainer.train()

Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 154
  Num Epochs = 7
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 140


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=140, training_loss=0.5146885463169643, metrics={'train_runtime': 144.7218, 'train_samples_per_second': 7.449, 'train_steps_per_second': 0.967, 'total_flos': 9204225229584.0, 'train_loss': 0.5146885463169643, 'epoch': 7.0})

# Second part : define the inference function (function making the prediction from a sentence input (string)) based on the model we fine tuned above

In [19]:
# function made to deal with one sentence at a time. This could be re-engineered to deal with a 
# pandas series of sentences, or a list of sentences
def SentenceClassifier(InputSentence):
    """ Take a sentence as input, return the corresponding label
    
    dependencies : my_tokenizer, my_trainer(fine tuned pre-trained model), preprocess_function
    """
      
    # here, we are keeping the input as a Dataset, which could allow us to reuse the code
    # to answer many questions at once
    InputSentenceDFData = {'sentence' : [InputSentence]}
    InputSentenceDataFrame = pd.DataFrame(data = InputSentenceDFData)
    InputSentenceDataset = datasets.Dataset.from_pandas(InputSentenceDataFrame)
    Tokenised_InputSentence = InputSentenceDataset.map(preprocess_function, batched=False)
    
    LabelScores = my_trainer.predict(Tokenised_InputSentence)
    BestLabel = LabelScores.predictions.argmax(1)
    
    IndexToLabel = {0:"Software Recommendation",1:"Make Update",2:"Shutdown Computer",3:"Setup Printer"}
    OutputLabelName = IndexToLabel[BestLabel[0]]
    
    return OutputLabelName

In [20]:
InputSentence = "What should I use to cut pictures ?"
OutputLabel = SentenceClassifier(InputSentence)
print(f'Your question was : "{InputSentence}" it was classified as : "{OutputLabel}"')

  0%|          | 0/1 [00:00<?, ?ex/s]

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1
  Batch size = 8


Your question was : "What should I use to cut pictures ?" it was classified as : "Software Recommendation"


# Third part : learn how to save and reload a saved model to avoid pre-trained model donwloading and training operations when we have a working model

## Save the model and tokenizer locally

In [21]:
ModelPath = "/kaggle/working/model/"
TokenizerPath = "/kaggle/working/tokenizer/"

In [22]:
for path_string in [ModelPath, TokenizerPath]:
    current_path = Path(path_string)
    if not current_path.is_dir():
        current_path.mkdir()
        print(f'Creation of the directory {path_string}')
    print(f'Folder existing ? : {current_path.is_dir()}')

Folder existing ? : True
Folder existing ? : True


In [23]:
if Path(ModelPath).is_dir():
    my_model.save_pretrained(ModelPath)
    print("model ok")
if Path(TokenizerPath).is_dir():
    my_tokenizer.save_pretrained(TokenizerPath)
    print("tokenizer ok")

Configuration saved in /kaggle/working/model/config.json
Model weights saved in /kaggle/working/model/pytorch_model.bin
tokenizer config file saved in /kaggle/working/tokenizer/tokenizer_config.json
Special tokens file saved in /kaggle/working/tokenizer/special_tokens_map.json


model ok
tokenizer ok


## List the files we just saved

In [24]:
# option 1 : with os
print(os.listdir(ModelPath))
print(os.listdir(TokenizerPath))

['pytorch_model.bin', 'config.json']
['tokenizer.json', 'tokenizer_config.json', 'special_tokens_map.json', 'vocab.txt']


In [25]:
# option 2 : with pathlib
print([file.name for file in Path(ModelPath).iterdir()])
print([file.name for file in Path(TokenizerPath).iterdir()])

['pytorch_model.bin', 'config.json']
['tokenizer.json', 'tokenizer_config.json', 'special_tokens_map.json', 'vocab.txt']


## Load the local version of the model and tokenizer

In [26]:
LocalModel = AutoModelForSequenceClassification.from_pretrained(ModelPath, num_labels=4)
LocalTokenizer = AutoTokenizer.from_pretrained(TokenizerPath)

loading configuration file /kaggle/working/model/config.json
Model config DistilBertConfig {
  "_name_or_path": "/kaggle/working/model/",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.20.1",
  "vocab_size": 30522
}

loading weights file /kaggle/working/model/pytorch_model.bin
All model checkpoint weights were used wh

In [29]:
# check if the classifier works well with the local data
def LocalSentenceClassifier(InputSentence):
    """ Take a sentence as input, return the corresponding label
    
    dependencies : LocalTokenizer, LocalModel
    We use tokenizer2 and trainer2 instead of tokeninzer and trainer
    to be sure that this function works with the data saved and load locally
    """
    
    Local_Trainer = Trainer(
        model=LocalModel,
        args=training_args,
        train_dataset=tokenize_train,
        #eval_dataset=tokenize_test,  Here, we work with the entire dataset as training data
        #compute_metrics=compute_metrics,
        tokenizer=my_tokenizer,
        data_collator=my_data_collator,
    )
    
    # here, we are keeping the input as a Dataset, which could allow us to reuse the code
    # to answer many questions at once
    InputSentenceDFData = {'sentence' : [InputSentence]}
    InputSentenceDataFrame = pd.DataFrame(data = InputSentenceDFData)
    InputSentenceDataset = datasets.Dataset.from_pandas(InputSentenceDataFrame)
    Tokenised_InputSentence = InputSentenceDataset.map(preprocess_function,batched=False)
    
    LabelScores = Local_Trainer.predict(Tokenised_InputSentence)
    BestLabel = LabelScores.predictions.argmax(1)
    
    IndexToLabel = {0:"Software Recommendation",1:"Make Update",2:"Shutdown Computer",3:"Setup Printer"}
    OutputLabelName = IndexToLabel[BestLabel[0]]
    
    return OutputLabelName

In [30]:
InputSentence = "How can I update Ubuntu ?"
OutputLabel = LocalSentenceClassifier(InputSentence)
print(f'Your question was : "{InputSentence}" it was classified as : "{OutputLabel}"')

  0%|          | 0/1 [00:00<?, ?ex/s]

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1
  Batch size = 8


Your question was : "How can I update Ubuntu ?" it was classified as : "Software Recommendation"
