# Workshop AI Week: Fine-tuning our first NLP model to classify text

Your instructors for today are Vlad & Călin.

In this session, our goal is to fine-tune an existing model on a publicly available dataset which is related to the problem we are solving: CV content classification.

But before we move on to the actual stuff... 

    ...can you tell the difference between training and fine-tuning a model?

To use existing models, we'll explore the world of ![HuggingFace](https://huggingface.co/front/assets/huggingface_logo-noborder.svg)

In [None]:
# Let's get things started.
# We'll need to first install a few dependencies, which we will work with throughout this workshop.

# Note that we're actually executing a command in the terminal from within the notebook, by marking the line with !
# (pip is the python package manager)
!pip install -q transformers tokenizers datasets torch accelerate evaluate

In [None]:
# This is how we import dependencies in python
from datasets import load_dataset

## Data acquisition and preparation

In any ML process, data has a huge weight on the final output. So this is why we need to ensure we know what we are dealing with, before feeding it to any ML model. Even though it feels unimportant, small data inconsistencies can have significant repercussions on the models.

Any ML model can be considered a function approximator. Remember from maths that basic formula
$$ 
f(x) = y
$$ 
where in our case **x** is the input data and **y** is the model output.

    What's x & y for the task of text classification?* 

*The various NLP tasks can be consulted on the [HuggingFace website](https://huggingface.co/tasks).

Next, let's look at some [datasets](https://huggingface.co/datasets) suitable for text classification. 

In [None]:
hf_dataset = load_dataset("ganchengguang/resume_seven_class")
hf_dataset

### Data Exploration (just a little bit)
For this, we will use pandas, the most widely used python tool for data analysis.

You can consider pandas as code-only Excel++ on steroids. The main object one works with in pandas is a DataFrame. It's similar to the concept of Spreadsheet, but more focused on a certain data schema.

In [None]:
cv_df = hf_dataset["train"].to_pandas()

# Head shows us the first few rows of the DataFrame
cv_df.head()

We see that y, our label, is prepended to the actual line of text (x). So let's split the two and create two new columns.

In [None]:
cv_df[["label", "line_content"]] = cv_df["text"].str.split("\t", expand=True)
cv_df.head()

Now let's inspect the data a bit.

In [None]:
# Show 3 examples of each label
cv_df.groupby("label").head(3).sort_values("label")

In [None]:
# Plot the distribution/cardinality of labels in this dataset
cv_df["label"].value_counts().plot(kind="bar")

### Data Cleaning

Some minimal data cleaning is required for 2 main reasons:
* To ensure we have the data in a format suitable to solve our problem, i.e. x & y adhere to the expected domain & co-domain
* To ensure we don't have erroneous or invalid data. Remember, in ML as in any other data-oriented system, the same principle applies: garbage in, garbage out.

In [None]:
# Replace labels that we do not use
cv_df = cv_df.replace(to_replace=["PI", "Sum", "Obj", "QC"], value="Oth")

In [None]:
# TODO: As we could see earlier, there was an empty row. Write the code to remove it.

In [None]:
# Recompute labels just to ensure we have the expected co-domain.
labels = cv_df["label"].unique().tolist()
labels

In [None]:
# Now let's review the distribution of the labels
cv_df["label"].value_counts().plot(kind="bar")

In [None]:
# Now let's simply retain the minimum number of samples per class label
# This is a tad more complex, but it is a common technique to ensure that we have a balanced dataset.
min_samples = cv_df["label"].value_counts().min()
cv_df = cv_df.groupby("label").apply(lambda x: x.sample(min_samples, random_state=42)).reset_index(drop=True)
cv_df

In [None]:
cv_df["label"].value_counts()

In [None]:
# Let's see how long the sequences are, just to have a rough understanding of the data. We'll need this to understand how to set the max_length parameter in the tokenizer.
cv_df["line_content"].apply(lambda x: len(x.split(" "))).describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.95, 0.99])

### Data Preparation (for training)
In a nutshell, ML models work with numeric values, so we'll need to convert these strings somehow ;) 

In [None]:
# First, let's reorganize the data into a format that the HF Transformers library can understand.
# For our experiment, we will also split it into train and test sets.
from datasets import Dataset

hf_dataset = Dataset.from_pandas(cv_df[["line_content", "label"]], preserve_index=False)
hf_dataset = hf_dataset.class_encode_column("label")
hf_dataset = hf_dataset.train_test_split(test_size=0.2, stratify_by_column="label", seed=42)
hf_dataset

In [None]:
# How does a row look now?
ds_row = hf_dataset["train"][5]
ds_row

In [None]:
# Let's start the transformation with the domain (x).
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)

# tokenize the hf_dataset dataset
hf_dataset = hf_dataset.map(lambda example: tokenizer(example["line_content"], truncation=True, padding="max_length", max_length=128), batched=True, batch_size=64)
hf_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

In [None]:
# And how does it look now?
hf_dataset["train"][5]

In [None]:
# Also, how do the tokens look like? Let's take a look at the first example.
tokenizer.tokenize(ds_row["line_content"])

In [None]:
# TODO: Remember how we've looked at the distribution of word lengths earlier? 
# Now that you've learned how to tokenize text, let's do the same for the tokenized sequences and see if our max_legth is correctly set.

# You'll need to work with cv_df ;)

In [None]:
# The transformation for the co-domain is pretty straightforward.
labels = hf_dataset["train"].features["label"].names
print({idx: label for idx, label in enumerate(labels)})

## Model Training

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
   pretrained_model_name_or_path=model_name,
   num_labels=len(labels),
   label2id={label: idx for idx, label in enumerate(labels)},
   id2label={idx: label for idx, label in enumerate(labels)}
)

In [None]:
import numpy as np
import evaluate

# To see how the model is learning, we'll also use this function to compute the accuracy of the model.
def compute_metrics(p):
    logits, labels = p

    pred = np.argmax(logits, axis=-1)
    print(f"Pred labels: {str(pred[:20])}")

    print(f"TRUE labels: {str(labels[:20])}")

    acc_metric = evaluate.load("accuracy")
    acc = acc_metric.compute(predictions=pred, references=labels)

    return acc

In [None]:
from transformers.trainer import Trainer, TrainingArguments
from transformers.trainer_callback import EarlyStoppingCallback

# Define Trainer and corresponding arguments
args = TrainingArguments(
    output_dir="output",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=7,
    eval_steps=250,
    seed=42,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=hf_dataset["train"],
    eval_dataset=hf_dataset["test"],
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)


In [None]:
# Train pre-trained model
trainer.train()

## Saving the model

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
trainer.save_model("/content/drive/My Drive/_workshop_models/distilbert_cv_model")