### NLP Workshop Part 1

#### Goals:

The goal of this workshop is to showcase how to use different NLP techniques to conduct verbal analytics in dialogue conversations. There are 3 parts altogether:  
 - Part 1: Improve Dialogue Acts 
 - Part 2: Topic Modeling via LDA
 - Part 3: Unsupervised Transformers   

#### Part 1: Improve Dialogue Acts:

- The existing Dialogue Act [DialogTag](https://pypi.org/project/DialogTag/) package uses either `bert-base-uncased`, or `distilbert-base-uncased` models, both of which are good at learning word-level associations. This is due to the fact that these networks are trained on the Masked-Language-Modeling objective, wherein, during training, a random word in the input is masked and the network has to learn to predict what this masked word is.
    - __Idea__: Instead of using Bert/Distilbert, can we train and use a transformer architecture that learns semantics at a sentence-level? To this point, there exists [Sentence Transformers in Hugging Face](https://huggingface.co/sentence-transformers) that can be trained and then used for inference.
        - We can consider finetuning a sentence transformer on the same dataset that the DialogTag authors used to train Bert/Distilbert. The dataset used is the [SwitchBoard Corpus](https://catalog.ldc.upenn.edu/LDC97S62) which is available for download. Following the finetuning procedure, we can use the trained transformer to predict dialogue tags by learning semantics at a sentence level.

#### Code Demo: DialogTag improvement: Examples to showcase HuggingFace sentence-transformer usage for a task

In [None]:
#!pip install datasets
#!pip install transformers

# Comment: If run into compatibility issue, please install versions: datasets == 2.4.0 transformers == 4.11.3

In [None]:
import datasets
import random
import pandas as pd
import numpy as np

from datasets import load_dataset, load_metric
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from IPython.display import display, HTML

In [None]:
# The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training,
# evaluating, and analyzing natural language understanding systems.

# COLA:  Each example is a sequence of words annotated with whether it is a grammatical English sentence.

actual_task = "cola"
dataset = load_dataset("glue", actual_task)
metric = load_metric('glue', actual_task)

In [None]:
dataset["train"][0]

In [None]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(dataset["train"])

In [None]:
metric

In [None]:
from transformers import AutoTokenizer

# MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation
model_checkpoint = "sentence-transformers/all-MiniLM-L6-v2" 
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

# Other models that can be explored: https://www.sbert.net/docs/pretrained_models.html

In [None]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.");

In [None]:
task_to_keys = {
    "cola": ("sentence", None),
}

sentence1_key, sentence2_key = task_to_keys[actual_task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

In [None]:
def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

In [None]:
preprocess_function(dataset['train'][:5]);

In [None]:
encoded_dataset = dataset.map(preprocess_function, batched=True);

In [None]:
# Bolier plate code!

num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

In [None]:
metric_name = "matthews_correlation"
model_name = model_checkpoint.split("/")[-1]
batch_size = 16

# Following parameters are recommended parameters but could also be tweaked.

# Current batch size is set to 16 for demonstrative purposes. If GPU is being used for training, batch size can
# take larger values, i.e., 32 / 64 / 128

# Weight decay recommended range: [1e-2, 1e-4]
# Learning rate: [5e-4, 5e-5]
#     Question: (Why do we use such a low learning rate?)

# Number of epochs is set to 1, but higher values, in range [5, 10] can be experimented with.

args = TrainingArguments(
    f"{model_name}-finetuned-{actual_task}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if actual_task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

In [None]:
# Boilerplate code

validation_key = "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
# The following cell runs for approximately 5 mins. It can take longer if number of epochs is set to 
# a higher value.
trainer.train()

In [None]:
trainer.evaluate()

#### Hyperparam search via Optuna / Ray\[Tune\]

In [None]:
# The Trainer supports hyperparameter search using optuna or Ray Tune.
# For this last section you will need either of those libraries installed,
# just uncomment the line you want on the next cell and run it.

In [None]:
# pip install optuna
# pip install ray[tune]

In [None]:
# Boilerplate code
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

In [None]:
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
# The method we call this time is hyperparameter_search. Note that it can take a long time to run on the 
# full dataset for some of the tasks. You can try to find some good hyperparameter on a portion of the 
# training dataset by replacing the train_dataset line above by:

train_dataset = encoded_dataset["train"].shard(index=1, num_shards=10)

# for 1/10th of the dataset. Then you can run a full training on the best hyperparameters picked by the search.

best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

# The best run is a config of hyperparameters that achieved the best metrics.

In [None]:
# The hyperparameter_search method returns a BestRun objects, which contains the value of the 
# objective maximized (by default the sum of all metrics) and the hyperparameters it used for that run.
best_run

In [None]:
# You can customize the objective to maximize by passing along a compute_objective function to the
# hyperparameter_search method, and you can customize the search space by passing a hp_space argument
# to hyperparameter_search. See this forum post for some examples.

# To reproduce the best training, just set the hyperparameters in your TrainingArgument 
# before creating a Trainer:

for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()