# Demo 7B
The purpose of this demonstration is to reiterate learning objects in lectures 7.2-7.3. After completing this demonstration, you should feel comfortable:

- using transfer learning to finetune a pretrained deep learning model
- evaluate model performance
- apply a pretrained model to new data

### Transfer Learning in Python
Recall that transfer learning refers to the process of updating a deep learning model for a specific task. You can use transfer learning for NLP tasks including:

- text classification (sentiment, topics)
- masked language models (learning new embeddings)
- token classification
- question answering (text generation)

My experience with transfer learning rests almost exclusively in the first on this list, text classification. This is what we will focus on.

In Python, the de facto standard for sharing trained deep learning models is through a library called __[transformers](https://huggingface.co/docs/transformers/index)__. This library is maintained by a group called HuggingFace (🤗). If you navigate to the __[models](https://huggingface.co/models)__ section of the page, you'll see there are over 200,000 "models" (essentially sets of ANN parameters) available for download and use (directly or after fine-tuning). For instance, the __[bert-base-uncased](https://huggingface.co/bert-base-uncased)__ is the original uncased BERT model trained for masked language. This model is comprised of 110M parameters.

You can also filter the models to focus only on __[those meant for text classification](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads)__ to see the various tasks and corpora that have been used. For instance, there are models focused on __[tweet classification](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest)__ and __[various financial tasks](https://huggingface.co/models?other=financial-text-analysis)__. This latter model is discussed in __[Huang, Wang, and Yang 2023](https://onlinelibrary.wiley.com/doi/full/10.1111/1911-3846.12832)__.

Before we move to our exercise, I want to highlight an important consideration that differs from the approach we've taken in earlier exercises and an additional caveat. First, transformer models are generally designed to work on *shorter spans of text*, such as sentences or paragraphs. For instance, __[this general purpose classifier](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)__ has a max sequence length (`max_seq_length`) of 128. Therefore, for fine-tuning we usually need labeled data at this more granular level, and then we must aggregate back up to the document level if that is the level we require. For shorter texts, like product reviews, customer complaints, or social media posts, this generally doesn't cause an issue.

Second, the caveat is that this coding is going to feel a little "heavier" than earlier tasks. I purposely did not include this topic in an experiential task because we don't really have sufficient time to learn all of the ins and outs of `transformers` and `pytorch`. So, if you feel a bit overwhelmed, know that I'm much more interested in you observing the process than being able to come up with this code independently. Hopefully with this basic understanding you can identify how to leverage these models in your own work as needed.

Let's get started with our exercise!

### The data
To demonstrate the power of transfer learning, we require a dataset of hand-labeled data that we can use to update the ANN weights in the transformer model. There are lots of datasets available, but for our demonstration I'm going to use a set of sentences my coauthors and I hand-labeled for a current working paper, available __[here](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3918531)__. In this paper, we examine how the quality of contributions of Seeking Alpha users vary depending on the company's CSR reputation.

In some more recent tests that we've added, we use fine-tuned classifiers to identify language in articles related to fundamental performance. This is the task we'll examine.

Let's load the data and inspect.

In [7]:
import pandas as pd
df = pd.read_csv("/storage/ice-shared/mgt8833/classdata/financial_labeled_data.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   label   2000 non-null   float64
 1   sent    5000 non-null   object 
dtypes: float64(1), object(1)
memory usage: 78.3+ KB


So this dataset has **5,000** rows/observations, of which **2,000** are labeled. In the data, the label equals `1` for sentences coded as financial. 

**PAUSE** and determine what percentage of sentences are classified as relating to financial performance.

In [8]:
#Your work goes here:

Note that it's easiest to use the default labels for our datasets, so we're going to rename "sent" to be "text" and "label" to be "labels" and then move on.

In [9]:
df = df.rename(columns={"sent":"sentence","label":"labels"})

### Fine-tuning with `transformers`
To fine-tune our model, we're going to follow a process very similar to this __[tutorial](https://huggingface.co/docs/transformers/training)__ available from Huggingface. We'll proceed in four steps:

1. Loading data and setting up datasets for training
2. Training (or fine-tuning) the model
3. Evaluating model performance
4. Applying to new data

Let's get started with setting up the data.

#### Step 1 - Data Set-up
##### Formatting the dataset

Unfortunately, `transformers` does not work with native dataframes, which is what we're used to using. There are two reasons for this:

- First, data must be represented as __[`tensors`](https://en.wikipedia.org/wiki/Tensor)__. Tensors are a broad class of mathematical representations that facilitate operations, such as gradient calculations.
- Second, often the volume of data used by these models is massive, so `transformers` has objects available to set up dataset "loaders", or processes to avoid putting all data in memory at once. 

We don't really need to worry about the second issue for this volume of data, and `transformers` includes some methods for converting pandas data. See the documentation __[here](https://huggingface.co/docs/datasets/tabular_load)__. Note that we could have loaded directly from the CSV, but I started with a dataframe because I'm more comfortable manipulating data with `pandas`.

Let's start by isolating the 2,000 rows with labels. **PAUSE** and see if you can do that, saving the result in a dataset called `labeled`

In [10]:
#Your work goes here:

Next, we're going to recast the label to an integer. We will then create two datasets, a training and validation dataset. We'll use `train_test_split` to split the dataframe and then use `from_pandas` to convert the data. Finally, we'll store the datasets in a `DatasetDict`, which is just a convenient wrapper used for datasets formatted for `transformers`:

In [6]:
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict

# Convert label to integer
labeled['labels'] = labeled['labels'].astype(int)

# Train/Test split
train,test = train_test_split(labeled,train_size=0.80,random_state=123)
train_ds = Dataset.from_pandas(train,split='train',preserve_index=False)
test_ds  = Dataset.from_pandas(test,split='test',preserve_index=False)

ds = DatasetDict({
    'train': train_ds,
    'test':test_ds})
ds['train'][100]

NameError: name 'labeled' is not defined

##### Tokenizing the data
Similar to other NLP tasks, we're next going to tokenize the data. However, we can't use our standard `CountVectorizer` approach here. There are two primary reasons for this. 

First, since we are fine-tuning an existing model, our vocabulary is pre-established. That is, the original model was built with a specific vocabulary, and we must start with that. There are methods for adding new language to the vocabulary, but usually this isn't necessary. One exception would be adding emojis for social-media classification, but even that we can circumvent by starting with a model built for social media.

Second, recall that much of the power of transformers comes from the concept of attention, which requires information on sequence. Thus, our tokenizer needs to capture this information.

Fortunately, Huggingface has made the tokenization process very easy by "attaching" the appropriate tokenizer to each model we'll use. There are many tokenization objects depending on the type of model you are fine-tuning, but `AutoTokenizer` available through Huggingface's __[Auto Classes](https://huggingface.co/docs/transformers/model_doc/auto)__ is a great choice for general tokenization.

To set up the tokenizer, we need to import a few objects and identify the model we're going to fine-tune. We're going to use a version of "DistilBERT", which is a simplified version of BERT (67M parameters instead of well over 100M). The specific version we'll use is called __[`distilbert-base-uncased-finetuned-sst-2-english`](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)__.

In [None]:
from transformers import AutoTokenizer

pretrained_model = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model,padding="max_length", truncation=True)

Next, we need to use this `tokenizer` to format our data. There are a few ways of doing this; we'll follow Huggingface's recommended approach in the tutorial I link to above. Specifically:

1. Write a function that accepts each record and returns tokenized data.
2. Use __[`map`](https://huggingface.co/docs/datasets/process)__ to apply that function to our formatted datasets.

In [None]:
def tokenize(record):
    return tokenizer(record['sentence'], padding='max_length', truncation=True)

tokenized_data = ds.map(tokenize, batched=True)

Let's recap a few things here:
- `padding`: We set this to "max_length", which is based on the pre-trained model. We could also make this corpus-specific (longest sequence), or don't pad sequences at all (the default). I've always used fixed-length sequences in this context.
- `truncation`: We set this to `True`, which means any sequences longer than the maximum length will be truncated.
- `map`: The `dataset` method mentioned above.
- `batches`: Allows you to process the data in batches; we only have 2,000 records, so this probably isn't necessary, but this illustrates how you'd want to handle larger datasets.

One additional thing to note. The number of tokens generated by these tokenizers doesn't always equal the number of words (!!). This is for a two primary reasons. First, these models insert special tokens, `[CLS]` and `[SEP]`. We won't see those tokens, but they are encoded to denote the beginning and end of sentences. Second, some tokenizers actually tokenize "sub-words", like separating "disagree" into "dis" and "##agree", to help better understand meaning.

If you want to see what one element of tokenized data looks like, uncomment this line:

In [None]:
# tokenized_data['train']

Now we're ready to train!

#### Step 2 - Training
If you read through the 
`transformers` documentation, you'll see there are numerous options, including a simple __[`Trainer`](https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/trainer#transformers.Trainer)__ option, training with native pytorch, or even with tensorflow if your original model was build in that architecture. In my experience, I haven't had to go beyond the most direct method of fine-tuning, the `Trainer` object, so that's what we will focus on.

Let's get our model set up. Note that the model must match the tokenizer, so we'll use the same `pretrained_model` variable we established above (and you are free to try other models by changing that variable). This may take a minute if you download the model from Huggingface. I'm also going to set several random seeds to try to make this as reproducible as possible, though it probably won't be perfect.

In [None]:
from transformers import AutoModelForSequenceClassification
import torch, numpy as np, random
# Set random seeds
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val) # don't really need this unless using GPU

model = AutoModelForSequenceClassification.from_pretrained(pretrained_model, num_labels=2,
                                                          ignore_mismatched_sizes=True)

We now have to set up our training arguments. We do this with a special object called `TrainerArguments` (__[docs](https://huggingface.co/docs/transformers/training#:~:text=Next%2C%20create%20a-,TrainingArguments,-class%20which%20contains)__). We will leave __[hyperparameters](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)__ mostly as the defaults but adjust a few things:
- `output_dir` = "./trained_model"
- `learning_rate` = `1e-5` (matching to original model)
- `warmup_steps` = 5 (how many iterations before evaluation begins)
- `num_train_epochs` = 5 (how iterations through the data)
- `evaluation_strategy` = "steps" (when to evaluate, after certain steps, set in `eval_steps`, or after epochs)
- `per_device_train/eval_batch_size` = 16 / 64(batch sizes)

In [None]:
from transformers import TrainingArguments, Trainer
import os

outdir = "./trained_model"
logdir = "./logs"
for d in [outdir,logdir]:
    os.makedirs(d,exist_ok=True)

params = TrainingArguments(output_dir = outdir,
                           overwrite_output_dir = 'True',
                           learning_rate=1e-5,
                           weight_decay=0.001,
                           warmup_steps = 5,
                           num_train_epochs=5,
                           evaluation_strategy="steps",
                           logging_dir=logdir,
                           per_device_train_batch_size=16,
                           per_device_eval_batch_size=64,
                           save_strategy='steps',
                           eval_steps=25)

As noted in the Huggingface documentation, the `Trainer` does not automatically evaluate model performance. There's a convenient library called __[evaluate](https://huggingface.co/docs/evaluate/index)__ that allows us to access various functions we can "plug-in" for evaluation as the model trains. However, we can also use the `sklearn` functions we are used to. We just have to set them up in a function.

This function is going to accept `pred` which is a set of predictions from the transformer (we'll see this again later). Note that transformers models always return "logits", which require conversion for interpretation. We'll use `argmax` to identify the predicted class.

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
    }

We're now going to set up an early stopping callback. This is far more complicated than in `keras` because the `Trainer` method doesn't have a built in callback parameter. We haven't ever defined a custom `class`, so I'm not going to go through the details here. Just recognize that this object is setting up a callback function to evaluate accuracy on the fly, and interupt training if no further improvement is detected.

In [None]:
from transformers import TrainerCallback, TrainerControl

class EarlyStoppingCallback(TrainerCallback):
    "A callback that implements early stopping."
    def __init__(self, early_stopping_patience=1):
        self.early_stopping_patience = early_stopping_patience
        self.early_stopping_counter = 0
        self.best_metric = None
        self.last_metric = None

    def on_log(self, args, state, control, logs=None, **kwargs):
        # We assume that 'eval_accuracy' is logged by the Trainer
        if 'eval_accuracy' in logs:
            self.last_metric = logs['eval_accuracy']

    def on_evaluate(self, args, state, control, metrics, **kwargs):
        # metric for early stopping
        metric = self.last_metric
        if self.best_metric is None:
            self.best_metric = metric
        # if current metric is worse than best_metric, increment counter
        if metric < self.best_metric:
            self.early_stopping_counter += 1
        else:  # else, reset counter and update best_metric
            self.early_stopping_counter = 0
            self.best_metric = metric
        # if counter has reached the patience limit, stop training
        if self.early_stopping_counter >= self.early_stopping_patience:
            control.should_training_stop = True

Next, we set up our `Trainer` with the objects established above.

In [None]:
trainer = Trainer(model=model,
                  args=params,
                  train_dataset=tokenized_data['train'],
                  eval_dataset=tokenized_data['test'],
                  compute_metrics=compute_metrics,
                  callbacks=[EarlyStoppingCallback(early_stopping_patience=3)] # set patience at 3
                 )

Finally, we can train the model with the `train()` method. This will take quite a while unless you utilize a GPU, which I have not done:

In [None]:
trainer.train()

This took quite a while, and ended up around 80% when I finished. That's not bad, but not great. We did a little better in the paper, though I think I started from a different model. In addition, this data is noisy. If you inspect several sentences you'll see they aren't very well formed in some cases. This will add noise. 

Also, I ran this on my laptop's CPU. With GPUs, training time is *drastically* reduced. You could set up a GPU instance on ICE though your environment would need to be adjusted. Google's Colab is also a great resource for free GPUs for smaller tasks. 

We're going to save the model, but before we get there I want to make one adjustment. Since we are fine-tuning a pre-existing model, the original labels, "POSITIVE" and "NEGATIVE" are still in the model configuration. It won't be apparent unless we generate new predictions from the model, but let's go ahead and update this configuration before saving

In [None]:
model.config.id2label = {0: "NON-FINANCIAL", 1: "FINANCIAL"}
model.config.label2id = {"NON-FINANCIAL": 0, "FINANCIAL": 1}

Now we will save the model before going any further.

In [None]:
trainer.save_model('final_fin_classifier')

#### Step 3 - Evaluating Model Performance
Next we're going to evaluate this model as we would other classifiers we've trained. Since I'm guessing many of you won't go through the full training procedure, let's load the saved model I've provided you.

In [None]:
model2 = AutoModelForSequenceClassification.from_pretrained("/storage/ice-shared/mgt8833/classdata/final_fin_classifier", num_labels=2,
                                                          ignore_mismatched_sizes=True)

To use this model to generate preditions, we have two options. 

First, we can take our training (or testing) dataset which we've already encoded and use `torch` to generate predictions. This is actually fairly involved because `torch` is a lower level framework than what we're used to (`sklearn`, `transformers`, etc.). I also have had issues keeping things aligned when generating batches in `torch`, so I usually avoid this approach.

Second, we can build a __[pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines)__ that accepts our original (textual) data, tokenizes, and then classifies on the fly. The beauty of this approach is we can use it with any text (as we'll see below).

So we'll go with this latter approach, and here are the specific steps we'll take:
1. Load both the **model** and **tokenizer** (we have already done this; we'll use `model2` and `tokenizer`).
2. Define a `pipeline`. 
3. **Generate predictions** for your data and **evaluate results**.

Let's start with **step 2** since we've already loaded our model and tokenizer:

**Step 3.2 - Build the pipeline**:

A __[pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines)__ provides a very flexible and intuitive approach to setting up a function-like object to process new text. The first (required) positional argument identifies the *type* of pipeline ("text-classification" for us, but see documentation for other options). After that, we provide the model and tokenizer. Different types of pipelines will require different objects.

In [None]:
from transformers import pipeline

# Create the pipeline
text_classification_pipeline = pipeline("text-classification", model=model2, tokenizer=tokenizer)

**Step 3.3 - Generate predictions and evaluate results**

Now, we can pass our original text to this pipeline and generate predictions. Let's focus on the evaluation data, which is in the dataframe `test`:

In [None]:
test

We can pass this list of sentences to our pipeline like so:

In [None]:
predictions = text_classification_pipeline(test['sentence'].tolist())
predictions[:10] # show first 10

Now we'll create a dataframe from this list, and convert our labels to the original "ids" (0 or 1) with the `label2id` dictionary stored in the model:

In [None]:
test_pred_df = pd.DataFrame(predictions)
test_pred_df['label'] = test_pred_df['label'].map(model2.config.label2id)
test_pred_df

Finally, we can use `classification_report` to get more insight into how this model performed:

In [None]:
from sklearn.metrics import classification_report

print(classification_report(test['labels'],test_pred_df['label']))

#### Step 4: Applying to new data
Often, we use a transformer model like the one fine-tuned in this demonstration as part of a larger "pipeline" (a larger pipeline that includes the pipeline we defined above). In other words, we have some analytical procedure where we process large volumes of text, and part of that procedure relates to *coding* or *classifying* elements in the corpus according to some label. Perhaps your organization receives large volumes of customer feedback (i.e., complaints). You could use a fine-tuned transformer to efficiently direct complaints to the appropriate party of prioritize ones that are particularly impactful.

To finish up this demonstration, we're going to look at how our model performs on *new text*. Recall at the beginning of this demo we created a dataframe called `labeled`, which excluded any sentences in the original dataframe (`df`) without labels. Let's go back to the original dataframe and grab a sample of unlabeled sentences:

In [None]:
new_sents_df = df.loc[df['labels'].isnull(),'sentence'].sample(5,random_state=321)
new_sents_df

Now, we'll convert this to a list and generate our predictions, just like we did above:

In [None]:
# Get predictions
predictions = text_classification_pipeline(new_sents_df.to_list())
predictions

In [None]:
# Print the labels and scores
for sentence, prediction in zip(new_sents_df, predictions):
    print(f"Sentence: {sentence}")
    print(f"Predicted label: {prediction['label']}")
    print(f"Confidence score: {prediction['score']}\n")

Not bad! These labels seem reasonable to me (and again, remember, this data was unlabeled).

#### Conclusion & Homework
Hopefully this demo provided you a reasonable handle on the power of transfer learning, particularly as it relates to text classification. Transformer-based models have revolutionalized NLP in a variety of settings. While we focused on the base implementation of transformers (with `transformers`), there is a "wrapper" package available called __[simpletransformers](https://simpletransformers.ai/)__. I did not use this in our demo because (1) it's relatively new, (2) has been inconsistently maintained (though it seems better recently), and (3) hides some of the details I wouldn't to make explicit. In any case, I wanted to point it out in case you want to explore more on your own.

For your **homework**, I'd like you to identify a situation in your own professional experience where applying transfer learning could help add insights or make analyses more efficient. Discuss with your classmates on the discussion board.