# PSY 341K Text Analysis for Behavioral Data Science
##### Spring 2024; written by: Prof Desmond Ong (desmond.ong@utexas.edu)

## Assignment 3

In the tutorial we learnt how to train (from scratch) a Recurrent Neural Network (RNN) to classify text.

In this assignment we'll be using a pre-trained contextual word embedding model (BERT), and we will be fine-tuning it on a sentiment classification example (see the slides for a high-level illustration of what this means.). This is quite "standard" in text analysis nowadays, and many projects adopt this approach.


There are a lot of libraries out there to help make this easy. We will be using the `transformers` library provided by 🤗 `huggingface`, which provides a lot of functionality. Much of this code is taken from the `huggingface transformers` documentation/tutorials, including the [quicktour](https://huggingface.co/docs/transformers/main/en/quicktour), and the [preprocessing](https://huggingface.co/docs/transformers/main/en/preprocessing) and [fine-tuning](https://huggingface.co/docs/transformers/main/en/training) tutorials.

In [1]:
! pip install transformers
! pip install datasets
! pip install evaluate
! pip install xformers

Collecting xformers
  Using cached xformers-0.0.25.post1.tar.gz (4.1 MB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: xformers
  Building wheel for xformers (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[354 lines of output][0m
  [31m   [0m running bdist_wheel
  [31m   [0m running build
  [31m   [0m running build_py
  [31m   [0m creating build
  [31m   [0m creating build/lib.macosx-11.1-arm64-cpython-311
  [31m   [0m creating build/lib.macosx-11.1-arm64-cpython-311/xformers
  [31m   [0m copying xformers/attn_bias_utils.py -> build/lib.macosx-11.1-arm64-cpython-311/xformers
  [31m   [0m copying xformers/checkpoint.py -> build/lib.macosx-11.1-arm64-cpython-311/xformers
  [31m   [0m copying xformers/__init__.py -> build/lib.macosx-11.1-arm64-cpython-311/xfor

The `transformers` library from 🤗 `huggingface` has an "all-in-one" command that can load several pre-trained models for certain tasks, using `pipeline`. So for example, if you wanted to just get some sentiment analysis done, you could call:

```pipeline(task="sentiment-analysis")```

and it will load its default sentiment analysis model for you. You could also specify one of many different models that it already has. (Or different tasks). Then you can just pass it any text you want. For example:

In [3]:
from transformers import pipeline

classifier = pipeline(task="sentiment-analysis")

classifier("We are very happy to show you the 🤗 Transformers library.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

Some models are really powerful too. For example, we talked about `zero-shot-classification` in class. Here's an example from the [pipeline tutorial](https://huggingface.co/docs/transformers/main/en/pipeline_tutorial), where you can import a pre-trained model that can try to infer which of a set of candidate labels apply to the input text. It is zero-shot in the sense that you can give it any new label you require, and it does not have to be trained on the label.

(Note these models take up some space, so I've left it in comments so you can run it if you're interested).

In [6]:
# ## This model is a `zero-shot-classification` model.
# ## It will classify text, except you are free to choose any label you might imagine
#classifier = pipeline(model="facebook/bart-large-mnli")
#classifier("I have a problem with my iphone that needs to be resolved asap!!",
#candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],
#)

# Fine-tuning a pre-trained BERT model

In this assignment we want to fine-tune a pre-trained BERT model on another dataset. After this assignment you can apply this same approach to fine-tuning models to your particular dataset of interest.

Recall the basic deep learning pipeline, which is 1) Set Up Data, 2) Define Model, 3) Set up Training Parameters, 4) Run and 5) Evaluate the model.

The `huggingface` `transformers` library already does lots of things for us. Let's use the `bert-base-uncased` model, which is the "base" BERT model trained on uncased (i.e., all-lower-cased) data. 

In [7]:
import torch

model_name = "bert-base-uncased"

## Set up Data

We can use the `AutoTokenzier` function from the transformers library, which provides the corresponding tokenizer to your desired model.

The important thing to note is to use the same pre-trained tokenizer that corresponds to the model that you are using. Different models are pre-trained with different tokenizing rules (e.g., subwords). Also, `bert-base-uncased`, as its name suggests, is trained on uncased data---Part of what the tokenizer does is to apply some of these transformations as well.

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

In [9]:
# here are some example code on how to use the tokenizer.
# note that the tokenizer adds in a [CLS] token at the start, and a [SEP] token at the end.
# these are just to mark the start and end of the input, and are what the BERT models expect.

encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")

tokenizer.decode(encoded_input["input_ids"])

'[CLS] do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'

In [10]:
# and for batches of sentences:
# we can see that the encoded inputs are tensors
# the input_ids are the id numbers of the words in the vocabulary. 
#    For example, "but" is word ID 2021      (CLS is 101)
# the options:
#    we pad short sentences and truncate too-long ones (limit determined by specific model)
#    and we ask it to return tensors that `pytorch` (pt) expects. 
#       (huggingface works with other frameworks besides pytorch too)

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(encoded_inputs)

{'input_ids': tensor([[  101,  2021,  2054,  2055,  2117,  6350,  1029,   102,     0,     0,
             0,     0,     0,     0],
        [  101,  2123,  1005,  1056,  2228,  2002,  4282,  2055,  2117,  6350,
          1010, 28315,  1012,   102],
        [  101,  2054,  2055,  5408, 14625,  1029,   102,     0,     0,     0,
             0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])}


For datasets, we'll also use one of the existing `dataset`s that is prepared by `huggingface`.

We'll use a dataset of Yelp Reviews, which is a **5-class** sentiment classification task (1-star to 5-stars). 


We'll also use just 300 of those reviews for training (just to get a sense of how it's done; and not to kill your laptops.) 

- The full dataset is 650,000 in the training set and 50,000 in the evaluation set.
- With 300 training examples, each epoch of training takes about 7 mins on the instructor's laptop.
- With 1000 training examples, each epoch of training takes about 45 mins on the instructor's laptop.
- Of course, with more training examples and more training epochs, the model's performance will increase.

Normally deep learning projects should be run on a cluster rather than on your own computer, but here we'll keep it simple (and small!) so you can learn the basics


In [11]:
from datasets import load_dataset 

dataset = load_dataset("yelp_review_full")
dataset["train"][100]

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, return_tensors="pt")


tokenized_datasets = dataset.map(tokenize_function, batched=True)

# These steps are necessary to convert the dataset into something that pytorch expects
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

Found cached dataset parquet (/Users/ruthcarter/.cache/huggingface/datasets/parquet/yelp_review_full-9c7006f5a2e02666/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/2 [00:00<?, ?it/s]

Loading cached processed dataset at /Users/ruthcarter/.cache/huggingface/datasets/parquet/yelp_review_full-9c7006f5a2e02666/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-e585686224d0228d.arrow


Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [12]:
SAMPLE_SET_SIZE = 300

# creating a small, 300 example dataset to train and evaluate our model
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(SAMPLE_SET_SIZE))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(SAMPLE_SET_SIZE))


# using DataLoader to prep the training and evaluation datasets into something torch expects
from torch.utils.data import DataLoader

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

Loading cached shuffled indices for dataset at /Users/ruthcarter/.cache/huggingface/datasets/parquet/yelp_review_full-9c7006f5a2e02666/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-2e4d55fe882549ff.arrow


## Define Model

Next, we define our model, which we want to be a Sequence Classification model built on top of a `bert-base-uncased` pre-trained model. The `AutoModelForSequenceClassification` function from the `transformers` library allows us to easily specify this according to several default choices (e.g., number of hidden units, etc. These can be modified).

Note that we are defining a model with **5 labels**. (This is because the Yelp Review dataset we are using has labels of 1-5 stars.) 

Note, the warning that shows up just says that some of the weights were not initialized, which is the `cls.predictions` layer (i.e., the output prediction layer that is newly-initialized and trained). Note that it says "This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model)."

In [13]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Set up training / fine-tuning parameters

Just like the tutorial, we define our optimizer, the learning rate, the number of epochs of training.

In [14]:
from torch.optim import AdamW
from transformers import get_scheduler

optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 2
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

# if you have a GPU it'll send the code to the GPU, otherwise it'll use the CPU.
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

## Fine-tune the model

Then we run our training loop! The progress bar will update regularly with the time-taken so far and allow you to estimate the time remaining.

(The code above, by default, will train the model for 2 epochs. On the instructor's laptop, it takes ~7 mins per epoch. You may choose to run your training for longer, to get better results.)

The evaluation step also takes a couple of minutes.

In [15]:
from tqdm.auto import tqdm

# providing a progress bar
progress_bar = tqdm(range(num_training_steps))

model.train() # switch the model to "training mode". 
# one of the reasons is that this activates dropout during training, 
# which is a regularization technique that helps the model avoid overfitting.

# training loop
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/76 [00:00<?, ?it/s]

In [17]:
import evaluate

metric = evaluate.load("accuracy")
model.eval() # switch the model to "evaluation mode". 
# this deactivates dropout during evaluation.

for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.5}

In [26]:
logits

tensor([[ 1.2818,  0.0533, -0.4202, -0.6646, -0.7030],
        [-1.1152, -0.0643,  0.1303,  0.5088,  0.3549],
        [-0.6903,  0.1139, -0.0249,  0.5008,  0.3043],
        [-1.2655, -0.3152, -0.0968,  0.6727,  0.6639]])

Is the accuracy good?

### Your answer:

- An accuracy of 48.33% does not seem that impressive. More than half of the time, the model predicts incorrectly. 

## Manually evaluating the model

Ok now that we have a fine-tuned model, let's see how well it does. This model is fine-tuned on Yelp reviews. Can you pull a couple of product reviews (from the web; it does not have to be Yelp, please try some websites that you may be familiar with), or other types of text, feed them into the model, and then see what the output says? 

Try at least ten different reviews or texts, and include a short discussion of each. For example, some questions that you could  

- Does the model do well? 
- Is the model able to capture specific characteristics of each piece of text? 
- Do you agree with the model? If yes, why? If no, why not?

(You are not limited to these questions above, these are just examples. Try to ask your *own* questions about this model and the data!)


- Hint: try to get a variety of complex examples to test the model on. For example, we talked about aspect-based sentiment analysis in class (although this particular model we've trained isn't trained to separate different aspects). If you feed it a complex review with several different "parts" (that say, talk about different aspects), what does the model predict? Does it agree with your own intuitions?
- What abour sarcasm? Can you find (or write) reviews that could be read as sarcastic by a human? Does the model think so too?
- What about other TYPES of reviews? Can the model generalize?
- What about other types of text? Can the model generalize?

Be creative and try to see how well this model does.

In [None]:
example_review = "UT is the best school ever"

## Task: 
## now that you have a trained model, 
## write some code to feed some hand-picked examples into the model, and get the output back out.
## see if the output class "agrees" with what you might expect.
## be sure to include a short discussion for each example





## Hint1: there is code in the assignment above to pass in example sentences.
##     you'll need to "encode" the sentences (convert words into word vectors/embeddings)

## Hint2 then you'll need to call the model() function on the input. Something like model(**input). 
##    (The model had to process the sentences in both the training and the evaluation steps... 
##    Which part of the code contains the single line that you need?

## Hint3: After getting the output of the model (... output = model(**input) ...)
##     you can get the logits using output.logits. For one example, this will give something like
##         tensor([ (logit of class 1), (logit of class 2), (logit of class 3), ... ])
##     i.e., the model is predicting the log-odds of each class. 
##     We can do something as simple as taking the class with the largest logit to be the label for that example
##    (There is a "argmax" function in the torch library which will be useful)


In [None]:
## Hint1: there is code in the assignment above to pass in example sentences.
##     you'll need to "encode" the sentences (convert words into word vectors/embeddings)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, return_tensors="pt")


tokenized_datasets = dataset.map(tokenize_function, batched=True)

# These steps are necessary to convert the dataset into something that pytorch expects
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

In [None]:

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(encoded_inputs)

In [None]:
batch_reviews = [
    "

### Your answer:

In [29]:
model(**tokenizer("Hello, world! Howdy."))

AttributeError: 'list' object has no attribute 'size'