# Homework 7: Fine-tuning Pre-trained Language Models

Due Date: July 17th, 2023 11:59PM ET

Total Points: 104 points

- **Overview**: In this assignment, we will explore how to take pre-trained language models and adapt them to do different natural language tasks. We'll cover:
   - Applying models to down-stream tasks without extra training (15 points)
   - Fine-tuning for sequence classification (20 points)
   - Fine-tuning for sequence pair classification (25 points)
   - Fine-tuning for token classification (40 points)
- Deliverables: This assignment has several deliverables:
   - Code (this notebook) (Automatic Graded)
- Grading: We will use the auto-grading system called PennGrader. To complete the homework assignment, you should implement anything marked with #TODO and run the cell with #PennGrader note.

Recommended Readings
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. NAACL 2019.
- [Generalized Language Models](https://lilianweng.github.io/posts/2019-01-31-lm/) Lillian Weng. 2019
- [The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)](http://jalammar.github.io/illustrated-bert/) Jay Alammar. 2019


## To get started, make a copy of this colab notebook into your google drive!

Also make sure you're using the GPU by going to `Runtime` -> `Change Runtime Type` and making sure that `Hardware Accelerator` is set to `GPU`. This will save you significant amounts of time when training models

In [1]:
!pip install -U datasets huggingface_hub fsspec

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting huggingface_hub
  Downloading huggingface_hub-0.33.2-py3-none-any.whl.metadata (14 kB)
Collecting fsspec
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading huggingface_hub-0.33.2-py3-none-any.whl (515 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m515.4/515.4 kB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, huggingface_hub, datasets
  Attempting uninstall: fsspec
    Found existing installati

In [2]:
%env CUBLAS_WORKSPACE_CONFIG=:4096:8

env: CUBLAS_WORKSPACE_CONFIG=:4096:8


In [3]:
## DO NOT CHANGE ANYTHING, JUST RUN
%%capture
!pip install penngrader-client

In [4]:
%%writefile notebook-config.yaml

grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23'
grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'

Writing notebook-config.yaml


In [5]:
!cat notebook-config.yaml


grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23'
grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'


In [6]:
from penngrader.grader import *

## TODO - Start
STUDENT_ID = 62502470 # YOUR PENN-ID GOES HERE AS AN INTEGER#
## TODO - End

SECRET = STUDENT_ID
grader = PennGrader('notebook-config.yaml', 'CIS5300_OL_23Su_HW7', STUDENT_ID, SECRET)

PennGrader initialized with Student ID: 62502470

Make sure this correct or we will not be able to store your grade


In [7]:
# check if the PennGrader is set up correctly
# do not change this cell, see if you get 4/4!
name_str = 'Rui Jiang'
grader.grade(test_case_id = 'name_test', answer = name_str)

Correct! You earned 4/4 points. You are a star!

Your submission has been successfully recorded in the gradebook.


In [11]:
# Comment this out if you want to enable huggingface warnings
import logging
logging.disable(logging.WARNING)

# This fixes colab's default encoding to match huggingface accelerate
import locale
locale.getpreferredencoding = lambda x=False: "UTF-8"

# Introduction: HuggingFace Transformers

In this homework we will be making use of the [HuggingFace](https://huggingface.co/) transformers library. This library provides support for downloading, running, and training language models and is an essential tool for professionals that want to deploy NLP technology.

We will be installing four huggingface libraries:

1. [Transformers](https://huggingface.co/docs/transformers/index) (Model inference): `!pip install transformers`
2. [Accelerate](https://huggingface.co/docs/accelerate/index) (Model training): `!pip install accelerate`
3. [Datasets](https://huggingface.co/docs/datasets/index) (Data processing): `!pip install datasets`
4. [Evaluate](https://huggingface.co/docs/evaluate/index) (Model evaluation): `!pip install evaluate`

We highly recommend that you reference the documentation linked above frequently as it will be a very valuable resource throughout this homework

In [8]:
%%capture
!pip install datasets
!pip install transformers
!pip install --upgrade accelerate
!pip install evaluate

In [9]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForTokenClassification
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
import datasets
from datasets import load_dataset
from evaluate import evaluator
import evaluate
import numpy as np
import torch
import copy
from dill.source import getsource
from collections import defaultdict

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device_id = 0 if str(device) == 'cuda' else -1

In [10]:
print(device_id)

0


# Section 1: Zero-Shot Application of Pre-Trained Language Models (15 points)

In this homework we'll be exploring different ways of using pre-trained language models to accomplish natural language tasks. We will be using BERT as our pre-trained language model for this homework.

BERT was introduced in late 2018 by the landmark paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). It was trained via "Masked Language Modeling" a procedure that involves training the model to fill in [MASK] tokens in input sequences.

<!-- We will be using the "base" size BERT model (110 million parameters) and we will use the "cased" version (i.e. the model is trained to distinguish between upper and lower case letters).  -->

As you can see below, according to BERT, there is an 84.7% chance that the masked word is "Italian".

<!-- Note: We are running BERT via a [HuggingFace Pipeline](https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/pipelines) object. Here we are using a `fill-mask` pipeline but many others exist and we will be using them extensively in this assignment. While this is not the only way to query a model with HuggingFace, it is the most convenient. -->

In [12]:
# Download and query the bert-base-cased model
bert_model = pipeline('fill-mask', model='bert-base-cased', device=device_id)
print(bert_model("I went to an [MASK] restaurant and ordered pasta."))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

[{'score': 0.8473353981971741, 'token': 2169, 'token_str': 'Italian', 'sequence': 'I went to an Italian restaurant and ordered pasta.'}, {'score': 0.03583113104104996, 'token': 1890, 'token_str': 'Indian', 'sequence': 'I went to an Indian restaurant and ordered pasta.'}, {'score': 0.016542498022317886, 'token': 1385, 'token_str': 'old', 'sequence': 'I went to an old restaurant and ordered pasta.'}, {'score': 0.006616957951337099, 'token': 3427, 'token_str': 'empty', 'sequence': 'I went to an empty restaurant and ordered pasta.'}, {'score': 0.005878540687263012, 'token': 6210, 'token_str': 'Egyptian', 'sequence': 'I went to an Egyptian restaurant and ordered pasta.'}]


In [None]:
tmp = bert_model("I went to an [MASK] restaurant and ordered pasta.")

In [None]:
tmp[1]

{'score': 0.03583093360066414,
 'token': 1890,
 'token_str': 'Indian',
 'sequence': 'I went to an Indian restaurant and ordered pasta.'}

In *theory* we should be able to use BERT to do natural language tasks such as Question Answering, Language Identification, Part-of-Speech Tagging, and even Translation by formulating our task in the style of a fill-in-the-blank sentence.

In [None]:
# Language Identification
print(bert_model("I am currently speaking in the [MASK] language.")[0])

# Factual QA
print(bert_model("The Declaration of Independence was written in the year [MASK].")[0])

# Part of speech tagging
print(bert_model("The word run is a [MASK].")[0])

# Translation
print(bert_model("The French word amour translates to [MASK] in English.")[0])

{'score': 0.18785135447978973, 'token': 1483, 'token_str': 'English', 'sequence': 'I am currently speaking in the English language.'}
{'score': 0.10682458430528641, 'token': 14447, 'token_str': '1776', 'sequence': 'The Declaration of Independence was written in the year 1776.'}
{'score': 0.22859828174114227, 'token': 12464, 'token_str': 'verb', 'sequence': 'The word run is a verb.'}
{'score': 0.055166736245155334, 'token': 1567, 'token_str': 'love', 'sequence': 'The French word amour translates to love in English.'}


However, it turns out that BERT really isn't very good at doing these tasks without extra training (as you can see below). In Section 1 of this homework we'll evaluate BERT without extra training on sentiment analysis to get an idea of where the base model is at. Then, in the next sections, we'll train the model and see how much we can improve the performance.

In [None]:
# Language Identification (failed)
print(bert_model("今[MASK]で話しています")[0])

# Factual QA (failed)
print(bert_model("The U.S.A. was founded in the year [MASK].")[0])

# Part of speech tagging (failed)
print(bert_model("The word golf is a [MASK].")[0])

# Translation (failed)
print(bert_model("The French word bonjour translates to [MASK] in English.")[0])

{'score': 0.5789163708686829, 'token': 100, 'token_str': '[UNK]', 'sequence': 'しています'}
{'score': 0.1736524999141693, 'token': 1196, 'token_str': 'before', 'sequence': 'The U. S. A. was founded in the year before.'}
{'score': 0.15804286301136017, 'token': 8155, 'token_str': 'joke', 'sequence': 'The word golf is a joke.'}
{'score': 0.016320660710334778, 'token': 6164, 'token_str': 'wolf', 'sequence': 'The French word bonjour translates to wolf in English.'}


### **Sequence Classification with pre-trained BERT**

In this section we'll be evaluating BERT for sentiment analysis without fine-tuning. This is purely for the purposes of demonstration as you'll be able to see the difference between what BERT does before and after fine-tuning.

In order to do a sequence classification task without fine-tuning BERT you need two things. You need
1. A mask-filling template to attach on to the end of the sequence
2. A set of `targets` which consist of the class labels.

Below you'll see an example of this procedure applied to sentiment analysis. For the sentence "I just saw a movie today and it was really great", we see that BERT outputs 2.78% chance of the `[MASK]` token being "positive" and 1.65% chance of it being "negative", thus we give the example a positive label.

In [None]:
bert_model("I just saw a movie today and it was really great. My opinion of the movie is [MASK].", targets=["positive", "negative"])

[{'score': 0.027819881215691566,
  'token': 3112,
  'token_str': 'positive',
  'sequence': 'I just saw a movie today and it was really great. My opinion of the movie is positive.'},
 {'score': 0.01648259349167347,
  'token': 4366,
  'token_str': 'negative',
  'sequence': 'I just saw a movie today and it was really great. My opinion of the movie is negative.'}]

Let's test BERT's ability to recognize sentiment using the [Yelp Reviews](https://huggingface.co/datasets/yelp_review_full) dataset. This dataset consists of 650,000 reviews with their user-assigned star ratings. The task is to determine how many stars (1 to 5) the user gave given the text of their review.

In [None]:
# Download the yelp dataset from huggingface
dataset = load_dataset("yelp_review_full")

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/299M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

### **Dataset Processing** (5 points)

There are many functions we can use to process a HuggingFace dataset object. One of the most useful is the [Map](https://huggingface.co/docs/datasets/process#map) function. This function applies some function `f` to every row of the dataset where `f` takes in a row of the dataset and returns a dictionary containing the new columns of the data (or any edits to existing columns). For example, given a dataset with a column `binary_label` that contains a binary label (0 or 1) you can create a new column with the opposite label by writing either one of the following functions:
```
def reverse_label(row):
  return {'reverse_label': int(not row["binary_label"])}

def reverse_label(row):
  row['reverse_label'] = int(not row[''binary_label"])
  return row
```
and then calling map on the dataset with the function
```
new_data = dataset.map(reverse_label)
```

### **Concatenating the Mask-Fill Template** (5 points)
Your first task will be to concatenate the mask-fill template onto every item in the `text` column of the dataset and add the output to the dataset as a new column named `input`.

Refer back to the documentation for the [Map](https://huggingface.co/docs/datasets/process#map) function if you're stuck

In [None]:
def concatenate_mask(row):
  '''
    Pseudocode:
        1. Concatenate this template string to the 'text' field of the input row.
        2. Add the resulting concatenated string as a new field 'input' in the row dictionary.

    Input:
        row: A dictionary representing a single row of the dataset.
             It contains at least the key "text" which holds a string of text.

    Returns:
        The input row dictionary, but with an added key "input" that contains the text
        with the template appended.
  '''
  # TODO: Concatenate the template to each example and return in a new column "input"
  template = " I give it a score of [MASK] out of 5."
  tmp = row['text'] + template
  row['input'] = tmp
  return row

test_data = dataset['test'].map(concatenate_mask)

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
test_data[0]

{'label': 0,
 'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!',
 'input': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAI

In [None]:
test_data

Dataset({
    features: ['label', 'text', 'input'],
    num_rows: 50000
})

In [None]:
concatenate_mask

In [None]:
grader.grade(test_case_id = 'test_concat_mask', answer = getsource(concatenate_mask))

Correct! You earned 5/5 points. You are a star!

Your submission has been successfully recorded in the gradebook.


### **Intro to the BERT Tokenizer**
Before taking in a sequence of text, language models need to break up the sequence into tokens. Since BERT uses the [Transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model) architecture, it has a fixed maximum length of tokens it can process at any one time (that being 512). If you pass in a sequence of text longer than 512 tokens, BERT will throw an error:

In [None]:
# Giving BERT a piece of text with more than 512 tokens produces an error
try:
  bert_model("Hello " * 512 + " my name is [MASK]")
except RuntimeError as e:
  print(e)

The size of tensor a (518) must match the size of tensor b (512) at non-singleton dimension 1


### **Filtering the data for BERT's max length** (5 points)
Below we have provided code to load the BERT tokenizer. To run the tokenizer you can call it on a piece of text like `tokenizer("Hello students")`. This will return a dictionary with three arrays, `input_ids`, `attention_mask` and `token_type_ids`. The length of all three arrays should be equal to the total number of tokens in the sequence.

Since we need to make sure that no example in our input is longer than 512 tokens, your assignment now is to filter the dataset to only include examples that have 512 or less tokens. You can do this using the [Filter](https://huggingface.co/docs/datasets/process#select-and-filter) function. This function is similar to Map but returns a boolean where `True` indicates we should keep the row and `False` indicates we should remove it.

NOTE: The maximum sequence length of any model can be found through the `tokenizer.model_max_length` variable. If you dislike magic numbers in your code, you can use this instead of hard-coding 512.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def filter_for_max_length(row):
  '''
    Pseudocode:
        1. Tokenize the "input" field in the row using the BERT tokenizer.
        2. Get the number of tokens in the tokenized input.
        3. Compare the number of tokens to the maximum length allowed by the BERT model.
        4. Return True if the number of tokens is less than or equal to the maximum length, otherwise return False.

    Input:
        row: A dictionary representing a single row of the dataset.
              It contains at least the key "input" which holds a string of text.

    Returns:
        A boolean value indicating whether the length of the tokenized "input" is
        less than or equal to the maximum length allowed by the BERT model.
  '''

  # TODO: Return True if the length of the input (in tokens) is less than or equal
  #        to the maximum length allowed by the BERT model
  tokens = tokenizer(row["input"])
  input_ids = tokens["input_ids"]
  token_type_ids = tokens["token_type_ids"]
  attention_mask = tokens["attention_mask"]
  num_of_input_ids = len(input_ids)
  num_of_token_type_ids = len(token_type_ids)
  num_of_attention_mask = len(attention_mask)

  if num_of_input_ids == num_of_token_type_ids and num_of_token_type_ids == num_of_attention_mask and num_of_attention_mask <= tokenizer.model_max_length:
    return True
  else:
    return False

filtered_data = test_data.filter(filter_for_max_length)

Filter:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
test_data[0]

{'label': 0,
 'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!',
 'input': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAI

In [None]:
dataset['test'][0]

{'label': 0,
 'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'}

In [None]:
tokens = tokenizer(test_data[2]['input'])
tokens

{'input_ids': [101, 1398, 146, 1169, 1474, 1110, 1103, 4997, 106, 1284, 1127, 1103, 1178, 123, 1234, 1107, 1103, 1282, 1111, 5953, 117, 1103, 1282, 1108, 13543, 1105, 8243, 1114, 4067, 12967, 106, 123, 25377, 117, 170, 188, 2528, 26950, 117, 1105, 1126, 4828, 9303, 11116, 1181, 1103, 7659, 1395, 119, 138, 3489, 4890, 1114, 22024, 117, 17393, 1183, 3602, 12668, 188, 3263, 13780, 1155, 1166, 1122, 1110, 1175, 1111, 1240, 22345, 119, 165, 183, 165, 183, 2346, 2149, 2094, 1338, 119, 119, 119, 1185, 1447, 1106, 3668, 117, 1185, 5679, 117, 5143, 4143, 2094, 119, 2096, 1736, 1157, 2504, 117, 1198, 1176, 1103, 1395, 117, 146, 1309, 1261, 1139, 6227, 1228, 106, 1109, 7463, 1132, 1315, 1353, 117, 1128, 2094, 16412, 1116, 1166, 2135, 1199, 3533, 118, 4044, 7072, 1112, 1128, 3465, 1107, 1240, 2423, 5624, 1149, 12960, 1946, 119, 1109, 15688, 1185, 27267, 1127, 1149, 1104, 170, 2884, 1105, 13392, 117, 1103, 23982, 1108, 182, 13148, 1183, 117, 1103, 15688, 7738, 1108, 3999, 3431, 119, 165, 183, 165, 

In [None]:
input_ids = tokens["input_ids"]

In [None]:
filtered_data[4]

{'label': 0,
 'text': "Food was NOT GOOD at all! My husband & I ate here a couple weeks ago for the first time. I ordered a salad & basil pesto cream pasta & my husband ordered the spinach & feta pasta. The salad was just a huge plate of spring mix (nothing else in it) with WAY to much vinegar dressing. My lettuce was drowning in the vinegar. My pesto pasta had no flavor (did not taste like a cream sauce to me) & the pesto was so runny/watery & way too much sauce not enough noodles. My husband's pasta had even less flavor than mine. We ate about a quarter of the food & couldn't even finish it. We took it home & it was so bad I didn't even eat my leftovers. And I hate wasting food!! Plus the prices are expensive for the amount of food you get & of course the poor quality. Don't waste your time eating here. There are much better Italian restaurants in Pittsburgh.",
 'input': "Food was NOT GOOD at all! My husband & I ate here a couple weeks ago for the first time. I ordered a salad & basi

In [None]:
test_data


Dataset({
    features: ['label', 'text', 'input'],
    num_rows: 50000
})

In [None]:
grader.grade(test_case_id = 'test_filtered_length', answer = len(filtered_data))

Correct! You earned 5/5 points. You are a star!

Your submission has been successfully recorded in the gradebook.


Here we'll be randomly selecting 100 examples from our filtered dataset for testing purposes. We'll be doing this throughout this assignment both for training and testing to ensure runtime stays fast. In the real world, given unlimited time, we would ideally like to use the full datasets.

In [None]:
# Randomly sample 100 examples from the dataset (do not change!)
sampled_data = filtered_data.shuffle(seed=42).select(range(100))

In [None]:
sampled_data[0]

{'label': 4,
 'text': "Probably has the best ddukboki on campus! It's made in such a way that you still want to pay for it 'cause even though you can make your own version at home, their version is that much better. Also, they get the rice right every time (it's the simple things that add up).",
 'input': "Probably has the best ddukboki on campus! It's made in such a way that you still want to pay for it 'cause even though you can make your own version at home, their version is that much better. Also, they get the rice right every time (it's the simple things that add up). I give it a score of [MASK] out of 5."}

### **Apply BERT to each example** (5 points)
As your final task for this section you must write a function that runs the BERT model on every sequence of text in the dataset's `input` column and outputs the most likely integer star rating (1-5) as predicted by BERT in a new column named `score`.

Refer back to the earlier parts of this Section for how to use BERT and make sure to use the `targets` field to specify the five possible star ratings `[1, 2, 3, 4, 5]`.

In [None]:
def apply_bert(row):
  '''
  Pseudocode:
      1. Use the BERT model to compute the probabilities for each possible score (1 to 5).
      2. Identify the score with the highest probability.
      3. Return the most likely score in a new dictionary with the key "score".

  Input:
      row: A dictionary representing a single row of the dataset.
            It contains at least the key "input" which holds a string of text.

  Returns:
      A dictionary with the key "score" holding the most likely score (1 to 5) as an integer.
  '''
  # TODO: Compute the probabilities for each score and return the value of the most likely score
  bert_results = bert_model(row["input"], targets=["1", "2", "3", "4", "5"])
  scores_dict = {}
  for item in bert_results:
    scores_dict[item['token_str']] = item['score']
  argmax_key = max(scores_dict, key=scores_dict.get)
  return {'score': int(argmax_key)}

predicted_data = sampled_data.map(apply_bert)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [None]:
predicted_data[9]

{'label': 4,
 'text': 'My sister and I visited this store while on vacation and we both had a wonderful experience! Curtis helped us and he was amazing! He was very friendly and helpful and explained the materials to us and also how the garments should fit. We both walked out of the store with a couple of items and a great recommendation to a wonderful cafe nearby that sells amazing vegan desserts.\\n\\nI signed up for a yelp account just to post this review for Curtis. If you are reading this, thanks again!',
 'input': 'My sister and I visited this store while on vacation and we both had a wonderful experience! Curtis helped us and he was amazing! He was very friendly and helpful and explained the materials to us and also how the garments should fit. We both walked out of the store with a couple of items and a great recommendation to a wonderful cafe nearby that sells amazing vegan desserts.\\n\\nI signed up for a yelp account just to post this review for Curtis. If you are reading th

In [None]:
grader.grade(test_case_id = 'test_predicted_data', answer = predicted_data['score'])

Correct! You earned 5/5 points. You are a star!

Your submission has been successfully recorded in the gradebook.


### **Evaluating the Model**

Below we've written code to evaluate the model's accuracy. If you've done everything correctly you should get somewhere around 24% accuracy. Given that this task is five-way classification, this is only *barely* better than random chance! Clearly BERT cannot do sentiment analysis without extra training. Let's see if fine-tuning will help.



In [None]:
def accuracy(outputs, reference):
  return sum([o == r for o, r in zip(outputs, reference)]) / len(reference)

print(accuracy(predicted_data["score"], predicted_data["label"]))

0.24


# Section 2: Training BERT for Sentiment Analysis (20 points)

In this section we'll be training BERT for the same Sentiment Analysis task from Section 1. We'll start by giving a short introduction to the concept of fine-tuning.

### **What is fine-tuning?**

Fine-tuning is the process of further training a pre-trained model to accomplish a particular down-stream task. This is typically done by swapping out the pre-trained model's last layer (its head) for a new randomly initialized head and training both model and head.

### **Model "heads"**

The final output layer of a language model is typically called a "head". The standard head used by models when pre-training is called a "language modeling head". This is a dense linear layer that projects the $D_{enc}$ length encodings of each of the $L_{context}$ tokens to a probability distribution over the vocabulary. A single transformation matrix is learned and applied to all context tokens, which are treated as a batch dimension. The total size of this layer is thus ($D_{enc}$ x $|V|$).
<!-- The head and the network are always trained *together*. We compute the cross-entropy loss $L = -\log(P(w))$ using the output of the head and backpropagate through the head to the rest of the network. -->

### **Adding on a "Classification head"**

In Section 1 we used BERT's language modeling head to perform the Sentiment Analysis task. While we *can* train the model with its original LM head, this is unnecessary, as the LM head outputs probabilities over all tokens rather than just our targets. Thus, in this section we're going to remove the language modeling head and replace it with a classification head.

A classification head is very similar to a language modeling head. It is a dense linear layer that projects the $D_{enc}$ length encodings of each of the $L_{context}$ tokens to a probability distribution over the *$n$ classes* rather than the whole vocabulary. The transformation matrix is again shared, with the context tokens treated as a batch dimension. The total size of this layer is thus ($D_{enc}$ x $n$). This makes it significantly more efficient to train.

### **Using HuggingFace to swap heads**

Loading a model with a particular head is easy using HuggingFace. All you do is load the model as an instance of a given class. The classes we're using in this homework are as follows:
1. [AutoModelForMaskedLM](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForMaskedLM) - Language modeling Head
2. [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSequenceClassification) - Classification Head
3. [AutoModelForTokenClassification](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForTokenClassification) - Token Classification Head

Feel free to read the [documentation](https://huggingface.co/docs/transformers/model_doc/auto#auto-classes) for a list of other possible heads

In [None]:
# Load in the bert-base-cased model with a classification head (num_labels = number of classes)
classification_head_bert = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

### **Training the Model + Classification head**

Let's now train the model + classification head for sentiment analysis. We will use the HuggingFace [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class. We need to:
1. Tokenize and process the dataset
2. Create a `compute_metrics` function
3. Initialize the `Trainer` and specify the `TrainingArguments`

### **Step 1: Tokenize and Process the dataset** (5 points)

For this section you must implement the `tokenize_function`. This function should run the BERT tokenizer on every example in the dataset and add the output fields `input_ids`, `attention_mask`, and `token_type_ids` as new columns in the dataset.

This function should also also pad all sequences of less than 512 tokens to be exactly 512 tokens using the special `[PAD]` token and truncate sequences longer than 512 to be exactly 512. This is to help GPU parallelization and can be done by specifying certain arguments to the tokenizer (check the [documentation](https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__)!)

In [None]:
# Step 1: Tokenize and Process the dataset
dataset = load_dataset("yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(row):
  '''
  Pseudocode:
      1. Use the BERT tokenizer to tokenize the text in the "text" field of the input row.
      2. Ensure that the tokenized output is padded to the maximum length.
      3. Ensure that the tokenized output is truncated to the maximum length if it exceeds it.
      4. Return the tokenized output.

  Input:
      row: A dictionary representing a single row of the dataset.
            It contains at least the key "text" which holds a string of text.

  Returns:
      A dictionary with the tokenized output including keys "input_ids", "attention_mask", and "token_type_ids".
  '''
  # TODO: Tokenize the dataset (Remember to pad and truncate to max length)
  result = tokenizer(row["text"], padding="max_length", truncation=True, max_length=512)
  return result

# Randomly select 1000 examples from the train and test data and tokenize - do not change!
# (Note: We are subsampling our data just so that training doesn't take too long)
train_data = dataset["train"].shuffle(seed=42).select(range(1000)).map(tokenize_function)
eval_data = dataset["test"].shuffle(seed=42).select(range(1000)).map(tokenize_function)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
dataset['train'][0]

{'label': 4,
 'text': "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."}

In [None]:
tokenizer(dataset["train"][0]["text"], padding="max_length", truncation=True, max_length=512)

{'input_ids': [101, 173, 1197, 119, 2284, 2953, 3272, 1917, 178, 1440, 1111, 1107, 170, 1704, 22351, 119, 1119, 112, 188, 3505, 1105, 3123, 1106, 2037, 1106, 1443, 1217, 10063, 4404, 132, 1119, 112, 188, 1579, 1113, 1159, 1107, 3195, 1117, 4420, 132, 1119, 112, 188, 6559, 1114, 170, 1499, 118, 23555, 2704, 113, 183, 9379, 114, 1134, 1139, 2153, 1138, 3716, 1106, 1143, 1110, 1304, 1696, 1107, 1692, 1380, 5940, 1105, 1128, 1444, 6059, 132, 1105, 1128, 1169, 1243, 5991, 16179, 1106, 1267, 18137, 1443, 1515, 1106, 1267, 1140, 1148, 119, 1541, 117, 1184, 1167, 1202, 1128, 1444, 136, 178, 112, 182, 2807, 1303, 1774, 1106, 1341, 1104, 1251, 11344, 178, 1138, 1164, 1140, 117, 1133, 178, 112, 182, 1541, 4619, 170, 9153, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [None]:
grader.grade(test_case_id = 'test_tokenized_data', answer = train_data[:500])

Correct! You earned 5/5 points. You are a star!

Your submission has been successfully recorded in the gradebook.


### **Step 2: Define a `compute_metrics` function** (10 points)

For this function you'll be calculating accuracy between the predictions and true labels. The `probabilities` variable is a 2D numpy array of size ($n$ x $5$) containing probabilities for the 5 class labels for each of the $n$ examples in the input dataset.

You must write code to select the index with the highest probability for each row in the `probabilities` array then calculate the accuracy of the model with respect to the ground truth label. (HINT: the [np.argmax](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html) function is really nice for this)

In [None]:
# Step 2: Define a compute_metrics function
def compute_metrics(eval_pred):
  '''
  Pseudocode:
      1. Extract the predicted probabilities and the true labels from the evaluation predictions.
      2. Compute the predicted labels by taking the argmax of the probabilities along the last axis.
      3. Calculate the accuracy by comparing the predicted labels with the true labels.
      4. Return a dictionary containing the accuracy.

  Input:
      eval_pred: A tuple (probabilities, labels)
                  probabilities: A 2D numpy array of shape (num_examples, num_classes) representing the predicted probabilities for each class.
                  labels: A 1D numpy array of shape (num_examples,) representing the true labels.

  Returns:
      A dictionary with the key "accuracy" and the value being the calculated accuracy.
  '''
  # Get the true labels and predicted probabilities
  probabilities, labels = eval_pred

  # TODO: compute accuracy between predictions & true labels
  predicted_labels = np.argmax(probabilities, axis=1)
  accuracy = np.mean(predicted_labels == labels)
  return {'accuracy': accuracy}

In [None]:
grader.grade(test_case_id = 'test_compute_metrics', answer = getsource(compute_metrics))

Correct! You earned 10/10 points. You are a star!

Your submission has been successfully recorded in the gradebook.


### **Step 3: Initialize the `Trainer` and Specify `TrainingArguments`**

Here we specify various arguments to the trainer. We choose to evaluate and log every `epoch` (i.e. every pass through the full dataset), we specify the output directory for checkpoints, the learning rate (`5e-05`), the number of epochs (`3`), and the size of batches used for stochastic gradient descent (`8`). In addition, we specify full_determinism for the sake of grading consistency.

In the Trainer we specify our model as the BERT model we loaded earlier, we pass in our argument and filtered datasets, and finally we pass in our `compute_metrics` function.

In [None]:
# Step 3: Specify the TrainingArguments and Initialize the Trainer (Do not change!)
training_args = TrainingArguments(
    eval_strategy="epoch",
    logging_strategy="epoch",
    output_dir="yelp-training",
    learning_rate=5e-05,
    num_train_epochs=3.0,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    full_determinism=True
)

trainer = Trainer(
    model=classification_head_bert,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=compute_metrics,
)

### **Time to Train!**

Once all the above cells have finished running it is time to train the model. On GPU this should only take about 5 minutes. You can check if you're using the GPU by going to `Runtime` -> `Change Runtime Type` and making sure that `Hardware Accelerator` is set to `GPU`

In [None]:
# Train the model!
trainer.train()
trainer.save_model('./yelpBERT')



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mruijiang2009[0m ([33mruijiang2009-university-of-pennsylvania[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,1.3729,1.094293,0.516
2,0.9181,1.027383,0.572
3,0.5923,1.018113,0.603


In [None]:
from google.colab import drive
drive.mount('/content/drive')
!cp ./yelpBERT "/content/drive/MyDrive/CIS5300/HW7"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
cp: -r not specified; omitting directory './yelpBERT'


In [None]:
from google.colab import drive
drive.mount('/content/drive')
!cp "/content/drive/MyDrive/CIS5300/HW7" ./yelpBERT

In [None]:
!cp -r ./yelpBERT /content/drive/MyDrive/CIS5300/HW7/

## Sanity-Checking our trained model

In order to run our trained model we need to load it into a `text-classification` pipeline. A quick sanity check should show us that our model outputs sensible results

In [None]:
yelpBERT = pipeline('text-classification', model='./yelpBERT', tokenizer=tokenizer, device=device_id)

In [None]:
# Output sentiment for reviews (LABEL_4 = 5 stars, LABEL_0 = 1 star)
yelpBERT(["This place was amazing!",
          "This place was good",
          "This place was fine, there were good and bad parts.",
          "This place was pretty bad",
          "This place was awful"])

[{'label': 'LABEL_4', 'score': 0.924649715423584},
 {'label': 'LABEL_3', 'score': 0.7362259030342102},
 {'label': 'LABEL_2', 'score': 0.6407105326652527},
 {'label': 'LABEL_2', 'score': 0.5589599609375},
 {'label': 'LABEL_0', 'score': 0.8636391162872314}]

## Evaluating the model

In order to get the official evaluation results we will be using the [Evaluator](https://huggingface.co/docs/evaluate/v0.4.0/en/package_reference/evaluator_classes#evaluator) pipeline for a standardized evaluation environment. Please run the following code to get the accuracy of your model on the Yelp dataset. We will be grading your model's accuracy. Your model should be at least higher than 55% to recieve full credit for this section.

In [None]:
task_evaluator = evaluator("text-classification")

eval_results = task_evaluator.compute(
    model_or_pipeline=yelpBERT,
    data=eval_data,
    metric=evaluate.load("accuracy"),
    label_mapping={"LABEL_0": 0, "LABEL_1": 1, "LABEL_2": 2, "LABEL_3": 3, "LABEL_4": 4}
)

print(eval_results)

Downloading builder script: 0.00B [00:00, ?B/s]

{'accuracy': 0.603, 'total_time_in_seconds': 27.857660229999965, 'samples_per_second': 35.89676920975216, 'latency_in_seconds': 0.027857660229999966}


In [None]:
grader.grade(test_case_id = 'test_yelp_accuracy', answer = eval_results)

Correct! You earned 5/5 points. You are a star!

Your submission has been successfully recorded in the gradebook.


# Section 3: Training BERT for Natural Language Inference (25 points)

In this section you will train BERT to perform a new task -- Natural Language Inference (NLI). NLI is the task of taking two pieces of text (a "premise" and "hypothesis") and determining whether or not the premise is entailed from the hypothesis.


> Natural Language Inference is a task of determining whether the given “hypothesis” and “premise” logically follow (entailment) or unfollow (contradiction) or are undetermined (neutral) to each other. ~ [Oleh Loksyn](https://towardsdatascience.com/natural-language-inference-an-overview-57c0eecf6517)

This can be formulated as a sequence classification task where, given both sequences, we predict one of three labels

(0 = contradiction, 1 = neutral, 2 = entailment)

### **Applying BERT to multiple sequences**

In order to encode multiple sequences of non-contiguous text with BERT we use the `[SEP]` token. This token is a special token (similar to `[MASK]`) that indicates to the model that the sequences on either side of the token are distinct. With the `[SEP]` token we're able to concatenate the premise and hypothesis together into one sequence allowing us to use a standard classification head.

### **Loading BERT**
Your first task is to load in BERT with a classification head. This should be identical to the loading code from Section 2 save for a different number of labels

In [None]:
# TODO: Load in the bert-base-cased model with a classification head (remember the right number of labels!)
classification_head_bert = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3)

### **The MNLI Dataset**
For this task we'll be using the [Multi-Genre NLI](https://huggingface.co/datasets/multi_nli) dataset. This dataset was collected by asking crowd workers to annotate 433,000 sentence pairs from various genres (Fiction, Travel, Telephone, Letters, Government) for their textual entailment information.

In [None]:
mnli = load_dataset("multi_nli")

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

(…)alidation_matched-00000-of-00001.parquet:   0%|          | 0.00/4.94M [00:00<?, ?B/s]

(…)dation_mismatched-00000-of-00001.parquet:   0%|          | 0.00/5.10M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392702 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9815 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9832 [00:00<?, ? examples/s]

### **Tokenize and Process the Dataset** (15 points)

In this section you must implement the three processing functions `is_labeled`, `concatenate` and `tokenize_function` each worth 5 points. These should be fairly similar to the processing functions we've been dealing with so far.

In [None]:
def is_labeled(row):
  '''

  Input:
      row: A dictionary representing a single row of the dataset.
            It contains at least the key "label" which holds an integer.

  Returns:
      A boolean value indicating whether the "label" is not -1.
  '''
  # TODO: Filter out all examples that do not have an entailment label (i.e. label is -1)
  if row['label'] != -1:
    return True
  return False

mnli = mnli.filter(is_labeled)

Filter:   0%|          | 0/392702 [00:00<?, ? examples/s]

Filter:   0%|          | 0/9815 [00:00<?, ? examples/s]

Filter:   0%|          | 0/9832 [00:00<?, ? examples/s]

In [None]:
grader.grade(test_case_id = 'test_is_labeled', answer = getsource(is_labeled))

Correct! You earned 5/5 points. You are a star!

Your submission has been successfully recorded in the gradebook.


In [None]:
def concatenate(row):
  '''
  Input:
      row: A dictionary representing a single row of the dataset.
            It contains at least the keys "premise" and "hypothesis", both holding strings.

  Returns:
      The input row dictionary, but with an added key "concat" that contains the concatenated string.

  '''
  # TODO:　Concatenate the text of the "premise" column to the "hypothesis" column
  #        using the [SEP] token. Must be in the order "<premise> [SEP] <hypothesis>"
  #        Add the output as a new column "concat" to the dataset
  row['concat'] = row['premise'] + ' [SEP] ' + row['hypothesis']
  return row

mnli = mnli.map(concatenate)

Map:   0%|          | 0/392702 [00:00<?, ? examples/s]

Map:   0%|          | 0/9815 [00:00<?, ? examples/s]

Map:   0%|          | 0/9832 [00:00<?, ? examples/s]

In [None]:
mnli["train"][12]

{'promptID': 32819,
 'pairID': '32819n',
 'premise': "It's not that the questions they asked weren't interesting or legitimate (though most did fall under the category of already asked and answered).",
 'premise_binary_parse': "( It ( ( ( ( 's not ) ( that ( ( ( the questions ) ( they asked ) ) ( ( were n't ) ( ( interesting or ) legitimate ) ) ) ) ) ( -LRB- ( ( though ( most ( ( did fall ) ( under ( ( ( the category ) ( of already ) ) ( ( asked and ) answered ) ) ) ) ) ) -RRB- ) ) ) . ) )",
 'premise_parse': "(ROOT (S (NP (PRP It)) (VP (VBZ 's) (RB not) (SBAR (IN that) (S (NP (NP (DT the) (NNS questions)) (SBAR (S (NP (PRP they)) (VP (VBD asked))))) (VP (VBD were) (RB n't) (ADJP (JJ interesting) (CC or) (JJ legitimate))))) (PRN (-LRB- -LRB-) (SBAR (IN though) (S (NP (JJS most)) (VP (VBD did) (NP (NN fall)) (PP (IN under) (NP (NP (DT the) (NN category)) (PP (IN of) (ADVP (RB already))) (VP (VBN asked) (CC and) (VBN answered))))))) (-RRB- -RRB-))) (. .)))",
 'hypothesis': 'All of the qu

In [None]:
grader.grade(test_case_id = 'test_concatenate', answer = getsource(concatenate))

Correct! You earned 5/5 points. You are a star!

Your submission has been successfully recorded in the gradebook.


In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(row):
  '''
  Input:
      row: A dictionary representing a single row of the dataset.
            It contains at least the key "concat" which holds a string of text.

  Returns:
      A dictionary with the tokenized output including keys "input_ids", "attention_mask", and "token_type_ids".
  '''
  # TODO: tokenize the concatenated data (make sure to pad and truncate to max length!)
  result = tokenizer(row["concat"], padding="max_length", truncation=True, max_length=512)
  row.update(result)
  return row

# Randomly select 1000 examples from train and validation and process them (Do not change!)
train_mnli = mnli["train"].shuffle(seed=42).select(range(1000)).map(tokenize_function)
eval_mnli = mnli["validation_matched"].shuffle(seed=42).select(range(1000)).map(tokenize_function)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
grader.grade(test_case_id = 'test_tokenize_function', answer = train_mnli[:500])

Correct! You earned 5/5 points. You are a star!

Your submission has been successfully recorded in the gradebook.


### **Define the `compute_metrics` function** (5 points)
For this section you'll define the compute_metrics function. We'll be using accuracy again in this section. Feel free to reference your implementation from the previous section to help you out.

In [None]:
def compute_metrics(eval_pred):
  '''
  Input:
      eval_pred: A tuple (logits, labels)
                  logits: A 2D numpy array of shape (num_examples, num_classes) representing the raw predictions from the model.
                  labels: A 1D numpy array of shape (num_examples,) representing the true labels.

  Returns:
      A dictionary with the key "accuracy" and the value being the calculated accuracy.
  '''
  # TODO: extract the outputs of the model and compute accuracy
  logits, labels = eval_pred
  predictions = np.argmax(logits, axis=-1)
  accuracy = np.mean(predictions == labels)
  return {"accuracy": accuracy}

In [None]:
grader.grade(test_case_id = 'test_mnli_compute_metrics', answer = getsource(compute_metrics))

Correct! You earned 5/5 points. You are a star!

Your submission has been successfully recorded in the gradebook.


### **Initialize the `TrainingArguments` and `Trainer`**

In [None]:
training_args = TrainingArguments(
    eval_strategy="epoch",
    logging_strategy="epoch",
    output_dir="mnli-training",
    learning_rate=5e-05,
    num_train_epochs=3.0,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    full_determinism=True
)

trainer = Trainer(
    model=classification_head_bert,
    args=training_args,
    train_dataset=train_mnli,
    eval_dataset=eval_mnli,
    compute_metrics=compute_metrics,
)

### **Train the model**

In [None]:
trainer.train()
trainer.save_model('./mnliBERT')



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mruijiang2009[0m ([33mruijiang2009-university-of-pennsylvania[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,1.3885,1.073122,0.528
2,0.8834,0.994093,0.574
3,0.6145,1.044535,0.606


In [None]:
from google.colab import drive
drive.mount('/content/drive')
!cp ./mnliBERT "/content/drive/MyDrive/CIS5300/HW7"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
cp: -r not specified; omitting directory './mnliBERT'


In [None]:
!cp -r ./mnliBERT /content/drive/MyDrive/CIS5300/HW7/

In [None]:
# reload the model
!cp -r /content/drive/MyDrive/CIS5300/HW7/mnliBERT ./mnliBERT

In [None]:
from google.colab import drive
drive.mount('/content/drive')
!cp -r"/content/drive/MyDrive/CIS5300/HW7/mnliBERT" ./mnliBERT

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
cp: cannot stat '/content/drive/MyDrive/CIS5300/HW7/mnliBERT': No such file or directory


In [None]:
!ls ./mnliBERT

config.json  model.safetensors	training_args.bin


### **Sanity Check**

Once again, if trained properly, a quick sanity check should provide us with sensible results

In [None]:
mnliBERT = pipeline('text-classification', model='./mnliBERT', tokenizer=tokenizer, device=device_id)

HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: './mnliBERT'.

In [None]:
# Output entailment (LABEL_0 = entailment, LABEL_1 = neutral, LABEL_2 = contradiction)
mnliBERT(["I just graduated college. [SEP] The person is a college graduate.",
          "My name is John. [SEP] The sky is blue.",
          "Thank you so much for your help. [SEP] The person was not helped."])

[{'label': 'LABEL_1', 'score': 0.9395487904548645},
 {'label': 'LABEL_1', 'score': 0.9481647610664368},
 {'label': 'LABEL_2', 'score': 0.962492048740387}]

### **Evaluate the Model**

Please run the following code to evaluate your model on the MNLI dataset. Your model should be at least 55% accurate to receive full credit.

In [None]:
task_evaluator = evaluator("text-classification")

eval_results = task_evaluator.compute(
    model_or_pipeline=mnliBERT,
    data=eval_mnli,
    input_column="concat",
    label_column="label",
    metric=evaluate.load("accuracy"),
    label_mapping={"LABEL_0": 0, "LABEL_1": 1, "LABEL_2": 2}
)

print(eval_results)

{'accuracy': 0.55, 'total_time_in_seconds': 25.408805179999945, 'samples_per_second': 39.356435413465675, 'latency_in_seconds': 0.02540880517999995}


In [None]:
grader.grade(test_case_id = 'test_mnli_accuracy', answer = eval_results)

Correct! You earned 5/5 points. You are a star!

Your submission has been successfully recorded in the gradebook.


# Section 4: Training BERT for Named-Entity Recognition (40 points)

In this final section you will train BERT to perform [Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) (NER).

NER is the task of tagging and identifying named entities (e.g. People, Organizations, Locations) in text. For example the sentence "Only France and Britain backed Fischler's proposal." would be tagged as

> Only (France, LOC) and (Britain, LOC) backed (Fischler's, PER) proposal.

NER is typically formulated as a token classification task. Given a sequence of tokens we assign a class to each token. Tokens that are not named entities get given the null tag `O` and tokens that are named entities get one of four classes (`ORG`, `PER`, `LOC`, or `MISC`). Thus the sentence from earlier would be given the labels

>(Only, O) (France, LOC) (and, O) (Britain, LOC) (backed, O) (Fischler's, PER) (proposal, O)

Giving every token in the sequence a label allows us to use a token classification head for this task.

### **Beginning-Inside-Outside (BIO) tags**

One downside to the tagging scheme described is that for two consecutive tokens with the same tag, we can't tell whether or not they are the same entity. For example

> ("25-1", O), ("Barcelona", ORG), ("Real", ORG), ("Madrid", ORG)

We know that [Real Madrid](https://en.wikipedia.org/wiki/Real_Madrid_CF) and [Barcelona](https://en.wikipedia.org/wiki/FC_Barcelona) are distinct entities but our tag system combines them together. Thus we must distinguish between tags that *begin* an entity (`B-ORG`) and those that are *inside* an entity (`I-ORG`). With this new system we get

> ("25-1", O), ("Barcelona", B-ORG), ("Real", B-ORG), ("Madrid", I-ORG)

Which allows us to distinguish between the two consecutive entities.

### **Training BERT + Token Classification Head**

In order to train BERT for NER we need to use a token classification head. Token classification heads are dense linear layers that output a probability distribution over $n$ classes *for each token* in the sequence. This is equivalent to having $L_{context}$ different classification heads, each trained to predict an output at a specific index. The $L_{context}$​ tokens are treated as a batch dimension, with the same transformation matrix applied to each token in our context window. The total size of a token classification head is ($D_{enc}$ x $n$).

### **Load the Model** (5 points)
Your first task is to load BERT with a token classification head. Take care that the `num_labels` and `id2label` parameters are set correctly.

HINT: Refer back to Section 2's discussion of model heads to find mention (and documentation) for the token classification head class.

In [14]:
label_names = {0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8:'I-MISC'}
# TODO: Load the model with the right num_labels and set id2label to be label_names
token_classification_head_bert = AutoModelForTokenClassification.from_pretrained("bert-base-cased", num_labels=9, id2label=label_names)

In [15]:
attributes = copy.deepcopy(vars(token_classification_head_bert))
attributes['config'] = attributes['config'].__dict__
del attributes['_modules']
grader.grade(test_case_id = 'test_load_bert', answer = attributes)

Correct! You earned 5/5 points. You are a star!

Your submission has been successfully recorded in the gradebook.


In [None]:
help(AutoModelForTokenClassification.from_pretrained)

Help on method from_pretrained in module transformers.models.auto.auto_factory:

from_pretrained(*model_args, **kwargs) class method of transformers.models.auto.modeling_auto.AutoModelForTokenClassification
    Instantiate one of the model classes of the library (with a token classification head) from a pretrained model.
    
    The model class to instantiate is selected based on the `model_type` property of the config object (either
    passed as an argument or loaded from `pretrained_model_name_or_path` if possible), or when it's missing, by
    falling back to using pattern matching on `pretrained_model_name_or_path`:
    
        - **albert** -- [`AlbertForTokenClassification`] (ALBERT model)
        - **arcee** -- [`ArceeForTokenClassification`] (Arcee model)
        - **bert** -- [`BertForTokenClassification`] (BERT model)
        - **big_bird** -- [`BigBirdForTokenClassification`] (BigBird model)
        - **biogpt** -- [`BioGptForTokenClassification`] (BioGpt model)
        - 

### **The CoNLL Dataset**

We will use the [CoNLL](https://huggingface.co/datasets/conll2003) dataset as our training data for NER. This dataset consists of 20,000 sentences from Reuters news articles that were manually annotated by participants for their named entities. The tags used are the four classes from earlier (`ORG`, `PER`, `LOC`, or `MISC`) with both `B-` and `I-` variants as well as a null tag `O`.

In [16]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [17]:
conll = load_dataset("conll2003", trust_remote_code=True)

README.md: 0.00B [00:00, ?B/s]

conll2003.py: 0.00B [00:00, ?B/s]

Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [None]:
conll

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

### **Token-tag mismatch**
When processing data for a tagging task, we need to ensure that the tags properly match up to all the correct tokens. BERT's tokenizer is a [Byte-Pair Encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (BPE) tokenizer that tends to split up words into sub-word tokens. For example, take the sentence from earlier
> "Only France and Britain backed Fischler's proposal"

We can see below that the tokenization from CoNLL doesn't match the BERT tokenizer. This creates a problem when processing the tags, as we need to keep the correspondence between tags and tokens. It will be up to you to fix this

In [None]:
print(conll['train'][12]['ner_tags'])
print(tokenizer.convert_ids_to_tokens(tokenizer.encode("Only France and Britain backed Fischler's proposal"))[1:-1])

[0, 5, 0, 5, 0, 1, 0, 0, 0]
['Only', 'France', 'and', 'Britain', 'backed', 'Fi', '##sch', '##ler', "'", 's', 'proposal']


In [None]:
conll['train'][12]

{'id': '12',
 'tokens': ['Only',
  'France',
  'and',
  'Britain',
  'backed',
  'Fischler',
  "'s",
  'proposal',
  '.'],
 'pos_tags': [30, 22, 10, 22, 38, 22, 27, 21, 7],
 'chunk_tags': [11, 12, 12, 12, 21, 11, 11, 12, 0],
 'ner_tags': [0, 5, 0, 5, 0, 1, 0, 0, 0]}

In [None]:
tokenizer.encode("Only France and Britain backed Fischler's proposal")

[101, 2809, 1699, 1105, 2855, 5534, 17355, 9022, 2879, 112, 188, 5835, 102]

In [None]:
tokenizer.convert_ids_to_tokens(tokenizer.encode("Only France and Britain backed Fischler's proposal"))

['[CLS]',
 'Only',
 'France',
 'and',
 'Britain',
 'backed',
 'Fi',
 '##sch',
 '##ler',
 "'",
 's',
 'proposal',
 '[SEP]']

In [None]:
tokenizer.encode("")

[101, 102]

In [None]:
tokenizer.convert_ids_to_tokens(tokenizer.encode("Only France and Britain backed Fischler's proposal", padding="max_length", max_length=512))

['[CLS]',
 'Only',
 'France',
 'and',
 'Britain',
 'backed',
 'Fi',
 '##sch',
 '##ler',
 "'",
 's',
 'proposal',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]

### **Redistributing Tags to Sub-word Tokens** (10 points)

Your job is to implement the `tokenize_and_tag` function. This function should do three things
1. Retokenize the tokens from CoNLL using the BERT tokenizer and add the outputs (`input_ids`, `attention_mask`, and `token_type_ids`) as new columns in the dataset.
2. Redistribute the NER tags with respect to the newly tokenized sequence and add the tags as a new `labels` column to the dataset
3. Pad and truncate both the tokens and tags to be length `tokenizer.model_max_length` (512)

If a token is broken up into multiple sub-tokens then all sub-tokens should be given the same NER tag as the original token.

Some example inputs and outputs are as follows (Reminder to consult the `label_names` dictionary for the mapping from number to NER tag):
```
-- Example #1 --
Inputs:
"tokens": ['Israel', 'approves', 'Arafat', "'s", 'flight', 'to', 'West', 'Bank', '.']
"ner_tags": [5, 0, 1, 0, 0, 0, 5, 6, 0]

Outputs:
"bert_tokens": ['[CLS]', 'Israel', 'approve', '##s', 'Ara', '##fa', '##t', "'", 's', 'flight', 'to', 'West', 'Bank', '.', '[SEP]', '[PAD]', '[PAD]', ...]
"input_ids": [101, 3103, 14942, 1116, 25692, 8057, 1204, 112, 188, 3043, 1106, 1537, 2950, 119, 102, 0, 0, ...]
"labels": [-100, 5, 0, 0, 1, 1, 1, 0, 0, 0, 0, 5, 6, 0, -100, -100, -100, ...]

-- Example #2 --
Inputs:
"tokens": ['66', 'Paul', 'Goydos', ',', 'Billy', 'Mayfair', ',', 'Hidemichi', 'Tanaka', '(', 'Japan', ')']
"ner_tags": [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 5, 0]

Outputs:
"bert_tokens": ['[CLS]', '66', 'Paul', 'Go', '##yd', '##os', ',', 'Billy', 'May', '##fair', ',', 'Hi', '##de', '##mic', '##hi', 'Tanaka', '(', 'Japan', ')', '[SEP]', '[PAD]', '[PAD]', ...]
"input_ids": [101, 5046, 1795, 3414, 19429, 2155, 117, 4224, 1318, 19803, 117, 8790, 2007, 7257, 3031, 24128, 113, 1999, 114, 102, 0, 0, ...]
"labels": [-100, 0, 1, 2, 2, 2, 0, 1, 2, 2, 0, 1, 1, 1, 1, 2, 0, 5, 0, -100, -100, -100, ...]
```
**NOTE:** Make sure to use the label `-100` for all special tokens (`[CLS]`, `[SEP]`, `[PAD]`). Using this label will ensure that special tokens will be ignored by PyTorch and will not contribute to the loss calculation.

In [18]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_and_tag(row):
  '''
  Pseudocode:
      1. Initialize empty lists for tokens and tags.
      2. For each word and its corresponding tag in the input row:
          a. Tokenize the word using the BERT tokenizer.
          b. Extend the tokens list with the tokenized word.
          c. Extend the tags list with the tag repeated for each sub-token.
      3. Tokenize the entire sequence of tokens using the BERT tokenizer with padding and truncation.
      4. Add the special label -100 to the start of the labels list and pad it to the maximum length.
      5. Add the labels list to the tokenized samples dictionary.
      6. Return the tokenized samples dictionary.

  Input:
      row: A dictionary representing a single row of the dataset.
            It contains at least the keys "tokens" and "ner_tags", both holding lists of strings and integers respectively.

  Returns:
      A dictionary with the tokenized output including keys "input_ids", "attention_mask", "token_type_ids", and "labels".
  '''
  # TODO: Redistribute and tokenize the CoNLL dataset using the BERT tokenizer
  tokens = []
  tags = []
  for word, tag in zip(row['tokens'], row['ner_tags']):
    tokenized = tokenizer.tokenize(word)
    tokens.extend(tokenized)
    tags.extend([tag] * len(tokenized)) # because tokenized could be more than 1 element

  tokenized = tokenizer(row['tokens'], padding="max_length", truncation=True, is_split_into_words=True)
  input_ids = tokenized['input_ids']
  bert_tokens = tokenizer.convert_ids_to_tokens(input_ids)

  result = {}
  result["input_ids"] = input_ids
  result["bert_tokens"] = bert_tokens

  labels = [-100] * (512)
  for idx, label in enumerate(tags):
    labels[idx+1] = label
  result["labels"] = labels
  return result


# TODO: randomly select 1000 examples from train and validation and call tokenize_and_tag
train_conll = conll["train"].shuffle(seed=42).select(range(1000)).map(tokenize_and_tag)
eval_conll = conll["validation"].shuffle(seed=42).select(range(1000)).map(tokenize_and_tag)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [19]:
grader.grade(test_case_id = 'test_tokenize_and_tag', answer = conll["train"].select(range(350)).map(tokenize_and_tag)[:])

Map:   0%|          | 0/350 [00:00<?, ? examples/s]

Correct! You earned 10/10 points. You are a star!

Your submission has been successfully recorded in the gradebook.


In [None]:
row = conll["train"][0]
print(row['tokens'])
print(row['ner_tags'])
row

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
[3, 0, 7, 0, 0, 0, 7, 0, 0]


{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [None]:
row

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [None]:
tokens = []
tags = []
print(row["tokens"])
for token, tag in zip(row['tokens'], row['ner_tags']):
  tokenized = tokenizer.tokenize(token)
  tokens.extend(tokenized)
  tags.extend([tag] * len(tokenized))
print("tokens: ", tokens)
print("tags: ", tags)
tokenized = tokenizer(" ".join(tokens), padding="max_length", truncation=True)
print(tokenized)
# input_ids = tokenizer.encode(" ".join(tokens), padding="max_length", truncation=True, max_length=512)
# bert_tokens = tokenizer.convert_ids_to_tokens(input_ids)
# print(input_ids)
# print(bert_tokens)

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
tokens:  ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.']
tags:  [3, 0, 7, 0, 0, 0, 7, 0, 0, 0]
{'input_ids': [101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 108, 108, 182, 1830, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [None]:
len(tokenized["input_ids"])

512

In [None]:
# tokenizer.encode(" ".join(tokens), padding='max_length', max_length=20)
result = tokenizer(tokens, padding="max_length", truncation=True, max_length=512)
# result
# len(result['input_ids'][0]) # list of numbers
len(result["attention_mask"][0]) # list of numbers
# result["token_type_ids"] # list of numbers
# type(result)

512

In [None]:
result.word_ids(batch_index=1)

[None,
 0,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 N

In [None]:
help(result.word_ids)

Help on method word_ids in module transformers.tokenization_utils_base:

word_ids(batch_index: int = 0) -> list[typing.Optional[int]] method of transformers.tokenization_utils_base.BatchEncoding instance
    Return a list mapping the tokens to their actual word in the initial sentence for a fast tokenizer.
    
    Args:
        batch_index (`int`, *optional*, defaults to 0): The index to access in the batch.
    
    Returns:
        `list[Optional[int]]`: A list indicating the word corresponding to each token. Special tokens added by the
        tokenizer are mapped to `None` and other tokens are mapped to the index of their corresponding word
        (several tokens will be mapped to the same word index if they are parts of that word).



In [None]:
help(tokenizer.encode)

Help on method encode in module transformers.tokenization_utils_base:

encode(text: Union[str, list[str], list[int]], text_pair: Union[str, list[str], list[int], NoneType] = None, add_special_tokens: bool = True, padding: Union[bool, str, transformers.utils.generic.PaddingStrategy] = False, truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy, NoneType] = None, max_length: Optional[int] = None, stride: int = 0, padding_side: Optional[str] = None, return_tensors: Union[str, transformers.utils.generic.TensorType, NoneType] = None, **kwargs) -> list[int] method of transformers.models.bert.tokenization_bert_fast.BertTokenizerFast instance
    Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary.
    
    Same as doing `self.convert_tokens_to_ids(self.tokenize(text))`.
    
    Args:
        text (`str`, `list[str]` or `list[int]`):
            The first sequence to be encoded. This can be a string, a list of strings (tokenized

### **Computing Macro-Averaged F1 Score** (20 points)

We want to compute the accuracy of our model when identifying named entities. However, most sequences consist mostly of `O` tags. This is bad for our evaluation, as it means that predicting a sequence of all `O`s will result in high accuracy despite not actually accomplishing the task.

To solve this problem we will be using the Macro-Averaged F1 Score as our metric. This is the unweighted average of the F1 scores for each individual token class. This is a common metric to use when given an unbalanced multi-class classification task. Feel free to read [this article](https://stephenallwright.com/micro-vs-macro-f1-score/) for a full explanation of Macro-F1 as it will be useful for your implementation.

You are to implement the `calculate_macro_f1` function. This function should:
1. Calculate the full confusion matrix (TP, TN, FP, FN) for each class
2. Compute the precision, recall, and f1 score for each class
3. Macro-Average together the F1 score for all nine classes to get the final score

**NOTE**: Typically for NER a correct entity prediction is considered to be one that exactly matches all `B-` and `I-` tags. However, to make the Macro-F1 calculation easier we are considering all of our nine different tag classes as fully separate.

**TIP**: You can use the [sklearn.metrics.f1_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) function from scikitlearn to check your implementation. To help debug we have provided the handy `debug_macro_f1` function for you to use. This function generates random test cases and compares the output of your implementation to that of scikitlearn's. If the two do not match it prints out the failed example. This should be helpful when debugging.

In [20]:
def calculate_macro_f1(preds, labels):
  '''
  Pseudocode:
      1. Map prediction and label IDs to their corresponding tag names.
      2. Compute confusion matrices for each class (TP, TN, FP, FN).
      3. Compute precision, recall, and F1 score for each class.
      4. Macro-average the precision, recall, and F1 scores across all classes.
      5. Return a dictionary containing the macro-averaged precision, recall, and F1 scores.

  Input:
      preds: A list of lists, where each sublist contains predicted label IDs for a sequence.
      labels: A list of lists, where each sublist contains true label IDs for a sequence.

  Returns:
      A dictionary with keys "precision", "recall", and "macro-f1", representing the macro-averaged scores.
  '''
  # Filter out -100 and convert ids to tags
  label_map = {0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8:'I-MISC'}
  true_preds = [
    [(label_map[p], label_map[l]) for (p, l) in zip(pred, label) if l in label_map]
    for pred, label in zip(preds, labels)
  ]
  import numpy as np
  from sklearn.metrics import confusion_matrix

  # TODO: Compute confusion matrices for each class
  label_index_map = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
  preds_flat = [label_index_map[pred] for sequence in true_preds for pred, label in sequence]
  labels_flat = [label_index_map[label] for sequence in true_preds for pred, label in sequence]

  unique_classes = set()
  for pred in preds_flat:
    unique_classes.add(pred)
  for label in labels_flat:
    unique_classes.add(label)
  label_list = list(unique_classes)
  label_list.sort()

  cm = confusion_matrix(labels_flat, preds_flat, labels=label_list)

  # TODO: Compute precision, recall, and f1 for each class
  precisions = []
  recalls = []
  f1s = []

  for i in range(len(label_list)):
    TP = cm[i, i]
    FP = cm[:, i].sum() - TP
    FN = cm[i, :].sum() - TP

    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    precisions.append(precision)
    recalls.append(recall)
    f1s.append(f1)

  return {
      "precision": float(np.mean(precisions)),
      "recall": float(np.mean(recalls)),
      "macro-f1": float(np.mean(f1s))
  }

In [21]:
grader.grade(test_case_id = 'test_macro_f1', answer = getsource(calculate_macro_f1))

Correct! You earned 20/20 points. You are a star!

Your submission has been successfully recorded in the gradebook.


In [22]:
from sklearn.metrics import f1_score

def debug_macro_f1(num_tests, len_examples):

  # For each random test case
  for i in range(num_tests):
    # Generate two random arrays for the predictions and labels
    rand_preds = np.random.randint(0,9,len_examples).tolist()
    rand_labels = np.random.randint(0,9,len_examples).tolist()
    print("rand_preds", rand_preds)
    print("rand_labels", rand_labels)

    # Calculate Macro-F1 score with your implementation and scikitlearn
    sklearn_f1 = f1_score(rand_labels, rand_preds, average='macro')
    your_f1 = calculate_macro_f1([rand_preds], [rand_labels])
    print("sklearn_f1: ", sklearn_f1)
    print("your_f1: ", your_f1)
    print("the difference between two f1 is ", abs(sklearn_f1 - your_f1['macro-f1']))

    # If the two implementations differ, print the example and the scores
    if abs(sklearn_f1 - your_f1['macro-f1']) <= 0.001:
      print(f'preds: {rand_preds}')
      print(f'labels: {rand_labels}')
      print(sklearn_f1, your_f1['macro-f1'])
    else:
      print("!!! The difference is too big")

debug_macro_f1(1, 10)

rand_preds [6, 2, 1, 8, 5, 6, 3, 3, 6, 3]
rand_labels [2, 3, 7, 2, 0, 8, 5, 0, 5, 7]
sklearn_f1:  0.0
your_f1:  {'precision': 0.0, 'recall': 0.0, 'macro-f1': 0.0}
the difference between two f1 is  0.0
preds: [6, 2, 1, 8, 5, 6, 3, 3, 6, 3]
labels: [2, 3, 7, 2, 0, 8, 5, 0, 5, 7]
0.0 0.0


In [None]:
preds = [5, 2, 4, 4, 7, 0, 8, 1, 2, 8]
labels = [0, 5, 6, 7, 5, 0, 0, 1, 2, 6]

In [None]:
preds_list = [preds]
labels_list = [labels]
labels_list

[[0, 5, 6, 7, 5, 0, 0, 1, 2, 6]]

In [None]:
rand_preds = [1, 0, 8, 1, 6, 5, 7, 7, 0, 0]
rand_labels = [3, 5, 0, 1, 1, 3, 0, 1, 7, 2]
result = calculate_macro_f1_v1([rand_preds], [rand_labels])
result


{'macro-f1': 0.044444444444444446,
 'precision': 0.05555555555555555,
 'recall': 0.037037037037037035}

In [None]:
unique_classes = set()
unique_classes.update(rand_preds)
unique_classes.update(rand_labels)
unique_classes

{0, 1, 2, 3, 5, 6, 7, 8}

In [None]:
label_list = list(unique_classes)
label_list.sort()
label_list

[0, 1, 2, 3, 5, 6, 7, 8]

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(rand_labels, rand_preds, labels=label_list)
cm

array([[0, 0, 0, 0, 0, 0, 1, 1],
       [0, 1, 0, 0, 0, 1, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0]])

In [None]:
precisions = []
recalls = []
f1s = []

for i in range(len(label_list)):
  TP = cm[i, i]
  FP = cm[:, i].sum() - TP
  FN = cm[i, :].sum() - TP

  precision = TP / (TP + FP) if (TP + FP) > 0 else 0
  recall = TP / (TP + FN) if (TP + FN) > 0 else 0
  f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

  precisions.append(precision)
  recalls.append(recall)
  f1s.append(f1)

print(f"precisions: {precisions}")
print(f"recalls: {recalls}")
print(f"f1s: {f1s}")
np.mean(f1s)

precisions: [np.float64(0.0), np.float64(0.5), 0, 0, np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0)]
recalls: [np.float64(0.0), np.float64(0.3333333333333333), np.float64(0.0), np.float64(0.0), np.float64(0.0), 0, np.float64(0.0), 0]
f1s: [0, np.float64(0.4), 0, 0, 0, 0, 0, 0]


np.float64(0.05)

In [None]:
label_map = {0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8:'I-MISC'}
true_preds = [
    [(label_map[p], label_map[l]) for (p, l) in zip(pred, label) if l in label_map]
    for pred, label in zip(preds_list, labels_list)
  ]

true_preds


[[('B-LOC', 'O'),
  ('I-PER', 'B-LOC'),
  ('I-ORG', 'I-LOC'),
  ('I-ORG', 'B-MISC'),
  ('B-MISC', 'B-LOC'),
  ('O', 'O'),
  ('I-MISC', 'O'),
  ('B-PER', 'B-PER'),
  ('I-PER', 'I-PER'),
  ('I-MISC', 'I-LOC')]]

In [24]:
def compute_metrics(eval_pred):
  logits, labels = eval_pred
  preds = np.argmax(logits, axis=2)
  return calculate_macro_f1(preds, labels)

### **Training the model**

In [25]:
training_args = TrainingArguments(
    eval_strategy="epoch",
    logging_strategy="epoch",
    output_dir="conll-training",
    learning_rate=5e-05,
    num_train_epochs=3.0,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    full_determinism=True
)

trainer = Trainer(
    model=token_classification_head_bert,
    args=training_args,
    train_dataset=train_conll,
    eval_dataset=eval_conll,
    compute_metrics=compute_metrics
)

In [27]:
trainer.train()
trainer.save_model('./nerBERT')

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [30]:
!ls ./

conll-training	drive  nerBERT	notebook-config.yaml  sample_data  wandb


In [29]:
from google.colab import drive
drive.mount('/content/drive')
!cp ./nerBERT "/content/drive/MyDrive/CIS5300/HW7"

Mounted at /content/drive
cp: -r not specified; omitting directory './nerBERT'


### **Sanity Checking the NER Predictions**
In order to check our model we need to load it into a `token-classification` pipeline.

If your model has been trained correctly then you will see in the following example that it correctly predicts the "Pennsylvania" in "University of Pennsylvania" as an `ORG` but "Philadelphia" as a `LOC`.

In [31]:
nerBERT = pipeline('token-classification', model='./nerBERT', tokenizer=tokenizer, device=device_id)

In [32]:
nerBERT(["Chris Callison-Burch is a professor at the University of Pennsylvania in Philadelphia",
          "Joe Biden is the 46th President of the United States"], aggregation_strategy="simple")

[[{'entity_group': 'PER',
   'score': np.float32(0.94467056),
   'word': 'Chris Callison - Burch',
   'start': 0,
   'end': 20},
  {'entity_group': 'ORG',
   'score': np.float32(0.64913946),
   'word': 'University of Pennsylvania',
   'start': 43,
   'end': 69},
  {'entity_group': 'LOC',
   'score': np.float32(0.5393902),
   'word': 'Philadelphia',
   'start': 73,
   'end': 85}],
 [{'entity_group': 'PER',
   'score': np.float32(0.9054521),
   'word': 'Joe Biden',
   'start': 0,
   'end': 9},
  {'entity_group': 'MISC',
   'score': np.float32(0.19316396),
   'word': 'of',
   'start': 32,
   'end': 34},
  {'entity_group': 'MISC',
   'score': np.float32(0.20893331),
   'word': 'the',
   'start': 35,
   'end': 38},
  {'entity_group': 'LOC',
   'score': np.float32(0.40475094),
   'word': 'United',
   'start': 39,
   'end': 45},
  {'entity_group': 'LOC',
   'score': np.float32(0.40150294),
   'word': 'States',
   'start': 46,
   'end': 52}]]

### **Evaluating NER with SeqEval**

The official library for evaluating a model on the CoNLL task is [SeqEval](https://github.com/chakki-works/seqeval). We'll be downloading it and using it to officially score our model. The SeqEval `overall_f1` is identical to your Macro-F1 implementation with the slight caveat that it uses the traditional NER definition of a correct tag prediction (i.e. one that exactly matches all `B-` and `I-` tags). Thus, don't worry if the official result doesn't match your macro-f1 implementation.

For this task, in order to receive full credit, your implementation must get a Macro-F1 score of at least 0.82.

In [33]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=dd1cba6c7d7bd596499657a281138ac1559e4d3b2cb30c41f8ebbd9a960143fa
  Stored in directory: /root/.cache/pip/wheels/bc/92/f0/243288f899c2eacdfa8c5f9aede4c71a9bad0ee26a01dc5ead
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [34]:
task_evaluator = evaluator("token-classification")

eval_results = task_evaluator.compute(
    model_or_pipeline=nerBERT,
    data=eval_conll,
    input_column="tokens",
    label_column="ner_tags",
    tokenizer=tokenizer,
    metric=evaluate.load("seqeval")
)

for key in eval_results:
  print(f"{key}: {eval_results[key]}")

Downloading builder script: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

LOC: {'precision': np.float64(0.7463556851311953), 'recall': np.float64(0.8888888888888888), 'f1': np.float64(0.8114104595879558), 'number': np.int64(576)}
MISC: {'precision': np.float64(0.3967611336032389), 'recall': np.float64(0.6950354609929078), 'f1': np.float64(0.5051546391752577), 'number': np.int64(282)}
ORG: {'precision': np.float64(0.7285714285714285), 'recall': np.float64(0.8073878627968337), 'f1': np.float64(0.7659574468085105), 'number': np.int64(379)}
PER: {'precision': np.float64(0.9509981851179673), 'recall': np.float64(0.940754039497307), 'f1': np.float64(0.9458483754512635), 'number': np.int64(557)}
overall_precision: 0.7150162715016272
overall_recall: 0.8573021181716833
overall_f1: 0.7797211660329529
overall_accuracy: 0.954639175257732
total_time_in_seconds: 73.13263082399999
samples_per_second: 13.673786772509068
latency_in_seconds: 0.07313263082399998


In [35]:
grader.grade(test_case_id = 'test_ner_score', answer = eval_results)

You earned 0/5 points.

But, don't worry, you can re-submit and we will keep only your latest score.


### Congratulation on finishing the homework! Here are the deliverables you need to submit to GradeScope
- This notebook and py file: rename to `homework7.ipynb` and `homework7.py`. You can download the notebook and py file by going to the top-left corner of this webpage, `File -> Download -> Download .ipynb/.py`