<a href="https://colab.research.google.com/github/qmeng222/transformers-for-NLP/blob/main/Seq2Seq/POS_tagger_with_custom_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
---

In language translation scenario:
*   encoder: reads and understands the input sentence in one language and produces a compact, meaningful representation (context vector)
*   decoder: then takes this representation and generates the equivalent sentence in another language

Example:
Encoder: Processes the English sentence "Hello, how are you?" into a context vector.
Decoder: Uses the context vector to generate the French translation "Bonjour, comment ça va ?"

In [None]:
# install libraries:
!pip install transformers datasets
# `transformers` library: for using pre-trained models
# `datasets` library: to access a collection of high-quality datasets for NLP tasks



In [None]:
import nltk # import the NLTK library

nltk.download('universal_tagset') # download the standard set of POS tags
nltk.download('brown') # download the Brown Corpus (a popular dataset for NLP) to local machine
from nltk.corpus import brown # import the Brown Corpus
# `nltk.corpus` is a module in NLTK that contains various corpora, including the Brown Corpus

[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [None]:
# retrieve the tagged sentences from the Brown Corpus using the Universal Part-of-Speech tagset:
corpus = brown.tagged_sents(tagset='universal')
corpus

[[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'), ('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('of', 'ADP'), ("Atlanta's", 'NOUN'), ('recent', 'ADJ'), ('primary', 'NOUN'), ('election', 'NOUN'), ('produced', 'VERB'), ('``', '.'), ('no', 'DET'), ('evidence', 'NOUN'), ("''", '.'), ('that', 'ADP'), ('any', 'DET'), ('irregularities', 'NOUN'), ('took', 'VERB'), ('place', 'NOUN'), ('.', '.')], [('The', 'DET'), ('jury', 'NOUN'), ('further', 'ADV'), ('said', 'VERB'), ('in', 'ADP'), ('term-end', 'NOUN'), ('presentments', 'NOUN'), ('that', 'ADP'), ('the', 'DET'), ('City', 'NOUN'), ('Executive', 'ADJ'), ('Committee', 'NOUN'), (',', '.'), ('which', 'DET'), ('had', 'VERB'), ('over-all', 'ADJ'), ('charge', 'NOUN'), ('of', 'ADP'), ('the', 'DET'), ('election', 'NOUN'), (',', '.'), ('``', '.'), ('deserves', 'VERB'), ('the', 'DET'), ('praise', 'NOUN'), ('and', 'CONJ'), ('thanks', 'NOUN'), ('of', 'ADP'), ('the', 'DET'), ('City

👆 `corpus` is a list of lists of tuples

In [None]:
corpus[3]

[('``', '.'),
 ('Only', 'ADV'),
 ('a', 'DET'),
 ('relative', 'ADJ'),
 ('handful', 'NOUN'),
 ('of', 'ADP'),
 ('such', 'ADJ'),
 ('reports', 'NOUN'),
 ('was', 'VERB'),
 ('received', 'VERB'),
 ("''", '.'),
 (',', '.'),
 ('the', 'DET'),
 ('jury', 'NOUN'),
 ('said', 'VERB'),
 (',', '.'),
 ('``', '.'),
 ('considering', 'ADP'),
 ('the', 'DET'),
 ('widespread', 'ADJ'),
 ('interest', 'NOUN'),
 ('in', 'ADP'),
 ('the', 'DET'),
 ('election', 'NOUN'),
 (',', '.'),
 ('the', 'DET'),
 ('number', 'NOUN'),
 ('of', 'ADP'),
 ('voters', 'NOUN'),
 ('and', 'CONJ'),
 ('the', 'DET'),
 ('size', 'NOUN'),
 ('of', 'ADP'),
 ('this', 'DET'),
 ('city', 'NOUN'),
 ("''", '.'),
 ('.', '.')]

*   each sub-list represents a sentence
*   each tuple contains a word with the corresponding tag

In [None]:
# separate the inputs and targets:
inputs = []
targets = []

for sentence_tag_pairs in corpus: # loop over sub-lists
  tokens = []
  target = []
  for token, tag in sentence_tag_pairs: # loop over tuples
    tokens.append(token)
    target.append(tag)
  inputs.append(tokens)
  targets.append(target)

In [None]:
# save data to json format

import json # import the json module for working with JSON (JavaScript Object Notation) data

with open('data.json', 'w') as f: # open the JSON file in write mode, ensuring that the file is properly closed after writing
  for x, y in zip(inputs, targets):
    j = {'inputs': x, 'targets': y} # create a Python dictionary (j)
    s = json.dumps(j) # Python dictionary (j) -> JSON-formatted string (s)
    f.write(f"{s}\n") # string `s` is written to the file `f` followed by a newline character `\n`

In [None]:
from datasets import load_dataset # from the library, import the function

In [None]:
data = load_dataset("json", data_files='data.json')

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

👆 **Download dataset vs. load dataset**: in many cases, the process of loading a dataset might implicitly involve downloading it if the dataset is not already present on your system.
*   Downloading is the process of obtaining the raw dataset files
*   Loading involves preparing the dataset for use in the code by reading, parsing, and organizing the data

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['inputs', 'targets'],
        num_rows: 57340
    })
})

In [None]:
small = data["train"].shuffle(seed=42).select(range(20_000)) # create a smaller, shuffled subset of the training data
# .shuffle(): this method shuffles the training examples
# seed=42: the seed parameter is set to 42 to ensure reproducibility
# .select(range(20_000)): selects the first 20,000 examples (0-19,999)
small

Dataset({
    features: ['inputs', 'targets'],
    num_rows: 20000
})

In [None]:
# split the dataset into training and testing sets:
data = small.train_test_split(seed=42) # ensure the split is reproducible

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['inputs', 'targets'],
        num_rows: 15000
    })
    test: Dataset({
        features: ['inputs', 'targets'],
        num_rows: 5000
    })
})

In [None]:
data["train"]

Dataset({
    features: ['inputs', 'targets'],
    num_rows: 15000
})

In [None]:
# check the 10th example from the training set:
data["train"][9]

{'inputs': ['Andy', 'crumbled', 'the', 'script', 'in', 'his', 'fist', '.'],
 'targets': ['NOUN', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', '.']}

In [None]:
data["train"].features

{'inputs': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'targets': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

👆 Both 'inputs' and 'targets' are sequences.

In [None]:
# create a set that contains unique elements:
target_set = set()
for target in targets:
  target_set = target_set.union(target) # only unique elements are retained (duplicates are excluded) in 'target_set'

target_set

{'.',
 'ADJ',
 'ADP',
 'ADV',
 'CONJ',
 'DET',
 'NOUN',
 'NUM',
 'PRON',
 'PRT',
 'VERB',
 'X'}

In [None]:
# map targets to ints (target <-> int):
#
target_list = list(target_set)
id2label = {k: v for k, v in enumerate(target_list)}
label2id = {v: k for k, v in id2label.items()}

In [None]:
id2label

{0: 'NUM',
 1: 'NOUN',
 2: 'VERB',
 3: 'DET',
 4: 'PRON',
 5: 'X',
 6: '.',
 7: 'ADP',
 8: 'CONJ',
 9: 'ADJ',
 10: 'ADV',
 11: 'PRT'}

In [None]:
label2id

{'NUM': 0,
 'NOUN': 1,
 'VERB': 2,
 'DET': 3,
 'PRON': 4,
 'X': 5,
 '.': 6,
 'ADP': 7,
 'CONJ': 8,
 'ADJ': 9,
 'ADV': 10,
 'PRT': 11}

In [None]:
from transformers import AutoTokenizer # import the class, enabling dynamic loading of tokenizer for a specific pre-trained model

checkpoint = "distilbert-base-cased" # model identifier (specify the name of a pre-trained model)
tokenizer = AutoTokenizer.from_pretrained(checkpoint) # automatically load the appropriate tokenizer

In [None]:
# tokenize a specific input sequence from the training data:
idx = 9
t = tokenizer(data["train"][idx]["inputs"], is_split_into_words=True) # 'input' is a list of pre-tokenized words rather than a single string
t

{'input_ids': [101, 4827, 172, 5697, 11813, 1103, 5444, 1107, 1117, 7374, 119, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
type(t)

transformers.tokenization_utils_base.BatchEncoding

👆 Not an dictionary.

In [None]:
# scratch:
# check the 10th example from the training set:
data["train"][9]

{'inputs': ['Andy', 'crumbled', 'the', 'script', 'in', 'his', 'fist', '.'],
 'targets': ['NOUN', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', '.']}

In [None]:
# inspect the string tokens in a list format:
t.tokens()

['[CLS]',
 'Andy',
 'c',
 '##rum',
 '##bled',
 'the',
 'script',
 'in',
 'his',
 'fist',
 '.',
 '[SEP]']

In [None]:
# idx i -> the i-th word in the input sentence (counting from 0)
t.word_ids()

[None, 0, 1, 1, 1, 2, 3, 4, 5, 6, 7, None]

In [None]:
# label2id (label ids):
# {'ADJ': 0,
#  'ADP': 1,
#  'CONJ': 2,
#  'PRON': 3,
#  'ADV': 4,
#  'NUM': 5,
#  'DET': 6,
#  '.': 7,
#  'PRT': 8,
#  'VERB': 9,
#  'X': 10,
#  'NOUN': 11}

In [None]:
# data["train"][9]:
# {'inputs': ['Andy', 'crumbled', 'the', 'script', 'in', 'his', 'fist', '.'],
#  'targets': ['NOUN', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', '.']}

In [None]:
# For targets: word ids -> lable ids
#          word_ids: [ None,  0, 1, 1, 1, 2,  3, 4, 5,  6, 7, None ]
# -> aligned_labels: [ -100, 11, 9, 9, 9, 6, 11, 1, 6, 11, 7, -100 ]
def align_targets(labels, word_ids):
  aligned_labels = []
  for word_id in word_ids:
    if word_id is None: # it's a token like [CLS]
      label = -100
    else: # it's a real word
      label = label2id[labels[word_id]]

    # add the label
    aligned_labels.append(label)

  return aligned_labels

In [None]:
# try it out:
labels = data['train'][idx]['targets']
word_ids = t.word_ids()
aligned_targets = align_targets(labels, word_ids)
aligned_targets

[-100, 1, 2, 2, 2, 3, 1, 7, 3, 1, 6, -100]

In [None]:
# For labels:
aligned_labels = [id2label[i] if i >= 0 else None for i in aligned_targets]
for x, y in zip(t.tokens(), aligned_labels):
  print(f"{x}\t{y}")

[CLS]	None
Andy	NOUN
c	VERB
##rum	VERB
##bled	VERB
the	DET
script	NOUN
in	ADP
his	DET
fist	NOUN
.	.
[SEP]	None


In [None]:
# tokenize both inputs and targets
def tokenize_fn(batch):
  # tokenize the input sequence first
  # this populates input_ids, attention_mask, etc.
  tokenized_inputs = tokenizer(
    batch['inputs'], truncation=True, is_split_into_words=True
  )

  labels_batch = batch['targets'] # original targets
  aligned_labels_batch = []
  for i, labels in enumerate(labels_batch):
    word_ids = tokenized_inputs.word_ids(i)
    aligned_labels_batch.append(align_targets(labels, word_ids))

  # recall: the 'target' must be stored in key called 'labels'
  tokenized_inputs['labels'] = aligned_labels_batch

  return tokenized_inputs

In [None]:
# want to remove these from model inputs - they are neither inputs nor targets
data["train"].column_names

['inputs', 'targets']

In [None]:
# apply a function (`tokenize_fn`) to each example in the dataset (`data`) & save as a new dataset (`tokenized_datasets`):
tokenized_datasets = data.map(
  tokenize_fn,
  batched=True, # apply tokenization function to examples in batches rather than individually
  remove_columns=data["train"].column_names, # specify the columns to be removed from the resulting tokenized dataset
)

Map:   0%|          | 0/15000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 15000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 5000
    })
})

In [None]:
from transformers import DataCollatorForTokenClassification # import the class for classifying individual tokens in a sequence

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer) # create an instance of the class

In [None]:
# https://stackoverflow.com/questions/11264684/flatten-list-of-lists
def flatten(list_of_lists):
  flattened = [val for sublist in list_of_lists for val in sublist]
  return flattened

In [None]:
import numpy as np
from sklearn.metrics import f1_score, accuracy_score

def compute_metrics(logits_and_labels):
  logits, labels = logits_and_labels
  preds = np.argmax(logits, axis=-1)

  # remove -100 from labels and predictions
  labels_jagged = [[t for t in label if t != -100] for label in labels]

  # do the same for predictions whenever true label is -100
  preds_jagged = [[p for p, t in zip(ps, ts) if t != -100] \
      for ps, ts in zip(preds, labels)
  ]

  # flatten labels and preds
  labels_flat = flatten(labels_jagged)
  preds_flat = flatten(preds_jagged)

  acc = accuracy_score(labels_flat, preds_flat)
  f1 = f1_score(labels_flat, preds_flat, average='macro')

  return {
    'f1': f1,
    'accuracy': acc,
  }

In [None]:
labels = [[-100, 0, 0, 1, 2, 1, -100]]
logits = np.array([[
  [0.8, 0.1, 0.1],
  [0.8, 0.1, 0.1],
  [0.8, 0.1, 0.1],
  [0.1, 0.8, 0.1],
  [0.1, 0.8, 0.1],
  [0.1, 0.8, 0.1],
  [0.1, 0.8, 0.1],
]])
compute_metrics((logits, labels))

{'f1': 0.6, 'accuracy': 0.8}

In [None]:
from transformers import AutoModelForTokenClassification # import the class to assign a label to each token in a seq

# load a pre-trained model for token classification using the HF transformers library:
model = AutoModelForTokenClassification.from_pretrained(
    checkpoint, # identifier to the pre-trained
    # the mappings help the model understand how to map between indices and label names:
    id2label=id2label,
    label2id=label2id,
)

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Customize the training process:

In [None]:
pip install transformers[torch]



In [None]:
from transformers import TrainingArguments # (import the class) for configuring and customizing the training process

training_args = TrainingArguments(
    "distilbert-finetuned-ner", # directory for saving model checkpoints and results
    evaluation_strategy="epoch", # evaluate and save results after each epoch
    save_strategy="epoch", # save model checkpoints after each epoch
    num_train_epochs=2,
)

In [None]:
from transformers import Trainer # (class) for training models

trainer = Trainer(
    model=model, # the pre-trained model to train
    args=training_args, # training arguments
    train_dataset=tokenized_datasets["train"], # training dataset (tokenized)
    eval_dataset=tokenized_datasets["test"], # evaluation dataset (tokenized)
    data_collator=data_collator, # for batching and collating the tokenized data during training
    compute_metrics=compute_metrics, # specify the function (compute_metrics) for evaluating and computing metrics on the validation set during training
    tokenizer=tokenizer, # the tokenizer to tokenize the input data
)

# initiate the training process:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.0674,0.057496,0.949043,0.983361
2,0.0261,0.053899,0.952781,0.985582


TrainOutput(global_step=3750, training_loss=0.06693797925313313, metrics={'train_runtime': 300.2973, 'train_samples_per_second': 99.901, 'train_steps_per_second': 12.488, 'total_flos': 386308980825984.0, 'train_loss': 0.06693797925313313, 'epoch': 2.0})

# Save the fine-tuned model & reload as pipeline:

In [None]:
# save the trained model, including its architecture and learned parameters, to a specified directory:
trainer.save_model('my_saved_model')

In [None]:
from transformers import pipeline # import the 'pipeline' function to use pre-trained models

# create a pipeline:
pipe = pipeline(
  "token-classification", # specify the pipeline task
  model='my_saved_model', # path to the saved model
  device=0, # use GPU for inference
)

In [None]:
# perform NER on an input string:
s = "Bill Gates was the CEO of Microsoft in Seattle, Washington."
pipe(s)

[{'entity': 'NOUN',
  'score': 0.99981195,
  'index': 1,
  'word': 'Bill',
  'start': 0,
  'end': 4},
 {'entity': 'NOUN',
  'score': 0.99984956,
  'index': 2,
  'word': 'Gates',
  'start': 5,
  'end': 10},
 {'entity': 'VERB',
  'score': 0.99970144,
  'index': 3,
  'word': 'was',
  'start': 11,
  'end': 14},
 {'entity': 'DET',
  'score': 0.9998983,
  'index': 4,
  'word': 'the',
  'start': 15,
  'end': 18},
 {'entity': 'NOUN',
  'score': 0.9997161,
  'index': 5,
  'word': 'CEO',
  'start': 19,
  'end': 22},
 {'entity': 'ADP',
  'score': 0.99986744,
  'index': 6,
  'word': 'of',
  'start': 23,
  'end': 25},
 {'entity': 'NOUN',
  'score': 0.9998246,
  'index': 7,
  'word': 'Microsoft',
  'start': 26,
  'end': 35},
 {'entity': 'ADP',
  'score': 0.9997514,
  'index': 8,
  'word': 'in',
  'start': 36,
  'end': 38},
 {'entity': 'NOUN',
  'score': 0.9998536,
  'index': 9,
  'word': 'Seattle',
  'start': 39,
  'end': 46},
 {'entity': '.',
  'score': 0.99990535,
  'index': 10,
  'word': ',',
  '

🎉 The model correctly tells that

"Bill" and "Gates" are NOUN entities,

and so on.