<a href="https://colab.research.google.com/github/sahug/ds-bert/blob/main/BERT%20NLP%20-%20Session%2013%20-%20Named%20Entity%20Recognition%20or%20Token%20Classification%20using%20BERT..ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BERT NLP - Session 13 - Named Entity Recognition or Token Classification using BERT**

**Token classification** assigns a label to individual tokens in a sentence. One of the most common token classification tasks is **Named Entity Recognition (NER)**. NER attempts to find a label for each entity in a sentence, such as a person, location, or organization.

**Load Dataset**

In [35]:
%pip install -qq datasets

In [49]:
from datasets import load_dataset
ds = load_dataset("conll2003")
ds

Reusing dataset conll2003 (/root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14042
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3251
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3454
    })
})

**Preprocess Data**

In [50]:
# Remove Unwanted Features
ds = ds.remove_columns(["id", "pos_tags", "chunk_tags"])

# Rename Columns
ds = ds.rename_column("ner_tags", "labels")
ds = ds.rename_column("tokens", "words")

In [58]:
ds["train"]

Dataset({
    features: ['words', 'labels'],
    num_rows: 14042
})

**Labels**

- B - indicates the beginning of an entity.
- I - indicates a token is contained inside the same entity (e.g., the State token is a part of an entity like Empire State Building).
- 0 - indicates the token doesn’t correspond to any entity.

In [59]:
ds["train"].features["labels"]

Sequence(feature=ClassLabel(num_classes=9, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [62]:
ds["train"].features["labels"].feature.names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

**Dataset Preview**

In [63]:
print(ds["train"][0]["words"])
print(ds["train"][0]["labels"])

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
[3, 0, 7, 0, 0, 0, 7, 0, 0]


**Preprocess**

In [38]:
%pip install -qq transformers

**DistilBERT**

In [65]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [66]:
inputs = tokenizer(ds["train"][0]["words"], is_split_into_words=True)
inputs.tokens()

['[CLS]',
 'eu',
 'rejects',
 'german',
 'call',
 'to',
 'boycott',
 'british',
 'lamb',
 '.',
 '[SEP]']

**Shit and Align Labels**

In [77]:
def shift_label(label):
  # If label is B-XXX we change it to I-XXX
  if label % 2 == 1:
    label += 1
  return label

In [79]:
def align_labels_with_tokens(labels, word_ids):
  new_labels = []
  current_word = None
  for word_id in word_ids:
    if word_id is None:
      new_labels.append(-100)
    elif word_id != current_word:
      # Start of a new word
      current_word = word_id
      new_labels.append(labels[word_id])
    else:
      new_labels.append(shift_label(labels[word_id]))
  return new_labels 

**Function for Token and Labels**

In [81]:
def tokenize_and_align_labels(examples):
  tokenized_inputs = tokenizer(examples["words"], truncation=True, is_split_into_words=True)

  new_labels = []

  for i, labels in enumerate(examples["labels"]):
    word_ids = tokenized_inputs.word_ids(batch_index=i)
    new_labels.append(align_labels_with_tokens(labels, word_ids))

  tokenized_inputs["labels"] = new_labels
  return tokenized_inputs

**Use Map to Apply above Function**

In [82]:
tokenized_ds = ds.map(tokenize_and_align_labels, batched=True)

  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

In [83]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['words', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 14042
    })
    validation: Dataset({
        features: ['words', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 3251
    })
    test: Dataset({
        features: ['words', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 3454
    })
})

**DataCollatorForTokenClassification**

In [84]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf")

data_collator

DataCollatorForTokenClassification(tokenizer=PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}), padding=True, max_length=None, pad_to_multiple_of=None, label_pad_token_id=-100, return_tensors='tf')

**Finetune**

In [85]:
tf_train_set = tokenized_ds["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator)

tf_validation_set = tokenized_ds["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator)

  tensor = as_tensor(value)


In [97]:
tf_train_set

<PrefetchDataset element_spec={'labels': TensorSpec(shape=(16, None), dtype=tf.int64, name=None), 'input_ids': TensorSpec(shape=(16, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(16, None), dtype=tf.int64, name=None)}>

**Create Optimizer**

In [108]:
from transformers import create_optimizer
import tensorflow as tf

batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_ds["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
optimizer, schedule

(<keras.optimizer_v2.adam.Adam at 0x7f77b50d7f50>,
 <keras.optimizer_v2.learning_rate_schedule.PolynomialDecay at 0x7f77b50d7250>)

**TFAutoModelForTokenClassification**

In [109]:
from transformers import TFAutoModelForTokenClassification

model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForTokenClassification: ['activation_13', 'vocab_layer_norm', 'vocab_transform', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_139', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inferen

**Compile**

In [110]:
import tensorflow as tf

model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour, please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


**Fit**

In [None]:
model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=2)