<a href="https://colab.research.google.com/github/sahug/ds-bert/blob/main/BERT%20NLP%20-%20Named%20Entity%20Recognition%20or%20Token%20Classification%20using%20BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BERT NLP - Named Entity Recognition or Token Classification using BERT**

**Token classification** assigns a label to individual tokens in a sentence. One of the most common token classification tasks is **Named Entity Recognition (NER)**. NER attempts to find a label for each entity in a sentence, such as a person, location, or organization.

**Load Dataset**

In [1]:
%pip install -qq datasets

In [2]:
from datasets import load_dataset
ds = load_dataset("conll2003")
ds

Reusing dataset conll2003 (/root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14042
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3251
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3454
    })
})

**Preprocess Data**

In [3]:
# Remove Unwanted Features
ds = ds.remove_columns(["id", "pos_tags", "chunk_tags"])

# Rename Columns
ds = ds.rename_column("ner_tags", "labels")
ds = ds.rename_column("tokens", "words")

In [4]:
ds["train"], ds["validation"], ds["test"]

(Dataset({
     features: ['words', 'labels'],
     num_rows: 14042
 }), Dataset({
     features: ['words', 'labels'],
     num_rows: 3251
 }), Dataset({
     features: ['words', 'labels'],
     num_rows: 3454
 }))

**Labels**

- B - indicates the beginning of an entity.
- I - indicates a token is contained inside the same entity (e.g., the State token is a part of an entity like Empire State Building).
- 0 - indicates the token doesn’t correspond to any entity.

In [5]:
ds["train"].features["labels"]

Sequence(feature=ClassLabel(num_classes=9, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [6]:
ds["train"].features["labels"].feature.names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

**Dataset Preview**

In [7]:
print(ds["train"][0])
print(ds["train"][0]["words"])
print(ds["train"][0]["labels"])

{'words': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'labels': [3, 0, 7, 0, 0, 0, 7, 0, 0]}
['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
[3, 0, 7, 0, 0, 0, 7, 0, 0]


In [8]:
print(ds["train"][3]["words"])
print(ds["train"][3]["labels"])

['The', 'European', 'Commission', 'said', 'on', 'Thursday', 'it', 'disagreed', 'with', 'German', 'advice', 'to', 'consumers', 'to', 'shun', 'British', 'lamb', 'until', 'scientists', 'determine', 'whether', 'mad', 'cow', 'disease', 'can', 'be', 'transmitted', 'to', 'sheep', '.']
[0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


**Data Preperation**

We will prepare or format the data so it alligns with input model accepts. We can see the data is a list of list. Each row contains a list of words and list of labels. We won't be able to input it as it is.

In [9]:
import pandas as pd
df_train = pd.DataFrame(ds["train"])
df_test = pd.DataFrame(ds["test"])
df_valid = pd.DataFrame(ds["validation"])

In [10]:
df_train.head()

Unnamed: 0,words,labels
0,"[EU, rejects, German, call, to, boycott, Briti...","[3, 0, 7, 0, 0, 0, 7, 0, 0]"
1,"[Peter, Blackburn]","[1, 2]"
2,"[BRUSSELS, 1996-08-22]","[5, 0]"
3,"[The, European, Commission, said, on, Thursday...","[0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, ..."
4,"[Germany, 's, representative, to, the, Europea...","[5, 0, 0, 0, 0, 3, 4, 0, 0, 0, 1, 2, 0, 0, 0, ..."


In [11]:
df_test.head()

Unnamed: 0,words,labels
0,"[SOCCER, -, JAPAN, GET, LUCKY, WIN, ,, CHINA, ...","[0, 0, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0]"
1,"[Nadim, Ladki]","[1, 2]"
2,"[AL-AIN, ,, United, Arab, Emirates, 1996-12-06]","[5, 0, 5, 6, 6, 0]"
3,"[Japan, began, the, defence, of, their, Asian,...","[5, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 0, 0, 0, ..."
4,"[But, China, saw, their, luck, desert, them, i...","[0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [12]:
df_valid.head()

Unnamed: 0,words,labels
0,"[CRICKET, -, LEICESTERSHIRE, TAKE, OVER, AT, T...","[0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0]"
1,"[LONDON, 1996-08-30]","[5, 0]"
2,"[West, Indian, all-rounder, Phil, Simmons, too...","[7, 8, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, ..."
3,"[Their, stay, on, top, ,, though, ,, may, be, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, ..."
4,"[After, bowling, Somerset, out, for, 83, on, t...","[0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 5, 6, 0, 3, ..."


Align word and labels.

In [13]:
def align_words_with_labels(df):
  word, label = [], []

  for i in range(df.shape[0]):
    for j in range(len(df["words"][i])):
       word.append(df["words"][i][j])
       label.append(df["labels"][i][j])

  return pd.DataFrame({"word": word, "label": label})

In [14]:
df_train = align_words_with_labels(df_train)
df_test = align_words_with_labels(df_test)
df_valid = align_words_with_labels(df_valid)

In [15]:
df_train.head()

Unnamed: 0,word,label
0,EU,3
1,rejects,0
2,German,7
3,call,0
4,to,0


In [16]:
df_test.head()

Unnamed: 0,word,label
0,SOCCER,0
1,-,0
2,JAPAN,5
3,GET,0
4,LUCKY,0


In [17]:
df_valid.head()

Unnamed: 0,word,label
0,CRICKET,0
1,-,0
2,LEICESTERSHIRE,3
3,TAKE,0
4,OVER,0


In [18]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df_train["label"] = label_encoder.fit_transform(df_train["label"])
df_test["label"] = label_encoder.fit_transform(df_test["label"])

In [19]:
y_train = df_train["label"]
x_train = df_train.drop("word", axis=1)
y_test = df_test["label"]
x_test = df_test.drop("word", axis=1)

**OHE**

In [20]:
import tensorflow as tf
ohe = tf.keras.layers.CategoryEncoding(num_tokens=9, output_mode="one_hot")
y_train = ohe(y_train)
y_test = ohe(y_test)

In [21]:
y_train

<tf.Tensor: shape=(203621, 9), dtype=float32, numpy=
array([[0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]], dtype=float32)>

**DistilBERT**

In [22]:
%pip install -qq tensorflow_hub
%pip install -U -qq tensorflow_text

In [23]:
import tensorflow_hub as hub
import tensorflow_text as text

**Preprocess**

In [45]:
preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")

In [46]:
def pre_process(example):  
  return preprocess(example)

**Encoder**

In [48]:
encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

In [49]:
def encode_input(preprocessed_text):  
  return encoder(preprocessed_text)

In [51]:
encode_input(pre_process(["CRICKET"]))["pooled_output"]

<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
array([[-0.92026347, -0.40817994, -0.26689175,  0.7907777 ,  0.27910346,
        -0.08950261,  0.91574216,  0.3270939 , -0.29746428, -0.9999853 ,
        -0.37653992,  0.82170445,  0.9844829 , -0.03137169,  0.91738856,
        -0.6537362 , -0.46163368, -0.60614115,  0.38942084, -0.6955212 ,
         0.6009076 ,  0.9965876 ,  0.33782148,  0.2889657 ,  0.42098036,
         0.875868  , -0.7644791 ,  0.9221854 ,  0.9521526 ,  0.7387567 ,
        -0.62886405,  0.31471002, -0.98837394, -0.25009716, -0.17155838,
        -0.9919301 ,  0.40328586, -0.7852357 ,  0.03087327, -0.10034633,
        -0.8894082 ,  0.3517057 ,  0.9997983 ,  0.1274255 ,  0.16281755,
        -0.3905328 , -0.99999875,  0.3105498 , -0.8838598 ,  0.451916  ,
         0.29466912,  0.25107566,  0.1498308 ,  0.5030369 ,  0.43993154,
         0.12921676,  0.05952758,  0.22110145, -0.23393197, -0.59752804,
        -0.56546617,  0.4240009 , -0.41598165, -0.91628456,  0.51859635,
 

**Model**

In [28]:
import tensorflow as tf
from tensorflow import keras
inputs = keras.layers.Input(shape=(), dtype=tf.string, name="inputs")
preprocess = pre_process(inputs)
encoder = encode_input(preprocess)

#NN
nn1 = keras.layers.Dropout(0.1, name="dropout")(encoder["pooled_output"])
nn1 = keras.layers.Dense(9, activation="softmax", name="output")(nn1)

#Final Model
model = keras.Model(inputs=[inputs], outputs=[nn1])

In [29]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 inputs (InputLayer)            [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_mask': (Non  0           ['inputs[0][0]']                 
                                e, 128),                                                          
                                 'input_word_ids':                                                
                                (None, 128),                                                      
                                 'input_type_ids':                                                
                                (None, 128)}                                                  

In [38]:
METRICS = [
           tf.keras.metrics.BinaryAccuracy(name="Accuracy"),
           tf.keras.metrics.Precision(name="Precision"),
           tf.keras.metrics.Recall(name="Recall"),
]
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), 
              loss=tf.keras.losses.CategoricalCrossentropy(), 
              metrics=METRICS)

In [None]:
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3)