##### Copyright 2021 The TensorFlow Authors.

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# TensorFlow Addons Layers: CRF

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/addons/tutorials/layers_crf"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/addons/blob/master/docs/tutorials/layers_crf.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/addons/blob/master/docs/tutorials/layers_crf.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
      <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/addons/docs/tutorials/layers_crf.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

## Overview

This notebook will demonstrate how to use the CRF (Conditional Random Field) layer in TensorFlow Addons.
We will introduce how to use the CRF layer by building a NER extraction model.


## Setup

In [2]:
!pip install -U -q tensorflow-addons
!pip install -q tensorflow
!pip install -q datasets

[K     |████████████████████████████████| 1.1 MB 8.2 MB/s 
[K     |████████████████████████████████| 264 kB 8.4 MB/s 
[K     |████████████████████████████████| 50 kB 9.7 MB/s 
[K     |████████████████████████████████| 119 kB 72.8 MB/s 
[K     |████████████████████████████████| 243 kB 60.3 MB/s 
[?25h

In [3]:
import copy

import numpy as np
import tensorflow as tf
import tensorflow_addons as tfa
import datasets

## Traning data

We load the CoNLL 2003 dataset by using the datasets library.

In [4]:
conll_data = datasets.load_dataset("conll2003")

Downloading:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.78k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 (download: 4.63 MiB, generated: 9.78 MiB, post-processed: Unknown size, total: 14.41 MiB) to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6...


Downloading:   0%|          | 0.00/650k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/163k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/146k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6. Subsequent calls will reuse this data.


Inspect the data splits and features:

In [5]:
conll_data

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

Get a sample of train data and print it out:

In [6]:
for item in conll_data["train"]:
  sample_tokens = item['tokens']
  sample_tag_ids = item["ner_tags"]
  print(sample_tokens)
  print(sample_tag_ids)
  break

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
[3, 0, 7, 0, 0, 0, 7, 0, 0]


For our NER model, the input are the tokens which is a list of strings. The outputs are the NER tags which in the dataset they are the tag ids.

The dataset also give the information about the mapping of NER tags and ids.

In [7]:
dataset_builder = datasets.load_dataset_builder('conll2003')
raw_tags = dataset_builder.info.features['ner_tags'].feature.names
print(raw_tags)

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


Let us decode the NER tag ids to tags.

In [8]:
sample_tags = [raw_tags[i] for i in sample_tag_ids]

print(sample_tokens)
print(sample_tags)

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']


Those tags are used to encode the named entities by some format. In this dataset, tags are encoded in [IOB](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) format.

We add a special tag `<PAD>` to the tag set which is used to represent a padding in the sequence. In NLP, we usually use 0 to mark padding. This is the default setting for many functions in Machine Learning software (include TensorFlow).

We create a list so we can get the tags from ids.

In [9]:
tags = ['<PAD>'] + raw_tags
print(tags)

['<PAD>', 'O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


Define some constant we will used for later

In [10]:
TAG_SIZE = len(tags)
VOCAB_SIZE = 20000

Building vocabulary lookup layer for tokens.

In [11]:
train_tokens = tf.ragged.constant(conll_data["train"]["tokens"])
train_tokens = tf.map_fn(tf.strings.lower, train_tokens)

lookup_layer = tf.keras.layers.StringLookup(max_tokens=VOCAB_SIZE, mask_token="[MASK]", oov_token="[UNK]")
lookup_layer.adapt(train_tokens)

print(len(lookup_layer.get_vocabulary()))
print(lookup_layer.get_vocabulary()[:10])

20000
['[MASK]', '[UNK]', 'the', '.', ',', 'of', 'in', 'to', 'a', 'and']


Creating raw (without preprocess) train and validation dataset.

In [12]:
def create_data_generator(dataset):
  def data_generator():
    for item in dataset:
      yield item['tokens'], item['ner_tags']
  
  return data_generator

data_signature= (
        tf.TensorSpec(shape=(None,), dtype=tf.string),
        tf.TensorSpec(shape=(None, ), dtype=tf.int32)
)

train_data = tf.data.Dataset.from_generator(
    create_data_generator(conll_data["train"]),
    output_signature=data_signature
)

Creating train and validation dataset that can be used for traning and validation.

In [13]:
def dataset_preprocess(tokens, tag_ids):
    preprocessed_tokens = preprecess_tokens(tokens)

    # increase by 1 for all tag_ids,
    # because we add `<PAD>` as the first element in tags list
    preprocessed_tag_ids = tag_ids + 1

    return preprocessed_tokens, preprocessed_tag_ids

def preprecess_tokens(tokens):
    tokens = tf.strings.lower(tokens)
    return lookup_layer(tokens)

BATCH_SIZE = 2048

# With `padded_batch()`, each batch may have different length
# shape: (batch_size, None)
train_dataset = (
    train_data.map(dataset_preprocess)
    .padded_batch(batch_size=BATCH_SIZE).cache()
)

## Method one: Using the CRF layer in a custom training loop

### Creating model

Define BiLSTM+CRF model by using tfa.layers.CRF layer.
The CRF layer not only ouput the CRF decode result (`decode_sequence`), but also outupt some interal variables (`potentials`, `sequence_length` and `kernel`). You will use those internal variables for compute loss value later.

In [14]:
# Build the model
def build_embedding_bilstm_crf_model(
    vocab_size: int, embed_dims: int, lstm_unit: int, tag_size: int
) -> tf.keras.Model:
    x = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="x")
    y = tf.keras.layers.Embedding(vocab_size, embed_dims, mask_zero=True)(x)
    y = tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(lstm_unit, return_sequences=True)
    )(y)
    decode_sequence, potentials, sequence_length, kernel = tfa.layers.CRF(tag_size)(y)

    return tf.keras.Model(
        inputs=x, outputs=[decode_sequence, potentials, sequence_length, kernel]
    )


model = build_embedding_bilstm_crf_model(VOCAB_SIZE, 32, 64, TAG_SIZE)


  return py_builtins.overload_of(f)(*args)


Run the model on a single batch of data, and inspect the output:

In [15]:
# preprocess
preprecessd_tokens = preprecess_tokens(sample_tokens)

# expand the tensor to shape: [1, None]. That is add batch dim
inputs = tf.expand_dims(preprecessd_tokens, axis=0)

outputs, *_ = model(inputs)
print(outputs[0])

tf.Tensor([3 6 2 3 6 2 3 6 2], shape=(9,), dtype=int32)


  "CRF decoding models have serialization issues in TF >=2.5 . Please see isse #2476"


### Define CRF loss function

By using the real y and some internal variables of the CRF layer. You can compute the log likelihood of real y. Use the negative of log likelihood as the loss to optimize.

In [16]:
@tf.function
def crf_loss_func(potentials, sequence_length, kernel, y):
    crf_likelihood, _ = tfa.text.crf_log_likelihood(
        potentials, y, sequence_length, kernel
    )
    # likelihood to loss
    flat_crf_loss = -1 * crf_likelihood
    crf_loss = tf.reduce_mean(flat_crf_loss)

    return crf_loss

### Define optimizer, metrics and train_step fucntion

In [17]:
optimizer = tf.keras.optimizers.Adam(0.02)

train_loss = tf.keras.metrics.Mean(name="train_loss")

@tf.function(experimental_relax_shapes=True)
def train_step(x, y):
    with tf.GradientTape() as tape:
        decoded_sequence, potentials, sequence_length, kernel = model(x)
        crf_loss = crf_loss_func(potentials, sequence_length, kernel, y)
        loss = crf_loss + tf.reduce_sum(model.losses)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

    train_loss(loss)

### Training model

In [18]:
EPOCHS = 10

for epoch in range(EPOCHS):
    # Reset the metrics at the start of the next epoch
    train_loss.reset_states()

    for x, y in train_dataset:
        train_step(x, y)

    print(f"Epoch {epoch + 1}, " f"Loss: {train_loss.result()}")


  return py_builtins.overload_of(f)(*args)


Epoch 1, Loss: 17.154226303100586
Epoch 2, Loss: 8.932731628417969
Epoch 3, Loss: 5.642122745513916
Epoch 4, Loss: 4.217898845672607
Epoch 5, Loss: 3.090301275253296
Epoch 6, Loss: 1.8915987014770508
Epoch 7, Loss: 1.1709725856781006
Epoch 8, Loss: 0.81854248046875
Epoch 9, Loss: 0.6320579648017883
Epoch 10, Loss: 0.5205766558647156


### Making inference

Inspect the predict result.

In [19]:
# print the inputs and expected outputs
print("raw inputs: ", sample_tokens)

# preprocess
preprocessed_inputs = preprecess_tokens(
    sample_tokens
)
# expend the batch dim
inputs = tf.reshape(preprocessed_inputs, shape=[1, -1])

outputs, *_ = model.predict(inputs)
prediction = [tags[i] for i in outputs[0]]

# Keypoint: EU -> B-ORG, German -> B-MISC, British -> B-MISC
print("ground true tags: ", sample_tags)
print("predicted tags: ", prediction)

raw inputs:  ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']


  return py_builtins.overload_of(f)(*args)


ground true tags:  ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']
predicted tags:  ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']


## Method two: Using the CRF layer via model subclassing

### Creating the base model

Define the BiLSTM model as the base model

In [20]:
# Build the model
def build_embedding_bilstm_crf_model(
    vocab_size: int, embed_dims: int, lstm_unit: int
) -> tf.keras.Model:
    x = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="x")
    y = tf.keras.layers.Embedding(vocab_size, embed_dims, mask_zero=True)(x)
    y = tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(lstm_unit, return_sequences=True)
    )(y)

    return tf.keras.Model(
        inputs=x, outputs=y
    )


base_model = build_embedding_bilstm_crf_model(VOCAB_SIZE, 32, 64)


Run the model on a single batch of data, and inspect the output:

In [21]:
# preprocess
preprecessd_tokens = preprecess_tokens(sample_tokens)

# expand the tensor to shape: [1, None]. That is add batch dim
inputs = tf.expand_dims(preprecessd_tokens, axis=0)

outputs = base_model(inputs)
print(outputs[0])

tf.Tensor(
[[-0.00246301  0.00139769  0.00138668 ... -0.0066145  -0.00208775
  -0.00464685]
 [-0.00359022 -0.00279237  0.00176504 ... -0.00154561  0.00373552
  -0.00553161]
 [-0.00238984 -0.00648718  0.00698042 ...  0.00294961  0.00392122
  -0.00589067]
 ...
 [-0.0012426   0.00084688  0.00371827 ...  0.00063269  0.00224524
   0.00329087]
 [ 0.00435535  0.00248722  0.00835475 ...  0.00167271  0.00393107
   0.00167722]
 [ 0.00835333  0.0062149   0.00396942 ...  0.0044266   0.00098306
   0.00133276]], shape=(9, 128), dtype=float32)


### Creating CRF model wrapper

In [22]:
class CRFModelWrapper(tf.keras.Model):
    def __init__(
        self,
        model: tf.keras.Model,
        units: int,
        chain_initializer="orthogonal",
        use_boundary: bool = True,
        boundary_initializer="zeros",
        use_kernel: bool = True,
        **kwargs
    ):
        super().__init__()

        self.crf_layer = tfa.layers.CRF(
            units=units,
            chain_initializer=chain_initializer,
            use_boundary=use_boundary,
            boundary_initializer=boundary_initializer,
            use_kernel=use_kernel,
            **kwargs
        )

        self.base_model = model

    def unpack_training_data(self, data):
        # override me, if this is not suit for your task
        if len(data) == 3:
            x, y, sample_weight = data
        else:
            x, y = data
            sample_weight = None
        return x, y, sample_weight

    def call(self, inputs, training=None, mask=None, return_crf_internal=False):
        base_model_outputs = self.base_model(inputs, training, mask)

        # change next line, if your model has more outputs
        crf_input = base_model_outputs

        decode_sequence, potentials, sequence_length, kernel = self.crf_layer(crf_input)

        # change next line, if your base model has more outputs
        # Aways keep `(potentials, sequence_length, kernel), decode_sequence, `
        # as first two outputs of model.
        # current `self.train_step()` expected such settings
        outputs = (potentials, sequence_length, kernel), decode_sequence

        if return_crf_internal:
            return outputs
        else:
            # outputs[0] is the crf internal, skip it
            output_without_crf_internal = outputs[1:]

            # it is nicer to return a tensor instead of an one tensor list
            if len(output_without_crf_internal) == 1:
                return output_without_crf_internal[0]
            else:
                return output_without_crf_internal

    def compute_crf_loss(self, potentials, sequence_length, kernel, y, sample_weight=None):
        crf_likelihood, _ = tfa.text.crf_log_likelihood(
            potentials, y, sequence_length, kernel
        )
        # convert likelihood to loss
        flat_crf_loss = -1 * crf_likelihood
        if sample_weight is not None:
            flat_crf_loss = flat_crf_loss * sample_weight
        crf_loss = tf.reduce_mean(flat_crf_loss)

        return crf_loss

    def train_step(self, data):
        x, y, sample_weight = self.unpack_training_data(data)
        with tf.GradientTape() as tape:
            (potentials, sequence_length, kernel), decoded_sequence, *_ = self(
                x, training=True, return_crf_internal=True
            )
            crf_loss = self.compute_crf_loss(
                potentials, sequence_length, kernel, y, sample_weight
            )
            loss = crf_loss + tf.reduce_sum(self.losses)
        gradients = tape.gradient(loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))

        # Update metrics (includes the metric that tracks the loss)
        self.compiled_metrics.update_state(y, decoded_sequence)
        # Return a dict mapping metric names to current value
        results = {m.name: m.result() for m in self.metrics}
        results.update({"loss": loss, "crf_loss": crf_loss})  # append loss
        return results

    def test_step(self, data):
        x, y, sample_weight = self.unpack_training_data(data)
        (potentials, sequence_length, kernel), decode_sequence, *_ = self(
            x, training=False, return_crf_internal=True
        )
        crf_loss = self.compute_crf_loss(
            potentials, sequence_length, kernel, y, sample_weight
        )
        loss = crf_loss + tf.reduce_sum(self.losses)
        # Update metrics (includes the metric that tracks the loss)
        self.compiled_metrics.update_state(y, decode_sequence)
        # Return a dict mapping metric names to current value
        results = {m.name: m.result() for m in self.metrics}
        results.update({"loss": loss, "crf_loss": crf_loss})  # append loss
        return results


### Wrapper base model with CRF model wrapper

In [23]:
model = CRFModelWrapper(base_model, TAG_SIZE)

### Traning model

In [24]:
model.compile(optimizer=tf.keras.optimizers.Adam(0.02))

model.fit(train_dataset, epochs=10, verbose=2)

Epoch 1/10


  return py_builtins.overload_of(f)(*args)


7/7 - 9s - loss: 10.3882 - crf_loss: 10.3882
Epoch 2/10
7/7 - 1s - loss: 6.8143 - crf_loss: 6.8143
Epoch 3/10
7/7 - 1s - loss: 4.8042 - crf_loss: 4.8042
Epoch 4/10
7/7 - 1s - loss: 3.5163 - crf_loss: 3.5163
Epoch 5/10
7/7 - 1s - loss: 2.0655 - crf_loss: 2.0655
Epoch 6/10
7/7 - 1s - loss: 1.3112 - crf_loss: 1.3112
Epoch 7/10
7/7 - 1s - loss: 0.9268 - crf_loss: 0.9268
Epoch 8/10
7/7 - 1s - loss: 0.6875 - crf_loss: 0.6875
Epoch 9/10
7/7 - 1s - loss: 0.5071 - crf_loss: 0.5071
Epoch 10/10
7/7 - 1s - loss: 0.3823 - crf_loss: 0.3823


<keras.callbacks.History at 0x7f69392ffdd0>

### Making inference

Inspect the predict result.

In [25]:
# print the inputs and expected outputs
print("raw inputs: ", sample_tokens)

# preprocess
preprocessed_inputs = preprecess_tokens(
    sample_tokens
)
# expend the batch dim
inputs = tf.reshape(preprocessed_inputs, shape=[1, -1])

outputs = model.predict(inputs)
prediction = [tags[i] for i in outputs[0]]

# Keypoint: EU -> B-ORG, German -> B-MISC, British -> B-MISC
print("ground true tags: ", sample_tags)
print("predicted tags: ", prediction)

raw inputs:  ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']


  return py_builtins.overload_of(f)(*args)


ground true tags:  ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']
predicted tags:  ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']
