##### Copyright 2021 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# TensorFlow Addons Layers: CRF

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/addons/tutorials/layers_crf"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/addons/blob/master/docs/tutorials/layers_crf.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/addons/blob/master/docs/tutorials/layers_crf.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
      <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/addons/docs/tutorials/layers_crf.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

## Overview

This notebook will demonstrate implementing the Conditional Random Field (CRF) in TensorFlow. TensorFlow Addons provides CRF (more precisely, linear chain CRF) layer to users. In Natural Language Processing (NLP), linear chain CRFs are popular, for which each prediction can make use of information from its neighbors. Therefore, it can improve the performance of the base model. For more information about CRF, please visit [Wikipedia](https://en.wikipedia.org/wiki/Conditional_random_field).

In the following sections, you will learn how to use the CRF layer in two ways by building Named Entity Recognition (NER) models. NER is the task of tagging entities (e.g., "Winston Churchill", "London", "tomorrow") in text with their corresponding type (e.g., "Person", "City", "Date"). NER is an important and popular task in NLP.

## Setup

In addition to TensorFlow and TensorFlow addons, you will also need to install `datasets`. The package `datasets` (homepage: https://github.com/huggingface/datasets) comes from the HuggingFace team. It is a community-driven open-source library of datasets. A dataset from this package will be used in later projects.

In [None]:
!pip install -q tensorflow-addons  # version >= 0.15.0 is required
!pip install -q tensorflow
!pip install -q datasets

Import modules so you can use them later:

In [None]:
import copy

import numpy as np
import tensorflow as tf
import tensorflow_addons as tfa
import datasets

## Training data

In this tutorial, CoNLL 2003 dataset will be used for training. The CoNLL 2003 is a language-independent named entity recognition dataset released as a part of CoNLL 2003 shared task. For more information, please visit https://huggingface.co/datasets/conll2003.

With the help of the `datasets` package, the data loading is quite easy:

In [None]:
conll_data = datasets.load_dataset("conll2003")

Inspect the data splits and features:

In [None]:
conll_data

Get a sample of train data and print it out:

In [None]:
for item in conll_data["train"]:
  sample_tokens = item['tokens']
  sample_tag_ids = item["ner_tags"]
  print(sample_tokens)
  print(sample_tag_ids)
  break

For our NER model, the inputs are the tokens which is a list of strings. The outputs are the NER tags which in the dataset they are the tag ids.

The dataset also give the information about the mapping of NER tags and ids:

In [None]:
dataset_builder = datasets.load_dataset_builder('conll2003')
raw_tags = dataset_builder.info.features['ner_tags'].feature.names
print(raw_tags)

Let's decode the NER tag ids to tags:

In [None]:
sample_tags = [raw_tags[i] for i in sample_tag_ids]

print(list(zip(sample_tokens, sample_tags)))

Those tags are used to encode the named entities by some format. In this dataset, tags are encoded in [IOB](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) format.

Add a special tag `<PAD>` to the tag set, which is used to represent padding in the sequence. In NLP, 0 is usually used to mark padding. This is the default setting for many functions in Machine Learning software (including TensorFlow).

Create a list to convert tag ids to tag text:

In [None]:
tags = ['<PAD>'] + raw_tags
print(tags)

Define some constants which will be used later:

In [None]:
TAG_SIZE = len(tags)
VOCAB_SIZE = 20000

Building vocabulary lookup layer for tokens.

In [None]:
train_tokens = tf.ragged.constant(conll_data["train"]["tokens"])
train_tokens = tf.map_fn(tf.strings.lower, train_tokens)

lookup_layer = tf.keras.layers.StringLookup(max_tokens=VOCAB_SIZE, mask_token="[MASK]", oov_token="[UNK]")
lookup_layer.adapt(train_tokens)

print(len(lookup_layer.get_vocabulary()))
print(lookup_layer.get_vocabulary()[:10])

Let's create a raw (without preprocess) train dataset:

In [None]:
def create_data_generator(dataset):
  def data_generator():
    for item in dataset:
      yield item['tokens'], item['ner_tags']
  
  return data_generator

data_signature= (
        tf.TensorSpec(shape=(None,), dtype=tf.string),
        tf.TensorSpec(shape=(None, ), dtype=tf.int32)
)

train_data = tf.data.Dataset.from_generator(
    create_data_generator(conll_data["train"]),
    output_signature=data_signature
)

Let's preprocess the training dataset:

In [None]:
def dataset_preprocess(tokens, tag_ids):
    preprocessed_tokens = preprecess_tokens(tokens)

    # increase by 1 for all tag_ids,
    # because `<PAD>` is added as the first element in tags list
    preprocessed_tag_ids = tag_ids + 1

    return preprocessed_tokens, preprocessed_tag_ids

def preprecess_tokens(tokens):
    tokens = tf.strings.lower(tokens)
    return lookup_layer(tokens)

BATCH_SIZE = 2048

# With `padded_batch()`, each batch may have different length
# shape: (batch_size, None)
train_dataset = (
    train_data.map(dataset_preprocess)
    .padded_batch(batch_size=BATCH_SIZE).cache()
)

## Method one: Using the CRF layer in a custom training loop

Using a custom training loop is a powerful way to customize the model while still leveraging the convenience of `fit()`. You can find more detailed information about the custom training loop at https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch. In this section, you will learn how to use the CRF layer in a custom training loop. 

### Creating model

Define BiLSTM+CRF model by using tfa.layers.CRF layer.
The CRF layer not only ouput the CRF decode result (`decode_sequence`), but also outupt some interal variables (`potentials`, `sequence_length` and `kernel`). You will use those internal variables for compute loss value later.

In [None]:
# Build the model
def build_embedding_bilstm_crf_model(
    vocab_size: int, embed_dims: int, lstm_unit: int, tag_size: int
) -> tf.keras.Model:
    x = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="x")
    y = tf.keras.layers.Embedding(vocab_size, embed_dims, mask_zero=True)(x)
    y = tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(lstm_unit, return_sequences=True)
    )(y)
    decode_sequence, potentials, sequence_length, kernel = tfa.layers.CRF(tag_size)(y)

    return tf.keras.Model(
        inputs=x, outputs=[decode_sequence, potentials, sequence_length, kernel]
    )


model = build_embedding_bilstm_crf_model(VOCAB_SIZE, 32, 64, TAG_SIZE)


Run the model on a single batch of data, and inspect the output:

In [None]:
# preprocess
preprecessd_tokens = preprecess_tokens(sample_tokens)

# expand the tensor to shape: [1, None]. That is add batch dim
inputs = tf.expand_dims(preprecessd_tokens, axis=0)

outputs, *_ = model(inputs)
print(outputs[0])

### Define CRF loss function

By using the real y and some internal variables of the CRF layer, you can compute the log likelihood of real y. The likelihood measures the probability of real y output from the CRF layer. This process can be done by using function tfa.text.crf_log_likelihood. Since the goal is making the model to output a higher likelihood of the real y, the negative of the log likelihood is used as the loss to optimize.

In [None]:
def crf_loss_func(potentials, sequence_length, kernel, y):
    crf_likelihood, _ = tfa.text.crf_log_likelihood(
        potentials, y, sequence_length, kernel
    )
    # likelihood to loss
    flat_crf_loss = -1 * crf_likelihood
    crf_loss = tf.reduce_mean(flat_crf_loss)

    return crf_loss

### Define optimizer, metrics and train_step fucntion

In [None]:
optimizer = tf.keras.optimizers.Adam(0.02)

train_loss = tf.keras.metrics.Mean(name="train_loss")

@tf.function(experimental_relax_shapes=True)
def train_step(x, y):
    with tf.GradientTape() as tape:
        decoded_sequence, potentials, sequence_length, kernel = model(x)
        crf_loss = crf_loss_func(potentials, sequence_length, kernel, y)
        loss = crf_loss + tf.reduce_sum(model.losses)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

    train_loss(loss)

### Training model

In [None]:
EPOCHS = 10

for epoch in range(EPOCHS):
    # Reset the metrics at the start of the next epoch
    train_loss.reset_states()

    for x, y in train_dataset:
        train_step(x, y)

    print(f"Epoch {epoch + 1}, " f"Loss: {train_loss.result()}")


### Making inference

Inspect the predicted result:

In [None]:
# print the inputs and expected outputs
print("raw inputs: ", sample_tokens)

# preprocess
preprocessed_inputs = preprecess_tokens(
    sample_tokens
)
# expend the batch dim
inputs = tf.reshape(preprocessed_inputs, shape=[1, -1])

outputs, *_ = model.predict(inputs)
prediction = [tags[i] for i in outputs[0]]

# Keypoint: EU -> B-ORG, German -> B-MISC, British -> B-MISC
print("ground true tags: ", sample_tags)
print("predicted tags: ", prediction)

## Method two: Using the CRF layer via CRF model wrapper

Using a custom training loop is hard for junior TensorFlow developers, and it is error-prone (even for senior developers). TensorFlow Addons provides the CRF model wrapper to address this problem for most cases. You can find the API document at https://www.tensorflow.org/addons/api_docs/python/tfa/text/CRFModelWrapper. In this section, you will learn how to use the CRF model wrapper. 

### Creating the base model

Define the BiLSTM model as the base model:

In [None]:
# Build the model
def build_embedding_bilstm_model(
    vocab_size: int, embed_dims: int, lstm_unit: int
) -> tf.keras.Model:
    x = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="x")
    y = tf.keras.layers.Embedding(vocab_size, embed_dims, mask_zero=True)(x)
    y = tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(lstm_unit, return_sequences=True)
    )(y)

    return tf.keras.Model(
        inputs=x, outputs=y
    )


base_model = build_embedding_bilstm_model(VOCAB_SIZE, 32, 64)


Run the model on a single batch of data, and inspect the output:

In [None]:
# preprocess
preprecessd_tokens = preprecess_tokens(sample_tokens)

# expand the tensor to shape: [1, None]. That is add batch dim
inputs = tf.expand_dims(preprecessd_tokens, axis=0)

outputs = base_model(inputs)
print(outputs[0])

### CRF model wrapper

Let's import the CRF model wrapper (tfa.text.crf_wrapper.CRFModelWrapper) from TensorFlow Addons:

In [None]:
from tensorflow_addons.text.crf_wrapper import CRFModelWrapper

### Wrapper base model with CRF model wrapper

The CRF model wrapper wraps the base model. It will apply the CRF layer to the output of the base model and compute the CRF loss. The wrapper takes the base model and number of tags (for initializing the CRF layer) as the initialization parameters.

In [None]:
model = CRFModelWrapper(base_model, TAG_SIZE)

### Training model

The compilation of the wrappered model is exactly the same as that of the regular Keras model, except that there is no need to provide a loss function (the wrappered model computes the CRF loss internally).

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(0.02))

In [None]:
model.fit(train_dataset, epochs=10, verbose=2)

### Making inference

Inspect the predicted result:

In [None]:
# print the inputs and expected outputs
print("raw inputs: ", sample_tokens)

# preprocess
preprocessed_inputs = preprecess_tokens(
    sample_tokens
)
# expend the batch dim
inputs = tf.reshape(preprocessed_inputs, shape=[1, -1])

outputs = model.predict(inputs)
prediction = [tags[i] for i in outputs[0]]

# Keypoint: EU -> B-ORG, German -> B-MISC, British -> B-MISC
print("ground true tags: ", sample_tags)
print("predicted tags: ", prediction)