# [KerasNLP] Position Embedding Techniques in Transformers

**Author:** [Usha Rengaraju](https://www.linkedin.com/in/usha-rengaraju-b570b7a2/)<br>
**Date created:** 2023/07/10<br>
**Last modified:** 2023/07/10<br>
**Description:** Position Embedding Techniques in Transformers using KerasNLP

## Overview

Embedding layers are the ones which convert the input data to embedding vector form with some added information like position encoding and much more.  There are various embedding layers already implemented in KerasNLP which we can use on the go.

In this guide we create a simple text classification pipeline and showcase the various embedding layers and their affects on the performance.

## Imports & setup

This tutorial requires you to have KerasNLP installed:

```shell
pip install keras-nlp
```

We begin by importing all required packages:

In [None]:
!pip install -q keras-nlp einops

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
import re
import json
import string
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import keras_nlp

## Data loading

This guide uses the
[IMDB review dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)
for demonstration purposes.

To get started, we first load the dataset:

In [None]:
import keras_nlp
import tensorflow_datasets as tfds

imdb_train, imdb_test = tfds.load(
    "imdb_reviews",
    split=["train", "test"],
    as_supervised=True,
    batch_size=16,
)


Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteSAD0T3/imdb_reviews-train.tfrecord…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteSAD0T3/imdb_reviews-test.tfrecord*…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteSAD0T3/imdb_reviews-unsupervised.t…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


## Preprocessing dataset

Next we move on to preprocessing the dataset, we use the `WordPieceTokenizer` from kerasNLP to tokenize the dataset and kerasNLP StartEndPacker to pack the input dataset. Then we create the data generator to train the model


In [None]:
vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
    imdb_train.map(lambda x, y: x),
    vocabulary_size=20_000,
    lowercase=True,
    strip_accents=True,
    reserved_tokens=["[PAD]", "[START]", "[END]", "[MASK]", "[UNK]"],
)
tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    lowercase=True,
    strip_accents=True,
    oov_token="[UNK]",
)
packer = keras_nlp.layers.StartEndPacker(
    start_value=tokenizer.token_to_id("[START]"),
    end_value=tokenizer.token_to_id("[END]"),
    pad_value=tokenizer.token_to_id("[PAD]"),
    sequence_length=512,
)


def preprocess(x, y):
    token_ids = packer(tokenizer(x))
    return token_ids, y


imdb_preproc_train_ds = imdb_train.map(
    preprocess, num_parallel_calls=tf.data.AUTOTUNE
).prefetch(tf.data.AUTOTUNE)
imdb_preproc_val_ds = imdb_test.map(
    preprocess, num_parallel_calls=tf.data.AUTOTUNE
).prefetch(tf.data.AUTOTUNE)

## Model Building

For this simple classification pipeline, we use the pretrained `BertClassifier` from kerasNLP

In the following guide we create a simple classifier using the kerasNLP `TransformerEncoder` and Dense layers and then each time change the embedding layers to see the affects

In [None]:
# Load a BERT model.
classifier = keras_nlp.models.BertClassifier.from_preset("bert_tiny_en_uncased",num_classes=2)
# Fine-tune on IMDb movie reviews.
classifier.fit(imdb_train, validation_data=imdb_test)
# Predict two new examples.
classifier.predict(["What an amazing movie!", "A total waste of my time."])



array([[-2.1772902,  1.5806383],
       [ 1.2048193, -0.9244135]], dtype=float32)

## Position Embedding



**sequence_length**: The maximum length of the dynamic sequence.

**initializer**: The initializer to use for the embedding weights. Defaults to "glorot_uniform".

**seq_axis**: The axis of the input tensor where we add the embeddings.

In [None]:
token_id_input = keras.Input(
    shape=(None,),
    dtype="int32",
    name="token_ids",
)
embed = keras.layers.Embedding(
    input_dim=len(vocab), output_dim=64
)(token_id_input)
outputs = keras_nlp.layers.PositionEmbedding(
    sequence_length=packer.sequence_length,
)(embed)
outputs = embed+outputs
outputs = keras_nlp.layers.TransformerEncoder(
    num_heads=2,
    intermediate_dim=128,
    dropout=0.1,
)(outputs)
outputs = keras.layers.Dense(2)(outputs[:, 0, :])
model = keras.Model(
    inputs=token_id_input,
    outputs=outputs,
)

model.summary()

Model: "model_5"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 token_ids (InputLayer)         [(None, None)]       0           []                               
                                                                                                  
 embedding_2 (Embedding)        (None, None, 64)     1226880     ['token_ids[0][0]']              
                                                                                                  
 position_embedding_11 (Positio  (None, None, 64)    32768       ['embedding_2[0][0]']            
 nEmbedding)                                                                                      
                                                                                                  
 tf.__operators__.add_1 (TFOpLa  (None, None, 64)    0           ['embedding_2[0][0]',      

In [None]:
model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.experimental.AdamW(5e-5),
    metrics=keras.metrics.SparseCategoricalAccuracy(),
    jit_compile=True,
)
model.fit(
    imdb_preproc_train_ds,
    validation_data=imdb_preproc_val_ds,
    epochs=3,
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f696cb5a410>

## Sine Positional Embedding

This layer calculates the position encoding as a mix of sine and cosine functions with geometrically increasing wavelengths. Defined and formulized in Attention is All You Need.

**max_wavelength**: The maximum angular wavelength of the sine/cosine curves, as described in Attention is All You Need. Defaults to 10000.

In [None]:
token_id_input = keras.Input(
    shape=(None,),
    dtype="int32",
    name="token_ids",
)
embed = keras.layers.Embedding(
    input_dim=len(vocab), output_dim=64
)(token_id_input)
outputs = keras_nlp.layers.SinePositionEncoding()(embed)
outputs = embed+outputs
outputs = keras_nlp.layers.TransformerEncoder(
    num_heads=2,
    intermediate_dim=128,
    dropout=0.1,
)(outputs)
outputs = keras.layers.Dense(2)(outputs[:, 0, :])
model = keras.Model(
    inputs=token_id_input,
    outputs=outputs,
)

model.summary()

Model: "model_8"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 token_ids (InputLayer)         [(None, None)]       0           []                               
                                                                                                  
 embedding_3 (Embedding)        (None, None, 64)     1226880     ['token_ids[0][0]']              
                                                                                                  
 sine_position_encoding_2 (Sine  (None, None, 64)    0           ['embedding_3[0][0]']            
 PositionEncoding)                                                                                
                                                                                                  
 tf.__operators__.add_2 (TFOpLa  (None, None, 64)    0           ['embedding_3[0][0]',      

In [None]:
model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.experimental.AdamW(5e-5),
    metrics=keras.metrics.SparseCategoricalAccuracy(),
    jit_compile=True,
)
model.fit(
    imdb_preproc_train_ds,
    validation_data=imdb_preproc_val_ds,
    epochs=3,
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f68c95a5480>

## Rotary embeddings

Rotary position embedding is a sort of position embedding that naturally combines explicit relative position dependency in the formulation of self-attention while encoding absolute positional information with rotation matrices.

**rotary_ndims**: The rotatory matrix dimensions

**max_wavelength**: The maximum angular wavelength of the sine/cosine curves, as described in Attention is All You Need. Defaults to 10000.

In [None]:
class RotaryEmbedding(keras.layers.Layer):
    def __init__(self, rotary_ndims, max_wavelength=10000):
        super().__init__()
        self.rotary_ndims = int(rotary_ndims)
        self.max_wavelength = max_wavelength
        self.to_qk = keras.layers.Dense(units=rotary_ndims * 4, use_bias=False)

    def _apply_rotary_pos_emb(self, tensor, cos_emb, sin_emb):
        cos_emb = cos_emb[: tf.shape(tensor)[0], : tf.shape(tensor)[1]]
        sin_emb = sin_emb[: tf.shape(tensor)[0], : tf.shape(tensor)[1]]
        x1, x2 = tf.split(tensor, 2, axis=-1)
        half_rot_tensor = tf.concat((-x2, x1), axis=-1)
        ret = (tensor * cos_emb) + (half_rot_tensor * sin_emb)
        return ret

    def _compute_cos_sin_embedding(self, x, seq_dim=1):
        seq_len = tf.shape(x)[seq_dim]
        range = tf.range(
            start=0, limit=self.rotary_ndims, delta=2, dtype="float32"
        )
        inverse_freq = 1.0 / (
            self.max_wavelength ** (range / self.rotary_ndims)
        )
        tensor = tf.range(seq_len, dtype=inverse_freq.dtype)
        freqs = tf.einsum("i, j -> ij", tensor, inverse_freq)
        embedding = tf.concat((freqs, freqs), axis=-1)
        return tf.cos(embedding), tf.sin(embedding)

    def call(self, x):
        qk = self.to_qk(x)
        qk = tf.split(qk, num_or_size_splits=2, axis=-1)
        query, key = qk

        query_rot, query_pass = (
            query[..., : self.rotary_ndims],
            query[..., self.rotary_ndims :],
        )
        key_rot, key_pass = (
            key[..., : self.rotary_ndims],
            key[..., self.rotary_ndims :],
        )
        cos_emb, sin_emb = self._compute_cos_sin_embedding(key_rot, seq_dim=1)
        query_emb = self._apply_rotary_pos_emb(query_rot, cos_emb, sin_emb)
        key_emb = self._apply_rotary_pos_emb(key_rot, cos_emb, sin_emb)
        query = tf.concat((query_emb, query_pass), axis=-1)

        return query

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "rotary_ndims": self.rotary_ndims,
                "max_wavelength": self.max_wavelength,
            }
        )

        return config



TensorShape([2, 128])

In [None]:
from einops import rearrange, repeat

token_id_input = keras.Input(
    shape=(None,),
    dtype="int32",
    name="token_ids",
)
embed = keras.layers.Embedding(
    input_dim=len(vocab), output_dim=128
)(token_id_input)
outputs = RotaryEmbedding(64)(embed)
outputs=outputs+embed
outputs = keras_nlp.layers.TransformerEncoder(
    num_heads=2,
    intermediate_dim=128,
    dropout=0.1,
)(outputs)
outputs = keras.layers.Dense(2)(outputs[:, 0, :])
model = keras.Model(
    inputs=token_id_input,
    outputs=outputs,
)

model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 token_ids (InputLayer)         [(None, None)]       0           []                               
                                                                                                  
 embedding_7 (Embedding)        (None, None, 128)    2453760     ['token_ids[0][0]']              
                                                                                                  
 rotary_embedding_37 (RotaryEmb  (None, None, 128)   32768       ['embedding_7[0][0]']            
 edding)                                                                                          
                                                                                                  
 tf.__operators__.add_3 (TFOpLa  (None, None, 128)   0           ['rotary_embedding_37[0][0]',

In [None]:
model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.experimental.AdamW(5e-5),
    metrics=keras.metrics.SparseCategoricalAccuracy(),
    jit_compile=True,
)
model.fit(
    imdb_preproc_train_ds,
    validation_data=imdb_preproc_val_ds,
    epochs=3,
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f696758c220>

## Alibi embeddings

Without using actual position embeddings, ALiBi completes the positional embedding task. Instead, ALiBi penalises the attention value that a given query can give to a given key based on how far apart the key and query are from one another when calculating the attention between a given key and query. As a result, the penalty is relatively low when a key and question are close together and very high when they are far apart.

The idea behind this strategy is the obvious one that words that are nearby matter a lot more than words that are far away. It takes in the attention head size and total heads as the input parameters.




In [None]:
!pip install einops

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting einops
  Downloading einops-0.6.1-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: einops
Successfully installed einops-0.6.1


In [None]:
import math
from einops import rearrange, repeat, reduce


class AlibiPositionalBias(layers.Layer):
    def __init__(self, heads, total_heads):
        super(AlibiPositionalBias,self).__init__()
        self.heads = heads
        self.total_heads = total_heads
        slopes = self._get_slopes(heads)
        slopes = tf.convert_to_tensor(slopes, dtype=tf.float32)
        slopes = rearrange(slopes, 'h -> h 1 1')
        self.slopes =slopes
        self.bias=None

    def get_bias(self, i, j):
        i_arange = tf.range(j - i, j)
        j_arange = tf.range(j)
        bias = -tf.math.abs(rearrange(j_arange, 'j -> 1 1 j') - rearrange(i_arange, 'i -> 1 i 1'))
        return bias
    @staticmethod
    def _get_slopes(heads):
        def get_slopes_power_of_2(n):
            start = (2**(-2**-(math.log2(n)-3)))
            ratio = start
            return [start*ratio**i for i in range(n)]

        if math.log2(heads).is_integer():
            return get_slopes_power_of_2(heads)

        closest_power_of_2 = 2 ** math.floor(math.log2(heads))
        return get_slopes_power_of_2(closest_power_of_2) + get_slopes_power_of_2(2 * closest_power_of_2)[0::2][:heads-closest_power_of_2]


    def call(self, i, j):
        h = self.total_heads

        if self.bias and self.bias.shape[-1] >= j:
            return self.bias[..., :i, :j]

        bias = self.get_bias(i, j)
        bias = tf.cast(bias,dtype=tf.float32) * self.slopes
        self.bias = bias

        return self.bias

class LearnedAlibiPositionalBias(AlibiPositionalBias):
    def __init__(self, heads, total_heads):
        super(LearnedAlibiPositionalBias,self).__init__(heads, total_heads)
        log_slopes = tf.math.log(self.slopes)
        self.learned_logslopes = tf.Variable(log_slopes)

    def call(self, i, j):
        h = self.heads

        def get_slopes(param):
            return tf.math.exp(param)

        if self.bias and self.bias.shape[-1] >= j:
            bias = self.bias[..., :i, :j]
        else:
            bias = self.get_bias(i, j)
            self.bias=bias

        slopes = get_slopes(self.learned_logslopes)
        bias = tf.cast(bias,dtype=tf.float32) * slopes

        return bias

In [None]:
token_id_input = keras.Input(
    shape=(None,),
    dtype="int32",
    name="token_ids",
)
embed = keras.layers.Embedding(
    input_dim=len(vocab), output_dim=64
)(token_id_input)
outputs = embed+LearnedAlibiPositionalBias(1,32)(512,64)
outputs = keras_nlp.layers.TransformerEncoder(
    num_heads=2,
    intermediate_dim=128,
    dropout=0.1,
)(outputs)
outputs = keras.layers.Dense(2)(outputs[:, 0, :])
model = keras.Model(
    inputs=token_id_input,
    outputs=outputs,
)

model.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 token_ids (InputLayer)      [(None, None)]            0         
                                                                 
 embedding_2 (Embedding)     (None, None, 64)          1226880   
                                                                 
 tf.__operators__.add_2 (TFO  (None, 512, 64)          0         
 pLambda)                                                        
                                                                 
 transformer_encoder_2 (Tran  (None, 512, 64)          33472     
 sformerEncoder)                                                 
                                                                 
 tf.__operators__.getitem_2   (None, 64)               0         
 (SlicingOpLambda)                                               
                                                           

In [None]:
model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.experimental.AdamW(5e-5),
    metrics=keras.metrics.SparseCategoricalAccuracy(),
    jit_compile=True,
)
model.fit(
    imdb_preproc_train_ds,
    validation_data=imdb_preproc_val_ds,
    epochs=3,
)