## Set Transformer Model


This notebook implements the [set transformer](https://arxiv.org/abs/1810.00825) model to predict the yield spreads. The model takes a sequence of trades as input and predicts the spreads.

In [1]:
import pandas as pd
import numpy as np
from data_preparation import process_data


from google.cloud import bigquery
import os
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import LayerNormalization, Dense
from tensorflow.keras import activations
from tensorflow.keras import backend as K
from tensorflow.keras import initializers
from tensorflow import repeat

import matplotlib.pyplot as plt


from IPython.display import display, HTML

Setting the seed for keras layer initializer. This removes the randomness from the experiments

In [2]:
layer_initializer = initializers.RandomNormal(mean=0.0, stddev=0.1, seed=10)

Initializing big query client

In [3]:
bq_client = bigquery.Client()

### Checking if GPU is available

In [4]:
tf.executing_eagerly()

True

In [5]:
tf.test.is_gpu_available()

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


True

#### Hyper-parameters for the model
The batch size and learning rate have an impact on the smoothness of convergence of the model.\
Larger the batch size the smoother the convergence. For a larger batch size we need a higher learning rate and vice-versa


The SECONDS_AGO_FEATURE decide the scale of the seconds ago feature. The feature can be either on a logarithmic scale or a square root scale. The parameters can be set to None to remove the features from input to the model.

In [6]:
# Training Parameters
TRAIN_TEST_SPLIT = 0.85
LEARNING_RATE = 0.001
NUM_EPOCHS = 200


# Model Parameters
SEQUENCE_LENGTH = 5
NUM_FEATURES= 8
EMBED_DIM = 100
SEED_DIM = 1
FF_DIM = 300
DROPOUT_RATE = 0.0
NUM_HEADS = 1
BATCH_SIZE = 1000
INDUCED_POINTS = 5

#### Query to fetch data from BigQuery

The SQL query uses the trade history with reference data view. The recent field is an aggregated array of 32 recent previous trades in the same cusip. The array contains the yield spreads, size, trade direction, and the seconds elapsed from the most recent trade.

In [7]:
DATA_QUERY = """SELECT
                    rtrs_control_number,
                    yield_spread,
                    par_traded,
                    trade_type,
                    recent
                FROM
                    `eng-reactor-287421.primary_views.trade_history_with_reference_data`
                WHERE
                    trade_date >= '2021-01-01'
                AND 
                    trade_date < '2021-04-01'
                AND 
                    yield_spread IS NOT NULL
                LIMIT
                    100000
                """

### Data Preparation

We grab the data from BigQuery and converts it into a format suitable for input to the model. The driver function of this is process_data. All the data processing functions have been moved to the data_preparation.py file. 

The data query returns a table with a nested column, called recent, which contains the  32 recent previous trades in the same CUSIP. The **fetch_data** function executes the query and grabs data from BigQuery as a data frame. 

The aggregated arrays are stored as a list of dictionaries. The **tradeDictToList**  extracts the yield spreads, the size, the type, and seconds elapsed from the dictionary and stores them as a list. In the process of extraction, we perform a few blunt normalizations. The yield spreads are converted from percentage points to basis points. The size of the trade is scaled down by dividing it by 1000. The seconds ago are converted into a logarithmic scale. If the seconds ago are negative (due to trade time after publishing time) the function adds a zero to the list. 

The trades which do not have sufficient history are padded with zeros. The **pad_trade_history** function pads the beginning of trade history with zeros to ensure that the length of the list is equal to trade history. 

| RTRS Control Number | trade_history                                          | yield_spread    |
|:-------------------:|---------------------------------------------------|-----------|
| 2021031700698100    | [[5.997612444822997, 50.0, 0.0, 0.0, 71.458333... | 20.281352 |

In [8]:
%%time
train_dataframe, test_dataframe = process_data(DATA_QUERY,bq_client, SEQUENCE_LENGTH, NUM_FEATURES, TRAIN_TEST_SPLIT)

Negative seconds ago
Negative seconds ago
Negative seconds ago
Negative seconds ago
Number of training Samples 60735


Unnamed: 0,rtrs_control_number,yield_spread,par_traded,trade_type,trade_history
0,2021031905029000,-85.964339,10000.0,D,"[[-105.964338945754, 4000.0, 0.0, 1.0, 6.51471..."
1,2021022408792800,-110.901862,20000.0,S,"[[-106.70186192700702, 50.0, 0.0, 0.0, 8.85537..."
3,2021010803866300,131.576567,250000.0,S,"[[131.57656694241902, 50.0, 0.0, 0.0, 7.638679..."
4,2021012107657000,23.645575,5000.0,D,"[[19.945575447686004, 60.0, 0.0, 1.0, 9.595534..."
5,2021020403211200,33.796133,500000.0,S,"[[40.59613290938899, 500.0, 0.0, 0.0, 6.553933..."


CPU times: user 30.1 s, sys: 1.49 s, total: 31.5 s
Wall time: 1min 14s


In [9]:
train_dataframe.to_pickle('temp.pkl')

In [10]:
# Shuffling the train set
train_dataframe = train_dataframe.sample(frac=1)
display(train_dataframe)

Unnamed: 0,rtrs_control_number,yield_spread,par_traded,trade_type,trade_history
50968,2021030304195600,-1.158774,15000.000000000,P,"[[-23.276521795774997, 45.0, 1.0, 0.0, 16.2097..."
50009,2021021706957400,-73.736782,100000.000000000,S,"[[40.89679316399801, 5.0, 0.0, 0.0, 15.8116282..."
60830,2021011200212700,-78.169731,30000.000000000,S,"[[-56.597830429389006, 50.0, 1.0, 0.0, 15.3530..."
64373,2021021004647000,-68.236276,120000.000000000,P,"[[-51.591652363448006, 10.0, 0.0, 0.0, 15.5271..."
93723,2021012102923100,507.245575,5000.000000000,S,"[[-60.94004651733801, 10.0, 0.0, 1.0, 15.52422..."
...,...,...,...,...,...
95957,2021032501092400,-49.538834,25000.000000000,D,"[[-75.228787439416, 230.0, 0.0, 0.0, 16.029777..."
34279,2021010504015800,-48.002388,40000.000000000,D,"[[0.0, 0.0, 0.0, 0.0, 0.0, 40.0, 0.0, 0.0], [0..."
40383,2021022307154600,-99.171869,10000.000000000,D,"[[0.0, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0], [-..."
42090,2021032307080100,-105.009387,50000.000000000,S,"[[-85.628787439416, 100.0, 1.0, 0.0, 16.013239..."


We just want the most recent trades from the trade history. By default the views have 32 trades in the trade history.

In [11]:
train_dataframe.trade_history = train_dataframe.trade_history.apply(lambda x: x[-SEQUENCE_LENGTH:])
test_dataframe.trade_history = test_dataframe.trade_history.apply(lambda x:x[-SEQUENCE_LENGTH:])

Changing the dataframe to numpy array so that data can be fed into the input layer of the model 

In [13]:
train_data = np.stack(train_dataframe.trade_history.to_numpy())
target = train_dataframe.yield_spread.to_numpy()

test_data = np.stack(test_dataframe.trade_history.to_numpy())
test_target =  test_dataframe.yield_spread.to_numpy()

In [14]:
print(train_data.shape)
print(target.shape)
print(test_data.shape)

(60735, 5, 8)
(60735,)
(10523, 5, 8)


# Set Transformers

Set Transformers are attention-based neural networks that are designed to model interactions among elements in the input sets. The model consists of an encoder and decoder, both of which rely on attention mechanisms. 

### MultiHead Attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. The multiheaded attention layer linearly projects the queries, keys, and values, number of heads times with different learned linear projections to $d_k$, $d_k$, and $d_v$ dimensions, respectively. On each of these projected versions of queries, keys, and values the layer then performs the attention functions, yielding $d_v$-dimensional output values. These are concatenated and once again projected, resulting in the final values.

The particular attention that the multihead attention calculates is called scaled dot product attention. The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$. We then compute the dot products of the
query with all keys, divide each by $\sqrt{d_k}$ and apply a softmax function to obtain the weights on the values. 

\begin{equation}
    Attention(Q,K,V) = SoftMax\Bigg(\frac{Q\times K}{\sqrt{d_k}}\Bigg) \times V
\end{equation}

In [15]:
def scaled_dot_product_attention(q, k, v, mask):
    '''
    The mask is simply to ensure that the encoder doesn't pay any attention to padding tokens. 
    Here is the formula for the masked scaled dot product attention:
    '''
    matmul_qk = tf.matmul(q, k, transpose_b=True)
    
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    if mask is not None:
        scaled_attention_logits += (mask * -1e9)

    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) 
    output = tf.matmul(attention_weights, v)  

    return output, attention_weights


class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0

        self.depth = d_model // self.num_heads

        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, embed_dim).
        Transpose the result such that the shape is (batch_size, num_heads, seq_len, embed_dim)
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, q, k, v, mask=None):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)  
        k = self.wk(k)  
        v = self.wv(v)

        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size) 
        v = self.split_heads(v, batch_size)

        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)

        scaled_attention = tf.transpose(scaled_attention,
                                        perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

        concat_attention = tf.reshape(scaled_attention,
                                      (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

        return output

RFF is a class that implements a feed forward network. RFF is a simple row-wise feedforward layer i.e., it processes each instance independently and identically. 

In [16]:
class RFF(tf.keras.layers.Layer):
    """
    Row-wise FeedForward layers.
    """

    def __init__(self, d):
        super(RFF, self).__init__()

        self.linear_1 = Dense(d, activation='relu')
        self.linear_2 = Dense(d, activation='relu')
        self.linear_3 = Dense(d, activation='relu')

    def call(self, x):
        """
        Arguments:
            x: a float tensor with shape [b, n, d].
        Returns:
            a float tensor with shape [b, n, d].
        """
        return self.linear_3(self.linear_2(self.linear_1(x)))

### Multiheaded Attention Block
Multiheaded attention block (MAB) is an adaption of the encoder block of the Transformer without positional encoding and dropout. The implementation of the block is derived from the Keras implementation of the encoder block for the transformer. MAB's are the building blocks of the set attention model. We define the Set Attention Block and Pooling Multiheaded attention block using the MAB

MABs can be defined as follows

\begin{equation}
MAB(X,Y) = LayerNorm(H+FF(H))
\end{equation}
\begin{equation}
H = LayerNorm(X + MultiheadAttention(X,Y)
\end{equation}

Since the model is based on self-attention. The Key (X) and Value(Y) are the same. FF in the above equation is a feed forward network

In [26]:
class MultiHeadAttentionBlock(tf.keras.layers.Layer):
    def __init__(self, embedz_dim: int, num_heads: int, rff: RFF):
        super(MultiHeadAttentionBlock, self).__init__()
        self.multihead = MultiHeadAttention(embedz_dim, num_heads)
        self.layer_norm1 = LayerNormalization(epsilon=1e-6, dtype='float32')
        self.layer_norm2 = LayerNormalization(epsilon=1e-6, dtype='float32')
        self.rff = rff

    def call(self, x, y):
        """
        Arguments:
            x: a float tensor 
            y: a float tensor 
        Returns:
            a float tensor 
        """

        h = self.layer_norm1(x + self.multihead(x, y, y))
        return self.layer_norm2(h + self.rff(h))

### Set Attention Block
A Set attention block (SAB) takes a set and performs self attention between the elements in the set, resulting in a set of equal size. The output of SAB contains pairwise interactions between the elements of the set. We can stack multiple SABs to encode higher order interactions.

SABs can be defined as

\begin{equation}
SAB(X) := MAB(X,X)
\end{equation}

In [27]:
class SetAttentionBlock(tf.keras.layers.Layer):
    def __init__(self, d: int, h: int, rff: RFF):
        super(SetAttentionBlock, self).__init__()
        self.mab = MultiHeadAttentionBlock(d, h, rff)

    def call(self, x):
        """
        Arguments:
            x: a float tensor with shape [b, n, d].
        Returns:
            a float tensor with shape [b, n, d].
        """
        return self.mab(x, x)

## Induced Set Attention Block

We can use SABs for the Set encoder block but a potential problem can be that they will be very slow to train. The forward pass on a SAB is $O(n^2)$. Thus we use Induced Set Attention Block (ISAB), which bypasses this problem. Along with the set $X \epsilon R^{n\times d}$, ISAB defines m d-dimensional vectors $I \epsilon R^{m\times d}$, which the authors call inducing points. Inducing points are part of the ISAB itself, and they are trainable parameters that are trained along with other parameters of the network. An ISAB with m inducing points I are defined as:

\begin{equation}
ISAB_{m}(X) = MAB(X,H) \epsilon R^{n\times d}
\end{equation}
\begin{equation}
H = MAB(I,X) \epsilon R^{m\times d}
\end{equation}

In [28]:
class InducedSetAttentionBlock(tf.keras.layers.Layer):
    def __init__(self, d: int, m: int, h: int, rff1: RFF, rff2: RFF):
        """
        Arguments:
            d: an integer, input dimension.
            m: an integer, number of inducing points.
            h: an integer, number of heads.
            rff1, rff2: modules, row-wise feedforward layers.
                It takes a float tensor with shape [b, n, d] and
                returns a float tensor with the same shape.
        """
        super(InducedSetAttentionBlock, self).__init__()
        self.mab1 = MultiHeadAttentionBlock(d, h, rff1)
        self.mab2 = MultiHeadAttentionBlock(d, h, rff2)
        self.inducing_points = tf.random.normal(shape=(1, m, d))

    def call(self, x):
        """
        Arguments:
            x: a float tensor with shape [b, n, d].
        Returns:
            a float tensor with shape [b, n, d].
        """
        b = tf.shape(x)[0]
        p = self.inducing_points
        p = repeat(p, (b), axis=0)  # shape [b, m, d]

        h = self.mab1(p, x)  # shape [b, m, d]
        return self.mab2(x, h)

### Pooling Multihead Attention

Set Transformers aggregate features by applying multihead attention on a learnable set of k seed vectors. Let $S \epsilon R^{k\times d}$ be the seed vectors and $Z \epsilon R^{n \times d}$ be the set of features constructed from the encoder. The a PMA is defined as.

\begin{equation}
PMA_k(Z) = MAB(S,ff(Z))
\end{equation}

In [29]:
class PoolingMultiHeadAttention(tf.keras.layers.Layer):

    def __init__(self, embed_dim: int, num_seeds: int, num_heads: int, rff: RFF, rff_s: RFF):
        """
        Arguments:
            embed_dims: an integer, input dimension.
            num_seeds: an integer, number of seed vectors.
            num_heads: an integer, number of heads.
            rff: a module, row-wise feedforward layers.
                It takes a float tensor with shape [b, n, d] and
                returns a float tensor with the same shape.
        """
        super(PoolingMultiHeadAttention, self).__init__()
        self.mab = MultiHeadAttentionBlock(embed_dim, num_heads, rff)
        self.seed_vectors = tf.random.normal(shape=(1, num_seeds, embed_dim))
        self.rff_s = rff_s

    @tf.function
    def call(self, z):
        """
        Arguments:
            z: a float tensor with shape [batch, seq_length, embed_dim].
        Returns:
            a float tensor with shape [batch, num_seeds, embed_dim]
        """
        b = tf.shape(z)[0]
        s = self.seed_vectors
        s = repeat(s, (b), axis=0)  # shape [b, k, d]
        return self.mab(s, self.rff_s(z))

### Combining the blocks 
Encoder(X) = SAB(SAB(X))

In [30]:
class STEncoder(tf.keras.layers.Layer):
    def __init__(self, embed_dim=12, num_induct=6, num_heads=6):
        super(STEncoder, self).__init__()

        # Embedding part
        self.linear_1 = Dense(embed_dim, activation='relu')

        # Encoding part
        self.isab_1 = InducedSetAttentionBlock(embed_dim, num_induct, num_heads, RFF(embed_dim), RFF(embed_dim))
        self.isab_2 = InducedSetAttentionBlock(embed_dim, num_induct, num_heads, RFF(embed_dim), RFF(embed_dim))

    def call(self, x):
        return self.isab_2(self.isab_1(self.linear_1(x)))

Decoder Decoder(Z) = FF(SAB(PMA(Z))

In [31]:
class STDecoder(tf.keras.layers.Layer):
    def __init__(self, out_dim, embed_dim=12, num_heads=2, num_seeds=8):
        super(STDecoder, self).__init__()

        self.PMA = PoolingMultiHeadAttention(embed_dim,
                                             num_seeds,
                                             num_heads,
                                             RFF(embed_dim),
                                             RFF(embed_dim))
        
        self.SAB = SetAttentionBlock(embed_dim,
                                     num_heads,
                                     RFF(embed_dim))
        
        self.output_mapper = Dense(out_dim)
        self.num_seeds, self.embed_dim= num_seeds, embed_dim

    def call(self, x):
        decoded_vec = self.SAB(self.PMA(x))
        decoded_vec = tf.reshape(decoded_vec, [-1, self.num_seeds * self.embed_dim])
        return tf.reshape(self.output_mapper(decoded_vec), (tf.shape(decoded_vec)[0],))


The SetTransformer class combines the encoder and deocder. 

In [40]:
class SetTransformer(tf.keras.Model):
    def __init__(self, encoder_d=4, num_induct=3, encoder_h=2, out_dim=1, decoder_d=4, decoder_h=2, num_seeds=1):
        super(SetTransformer, self).__init__()
        self.basic_encoder = STEncoder(embed_dim=encoder_d, num_induct=num_induct, num_heads=encoder_h)
        self.basic_decoder = STDecoder(out_dim=out_dim, embed_dim=decoder_d, num_heads=decoder_h, num_seeds=num_seeds)

    def call(self, x):
        enc_output = self.basic_encoder(x)  # (batch_size, set_len, d_model)
        return self.basic_decoder(enc_output)

In [41]:
model = SetTransformer(encoder_d= EMBED_DIM,
                            num_induct = INDUCED_POINTS,
                            encoder_h=NUM_HEADS,
                            out_dim=1,
                            decoder_d=EMBED_DIM,
                            decoder_h=NUM_HEADS,
                            num_seeds=1)

In [34]:
model.compile(optimizer=keras.optimizers.Adam(learning_rate=LEARNING_RATE),
              loss=keras.losses.MeanSquaredError(),
              metrics=[keras.metrics.MeanAbsoluteError()])

In [35]:
%time history = model.fit(train_data, target, epochs=NUM_EPOCHS, batch_size=BATCH_SIZE,  verbose=2, validation_split=0.2)

Epoch 1/200
49/49 - 10s - loss: 85626.4375 - mean_absolute_error: 79.9746 - val_loss: 59929.3906 - val_mean_absolute_error: 73.2783
Epoch 2/200
49/49 - 1s - loss: 84529.1953 - mean_absolute_error: 74.6564 - val_loss: 59188.8711 - val_mean_absolute_error: 69.2395
Epoch 3/200
49/49 - 1s - loss: 83728.6016 - mean_absolute_error: 70.5281 - val_loss: 58427.7852 - val_mean_absolute_error: 64.9037
Epoch 4/200
49/49 - 1s - loss: 82913.8281 - mean_absolute_error: 66.1963 - val_loss: 57746.1367 - val_mean_absolute_error: 61.7698
Epoch 5/200
49/49 - 1s - loss: 82208.9766 - mean_absolute_error: 62.8082 - val_loss: 57138.0000 - val_mean_absolute_error: 60.3688
Epoch 6/200
49/49 - 1s - loss: 81882.6406 - mean_absolute_error: 60.2832 - val_loss: 56513.3711 - val_mean_absolute_error: 56.3256
Epoch 7/200
49/49 - 1s - loss: 80885.9453 - mean_absolute_error: 57.4510 - val_loss: 56051.9648 - val_mean_absolute_error: 54.0676
Epoch 8/200
49/49 - 1s - loss: 80295.9141 - mean_absolute_error: 55.9921 - val_los

In [36]:
_, mae = model.evaluate(test_data, test_target, verbose=1)
print(f"Test MAE: {round(mae, 3)}")

Test MAE: 31.47
