# Timeseries classification with a Transformer model

This is the Transformer architecture from [Attention Is All You Need](https://arxiv.org/abs/1706.03762), applied to timeseries instead of natural language.

This example requires TensorFlow 2.4 or higher.

## Load the dataset

We are going to use the same dataset and preprocessing as the TimeSeries Classification from Scratch example.

In [1]:
import numpy as np


def read_ucr_dataset(filename):
    root_url = "https://raw.githubusercontent.com/hfawaz/cd-diagram/master/FordA/"
    data = np.loadtxt(root_url + filename, delimiter="\t")
    y = data[:, 0]
    x = data[:, 1:]
    return x, y.astype(int)


x_train, y_train = read_ucr_dataset("FordA_TRAIN.tsv")
x_test, y_test = read_ucr_dataset("FordA_TEST.tsv")


### Standardise the data

Our timeseries are already in a single length (500). However, their values are usually in various ranges. This is not ideal for a neural network; in general we should seek to make the input values normalized. For this specific dataset, the data is already z-normalized: each timeseries sample has a mean equal to zero and a standard deviation equal to one. This type of normalization is very common for timeseries classification problems, see Bagnall et al. (2016).

In [2]:
# Note that the timeseries data used here are univariate, meaning we only have one channel 
# per timeseries example. We will therefore transform the timeseries into a multivariate 
# one with one channel using a simple reshaping via numpy. This will allow us to construct 
# a model that is easily applicable to multivariate time series.
print(x_train.shape, x_test.shape)
x_train = x_train.reshape((x_train.shape[0], x_train.shape[1], 1))
x_test = x_test.reshape((x_test.shape[0], x_test.shape[1], 1))
print(x_train.shape, x_test.shape)

# Finally, in order to use `sparse_categorical_crossentropy`, we will have to count the 
# number of classes beforehand.
num_classes = len(np.unique(y_train))

# Now we shuffle the training set because we will be using the validation_split option 
# later when training.
idx = np.random.permutation(len(x_train))
x_train = x_train[idx]
y_train = y_train[idx]
print(x_train.shape, x_test.shape)

# Standardize the labels to positive integers. The expected labels will then be 0 and 1.
y_train[y_train == -1] = 0
y_test[y_test == -1] = 0

x_train.shape, y_train.shape, x_test.shape, y_test.shape


(3601, 500) (1320, 500)
(3601, 500, 1) (1320, 500, 1)
(3601, 500, 1) (1320, 500, 1)


((3601, 500, 1), (3601,), (1320, 500, 1), (1320,))

## Build the model

Our model processes a tensor of shape `(batch size, sequence length, features)`, where `sequence length` is the number of time steps and features is each input timeseries.

You can replace your classification RNN layers with this one: the inputs are fully compatible!

We include residual connections, layer normalization, and dropout. The resulting layer can be stacked multiple times.

The projection layers are implemented through [`keras.layers.Conv1D`](https://keras.io/api/layers/convolution_layers/convolution1d/#conv1d-class).


In [4]:
from tensorflow import keras


def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0):
    # Attention and Normalization
    x = keras.layers.MultiHeadAttention(
        key_dim=head_size,
        num_heads=num_heads,
        dropout=dropout
    )(inputs, inputs)
    x = keras.layers.Dropout(dropout)(x)
    x = keras.layers.LayerNormalization(epsilon=1e-6)(x)
    res = x + inputs

    # Feed Forward Part

    x = keras.layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(res)
    x = keras.layers.Dropout(dropout)(x)
    x = keras.layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
    x = keras.layers.LayerNormalization(epsilon=1e-6)(x)
    return x + res

2023-01-18 16:03:20.211215: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


The main part of our model is now complete. We can stack multiple of those `transformer_encoder` blocks and we can also proceed to add the final Multi-Layer Perceptron classification head. Apart from a stack of `Dense` layers, we need to reduce the output tensor of the `TransformerEncoder` part of our model down to a vector of features for each data point in the current batch. A common way to achieve this is to use a pooling layer. For this example, a `GlobalAveragePooling1D` layer is sufficient.


In [5]:
def build_model(
    input_shape,
    head_size,
    num_heads,
    ff_dim,
    num_transformer_blocks,
    mlp_units,
    dropout=0,
    mlp_dropout=0,
):
    inputs = keras.layers.Input(input_shape)
    x = inputs
    for _ in range(num_transformer_blocks):
        x = transformer_encoder(
            x, head_size, num_heads, ff_dim, dropout
        )
        
    x = keras.layers.GlobalAveragePooling1D(data_format="channels_first")(x)

    for dim in mlp_units:
        x = keras.layers.Dense(dim, activation="relu")(x)
        x = keras.layers.Dropout(mlp_dropout)(x)
    outputs = keras.layers.Dense(num_classes, activation="softmax")(x)
    return keras.Model(inputs, outputs)


## Train and evaluate


In [None]:
input_shape = x_train.shape[1:]

model = build_model(
    input_shape,
    head_size=256,
    num_heads=4,
    ff_dim=4,
    num_transformer_blocks=4,
    mlp_units=[128],
    mlp_dropout=0.4,
    dropout=0.25,
)

model.compile(
    loss="sparse_categorical_crossentropy",
    optimizer=keras.optimizers.Adam(learning_rate=1e-4),
    metrics=["sparse_categorical_accuracy"],
)
model.summary()

callbacks = [keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)]

model.fit(
    x_train,
    y_train,
    validation_split=0.2,
    epochs=10,
    batch_size=64,
    callbacks=callbacks,
)

model.evaluate(x_test, y_test, verbose=1)


2023-01-18 16:11:48.196451: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 500, 1)]     0           []                               
                                                                                                  
 multi_head_attention (MultiHea  (None, 500, 1)      7169        ['input_1[0][0]',                
 dAttention)                                                      'input_1[0][0]']                
                                                                                                  
 dropout (Dropout)              (None, 500, 1)       0           ['multi_head_attention[0][0]']   
                                                                                                  
 layer_normalization (LayerNorm  (None, 500, 1)      2           ['dropout[0][0]']            