## Abstraction

- **[About Competition]**: In the given [cat-dog pet dataset](https://www.kaggle.com/c/petfinder-pawpularity-score/data), both image files and also meta informations are provided which can be used in training. The overall task is to analyze this raw images and metadata to predict the **Pawpularity** of pet photos.
- **[About Hybrid External Multi-Head Transformer]**: 
    - In this notebook, we will be implementing the **External Self-Attention Transformer, EAT in `TensorFlow.Keras` ([paper](https://arxiv.org/pdf/2105.02358.pdf) 2021)**. Also implementation will be maintained in this repo [External-Attention-TensorFlow](https://github.com/innat/External-Attention-TensorFlow). The official PyTorch implementation is [here](https://github.com/MenghaoGuo/EANet). 
    - Specifically, in this notebook, we will use the **EAT** and build a **hybrid model** with ImageNet mdoels (diagram below, we choose `resnet` here, but you can use any). From the existing `tf.keras.applications`, we will import `resnet` and further we'll build a new feature extraction model with 2 output. One of the layer is `top_activation` and another is `conv2_block3_out` layer from `resnet`. 
    - In the next call, the output of the `conv2_block3_out` layer will be passed to the **EAT** model for further training. At the end, according to the model architecture, we will have 3 output to merge followed by classificaiton layer. 
- **[About Input Format]**: As in this competition, we have both **image data** and **structure data**, we'll be building a multi-input model in order to utilize both information. In the above diagram, the `input 1` is the raw image and `input 2` refer the structure data. And for structure data, a simple mlp model will be used. FYI, for structure data, we can also try [TensorFlow Decision Forests](https://blog.tensorflow.org/2021/05/introducing-tensorflow-decision-forests.html).

![up](https://user-images.githubusercontent.com/17668390/141291222-b3184730-11ba-4e12-bd87-30a85d82e854.png)

- **[About Training]**: We'll train the model on **TPU** and next we'll inference with **GPU**. For training details with **TPU** settings, check **Version 2**; we've trained already and saved the weight file. And lastly, we'll try to inspect the **activaiton feature maps** from the **External-Transformer** blocks.
- **[About Inference]**: Next, in the **Inference** section, we'll compute out-of-fold validation and do inference with all trained folds and followed by simple **Ensemble**. 
- **[(Optional): About RAPIDS SVR]**: Added RAPIDS SVR for second stage training approach. Reference [kernel](https://www.kaggle.com/cdeotte/rapids-svr-boost-17-8) - [Discussion](https://www.kaggle.com/c/petfinder-pawpularity-score/discussion/276724). 

In [None]:
import numpy as np 
import pandas as pd 

import os, random
import seaborn as sns
import matplotlib.pyplot as plt 
import random as python_random
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import tensorflow as tf; print(tf.__version__)
import tensorflow_hub as hub
import tensorflow_addons as tfa
from tensorflow.keras import Input, Model, Sequential, layers

# control gpu ram growth 
tf.config.optimizer.set_jit(True)
physical_devices = tf.config.list_physical_devices('GPU')
try: tf.config.experimental.set_memory_growth(physical_devices[0], True)
except: pass 

# for reproducibiity
def seed_all(s):
    random.seed(s)
    python_random.seed(s)
    tf.random.set_seed(s)
    os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
    os.environ['PYTHONHASHSEED'] = str(0)
    
# seed all
SEED  = 1994
sns.set(style="darkgrid")
seed_all(SEED)

In [None]:
# Set DEVICE = 'TPU' for training. Please check version 2. 
DEVICE = 'GPU' 

if DEVICE == "TPU":
    print("connecting to TPU...")
    try:
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
        print('Running on TPU ', tpu.master())
    except ValueError:
        print("Could not connect to TPU")
        tpu = None
    if tpu:
        try:
            print("initializing  TPU ...")
            tf.config.experimental_connect_to_cluster(tpu)
            tf.tpu.experimental.initialize_tpu_system(tpu)
            strategy = tf.distribute.experimental.TPUStrategy(tpu)
            print("TPU initialized")
        except _:
            print("failed to initialize TPU")
    else:
        DEVICE = "GPU"

if DEVICE != "TPU":
    print("Using default strategy for CPU and single GPU")
    strategy = tf.distribute.get_strategy()

if DEVICE == "GPU":
    print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
    physical_devices = tf.config.list_physical_devices('GPU')
    tf.config.experimental.set_memory_growth(physical_devices[0], True)
    
    
AUTO     = tf.data.experimental.AUTOTUNE
REPLICAS = strategy.num_replicas_in_sync
print(f'REPLICAS: {REPLICAS}')

In [None]:
IMG_SIZE   = 896 
BATCH_SIZE = 10
EPOCHS     = 5

# Folders
if DEVICE == "TPU":
    from kaggle_datasets import KaggleDatasets
    DATA_DIR = '/kaggle/input/petfinder-pawpularity-score/'
    GCS_PATH  = KaggleDatasets().get_gcs_path('petfinder-pawpularity-score')
    TRAIN_DIR = GCS_PATH + '/train/'
    TEST_DIR  = GCS_PATH + '/test/'
else:
    DATA_DIR  = '/kaggle/input/petfinder-pawpularity-score/'
    TRAIN_DIR = DATA_DIR + 'train/'
    TEST_DIR  = DATA_DIR + 'test/'

# Configured Competition Data

In [None]:
from sklearn.model_selection import StratifiedKFold

FOLDS = 10
# Load Train Data
train_df = pd.read_csv(f'{DATA_DIR}train.csv')
train_df['Id'] = train_df['Id'].apply(lambda x: f'{TRAIN_DIR}{x}.jpg')
# Set a specific label to be able to perform stratification
train_df['stratify_label'] = pd.qcut(train_df['Pawpularity'], q = 30, labels = range(30))
# Label value to be used for feature model 'classification' training.
train_df['target_value'] = train_df['Pawpularity'] / 100.
# list of feature that will be second input along with image input
dense_features = [
    'Subject Focus', 'Eyes', 'Face', 'Near', 'Action', 'Accessory',
    'Group', 'Collage', 'Human', 'Occlusion', 'Info', 'Blur'
]
kfold = StratifiedKFold(n_splits = FOLDS,  shuffle = True, random_state = SEED)
train_df['kfold'] = -1
for fold, (train_index, val_index) in enumerate(kfold.split(train_df.index,
                                                            train_df['stratify_label'])):
    train_df.loc[val_index, 'kfold'] = fold 
    
# save or not
train_df.to_csv("train_df_folds.csv", index=False)
display(train_df.head(3))

# test set 
test_df = pd.read_csv(f'{DATA_DIR}test.csv')
test_df['Id'] = test_df['Id'].apply(lambda x: f'{TEST_DIR}{x}.jpg')
test_df['Pawpularity'] = 0
display(test_df.head(3))

# `tf.data` API for Multi-Input

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE  
def build_augmenter(is_labelled):
    def augment(img):
        img = tf.image.random_flip_left_right(img)
        img = tf.image.random_flip_up_down(img)
        img = tf.image.random_saturation(img, 0.95, 1.05)
        img = tf.image.random_brightness(img, 0.05)
        img = tf.image.random_contrast(img, 0.95, 1.05)
        img = tf.image.random_hue(img, 0.05)
        return img
    
    def augment_with_labels(path, label):
        image, feature = path 
        augment_image = augment(image)
        return (augment_image, feature), label
    
    def augment_without_labels(path, _):
        image, feature = path 
        augment_image = augment(image)
        return (augment_image, feature), None
    
    return augment_with_labels if is_labelled else augment_without_labels

In [None]:
def build_decoder(is_labelled):
    def decode(path):
        file_bytes = tf.io.read_file(path)
        img = tf.image.decode_jpeg(file_bytes, channels = 3)
        img = tf.image.resize(img, (IMG_SIZE, IMG_SIZE)) 
        return tf.divide(img, 255.)
    
    def decode_with_labels(path, label):
        image, feature = path 
        decode_image = decode(image)
        return (decode_image, feature), label 
    
    def decode_without_labels(path, feature):
        decode_image = decode(path)
        return (decode_image, feature), None
    
    return decode_with_labels if is_labelled else decode_without_labels

def create_dataset(df, 
                   batch_size  = 32, 
                   is_labelled = False, 
                   augment     = False,
                   repeat      = False, 
                   shuffle     = False):
    
    decode_fn    = build_decoder(is_labelled)
    augmenter_fn = build_augmenter(is_labelled)
    
    # Create Dataset
    if is_labelled:
        dataset_sample = tf.data.Dataset.from_tensor_slices((df['Id'].values, df[dense_features].values))
        dataset_labels = tf.data.Dataset.from_tensor_slices(df['target_value'].values)
        dataset = tf.data.Dataset.zip((dataset_sample, dataset_labels))
    else:
        dataset = tf.data.Dataset.from_tensor_slices((df['Id'].values, df[dense_features].values))
        
    dataset = dataset.map(decode_fn, num_parallel_calls = AUTOTUNE)
    dataset = dataset.map(augmenter_fn, num_parallel_calls = AUTOTUNE) if augment else dataset
    dataset = dataset.repeat() if repeat else dataset
    dataset = dataset.shuffle(1024 * REPLICAS, reshuffle_each_iteration = True) if shuffle else dataset
    dataset = dataset.batch(batch_size * REPLICAS, drop_remainder=shuffle)
    dataset = dataset.prefetch(AUTOTUNE)
    return dataset

**Check Dataloader**

In [None]:
training_dataset = create_dataset(train_df, 
                                  batch_size  = BATCH_SIZE, 
                                  is_labelled = True, 
                                  augment     = True,
                                  repeat      = False, 
                                  shuffle     = False)
(sample_images, sample_feature), sample_labels = next(iter(training_dataset))
print(sample_images.shape, sample_feature.shape, sample_labels.shape)

import matplotlib.pyplot as plt 
plt.figure(figsize=(16, 10))
for i, (image, label) in enumerate(zip(sample_images[:8], sample_labels[:8])):
    ax = plt.subplot(3, 4, i + 1)
    plt.title(f'{label.numpy()} , Raw: {image.numpy().shape}')
    plt.imshow(image.numpy().squeeze())
    plt.axis("off")

# Modeling

As described at the beginning, we'll implement [External Attention](https://arxiv.org/pdf/2105.02358.pdf) in `TensorFlow.Keras` and later we'll integrate it into a ImageNet model (here, we pick `resnet`). Describing the model is beyond the scope of this code example. [Here](https://github.com/MenghaoGuo/EANet) is the official PyTorch implementation. And we'll try to rebuild based on that.  

<img width="756" alt="ea" src="https://user-images.githubusercontent.com/17668390/141291708-7c3cd892-d508-4cca-8306-a8b06a38c158.png">


## Implement the patch extraction and encoding layer

In [None]:
class PatchEmbed(tf.keras.layers.Layer):
    def __init__(self, img_size=(224, 224), patch_size=(4, 4),  embed_dim=96):
        super().__init__(name='patch_embed')
        patches_resolution = [img_size[0] // patch_size[0], 
                              img_size[1] // patch_size[1]]
        self.img_size = img_size
        self.patch_size = patch_size
        self.patches_resolution = patches_resolution
        self.num_patches = patches_resolution[0] * patches_resolution[1]
        self.embed_dim = embed_dim
        self.proj = layers.Conv2D(embed_dim, 
                                  kernel_size=patch_size, 
                                  strides=patch_size, name='proj')
        self.norm = layers.LayerNormalization(epsilon=1e-5, name='norm')
     
    def call(self, x):
        B, H, W, C = x.get_shape().as_list()
        x = self.proj(x)
        x = tf.reshape(
            x, shape=[-1, 
                      (H // self.patch_size[0]) * (W // self.patch_size[0]), 
                      self.embed_dim]
        )
        x = self.norm(x)
        return x

## Implement the External Attention Block

In [None]:
class ExternalAttention(layers.Layer):
    def __init__(self, dim, num_heads, dim_coefficient = 4, 
                 attention_dropout = 0,  projection_dropout = 0, 
                 **kwargs):
        super(ExternalAttention, self).__init__(name= 'ExternalAttention', **kwargs)
        self.dim       = dim 
        self.num_heads = num_heads 
        self.dim_coefficient    = dim_coefficient
        self.attention_dropout  = attention_dropout
        self.projection_dropout = projection_dropout
        
        k = 256 // dim_coefficient
        self.trans_dims = layers.Dense(dim * dim_coefficient)
        self.linear_0 = layers.Dense(k)
        self.linear_1 = layers.Dense(dim * dim_coefficient // num_heads)
        self.proj = layers.Dense(dim)
    
        self.attn_drop  = layers.Dropout(attention_dropout)
        self.proj_drop  = layers.Dropout(projection_dropout)
        
    def call(self, inputs, return_attention_scores=False, training=None):
        num_patch = tf.shape(inputs)[1]
        channel   = tf.shape(inputs)[2]
        x = self.trans_dims(inputs)
        x = tf.reshape(x, shape=(-1, 
                                 num_patch, 
                                 self.num_heads,
                                 self.dim * self.dim_coefficient // self.num_heads))
        x = tf.transpose(x, perm=[0, 2, 1, 3])
        
        # a linear layer M_k
        attn = self.linear_0(x)
        # normalize attention map
        attn = layers.Softmax(axis=2)(attn)
        # dobule-normalization
        attn = attn / (1e-9 + tf.reduce_sum(attn, axis=-1, keepdims=True))
        attn_drop = self.attn_drop(attn, training=training)
        
        # a linear layer M_v
        attn_dense = self.linear_1(attn_drop)
        x = tf.transpose(attn_dense, perm=[0, 2, 1, 3])
        x = tf.reshape(x, [-1, num_patch, self.dim * self.dim_coefficient])
        # a linear layer to project original dim
        x = self.proj(x)
        x = self.proj_drop(x, training=training)
  
        if return_attention_scores:
            return x, attn
        else:
            return x 
        
    def get_config(self):
        config = {
            'dim'                : self.dim,
            'num_heads'          : self.num_heads,
            'dim_coefficient'    : self.dim_coefficient,
            'attention_dropout'  : self.attention_dropout,
            'projection_dropout' : self.projection_dropout
        }
        base_config = super(ExternalAttention, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

## Implement the MLP block

In [None]:
class MLP(layers.Layer):
    def __init__(self, mlp_dim, embedding_dim=None, 
                 act_layer=tf.nn.gelu, drop_rate=0.2, **kwargs):
        super(MLP, self).__init__(name='MLP', **kwargs)
        self.fc1  = layers.Dense(mlp_dim, activation=act_layer)
        self.fc2  = layers.Dense(embedding_dim)
        self.drop = layers.Dropout(drop_rate)

    def call(self, inputs, training=None):
        x = self.fc1(inputs)
        x = self.drop(x, training=training)
        x = self.fc2(x)
        x = self.drop(x, training=training)
        return x

## Implement the Transformer block

In [None]:
patch_size       = 4
num_heads        = 4
embedding_dim    = 128
mlp_dim          = 64
dim_coefficient  = 4
num_patches      = (IMG_SIZE // patch_size) ** 2
attention_dropout   = 0.2
projection_dropout  = 0.2
num_ext_transformer_blocks = 10

In [None]:
class AttentionEncoder(layers.Layer):
    def __init__(self, embedding_dim, 
                 mlp_dim, num_heads, 
                 dim_coefficient,  
                 attention_dropout,  
                 projection_dropout, 
                 get_attention_matrix=False,
                 **kwargs):
        super(AttentionEncoder, self).__init__(**kwargs)
        self.embedding_dim = embedding_dim
        self.mlp_dim   = mlp_dim
        self.num_heads = num_heads
        self.dim_coefficient    = dim_coefficient
        self.attention_dropout  = attention_dropout
        self.projection_dropout = projection_dropout
        self.get_attention_matrix = get_attention_matrix
        self.mlp = MLP(mlp_dim, embedding_dim)
        
        self.etn = ExternalAttention(
            embedding_dim,
            num_heads,
            dim_coefficient,
            attention_dropout,
            projection_dropout
        )
    
    def call(self, inputs):
        residual_1 = inputs 
        x, ext_attention_scores = self.etn(inputs, return_attention_scores=True) 
        x = layers.add([x, residual_1])
        residual_2 = x
        x = self.mlp(x)
        x = layers.add([x, residual_2])
        
        if self.get_attention_matrix:
            return x, ext_attention_scores
        else:
            return x 
    
    def get_config(self):
        config = {
            'embedding_dim'     : self.embedding_dim,
            'mlp_dim'           : self.mlp_dim,
            'num_heads'         : self.num_heads,
            'dim_coefficient'   : self.dim_coefficient,
            'attention_dropout' : self.attention_dropout,
            'projection_dropout': self.projection_dropout,
            'get_attention_matrix': self.get_attention_matrix
        }
        base_config = super(AttentionEncoder, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

## ImageNet-Model + External Attention Transformer 

![up](https://user-images.githubusercontent.com/17668390/141291222-b3184730-11ba-4e12-bd87-30a85d82e854.png)

In [None]:
def get_model(plot_model, print_summary, with_compile):
    # multi-input 
    image_inputs  = layers.Input((IMG_SIZE, IMG_SIZE, 3))
    feature_input =  layers.Input((len(dense_features),))

    # base mdoel: image-net (replace with yours)
    # multi-output 
    backbone = tf.keras.applications.ResNet50(
        include_top=False,
        weights=None,
        input_tensor=layers.Input((IMG_SIZE, IMG_SIZE, 3)),
    )
    multi_op_backbone = Model(
        backbone.input, 
        [
            backbone.get_layer('conv2_block3_out').output, # for transformer blocks 
            backbone.output # for cnn blocks 
        ]
    )
    mid_y, last_y = multi_op_backbone(image_inputs)
    
    # Tranformer Blocks 
    patchedx = PatchEmbed(
        img_size=(224, 224),
        patch_size=(patch_size, patch_size),
        embed_dim=embedding_dim
    )(mid_y)
    
    x = patchedx
    for _ in range(num_ext_transformer_blocks):
        x, attn_weight_matrix = AttentionEncoder(
            embedding_dim,
            mlp_dim,
            num_heads,
            dim_coefficient,
            attention_dropout,
            projection_dropout,
            get_attention_matrix = True
        )(x)

    # end layers : for transformer head 
    tail_1 = Sequential(
        [
            layers.GlobalAveragePooling1D(),
            layers.Dropout(0.5),
            layers.BatchNormalization()
        ], name='tail_1'
    )
    
    # end layers : for cnn head 
    tail_2 = Sequential(
        [
            layers.GlobalAveragePooling2D(),
            layers.Dropout(0.5),
        ], name='tail_2'
    )
    
    # end layers : head layers for feature input (simple mlp)
    tail_3 = Sequential(
        [
            layers.Dense(32,  activation='selu'),
            layers.Dense(64,  activation='selu'),
            layers.Dense(128, activation='selu'),
            layers.Dropout(0.2),
        ], name='tail_3'
    )
    
    # bring all together 
    cating = tf.concat(
        [
            tail_1(x), 
            tail_2(last_y), 
            tail_3(feature_input)
        ], 
        axis=-1
    )
    classifier = layers.Dense(1, activation = 'sigmoid')(cating)
    
    # construct the DAG graph 
    model = Model([image_inputs, feature_input], classifier)

    # plotting 
    if plot_model:
        display(tf.keras.utils.plot_model(model, 
                                          show_shapes=True, 
                                          show_layer_names=True,  
                                          expand_nested=False))
    # overal summary  
    if print_summary:
        print(model.summary())
        
    # compiling 
    if with_compile:
        model.compile(
            optimizer = optimizers.Adam(), 
            loss = losses.BinaryCrossentropy(), 
            metrics = [metrics.RootMeanSquaredError('rmse')])  

    return model 

In [None]:
model = get_model(plot_model    = True,  
                  print_summary = True, 
                  with_compile  = False)

# Training Modules

In [None]:
from tensorflow.keras import losses, optimizers , metrics
from tensorflow.keras import callbacks

def get_lr_callback(batch_size=8):
    lr_start   = 0.000005
    lr_max     = 0.00000125 * batch_size * REPLICAS
    lr_min     = 0.000001
    lr_ramp_ep = 5
    lr_sus_ep  = 0
    lr_decay   = 0.8
    def lrfn(epoch):
        if epoch < lr_ramp_ep:
            lr = (lr_max - lr_start) / lr_ramp_ep * epoch + lr_start
        elif epoch < lr_ramp_ep + lr_sus_ep:
            lr = lr_max
        else:
            lr = (lr_max - lr_min) * lr_decay**(epoch - lr_ramp_ep - lr_sus_ep) + lr_min
        return lr
    return callbacks.LearningRateScheduler(lrfn, verbose=True)


# Set Callbacks
def model_checkpoint(fold):
    return callbacks.ModelCheckpoint(f'feature_model_{fold}.h5',
                                              verbose = 1, 
                                              monitor = 'val_rmse', 
                                              mode  = 'min', 
                                              save_weights_only = True,
                                              save_best_only    = True)

In [None]:
training_fold = 0 # 1st fold training: total fold 10
print(f'\nFold {training_fold}\n')

df = pd.read_csv("./train_df_folds.csv")
df_train = df[df.kfold != training_fold].reset_index(drop=True)
df_valid = df[df.kfold == training_fold].reset_index(drop=True)

with strategy.scope():
    # Create Model
    model = get_model(plot_model=False, print_summary=False, with_compile=True)

training_dataset = create_dataset(df_train, 
                                  batch_size  = BATCH_SIZE, 
                                  is_labelled = True, 
                                  repeat      = True, 
                                  shuffle     = True)
validation_dataset = create_dataset(df_valid, 
                                    batch_size  = BATCH_SIZE, 
                                    is_labelled = True,
                                    repeat      = True, 
                                    shuffle     = False)

# Only True for training. 
# Current Image_Size is too big for other device.
if DEVICE == "TPU":
    # Fit Model
    history = model.fit(training_dataset,
                        epochs = EPOCHS,
                        steps_per_epoch  = df_train.shape[0] / batch_size // REPLICAS,
                        validation_steps = df_valid.shape[0] / batch_size // REPLICAS,
                        callbacks = [model_checkpoint(training_fold),  get_lr_callback(batch_size)],
                        validation_data = validation_dataset,
                        verbose = 1)   
    # Validation Information
    best_val_rmse = min(history.history['val_rmse'])
    print(f'\nValidation RMSE: {best_val_rmse}\n')
else:
    # load trained weights 
    model.load_weights(f'../input/pet-test-wg/ext_attn_wg/feature_model_{training_fold}.h5')

# Plot Learning Curve

In [None]:
if DEVICE == "TPU":
    plt.figure(figsize=(19,6))

    plt.subplot(131)
    plt.plot(history.epoch, history.history["loss"], label="Train loss")
    plt.plot(history.epoch, history.history["val_loss"], label="Valid loss")
    plt.legend()

    plt.subplot(132)
    plt.plot(history.epoch, history.history["rmse"], label="Train RMSE")
    plt.plot(history.epoch, history.history["val_rmse"], label="Valid RMSE")
    plt.legend()

    plt.subplot(133)
    rng = [i for i in range(epochs)]
    lr = history.history["lr"]
    plt.plot(rng, lr, '-o')
    plt.xlabel('Epoch',size=14)
    plt.ylabel('Learning Rate',size=14)
    plt.legend()
    plt.show()

# Activaiton Maps of External MHA Transformer Blocks 

In [None]:
# build model with transformer blocks only 
transformer_feature_blocks = [layer.output for layer in model.layers if isinstance(layer, AttentionEncoder)]
trans_act_model = Model(inputs=model.inputs, outputs=transformer_feature_blocks)

# only transformer blocks
tf.keras.utils.plot_model(trans_act_model, show_shapes=True, show_layer_names=True)

In [None]:
(sample_img, sample_feat), sample_gt = next(iter(validation_dataset))
print(sample_img.shape, sample_feat.shape)

plt.figure(figsize=(10, 10))
plt.axis('off')
plt.imshow(sample_img[5])
plt.show()

In [None]:
feature_maps = trans_act_model.predict(
    (
        tf.expand_dims(sample_img[5],  axis=0),
        tf.expand_dims(sample_feat[5], axis=0)
    )
)

feats_maps  = np.array(list(zip(*feature_maps))[0]).squeeze(axis=1) ; print(feats_maps.shape)
attn_weight = np.array(list(zip(*feature_maps))[1]).squeeze(axis=1) ; print(attn_weight.shape)

for i, feature_map in enumerate(feats_maps):
    print('ExtTransformerBlock ', i)
    num_sequence, channel = feature_map.shape
    height = width = int(np.sqrt(num_sequence))
    feature_map = tf.reshape(feature_map, shape=(-1, height, width, channel))
    ix = 1
    plt.figure(figsize=(25, 25))
    for _ in range(64):
        ax = plt.subplot(10, 10, ix)
        ax.set_xticks([])
        ax.set_yticks([])
        plt.imshow(feature_map[0, :, :, ix - 1], cmap="viridis")
        ix += 1
    plt.show()

# OOF + Ensemble Inference 

- **Classification Head Models**
- **Let's try TTA later.**
- **We'll try [RAPIDS SVR](https://www.kaggle.com/cdeotte/rapids-svr-boost-17-8) next.**

In [None]:
all_weit = '../input/pet-test-wg/ext_attn_wg'
all_rmse = []
all_pred = []

for fold_ in range(df.kfold.nunique())[:len(os.listdir(all_weit))]:
    print('FOLD ', fold_)
    # Get fold-wise samples 
    df_valid = df[df.kfold == fold_].reset_index(drop=True)
    
    # load trained model 
    model = get_model(plot_model = False,  print_summary = False,  with_compile  = True)
    model.load_weights(f'{all_weit}/feature_model_{fold_}.h5')
    
    # get validaiton data set to compute Out-of-fold 
    cb_val_set = create_dataset(df_valid, 
                                 batch_size  = BATCH_SIZE,
                                 is_labelled = True, 
                                 augment     = False,
                                 repeat      = False, 
                                 shuffle     = False)
    loss, rmse = model.evaluate(cb_val_set, verbose=1)
    all_rmse.append(rmse)
    
    # Inference on test set 
    test_dataset = create_dataset(test_df, 
                                 batch_size  = BATCH_SIZE,
                                 is_labelled = False, 
                                 augment     = False, 
                                 repeat      = False, 
                                 shuffle     = False)
    fold_wise_pred = model.predict(test_dataset, verbose=1)
    fold_wise_pred = [x * 100 for x in fold_wise_pred]
    all_pred.append(fold_wise_pred)
    del model

In [None]:
print('Mean of all RMSE ', np.mean(all_rmse))
test_df['Pawpularity'] = np.mean(np.column_stack(all_pred), axis=1)
test_df = test_df[["Id", "Pawpularity"]]
test_df['Id'] = test_df.Id.apply(lambda x: x.split('/')[-1].split('.')[0])
test_df.to_csv("submission.csv", index=False)
test_df.head(15)

# [Optional]: RAPIDS SVR 

[Reference.](https://www.kaggle.com/cdeotte/rapids-svr-boost-17-8) - [Discussion](https://www.kaggle.com/c/petfinder-pawpularity-score/discussion/276724). 

In [None]:
import cuml, pickle
from cuml.svm import SVR
print('RAPIDS version',cuml.__version__,'\n')

In [None]:
LOAD_SVR_FROM_PATH = None
df = pd.read_csv('../input/pet-test-wg/train_df_folds.csv')
print('Train shape:', df.shape )

test_df = pd.read_csv(f'{DATA_DIR}test.csv')
test_df['Id'] = test_df['Id'].apply(lambda x: f'{TEST_DIR}{x}.jpg')
test_df['Pawpularity'] = 0
print('Train shape:', test_df.shape )

In [None]:
dnn_test_preds = []
svr_test_preds = []

dnn_val_preds = []
svr_val_preds = []

val_true = []

for fold_ in range(df.kfold.nunique())[:len(os.listdir(all_weit))]:
    print('-'*10)
    print('FOLD',fold_)
    print('-'*10)
    
    # build model 
    model = get_model(plot_model = False,   print_summary = False,  with_compile  = True)
    model.load_weights(f'../input/pet-test-wg/ext_attn_wg/feature_model_{fold_}.h5')
    
    # move out the embedding and predictin layer 
    new_model = Model(model.input, [model.layers[-2].output, model.output])
    
    # get validation fold on current fold 
    df_valid = df[df.kfold == fold_].reset_index(drop=True)

    name = f"SVR_fold_{fold_}.pkl" 
    if LOAD_SVR_FROM_PATH is None:
        # extract embeddings 
        # get training fold on current fold 
        df_train = df[df.kfold != fold_].reset_index(drop=True)
        training_dataset = create_dataset(df_train,
                                          batch_size  = BATCH_SIZE, 
                                          is_labelled = False, 
                                          augment     = False,
                                          repeat      = False, 
                                          shuffle     = False)
        print('Extracting train embedding...')
        embed, _ = new_model.predict(training_dataset, verbose=1)
        
        print('Fitting SVR...')
        clf = SVR(C=20.0)
        clf.fit(embed.astype('float32'), df_train.Pawpularity.values.astype('int32'))
        pickle.dump(clf, open(name, "wb"))
    else:
        # LOAD RAPIDS SVR 
        print('Loading SVR...',LOAD_SVR_FROM_PATH+name)
        clf = pickle.load(open(LOAD_SVR_FROM_PATH+name, "rb"))
  
    # test set 
    test_dataset = create_dataset(test_df, 
                                 batch_size  = BATCH_SIZE,
                                 is_labelled = False, 
                                 augment     = False, 
                                 repeat      = False, 
                                 shuffle     = False)

    print('Predicting test...')
    # Step 1: Get Classification Prediction and Preceding layer feature / Embeddings 
    dnn_embed, dnn_test_pred = new_model.predict(test_dataset, verbose=1)
    dnn_test_pred = [x * 100 for x in dnn_test_pred]
    
    # Step 2: Pass The Embeddings to RAPIDS-SVR Model 
    svr_test_pred = clf.predict(dnn_embed)
    
    # Step 3: Save 
    dnn_test_preds.append(dnn_test_pred)
    svr_test_preds.append(svr_test_pred)

    # OOF 
    valid_dataset = create_dataset(df_valid,
                                   batch_size  = BATCH_SIZE, 
                                   is_labelled = False, 
                                   augment     = False,
                                   repeat      = False, 
                                   shuffle     = False)
    print('Predicting Out-of-Fold...')
    # Step 1: Get Classification Prediction and Preceding layer feature / Embeddings 
    dnn_embed, dnn_val_pred = new_model.predict(valid_dataset, verbose=1)
    dnn_val_pred = [x * 100 for x in dnn_val_pred]
    
    # Step 2: Pass The Embeddings to RAPIDS-SVR Model 
    svr_val_pred = clf.predict(dnn_embed)    
    
    # Step 3: Save 
    dnn_val_preds.append(dnn_val_pred)
    svr_val_preds.append(svr_val_pred)
    
    # Step 4: Save GT for computing OOF 
    val_true.append(df_valid['Pawpularity'].values)
 
    ##################
    # COMPUTE RSME
    rsme = np.sqrt( np.mean( (val_true[-1] - np.array(dnn_val_preds[-1]))**2.0 ) )
    print('NN RSME =',rsme)
    
    rsme = np.sqrt( np.mean( (val_true[-1] - np.array(svr_val_preds[-1]))**2.0 ) )
    print('SVR RSME =',rsme)
    
    w = 0.5
    oof2 = (1-w)*np.array(dnn_val_preds[-1]) + w*np.array(svr_val_preds[-1])
    rsme = np.sqrt( np.mean( (val_true[-1] - oof2)**2.0 ) )
    print('Ensemble RSME =',rsme,'\n')

# Additional Resources
1. How to use it on my own dataset?
    - First, understand the competition task and its data format. And try to relate with yours.
    - Second, run this notebook successfully on the competition data.
    - Lastly, replace the dataset with yours.
2. More Code Exampels.
    - [TF.Keras: EfficientNet Hybrid Swin Transformer TPU](https://www.kaggle.com/ipythonx/tf-keras-efficientnet-hybrid-swin-transformer-tpu) - [Discussion](https://www.kaggle.com/c/petfinder-pawpularity-score/discussion/280531).
    - [[TF.Keras]:Learning to Resize Images for ViT Model](https://www.kaggle.com/ipythonx/tf-keras-learning-to-resize-images-for-vit-model) - [Discussion](https://www.kaggle.com/c/petfinder-pawpularity-score/discussion/280438).
    - [Discussion: DOLG Models in TensorFlow 2 (Keras) Implementation](https://www.kaggle.com/c/petfinder-pawpularity-score/discussion/281914) - [TF.Keras Code](https://github.com/innat/DOLG-TensorFlow)