# Image classification with modern MLP models

**Author:** [Khalid Salama](https://www.linkedin.com/in/khalid-salama-24403144/)<br>
**Date created:** 2021/05/30<br>
**Last modified:** 2023/08/03<br>
**Description:** Implementing the MLP-Mixer, FNet, and gMLP models for CIFAR-100 image classification.

## Introduction

This example implements three modern attention-free, multi-layer perceptron (MLP) based models for image
classification, demonstrated on the CIFAR-100 dataset:

1. The [MLP-Mixer](https://arxiv.org/abs/2105.01601) model, by Ilya Tolstikhin et al., based on two types of MLPs.
3. The [FNet](https://arxiv.org/abs/2105.03824) model, by James Lee-Thorp et al., based on unparameterized
Fourier Transform.
2. The [gMLP](https://arxiv.org/abs/2105.08050) model, by Hanxiao Liu et al., based on MLP with gating.

The purpose of the example is not to compare between these models, as they might perform differently on
different datasets with well-tuned hyperparameters. Rather, it is to show simple implementations of their
main building blocks.

## Setup

In [1]:
import numpy as np
import keras
from keras import layers
!pip install tf-keras~=2.16
!pip install keras-cv-attention-module
import keras.ops

[31mERROR: Could not find a version that satisfies the requirement keras-cv-attention-module (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for keras-cv-attention-module[0m[31m
[0m

## Prepare the data

In [2]:
num_classes = 5
input_shape = (256, 256, 3)

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
from PIL import Image
import glob
import numpy as np

filelist1 = glob.glob('/content/drive/MyDrive/TypeCoffee.v25i.folder/test/DuroRiadoRio/*.jpg')

xt_drr = np.array([np.array(Image.open(fname)) for fname in filelist1])
print(xt_drr.shape)
yt_drr = np.zeros((19,1),dtype=np.uint8)

filelist2 = glob.glob('/content/drive/MyDrive/TypeCoffee.v25i.folder/test/Mole/*.jpg')

xt_mole = np.array([np.array(Image.open(fname)) for fname in filelist2])
yt_mole = np.ones((19,1),dtype=np.uint8)

filelist3 = glob.glob('/content/drive/MyDrive/TypeCoffee.v25i.folder/test/Quebrado/*.jpg')

xt_q = np.array([np.array(Image.open(fname)) for fname in filelist3])
yt_q= np.full((20,1),2,dtype=np.uint8)

filelist4 = glob.glob('/content/drive/MyDrive/TypeCoffee.v25i.folder/test/RiadoRio/*.jpg')

xt_rr = np.array([np.array(Image.open(fname)) for fname in filelist4])
yt_rr= np.full ((22,1),3,dtype=np.uint8)

filelist5 = glob.glob('/content/drive/MyDrive/TypeCoffee.v25i.folder/test/RioFechado/*.jpg')

xt_rf = np.array([np.array(Image.open(fname)) for fname in filelist5])
yt_rf= np.full ((20,1),4,dtype=np.uint8)

x_test=np.concatenate((xt_drr,xt_mole,xt_q,xt_rr,xt_rf), axis=0)
y_test=np.concatenate((yt_drr,yt_mole,yt_q,yt_rr,yt_rf), axis=0)

print(x_test.shape)
print(y_test.shape)

(19, 256, 256, 3)
(100, 256, 256, 3)
(100, 1)


In [5]:
filelist6 = glob.glob('/content/drive/MyDrive/TypeCoffee.v25i.folder/train/DuroRiadoRio/*.jpg')

xtrain_drr = np.array([np.array(Image.open(fname)) for fname in filelist6])
ytrain_drr = np.zeros((210,1),dtype=np.uint8)

filelist7 = glob.glob('/content/drive/MyDrive/TypeCoffee.v25i.folder/train/Mole/*.jpg')

xtrain_mole = np.array([np.array(Image.open(fname)) for fname in filelist7])
ytrain_mole = np.ones((215,1),dtype=np.uint8)

filelist8 = glob.glob('/content/drive/MyDrive/TypeCoffee.v25i.folder/train/Quebrado/*.jpg')

xtrain_q = np.array([np.array(Image.open(fname)) for fname in filelist8])
ytrain_q= np.full((206,1),2,dtype=np.uint8)

filelist9 = glob.glob('/content/drive/MyDrive/TypeCoffee.v25i.folder/train/RiadoRio/*.jpg')

xtrain_rr = np.array([np.array(Image.open(fname)) for fname in filelist9])
ytrain_rr= np.full ((212,1),3,dtype=np.uint8)

filelist10 = glob.glob('/content/drive/MyDrive/TypeCoffee.v25i.folder/train/RioFechado/*.jpg')

xtrain_rf = np.array([np.array(Image.open(fname)) for fname in filelist10])
ytrain_rf= np.full ((206,1),4,dtype=np.uint8)

x_train=np.concatenate((xtrain_drr,xtrain_mole,xtrain_q,xtrain_rr,xtrain_rf), axis=0)
y_train=np.concatenate((ytrain_drr,ytrain_mole,ytrain_q,ytrain_rr,ytrain_rf), axis=0)

print(x_train.shape)
print(y_train.shape)

(1049, 256, 256, 3)
(1049, 1)


In [6]:
filelist11 = glob.glob('/content/drive/MyDrive/TypeCoffee.v25i.folder/val/DuroRiadoRio/*.jpg')

xv_drr = np.array([np.array(Image.open(fname)) for fname in filelist11])
yv_drr = np.zeros((13,1),dtype=np.uint8)

filelist12 = glob.glob('/content/drive/MyDrive/TypeCoffee.v25i.folder/val/Mole/*.jpg')

xv_mole = np.array([np.array(Image.open(fname)) for fname in filelist12])
yv_mole = np.ones((11,1),dtype=np.uint8)

filelist13 = glob.glob('/content/drive/MyDrive/TypeCoffee.v25i.folder/val/Quebrado/*.jpg')

xv_q = np.array([np.array(Image.open(fname)) for fname in filelist13])
yv_q= np.full((13,1),2,dtype=np.uint8)

filelist14 = glob.glob('/content/drive/MyDrive/TypeCoffee.v25i.folder/val/RiadoRio/*.jpg')

xv_rr = np.array([np.array(Image.open(fname)) for fname in filelist14])
yv_rr= np.full ((10,1),3,dtype=np.uint8)

filelist15 = glob.glob('/content/drive/MyDrive/TypeCoffee.v25i.folder/val/RioFechado/*.jpg')

xv_rf = np.array([np.array(Image.open(fname)) for fname in filelist15])
yv_rf= np.full ((13,1),4,dtype=np.uint8)

x_val=np.concatenate((xv_drr,xv_mole,xv_q,xv_rr,xv_rf), axis=0)
y_val=np.concatenate((yv_drr,yv_mole,yv_q,yv_rr,yv_rf), axis=0)

print(x_val.shape)
print(y_val.shape)

(60, 256, 256, 3)
(60, 1)


## Configure the hyperparameters

In [7]:
weight_decay = 0.0001
batch_size = 32
num_epochs = 50  # Recommended num_epochs = 50
dropout_rate = 0.2
image_size = 288  # ESTAVA 224 We'll resize input images to this size.
patch_size = 16 # ESTAVA 8 Size of the patches to be extracted from the input images.
num_patches = (image_size // patch_size) ** 2  # Size of the data array.
embedding_dim = 256  # Number of hidden units.
num_blocks = 4  # Number of blocks.

print(f"Image size: {image_size} X {image_size} = {image_size ** 2}")
print(f"Patch size: {patch_size} X {patch_size} = {patch_size ** 2} ")
print(f"Patches per image: {num_patches}")
print(f"Elements per patch (3 channels): {(patch_size ** 2) * 3}")

Image size: 288 X 288 = 82944
Patch size: 16 X 16 = 256 
Patches per image: 324
Elements per patch (3 channels): 768


## Build a classification model

We implement a method that builds a classifier given the processing blocks.

In [8]:

def build_classifier(blocks, positional_encoding=False):
    inputs = layers.Input(shape=input_shape)
    # Augment data.
    augmented = data_augmentation(inputs)
    # Create patches.
    patches = Patches(patch_size)(augmented)
    # Encode patches to generate a [batch_size, num_patches, embedding_dim] tensor.
    x = layers.Dense(units=embedding_dim)(patches)
    if positional_encoding:
        x = x + PositionEmbedding(sequence_length=num_patches)(x)
    # Process x using the module blocks.
    x = blocks(x)
    # Apply global average pooling to generate a [batch_size, embedding_dim] representation tensor.
    representation = layers.GlobalAveragePooling1D()(x)
    # Apply dropout.
    representation = layers.Dropout(rate=dropout_rate)(representation)
    # Compute logits outputs.
    logits = layers.Dense(num_classes)(representation)
    # Create the Keras model.
    return keras.Model(inputs=inputs, outputs=logits)


## Define an experiment

We implement a utility function to compile, train, and evaluate a given model.

In [9]:

def run_experiment(model):
    # Create Adam optimizer with weight decay.
    optimizer = keras.optimizers.AdamW(
        learning_rate=learning_rate,
        weight_decay=weight_decay,
    )
    # Compile the model.
    model.compile(
        optimizer=optimizer,
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[
            keras.metrics.SparseCategoricalAccuracy(name="acc"),
            keras.metrics.SparseTopKCategoricalAccuracy(5, name="top5-acc"),
        ],
    )
    # Create a learning rate scheduler callback.
    reduce_lr = keras.callbacks.ReduceLROnPlateau(
        monitor="val_loss", factor=0.5, patience=5
    )
    # Create an early stopping callback.
    early_stopping = keras.callbacks.EarlyStopping(
        monitor="val_loss", patience=10, restore_best_weights=True
    )
    # Fit the model.
    history = model.fit(
        x=x_train,
        y=y_train,
        batch_size=batch_size,
        epochs=num_epochs,
        validation_split=0.1
        #callbacks=[early_stopping, reduce_lr],
    )

    _, accuracy, top_5_accuracy = model.evaluate(x_test, y_test)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")
    print(f"Test top 5 accuracy: {round(top_5_accuracy * 100, 2)}%")

    # Return history to plot learning curves.
    return history


## Use data augmentation

In [10]:
data_augmentation = keras.Sequential(
    [
        layers.Normalization(),
        layers.Resizing(image_size, image_size),
        layers.RandomFlip("horizontal"),
        layers.RandomZoom(height_factor=0.2, width_factor=0.2),
    ],
    name="data_augmentation",
)
# Compute the mean and the variance of the training data for normalization.
data_augmentation.layers[0].adapt(x_train)


## Implement patch extraction as a layer

In [11]:

class Patches(layers.Layer):
    def __init__(self, patch_size, **kwargs):
        super().__init__(**kwargs)
        self.patch_size = patch_size

    def call(self, x):
        patches = keras.ops.image.extract_patches(x, self.patch_size)
        batch_size = keras.ops.shape(patches)[0]
        num_patches = keras.ops.shape(patches)[1] * keras.ops.shape(patches)[2]
        patch_dim = keras.ops.shape(patches)[3]
        out = keras.ops.reshape(patches, (batch_size, num_patches, patch_dim))
        return out


## Implement position embedding as a layer

In [12]:

class PositionEmbedding(keras.layers.Layer):
    def __init__(
        self,
        sequence_length,
        initializer="glorot_uniform",
        **kwargs,
    ):
        super().__init__(**kwargs)
        if sequence_length is None:
            raise ValueError("`sequence_length` must be an Integer, received `None`.")
        self.sequence_length = int(sequence_length)
        self.initializer = keras.initializers.get(initializer)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "sequence_length": self.sequence_length,
                "initializer": keras.initializers.serialize(self.initializer),
            }
        )
        return config

    def build(self, input_shape):
        feature_size = input_shape[-1]
        self.position_embeddings = self.add_weight(
            name="embeddings",
            shape=[self.sequence_length, feature_size],
            initializer=self.initializer,
            trainable=True,
        )

        super().build(input_shape)

    def call(self, inputs, start_index=0):
        shape = keras.ops.shape(inputs)
        feature_length = shape[-1]
        sequence_length = shape[-2]
        # trim to match the length of the input sequence, which might be less
        # than the sequence_length of the layer.
        position_embeddings = keras.ops.convert_to_tensor(self.position_embeddings)
        position_embeddings = keras.ops.slice(
            position_embeddings,
            (start_index, 0),
            (sequence_length, feature_length),
        )
        return keras.ops.broadcast_to(position_embeddings, shape)

    def compute_output_shape(self, input_shape):
        return input_shape


## The MLP-Mixer model

The MLP-Mixer is an architecture based exclusively on
multi-layer perceptrons (MLPs), that contains two types of MLP layers:

1. One applied independently to image patches, which mixes the per-location features.
2. The other applied across patches (along channels), which mixes spatial information.

This is similar to a [depthwise separable convolution based model](https://arxiv.org/abs/1610.02357)
such as the Xception model, but with two chained dense transforms, no max pooling, and layer normalization
instead of batch normalization.

### Implement the MLP-Mixer module

In [13]:

class MLPMixerLayer(layers.Layer):
    def __init__(self, num_patches, hidden_units, dropout_rate, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.mlp1 = keras.Sequential(
            [
                layers.Dense(units=num_patches, activation="gelu"),
                layers.Dense(units=num_patches),
                layers.Dropout(rate=dropout_rate),
            ]
        )
        self.mlp2 = keras.Sequential(
            [
                layers.Dense(units=num_patches, activation="gelu"),
                layers.Dense(units=hidden_units),
                layers.Dropout(rate=dropout_rate),
            ]
        )
        self.normalize = layers.LayerNormalization(epsilon=1e-6)

    def build(self, input_shape):
        return super().build(input_shape)

    def call(self, inputs):
        # Apply layer normalization.
        x = self.normalize(inputs)
        # Transpose inputs from [num_batches, num_patches, hidden_units] to [num_batches, hidden_units, num_patches].
        x_channels = keras.ops.transpose(x, axes=(0, 2, 1))
        # Apply mlp1 on each channel independently.
        mlp1_outputs = self.mlp1(x_channels)
        # Transpose mlp1_outputs from [num_batches, hidden_dim, num_patches] to [num_batches, num_patches, hidden_units].
        mlp1_outputs = keras.ops.transpose(mlp1_outputs, axes=(0, 2, 1))
        # Add skip connection.
        x = mlp1_outputs + inputs
        # Apply layer normalization.
        x_patches = self.normalize(x)
        # Apply mlp2 on each patch independtenly.
        mlp2_outputs = self.mlp2(x_patches)
        # Add skip connection.
        x = x + mlp2_outputs
        return x


### Build, train, and evaluate the MLP-Mixer model

Note that training the model with the current settings on a V100 GPUs
takes around 8 seconds per epoch.

In [14]:
mlpmixer_blocks = keras.Sequential(
    [MLPMixerLayer(num_patches, embedding_dim, dropout_rate) for _ in range(num_blocks)]
)
learning_rate = 0.005
mlpmixer_classifier = build_classifier(mlpmixer_blocks)
history = run_experiment(mlpmixer_classifier)

Epoch 1/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 113ms/step - acc: 0.1880 - loss: 4.8992 - top5-acc: 1.0000 - val_acc: 0.0762 - val_loss: 3.2527 - val_top5-acc: 1.0000
Epoch 2/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 86ms/step - acc: 0.2833 - loss: 1.8571 - top5-acc: 1.0000 - val_acc: 0.0000e+00 - val_loss: 3.3413 - val_top5-acc: 1.0000
Epoch 3/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 86ms/step - acc: 0.3160 - loss: 1.6139 - top5-acc: 1.0000 - val_acc: 0.3429 - val_loss: 1.4922 - val_top5-acc: 1.0000
Epoch 4/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 86ms/step - acc: 0.3738 - loss: 1.4654 - top5-acc: 1.0000 - val_acc: 0.2190 - val_loss: 1.5296 - val_top5-acc: 1.0000
Epoch 5/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 86ms/step - acc: 0.3877 - loss: 1.4631 - top5-acc: 1.0000 - val_acc: 0.0381 - val_loss: 2.5472 - val_top5-acc: 1.0000
Epoch 6/50
[1m30/30[0m [32

In [15]:
!pip install scikit-learn # Install scikit-learn if you haven't already

from sklearn.metrics import precision_recall_fscore_support # Import the function
import numpy as np

y_true = []
y_pred = []
for images, labels in zip(x_test, y_test):
  predictions = mlpmixer_classifier.predict(np.expand_dims(images, axis=0))  # Predict on a single image
  predicted_label = np.argmax(predictions, axis=1)[0]  # Get the predicted label
  y_true.append(labels[0])  # Assuming label is a single-element array
  y_pred.append(predicted_label)

precision, recall, f1_score, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1_score)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 477ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3

In [16]:
mlpmixer_classifier.summary()

In [17]:
import os
import time
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Import the preprocess_input function
from keras.applications.imagenet_utils import preprocess_input

# Lista para armazenar tempos de inferência
tempos_de_inferencia = []

# Caminho das imagens para inferência
caminho_dados = "/content/drive/MyDrive/TypeCoffee.v28i.folder/train/Mole/"
imagens = [os.path.join(caminho_dados, nome) for nome in os.listdir(caminho_dados)]

# Make sure you've defined or loaded 'model' before this loop
model = mlpmixer_classifier

for imagem in imagens:
    inicio = time.time()  # Marca o início do processo
   # Load the image and preprocess it
    img = tf.keras.utils.load_img(imagem, target_size=(256, 256))
    img_array = tf.keras.utils.img_to_array(img)
    img_array = tf.expand_dims(img_array, 0)
    img_array = preprocess_input(img_array)  # Apply preprocessing if necessary

    predictions = model.predict(img_array)  # Make predictions

    # You can process 'predictions' here, such as getting the class with highest probability
    predicted_class_index = np.argmax(predictions[0])

    fim = time.time()

    # Calcula o tempo de inferência
    tempos_de_inferencia.append(fim - inicio)

# Cálculo da média dos tempos
media_tempo_inferencia = np.mean(tempos_de_inferencia)

print(f"Média do tempo de inferência: {media_tempo_inferencia:.4f} segundos")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 488ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3

In [29]:
import torch
torch.save(mlpmixer_classifier, 'meu_modelo_mlp_mixer.pt')

The MLP-Mixer model tends to have much less number of parameters compared
to convolutional and transformer-based models, which leads to less training and
serving computational cost.

As mentioned in the [MLP-Mixer](https://arxiv.org/abs/2105.01601) paper,
when pre-trained on large datasets, or with modern regularization schemes,
the MLP-Mixer attains competitive scores to state-of-the-art models.
You can obtain better results by increasing the embedding dimensions,
increasing the number of mixer blocks, and training the model for longer.
You may also try to increase the size of the input images and use different patch sizes.

## The FNet model

The FNet uses a similar block to the Transformer block. However, FNet replaces the self-attention layer
in the Transformer block with a parameter-free 2D Fourier transformation layer:

1. One 1D Fourier Transform is applied along the patches.
2. One 1D Fourier Transform is applied along the channels.

### Implement the FNet module

In [20]:

class FNetLayer(layers.Layer):
    def __init__(self, embedding_dim, dropout_rate, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.ffn = keras.Sequential(
            [
                layers.Dense(units=embedding_dim, activation="gelu"),
                layers.Dropout(rate=dropout_rate),
                layers.Dense(units=embedding_dim),
            ]
        )

        self.normalize1 = layers.LayerNormalization(epsilon=1e-6)
        self.normalize2 = layers.LayerNormalization(epsilon=1e-6)

    def call(self, inputs):
        # Apply fourier transformations.
        real_part = inputs
        im_part = keras.ops.zeros_like(inputs)
        x = keras.ops.fft2((real_part, im_part))[0]
        # Add skip connection.
        x = x + inputs
        # Apply layer normalization.
        x = self.normalize1(x)
        # Apply Feedfowrad network.
        x_ffn = self.ffn(x)
        # Add skip connection.
        x = x + x_ffn
        # Apply layer normalization.
        return self.normalize2(x)


### Build, train, and evaluate the FNet model

Note that training the model with the current settings on a V100 GPUs
takes around 8 seconds per epoch.

In [21]:
fnet_blocks = keras.Sequential(
    [FNetLayer(embedding_dim, dropout_rate) for _ in range(num_blocks)]
)
learning_rate = 0.001
fnet_classifier = build_classifier(fnet_blocks, positional_encoding=True)
history = run_experiment(fnet_classifier)

Epoch 1/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 98ms/step - acc: 0.2176 - loss: 1.7589 - top5-acc: 1.0000 - val_acc: 0.0000e+00 - val_loss: 1.6124 - val_top5-acc: 1.0000
Epoch 2/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 73ms/step - acc: 0.2302 - loss: 1.6072 - top5-acc: 1.0000 - val_acc: 0.0000e+00 - val_loss: 2.6479 - val_top5-acc: 1.0000
Epoch 3/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 71ms/step - acc: 0.2487 - loss: 1.5642 - top5-acc: 1.0000 - val_acc: 0.0000e+00 - val_loss: 2.3133 - val_top5-acc: 1.0000
Epoch 4/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 72ms/step - acc: 0.3651 - loss: 1.4611 - top5-acc: 1.0000 - val_acc: 0.3143 - val_loss: 1.3606 - val_top5-acc: 1.0000
Epoch 5/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 69ms/step - acc: 0.3786 - loss: 1.4052 - top5-acc: 1.0000 - val_acc: 0.0000e+00 - val_loss: 1.8361 - val_top5-acc: 1.0000
Epoch 6/50
[1m30/

In [22]:
!pip install scikit-learn # Install scikit-learn if you haven't already

from sklearn.metrics import precision_recall_fscore_support # Import the function
import numpy as np

y_true = []
y_pred = []
for images, labels in zip(x_test, y_test):
  predictions = fnet_classifier.predict(np.expand_dims(images, axis=0))  # Predict on a single image
  predicted_label = np.argmax(predictions, axis=1)[0]  # Get the predicted label
  y_true.append(labels[0])  # Assuming label is a single-element array
  y_pred.append(predicted_label)

precision, recall, f1_score, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1_score)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 377ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3

In [None]:
import os
import time
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Import the preprocess_input function
from keras.applications.imagenet_utils import preprocess_input

# Lista para armazenar tempos de inferência
tempos_de_inferencia = []

# Caminho das imagens para inferência
caminho_dados = "/content/drive/MyDrive/TypeCoffee.v28i.folder/train/Mole/"
imagens = [os.path.join(caminho_dados, nome) for nome in os.listdir(caminho_dados)]

# Make sure you've defined or loaded 'model' before this loop
model = fnet_classifier

for imagem in imagens:
    inicio = time.time()  # Marca o início do processo
   # Load the image and preprocess it
    img = tf.keras.utils.load_img(imagem, target_size=(256, 256))
    img_array = tf.keras.utils.img_to_array(img)
    img_array = tf.expand_dims(img_array, 0)
    img_array = preprocess_input(img_array)  # Apply preprocessing if necessary

    predictions = model.predict(img_array)  # Make predictions

    # You can process 'predictions' here, such as getting the class with highest probability
    predicted_class_index = np.argmax(predictions[0])

    fim = time.time()

    # Calcula o tempo de inferência
    tempos_de_inferencia.append(fim - inicio)

# Cálculo da média dos tempos
media_tempo_inferencia = np.mean(tempos_de_inferencia)

print(f"Média do tempo de inferência: {media_tempo_inferencia:.4f} segundos")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 329ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3

In [23]:
torch.save(fnet_classifier, 'meu_modelo_fnet.pt')

As shown in the [FNet](https://arxiv.org/abs/2105.03824) paper,
better results can be achieved by increasing the embedding dimensions,
increasing the number of FNet blocks, and training the model for longer.
You may also try to increase the size of the input images and use different patch sizes.
The FNet scales very efficiently to long inputs, runs much faster than attention-based
Transformer models, and produces competitive accuracy results.

## The gMLP model

The gMLP is a MLP architecture that features a Spatial Gating Unit (SGU).
The SGU enables cross-patch interactions across the spatial (channel) dimension, by:

1. Transforming the input spatially by applying linear projection across patches (along channels).
2. Applying element-wise multiplication of the input and its spatial transformation.

### Implement the gMLP module

In [24]:

class gMLPLayer(layers.Layer):
    def __init__(self, num_patches, embedding_dim, dropout_rate, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.channel_projection1 = keras.Sequential(
            [
                layers.Dense(units=embedding_dim * 2, activation="gelu"),
                layers.Dropout(rate=dropout_rate),
            ]
        )

        self.channel_projection2 = layers.Dense(units=embedding_dim)

        self.spatial_projection = layers.Dense(
            units=num_patches, bias_initializer="Ones"
        )

        self.normalize1 = layers.LayerNormalization(epsilon=1e-6)
        self.normalize2 = layers.LayerNormalization(epsilon=1e-6)

    def spatial_gating_unit(self, x):
        # Split x along the channel dimensions.
        # Tensors u and v will in the shape of [batch_size, num_patchs, embedding_dim].
        u, v = keras.ops.split(x, indices_or_sections=2, axis=2)
        # Apply layer normalization.
        v = self.normalize2(v)
        # Apply spatial projection.
        v_channels = keras.ops.transpose(v, axes=(0, 2, 1))
        v_projected = self.spatial_projection(v_channels)
        v_projected = keras.ops.transpose(v_projected, axes=(0, 2, 1))
        # Apply element-wise multiplication.
        return u * v_projected

    def call(self, inputs):
        # Apply layer normalization.
        x = self.normalize1(inputs)
        # Apply the first channel projection. x_projected shape: [batch_size, num_patches, embedding_dim * 2].
        x_projected = self.channel_projection1(x)
        # Apply the spatial gating unit. x_spatial shape: [batch_size, num_patches, embedding_dim].
        x_spatial = self.spatial_gating_unit(x_projected)
        # Apply the second channel projection. x_projected shape: [batch_size, num_patches, embedding_dim].
        x_projected = self.channel_projection2(x_spatial)
        # Add skip connection.
        return x + x_projected


### Build, train, and evaluate the gMLP model

Note that training the model with the current settings on a V100 GPUs
takes around 9 seconds per epoch.

In [25]:
gmlp_blocks = keras.Sequential(
    [gMLPLayer(num_patches, embedding_dim, dropout_rate) for _ in range(num_blocks)]
)
learning_rate = 0.003
gmlp_classifier = build_classifier(gmlp_blocks)
history = run_experiment(gmlp_classifier)

Epoch 1/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 115ms/step - acc: 0.2189 - loss: 2.3787 - top5-acc: 1.0000 - val_acc: 0.0000e+00 - val_loss: 1.6462 - val_top5-acc: 1.0000
Epoch 2/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 92ms/step - acc: 0.2883 - loss: 1.5757 - top5-acc: 1.0000 - val_acc: 0.0000e+00 - val_loss: 2.0635 - val_top5-acc: 1.0000
Epoch 3/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 88ms/step - acc: 0.3190 - loss: 1.4735 - top5-acc: 1.0000 - val_acc: 0.0000e+00 - val_loss: 2.1246 - val_top5-acc: 1.0000
Epoch 4/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 88ms/step - acc: 0.3563 - loss: 1.4545 - top5-acc: 1.0000 - val_acc: 0.0000e+00 - val_loss: 2.5435 - val_top5-acc: 1.0000
Epoch 5/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 86ms/step - acc: 0.3745 - loss: 1.4370 - top5-acc: 1.0000 - val_acc: 0.0000e+00 - val_loss: 2.3114 - val_top5-acc: 1.0000
Epoch 6/50
[

In [26]:
!pip install scikit-learn # Install scikit-learn if you haven't already

from sklearn.metrics import precision_recall_fscore_support # Import the function
import numpy as np

y_true = []
y_pred = []
for images, labels in zip(x_test, y_test):
  predictions = gmlp_classifier.predict(np.expand_dims(images, axis=0))  # Predict on a single image
  predicted_label = np.argmax(predictions, axis=1)[0]  # Get the predicted label
  y_true.append(labels[0])  # Assuming label is a single-element array
  y_pred.append(predicted_label)

precision, recall, f1_score, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1_score)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 437ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 81ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3

In [None]:
import os
import time
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Import the preprocess_input function
from keras.applications.imagenet_utils import preprocess_input

# Lista para armazenar tempos de inferência
tempos_de_inferencia = []

# Caminho das imagens para inferência
caminho_dados = "/content/drive/MyDrive/TypeCoffee.v28i.folder/train/Mole/"
imagens = [os.path.join(caminho_dados, nome) for nome in os.listdir(caminho_dados)]

# Make sure you've defined or loaded 'model' before this loop
model = gmlp_classifier

for imagem in imagens:
    inicio = time.time()  # Marca o início do processo
   # Load the image and preprocess it
    img = tf.keras.utils.load_img(imagem, target_size=(256, 256))
    img_array = tf.keras.utils.img_to_array(img)
    img_array = tf.expand_dims(img_array, 0)
    img_array = preprocess_input(img_array)  # Apply preprocessing if necessary

    predictions = model.predict(img_array)  # Make predictions

    # You can process 'predictions' here, such as getting the class with highest probability
    predicted_class_index = np.argmax(predictions[0])

    fim = time.time()

    # Calcula o tempo de inferência
    tempos_de_inferencia.append(fim - inicio)

# Cálculo da média dos tempos
media_tempo_inferencia = np.mean(tempos_de_inferencia)

print(f"Média do tempo de inferência: {media_tempo_inferencia:.4f} segundos")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 350ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3

As shown in the [gMLP](https://arxiv.org/abs/2105.08050) paper,
better results can be achieved by increasing the embedding dimensions,
increasing the number of gMLP blocks, and training the model for longer.
You may also try to increase the size of the input images and use different patch sizes.
Note that, the paper used advanced regularization strategies, such as MixUp and CutMix,
as well as AutoAugment.

In [28]:
torch.save(gmlp_classifier, 'meu_modelo_gmlp.pt')