# The Perceptron

## Fully Connected Feed Forward Network

### The Concept

Perceptrons are a part of a class of algorithms known as Artificial Neural Networks and they exist inside of the Deep Learning Paradigm, where features and functions are learned by these Artificial Neural Networks from large datasets.  As opposed to traditional Machine Learning approaches the amount of feature engineering and design are drastically reduced as the data structure will internally model these during the training phase.

These algorithms use densely connected layers of weights, which are NxM matrices of real numbers, to multiply with the input parameters, the independent variables to predict the labeled outcome which is the dependent variable.

These calculations are performed by stacking together these calculations, passing the output of one layer as the inputs into another layer, through a process known as feed forward.  Once these input variables are multiplied by the individual weights in a given layer, by a process known as dot product, or matrix multiplication, they are then passed into what is known as an activation function.

Activation Functions are non-linear functions that when multiplied by these vector sums produce what is known as an activation bump.  These functions are specifically non-linear primarily because this allows the networks to model very sophisticated functions and mappings through a process known as back-propagation.  This is the process by which the cost / error function is passed back through these series of layers to modify the individual weights and biases such that the error between the prediction and the true label is minimized over time, which allows the data structure to model the underlying latent space between these examples. In other words some prototypical features or representations that define this example space represented by the data sample.

## Activation Functions:

* Sigmoid : outputs values between 0, 1
* Tanh : Outputs  values between -1, 1
* Relu : Outputs values between 0, Infinity

* The non-linearity allows for more complex functions to represent the space / plane that separates the class distributions

## Perceptron is composed of these Stages:
* Input (Any transformations on the raw data)
* Weights (An NxM Matrix of coefficients to multiply the the input by)
* Sum (Weights get added to any bias terms to reduce the likelihood of overfitting)
* Non-Linearity (The resulting Matrix sum (Weights + Biases) is multiplied by a Non-Linear Activation Function)
* Output (Classification Probability Predictions or Continuous Value Predictions)

### Links to resources, tutorials:
* https://keras.io/examples/vision/mlp_image_classification/

In [1]:
import abc
import torch

import numpy as np
import tensorflow as tf
import tensorflow_addons as tfa
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as data


from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import euclidean_distances
from tensorflow import keras
from tensorflow.keras import layers

In [2]:
num_classes = 100
input_shape = (32, 32, 3)

(x_train, y_train), (x_test, y_test) = keras.datasets.cifar100.load_data()

print(f"x_train shape: {x_train.shape} - y_train shape: {y_train.shape}")
print(f"x_test shape: {x_test.shape} - y_test shape: {y_test.shape}")


x_train shape: (50000, 32, 32, 3) - y_train shape: (50000, 1)
x_test shape: (10000, 32, 32, 3) - y_test shape: (10000, 1)


In [3]:
weight_decay = 0.0001
batch_size = 128
num_epochs = 5
dropout_rate = 0.2
image_size = 64  # We'll resize input images to this size.
patch_size = 8  # Size of the patches to be extracted from the input images.
num_patches = (image_size // patch_size) ** 2  # Size of the data array.
embedding_dim = 256  # Number of hidden units.
num_blocks = 4  # Number of blocks.

print(f"Image size: {image_size} X {image_size} = {image_size ** 2}")
print(f"Patch size: {patch_size} X {patch_size} = {patch_size ** 2} ")
print(f"Patches per image: {num_patches}")
print(f"Elements per patch (3 channels): {(patch_size ** 2) * 3}")


Image size: 64 X 64 = 4096
Patch size: 8 X 8 = 64 
Patches per image: 64
Elements per patch (3 channels): 192


## Keras Classifier

In [4]:
def build_keras_classifier(blocks, 
                           positional_encoding=False):
    
    inputs = layers.Input(shape=input_shape)
    # Augment data
    augmented = data_augmentation(inputs)
    
    # Create image patches / sub-images
    patches = Patches(patch_size, num_patches)(augmented)
    
    # Encode image patches to create a tensor
    x = layers.Dense(units=embedding_dim)(patches)
    
    # Add the positional embeddings to the image patches
    if positional_encoding:
        positions = tf.range(start=0, limit=num_patches, delta=1)
        position_embedding = layers.Embedding(input_dim=num_patches, output_dim=embedding_dim)(positions)
        x = x + position_embedding
    
    # Process the encoded input using the Neural Network Blocks
    x = blocks(x)
    
    # Apply global average pooling downsample representation
    representation = layers.GlobalAvgPool1D()(x)
    
    # Apploy dropout to prevent overfitting
    representation = layers.Dropout(rate=dropout_rate)(representation)
    
    # Compute logits, these are the units that lead into a softmax classifier
    logits = layers.Dense(units=num_classes)(representation)
    
    return keras.Model(inputs=inputs, outputs=logits) 

In [5]:
def run_experiment(model):
    # Create the optimizer with weight decay, this will perform : weights * (2* (Sum of Squares * weight_decay))
    # Described here https://towardsdatascience.com/this-thing-called-weight-decay-a7cd4bcfccab
    optimizer = tfa.optimizers.AdamW(learning_rate=learning_rate,
                                     weight_decay=weight_decay)
    
    # Compile the model using the optimizer, metrics and loss
    model.compile(optimizer=optimizer,
                  loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=[keras.metrics.SparseCategoricalAccuracy(name='acc'),
                           keras.metrics.SparseTopKCategoricalAccuracy(name='top5-acc')])
    
    # Create a learning rate schedule callback to slowly dip the learning rate as epochs go on, to help with finding minima
    learning_rate_schedule = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5)
    
    # Create early stopping callback to prevent overfitting by stopping early and choosing the best weights
    early_stopping_callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
    
    # Fit the model on the dataset using the Optimizer, Loss Function, and other hyperparameters, IE: batch_size, epochs
    record = model.fit(x=x_train,
                       y=y_train,
                       batch_size=batch_size,
                       epochs=num_epochs,
                       validation_split=0.1,
                       callbacks=[early_stopping_callback,
                                  learning_rate_schedule])
    
    _, score, top_5_score = model.evaluate(x_test, y_test)
    print(f"Test accuracy: {round(score * 100, 2)}%")
    print(f"Test top 5 accuracy: {round(top_5_score * 100, 2)}%")

    # Return history to plot learning curves.
    return record

In [6]:

augmentations=[layers.Normalization(),
               layers.Resizing(image_size, image_size),
               layers.RandomFlip('horizontal'),
               layers.RandomZoom(height_factor=.2, width_factor=.2)]
# Set up operations so that they will occur sequentially
data_augmentation = keras.Sequential(augmentations, name='data_augmentation')
# Augment the training data examples
# Compute the mean and the variance of the training data for normalization.
data_augmentation.layers[0].adapt(x_train)
    

2022-06-27 14:43:47.016500: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [7]:
class Patches(layers.Layer):
    """
    Image Patch Extraction as a Layer extension
    """
    
    def __init__(self, patch_size, num_patches):
        super(Patches, self).__init__()
        self.patch_size = patch_size
        self.num_patches = num_patches
        
    def call(self, images):
        """Apply the patching / sub-sampling of the images and create the sub-images"""
        # Batch size is the first index of the shape [batch_size, image_height, image_width, image_channels]
        batch_size = tf.shape(images)[0]
        # Generate the sub-regions of the image to feed to the network in pieces
        patches = tf.image.extract_patches(images=images,
                                           sizes=[1, self.patch_size, self.patch_size, 1],
                                           strides=[1, self.patch_size, self.patch_size, 1],
                                           rates=[1, 1, 1, 1],
                                           padding='VALID')
        
        patch_dims = patches.shape[-1]
        # Reshape so that the image patches are sequenced
        patches = tf.reshape(patches, [batch_size, self.num_patches, patch_dims])
        return patches

## Multi-Layer Perceptron Architectures

### MLP-Mixer Model

* The MLP-Mixer is an architecture based exclusively on multi-layer perceptrons (MLPs), that contains two types of MLP layers:

        One applied independently to image patches, which mixes the per-location features.
        The other applied across patches (along channels), which mixes spatial information.
        

* This is similar to a depthwise separable convolution based model such as the Xception model, but with two chained dense transforms, no max pooling, and layer normalization instead of batch normalization.

* https://arxiv.org/abs/2105.01601

* The MLP-Mixer model tends to have much less number of parameters compared to convolutional and transformer-based models, which leads to less training and serving computational cost.

* As mentioned in the MLP-Mixer paper, when pre-trained on large datasets, or with modern regularization schemes, the MLP-Mixer attains competitive scores to state-of-the-art models. You can obtain better results by increasing the embedding dimensions, increasing, increasing the number of mixer blocks, and training the model for longer. You may also try to increase the size of the input images and use different patch sizes.

In [8]:

class MLPMixerLayer(layers.Layer):
    def __init__(self,
                 num_patches,
                 hidden_units,
                 dropout_rate,
                 embedding_dim,
                 *args,
                 **kwargs):
        super(MLPMixerLayer, self).__init__(*args, **kwargs)
        
        # Create the first of the different blocks (sub-MLPs) input size=num_patches, output_size=num_patches
        self.mlp1 = keras.Sequential([layers.Dense(units=num_patches),
                                      tfa.layers.GELU(), # Gaussian Error Linear Unit
                                      layers.Dense(units=num_patches),
                                      layers.Dropout(rate=dropout_rate)])
        
        # Create the second of the different blocks input_size=num_blocks, output_size=embedding_dim
        self.mlp2 = keras.Sequential([layers.Dense(units=num_patches),
                                      tfa.layers.GELU(),
                                      layers.Dense(units=embedding_dim),
                                      layers.Dropout(rate=dropout_rate)])
        # Normalize using LayerNorm so that there is no dependency on batch_size while reducing overfitting
        self.normalize = layers.LayerNormalization(epsilon=1e-6)
    
    def call(self, inputs):
        # Apply input normalization
        x = self.normalize(inputs)
        
        # Transpose inputs
        x_channels = tf.linalg.matrix_transpose(x)
        
        # Apply block1 to the input channels, each independently
        mlp1_outputs = self.mlp1(x_channels)
        mlp1_outputs = tf.linalg.matrix_transpose(mlp1_outputs)
        # Add a skip-connection to allow strong input features to have additional input
        x = mlp1_outputs + inputs
        # Normalize Layers
        x_patches = self.normalize(x)
        mlp2_outputs = self.mlp2(x_patches)
        # Skip Connection
        x = x + mlp2_outputs
        return x

In [9]:
mlpmixer_blocks = keras.Sequential([MLPMixerLayer(num_patches, embedding_dim, dropout_rate, embedding_dim) for _ in range(num_blocks)])
learning_rate = 0.005
mlpmixer_classifier = build_keras_classifier(mlpmixer_blocks)
record = run_experiment(mlpmixer_classifier)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test accuracy: 30.93%
Test top 5 accuracy: 62.11%


In [10]:
import re

['i', 'i', 'i', 'i']