In this notebook we'll introduce the basic layers and architectures used in deep neural networks.

The code will be written in PyTorch, but the concepts are applicable to any deep learning framework.

In [2]:
# import pytorch and related libraries
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms


# Layers

## Linear layers

A linear layer is a layer that applies a linear transformation to its input. It is defined by a weight matrix and a bias vector. The output is computed as:

$$y = f(x; \theta) = Wx + b$$

where $x$ is the input, $W$ is the weight matrix, $b$ is the bias vector.

The linear layer is implemented in PyTorch as `F.linear`.



In [5]:
# define a linear layer in NumPy
def linear_layer(x, w, b):
    return x @ w + b


In [11]:
in_features = 28 * 28
out_features = 10

# test the linear layer
x = torch.randn(1, in_features)
w = torch.randn(in_features, out_features)
b = torch.randn(out_features)

# compare the output of the NumPy implementation with the PyTorch implementation
assert torch.allclose(linear_layer(x, w, b), F.linear(x, w.T, b))


note that in PyTorch the weight matrix has the shape (out\_features, in\_features) so we have to transpose it.

## Non-linear activations

It's useless to have deep neural networks with only linear layers because they can be replaced by a single linear layer. We need to add non-linear activations to the network to make it more powerful.

### GELU

The GELU activation function, which stands for Gaussian Error Linear Unit, is a popular activation function used in neural networks. It was first introduced by Dan Hendrycks and Kevin Gimpel in their paper "Gaussian Error Linear Units (GELUs)" in 2017.

The GELU activation function is defined as follows:

$$GELU(x) = x \Phi(x) = 0.5 x (1 + erf(x/\sqrt{2})) $$

where $\Phi$ is the cumulative distribution function of the standard normal distribution and erf is the error function.

The GELU activation function has a similar shape to the widely used ReLU activation function, but with some key differences. One of the main advantages of GELU over ReLU is that it has a non-zero mean, which can help to reduce the vanishing gradient problem. Additionally, GELU has been shown to outperform other activation functions in certain scenarios, such as on language modeling tasks.

However, it should be noted that GELU is a relatively new activation function and may not always be the best choice for every application. As with any neural network component, it is important to experiment with different activation functions to find the one that works best for your specific problem.

In [14]:
# define a GELU layer in NumPy
def gelu_layer(x):
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * np.power(x, 3))))

In [27]:
# test the GELU layer
x = torch.randn(1, 2)

print(gelu_layer(x), F.gelu(x))

tensor([[-0.1595, -0.1053]]) tensor([[-0.1594, -0.1053]])


The two implementations do not always give the same results because the `torch.erf` function is not as precise as the `scipy.special.erf` function.

## Conv layers

## resiudal blocks

## Normalization layers

## Dropout

## Attention

## Recurrent layers

## Multiplicative layers

## Implicit layers

# Architectures

## Feedforward Neural Networks

## Convolutional Neural Networks

## Autoencoders

## Recurrent Neural Networks

## Transformers

## Graph Neural Networks