# ANN basis

A neural network will have (https://medium.datadriveninvestor.com/neural-network-simplified-c28b6614add4)

1. Input layer $x$, with the bias unit which is 1. It is also referred as the intercept.
2. Weights associated with each connection, organized in layers $w_l$.
3. One or more hidden layers, each hidden layer will have a bias unit $b_l$.
4. Output layer.
5. Activation function $\phi$ which converts an input signal of a node to an output signal.
    
Weights are how neural networks learn. We adjust the weights to determine the strength of the signal.

## Forward propagation:



$$ \overrightarrow{a}_1 = \overrightarrow{x} $$
$$ \overrightarrow{z}_{l+1} = \overrightarrow{a}_l\overrightarrow{w}_l + \overrightarrow{b}_l $$
$$ \overrightarrow{a}_{l+1} = \sigma(\overrightarrow{z}_{l+1}) $$

where $\overrightarrow{x}$ is the input activation pattern, $\overrightarrow{a}_l$ the activation pattern of the $l$-th layer, and $\sigma$ is the activation function

## Back propagation:

Back propagation is a fast algorithm for learning. It tells us how the cost function will change when we change the weights and biases. Without going into the detailed mathematics for backward propagation. In back propagation, we compute the partial derivative of the cost with respect to weight and partial derivative of the cost with respect to biases for each training example. Average the partial derivatives over all the training examples.

For our single data point, we determine the amount each of the weights and biases was responsible for the error. Based on how the weights are responsible for error, we adjust the all weights simultaneously.

Weights can be updated once for all of the training data using Batch gradient descent (GD) or once for each of the training example using Stochastic gradient descent (SGD).

Epoch is when the complete dataset is used once for learning, one forward propagation and one backward propagation for all training example. We repeat the forward and backward propagation for multiple epochs till we converge to a global minima.

## Training

The use of the forward and back propagation over the training dataset that allows the the ANN to minimize the cost.

## Inference/prediction

A forward propagation step.

## Epoch

A pass through the entire set one time.


## What is learning rate?

Learning rate controls how much we should adjust the weights with respect to the loss gradient.

## Regularization

A general name for all those techniques that prevent overfitting.

## Overfitting

(Undesired) Specialization of the ANN in the training set.

## Dropout

Minimizes overfitting by ignoring some neurons when the ANN is trained, at random for each training element.

## DropConnect

The same concept that dropout, but used on the weights instead of the neurons.

## Early stopping

One of the simplest methods to prevent overfitting of a network is to simply stop the training before overfitting has had a chance to occur.

## Limiting the number of hidden units in each layer or limiting network depth

Another simple way to prevent overfitting is to limit the number of parameters, typically by limiting the number of hidden units in each layer or limiting network depth.

## Weight decay

A simple form of added regularizer is weight decay, which simply adds an additional error, proportional to the sum of weights (L1 norm) or squared magnitude (L2 norm) of the weight vector, to the error at each node.

# Max norm constraints



## Loss/Error function

Measure the error of the ANN of a single input. For example, the squared error.

## Cost function

Measure the error of the ANN for a input set. For example, the mean squared error.

## Activation

The output of the neuron.

## Weight

The relationship between to neurons in adaject layers.

## Bias

The minimal activation level needed to fire up the neuron.

# CNN

Convolutional neural networks are variants of multilayer perceptrons, designed to emulate the behavior of a visual cortex. In a CNN one or more hidden layers include layers that perform convolutions. In a 2D-CNN, the input $x$ is a tensor with a shape (number of inputs) x (input height) x (input width) x (input channels). After passing through a convolutional layer, the image becomes abstracted to a feature map with shape (number of inputs) x (feature map height) x (feature map width) x (feature map channels). Conceptually, each convolutional neuron processes data only for its receptive field. Distinct types of layers, both locally and completely connected, are stacked to form a CNN architecture. The layers of a CNN have neurons arranged in 3 dimensions: width, height and depth (number of feature detectors).

A difference to ANNs, CNNs provides:

1. Invariance to translation (when an image is rotated, sized differently or viewed in different illumination an object will be recognized as the same object).
2. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. Thus, a much smaller number of weights for a given dimesion of an stimuly. Convolution reduces the number of free parameters, allowing the network to be deeper, and avoids the vanishing gradients and exploding gradients problems seen during backpropagation in traditional neural networks.
3. Furthermore, convolutional neural networks are ideal for data with a grid-like topology (such as images) as spatial relations between separate features are taken into account during convolution and/or pooling.
4. In a convolutional layer, each neuron receives input from only a restricted area of the previous layer called the neuron's receptive field.
5. Except when using dilated layers, in sucessive convolutional layer, each neuron takes input from a larger area in the input than previous layers.


## Receptive field
Each neuron inside a convolutional layer is connected to only a small region of the layer before it, called a receptive field. CNNs use a receptive field-like layout in which each neuron receives connections only from a subset of neurons in the previous (lower) layer. The receptive field of a neuron in one of the lower layers encompasses only a small area of the image, while the receptive field of a neuron in subsequent (higher) layers involves a combination of receptive fields from several (but not all) neurons in the layer before (i. e. a neuron in a higher layer "looks" at a larger portion of the image than does a neuron in a lower layer). In this way, each successive layer is capable of learning increasingly abstract features of the original image.


## Convolution

The mathematical operation that helps compute similarity of two signals. Closely related to cross correlation.

## Convolutional layer

The convolutional layer is the core building block of a CNN. The layer's parameters consist of a set of learnable filters (or kernels), which have a small receptive field, but extend through the full depth of the input volume. During the forward pass, each filter is convolved across the width and height of the input volume, computing the dot product between the filter entries and the input, producing a 2-dimensional activation map of that filter. As a result, the network learns filters that activate when it detects some specific type of feature at some spatial position in the input.

Stacking the activation maps for all filters along the depth dimension forms the full output volume of the convolution layer. Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation map.

Three hyperparameters control the size of the output volume of the convolutional layer: the depth, stride and padding size. 

## Feature detector (convolution(al) kernel/filter)

The discrete signal that convolves the signal $a_l$ is defined by the vector of weights and the bias, and represent particular features of the input. Each neuron in a CNN computes an output value by applying a specific function to the input values received from the receptive field in the previous layer. The function that is applied to the input values is determined by a vector of weights and a bias. Learning consists of iteratively adjusting these biases and weights. 

Stacking many convolutional layers leads to feature detectors that become increasingly global (i.e. responsive to a larger region of pixel space) so that the network first creates representations of small parts of the input, then from them assembles representations of larger areas.

## Feature/Activation map

The output of a convolutional layer $a_l$ for a given feature detector. The set of activation maps stacked together the depth dimension conform the output volume of the convolutional layer.

## Depth

The depth of the output volume of a convolutional layer controls the number of neurons in a layer that connect to the same region of the input volume. These neurons learn to activate for different features in the input.

## Stride

The step size used in the convolution. Controls the overlapping between the receptive fields.

## Padding

Sometimes, it is convenient to pad the input with zeros (or other values, such as the average of the region) on the border of the input volume. In particular, sometimes it is desirable to exactly preserve the spatial size of the input volume, this is commonly referred to as "same" padding.

## Dilation layer


## Poolling layer

Helps with “Translational Invariance”. Pooling is based on the concept that when we change the input by a small amount, the pooled outputs do not change. Pooling layers reduce the dimensions of data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. Local pooling combines small clusters, tiling sizes such as 2 x 2 are commonly used. Global pooling acts on all the neurons of the feature map. There are two common types of pooling in popular use: max (max-polling) and average.

The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters, memory footprint and amount of computation in the network, and hence to also control overfitting.

Usually, polling does not operate along the depth dimension.

In CNNs where the conectivity between adjacent layers is smaller than in normal ANNs. For this reason, stochastic pooling (based on a binomial distribution) usually replaces common dropout.

## Activation function layer

Usually, CNN neurons does not incorporate any activation function, which is provided by a specific layer.

## Fully connected layer

In clasiffication task, the last layers of a deep ANN structure are conformed by a common multilayer percetron.

## Normalization

## Shift equivariance

All the neurons in a given convolutional layer respond to the same feature within their specific response to the input visual field. Replicating units in this way allows for the resulting activation map to be equivariant under shifts of the locations of input features in the visual field, i.e. they grant translational equivariance - given that the layer has a stride of one.

In addition to reducing the sizes of feature maps, the pooling operation grants a degree of local translational invariance to the features contained therein, allowing the CNN to be more robust to variations in their positions.

Parameter sharing contributes to the translation invariance of the CNN architecture.

## Data augmentation

In data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and helps reduce overfitting when training a machine learning model.


