# <a id='toc1_'></a>[Neural Networks](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Neural Networks](#toc1_)    
    - [Types of Layers in Neural Networks](#toc1_1_1_)    
    - [Activation Functions](#toc1_1_2_)    
    - [Optimization Algorithm](#toc1_1_3_)    
    - [Optimization Algorithm Parameters](#toc1_1_4_)    
    - [Batch Size](#toc1_1_5_)    
    - [Epochs](#toc1_1_6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

### <a id='toc1_1_1_'></a>[Types of Layers in Neural Networks](#toc0_)

- **Definition**: In a neural network, layers are interconnected nodes that are organized into columns. Each layer takes in input from previous layers (or the input data), performs transformations on this data, and passes its output to subsequent layers.

- **Intuition**: Think of layers as filters of information. Each layer extracts some information from the input data, which is then passed on to the next layer for further processing.

- **Purpose**: The purpose of having different types of layers is to allow the neural network to learn different types of features from the data. For example, convolutional layers are good at learning spatial features in image data, while recurrent layers are good at learning temporal features in time-series data.

- **Formula**: There isn't a specific formula for layers in a neural network, as they are more of a structural concept. However, the transformations performed by a layer can often be represented mathematically. For example, a fully connected layer performs a matrix multiplication and adds a bias term.

- **Code**: Here is an example of how to define a simple neural network with different types of layers in Python using the Keras library:

```python
from keras.models import Sequential
from keras.layers import Dense, Conv2D, MaxPooling2D, Flatten

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(64, 64, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
```

- **Limitations and Cautions**: The choice of layer type can greatly affect the performance of a neural network. It's important to choose the right type of layer for the specific task at hand. For example, using a convolutional layer for image data is generally a good idea, but using it for time-series data might not be as effective.

- **Subconcepts**: Some of the common types of layers in a neural network include:
  - **Dense (or Fully Connected) Layers**: Every neuron in a dense layer is connected to every neuron in the previous layer.
  - **Convolutional Layers**: These layers apply a convolution operation to the input, passing the result to the next layer. This is especially effective for tasks like image recognition.
  - **Pooling Layers**: These layers reduce the spatial size of the convolved feature, reducing the computational complexity of the model.
  - **Recurrent Layers**: These layers save the output of a layer and feed it back to the input in order to predict the output of the layer at the current time step given the previous time step.
  - **Normalization Layers**: These layers standardize the inputs to the layer, helping to stabilize the learning process and reduce the number of training epochs required.
  - **Dropout Layers**: These layers randomly set a fraction of input units to 0 at each update during training time, which helps prevent overfitting.



### <a id='toc1_1_2_'></a>[Activation Functions](#toc0_)

- **Definition**: Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction.

- **Intuition**: Activation functions are like the gatekeepers of the neural network. They decide how much information should proceed further through the network.

- **Purpose**: They are used to introduce non-linearity into the output of a neuron. This is important because most real world data is non linear and we want neurons to learn these non linear representations.

- **Formula**: There are many types of activation functions, each with its own formula. Here are a few examples:
  - Sigmoid: \(f(x) = 1 / (1 + e^{-x})\)
  - ReLU (Rectified Linear Unit): \(f(x) = max(0, x)\)
  - Tanh: \(f(x) = (e^{x} - e^{-x}) / (e^{x} + e^{-x})\)

- **Code**: Here is an example of how to use activation functions in Python using the Keras library:

```python
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(64, activation='relu', input_dim=50))
model.add(Dense(1, activation='sigmoid'))
```

- **Limitations and Cautions**: The choice of activation function can greatly affect the performance of a neural network. It's important to choose the right activation function for the specific task at hand. For example, the ReLU activation function is often a good choice for hidden layers, but it wouldn't be a good choice for the output layer of a binary classification problem, where a sigmoid activation function would be more appropriate.

- **Subconcepts**: Some of the common types of activation functions include:
  - **Sigmoid**: This function maps the input values to a range between 0 and 1, making it useful for output neurons in binary classification.
  - **ReLU (Rectified Linear Unit)**: This function sets all negative values in the input to 0 and leaves all positive values unchanged.
  - **Tanh**: This function maps the input values to a range between -1 and 1.
  - **Softmax**: This function is often used in the output layer of a multi-class classification neural network. It converts the outputs into probability values for each class.



### <a id='toc1_1_3_'></a>[Optimization Algorithm](#toc0_)

- **Definition**: Optimization algorithms in neural networks are used to minimize the error (loss function output) and improve the model's performance. They adjust the weights and biases of the model in order to minimize the output of the loss function.

- **Intuition**: Think of the optimization process as a hiker (the optimization algorithm) trying to find the bottom of a valley (the minimum of the loss function) while only being able to see a few feet ahead (the current batch of data).

- **Purpose**: The purpose of an optimization algorithm is to find the best set of weights and biases for the model that minimize the output of the loss function.

- **Formula**: There isn't a specific formula for optimization algorithms as a whole, as each algorithm has its own method of updating the weights and biases. For example, the update rule for Stochastic Gradient Descent (SGD) is: \(w = w - \eta \nabla L\), where \(w\) is the weight, \(\eta\) is the learning rate, and \(\nabla L\) is the gradient of the loss function.

- **Code**: Here is an example of how to use an optimization algorithm in Python using the Keras library:

```python
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD

model = Sequential()
model.add(Dense(64, activation='relu', input_dim=50))
model.add(Dense(1, activation='sigmoid'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
```

- **Limitations and Cautions**: The choice of optimization algorithm can greatly affect the performance of a neural network. It's important to choose the right algorithm for the specific task at hand. For example, while SGD is a good general-purpose optimizer, it might struggle with problems where the loss function has many shallow minima.

- **Subconcepts**: Some of the common types of optimization algorithms include:
  - **Stochastic Gradient Descent (SGD)**: This is the most basic optimization algorithm. It updates the weights using the gradient of the loss function with respect to the weight.
  - **Momentum**: This is a variant of SGD that takes into account the previous gradients to smooth out the update process.
  - **Adagrad**: This algorithm adapts the learning rate to the parameters, performing smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequent features.
  - **RMSprop**: This is an unpublished, adaptive learning rate method proposed by Geoff Hinton in his Coursera course.
  - **Adam**: This algorithm combines the benefits of RMSprop and momentum by using moving averages of the parameters.



### <a id='toc1_1_4_'></a>[Optimization Algorithm Parameters](#toc0_)

- **Definition**: These are the parameters that define how the optimization algorithm works. For example, the learning rate is a common parameter that determines how much the weights are updated during training.

- **Intuition**: Think of these parameters as the settings on a machine. By adjusting these settings, you can change how the machine operates.

- **Purpose**: The purpose of these parameters is to control the behavior of the optimization algorithm.

- **Formula**: There isn't a specific formula for these parameters, as they are values that are set before the training process begins.

- **Code**: Here is an example of how to set the parameters of an optimization algorithm in Python using the Keras library:

```python
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD

model = Sequential()
model.add(Dense(64, activation='relu', input_dim=50))
model.add(Dense(1, activation='sigmoid'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
```

- **Limitations and Cautions**: The choice of parameters can greatly affect the performance of the optimization algorithm. It's important to choose the right values for your specific task. For example, a learning rate that is too high can cause the algorithm to overshoot the minimum of the loss function, while a learning rate that is too low can cause the training process to be very slow.

- **Subconcepts**: Some of the common types of optimization algorithm parameters include:
  - **Learning Rate**: This is the size of the steps the algorithm takes towards the minimum of the loss function.
  - **Momentum**: This is a value between 0 and 1 that increases the size of the steps taken towards the minimum of the loss function.
  - **Decay**: This is a value that reduces the learning rate over time, helping the algorithm to settle at the minimum of the loss function.
  - **Nesterov Momentum**: This is a variant of momentum that has slightly better performance in practice.



### <a id='toc1_1_5_'></a>[Batch Size](#toc0_)

- **Definition**: Batch size is the number of training examples used in one iteration. For instance, let's say you have 1000 training samples and you decide to set batch size to 100. The algorithm takes the first 100 samples (from 1st to 100th) from the training dataset and trains the network. Next it takes the second 100 samples (from 101st to 200th) and trains the network again. This process continues until we have propagated through all samples of the network.

- **Intuition**: Larger batch sizes result in faster progress in training, but don't always converge as fast. Smaller batch sizes train slower, but can converge faster. It's definitely problem dependent.

- **Purpose**: The purpose of batch size is to allow the model to be trained using less memory space. By adjusting the batch size, you can ensure that your model is able to train on your machine's memory.

- **Formula**: There isn't a specific formula for batch size, as it is a hyperparameter that you set before training the model.

- **Code**: Here is an example of how to set the batch size in Python using the Keras library:

```python
model.fit(X_train, Y_train, epochs=10, batch_size=32)
```

- **Limitations and Cautions**: The choice of batch size can significantly affect the performance of your model. A batch size that is too large can lead to poor generalization (the model learns the training data too well and performs poorly on unseen data). On the other hand, a batch size that is too small can lead to slow convergence and a noisy gradient signal.



### <a id='toc1_1_6_'></a>[Epochs](#toc0_)

- **Definition**: An epoch is a term used in machine learning and indicates the number of passes of the entire training dataset the machine learning algorithm has completed. If the batch size is the whole dataset then the number of epochs is the number of iterations.

- **Intuition**: More epochs means the learning algorithm has more opportunities to tune the weights of the network to better map inputs to outputs. But more training isn't always better. A point of diminishing returns can be reached.

- **Purpose**: The purpose of setting the number of epochs is to specify how long we want to train our neural network.

- **Formula**: There isn't a specific formula for epochs, as it is a hyperparameter that you set before training the model.

- **Code**: Here is an example of how to set the number of epochs in Python using the Keras library:

```python
model.fit(X_train, Y_train, epochs=10, batch_size=32)
```

- **Limitations and Cautions**: The choice of the number of epochs is critical. Too few epochs can mean underfitting of the model, whereas too many epochs can mean overfitting of the model. It's important to choose a suitable number of epochs so that the model can learn the data well without overfitting.