# <a id='toc1_'></a>[Neural Networks](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Neural Networks](#toc1_)    
    - [Types of Layers in Neural Networks](#toc1_1_1_)    
    - [Activation Functions](#toc1_1_2_)    
    - [Optimization Algorithm](#toc1_1_3_)    
    - [Optimization Algorithm Parameters](#toc1_1_4_)    
    - [Batch Size](#toc1_1_5_)    
    - [Epochs](#toc1_1_6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_1_'></a>[Types of Layers in Neural Networks](#toc0_)

- **Definition**: In a neural network, layers are interconnected nodes that are organized into columns. Each layer takes in input from previous layers (or the input data), performs transformations on this data, and passes its output to subsequent layers.

- **Intuition**: Think of layers as filters of information. Each layer extracts some information from the input data, which is then passed on to the next layer for further processing.

- **Purpose**: The purpose of having different types of layers is to allow the neural network to learn different types of features from the data. For example, convolutional layers are good at learning spatial features in image data, while recurrent layers are good at learning temporal features in time-series data.

- **Formula**: There isn't a specific formula for layers in a neural network, as they are more of a structural concept. However, the transformations performed by a layer can often be represented mathematically. For example, a fully connected layer performs a matrix multiplication and adds a bias term.

- **Code**: Here is an example of how to define a simple neural network with different types of layers in Python using the Keras library:

```python
from keras.models import Sequential
from keras.layers import Dense, Conv2D, MaxPooling2D, Flatten

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(64, 64, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
```

- **Limitations and Cautions**: The choice of layer type can greatly affect the performance of a neural network. It's important to choose the right type of layer for the specific task at hand. For example, using a convolutional layer for image data is generally a good idea, but using it for time-series data might not be as effective.

- **Subconcepts**: Some of the common types of layers in a neural network include:
  - **Dense (or Fully Connected) Layers**: Every neuron in a dense layer is connected to every neuron in the previous layer.
  - **Convolutional Layers**: These layers apply a convolution operation to the input, passing the result to the next layer. This is especially effective for tasks like image recognition.
  - **Pooling Layers**: These layers reduce the spatial size of the convolved feature, reducing the computational complexity of the model.
  - **Recurrent Layers**: These layers save the output of a layer and feed it back to the input in order to predict the output of the layer at the current time step given the previous time step.
  - **Normalization Layers**: These layers standardize the inputs to the layer, helping to stabilize the learning process and reduce the number of training epochs required.
  - **Dropout Layers**: These layers randomly set a fraction of input units to 0 at each update during training time, which helps prevent overfitting.



## <a id='toc1_1_2_'></a>[Activation Functions](#toc0_)

- **Definition**: Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction.

- **Intuition**: Activation functions are like the gatekeepers of the neural network. They decide how much information should proceed further through the network.

- **Purpose**: They are used to introduce non-linearity into the output of a neuron. This is important because most real world data is non linear and we want neurons to learn these non linear representations.

- **Formula**: There are many types of activation functions, each with its own formula. Here are a few examples:
  - Sigmoid: $f(x) = 1 / (1 + e^{-x})$
  - ReLU (Rectified Linear Unit): $f(x) = max(0, x)$
  - Tanh: $f(x) = (e^{x} - e^{-x}) / (e^{x} + e^{-x})$

- **Code**: Here is an example of how to use activation functions in Python using the Keras library:

```python
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(64, activation='relu', input_dim=50))
model.add(Dense(1, activation='sigmoid'))
```

- **Limitations and Cautions**: The choice of activation function can greatly affect the performance of a neural network. It's important to choose the right activation function for the specific task at hand. For example, the ReLU activation function is often a good choice for hidden layers, but it wouldn't be a good choice for the output layer of a binary classification problem, where a sigmoid activation function would be more appropriate.

- **Subconcepts**: Some of the common types of activation functions include:
  - **Sigmoid**: This function maps the input values to a range between 0 and 1, making it useful for output neurons in binary classification.
  - **ReLU (Rectified Linear Unit)**: This function sets all negative values in the input to 0 and leaves all positive values unchanged.
  - **Tanh**: This function maps the input values to a range between -1 and 1.
  - **Softmax**: This function is often used in the output layer of a multi-class classification neural network. It converts the outputs into probability values for each class.



### 1. Sigmoid Function
- **Formula**: $f(x) = \frac{1}{1 + e^{-x}}$
- **Intuition**: The sigmoid function maps any input into a range between 0 and 1, making it useful for outputting probabilities.
- **Use Case**: It is often used in the output layer of a binary classification problem where the output is expected to be a probability that gives the likelihood of the input belonging to a particular class.
- **Limitations**: The sigmoid function suffers from the vanishing gradient problem, where the gradients become very small if the input is large. This can slow down learning during backpropagation. It also isn't zero-centered which can lead to undesirable zig-zagging dynamics in the gradient updates for the weights.
- **Assumptions**: No specific assumptions are necessary for using the sigmoid function.



### 2. Tanh Function
- **Formula**: $f(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$
- **Intuition**: The tanh function is similar to the sigmoid function but maps any input to a range between -1 and 1. This means that the output is zero-centered.
- **Use Case**: It is often used in hidden layers of a neural network as it can model both positive and negative input values.
- **Limitations**: Like the sigmoid function, the tanh function also suffers from the vanishing gradient problem.
- **Assumptions**: No specific assumptions are necessary for using the tanh function.



### 3. ReLU (Rectified Linear Unit) Function
- **Formula**: $f(x) = \max(0, x)$
- **Intuition**: The ReLU function outputs the input directly if it is positive, otherwise, it outputs zero. It introduces non-linearity in the network without affecting the receptive fields of convolution layers.
- **Use Case**: It is widely used in the hidden layers of neural networks as it helps the model learn complex patterns and overcome the vanishing gradient problem.
- **Limitations**: The ReLU function suffers from the "dying ReLU" problem, where neurons can sometimes be stuck in the negative state and always output zero, causing them to stop learning.
- **Assumptions**: No specific assumptions are necessary for using the ReLU function.



### 4. Leaky ReLU Function
- **Formula**: $f(x) = \max(0.01x, x)$
- **Intuition**: Leaky ReLU is a variant of ReLU that has a small slope for negative values instead of a flat zero, which helps to alleviate the dying ReLU problem.
- **Use Case**: It can be used in the hidden layers of neural networks, especially when the dying ReLU problem is a concern.
- **Limitations**: The value of the slope for negative inputs is a hyperparameter and needs to be manually tuned.
- **Assumptions**: No specific assumptions are necessary for using the Leaky ReLU function.



### 5. Softmax Function
- **Formula**: $f(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{K} e^{x_j}}$ for i = 1, …, K and K is the number of classes.
- **Intuition**: The softmax function outputs a vector that represents the probability distribution of a list of potential outcomes. It's the multiclass generalization of the sigmoid function.
- **Use Case**: It is often used in the output layer of a neural network for multiclass classification problems.
- **Limitations**: It can suffer from numerical instability due to the exponentials involved in its calculation.
- **Assumptions**: No specific assumptions are necessary for using the softmax function.

## <a id='toc1_1_3_'></a>[Optimization Algorithm](#toc0_)

- **Definition**: Optimization algorithms in neural networks are used to minimize the error (loss function output) and improve the model's performance. They adjust the weights and biases of the model in order to minimize the output of the loss function.

- **Intuition**: Think of the optimization process as a hiker (the optimization algorithm) trying to find the bottom of a valley (the minimum of the loss function) while only being able to see a few feet ahead (the current batch of data).

- **Purpose**: The purpose of an optimization algorithm is to find the best set of weights and biases for the model that minimize the output of the loss function.

- **Formula**: There isn't a specific formula for optimization algorithms as a whole, as each algorithm has its own method of updating the weights and biases. For example, the update rule for Stochastic Gradient Descent (SGD) is: $w = w - \eta \nabla L$, where $w$ is the weight, $\eta$ is the learning rate, and $\nabla L$ is the gradient of the loss function.

- **Code**: Here is an example of how to use an optimization algorithm in Python using the Keras library:

```python
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD

model = Sequential()
model.add(Dense(64, activation='relu', input_dim=50))
model.add(Dense(1, activation='sigmoid'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
```

- **Limitations and Cautions**: The choice of optimization algorithm can greatly affect the performance of a neural network. It's important to choose the right algorithm for the specific task at hand. For example, while SGD is a good general-purpose optimizer, it might struggle with problems where the loss function has many shallow minima.

- **Subconcepts**: Some of the common types of optimization algorithms include:
  - **Stochastic Gradient Descent (SGD)**: This is the most basic optimization algorithm. It updates the weights using the gradient of the loss function with respect to the weight.
  - **Momentum**: This is a variant of SGD that takes into account the previous gradients to smooth out the update process.
  - **Adagrad**: This algorithm adapts the learning rate to the parameters, performing smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequent features.
  - **RMSprop**: This is an unpublished, adaptive learning rate method proposed by Geoff Hinton in his Coursera course.
  - **Adam**: This algorithm combines the benefits of RMSprop and momentum by using moving averages of the parameters.



## <a id='toc1_1_4_'></a>[Optimization Algorithm Parameters](#toc0_)

- **Definition**: These are the parameters that define how the optimization algorithm works. For example, the learning rate is a common parameter that determines how much the weights are updated during training.

- **Intuition**: Think of these parameters as the settings on a machine. By adjusting these settings, you can change how the machine operates.

- **Purpose**: The purpose of these parameters is to control the behavior of the optimization algorithm.

- **Formula**: There isn't a specific formula for these parameters, as they are values that are set before the training process begins.

- **Code**: Here is an example of how to set the parameters of an optimization algorithm in Python using the Keras library:

```python
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD

model = Sequential()
model.add(Dense(64, activation='relu', input_dim=50))
model.add(Dense(1, activation='sigmoid'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
```

- **Limitations and Cautions**: The choice of parameters can greatly affect the performance of the optimization algorithm. It's important to choose the right values for your specific task. For example, a learning rate that is too high can cause the algorithm to overshoot the minimum of the loss function, while a learning rate that is too low can cause the training process to be very slow.

- **Subconcepts**: Some of the common types of optimization algorithm parameters include:
  - **Learning Rate**: This is the size of the steps the algorithm takes towards the minimum of the loss function.
  - **Momentum**: This is a value between 0 and 1 that increases the size of the steps taken towards the minimum of the loss function.
  - **Decay**: This is a value that reduces the learning rate over time, helping the algorithm to settle at the minimum of the loss function.
  - **Nesterov Momentum**: This is a variant of momentum that has slightly better performance in practice.



## <a id='toc1_1_5_'></a>[Batch Size](#toc0_)

- **Definition**: Batch size is the number of training examples used in one iteration. For instance, let's say you have 1000 training samples and you decide to set batch size to 100. The algorithm takes the first 100 samples (from 1st to 100th) from the training dataset and trains the network. Next it takes the second 100 samples (from 101st to 200th) and trains the network again. This process continues until we have propagated through all samples of the network.

- **Intuition**: Larger batch sizes result in faster progress in training, but don't always converge as fast. Smaller batch sizes train slower, but can converge faster. It's definitely problem dependent.

- **Purpose**: The purpose of batch size is to allow the model to be trained using less memory space. By adjusting the batch size, you can ensure that your model is able to train on your machine's memory.

- **Formula**: There isn't a specific formula for batch size, as it is a hyperparameter that you set before training the model.

- **Code**: Here is an example of how to set the batch size in Python using the Keras library:

```python
model.fit(X_train, Y_train, epochs=10, batch_size=32)
```

- **Limitations and Cautions**: The choice of batch size can significantly affect the performance of your model. A batch size that is too large can lead to poor generalization (the model learns the training data too well and performs poorly on unseen data). On the other hand, a batch size that is too small can lead to slow convergence and a noisy gradient signal.



## <a id='toc1_1_6_'></a>[Epochs](#toc0_)

- **Definition**: An epoch is a term used in machine learning and indicates the number of passes of the entire training dataset the machine learning algorithm has completed. If the batch size is the whole dataset then the number of epochs is the number of iterations.

- **Intuition**: More epochs means the learning algorithm has more opportunities to tune the weights of the network to better map inputs to outputs. But more training isn't always better. A point of diminishing returns can be reached.

- **Purpose**: The purpose of setting the number of epochs is to specify how long we want to train our neural network.

- **Formula**: There isn't a specific formula for epochs, as it is a hyperparameter that you set before training the model.

- **Code**: Here is an example of how to set the number of epochs in Python using the Keras library:

```python
model.fit(X_train, Y_train, epochs=10, batch_size=32)
```

- **Limitations and Cautions**: The choice of the number of epochs is critical. Too few epochs can mean underfitting of the model, whereas too many epochs can mean overfitting of the model. It's important to choose a suitable number of epochs so that the model can learn the data well without overfitting.

## Neural Net Hyperparameters

### Number of Hidden Layers

**Definition**: The number of hidden layers in a neural network is a hyperparameter that determines the depth of the network. Each hidden layer is composed of a set of neurons, where each neuron is a computational unit that takes in input from the previous layer, applies a transformation, and passes the output to the next layer.

**Intuition**: More hidden layers allow the network to learn more complex representations of the data. However, too many layers can lead to overfitting, where the model learns the training data too well and performs poorly on unseen data.

**Use Case**: Deep learning models, such as convolutional neural networks (CNNs) for image recognition or recurrent neural networks (RNNs) for sequence data, often have multiple hidden layers.

**Formula**: There is no specific formula for determining the optimal number of hidden layers. It is usually determined through experimentation and cross-validation.

**Limitations**: Adding more hidden layers increases the computational complexity of the model and the risk of overfitting. It also makes the model more difficult to train effectively, as gradients can vanish or explode in deep networks (a problem known as the vanishing/exploding gradients problem).

**Cautions**: It's important to balance the complexity of the model (number of hidden layers) with the amount and diversity of available training data. Regularization techniques, such as dropout or weight decay, can be used to prevent overfitting in deep networks.



### Number of Neurons in Hidden Layers

**Definition**: The number of neurons in a hidden layer is a hyperparameter that determines the width of the network. Each neuron in a layer takes in input from all neurons in the previous layer, applies a transformation, and passes the output to all neurons in the next layer.

**Intuition**: More neurons allow the layer to learn more complex representations of the data. However, too many neurons can lead to overfitting.

**Use Case**: The number of neurons in hidden layers is a key factor in the design of any neural network and is typically determined through experimentation and cross-validation.

**Formula**: There is no specific formula for determining the optimal number of neurons. It is usually determined through experimentation and cross-validation.

**Limitations**: Adding more neurons increases the computational complexity of the model and the risk of overfitting. It also increases the number of parameters in the model, making it more difficult to train effectively.

**Cautions**: It's important to balance the complexity of the model (number of neurons) with the amount and diversity of available training data. Regularization techniques, such as dropout or weight decay, can be used to prevent overfitting.



### Learning Rate

**Definition**: The learning rate is a hyperparameter that determines the step size at each iteration while moving toward a minimum of a loss function. It controls how much we are adjusting the weights of our network with respect to the loss gradient.

**Intuition**: A smaller learning rate could get stuck in local minima, while a larger learning rate could overshoot the global minimum.

**Use Case**: The learning rate is a key factor in the training of any neural network and is typically determined through experimentation and cross-validation.

**Formula**: The learning rate is typically a constant, but it can also be adjusted dynamically during training (a technique known as learning rate scheduling).

**Limitations**: If the learning rate is too high, the model might converge too quickly to a suboptimal solution, or it might not converge at all. If the learning rate is too low, the model might take too long to converge, or it might get stuck in a local minimum.

**Cautions**: It's important to choose an appropriate learning rate for the specific problem and model. Techniques such as learning rate scheduling or adaptive learning rates can be used to adjust the learning rate during training.



### Activation Function

**Definition**: The activation function is a mathematical function applied at each node in a layer, which determines the output of that node given an input or set of inputs.

**Intuition**: Different activation functions can model different types of relationships between input and output, and some may work better than others for a particular task.

**Use Case**: The choice of activation function can have a significant impact on the performance of a neural network. Common choices include ReLU, sigmoid, and tanh.

**Formula**: The formula for the activation function depends on the specific function used. For example, the ReLU function is defined as f(x) = max(0, x), and the sigmoid function is defined as f(x) = 1 / (1 + exp(-x)).

**Limitations**: The choice of activation function can affect the ability of the network to converge and the speed of convergence. Some activation functions, like the sigmoid function, can suffer from the vanishing gradient problem, which can slow down training.

**Cautions**: It's important to choose an appropriate activation function for the specific problem and model. Different activation functions have different properties and are suitable for different types of tasks.

### Keras

Originally, TensorFlow was a complex, low-level library for building neural networks. To overcome this complexity, a group of people created a separate library, called Keras, which was an interface to make it easier to build sophisticated neural networks in TensorFlow and other neural network libraries. Keras was so popular and widely used that as of TensorFlow 2.X, Keras is integrated directly into TensorFlow and is the primary interface used to build neural networks in TensorFlow.

### Building a Neural Network Using Keras

The process of building a network using Keras can generally be broken down into four separate steps:
1. **Build the model**: This is the step where we will declare the structure of the network — primarily the types and sizes of the hidden layers.
2. **Compile the model**: This step allows us to customize some of the settings that will be used for training.
3. **Train the model**
4. **Evaluate the model and generate predictions**

### Example: 5 Hidden Layers with 5 Nodes Each

Let's take a closer look at how to build a neural network using TensorFlow by recreating the network we create previously using `scikit-learn`'s MLP Classifier.

<img src="https://drive.google.com/uc?export=view&id=1WfEXuqomB66DpKcMvBR57nXTI9XXFV9d" width=600 style="margin:20px 20px"/>

To start, let's import the required modules we will need to create this network using TensorFlow.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

We will use the same synthetic data from before to train our network. However, let's also generate a test set to allow us to further evaluate our model.

In [None]:
X_train, y_train = generate_data()
X_test, y_test = generate_data(random_seed = 1) # generate test data with a different random seed



**Step 1.** Build the model:

In [None]:
# random seeds for reproducibility
tf.random.set_seed(123)

# Create a new sequential model
model = keras.Sequential()

# Declare the hidden layers
model.add(layers.Dense(5, activation="relu"))
model.add(layers.Dense(5, activation="relu"))
model.add(layers.Dense(5, activation="relu"))
model.add(layers.Dense(5, activation="relu"))
model.add(layers.Dense(5, activation="relu"))

# Declare the output layer
model.add(layers.Dense(1, activation="sigmoid"))

**Step 2.** Compile the model

In [None]:
model.compile(
    # Optimizer
    optimizer=keras.optimizers.Adam(),  
    # Loss function to minimize
    loss=keras.losses.BinaryCrossentropy(),
    # Metric used to evaluate model
    metrics=[keras.metrics.BinaryAccuracy()]
)

**Step 3.** Train the Model

In [None]:
history = model.fit(X_train, y_train, epochs=100, verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

**Step 4.** Evaluate the model using the test data and generate predictions

In [None]:
# Evaluate the network
train_accuracy = history.history["binary_accuracy"][-1]
result = model.evaluate(X_test,y_test, verbose=0)

print(f"Train Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {result[1]:.4f}") 

# Generate predictions
predictions = model.predict(X_test)

Train Accuracy: 0.9050
Test Accuracy: 0.8800
