# Deep Learning

## Github Repos

- [Deep Learning](https://github.com/udacity/deep-learning)
- [Deep Learning with Pytorch](https://github.com/udacity/deep-learning-v2-pytorch)

## Content

1. [Introductioin to Neural Networks](#Introductioin-to-Neural-Networks)
2. [Implementing Gradient Descent](#Implementing-Gradient-Descent)
3. [Training Neural Network](#Training-Neural-Network)
4. [Deep Learning with TensorFlow](#Deep-Learning-with-TensorFlow)
5. [Deep Learning with PyTorch](#Deep-Learning-with-PyTorch)
6. [Convolutional Neural Networks](#Convolutional-Neural-Networks)
7. [Recurrent Neural Networks](#Recurrent-Neural-Networks)
8. [Generative Adversarial Networks](#Generative-Adversarial-Networks)
9. [Deploying a Model](#Deploying-a-Model)
10. [Projects](#Projects)

## Introductioin to Neural Networks

### Gradient Descent
[Principles and the math behind the gradient descent algorithm](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Deep%20Learning/Introduction%20to%20Neural%20Networks/Gradient%20Descent.pdf)

#### Error Function

- The error function should be differentiable
- THe error function should be continuous

### Activation Function

#### Gradient Descent Algorithm

- Sigmoid activation function

$$\sigma(x) = \frac{1}{1+e^{-x}}$$

- Derivative of the sigmoid function
$$\sigma'(x)=\sigma(x)(1-\sigma(x))$$

- Output (prediction) formula

$$\hat{y} = \sigma(w_1 x_1 + w_2 x_2 + b)$$

- Error function

$$Error(y, \hat{y}) = - y \log(\hat{y}) - (1-y) \log(1-\hat{y})$$

- The function that updates the weights

$$ w_i \longrightarrow w_i + \alpha (y - \hat{y}) x_i$$

$$ b \longrightarrow b + \alpha (y - \hat{y})$$


```python
# Activation (sigmoid) function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def output_formula(features, weights, bias):
    return sigmoid(np.dot(features, weights) + bias)

def error_formula(y, output):
    return - y*np.log(output) - (1 - y) * np.log(1-output)

def update_weights(x, y, weights, bias, learnrate):
    output = output_formula(x, weights, bias)
    d_error = y - output
    weights += learnrate * d_error * x
    bias += learnrate * d_error
    return weights, bias
```

### One-hot Encoding
Use the `get_dummies` function in Pandas in order to one-hot encode the data.

```python
# Make dummy variables for rank
one_hot_data = pd.concat([data, pd.get_dummies(data['rank'], prefix='rank')], axis=1)
```

### Maximum Likelihood
- log(ab) = log(a) + log(b)

### Cross Entropy
A higher cross-entropy implies a lower probability for an event. (cross-entropy is inversely proportional to the total probability of an outcome.)

- A good model gives a low cross entropy
- A bad model gives a high cross entropy

$$
CE = - \sum_{i=1}^m y_i ln(p_i) + (1-y_i) ln (1-p_i)
$$

#### Coding Cross-entropy
```python
# Y is for the category, and P is the probability.

import numpy as np

def cross_entropy(Y, P):
    Y = np.float_(Y)
    P = np.float_(P)
    return -np.sum(Y * np.log(P) + (1 - Y) * np.log(1 - P))
```

### Logistic Regression
1. Start with random weights: $w_1, ... , w_n, b$
2. For every point $(x_1, ... , x_n)$: update $w_i, b$
3. Reapeat until error is small

### Neural Network Architecture
- Input Layer
- Hidden Layer
- Output Layer

### Feedforward

### Backpropagation
- Doing a feedforward operation.
- Comparing the output of the model with the desired output.
- Calculating the error.
- Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
- Use this to update the weights, and get a better model.
- Continue this until we have a model that is good.

#### Backpropagate the error
$$ (y-\hat{y}) \sigma'(x) $$

```python
def error_term_formula(x, y, output):
    return (y - output)*sigmoid_prime(x)
```

[Lab: Analyzing Student Data](../../notebooks/01%20Introduction%20to%20Neural%20Networks/StudentAdmissions.ipynb)

## Implementing Gradient Descent

### Mean Squared Error Function
$$
E=\frac{1}{2m}\sum_{\mu}(y^{\mu}-\hat{y}^{\mu})^2
$$

- [Gradient Descent](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Deep%20Learning/02%20Implementing%20Gradient%20Descent/Gradient%20Descent.pdf)
- [Gradient Descent Code](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Deep%20Learning/02%20Implementing%20Gradient%20Descent/Gradient%20Descent%20Code.pdf)
- [Gradient Descent Implementing](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Deep%20Learning/02%20Implementing%20Gradient%20Descent/Gradient%20Descent%20Implementing.pdf)
- [Multilayer Perceptrons](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Deep%20Learning/02%20Implementing%20Gradient%20Descent/Multilayer%20Perceptrons.pdf)
- [Backpropagation](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Deep%20Learning/02%20Implementing%20Gradient%20Descent/Backpropagation.pdf)
- [Backpropagation Implementing](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Deep%20Learning/02%20Implementing%20Gradient%20Descent/Backpropagation%20Implementing.pdf)

Further reading
- From Andrej Karpathy: [Yes, you should understand backprop](https://karpathy.medium.com/yes-you-should-understand-backprop-e2f06eab496b#.vt3ax2kg9)
- Also from Andrej Karpathy, [a lecture from Stanford's CS231n course](https://www.youtube.com/watch?v=59Hbtz7XgjM)

## Training Neural Network

### Overfitting and Underfitting

- Overfitting -> high variance
- Underfitting -> high bias

![earlyStopping](./img/earlyStopping.png)

### Regularization
Large coefficients -> overfitting
- L1 Error Function: Good for feature selection
$$= -\frac{1}{m} \sum_{i=1}^m y_i ln(\hat{y}_i) + (1-y_i) ln (1-\hat{y}_i) + \lambda(|w_1|+...+|w_n|)$$
- L2 Error Function: Normally better for training models
$$E = -\frac{1}{m} \sum_{i=1}^m y_i ln(\hat{y}_i) + (1-y_i) ln (1-\hat{y}_i) + \lambda(w_1^2+...+w_n^2)$$

### Dropout
Prevent overfitting

### Random Restart
Jump out the local minima

### Vanishing Gradient
- Hyperbolic tangent function
$$tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

- Rectified Linear Unit (ReLU)
$$
relu(x)=
\begin{cases}
x & if x\ge 0\\
0 & if x<0
\end{cases}
$$

### Batch vs Stochastic Gradient Descent
Decrease training time

### Learning Rate Decay
Rule:
- If steep: long steps
- If plain: small steps

### Momentum
Solve local minmum problem.
- STEP: average of previous steps
- $\beta$: momentum
- STEP(n) $\rightarrow$ STEP(n) + $\beta$ STEP(n-1) + $\beta^2$ STEP(n-2) + ...




## [Deep Learning with TensorFlow](https://github.com/udacity/intro-to-ml-tensorflow)

### Build Neural Network
[Part 1 Introduction to Neural Networks with TensorFlow](../../Deep%20Learning/04%20Deep%20Learning%20with%20TensorFlow/Notebooks/Part_1_Introduction_to_Neural_Networks_with_TensorFlow_(Solution).ipynb)

[Part 2 Neural networks with TensorFlow and Keras](../../Deep%20Learning/04%20Deep%20Learning%20with%20TensorFlow/Notebooks/Part_2_Neural_networks_with_TensorFlow_and_Keras_(Solution).ipynb)

- `tf.multiply()`: Performs element-wise multiplication on two inputs
- `tf.matmul()`: Performs matrix multiplication on two inputs
- `tf.reduce_sum()`: Computes the sum of elements across an input tensor's dimensions
- `tf.convert_to_tensor()`: convert ndarray to a TensorFlow tensor
- `tensor.numpy()`: command on the tensor itself to convert it to an ndarray

There are [plenty of different datasets](https://www.tensorflow.org/datasets/catalog/overview) available from the `tensorflow_datasets` library, which we shortened in the code to `tfds`. Loading one of the datasets is simple with the `tfds.load()` function, which takes in the dataset name (in this case `mnist`), as well as some other optional arguments such as: 1) the dataset split to get (training, test, validation), 2) whether to shuffle the data, 3) if the data is to be used as part of a supervised learning algorithm (including labels), 4) whether to include metadata about the dataset itself, and [more](https://www.tensorflow.org/datasets/api_docs/python/tfds/load).

You can use the `.take()` function with an integer as an argument to get a certain number of images at once from the dataset.

#### Pipelines

- [Pipeline Performance](https://www.tensorflow.org/guide/data_performance)
- [Transformations](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)

#### Softmax

To calculate this probability distribution, we often use the [**softmax** function](https://en.wikipedia.org/wiki/Softmax_function). Mathematically this looks like

$$
\Large \sigma(x_i) = \cfrac{e^{x_i}}{\sum_k^K{e^{x_k}}}
$$

TensorFlow also includes one of its own built-in Softmax activation functions you can use. Using the [TensorFlow API documentation]

- `tf.nn.softmax`
- `tf.math.softmax`
- `tf.keras.activations.softmax`

#### Neural Networks with TensorFlow

Keras helps further simplify working with neural networks running on TensorFlow under the hood. You can more easily stack layers with `tf.keras.Sequential`, making sure to feed an `input_shape` to the first layer of the network. You can also either add separate `Activation` layers, or feed an activation as an argument within certain layers, such as the `Dense` fully-connected layers.

Example:
```python
model = tf.keras.Sequential([
        tf.keras.layers.Flatten(input_shape = (28,28,1)),
        tf.keras.layers.Dense(256, activation = 'sigmoid'),
        tf.keras.layers.Dense(10, activation = 'softmax')
])
```

#### Subclassing
```python
class Network(tf.keras.Model):
    def __init__(self, num_classes = 2):
        super().__init__()
        self.num_classes = num_classes
    
        # Define layers 
        self.input_layer = tf.keras.layers.Flatten()
        self.hidden_layer = tf.keras.layers.Dense(256, activation = 'relu')
        self.output_layer = tf.keras.layers.Dense(self.num_classes, activation = 'softmax')
    
    # Define forward Pass   
    def call(self, input_tensor):
        x = self.input_layer(input_tensor)
        x = self.hidden_layer(x)
        x = self.output_layer(x)
    
        return x


# Create a model object
subclassed_model = Network(10)

# Build the model, i.e. initialize the model's weights and biases
subclassed_model.build((None, 28, 28, 1))

subclassed_model.summary()
```

#### Adding Layers with .add

Example:
```python
layer_neurons = [512, 256, 128, 56, 28, 14]

model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(input_shape = (28,28,1)))

for neurons in layer_neurons:
    model.add(tf.keras.layers.Dense(neurons, activation='relu'))
            
model.add(tf.keras.layers.Dense(10, activation='softmax'))
          
model.summary() 
```

#### Clearing the Graph

In order to avoid clutter from old models in the graph, we can use:

```python
tf.keras.backend.clear_session()
```

This command deletes the current `tf.keras` graph and creates a new one.


### Train Neural Network
[Part 3 Training Neural Networks](../../Deep%20Learning/04%20Deep%20Learning%20with%20TensorFlow/Notebooks/Part_3_Training_Neural_Networks_(Solution).ipynb)

Before we can train our model we need to set the parameters we are going to use to train it. We can configure our model for training using the `.compile` method. The main parameters we need to specify in the `.compile` method are:

* **Optimizer:** The algorithm that we'll use to update the weights of our model during training. Throughout these lessons we will use the [`adam`](http://arxiv.org/abs/1412.6980) optimizer. Adam is an optimization of the stochastic gradient descent algorithm. For a full list of the optimizers available in `tf.keras` check out the [optimizers documentation](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/optimizers#classes).


* **Loss Function:** The loss function we are going to use during training to measure the difference between the true labels of the images in your dataset and the predictions made by your model. In this lesson we will use the `sparse_categorical_crossentropy` loss function. We use the `sparse_categorical_crossentropy` loss function when our dataset has labels that are integers, and the `categorical_crossentropy` loss function when our dataset has one-hot encoded labels. For a full list of the loss functions available in `tf.keras` check out the [losses documentation](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/losses#classes).


* **Metrics:** A list of metrics to be evaluated by the model during training. Throughout these lessons we will measure the `accuracy` of our model. The `accuracy` calculates how often our model's predictions match the true labels of the images in our dataset. For a full list of the metrics available in `tf.keras` check out the [metrics documentation](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/metrics#classes).

These are the main parameters we are going to set throught these lesson. You can check out all the other configuration parameters in the [TensorFlow documentation](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model#compile)

Example:
```python
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
```

#### Training the Model

Now let's train our model by using all the images in our training set. Some nomenclature, one pass through the entire dataset is called an *epoch*. To train our model for a given number of epochs we use the `.fit` method, as seen below:

```python
EPOCHS = 5

history = model.fit(training_batches, epochs = EPOCHS)
```

The `.fit` method returns a `History` object which contains a record of training accuracy and loss values at successive epochs, as well as validation accuracy and loss values when applicable. We will discuss the history object in a later lesson. 

With our model trained, we can check out it's predictions.

```python
## Build model
my_model = tf.keras.Sequential([
           tf.keras.layers.Flatten(input_shape = (28,28,1)),
           tf.keras.layers.Dense(128, activation = 'relu'),
           tf.keras.layers.Dense(64, activation = 'relu'),
           tf.keras.layers.Dense(32, activation = 'relu'),
           tf.keras.layers.Dense(10, activation = 'softmax')
])


my_model.compile(optimizer='adam',
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])


## Train model
EPOCHS = 5

history = my_model.fit(training_batches, epochs = EPOCHS)


## Predict model
for image_batch, label_batch in training_batches.take(1):
    ps = my_model.predict(image_batch)
    first_image = image_batch.numpy().squeeze()[0]
```


### Train Neural Network on Complex Dataset
[Part 4 Fashion MNIST](../../Deep%20Learning/04%20Deep%20Learning%20with%20TensorFlow/Notebooks/Part_4_Fashion_MNIST_(Solution).ipynb)

[Part 5 Inference and Validation](../../Deep%20Learning/04%20Deep%20Learning%20with%20TensorFlow/Notebooks/Part_5_Inference_and_Validation_(Solution).ipynb)

### Inference & Validation

We used `tfds.Split.ALL.subsplit` to make a 60/20/20 split for training, validation and test sets, although some TensorFlow datasets have these subsections already built in. Depending on the dataset, you may also want to make sure to shuffle the data at this point as well.

Avoid overfitting to the training data?
- Stop training when the training and validation curves start to diverge by a certain amount
- Save down the best validation accuracy model from during training
- Add layers like Dropout to help generalize the network


### Saving & Loading
[Part 6 Saving and Loading Models](../../Deep%20Learning/04%20Deep%20Learning%20with%20TensorFlow/Notebooks/Part_6_Saving_and_Loading_Models.ipynb)

In TensorFlow we can save our trained models in different formats. Here we will see how to save our models in TensorFlow's SavedModel format and as HDF5 files, which is the format used by Keras models.

#### Saving and Loading Models in HDF5 Format

To save our models in the format used by Keras models we use the `.save(filepath)` method. For example, to save a model called `my_model` in the current working directory with the name `test_model` we use:

```python
my_model.save('./test_model.h5')
```

It's important to note that we have to provide the `.h5` extension to the `filepath` in order the tell `tf.keras` to save our model as an HDF5 file. 

The above command saves our model into a single HDF5 file that will contain:

* The model's architecture.
* The model's weight values which were learned during training.
* The model's training configuration, which corresponds to the parameters you passed to the `compile` method.
* The optimizer and its state. This allows you to resume training exactly where you left off.


In the cell below we save our trained `model` as an HDF5 file. The name of our HDF5 will correspond to the current time stamp. This is useful if you are saving many models and want each of them to have a unique name. By default the `.save()` method will **silently** overwrite any existing file at the target location with the same name. If we want `tf.keras` to provide us with a manual prompt to whether overwrite files with the same name, you can set the argument `overwrite=False` in the `.save()` method.

```python
t = time.time()

saved_keras_model_filepath = './{}.h5'.format(int(t))

model.save(saved_keras_model_filepath)
```

Once a model has been saved, we can use `tf.keras.models.load_model(filepath)` to re-load our model. This command will also compile our model automatically using the saved training configuration, unless the model was never compiled in the first place.

```python
reloaded_keras_model = tf.keras.models.load_model(saved_keras_model_filepath)
```

#### Saving and Loading TensorFlow SavedModels

To export our models to the TensorFlow **SavedModel** format, we use the `tf.saved_model.save(model, export_dir)` function. For example, to save a model called `my_model` in a folder called `saved_models` located in the current working directory we use:

```python
tf.saved_model.save(my_model, './saved_models')
```

It's important to note that here we have to provide the path to the directory where we want to save our model, **NOT** the name of the file. This is because SavedModels are not saved in a single file. Rather, when you save your model as a SavedModel, `the tf.saved_model.save()` function will create an `assets` folder, a `variables` folder, and a `saved_model.pb` file inside the directory you provided.

The SavedModel files that are created contain:

* A TensorFlow checkpoint containing the model weights.
* A SavedModel proto containing the underlying TensorFlow graph. Separate graphs are saved for prediction (serving), training, and evaluation. If the model wasn't compiled before, then only the inference graph gets exported.
* The model's architecture configuration if available.

The SavedModel is a standalone serialization format for TensorFlow objects, supported by TensorFlow serving as well as TensorFlow implementations other than Python. It does not require the original model building code to run, which makes it useful for sharing or deploying in different platforms, such as mobile and embedded devices (with TensorFlow Lite), servers (with TensorFlow Serving), and even web browsers (with TensorFlow.js).

In the cell below we save our trained model as a SavedModel. The name of the folder where we are going to save our model will correspond to the current time stamp. Again, this is useful if you are saving many models and want each of them to be saved in a unique directory.

```python
t = time.time()

savedModel_directory = './{}'.format(int(t))

tf.saved_model.save(model, savedModel_directory)
```

Once a model has been saved as a SavedModel, we can use `tf.saved_model.load(export_dir)` to re-load our model. 

```python
reloaded_SavedModel = tf.saved_model.load(savedModel_directory)
```

It's important to note that the object returned by `tf.saved_model.load` is **NOT** a Keras object. Therefore, it doesn't have `.fit`, `.predict`, `.summary`, etc. methods. It is 100% independent of the code that created it. This means that in order to make predictions with our `reloaded_SavedModel` we need to use a different method than the one used with the re-loaded Keras model.

To make predictions on a batch of images with a re-loaded SavedModel we have to use:

```python
reloaded_SavedModel(image_batch, training=False)
```

This will return a tensor with the predicted label probabilities for each image in the batch. Again, since we haven't done anything new to this re-loaded SavedModel, then both the `reloaded_SavedModel` and our original `model` should be identical copies. Therefore, they should make the same predictions on the same images.

We can also get back a full Keras model, from a TensorFlow SavedModel, by loading our SavedModel with the `tf.keras.models.load_model` function. 

```python
reloaded_keras_model_from_SavedModel = tf.keras.models.load_model(savedModel_directory)
```

#### Saving Models During Training

We have seen that when we train a model with a validation set, the value of the validation loss changes through the training process. Since the value of the validation loss is an indicator of how well our model will generalize to new data, it will be great if could save our model at each step of the training process and then only keep the version with the lowest validation loss. 

We can do this in `tf.keras` by using the following callback:

```python
tf.keras.callbacks.ModelCheckpoint('./best_model.h5', monitor='val_loss', save_best_only=True)
```
This callback will save the model as a Keras HDF5 file after every epoch. With the `save_best_only=True` argument, this callback will first check the validation loss of the latest model against the one previously saved. The callback will only save the latest model and overwrite the old one, if the latest model has a lower validation loss than the one previously saved. This will guarantee that will end up with the version of the model that achieved the lowest validation loss during training.

### Loading Images with TensorFlow
[Part 7 Loading Image Data](../../Deep%20Learning/04%20Deep%20Learning%20with%20TensorFlow/Notebooks/Part_7_Loading_Image_Data_(Solution).ipynb)

### Data Augmentation
`tf.keras` offers many other transformations that we can apply to our images. You can take a look at all the available transformations in the [TensorFlow Documentation](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator#arguments)

* rotation_range
* width_shift_range
* height_shift_range
* shear_range
* zoom_range
* horizontal_flip
* fill_mode

### Creating a Validation Data Generator
Generally, we only apply data augmentation to our training data. Therefore, for the validation set we only need to normalize the pixel values of our images.

### Pre-Notebooks with GPU

### Transfer Learning

[Transfer Learning](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Deep%20Learning/04%20Deep%20Learning%20with%20TensorFlow/Transfer%20Learning.pdf)

[Part 8 Transfer Learning](../../Deep%20Learning/04%20Deep%20Learning%20with%20TensorFlow/Notebooks/Part_8_Transfer_Learning_(Solution).ipynb)

## Deep Learning with PyTorch

Calculate the output of single layer network using `torch.sum()` or `.sum()` and __matrix multiplication__.

### Watch those shapes
In general, you'll want to check that the tensors going through your model and other code are the correct shapes. Make use of the `.shape` method during debugging and development.

A few things to check if your network isn't training appropriately
Make sure you're clearing the gradients in the training loop with `optimizer.zero_grad()`. If you're doing a validation loop, be sure to set the network to evaluation mode with `model.eval()`, then back to training mode with `model.train()`.


### CUDA errors
Sometimes you'll see this error:

```
RuntimeError: Expected object of type torch.FloatTensor but found type torch.cuda.FloatTensor for argument #1 ‘mat1’
```

You'll notice the second type is `torch.cuda.FloatTensor`, this means it's a tensor that has been moved to the GPU. It's expecting a tensor with type `torch.FloatTensor`, no `.cuda` there, which means the tensor should be on the CPU. PyTorch can only perform operations on tensors that are on the same device, so either both CPU or both GPU. If you're trying to run your network on the GPU, check to make sure you've moved the model and all necessary tensors to the GPU with `.to(device)` where `device` is either `"cuda"` or `"cpu"`.


[Tutorial: Deep Learning in PyTorch](http://iamtrask.github.io/2017/01/15/pytorch-tutorial/)

[Notebooks](https://github.com/stephengineer/Machine-Learning/tree/main/Deep%20Learning/05%20Deep%20Learning%20with%20PyTorch)

## Convolutional Neural Networks

### Normalizing image inputs
Data normalization is an important pre-processing step. It ensures that each input (each pixel value, in this case) comes from a standard distribution. That is, the range of pixel values in one input image are the same as the range in another image. This standardization makes our model train and reach a minimum error, faster!


### ReLU Activation Function
The purpose of an activation function is to scale the outputs of a layer so that they are a consistent, small value. Much like normalizing input values, this step ensures that our model trains efficiently!

A ReLU activation function stands for "Rectified Linear Unit" and is one of the most commonly used activation functions for hidden layers. It is an activation function, simply defined as the __positive__ part of the input, `x`. So, for an input image with any negative pixel values, this would turn all those values to `0`, black. You may hear this referred to as "clipping" the values to zero; meaning that is the lower bound.

![ReLU](./img/relu-ex.png)


### Cross-Entropy Loss
In the [PyTorch documentation](https://pytorch.org/docs/stable/nn.html#crossentropyloss), you can see that the cross entropy loss function actually involves two steps:

- It first applies a softmax function to any output is sees
- Then applies [NLLLoss](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html#torch.nn.NLLLoss); negative log likelihood loss

Then it returns the average loss over a batch of data. Since it applies a softmax function, we do not have to specify that in the `forward` function of our model definition, but we could do this another way.

#### Another approach
We could separate the softmax and NLLLoss steps.

- In the `forward` function of our model, we would explicitly apply a softmax activation function to the output, `x`.

```py
# a softmax layer to convert 10 outputs into a distribution of class probabilities
x = F.log_softmax(x, dim=1)

return x
```

- Then, when defining our loss criterion, we would apply NLLLoss

```py
# cross entropy loss combines softmax and nn.NLLLoss() in one single class
# here, we've separated them
criterion = nn.NLLLoss()
```

This separates the usual `criterion = nn.CrossEntropy()` into two steps: softmax and NLLLoss, and is a useful approach should you want the output of a model to be class probabilities rather than class scores.


### Validation Set: Takeaways

Measure how well a model generalizes, during training
Tell us when to stop training a model; when the validation loss stops decreasing (and especially when the validation loss starts increasing and the training loss is still decreasing)

![imageClassificationSteps](./img/imageClassificationSteps.png)

### Filters
To detect changes in intensity in an image, you’ll be using and creating specific image filters that look at groups of pixels and react to alternating patterns of dark/light pixels. These filters produce an output that shows edges of objects and differing textures.

### Frequency in images
We have an intuition of what frequency means when it comes to sound. High-frequency is a high pitched noise, like a bird chirp or violin. And low frequency sounds are low pitch, like a deep voice or a bass drum. For sound, frequency actually refers to how fast a sound wave is oscillating; oscillations are usually measured in cycles/s ([Hz](https://en.wikipedia.org/wiki/Hertz)), and high pitches and made by high-frequency waves. Examples of low and high-frequency sound waves are pictured below. On the y-axis is amplitude, which is a measure of sound pressure that corresponds to the perceived loudness of a sound, and on the x-axis is time.

![frequency](./img/frequency.png)

#### High and low frequency
Similarly, frequency in images is a __rate of change__. But, what does it means for an image to change? Well, images change in space, and a high frequency image is one where the intensity changes a lot. And the level of brightness changes quickly from one pixel to the next. A low frequency image may be one that is relatively uniform in brightness or changes very slowly. This is easiest to see in an example.

![frequencyImage](./img/frequencyImage.png)

Most images have both high-frequency and low-frequency components. In the image above, on the scarf and striped shirt, we have a high-frequency image pattern; this part changes very rapidly from one brightness to another. Higher up in this same image, we see parts of the sky and background that change very gradually, which is considered a smooth, low-frequency pattern.

__High-frequency components also correspond to the edges of objects in images__, which can help us classify those objects.

#### Edge Handling
Kernel convolution relies on centering a pixel and looking at it's surrounding neighbors. So, what do you do if there are no surrounding pixels like on an image corner or edge? Well, there are a number of ways to process the edges, which are listed below. It’s most common to use padding, cropping, or extension. In extension, the border pixels of an image are copied and extended far enough to result in a filtered image of the same size as the original image.

__Extend__ The nearest border pixels are conceptually extended as far as necessary to provide values for the convolution. Corner pixels are extended in 90° wedges. Other edge pixels are extended in lines.

__Padding__ The image is padded with a border of 0's, black pixels.

__Crop__ Any pixel in the output image which would require values from beyond the edge is skipped. This method can result in the output image being slightly smaller, with the edges having been cropped.

### Pooling layers
Some architectures choose to use [average pooling](https://pytorch.org/docs/stable/nn.html#avgpool2d), which chooses to average pixel values in a given window size. So in a 2x2 window, this operation will see 4 pixel values, and return a single, average of those four values, as output!

This kind of pooling is typically not used for image classification problems because maxpooling is better at noticing the most important details about edges and other features in an image, but you may see this used in applications for which smoothing an image is preferable.

### Padding
Padding is just adding a border of pixels around an image. In PyTorch, you specify the size of this border.

Why do we need padding?

When we create a convolutional layer, we move a square filter around an image, using a center-pixel as an anchor. So, this kernel cannot perfectly overlay the edges/corners of images. The nice feature of padding is that it will allow us to control the spatial size of the output volumes (most commonly as we’ll see soon we will use it to exactly preserve the spatial size of the input volume so the input and output width and height are the same).

The most common methods of padding are padding an image with all 0-pixels (zero padding) or padding them with the nearest pixel value. You can read more about calculating the amount of padding, given a kernel_size, [here](https://cs231n.github.io/convolutional-networks/#conv).

### Formula: Number of Parameters in a Convolutional Layer
The number of parameters in a convolutional layer depends on the supplied values of `filters/out_channels`, `kernel_size`, and `input_shape`. Let's define a few variables:

- `K` - the number of filters in the convolutional layer
- `F` - the height and width of the convolutional filters
- `D_in` - the depth of the previous layer
Notice that `K` = `out_channels`, and `F` = `kernel_size`. Likewise, `D_in` is the last value in the `input_shape` tuple, typically 1 or 3 (RGB and grayscale, respectively).

Since there are `F*F*D_in` weights per filter, and the convolutional layer is composed of `K` filters, the total number of weights in the convolutional layer is `K*F*F*D_in`. Since there is one bias term per filter, the convolutional layer has `K` biases. Thus, the __number of parameters__ in the convolutional layer is given by `K*F*F*D_in + K`.

### Formula: Shape of a Convolutional Layer
The shape of a convolutional layer depends on the supplied values of `kernel_size`, `input_shape`, `padding`, and `stride`. Let's define a few variables:

- `K` - the number of filters in the convolutional layer
- `F` - the height and width of the convolutional filters
- `S` - the stride of the convolution
- `P` - the padding
- `W_in` - the width/height (square) of the previous layer
Notice that `K = out_channels`, `F = kernel_size`, and `S = stride`. Likewise, `W_in` is the first and second value of the `input_shape` tuple.

The __depth__ of the convolutional layer will always equal the number of filters `K`.

The spatial dimensions of a convolutional layer can be calculated as: `(W_in−F+2P)/S+1`

### Optional Resources
- Check out the [AlexNet](https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf) paper!
- Read more about [VGGNet](https://arxiv.org/pdf/1409.1556.pdf) here.
- The [ResNet](https://arxiv.org/pdf/1512.03385v1.pdf) paper can be found here.
- Here's the [Keras documentation](https://keras.io/api/applications/) for accessing some famous CNN architectures.
- Read this [detailed treatment](http://neuralnetworksanddeeplearning.com/chap5.html) of the vanishing gradients problem.
- Here's a [GitHub repository](https://github.com/jcjohnson/cnn-benchmarks) containing benchmarks for different CNN architectures.
- Visit the [ImageNet Large Scale Visual Recognition Competition (ILSVRC)](https://image-net.org/challenges/LSVRC/) website.

### External Resource
[Deep learning eBook](https://www.deeplearningbook.org/) (2016) authored by Ian Goodfellow, Yoshua Bengio, and Aaron Courville; published by Cambridge: MIT Press


### 3. Transfer Learning
Transfer learning involves taking a pre-trained neural network and adapting the neural network to a new, different data set.

Depending on both:

- The size of the new data set, and
- The similarity of the new data set to the original data set

The approach for using transfer learning will be different. There are four main cases:

1. New data set is small, new data is similar to original training data.
2. New data set is small, new data is different from original training data.
3. New data set is large, new data is similar to original training data.
4. New data set is large, new data is different from original training data.

A large data set might have one million images. A small data could have two-thousand images. The dividing line between a large data set and small data set is somewhat subjective. Overfitting is a concern when using transfer learning with a small data set.

Images of dogs and images of wolves would be considered similar; the images would share common characteristics. A data set of flower images would be different from a data set of dog images.

Each of the four transfer learning cases has its own approach. In the following sections, we will look at each case one by one.


### 4. Weight Initialization
It's key to include an element of uniqueness, or __randomness__! Better weights might be selected randomly from within a specified range. By adding variety and all unique weight values, we can ensure that backpropagation will have different activations to look at in the hidden layers, and it can respond to those differences.

### Additional Material
- [Understanding the difficulty of training deep feedforward neural networks](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
- [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/pdf/1502.01852v1.pdf)
- [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167v2.pdf)


### 5. Autoencoders

Autoencoders are neural networks used for data compression, image de-noising, and dimensionality reduction.


### 6. Style Transfer
[Image Style Transfer Using Convolutional Neural Networks](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Gatys_Image_Style_Transfer_CVPR_2016_paper.pdf)

Content Loss
$$ L_{content} = \frac{1}{2} \sum (T_c - C_c)^2 $$

Style Loss
$$ L_{style} = a \sum_i w_i (T_{s,i} - s_{s,i})^2 $$

Total Loss
$$ \alpha L_{content} + \beta L_{style} $$

The smaller alpha-beta ratio ($\frac{\alpha}{\beta}$), the more stylistic effect you will see.


### Project: Landmark Classification & Tagging for Social Media

Build a landmark classification and tagging system useful for social media!

### 8. Deep Learning in Cancer Detection

#### Sensitivity and Specificity

[Precision and Recall](https://en.wikipedia.org/wiki/Precision_and_recall)

Although similar, sensitivity and specificity are not the same as precision and recall. Here are the definitions:

In the cancer example, sensitivity and specificity are the following:

- Sensitivity: Of all the people **with** cancer, how many were correctly diagnosed?
- Specificity: Of all the people **without** cancer, how many were correctly diagnosed?

And precision and recall are the following:

- Recall: Of all the people who **have cancer**, how many did **we diagnose** as having cancer?
- Precision: Of all the people **we diagnosed** with cancer, how many actually **had cancer**?

From here we can see that Sensitivity is Recall, and the other two are not the same thing.

Trust me, we also have a hard time remembering which one is which, so here's a little trick. If you remember from Luis's Evaluation Metrics section, here is the confusion matrix:

![confusion-matrix](./img/confusion-matrix.png)

Now, sensitivity and specificity are the rows of this matrix. More specifically, if we label

- TP: (True Positives) Sick people that we **correctly** diagnosed as sick.
- TN: (True Negatives) Healthy people that we **correctly** diagnosed as healthy.
- FP: (False Positives) Healthy people that we **incorrectly** diagnosed as sick.
- FN: (False Negatives) Sick people that we **incorrectly** diagnosed as healthy.
then:

Sensitivity = $\frac{TP}{TP + FN}$

and

Specificity = $\frac{TN}{TN + FP}$

![sensitivity-specificity](./img/sensitivity-specificity.png)

<center>Sensitivity and Specificity</center>

And precision and recall are the top row and the left column of the matrix:

Recall = $\frac{TP}{TP + FN}$

and

Precision = $\frac{TP}{TP + FP}$

![precision-recall](./img/precision-recall.png)

<center>Precision and Recall</center>

The graph below is a histogram of the predictions our model gives in a set of images of lesions, as follows:

- Each point in the horizontal axis is a value pp from 0 to 1.
- Over each value pp, we locate all the lesions that our classifier predicted to have probability p of being malignant.

![threshold](./img/threshold.png)

Here we have graphed the thresholds at 0.2, 0.5, and 0.8. Notice how:

- At 0.2, we classify every malignant lesion correctly, yet we also send a lot of benign lesions for more testing.
- At 0.5, we miss some malignant lesions (bad), and we send a few benign lesions for more testing.
- At 0.8, we correctly classify most of the benign lesions, but we miss many malignant lesions (very bad).

So in this case, it's arguable that 0.2 is better.


#### [ROC Curves](https://www.youtube.com/watch?v=2Iw5TiGzJI4)

The curves have been introduced as follows, where in the horizontal axis we plot the True Positive Rate, and in the vertical axis we plot the False Positive Rate.

![roc-1](./img/roc-1.png)

![roc](./img/roc.png)

However, you'll see that in this section, I will use a different ROC Curve. The one I use looks like I flipped it sideways, like this:

![roc-curve](./img/roc-curve.png)


And there's a really cool reason why I use this one. And it's because it's the curve we get when we plot the sensitivity in the horizontal axis, and the specificity in the vertical axis!

Let me be more specific (yes pun intended). Let's use the same histogram as in the last section.

![threshold-1](./img/threshold-1.png)

Recall that the values in the horizontal axis are all the possible thresholds. For any threshold pp between 0 and 1, the verdict of the model will be the following: "*Any lesion to the left of this threshold will be considered benign, and any lesion to the right of this threshold will be considered malignant, and sent for more tests*."

Now, for this particular model, we calculate the sensitivity and specificity as follows:

- Sensitivity: Out of all the malignant lesions, what percentage are to the right of the threshold (correctly classified)?
- Specificity: Out of all the benign lesions, what percentage are to the left of the threshold (correctly classified)?

And we plot that point, where the coordinates are (Sensitivity, Specificity). If we plot all the points corresponding to each of the possible thresholds between 0% and 100%, we'll get the ROC curve that I drew above. Therefore, we can also refer to the ROC curve as the *Sensitivity-Specificity Curve*.

And finally, here's a little animation of the ROC curve getting drawn, as the threshold moves from 0 to 1.


#### Confusion Matrices

In Luis's Evaluation Metrics section, we learned about confusion matrices, and if you need a refresher, the [video](https://www.youtube.com/watch?v=9GLNjmMUB_4).

##### Type 1 and Type 2 Errors

Sometimes in the literature, you'll see False Positives and True Negatives as Type 1 and Type 2 errors. Here is the correspondence:

- **Type 1 Error (Error of the first kind, or False Positive):** In the medical example, this is when we misdiagnose a healthy patient as sick.
- **Type 2 Error (Error of the second kind, or False Negative):** In the medical example, this is when we misdiagnose a sick patient as healthy.

But confusion matrices can be much larger than 2 X 2. Here's an example of a larger one. Let's say we have three illnesses called A, B, C. And here is a confusion matrix:

![new-confusion-matrix](./img/new-confusion-matrix.png)

<center>A confusion matrix for three types of illnesses: A, B, and C</center>

As you can see, each entry in the ii-th row and the j-th column will tell you the probability of the patient having illness ii and getting diagnosed with illness j.

For example, from the entry on the second row and the first column, we can determine that if a patient has illness B, the probability of getting diagnosed with illness A is exactly 0.08.


## Recurrent Neural Networks


### 1. Recurrent Neural Networks

[Sketch RNN (demo here)](https://magenta.tensorflow.org/assets/sketch_rnn_demo/index.html) is a program that learns to complete a drawing, once you give it something (a line or circle, etc.) to start!


#### A bit of history
RNNs have a key flaw, as capturing relationships that span more than 8 or 10 steps back is practically impossible. This flaw stems from the "—__vanishing gradient__" problem in which the contribution of information decays geometrically over time.

What does this mean?

As you may recall, while training our network we use __backpropagation__. In the backpropagation process we adjust our weight matrices with the use of a __gradient__. In the process, gradients are calculated by continuous multiplications of derivatives. The value of these derivatives may be so small, that these continuous multiplications may cause the gradient to practically "vanish".

__LSTM__ is one option to overcome the Vanishing Gradient problem in RNNs.

Please use these resources if you would like to read more about the [Vanishing Gradient](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) problem or understand further the concept of a [Geometric Series](https://socratic.org/algebra/exponents-and-exponential-functions/geometric-sequences-and-exponential-functions) and how its values may exponentially decrease.

If you are still curious, for more information on the important milestones mentioned here, please take a peek at the following links:

- [TDNN](https://en.wikipedia.org/wiki/Time_delay_neural_network)
- Here is the original [Elman Network](https://onlinelibrary.wiley.com/doi/abs/10.1207/s15516709cog1402_1) publication from 1990. This link is provided here as it's a significant milestone in the world on RNNs. To simplify things a bit, you can take a look at the following [additional info](https://en.wikipedia.org/wiki/Recurrent_neural_network#Elman_networks_and_Jordan_networks).
- In this [LSTM](http://www.bioinf.jku.at/publications/older/2604.pdf) link you will find the original paper written by [Sepp Hochreiter](https://en.wikipedia.org/wiki/Sepp_Hochreiter) and [Jürgen Schmidhuber](https://people.idsia.ch//~juergen/). Don't get into all the details just yet. We will cover all of this later!

As mentioned in the video, Long Short-Term Memory Cells (LSTMs) and Gated Recurrent Units (GRUs) give a solution to the vanishing gradient problem, by helping us apply networks that have temporal dependencies. In this lesson we will focus on RNNs and continue with LSTMs. We will not be focusing on GRUs. More information about GRUs can be found in the following blog. Focus on the overview titled: __GRUs__.


#### Applications
There are so many interesting applications, let's look at a few more!

- Are you into gaming and bots? Check out the [DotA 2 bot by Open AI](https://openai.com/blog/dota-2/)
- How about [automatically adding sounds to silent movies?](https://www.youtube.com/watch?time_continue=1&v=0FW99AQmMc8)
- Here is a cool tool for [automatic handwriting generation]()
- Amazon's voice to text using high quality speech recognition, [Amazon Lex](https://aws.amazon.com/lex/faqs/).
- Facebook uses RNN and LSTM technologies for [building language models](https://engineering.fb.com/2016/10/25/ml-applications/building-an-efficient-neural-language-model-over-a-billion-words/)
- Netflix also uses RNN models: [here is an interesting read](https://arxiv.org/pdf/1511.06939.pdf)


#### Feedforward Neural Network - A Reminder
The mathematical calculations needed for training RNN systems are fascinating. To deeply understand the process, we first need to feel confident with the vanilla FFNN system. We need to thoroughly understand the feedforward process, as well as the backpropagation process used in the training phases of such systems. The next few videos will cover these topics, which you are already familiar with. We will address the feedforward process as well as backpropagation, using specific examples. These examples will serve as extra content to help further understand RNNs later in this lesson.

The following couple of videos will give you a brief overview of the __Feedforward Neural Network (FFNN)__.

As mentioned before, when working with neural networks we have 2 primary phases: __Training__ and __Evaluation__.

During the __training__ phase, we take the data set (also called the training set), which includes many pairs of inputs and their corresponding targets (outputs). Our goal is to find a set of weights that would best map the inputs to the desired outputs. In the __evaluation__ phase, we use the network that was created in the training phase, apply our new inputs and expect to obtain the desired outputs.

The training phase will include two steps: __Feedforward__ and __Backpropagation__

We will repeat these steps as many times as we need until we decide that our system has reached the best set of weights, giving us the best possible outputs.

The next two videos will focus on the feedforward process.

You will notice that in these videos I use subscripts as well as superscript as a numeric notation for the weight matrix.

For example:

- $W_k$ is weight matrix k
- $W_{ij}^{k}$ is the ij element of weight matrix k


#### Feedforward
In this section we will look closely at the math behind the feedforward process. With the use of basic Linear Algebra tools, these calculations are pretty simple!

If you are not feeling confident with linear combinations and matrix multiplications, you can use the following links as a refresher:

- [Linear Combination](http://linear.ups.edu/html/section-LC.html)
- [Matrix Multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication)

Assuming that we have a single hidden layer, we will need two steps in our calculations. The first will be calculating the value of the hidden states and the latter will be calculating the value of the outputs.

Notice that both the hidden layer and the output layer are displayed as vectors, as they are both represented by more than a single neuron.

##### Calculating the value of the hidden states

[Video](https://youtu.be/4rCfnWbx8-0)

vector h' of the hidden layer will be calculated by multiplying the input vector with the weight matrix $W^{1}$ the following way: $\bar{h'}=(\bar{x}W^1)$ Using vector by matrix multiplication.

After finding h' we need an activation function $\Phi$ to finalize the computation of the hidden layer's values. This activation function can be a Hyperbolic Tangent, a Sigmoid or a ReLU function. We can use the following two equations to express the final hidden vector $\bar{h}$: $\bar{h}=\Phi(\bar{x}W^1)$ or $\bar{h}=\Phi(h')$


More information on the activation functions and how to use them can be found [here](https://github.com/Kulbear/deep-learning-nano-foundation/wiki/ReLU-and-Softmax-Activation-Functions)


##### Calculating the values of the Outputs.

[Video](https://youtu.be/kTYbTVh1d0k)

The process of calculating the output vector is mathematically similar to that of calculating the vector of the hidden layer. We use, again, a vector by matrix multiplication, which can be followed by an activation function. The vector is the newly calculated hidden layer and the matrix is the one connecting the hidden layer to the output.

Essentially, each new layer in an neural network is calculated by a vector by matrix multiplication, where the vector represents the inputs to the new layer and the matrix is the one connecting these new inputs to the next layer.

In our example, the input vector is $\bar{h}$ and the matrix is $W^2$, therefore $\bar{y}=\bar{h}W^2$. In some applications it can be beneficial to use a softmax function (if we want all output values to be between zero and 1, and their sum to be 1).

The two error functions that are most commonly used are the [Mean Squared Error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error) (usually used in regression problems) and the [cross entropy](https://www.ics.uci.edu/~pjsadows/notes.pdf) (usually used in classification problems).

In the above calculations we used a variation of the MSE.

The next few videos will focus on the backpropagation process, or what we also call stochastic gradient decent with the use of the chain rule.
 
 

### 2. Long Short-Term Memory Networks (LSTMs)


### 3. Implementation of RNN and LSTM


### 4. Hyperparameters


### 5. Embeddings & Word2Vec


### 6. Sentiment Prediction RNN


### Project: Generate TV Scripts


### 8. Attention

## Generative Adversarial Networks


### 1. Generative Adversarial Networks


### 2. Deep Convolutional GANs


### 3. Pix2Pix & CycleGAN


### 4. Implementing a CycleGAN


### Project: Generate Faces



## Deploying a Model


### 1. Introduction to Deployment


### 2. Building a Model using SageMaker


### 3. Deploying and Using Model


### 4. Hyperparameter Tuning


### 5. Updating a Model


### Project: Deploying a Sentiment Analysis Model




## Projects

- [Sentiment Analysis](https://github.com/stephengineer/Machine-Learning/tree/main/Deep%20Learning/Projects/00%20Sentiment%20Analysis)
- [ImageClassifier](https://github.com/stephengineer/Machine-Learning/tree/main/Deep%20Learning/Projects/01%20Image%20Classifier)
- [Predicting Bike-Sharing Patterns](https://github.com/stephengineer/Machine-Learning/tree/main/Deep%20Learning/Projects/02%20Bike-Sharing%20Patterns)
- [Landmark Classification](https://github.com/stephengineer/Machine-Learning/tree/main/Deep%20Learning/Projects/03%20Landmark%20Classification%20and%20Tagging%20for%20Social%20Media)
- [Dermatologist AI](https://github.com/udacity/dermatologist-ai)