![](pics/header.png)

# Deep Learning: Optimization

Kevin Walchko

---

`torch` optimization functions. With neural nets, we want to train weights/biases in an optimal way to get good results. Generally you have a data set with known truth (labels) you need the network to learn.

```python
# some psudo code
criterion = loss_class()
optimizer = optim_class(model.parameters, lr=0.001)
for sample, labels in dataset:
    optimizer.zero_grad() # clear the gradients for this epoch
    x = model(sample) # forward pass: 
                      #   compute predicted outputs
    loss = criterion(x, labels)
    loss.backward()   # backward pass: 
                      #   compute gradient of the loss w.r.t model parameters
    optimizer.step()  # perform a single optimization step (parameter update)
```

## Loss Functions

| Loss | `torch.nn` | Description     |
|------|------------|-----------------|
| Negative Log Likelihood | `NLLLoss` | The negative log likelihood loss is useful to train a classification problem with C classes. Obtaining log-probabilities in a neural network is easily achieved by adding a LogSoftmax layer in the last layer of your network. You may use CrossEntropyLoss instead, if you prefer not to add an extra layer. |
| Cross Entropy | `CrossEntropyLoss` | It is useful when training a classification problem with C classes. The performance of this criterion is generally better when target contains class indices, as this allows for optimized computation.  |
| Mean Squared Error | `MSELoss` | Measures the mean squared error (squared L2 norm) between each element in the input x and target y (label). A good choice when comparing pixel quantities rather than class probabilities |

## Optimizers

| Optim | `torch.optim`  | Description    |
|-------|----------------|----------------|
| Stochastic Gradient Decent | `SGD` | Stochastic gradient descent is extremely basic and is seldom used now. One problem is with the worldwide learning rate related to an equivalent. Also, Stochastic gradient descent generally has a hard time escaping the saddle points. |
| Nesterov | `SGD(nesterov=True)` | Implements a gradient decent with momentum. This method is capabile of performing as well as Adam. |
| Adam (Adaptive Moment Estimation) | `Adam` | Adam’s method considered as a method of Stochastic Optimization is a technique implementing adaptive learning rate. It is very efficient with large problems and memory efficient. It is also very common and generally replaced SGD. |