# <font color = 'blue'>**Manual function calculations for loss and gradients in neural networks:**

In [None]:
import torch
import torch.nn as nn

##  <font color = 'blue'>**Manual Gradient function calculation:**
## $f(x,y) = \frac{x + \exp(y)}{\log(x) + (x-y)^3}$

In [None]:
def fxy(x,y):
    # calculating the numerator of the gradient
    num = x + torch.exp(y)

    #calculating the denominator
    den = torch.log(x) + (x - y)**3

    # Performing element-wise division of the numerator by the denominator
    return num/den

In [None]:
#Creating an example tensor set to work with the gradient function
x = torch.tensor(3.0, requires_grad = True)

y = torch.tensor(4.0, requires_grad = True) # Ensuring requires_grad = True sets us up to calculate the gradient and store it for usage during backpropogation.


In [None]:
# Create a single-element tensor 'x' containing the value 3.0
# make sure to set 'requires_grad=True' as you want to compute gradients with respect to this tensor during backpropagation
x = torch.tensor(3.0, requires_grad = True)

# Create a single-element tensor 'y' containing the value 4.0
# Similar to 'x', we want to compute gradients for 'y' during backpropagation, hence make sure to set 'requires_grad=True'
y = torch.tensor(4.0, requires_grad = True)

In [None]:
# Call the function 'fxy' with the tensors 'x' and 'y' as arguments
# The result 'f' will also be a tensor and will contain derivative information because 'x' and 'y' have 'requires_grad=True'
f = fxy(x, y)
f

tensor(584.0868, grad_fn=<DivBackward0>)

In [None]:
# Perform backpropagation to compute the gradients of 'f' with respect to 'x' and 'y'
# Hint use backward() function on f
f.backward()


In [None]:
# Display the computed gradients of 'f' with respect to 'x' and 'y'
# These gradients are stored as attributes of x and y after the backward operation
# Print the gradients for x and y
print('x.grad =', x.grad)
print('y.grad =', y.grad)



x.grad = tensor(-19733.3965)
y.grad = tensor(18322.8477)


## <font color = 'blue'>**Implementing the -log of the softmax function for classification probabilities:**

We're going to break down the calculation of classification probabilities using the softmax function, followed by the negative log to get that all-important loss function for a model:

$$-\log\left(\frac{e^x}{e^x+e^y}\right)$$

So, what's happening here?

- First, the function calculates the softmax probability, essentially turning our logits into probabilities.
- Next, we apply the natural logarithm to this probability.
- Finally, multiplying by `-1` gives us the negative log-likelihood, which is our **cross-entropy loss**.

This is super important for multi-class classification problems because it helps us measure how well (or how badly) the model is predicting the true class.


In [None]:
# Loss function definition
def log_exp(x,y):
    # defining the numerator
    num = torch.exp(x)

    # defining the denominator
    den = torch.exp(x)+torch.exp(y)

    # passing the positive log of the num/den
    pos_log = torch.log(num/den)

    # don't forget to multiply by -1 to calculate loss
    return pos_log * (-1)

Test with normal inputs:

In [None]:
# Create tensors x and y with initial values 2.0 and 3.0, respectively
x, y = torch.tensor([2.0]), torch.tensor([3.0])

# Evaluate the function log_exp() for the given x and y, and store the output in z
z = log_exp(x, y)

# Display the computed value of z
z


tensor([1.3133])

## <font color='blue'>**Function Explanation: Computing Gradients Using PyTorch Autograd**</font>

The function `grad` defined below takes two given tensors, `x` and `y`, and enables <font color='blue'>**gradient tracking**</font> for them by setting their `requires_grad` attribute to `True`. This is essential for PyTorch's automatic differentiation engine, <font color='blue'>**Autograd**</font>, to track operations on these tensors and compute gradients with respect to them.

The function then passes `x` and `y` through a user-defined `forward_func`, which computes a scalar output `z`. This forward function represents any differentiable operation or model that the user wishes to analyze.

After obtaining the output `z`, the function performs the <font color='blue'>**backward pass**</font> by calling `z.backward()`. This computes the <font color='blue'>**gradients**</font> of `z` with respect to all tensors that have `requires_grad=True` (in this case, `x` and `y`). The computed gradients are stored in the `.grad` attribute of each tensor.

The function then prints out the gradients `x.grad` and `y.grad` to display the results of the computation.

To prevent <font color='blue'>**gradient accumulation**</font>—which is the default behavior in PyTorch where gradients are added to any existing gradients in the `.grad` attributes—the function resets the gradients of `x` and `y` back to zero using the `zero_()` method. This ensures that subsequent calls to the `grad` function start with fresh gradients, avoiding any unintended interference from previous computations.

**Key Points Demonstrated in the Function:**

- **Enabling Gradient Tracking:** Using `x.requires_grad_(True)` and `y.requires_grad_(True)` to allow <font color='blue'>**Autograd**</font> to monitor operations on `x` and `y`.

- **Forward Pass:** Passing the inputs through a customizable forward function `forward_func` to compute the output `z`.

- **Backward Pass:** Calling `z.backward()` to compute the <font color='blue'>**gradients**</font> of `z` with respect to `x` and `y`.

- **Accessing Gradients:** Printing `x.grad` and `y.grad` to access the computed gradients.

- **Preventing Gradient Accumulation:** Resetting the gradients using `x.grad.zero_()` and `y.grad.zero_()` to avoid accumulation in successive computations.

This implementation showcases an understanding of PyTorch's automatic differentiation mechanism and best practices in managing <font color='blue'>**gradient computations**</font>, ensuring accurate and efficient gradient calculations for optimization and analysis tasks.

In [None]:
def grad(forward_func, x, y):

  # Enable gradient tracking for x and y, set reauires_grad appropraitely
  x.requires_grad_(True)
  y.requires_grad_(True)

  # Evaluate the forward function to get the output 'z'
  z = forward_func(x, y)

  # Perform the backward pass to compute gradients
  z.backward()

  # Print the gradients for x and y
  print('x.grad =', x.grad)
  print('y.grad =', y.grad)

  # Reset the gradients for x and y to zero for the next iteration
  x.grad.zero_()
  y.grad.zero_()


Testing the function and output.

In [None]:
grad(log_exp, x, y)

x.grad = tensor([-0.7311])
y.grad = tensor([0.7311])


But now let's try some "hard" inputs

In [None]:
x, y = torch.tensor([50.0]), torch.tensor([100.0])

In [None]:
x,y = torch.tensor([50.0]), torch.tensor([100.0])

In [None]:
grad(log_exp, x, y)

x.grad = tensor([nan])
y.grad = tensor([nan])


In [None]:
torch.exp(torch.tensor([100.0]))

tensor([inf])

### <font color = 'blue'>Explanation of Numerical Overflow and the Log-Sum-Exp Trick

The reason for these output values is that we are experiencing a **numerical overflow** when we calculate $ e^{100}$, which is equal to $2.688 \times 10^{43}$.  
This value is not within range of our current datatype of `float32`. To fix this issue, we need to more stably calculate the values of the large exponential pieces.

The way we can do this is through the **logarithmic identity principles**.

- We know that $\log\left(\frac{\exp(x)}{\exp(x) + \exp(y)}\right) = \log(\exp(x)) - \log(\exp(x) + \exp(y))$
- Also, $\log(\exp(x)) = x$
- So our new equation would be $x - \log(\exp(x) + \exp(y))$
- Now, we can use a log-sum trick, understanding that $\log(\exp(x) + \exp(y)) = a + \log(\exp(x - a) + \exp(y - a))$, where $a = \max(x, y)$
- Now, our final functional version becomes:  
  $\log\left(\frac{\exp(x)}{\exp(x) + \exp(y)}\right) = x - \left[ a + \log(\exp(x - a) + \exp(y - a)) \right]$

<font color='blue'>**Key point**:</font> This approach helps us avoid numerical overflow by normalizing large exponentials.


                                                       

In [None]:
def stable_log_exp(x, y):
    # define a
    a = torch.max(x,y)

    # define e^x - a and e^y-a
    exp_x = torch.exp(x-a)

    exp_y = torch.exp(y-a)

    # calculating a + our logsum trick
    denom = a + torch.log(exp_x + exp_y)

    # final calculation (multiply by -1 for loss)
    return (-1)*(x-denom)


Here we have an overflow calculation.

In [None]:
log_exp(x, y)

tensor([inf], grad_fn=<MulBackward0>)

Here is the stable calculation

In [None]:
stable_log_exp(x, y)

tensor([50.], grad_fn=<MulBackward0>)

Calling the gradient function to calculate the gradients using the stable calculation.

In [None]:
grad(stable_log_exp, x, y)

x.grad = tensor([-1.])
y.grad = tensor([1.])
