In [1]:
# Matplotlib figures generated in the notebook inline with the rest of the script
%matplotlib inline

import numpy as np # Import NumPy for arrays and basic linear algebra
from collections import namedtuple
import matplotlib.pyplot as plt # Import PyPlot for basic plotting.
import matplotlib as mpl # Import Matplotlib for advanced plotting.
from mpl_toolkits.mplot3d import Axes3D # Import Matplotlib 3D Plots

# from sknrf.model.settings import Settings
# from sknrf.devices.signal import *
# plt.style.use('bmh')

# Converting Numpy Array to PyTorch Tensors

PyTorch is a python wrapping of the Torch C++ library performs the following tasks:

1. Defines **torch.tensor**, a multi-dimensional array that can be moved between the CPU and GPU(s) that support CUDA.
2. Defines **torch.autograd.function** that records a graph of all operations applied to one or more tensors and then performs automatic differentiation (backwards propagation) to determine the sensitivity of the input tensor to the output tensor.
3. Defines **torch.nn** witch allows you to design a neural network that accept input tensors, graph multiple operations over hidden layers, outputing tensors that are tuned using an optimization strategy.
4. Defines **torch.optim** with supports various optimization algorithms based on the first-order gradient, (learning rate) or the 2nd order hessian (momentum) of the output tensors.

In a **neural network**, a **tensor** represents the state of individual net, while a **function** perform math operations to connect the state of one net to another. At the output, an **optimization loss function** calculates the difference between the predicted output and the expected output, and modifies the **tensors** in the hidden layers of the neural network to be used in the following itereation.

A simplified class diagram of these components is presented below:
![alt text](./images/PNG/PyTorch_Structure.png "PyTorch Structure")

## torch.Tensor

### Data Types
#### Numpy Array
np.array(data, dtype=None) → ndarray  # Create a new array
nd.array.tolist() → list  # Convert numpy array to python list

#### PyTorch Tensor
torch.tensor(data, dtype=None, device=None, requires_grad=False) → Tensor  # Create a new tensor
tensor.item()  # Convert tensor to python value if tensor value is a non-dimensional
tensor.tolist()  # Convert tensor to  python list
tensor.numpy() → numpy.ndarray # Convert tensor to numpy array
torch.from_numpy(ndarray) → Tensor # Convert numpy array to tensor without copying data
torch.as_tensor(data, dtype=None, device=None)   # Convert any data (list, ndarray, tensor) to a tensor

#### Existing Envelope Signal
signal.EnvelopeSignal(array, indep_map=IndepDict()) → Tensor  # Create a new signal

#### Proposed Envelope Signal
sig.esignal(data, dtype=None, device=None, requires_grad=True)

Q. Does autograd allow us to implicitly track independent variables (using graphing)?
A. singals are mainly altered using the following operations. (The purpose of indep_map).
  - torch.index_select(input, dim, index, out=None) → Tensor, where index is an array/tensor of indices to extract along dim
  - torch.masked_select(input, mask, out=None) → Tensor, where mask is a boolean array/tensor the same shape as input
  - torch.narrow(input, dimension, start, length) → Tensor, where start length is similar to a slice operation with step = 1 and stop = start + length.
  - torch.take(input, indices) → Tensor, where indicies is flat indexation of a 1D array/tensor
  
  **As long as these operations store a grad_func in the graph, indep_map should not be needed.**

#### Exercise: Prove that we don't need indep_map:

1. Create freq, time tensors
2. Create a v1 tensor that implicilty stores a reference to freq, time tensors.
3. Prove that freq and time can be accessed from v1.

### Serialization

save/load functions are similar to arrays, but you must also specify the hardware device where the tensor is stored

- torch.save(obj, f)
- torch.load(f, **map_location=None**)

where **map_location** is a function, torch.device, string or a dict specifying how to remap storage locations.
A safe way to load a saved state on a different hardware configuration is to perform a two-step load:
1. torch.load(.., map_location=’cpu’)
2. load_state_dict()

This allows you to load the state to the CPU and then move all of the tensors to the available GPUs after. It is considered to be a more stable way to save and load the state of your application.

#### Exercise: Save and Load a tensor


### Limits
torch.clamp(input, min, max, out=None) → Tensor provides a good way to maintain safe operation

### Tolerences
torch.trunc(input, out=None) → Tensor, truncates floating point values
torch.allclose(self, other, rtol=1e-05, atol=1e-08, equal_nan=False) → bool, determines if tensors are almost equal

#### Exercise: Apply limits and check tolerances


## torch.autograd.Function

### Forwards Propagation

In a neural network, a **function** is a mathematical operation that connects the values of a set of **input tensors** to the values of a set of **output tensors**. Therefore the forward propagation should perform the following tasks:

1. Calculate $outputs = f(inputs)$

After calculating the final output of the neural network, the predicted error is calculated using a loss function and a derivative ($\frac{d_{loss}}{d_{input}}$) will be used to update the value of each tensor input. The **differential chain rule** allows us to connect the derivative of function $f' = \frac{df}{dg}$ to the derivative $g' = \frac{dg}{dx}$ as follows:

$$\begin{aligned}
(f\circ g)'&=(f'\circ g)\cdot g' \\
(f(g(x))'&=f'(g(x))\cdot g(x)' \\
\frac{df}{dx} &= \frac{df}{dg} \cdot \frac{dg}{dx} \\
\frac{df}{dx}\Big|_{x=c} &= \frac{df}{dg}\Big|_{g(c)} \cdot \frac{dg}{dx}\Big|_{x=c}
\end{aligned}$$

Each subsequent equation is the same, however the addtional information added in later equations is describing the addional complexity of performing the chain rule (automatic differentiation) using finite difference (numerical derivatives) rather than analytical derivatives. For example, the numerical derivative ($\frac{df}{dg}\Big|_{g(c)}$) are only valid around a given point (${g(c)}$). This means that the partial derivative $\frac{df}{dx}\Big|_{x=c}$ is dependent on the partial derivative of future function evaluations that are unknown during forward propagation.

The Function.forward mehtod controls forward propagation as follows:

```python
@staticmethod
def forward(ctx, *args, **kwargs)
```

where ctx is a **context dictionary** that saves computations from the forward propagation that can be recalled when computing derivatives in the backwards propagation. Thus the forward propagation must perform the following tasks.

1. Calculate $outputs = f(inputs)$
2. Indentify which inputs/outputs tensors (nets in the neural network) that require derivatives.
3. Record the non-differential computation results in **ctx** that will be recalled during backwards propagation.

### Backwards Propagation

The backward propagation method defined below calculates all of the partial derivatives that describe changes in the output as a function of changes in inputs and the values of hidden layers. 
```python
@staticmethod
def backward(ctx, *grad_outputs):
```
It is called back propagation because the chain rule demonstrates that these **numerical derivatives** depend on current value of the output tensors around which the grad_outputs are calculated. For example $\frac{df}{dx}\Big|_{x=c}$ is the function of this Function's anlaytical derivative $\frac{dg}{dx}\Big|_{x=c}$ (evaluated at $x=c$), but also the **grad_output** of the next function $\frac{df}{dg}\Big|_{g(c)}$.

#### Example torch.autograd.Function

Let's define an autograd.Function for the following operation:

$$\begin{aligned}
f(x) &= e^x \\
\frac{df}{dx} &= e^x \cdot
\end{aligned}$$

```python
class Exp(Function):

    @staticmethod
    def forward(ctx, i):
        result = i.exp()
        ctx.save_for_backward(result)
        return result

    @staticmethod
    def backward(ctx, grad_output):
        result, = ctx.saved_tensors
        return grad_output * result
```

Even though is a very simple derivative, we save the value of $e**x$ in **ctx** so that we can re-use this value in the backward method (when the input x is no longer provided). We also note that the Exp does not need to know what the next function is in the chain, it only need a numerical approximation of its gradient (**grad_output**). The **chain rule** demonstrates that there is a way to separate the computation of numerical derivatives into separate functions that are implemented in the **backward** method of the torch.autograd.Function.


#### Extension of Chain Rule to Higher-Order Derivatives

Higher-order derivates are useful for nonlinear problems. Faà di Bruno's formula extends the chain rule to higher-order derivatives. 

$${d^n \over dx^n} f(g(x))=\sum \frac{n!}{m_1!\,1!^{m_1}\,m_2!\,2!^{m_2}\,\cdots\,m_n!\,n!^{m_n}}\cdot f^{(m_1+\cdots+m_n)}(g(x))\cdot \prod_{j=1}^n\left(g^{(j)}(x)\right)^{m_j}$$

Several examples of higher-order derivatives are explictly provided below:

$$\begin{aligned}
\frac{dy}{dx} & = \frac{dy}{du} \frac{du}{dx} \\
\frac{d^2 y }{d x^2} & = \frac{d^2 y}{d u^2} \left(\frac{du}{dx}\right)^2
    + \frac{dy}{du} \frac{d^2 u}{dx^2} \\
\frac{d^3 y }{d x^3} & = \frac{d^3 y}{d u^3} \left(\frac{du}{dx}\right)^3
    + 3 \, \frac{d^2 y}{d u^2} \frac{du}{dx} \frac{d^2 u}{d x^2}
    + \frac{dy}{du} \frac{d^3 u}{d x^3} \\
\frac{d^4 y}{d x^4} & =\frac{d^4 y}{du^4} \left(\frac{du}{dx}\right)^4
    + 6 \, \frac{d^3 y}{d u^3} \left(\frac{du}{dx}\right)^2 \frac{d^2 u}{d x^2}
    + \frac{d^2 y}{d u^2} \left( 4 \, \frac{du}{dx} \frac{d^3 u}{dx^3}
    + 3 \, \left(\frac{d^2 u}{dx^2}\right)^2\right)
    + \frac{dy}{du} \frac{d^4 u}{dx^4}.
\end{aligned}$$

An important observation is that the higher order derivatives are computed using lower order derivatives. This means that no new methods need to be defined in each Function in order to calculate higher-order derivatives.

####  Extension of Chain Rule to Inverse Function Derivatives

Suppose that $y = g(x)$ has an inverse function. Call its inverse function $f$ so that we have $x = f(y)$. If both **$g(x)$ and $f(y)$ are differentiable**, the derivative of the inverse function $f$ can solved in terms of $g'$:

$$\begin{aligned}
f'(y) &= \frac{1}{g'(f(y))} \\
f' &= \frac{1}{g'\circ f} \\
f' &= inv \circ g'\circ f
\end{aligned}$$

While the Function definintion must explicitly define an **inverse** function $f$, no new methods need to be defined to calculate the inverse derivative $f'$. Therefore to formally support inverse function and their derivatives, we must add the following define the following properties in each Function.

* inverse_func - some functions do not have an inverse
* if (self.inverse_func is not None and self.derivative_exists and self.inverse_func.derivative_exists):

The inverse derivative will define $\frac{d_{in}}{d_{out}}$, which simply describes feedback in the system.

#### Extension of Chain Rule to Higher Dimensions

A first-order derivative of a multidimensional vector is a Jacobian. Thus the **chain rule** can be reformulated as follows:

$$\begin{aligned}
(f\circ g)'&=(f'\circ g)\cdot g' \\
\nabla (f\circ g) &= J_g^\mathsf{T} \cdot (\nabla f \circ g)
\end{aligned}$$

,
where $J_g$ denotes the Jacobian matrix of function g. 

$$\mathbf J = \begin{bmatrix}
    \dfrac{\partial \mathbf{g}}{\partial x_1} & \cdots & \dfrac{\partial \mathbf{g}}{\partial x_n} \end{bmatrix}
= \begin{bmatrix}
    \dfrac{\partial g_1}{\partial x_1} & \cdots & \dfrac{\partial g_1}{\partial x_n}\\
    \vdots & \ddots & \vdots\\
    \dfrac{\partial g_m}{\partial x_1} & \cdots & \dfrac{\partial g_m}{\partial x_n} \end{bmatrix}$$

or, component-wise:

$$\mathbf J_{ij} = \frac{\partial f_i}{\partial x_j}.$$

This implies that extending the **chain-rule** to high-dimensional problems is accomplished by performin vector operations of dimension (i, j) rather than scalar operations. This is already taken care of by numpy vector math functions.

### Proposal
In theory we could extend the autograd.Function to support inverse derivative computational graphs. Some problems with this idea

* Only One-to-One functions have an inverse.
* Functions with multiple inputs do not have an inverse.
* Some inverse functions may not have defined derivatives.

This suggests that an inverse derivative computational graph could have broken links. An inverse autograd extension of the existing autograd library would not deal well with broken links in the graph because each autograd Function is a sandboxed class. 

## torch.nn.Module

Describing relationships between operations is best left to how we define the structure of a neural network. For example, the forward/inverse propagation described above could be structured an a Bidirectional RNN (see **torch.nn.GRU**). The **torch.nn.Model** alows us to describe problems using a custom combination of **autograd.Funcrtions** and sub **nn.Models**.

![alt text](./images/PNG/rnn-bidirectional.png "Bi-directional RNN")

In this diagram, the input signal ($s_0$) could propagate through a series of forward transforms ($A$), while the inverse signal ($s'_0$) could propagate through the inverse transforms ($A'$). In this case, the $Xi$ terms represent known/unknown impdedance stimuli, while the $Yi$ terms represent predicted currents. This could suggest that internal nodes, which are not measurable could be solved by training the network. 

A custom module can be defined as follows:

```python
import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
       x = F.relu(self.conv1(x))
       return F.relu(self.conv2(x))
```

Notice that the **nn.Module** is a container of sub-Modules that can be organized in **nn.ModuleList**, or **nn.ModuleDict**. All Modules and sub-Modules inherit from **nn.Module**, which itself is a container that has the following notable methods.

```python
def buffers(self, recurse=True):
        r"""Returns an iterator over module buffers. Tensors that are not part of the neural network"""
        
def parameters(self, recurse=True):
        r"""Returns an iterator over module parameters. Wrapped tensors that are part of the neural network"""
        
def modules(self):
        r"""Returns an iterator over all modules in the network."""
        
def add_module(self, name, module):
        r"""Adds a child module to the current module. Hence each module is a container"""
        
def state_dict(self, destination=None, prefix='', keep_vars=False):
        r"""Returns a dictionary containing a whole state of the module."""
        
def load_state_dict(self, state_dict, strict=True):
        r"""Copies parameters and buffers from :attr:`state_dict` into this module and its descendants"""
        
def apply(self, fn):
        r"""Applies ``fn`` recursively to every submodule"""
        
def train(self, mode=True):
        r"""Sets the module in training mode."""
        
def eval(self):
        r"""Sets the module in evaluation mode."""
        
def zero_grad(self):
        r"""Sets gradients of all model parameters to zero."""
```

Perhaps it would be a good idea to organize transforms in a **torch.nn.Sequential**.

## torch.nn.optim

An optimizer is basically a loss function that takes the **nn.Module** tunable **parameters** as inputs and provides the next iteration using a **Optimizer.step()** method that is custom to every otimization type. The lr (learning rate) represents controls the first-order step, while momentum represents the second-order step. See an example below.

```python
optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)

for input, target in dataset:
    optimizer.zero_grad()
    output = model(input)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()
```

Note that the momentum is similar to a 2nd-order solver, but not as powerful as it does not include a Hessian matrix. The Hessian accounts for mixed-parital derivatives, while the momentum factor just looks at second order with respect to each variable. THe learning rate uses the gradient to change the position of the weight, but the momentum changes the velocity of the weight. By not using a Hessian matrix, we do not need to change the behavior of autograd.

## Proposed Work


1. TypeCast(function) -> TypeCast(autograd.Function)
    a. forward(signal, time, freq)
    b. backward(signal, time, freq)
    - This looks very complicated. I'm unable to forsea how a complicated data type could be transmitted between CPU/GPU/FPGA without stuff really going wrong. This data type needs to go between vastly different hw architectures.
    - A workaround is to store the type information in Info() object inside AbstractModels which only exist on CPU
    - Another workaround is to use capital letters for fs signals, lowercase for es signals (V1 vs V1)
    - Another workaround is to store information in the variable name v1_ft, V1_ff
    - Decision: Store type information in Info, use _ft, _ff, etc when there is a need to describe the doamin
    
2. DevicesModel(QObject) -> ErrorModel(nn.Module)
    - devices = no change
    - add_module(transforms = nn.Sequential())
    - add_buffer(uncorrected_signals)
    - add_parameter(correcte_signals, sweeps, time, freq)
The ErrorModel is responsible for assigning signals to devices and to the database

3. Measure(AbstractModel) -> Measure(AbstractModel)
    self.data = open(database)
    self.opt = optim.Optimizer()
    self.loss_func = torch.nn.functional
    for epoch in range(epochs):
        for b_in, a_in, g_in in self.data:
            b_out, a_out, g_out = self.model(xb)
            loss = self.loss_func(b_out, b_goal) + loss_func(g_out, g_goal)
            
            loss.backward()
            self.opt.step()
            self.opt.zero_grad()
    
4. Transforms(AbstractModel) -> Transforms(nn.Module)
    a. expected_type (esignal, fsignal, fssignal, etc)
    a. forward(v, i, z) or forward(b, a, g)
    b. inverse(v, i, z) or inverse(b, a, g)
    
5. Minimize(AbstractModel) -> optim.Optimizer
    a). Do nothing.
    
    
    
    

Measure
 - convert sweep_name: SweepPlan to sweep_name: Tensors
 - Register tensors in Error Model (current_measurement)
 - Register tensors in Database (entire sweep)
 