# latent.ipynb
Maximizing the probability of training data by directly sampling $z$ from a normal distribution and estimating $P(X|z;\theta)$ from that sample in order to approximate  
$P(X) = \int P(X|z;\theta)P(z)dz \quad$   using  
$P(X|z_i; \theta) \approx \sum_j P(z_i) \frac{\mathbf{1}(X_{ij}, X)}{N}$,  
where $\mathbf{1}(X_{ij}, X) = 1$ if $X_{ij} = X$ and $\mathbf{1}(X_{ij}, X) = 0$ otherwise.

There are at least 3 ways to proceed:  
1) If $f(z, \theta)$ is deterministic, we take a grid of M values of $z$ in the interval $[z_{min}, z_{max}]$. For each value $X_i$ in the training data we find the values $z_i$ s.t. $f(z_i, \theta) = X_i$. We then approximate $P(X_i) = \Delta z \sum_i P(z_i)$. This is the same as taking each one of the M $z_i$, values, obtaining $X_i$, and updating $P(X_i) \leftarrow P(X_i) + \Delta z P(z_i)$. 
2) We initialize $P(X_j) = 0$ for all $X_j$ in the training set. We take a grid of M values of $z$ in the interval $[z_{min}, z_{max}]$. For each value $z_i$ in the grid we sample $X_i = f(z_i, \theta)$. Each time $X_i$ is equal to a value $X_j$ in the training set we increase $P(X_j)$ by an amount $P(X_j|z_i, \theta)P(z_i)$. At the end, we normalize all the $P(X_j)$ values so they add to 1.
3) Sample the $z_i$ stochastically from the normal distribution. Each time an $X_i$ is computed, and $P(X_i)$ is increased by an amount $\epsilon$. After sampling you normalize the $P(X_i)$ values so they add to 1. This may not be differentiable...

The training data consists of 10-element vectors of 1's and 0's, where only two 1's are present at random locations. The latent space has dimension 2.  
We use a 3-layer feedforward perceptron to map from the $z$ to the $X$ values.

In [1]:
import torch
from torch import nn

In [None]:
class FFP(nn.Module):
    """ A feedforward perceptron. """
    def __init__(self, sizes, nltypes, bias=True):
        """
            sizes: list with size of each layer.
            nltypes: list with nonlinearity type for each inner or
                output layer. Entries are 'relu', 'sig', or 'tanh'.
            bias: whether the layers have a bias unit
        """
        assert len(sizes)-1 == len(nltypes), "length mismatch in nltypes, sizes"
        super(FFP, self).__init__()
        # Add activation functions
        self.nlfs = []
        for nltype in nltypes:
            if nltype == "relu":
                self.nlfs.append(nn.ReLU())
            elif nltype == "sig":
                self.nlfs.append(nn.Sigmoid())
            elif nltype == "tanh":
                self.nlfs.append(nn.Tanh())
            else:
                raise ValueError(f"unknown nonlinearity {nltype}")
        # create layers
        self.bias = bias
        self.sizes = sizes
        layers = []
        for lidx in range(1,len(sizes)):
            layers.append(nn.Linear(sizes[lidx-1], sizes[lidx], bias=bias))
        self.layers = nn.ModuleList(layers)
                
    def forward(self, x):
        for lidx, layer in enumerate(self.layers):
            x = self.nlfs[lidx](layer(x))
        return x

class standard_SGD():
    """ An SGD optimizer for my FFP module. """
    def __init__(self, model, lr=0.1):
        """
            model: an instance of the FFP class
            lr: learning rate
        """
        self.model = model
        self.lr = lr
        
    def step(self):
        """ Updates the model's parameters. """
        for lidx, layer in enumerate(self.model.layers, 1):
            dw = self.lr * layer.weight.grad
            with torch.no_grad():
                layer.weight -= dw
                if self.model.bias:
                    layer.bias -= self.lr * layer.bias.grad
            
    def zero_grad(self):
        for layer in self.model.layers:
            layer.weight.grad.zero_()
            if self.model.bias:
                layer.bias.grad.zero_()

def loss(

In [2]:
# 1) Define the sample interval for z
# 2) for all z values
    # 2.1) Choose a z value
    # 2.2) Obtain the corresponding X value
    # 2.3) Update P(X) with P(z)
# 3) Test whether all X values have P(X) > 0
# 4) Change the parameters using the gradients for all P(X) functions


True