# Discrete Flows 

In this notebook I explain very basics of and demonstrate how to use discrete flows introduced in:

*Tran, Dustin, et al. "Discrete flows: Invertible generative models of discrete data." Advances in Neural Information Processing Systems. 2019.*

In [1]:
#Make sure that all the necessary things are installed
# !pip install tensorflow tensorflow_probability
# !pip install "git+https://github.com/google/edward2.git#egg=edward2"

In [2]:
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
import edward2 as ed

In [3]:
# tested with versions
tf.__version__, tfp.__version__

('2.1.0', '0.9.0')

## Operations on One-hot encoded vectors

Let's look on one-hot encoded arithmetics

In [4]:
x = tf.constant([0., 0., 0., 1., 0.]) # x=4

In [5]:
mu = tf.constant([0., 0., 1., 0., 0.]) # shift = 2

I round output tesnor (x_transformed) since one_hot_add implementation in Edward2 is numerically noisy. In practice it's actually better to use one_hot_minus!

In [6]:
x_transformed = ed.layers.utils.one_hot_add(x, mu) # x_transformed = x + mu
np.round(x_transformed, 4) # (4+2) mod 5 == 0

array([1., 0., 0., 0., 0.], dtype=float32)

## Learning transformations

Let's learn mu s.t. x_target = mu + x (learning here is trivial we just train one variable) 

In [7]:
# Try different initalizations!
np.random.seed(9661)

In [8]:
# where do we start
x = tf.constant([1., 0., 0., 0. , 0., 0.], name="x") # =0

# what we want to get
x_target = tf.constant([0., 0., 0., 0., 0., 1.], name="x_target") # =5

# trainable variable in unconstrained space
mu_logits = tf.Variable(np.random.randn(6), name="mu_logits") # random initialization
#mu_logits = tf.Variable([0.1, 0.1, 0.1, 0.1, 0.1, 0.1], name="mu_logits") # uniform initialization

# straigh-through estimator of mu = argmax(mu_logits) 
# where mu is one-hot representation of a shift transformation
#  the temperature controls bias of gradients going through the argmax
mu = ed.layers.utils.one_hot_argmax(mu_logits, temperature=1.0)

In [9]:
# Preview tensors
x, x_target, mu_logits, mu

(<tf.Tensor: shape=(6,), dtype=float32, numpy=array([1., 0., 0., 0., 0., 0.], dtype=float32)>,
 <tf.Tensor: shape=(6,), dtype=float32, numpy=array([0., 0., 0., 0., 0., 1.], dtype=float32)>,
 <tf.Variable 'mu_logits:0' shape=(6,) dtype=float64, numpy=
 array([-0.42223113,  2.03345075, -1.60760614, -0.51772096,  1.21881057,
        -1.65165981])>,
 <tf.Tensor: shape=(6,), dtype=float64, numpy=array([0., 1., 0., 0., 0., 0.])>)

Gradient-based learning of mu_logits so argmax(mu_logits) would learn to move x to x_target. We observe how mu changes so x+mu would ==x_target.

In [10]:
optimizer = tf.keras.optimizers.RMSprop(lr=0.1)

for i in range(30):
    with tf.GradientTape() as tape: 
        
        mu = ed.layers.utils.one_hot_argmax(mu_logits, temperature=1.0)
        x_transformed = ed.layers.utils.one_hot_add(x, mu)        
        loss = tf.reduce_sum((x_target-x_transformed)**2) # squared loss
        
        if i%1==0:
            #print(np.round(mu, 1).reshape(-1))
            print("iter=%i mu=%s x_transformed=%s" % 
                   (i, np.round(mu, 2), np.round(abs(x_transformed), 1)))        
            
    gradients = tape.gradient(loss, mu_logits)        
    optimizer.apply_gradients([(gradients, mu_logits)])

iter=0 mu=[0. 1. 0. 0. 0. 0.] x_transformed=[0. 1. 0. 0. 0. 0.]
iter=1 mu=[0. 1. 0. 0. 0. 0.] x_transformed=[0. 1. 0. 0. 0. 0.]
iter=2 mu=[0. 0. 0. 0. 1. 0.] x_transformed=[0. 0. 0. 0. 1. 0.]
iter=3 mu=[0. 1. 0. 0. 0. 0.] x_transformed=[0. 1. 0. 0. 0. 0.]
iter=4 mu=[0. 0. 0. 0. 1. 0.] x_transformed=[0. 0. 0. 0. 1. 0.]
iter=5 mu=[0. 1. 0. 0. 0. 0.] x_transformed=[0. 1. 0. 0. 0. 0.]
iter=6 mu=[0. 0. 0. 0. 1. 0.] x_transformed=[0. 0. 0. 0. 1. 0.]
iter=7 mu=[0. 1. 0. 0. 0. 0.] x_transformed=[0. 1. 0. 0. 0. 0.]
iter=8 mu=[0. 0. 0. 0. 1. 0.] x_transformed=[0. 0. 0. 0. 1. 0.]
iter=9 mu=[1. 0. 0. 0. 0. 0.] x_transformed=[1. 0. 0. 0. 0. 0.]
iter=10 mu=[0. 1. 0. 0. 0. 0.] x_transformed=[0. 1. 0. 0. 0. 0.]
iter=11 mu=[0. 0. 0. 1. 0. 0.] x_transformed=[0. 0. 0. 1. 0. 0.]
iter=12 mu=[0. 0. 0. 0. 1. 0.] x_transformed=[0. 0. 0. 0. 1. 0.]
iter=13 mu=[0. 1. 0. 0. 0. 0.] x_transformed=[0. 1. 0. 0. 0. 0.]
iter=14 mu=[0. 0. 0. 0. 0. 1.] x_transformed=[0. 0. 0. 0. 0. 1.]
iter=15 mu=[0. 0. 0. 0. 0. 1.] x_tr

In [11]:
# let's look on how logits changed
mu_logits

<tf.Variable 'mu_logits:0' shape=(6,) dtype=float64, numpy=
array([1.09168971, 1.04800181, 0.51804335, 1.11984215, 1.14813777,
       1.18473996])>

## Transforming batches of N-dimensional samples

We sample from N-dimensional K-categorical distribution a batch of samples and pass it through a flow that is modelling dependecies between dimensions using an autoregressive transformation:

In [12]:
N, K = 2, 3 # two variables with three categories

In [13]:
tf.random.set_seed(123) # assure the same results every time

In [14]:
# base distribution
probs = [[0.1, 0.1, 0.8],[0.2,0.2,0.6]]
base = tfp.distributions.OneHotCategorical(probs=probs)
base.probs

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[0.1, 0.1, 0.8],
       [0.2, 0.2, 0.6]], dtype=float32)>

In [15]:
# sample batch of two samples from base distribution -> output dim = (batch, N, K)
sample = base.sample(2)
sample

<tf.Tensor: shape=(2, 2, 3), dtype=int32, numpy=
array([[[1, 0, 0],
        [0, 0, 1]],

       [[0, 0, 1],
        [0, 0, 1]]], dtype=int32)>

In [16]:
# let's use a masked autoncoder to model our transformation mu
mu = ed.layers.MADE(K, hidden_dims=[3,3], hidden_order="left-to-right")

In [17]:
# an autorgressive flow using transformation mu
flow = ed.layers.DiscreteAutoregressiveFlow(mu, temperature=0.1) 

In [18]:
# let's push forward our sample and see how values changed 
# (=ones were moved to other positions)
transformed_sample = flow(sample)
transformed_sample

<tf.Tensor: shape=(2, 2, 3), dtype=int32, numpy=
array([[[1, 0, 0],
        [1, 0, 0]],

       [[0, 0, 1],
        [0, 1, 0]]], dtype=int32)>

In [19]:
# and retrieve the original sample by pushing the transformed_sample back
restored_sample = flow.reverse(transformed_sample)
np.round(restored_sample, 4)

array([[[ 1.,  0., -0.],
        [ 0., -0.,  1.]],

       [[ 0., -0.,  1.],
        [-0., -0.,  1.]]], dtype=float32)

In [20]:
# compare the original sample with the retrieved one
np.round(restored_sample, 4)==np.round(sample, 4)

array([[[ True,  True,  True],
        [ True,  True,  True]],

       [[ True,  True,  True],
        [ True,  True,  True]]])

## Training larger transformation with MLE

Let's train the flow transformation so base samples passed through the flow would follow a distribution as close as possbile to a target distribution.

In [21]:
# 'true' data generating distribution
target = tfp.distributions.OneHotCategorical(probs = [[0.7, 0.2, 0.1],[0.3,0.4,0.3]])
target.probs

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[0.7, 0.2, 0.1],
       [0.3, 0.4, 0.3]], dtype=float32)>

In [22]:
# our 'features'
target_samples = tf.cast(target.sample(100), 'float32') # cast to the right type
target_samples.shape

TensorShape([100, 2, 3])

In [23]:
optimizer = tf.keras.optimizers.RMSprop(lr=0.1)

for i in range(200):
    with tf.GradientTape() as tape: 
        # move samples to the space where we know how to evaluate probabilities
        reversed_target_samples = flow.reverse(target_samples)
        
        # evaluate log-probs of the samples (output shape=batch x N)
        # (i.e., log_probs = base.log_prob(reversed_target_samples) )
        probs = tf.reduce_sum(reversed_target_samples*base.probs, -1)
        log_probs = tf.math.log(probs+1e-31)
        
        # independent variables -> we just sum up log-probs 
        # to get joint log prob of a N-dim sample
        log_probs = tf.reduce_sum(log_probs, -1) 
        
        # loss = minus average log-likelihood
        loss = -tf.reduce_mean(log_probs) 

        if i%10==0 or i<10:        
            print("iter=%i loss=%.3f" % (i, loss))
            
    gradients = tape.gradient(loss, flow.trainable_variables)        
    optimizer.apply_gradients(zip(gradients, flow.trainable_variables))

iter=0 loss=3.371
iter=1 loss=2.270
iter=2 loss=2.873
iter=3 loss=2.873
iter=4 loss=2.873
iter=5 loss=2.928
iter=6 loss=2.928
iter=7 loss=2.325
iter=8 loss=2.325
iter=9 loss=2.325
iter=10 loss=2.325
iter=20 loss=2.325
iter=30 loss=2.928
iter=40 loss=2.325
iter=50 loss=2.873
iter=60 loss=2.237
iter=70 loss=2.237
iter=80 loss=2.237
iter=90 loss=2.840
iter=100 loss=2.237
iter=110 loss=2.237
iter=120 loss=2.237
iter=130 loss=2.237
iter=140 loss=2.237
iter=150 loss=2.237
iter=160 loss=2.237
iter=170 loss=2.237
iter=180 loss=2.840
iter=190 loss=2.215


In [24]:
# Let's retrieve a resulting distribution 
#  'generated' by passing samples from the base through the flow
np.round( tf.reduce_mean(flow(tf.cast(base.sample(100000),'float32')),0), 2)

array([[0.8 , 0.1 , 0.1 ],
       [0.24, 0.56, 0.2 ]], dtype=float32)

In [25]:
# we compare it against the target distribution
target.probs

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[0.7, 0.2, 0.1],
       [0.3, 0.4, 0.3]], dtype=float32)>

In [26]:
# we note that it's not exactly the same
# but let's a look again at the base distribution
base.probs

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[0.1, 0.1, 0.8],
       [0.2, 0.2, 0.6]], dtype=float32)>

Note that the target distribution here was factorized, so actually the network mu did not need to learn any dependecies between dimensions, but in general case target could be any distribution.