# Understanding the VGAE and GAE Models

The code from pygcn is designed for a simple GCN network so the model architecture is simpler than that in vgae-pytorch, which is designed for a (V)AE. We'll compare the two code bases and understand the VAE and AE architecture in the context of the math. 

## pygcn Implementation (GCN only)

In [1]:
import torch.nn as nn
import torch.nn.functional as F
from pygcn.layers import GraphConvolution


class GCN(nn.Module):
    def __init__(self, nfeat, nhid, nclass, dropout):
        super(GCN, self).__init__()

        self.gc1 = GraphConvolution(nfeat, nhid)
        self.gc2 = GraphConvolution(nhid, nclass)
        self.dropout = dropout

    def forward(self, x, adj):
        x = F.relu(self.gc1(x, adj)) # notice that the ReLU is applied outside of the GCN layer definition
        x = F.dropout(x, self.dropout, training=self.training) # does vgae-pytorch use dropout? 
        x = self.gc2(x, adj)
        return F.log_softmax(x, dim=1) # note that the final output is using log-softmax

ModuleNotFoundError: No module named 'pygcn'

## vgae-pytorch Implementation 

### VGAE

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import os
import numpy as np

import args

class VGAE(nn.Module):
	def __init__(self, adj):
		super(VGAE,self).__init__()
		self.base_gcn = GraphConvSparse(args.input_dim, args.hidden1_dim, adj) # first layer
        # the second layer is generated twice, once for the mean and once for the log-std dev
        # this correlates with the original VGAE paper which says that the mean and std dev GCN layers share W0 parameters
		self.gcn_mean = GraphConvSparse(args.hidden1_dim, args.hidden2_dim, adj, activation=lambda x:x) 
		self.gcn_logstddev = GraphConvSparse(args.hidden1_dim, args.hidden2_dim, adj, activation=lambda x:x)

	def encode(self, X):
		hidden = self.base_gcn(X)
		self.mean = self.gcn_mean(hidden)
		self.logstd = self.gcn_logstddev(hidden)
        # the encoder adds N x Nlatent size Gaussian noise to Z during sampling of the latent space
		gaussian_noise = torch.randn(X.size(0), args.hidden2_dim)
		sampled_z = gaussian_noise*torch.exp(self.logstd) + self.mean
		return sampled_z

	def forward(self, X):
        # the full run of the VGAE is to encode a sample, 
        # pull a random sample that is not exactly equal to the encoding of the input 
        # and return the dot product of Z (the random sample) as the prediction Ahat
		Z = self.encode(X)
		A_pred = dot_product_decode(Z)
		return A_pred

ModuleNotFoundError: No module named 'args'

### Explanation of Decoder

According to Kipf and Welling, the generative part of the model (i.e. the decoder) can be written as follows: 

$$p(\mathbf{\hat{A}} | \mathbf{Z}) = \Pi_{i=1}^N \Pi_{j=1}^N p(\hat{A}_{ij} | \mathbf{z}_i, \mathbf{z}_j)$$

Where the probability distribution can be written as: 

$$p(\hat{A}_{ij} | \mathbf{z}_i, \mathbf{z}_j) = \sigma(\mathbf{z}_i^T \mathbf{z}_j)$$

Where $\sigma(\cdot)$ is the logistic sigmoid function:

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

We use the logistic sigmoid function because we're using binary cross entropy on the loss function and we want to keep the range of output values of $\hat{\mathbf{A}}$ to between 0 and 1. We also use sigmoid activation when we are doing KL divergence...what exactly is the reason, then?

NOTE: I do not really understand why this works? Is it because the GCN layers will really learn to generate a latent space that can produce representations which can be multiplied in this way to return a good prediction? 

In [5]:
def dot_product_decode(Z):
    # the prediction itself is the sigmoid of the product of Z*Z.T 
	A_pred = torch.sigmoid(torch.matmul(Z,Z.t()))
	return A_pred

### GAE 

This model just removes any stochasticity and represents everything deterministically. 

In [7]:
class GAE(nn.Module):
	def __init__(self,adj):
		super(GAE,self).__init__()
		self.base_gcn = GraphConvSparse(args.input_dim, args.hidden1_dim, adj)
        # the second layer just represents the mean of the latent space
		self.gcn_mean = GraphConvSparse(args.hidden1_dim, args.hidden2_dim, adj, activation=lambda x:x)

	def encode(self, X):
		hidden = self.base_gcn(X)
		z = self.mean = self.gcn_mean(hidden)
		return z

	def forward(self, X):
		Z = self.encode(X)
		A_pred = dot_product_decode(Z) # decoding is the same as for the VAE
		return A_pred

## Loss Functions

Now let's talk about the loss function used to train the (V)AE. According to Kipf and Welling again, the VAE loss function is the ELBO: 

$$\mathcal{L} = \mathbb{E}_{q(\mathbf{Z}|\mathbf{X,A})} [\log p(\mathbf{A}|\mathbf{Z})] - \text{KL}[q(\mathbf{Z}|\mathbf{X,A} || p(\mathbf{Z})]$$

Here $\text{KL}[q(\cdot) || p(\cdot)]$ is the Kullback-Leibler divergence between $q(\cdot)$ and $p(\cdot)$. 

The paper says it also uses a Gaussian prior for $p(\mathbf{Z})$:
$$\begin{align} p(\mathbf{Z}) &= \Pi_i p(\mathbf{z}_i) \\
&= \Pi_i \mathcal{N}(\mathbf{z}_i | 0, \mathbf{I})\end{align}$$

After reviewing the code below, I believe we are using a very specific method of computing the KL divergence that formulates it as the "relative entropy between a diagonal multivariate normal (i.e. $q(\mathbf{Z}|\mathbf{X,A}$)( and a standard normal distribution (i.e. $p(\mathbf{Z})$)" (source: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence):

$$\begin{align}D_{KL}\bigg( \mathcal{N}\big((\mu_1, ... , \mu_k)^T, \text{diag}(\sigma_1^2, ..., \sigma_k^2) \big) || \mathcal{N}(0, \mathbf{I}) \bigg) &= \frac{1}{2} \sum_{i=1}^k (- 1 - \ln(\sigma_i^2) + \sigma_i^2 + \mu_i^2 ) \\
&= \frac{1}{2N} \text{mean}(\text{sum}(\big(1 + 2 ln(\sigma_i) - (e^{ln(\sigma_i)})^2 - \mu_i^2 \big))) \end{align}$$

I think the mean might be redundant here. Note that the middle two terms are equivalent in the two expressions, we just write the second row a specific way because the log of the standard deviation has been computed as one monolithic term by the second layer of the GCN.

It also says that if $\tilde{\mathbf{A}}$ is very sparse, then it can help to reweight terms with $\tilde{\mathbf{A}}_{ij}=1$ in the loss function, $\mathcal{L}$. (Although I cannot see where we do that in the code?)

Finally, training is done with full-batch gradient descent, and the reparameterization trick (see here: https://sassafras13.github.io/ReparamTrick/) is used to handle the stochasticity of the VAE (not necessary for the AE). 

### What about the AE? 

The AE is deterministic so instead of using the KL divergence for the loss function, we use binary cross entropy, aka the log-loss function (see here: https://sassafras13.github.io/BiCE/). This is essentially doing the same thing as KL divergence - it is measuring the difference between two distributions:

$$H_p(q) = - \frac{1}{N} \sum_{i=1}^N y_i \log(p(y_i)) + (1 - y_i)\log(1 - p(y_i))$$

In [None]:
# Create Model
pos_weight = float(adj.shape[0] * adj.shape[0] - adj.sum()) / adj.sum()
norm = adj.shape[0] * adj.shape[0] / float((adj.shape[0] * adj.shape[0] - adj.sum()) * 2)


adj_label = adj_train + sp.eye(adj_train.shape[0])
adj_label = sparse_to_tuple(adj_label)



adj_norm = torch.sparse.FloatTensor(torch.LongTensor(adj_norm[0].T), 
                            torch.FloatTensor(adj_norm[1]), 
                            torch.Size(adj_norm[2]))
adj_label = torch.sparse.FloatTensor(torch.LongTensor(adj_label[0].T), 
                            torch.FloatTensor(adj_label[1]), 
                            torch.Size(adj_label[2]))
features = torch.sparse.FloatTensor(torch.LongTensor(features[0].T), 
                            torch.FloatTensor(features[1]), 
                            torch.Size(features[2]))

weight_mask = adj_label.to_dense().view(-1) == 1
weight_tensor = torch.ones(weight_mask.size(0)) 
weight_tensor[weight_mask] = pos_weight

# init model and optimizer
model = getattr(model,args.model)(adj_norm)
optimizer = Adam(model.parameters(), lr=args.learning_rate)

# train model
for epoch in range(args.num_epoch):
    t = time.time()

    A_pred = model(features)
    optimizer.zero_grad()
    # the AE uses binary cross entropy
    loss = log_lik = norm*F.binary_cross_entropy(A_pred.view(-1), adj_label.to_dense().view(-1), weight = weight_tensor)
    if args.model == 'VGAE':
        # I think this is the expression for KL divergence for multivariate normal distributions that have a diagonal multivariate normal
        kl_divergence = 0.5/ A_pred.size(0) * (1 + 2*model.logstd - model.mean**2 - torch.exp(model.logstd)**2).sum(1).mean() # is this mean redundant?
        loss -= kl_divergence

    loss.backward()
    optimizer.step()

    train_acc = get_acc(A_pred,adj_label)

    val_roc, val_ap = get_scores(val_edges, val_edges_false, A_pred)
    print("Epoch:", '%04d' % (epoch + 1), "train_loss=", "{:.5f}".format(loss.item()),
          "train_acc=", "{:.5f}".format(train_acc), "val_roc=", "{:.5f}".format(val_roc),
          "val_ap=", "{:.5f}".format(val_ap),
          "time=", "{:.5f}".format(time.time() - t))


test_roc, test_ap = get_scores(test_edges, test_edges_false, A_pred)
print("End of training!", "test_roc=", "{:.5f}".format(test_roc),
      "test_ap=", "{:.5f}".format(test_ap))