In [4]:
import torch
import numpy as np
#from common.ffn.ffn_relu import ParametricReLUNet
from common.ffn.ffn_tanh import TanhNet

***
##### Theoretical values
***
Consider the FFN with input dimension ${n_0}$, all hidden and the out layers have width ${n}$; the width at l-layer is denoted by ${n^l}$. For preactivation on l-layer $z^{(l)}$ activation function is $σ(z^{(l)})$, or $σ^{(l)}$. Preactivation weights are initialized with a centered normal distribution with variance $1/n$, except in the first layer, where the variance is $1/{n_0}$; bias is constantly zero. As mentioned, the preactivation at layer l for the trainset with points $α_1...α_N∈D$ is $z^{(l)}$. Consider the distribution:

$$p(z^{(l)})_{g,v}=\frac{1}{Z_{g,v}}exp(-\frac{1}{2}\sum \limits _{k=1} ^{n^l}\sum \limits _{α_1,α_2∈D}g^{α_1,α_2}_{(l)}z^{(l)}_{k,α_1}z^{(l)}_{k,α_2})(1+\frac{1}{8}\sum \limits _{k_1,k_2=1} ^{n^l}\sum \limits _{α_1,α_2,α_3,α_4∈D}v^{(α_1,α_2)(α_3,α_4)}_{(l)}z^{(l)}_{k_1,α_1}z^{(l)}_{k_1,α_2}z^{(l)}_{k_2,α_3}z^{(l)}_{k_2,α_4}) (1)$$

Here, $g^{α_1,α_2}_{(l)}$ and $v^{(α_1,α_2)(α_3,α_4)}_{(l)}$ are calculated via $G^{(l)}_{α_1,α_2}$ and $v^{(l)}_{(α_1,α_2)(α_3,α_4)}$:

$$G^{(l+1)}_{α_1,α_2}=<σ^{(l)}_{α_1}σ^{(l)}_{α_2}>_{g^{(l)}}+\frac{1}{8}\sum \limits _{β_1,β_2,β_3,β_4∈D}v^{(β_1,β_2)(β_3,β_4)}_{(l)}(<σ^{(l)}_{α_1}σ^{(l)}_{α_2}z^{(l)}_{β_1,β_2}(z^{(l)}_{β_3,β_4}+2ng^{(l)}_{β_3,β_4})>_{g^{(l)}}-2<σ^{(l)}_{α_1}σ^{(l)}_{α_2}>_{g^{(l)}}g^{(l)}_{β_1,β_3}g^{(l)}_{β_2,β_4}) (2)$$

Formula (2) also matches (4.61) in [1]. In this formula $σ(z^{(l)})=σ^{(l)}$; $<⋅>_{g^{(l)}}$ means gaussian integral with covariance matrix $g^{(l)}$ for all greek letters variables mentioned inside <⋅>; $z^{(l)}_{β_1,β_2}=z^{(l)}_{β_1}z^{(l)}_{β_2}-g^{(l)}_{β_1,β_2}$

$$v^{(l+1)}_{(α_1,α_2)(α_3,α_4)}=\frac{1}{n}(<σ^{(l)}_{α_1}σ^{(l)}_{α_2}σ^{(l)}_{α_3}σ^{(l)}_{α_4}>_{g^{(l)}}-<σ^{(l)}_{α_1}σ^{(l)}_{α_2}>_{g^{(l)}}<σ^{(l)}_{α_3}σ^{(l)}_{α_4}>_{g^{(l)}})+\frac{1}{4}\sum \limits _{β_1,β_2,β_3,β_4∈D}v^{(β_1,β_2)(β_3,β_4)}_{(l)}<σ^{(l)}_{α_1}σ^{(l)}_{α_2}z^{(l)}_{β_1,β_2}>_{g^{(l)}}<σ^{(l)}_{α_3}σ^{(l)}_{α_4}z^{(l)}_{β_3,β_4}>_{g^{(l)}} (3)$$

Formula (3) matches (4.90) in [1] if the subleading $1/n^2$ correction is neglected and the following sumstitutions are applied: $g^{α_1,α_2}=G^{α_1,α_2}+O(1/n), v^{(β_1,β_2)(β_3,β_4)}=\frac{1}{n}V^{(β_1,β_2)(β_3,β_4)}+O(1/n^2), v_{(β_1,β_2)(β_3,β_4)}=??$
Based on previous values, $g^{α_1,α_2}_{(l+1)}$ and $v^{(α_1,α_2)(α_3,α_4)}_{(l+1)}$ can be computed:

$$g^{α_1,α_2}_{(l+1)}=G^{α_1,α_2}_{(l+1)}+\sum \limits _{β_1,β_2,β_3,β_4∈D}v^{(l+1)}_{(β_1,β_2)(β_3,β_4)}G^{α_1,β_1}_{(l+1)}(G^{β_2,β_3}_{(l+1)}G^{β_4,α_2}_{(l+1)}+\frac{n}{2}G^{β_2,α_2}_{(l+1)}G^{β_3,β_4}_{(l+1)}) (4)$$
$$v^{(α_1,α_2)(α_3,α_4)}_{(l+1)}=\sum \limits _{β_1,β_2,β_3,β_4∈D}G^{α_1,β_1}_{(l+1)}G^{α_2,β_2}_{(l+1)}G^{α_3,β_3}_{(l+1)}G^{α_4,β_4}_{(l+1)}v^{(l+1)}_{(β_1,β_2)(β_3,β_4)} (5)$$

The theorem is as follows: Suppose $q(z^{(l)})$ is a true preactivation distribution at l-layer in FFN, $σ(x)=O(x^k), k∈N$. Then, for any S∈Ω, 

$$|\int_{z^{(l)}∈S}q(z^{(l)})dz-\int_{z^{(l)}∈S}p(z^{(l)})_{g,v}dz^{(l)}|=O(\frac{1}{n^{1.49}}) (6)$$

***
##### Case 1: theoretical values for 1-point train set; all α and β are equal to 1. Input width is 3, all layer widths are 100; number of layers is 5. 
***
Preactivation on 1st layer $z^{(1)}$ has Gaussian distribution as a sum of gaussians weights for each neuron:
$$z^{(1)}_{k,α}=b^{(1)}_k+\sum \limits _{s=1} ^{n^0}w^{(1)}_{k,s}z^{(0)}_{s,α} (7)$$
$$E(z^{(1)}_{k,α})=0 (8)$$
$$E(z^{(1)}_{k,α}z^{(1)}_{k,β})=\frac{1}{n^0}\sum \limits _{s=1} ^{n^0}z^{(0)}_{s,α}z^{(0)}_{s,β}=G^{(1)}_{α,β} (9)$$
When α=β=1,
$$E(z^{(1)}_{k,α})^2=\frac{1}{n^0}\sum \limits _{s=1} ^{n^0}(z^{(0)}_{s,α})^2=G^{(1)}_{α,α} (10)$$
Value $v^{(l+1)}=0$, $G_{(1)}=(G_{(1)})^{-1}$

In [3]:
'''n0: # dimension of x
    nk: # hidden nodes
    nl: # dimension of y
    l: # number of layers
    nd: # number of points in train-set'''
n0,nk,nl,l=3,100,100,5
nd = 3
'''slope_plus, slope_minus: # slopes for Relu
    experiments_number: # number of experiments'''
#slope_plus, slope_minus=1.0, 0.5
#experiments_number = 200

testNet = TanhNet(n0=n0,nk=nk,nl=nl,l=l)#ParametricReLUNet(n0=n0,nk=nk,nl=nl,l=l)
testNet.set_log_level("info")
#testNet.set_slopes(slope_plus, slope_minus)
testNet.set_gmetric_recording_indices([(1,1),(1,2),(2,2)])

xx = np.random.normal(size=(n0, nd)).astype(np.float32)
#yy = np.zeros((experiments_number, nl, nd))
#weights distribution variances are set as in (5.67)
cb, cw = 0, 1 #2.0/(slope_plus**2.0 + slope_minus**2.0)

G01_records = []
G00_records = []
G11_records = []

#for each experiment re-initialisation of the weights with recalculation
for experiment_number in range(experiments_number):
    testNet.init_weights(cb, cw)
    res = testNet.forward(xx)
    yy[experiment_number] = res
    G00_records.append(testNet.get_gmetric(1,1).copy())
    G11_records.append(testNet.get_gmetric(2,2).copy())
    G01_records.append(testNet.get_gmetric(1,2).copy())
    
    print('-', end='')


FeedForwardNet created with n0=3, nk=100, nl=100, l=5, bias_on=False
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------