# Neural network analysis

In this notebook we use the trained neural networks to predict the Koopman operators, implied timescales, calculate the Chapman-Kolmogorov test and other things. This is setup in such a way as to use an environment with only tensorflow and Keras, and no other packages.

In [16]:
# Contains the neural net definitions and imports
%run model.py

## Data
### Trajectories
Trajectories were acquired in five rounds of 1024 simulations each, totalling 5119 runs (one simulation failed to run) at 278 K in the $NVT$ ensemble. Postprocessing involved removing water, subsampling to 250 ps timesteps, and making molecules whole.

In [7]:
trajs = (sorted(glob("trajectories/r1/traj*.xtc")) +
         sorted(glob("trajectories/r2/traj*.xtc")) +
         sorted(glob("trajectories/r3/traj*.xtc")) +
         sorted(glob("trajectories/r4/traj*.xtc")) +
         sorted(glob("trajectories/r5/traj*.xtc")))
top = "trajectories/topol.gro"
KBT = 2.311420 # 278 K
traj_rounds = [1024, 2047, 3071, 4095, 5119]
nres = 42

# This is only really necessary for the residues in the plots
topo = md.load_topology(top)

We use minimum distances as features for the neural network:

In [8]:
feat = pe.coordinates.featurizer(top)
feat.add_residue_mindist()
inpcon = pe.coordinates.source(trajs, feat)

lengths = sort_lengths(inpcon.trajectory_lengths(), [1024, 1023, 1024, 1024, 1024])
nframes = inpcon.trajectory_lengths().sum()



HBox(children=(HBox(children=(Label(value='Obtaining file info'),), layout=Layout(max_width='35%', min_width='…



In [9]:
print("Trajectories: {0}".format(len(trajs)))
print("Frames: {0}".format(nframes))
print("Time: {0:5.3f} µs".format(inpcon.trajectory_lengths().sum() * 0.00025))

Trajectories: 5119
Frames: 1259172
Time: 314.793 µs


## VAMPNet
VAMPNet[1] is composed of two lobes, one reading the system features $\mathbf{x}$ at a timepoint $t$ and the other after some lag time $\tau$. In this case the network reads all minimum inter-residue distances (780 values) and sends them through 5 layers with 256 nodes each. The final layer uses between 2 and 8 *softmax* outputs to yield a state assignment vector $\chi: \mathbb{R}^m \to \Delta^{n}$ where $\Delta^{n} = \{ s \in \mathbb{R}^n \mid 0 \le s_i \le 1, \sum_i^n s_i = 1 \}$ representing the probability of a state assignment. One lobe thus transforms a system state into a state occupation probability. We can also view this value as a kind of reverse ambiguity, i.e. how sure the network is that the system is part of a certain cluster. These outputs are then used as the input for the VAMP scoring function. We use the new enhanced version with physical constraints[2], particularly the ones for positive entries and reversibility.

[1] Mardt, A., Pasquali, L., Wu, H. & Noé, F. VAMPnets for deep learning of molecular kinetics. Nat Comms 1–11 (2017). doi:10.1038/s41467-017-02388-1

[2] Mardt, A., Pasquali, L., Noé, F. & Wu, H. Deep learning Markov and Koopman models with physical constraints. arXiv:1912.07392 [physics] (2019).

### Data preparation
We use minimum residue distances as input ($\frac{N(N-1)}{2}$ values, where $N$ is the number of residues) for the neural network, but remove the 2nd and 3rd off-diagonals:

In [10]:
filename = "intermediate/mindist-780.npy"
if os.path.exists(filename):
    print("Loading existing file for ensemble: {0}".format(filename))
    input_flat = np.load(filename)
else:
    print("No mindist file for ensemble, calculating from scratch...")
    input_flat = np.vstack(inpcon.get_output())
    np.save(filename, input_flat)
input_data = unflatten(input_flat, lengths)

Loading existing file for ensemble: intermediate/mindist-780.npy


We also use the full minimum inter-residue distances for some analysis:

In [11]:
allpairs = np.asarray(list(itertools.combinations(range(nres), 2)))
filename = "intermediate/mindist-all.npy"
if os.path.exists(filename):
    print("Loading existing file for ensemble: {0}".format(filename))
    mindist_flat = np.load(filename)
else:
    print("No mindist file for ensemble, calculating from scratch...")
    feat = pe.coordinates.featurizer(top)
    feat.add_residue_mindist(residue_pairs=allpairs)
    inpmindist = pe.coordinates.source(trajs, feat)
    mindist_flat = np.vstack(inpmindist.get_output())
    np.save(filename, mindist_flat)
mindist = unflatten(mindist_flat, lengths)

Loading existing file for ensemble: intermediate/mindist-all.npy


### Neural network hyperparameters
To allow for a larger hyperparameter search space, we use the self-normalizing neural network approach by Klambauer *et al.* [3], thus using SELU units, `AlphaDropout` and normalized `LeCun` weight initialization. The other hyperparameters are defined at the beginning of this notebook.

[3] Klambauer, G., Unterthiner, T., Mayr, A. & Hochreiter, S. Self-Normalizing Neural Networks. arXiv.org cs.LG, (2017).

In [12]:
activation = "selu"              # NN activation function
init = "lecun_normal"            # NN weight initialization
lag = 50                         # Lag time
n_epoch = 100                    # Max. number of epochs
n_epoch_s = 10000                # Max. number of epochs for S optimization
n_batch = 5000                   # Training batch size
n_dims = input_data[0].shape[1]  # Input dimension
nres = 42                        # Number of residues
epsilon = 1e-7                   # Floating point noise
dt = 0.25                        # Trajectory timestep in ns
steps = 6                        # CK test steps
bs_frames = 900000               # Number of frames in the bootstrap sample
ratio = 0.9                      # Train-Test split ratio
attempts = 20                    # Number of times to run
width = 256                      # Layer width
depth = 5                        # Number of layers
learning_rate = 5e-2             # Learning rate for Chi layers
dropout = 0.0                    # Dropout for Chi layers
regularization = 1e-8            # L2 regularization strength for Chi layers

outsizes = np.array([2, 3, 4, 5, 6, 7, 8])
lags = np.array([1, 2, 5, 10, 20, 50, 100])

# Analysis

## Model validation
We load the previously trained neural network models and calculate the implied timescales, Chapman-Kolmogorov test, and the Koopman operators. This can take a long time, as the constraint vectors have to be re-estimated for every lag time, so we save the intermediate results.

In [80]:
# Takes ~5 days on a V100
with h5py.File("intermediate/data.hdf5", "w") as write:
    for i in range(attempts):
        att = write.create_group(str(i))
        generator = DataGenerator.from_state(input_data, "models/model-idx-{0}.hdf5".format(i))
        for n in outsizes:
            print("Analysing n={0} i={1}...".format(n, i))
            out = att.create_group(str(n))
            koop = KoopmanModel(n=n, network_lag=lag, verbose=0, nnargs=dict(
                width=width, depth=depth, learning_rate=learning_rate,
                regularization=regularization, dropout=dropout,
                batchnorm=True, lr_factor=1e-2))
            koop.load("models/model-ve-{0}-{1}.hdf5".format(n, i))
            koop.generator = generator
            out.create_dataset("k", data=koop.estimate_koopman(lag=50))
            out.create_dataset("mu", data=koop.mu)
            out.create_dataset("its", data=koop.its(lags))
            ckes, ckps = koop.cktest(steps)
            out.create_dataset("cke", data=ckes)
            out.create_dataset("ckp", data=ckps)
            out.create_dataset("bootstrap", data=koop.transform(koop.data.trains[0]))
            out.create_dataset("full", data=koop.transform(generator.data_flat))
            del ckes, ckps, koop

Analysing n=2 i=0...
Analysing n=3 i=0...


KeyboardInterrupt: 

### Convergence
We would ideally like to see how converged our ensemble is with respect to the timescales and stationary distribution given by our model. We thus build trial models with different numbers of trajectories:

In [34]:
n = 4

In [42]:
filename = "intermediate/k-conv-{0}-t.npy".format(n)
k_conv = np.empty((len(traj_rounds), attempts, n, n))
for j, nt in enumerate(traj_rounds):
    for i in range(attempts):
        generator = DataGenerator(input_data[:nt])
        print("Analysing trajs={0} n={1} i={2}...".format(j, n, i), end="\r")
        koop = KoopmanModel(n=n, network_lag=lag, verbose=0, nnargs=dict(
            width=width, depth=depth, learning_rate=learning_rate,
            regularization=regularization, dropout=dropout,
            batchnorm=True, lr_factor=1e-2))
        koop.load("models/model-ve-{0}-{1}.hdf5".format(n, i))
        koop.generator = generator
        k_conv[j, i] = koop.estimate_koopman(lag=50)
np.save(filename, k_conv)

No mindist file for ensemble, calculating from scratch...
Analysing trajs=4 n=4 i=19...