# Voxel VAE-GAN Training

This notebook is designed to provide a wholistic vae-gan training experience. You can adjust the model and training parameters through the sacred configuration file, you can view training progress in tensorboard, and you can (wip) create reconstructions with the saved models!

References:

* https://github.com/anitan0925/vaegan/blob/master/examples/train.py
  * Runs 20 epochs on separate VAE and GAN then 200 on VAEGAN
* https://github.com/jlindsey15/VAEGAN/blob/master/main.py
  * Almost clear code for vaegan paper
* https://arxiv.org/pdf/1512.09300.pdf
  * vaegan paper
* https://github.com/timsainb/Tensorflow-MultiGPU-VAE-GAN
  * Best code yet!
  

## Setup

In [1]:
import env
from train_vaegan import train_vaegan
from data.thingi10k import Thingi10k
from data.modelnet10 import ModelNet10
from data import MODELNET10_TOILET_INDEX, MODELNET10_SOFA_INDEX, MODELNET10_SOFA_TOILET_INDEX
from models import MODEL_DIR


# plot things
%matplotlib inline
# autoreload modules
%load_ext autoreload
%autoreload 2

## Prepare Sacred Experiment

In [2]:
from sacred.observers import FileStorageObserver
from sacred import Experiment
import os

ex = Experiment(name='voxel_vaegan_notebook', interactive=True)
ex.observers.append(FileStorageObserver.create('experiments_vaegan'))

@ex.main
def run_experiment(cfg):
    train_vaegan(cfg)

import datetime
last_model_dir = None

## Prepare Model Config

The model dir is generated with a timestamp. This keeps you from overwriting past results and keeps results separate to avoid confusing tensorboard.

But be warned! These model dirs can take up space, so you might need to periodically go back and delete ones you do not care about.

Also, if you ever train a model that you would really like to keep, I recommend moving it to a new directory with a special name like "best_model_ever".

In [3]:
DATASET_CLASS = 'ModelNet10'
#INDEX = MODELNET10_SOFA_TOILET_INDEX
#INDEX = MODELNET10_SOFA_INDEX
INDEX = MODELNET10_TOILET_INDEX

def make_cfg():
    model_dir = os.path.join(
        MODEL_DIR,
        'voxel_vaegan1/modelnet10/{}'.format(datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')))
    print(model_dir)
    os.makedirs(model_dir)

    cfg = {
        'cfg': {
            "dataset": {
                "class": DATASET_CLASS,
                "index": INDEX,
                #"tag": "animal",
                #"filter_id": 126660,
                #"pctile": 1.0,
                "splits": True
                #"splits": {
                #    "train": .8,
                #    "dev": .1,
                #    "test": .1
                #}
            },
            "generator": {
                "verbose": True,
                "pad": True
            }, 
            "model": {
                "ckpt_dir": model_dir,
                "voxels_dim": 32,
                "batch_size": 32,
                # Do 0.0001 for 1 epoch, then 0.001 for rest of training
                #"learning_rate": [(1, 0.0001), (None, 0.001)],
                #"learning_rate": 0.0001,
                "enc_lr": 0.0001,
                "dec_lr": 0.0001,
                "dis_lr": 0.0001,
                "epochs": 201,
                "keep_prob": 1.0,
                "kl_div_loss_weight": 100,
                "recon_loss_weight": 10000,
                "ll_weight": .0001,
                "dec_weight": 100,
                "latent_dim": 100,
                "verbose": True,
                "debug": False,
                "input_repeats": 1,
                "display_step": 1,
                #"example_stl_id": 126660,
                "voxel_prob_threshold": 0.065,
                "dev_step": 10,
                "save_step": 10,
                'launch_tensorboard': True,
                'tb_dir': 'tb',
                #'tb_compare': [('best_sofa_and_toilet', '/home/jcworkma/jack/3d-form/models/voxel_vaegan1/modelnet10/2019-03-15_17-08-43/tb')],
                #'tb_compare': [('best_vaegan', '/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-17_08-40-29/tb')],
                'tb_compare': [('vaegan_100epochs_toilets', '/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-18_13-12-53/tb')],
                'no_gan': False,
                'monitor_memory': True,
                # these settings control how often the components' optimizers are executed during the training loop
                'train_vae_cadence': 1,
                'train_gan_cadence': 1
            }
        }
    }
    
    return cfg

## Tensorboard Prep

We launch tensorboard with a call to the python subprocess module. Sometimes, that process does not die with the rest of the experiment and lingers on as a system process. This becomes a problem when we try to initialize tensorboard for the next experiment because they cannot share the same port!

The function below is designed to solve this problem. It uses the linux pgrep utility to search for existing tensorboard processes and kill them. Note that this probably won't work on Windows.

In [4]:
from utils import kill_tensorboard

kill_tensorboard()

['pgrep', 'tensorboard'] yielded -> b''


## Training

We start with a check that we are not attempting to overwrite the last MODEL_DIR. If you are blocked by the assert, re-execute the cfg code above to generate a new MODEL_DIR. This will allow you to move ahead with training.

The sacred experiment will save away a copy of your experiment settings in an experiments directory. This can be accessed later in case we need to retrieve a prime config.

If tensorboard is enabled, tune in at localhost:6006 or your_ip:6006
   

In [None]:
cfg = make_cfg()
model_dir = cfg.get('cfg').get('model').get('ckpt_dir')
kill_tensorboard()

/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-20_09-39-05
['pgrep', 'tensorboard'] yielded -> b''


In [None]:
if last_model_dir == model_dir:
    print('dont overwrite!')
    assert False
else:
    last_model_dir = model_dir

ex.run(config_updates=cfg)

INFO - voxel_vaegan_notebook - Running command 'run_experiment'
INFO - voxel_vaegan_notebook - Started run with ID "217"


Logging to /home/jcworkma/jack/3d-form/src/logs/2019-03-20_09-39__root.log
Starting train_vaegan main
Numpy random seed: 639199290
Saved cfg: /home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-20_09-39-05/cfg.json
Dataset: <class 'data.modelnet10.ModelNet10'>
Using dataset index /home/jcworkma/jack/3d-form/src/../data/processed/modelnet10_toilet_index.csv and pctile None
Shuffling dataset
dataset n_input=7104
Splitting Datasets
Num input = 7104
Num batches per epoch = 222.00
Initializing VoxelVaegan
['tensorboard', '--logdir', 'current:/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-20_09-39-05/tb,vaegan_100epochs_toilets:/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-18_13-12-53/tb']
Epoch: 0, Elapsed Time: 0.03
Training VAE in this epoch
Training GAN in this epoch
Epoch: 0 / 201, Batch: 0 (0 / 32), Elapsed time: 0.03 mins
Enc Loss = 8.17, KL Divergence = 0.08, Reconstruction Loss = 0.11, ll_loss = 1.09, di

Memory Use (GB): 1.9497795104980469
Epoch: 0 / 201, Batch: 31 (0 / 1024), Elapsed time: 5.74 mins
Enc Loss = 4.64, KL Divergence = 0.01, Reconstruction Loss = 0.13, ll_loss = 40367.56, dis_Loss = 0.00, dec_Loss = 4.04, Elapsed time: 5.93 mins
Memory Use (GB): 1.90728759765625
Epoch: 0 / 201, Batch: 32 (0 / 1056), Elapsed time: 5.93 mins
Enc Loss = 3.51, KL Divergence = 0.01, Reconstruction Loss = 0.15, ll_loss = 29132.92, dis_Loss = 0.00, dec_Loss = 2.91, Elapsed time: 6.11 mins
Memory Use (GB): 1.9228401184082031
Epoch: 0 / 201, Batch: 33 (0 / 1088), Elapsed time: 6.11 mins
Enc Loss = 4.62, KL Divergence = 0.01, Reconstruction Loss = 0.12, ll_loss = 40344.05, dis_Loss = 0.00, dec_Loss = 4.03, Elapsed time: 6.29 mins
Memory Use (GB): 1.9741172790527344
Epoch: 0 / 201, Batch: 34 (0 / 1120), Elapsed time: 6.29 mins
Enc Loss = 3.62, KL Divergence = 0.01, Reconstruction Loss = 0.15, ll_loss = 30467.50, dis_Loss = 0.00, dec_Loss = 3.05, Elapsed time: 6.48 mins
Memory Use (GB): 2.09127044677

Memory Use (GB): 2.143524169921875
Epoch: 0 / 201, Batch: 65 (0 / 2112), Elapsed time: 12.00 mins
Enc Loss = 2.45, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 20302.35, dis_Loss = 0.00, dec_Loss = 2.03, Elapsed time: 12.19 mins
Memory Use (GB): 2.095867156982422
Epoch: 0 / 201, Batch: 66 (0 / 2144), Elapsed time: 12.19 mins
Enc Loss = 1.95, KL Divergence = 0.00, Reconstruction Loss = 0.22, ll_loss = 15098.62, dis_Loss = 0.00, dec_Loss = 1.51, Elapsed time: 12.37 mins
Memory Use (GB): 1.9308280944824219
Epoch: 0 / 201, Batch: 67 (0 / 2176), Elapsed time: 12.37 mins
Enc Loss = 2.19, KL Divergence = 0.00, Reconstruction Loss = 0.20, ll_loss = 17954.50, dis_Loss = 0.00, dec_Loss = 1.80, Elapsed time: 12.56 mins
Memory Use (GB): 2.1852684020996094
Epoch: 0 / 201, Batch: 68 (0 / 2208), Elapsed time: 12.56 mins
Enc Loss = 1.76, KL Divergence = 0.00, Reconstruction Loss = 0.23, ll_loss = 14005.86, dis_Loss = 0.00, dec_Loss = 1.40, Elapsed time: 12.74 mins
Memory Use (GB): 2.033

Memory Use (GB): 1.971282958984375
Epoch: 0 / 201, Batch: 99 (0 / 3200), Elapsed time: 18.24 mins
Enc Loss = 2.35, KL Divergence = 0.00, Reconstruction Loss = 0.22, ll_loss = 20631.74, dis_Loss = 0.00, dec_Loss = 2.06, Elapsed time: 18.43 mins
Memory Use (GB): 2.0241165161132812
Epoch: 0 / 201, Batch: 100 (0 / 3232), Elapsed time: 18.43 mins
Enc Loss = 2.48, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 21654.59, dis_Loss = 0.00, dec_Loss = 2.17, Elapsed time: 18.61 mins
Memory Use (GB): 2.0200462341308594
Epoch: 0 / 201, Batch: 101 (0 / 3264), Elapsed time: 18.61 mins
Enc Loss = 2.57, KL Divergence = 0.00, Reconstruction Loss = 0.20, ll_loss = 22962.78, dis_Loss = 0.00, dec_Loss = 2.30, Elapsed time: 18.80 mins
Memory Use (GB): 2.0233154296875
Epoch: 0 / 201, Batch: 102 (0 / 3296), Elapsed time: 18.80 mins
Enc Loss = 3.12, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 28355.51, dis_Loss = 0.00, dec_Loss = 2.84, Elapsed time: 18.98 mins
Memory Use (GB): 1.97

Enc Loss = 4.88, KL Divergence = 0.00, Reconstruction Loss = 0.27, ll_loss = 46257.49, dis_Loss = 0.00, dec_Loss = 4.63, Elapsed time: 24.48 mins
Memory Use (GB): 2.1381149291992188
Epoch: 0 / 201, Batch: 133 (0 / 4288), Elapsed time: 24.48 mins
Enc Loss = 2.37, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 21393.76, dis_Loss = 0.00, dec_Loss = 2.14, Elapsed time: 24.66 mins
Memory Use (GB): 2.1051177978515625
Epoch: 0 / 201, Batch: 134 (0 / 4320), Elapsed time: 24.66 mins
Enc Loss = 1.81, KL Divergence = 0.00, Reconstruction Loss = 0.21, ll_loss = 15865.39, dis_Loss = 0.00, dec_Loss = 1.59, Elapsed time: 24.84 mins
Memory Use (GB): 2.0679397583007812
Epoch: 0 / 201, Batch: 135 (0 / 4352), Elapsed time: 24.84 mins
Enc Loss = 3.25, KL Divergence = 0.00, Reconstruction Loss = 0.27, ll_loss = 29983.93, dis_Loss = 0.00, dec_Loss = 3.00, Elapsed time: 25.02 mins
Memory Use (GB): 1.9607582092285156
Epoch: 0 / 201, Batch: 136 (0 / 4384), Elapsed time: 25.02 mins
Enc Loss = 3.05,

Memory Use (GB): 1.9509773254394531
Epoch: 0 / 201, Batch: 166 (0 / 5344), Elapsed time: 30.55 mins
Enc Loss = 4.90, KL Divergence = 0.00, Reconstruction Loss = 0.23, ll_loss = 47300.03, dis_Loss = 0.00, dec_Loss = 4.73, Elapsed time: 30.74 mins
Memory Use (GB): 2.0061264038085938
Epoch: 0 / 201, Batch: 167 (0 / 5376), Elapsed time: 30.74 mins
Enc Loss = 3.69, KL Divergence = 0.00, Reconstruction Loss = 0.23, ll_loss = 35125.70, dis_Loss = 0.00, dec_Loss = 3.51, Elapsed time: 30.92 mins
Memory Use (GB): 1.8374214172363281
Epoch: 0 / 201, Batch: 168 (0 / 5408), Elapsed time: 30.93 mins
Enc Loss = 4.00, KL Divergence = 0.00, Reconstruction Loss = 0.27, ll_loss = 38040.08, dis_Loss = 0.00, dec_Loss = 3.80, Elapsed time: 31.11 mins
Memory Use (GB): 2.0280075073242188
Epoch: 0 / 201, Batch: 169 (0 / 5440), Elapsed time: 31.11 mins
Enc Loss = 4.92, KL Divergence = 0.00, Reconstruction Loss = 0.25, ll_loss = 47549.86, dis_Loss = 0.00, dec_Loss = 4.75, Elapsed time: 31.30 mins
Memory Use (GB):

Enc Loss = 4.06, KL Divergence = 0.00, Reconstruction Loss = 0.25, ll_loss = 39198.12, dis_Loss = 0.00, dec_Loss = 3.92, Elapsed time: 36.63 mins
Memory Use (GB): 2.0288658142089844
Epoch: 1 / 201, Batch: 27 (896 / 5504), Elapsed time: 36.63 mins
Enc Loss = 4.00, KL Divergence = 0.00, Reconstruction Loss = 0.27, ll_loss = 38752.12, dis_Loss = 0.00, dec_Loss = 3.88, Elapsed time: 36.81 mins
Memory Use (GB): 1.9299583435058594
Epoch: 1 / 201, Batch: 28 (928 / 5504), Elapsed time: 36.81 mins
Enc Loss = 3.06, KL Divergence = 0.00, Reconstruction Loss = 0.23, ll_loss = 29326.63, dis_Loss = 0.00, dec_Loss = 2.93, Elapsed time: 37.00 mins
Memory Use (GB): 2.007434844970703
Epoch: 1 / 201, Batch: 29 (960 / 5504), Elapsed time: 37.00 mins
Enc Loss = 4.68, KL Divergence = 0.00, Reconstruction Loss = 0.23, ll_loss = 45474.81, dis_Loss = 0.00, dec_Loss = 4.55, Elapsed time: 37.18 mins
Memory Use (GB): 1.8186187744140625
Epoch: 1 / 201, Batch: 30 (992 / 5504), Elapsed time: 37.18 mins
Enc Loss = 2.

Memory Use (GB): 2.022808074951172
Epoch: 1 / 201, Batch: 60 (1952 / 5504), Elapsed time: 42.70 mins
Enc Loss = 3.36, KL Divergence = 0.00, Reconstruction Loss = 0.29, ll_loss = 32414.71, dis_Loss = 0.00, dec_Loss = 3.24, Elapsed time: 42.88 mins
Memory Use (GB): 1.9810295104980469
Epoch: 1 / 201, Batch: 61 (1984 / 5504), Elapsed time: 42.88 mins
Enc Loss = 2.61, KL Divergence = 0.00, Reconstruction Loss = 0.25, ll_loss = 25062.87, dis_Loss = 0.00, dec_Loss = 2.51, Elapsed time: 43.06 mins
Memory Use (GB): 1.9648818969726562
Epoch: 1 / 201, Batch: 62 (2016 / 5504), Elapsed time: 43.06 mins
Enc Loss = 2.45, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 23485.79, dis_Loss = 0.00, dec_Loss = 2.35, Elapsed time: 43.25 mins
Memory Use (GB): 2.1352577209472656
Epoch: 1 / 201, Batch: 63 (2048 / 5504), Elapsed time: 43.25 mins
Enc Loss = 5.53, KL Divergence = 0.00, Reconstruction Loss = 0.26, ll_loss = 54237.61, dis_Loss = 0.00, dec_Loss = 5.42, Elapsed time: 43.43 mins
Memory Us

Epoch: 1 / 201, Batch: 93 (3008 / 5504), Elapsed time: 48.75 mins
Enc Loss = 3.40, KL Divergence = 0.00, Reconstruction Loss = 0.25, ll_loss = 33062.24, dis_Loss = 0.00, dec_Loss = 3.31, Elapsed time: 48.93 mins
Memory Use (GB): 1.9967041015625
Epoch: 1 / 201, Batch: 94 (3040 / 5504), Elapsed time: 48.93 mins
Enc Loss = 3.80, KL Divergence = 0.00, Reconstruction Loss = 0.25, ll_loss = 37020.72, dis_Loss = 0.00, dec_Loss = 3.70, Elapsed time: 49.11 mins
Memory Use (GB): 2.0927352905273438
Epoch: 1 / 201, Batch: 95 (3072 / 5504), Elapsed time: 49.11 mins
Enc Loss = 2.89, KL Divergence = 0.00, Reconstruction Loss = 0.26, ll_loss = 27894.65, dis_Loss = 0.00, dec_Loss = 2.79, Elapsed time: 49.29 mins
Memory Use (GB): 2.1462020874023438
Epoch: 1 / 201, Batch: 96 (3104 / 5504), Elapsed time: 49.29 mins
Enc Loss = 3.95, KL Divergence = 0.00, Reconstruction Loss = 0.21, ll_loss = 38656.95, dis_Loss = 0.00, dec_Loss = 3.87, Elapsed time: 49.48 mins
Memory Use (GB): 2.085369110107422
Epoch: 1 / 2

Epoch: 1 / 201, Batch: 126 (4064 / 5504), Elapsed time: 54.80 mins
Enc Loss = 3.60, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 35146.14, dis_Loss = 0.00, dec_Loss = 3.51, Elapsed time: 54.98 mins
Memory Use (GB): 1.9380378723144531
Epoch: 1 / 201, Batch: 127 (4096 / 5504), Elapsed time: 54.98 mins
Enc Loss = 3.20, KL Divergence = 0.00, Reconstruction Loss = 0.23, ll_loss = 31159.70, dis_Loss = 0.00, dec_Loss = 3.12, Elapsed time: 55.17 mins
Memory Use (GB): 1.9082908630371094
Epoch: 1 / 201, Batch: 128 (4128 / 5504), Elapsed time: 55.17 mins
Enc Loss = 5.11, KL Divergence = 0.00, Reconstruction Loss = 0.22, ll_loss = 50344.70, dis_Loss = 0.00, dec_Loss = 5.03, Elapsed time: 55.35 mins
Memory Use (GB): 2.0745506286621094
Epoch: 1 / 201, Batch: 129 (4160 / 5504), Elapsed time: 55.35 mins
Enc Loss = 3.25, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 31681.02, dis_Loss = 0.00, dec_Loss = 3.17, Elapsed time: 55.53 mins
Memory Use (GB): 2.167449951171875
Epoch

Epoch: 1 / 201, Batch: 159 (5120 / 5504), Elapsed time: 60.88 mins
Enc Loss = 4.48, KL Divergence = 0.00, Reconstruction Loss = 0.22, ll_loss = 44084.11, dis_Loss = 0.00, dec_Loss = 4.41, Elapsed time: 61.06 mins
Memory Use (GB): 2.0189285278320312
Epoch: 1 / 201, Batch: 160 (5152 / 5504), Elapsed time: 61.06 mins
Enc Loss = 5.54, KL Divergence = 0.00, Reconstruction Loss = 0.26, ll_loss = 54654.93, dis_Loss = 0.00, dec_Loss = 5.47, Elapsed time: 61.25 mins
Memory Use (GB): 2.0789108276367188
Epoch: 1 / 201, Batch: 161 (5184 / 5504), Elapsed time: 61.25 mins
Enc Loss = 3.30, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 32362.44, dis_Loss = 0.00, dec_Loss = 3.24, Elapsed time: 61.43 mins
Memory Use (GB): 2.1438980102539062
Epoch: 1 / 201, Batch: 162 (5216 / 5504), Elapsed time: 61.43 mins
Enc Loss = 2.70, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 26264.11, dis_Loss = 0.00, dec_Loss = 2.63, Elapsed time: 61.61 mins
Memory Use (GB): 2.0905418395996094
Epoc