# Voxel VAE-GAN Training

This notebook is designed to provide a wholistic vae-gan training experience. You can adjust the model and training parameters through the sacred configuration file, you can view training progress in tensorboard, and you can (wip) create reconstructions with the saved models!

References:

* https://github.com/anitan0925/vaegan/blob/master/examples/train.py
  * Runs 20 epochs on separate VAE and GAN then 200 on VAEGAN
* https://github.com/jlindsey15/VAEGAN/blob/master/main.py
  * Almost clear code for vaegan paper
* https://arxiv.org/pdf/1512.09300.pdf
  * vaegan paper
* https://github.com/timsainb/Tensorflow-MultiGPU-VAE-GAN
  * Best code yet!
  

## Setup

In [1]:
import env
from train_vaegan import train_vaegan
from data.thingi10k import Thingi10k
from data.modelnet10 import ModelNet10
from data import MODELNET10_TOILET_INDEX, MODELNET10_SOFA_INDEX, MODELNET10_SOFA_TOILET_INDEX
from models import MODEL_DIR


# plot things
%matplotlib inline
# autoreload modules
%load_ext autoreload
%autoreload 2

## Prepare Sacred Experiment

In [2]:
from sacred.observers import FileStorageObserver
from sacred import Experiment
import os

ex = Experiment(name='voxel_vaegan_notebook', interactive=True)
ex.observers.append(FileStorageObserver.create('experiments_vaegan'))

@ex.main
def run_experiment(cfg):
    train_vaegan(cfg)

import datetime
last_model_dir = None

## Prepare Model Config

The model dir is generated with a timestamp. This keeps you from overwriting past results and keeps results separate to avoid confusing tensorboard.

But be warned! These model dirs can take up space, so you might need to periodically go back and delete ones you do not care about.

Also, if you ever train a model that you would really like to keep, I recommend moving it to a new directory with a special name like "best_model_ever".

In [3]:
DATASET_CLASS = 'ModelNet10'
#INDEX = MODELNET10_SOFA_TOILET_INDEX
#INDEX = MODELNET10_SOFA_INDEX
INDEX = MODELNET10_TOILET_INDEX

def make_cfg():
    model_dir = os.path.join(
        MODEL_DIR,
        'voxel_vaegan1/modelnet10/{}'.format(datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')))
    print(model_dir)
    os.makedirs(model_dir)

    cfg = {
        'cfg': {
            "dataset": {
                "class": DATASET_CLASS,
                "index": INDEX,
                #"tag": "animal",
                #"filter_id": 126660,
                #"pctile": 1.0,
                "splits": True
                #"splits": {
                #    "train": .8,
                #    "dev": .1,
                #    "test": .1
                #}
            },
            "generator": {
                "verbose": True,
                "pad": True
            }, 
            "model": {
                "ckpt_dir": model_dir,
                "voxels_dim": 32,
                "batch_size": 32,
                # Do 0.0001 for 1 epoch, then 0.001 for rest of training
                #"learning_rate": [(1, 0.0001), (None, 0.001)],
                #"learning_rate": 0.0001,
                "enc_lr": 0.0002,
                "dec_lr": 0.0002,
                "dis_lr": 0.0002,
                "epochs": 201,
                "keep_prob": 1.0,
                "kl_div_loss_weight": 100,
                "recon_loss_weight": 10000,
                "ll_weight": .0001,
                "dec_weight": 100,
                "latent_dim": 100,
                "verbose": True,
                "debug": False,
                "input_repeats": 1,
                "display_step": 1,
                #"example_stl_id": 126660,
                "voxel_prob_threshold": 0.065,
                "dev_step": 10,
                "save_step": 10,
                'launch_tensorboard': True,
                'tb_dir': 'tb',
                #'tb_compare': [('best_sofa_and_toilet', '/home/jcworkma/jack/3d-form/models/voxel_vaegan1/modelnet10/2019-03-15_17-08-43/tb')],
                #'tb_compare': [('best_vaegan', '/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-17_08-40-29/tb')],
                'tb_compare': [('vaegan_100epochs_toilets', '/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-18_13-12-53/tb')],
                'no_gan': False,
                'monitor_memory': True,
                # these settings control how often the components' optimizers are executed during the training loop
                'train_vae_cadence': 2,
                'train_gan_cadence': 1
            }
        }
    }
    
    return cfg

## Tensorboard Prep

We launch tensorboard with a call to the python subprocess module. Sometimes, that process does not die with the rest of the experiment and lingers on as a system process. This becomes a problem when we try to initialize tensorboard for the next experiment because they cannot share the same port!

The function below is designed to solve this problem. It uses the linux pgrep utility to search for existing tensorboard processes and kill them. Note that this probably won't work on Windows.

In [4]:
from utils import kill_tensorboard

kill_tensorboard()

['pgrep', 'tensorboard'] yielded -> b''


## Training

We start with a check that we are not attempting to overwrite the last MODEL_DIR. If you are blocked by the assert, re-execute the cfg code above to generate a new MODEL_DIR. This will allow you to move ahead with training.

The sacred experiment will save away a copy of your experiment settings in an experiments directory. This can be accessed later in case we need to retrieve a prime config.

If tensorboard is enabled, tune in at localhost:6006 or your_ip:6006
   

In [None]:
cfg = make_cfg()
model_dir = cfg.get('cfg').get('model').get('ckpt_dir')
kill_tensorboard()

/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-19_18-52-57
['pgrep', 'tensorboard'] yielded -> b''


In [None]:
if last_model_dir == model_dir:
    print('dont overwrite!')
    assert False
else:
    last_model_dir = model_dir

ex.run(config_updates=cfg)

INFO - voxel_vaegan_notebook - Running command 'run_experiment'
INFO - voxel_vaegan_notebook - Started run with ID "211"


Logging to /home/jcworkma/jack/3d-form/src/logs/2019-03-19_18-52__root.log
Starting train_vaegan main
Numpy random seed: 10445087
Saved cfg: /home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-19_18-52-57/cfg.json
Dataset: <class 'data.modelnet10.ModelNet10'>
Using dataset index /home/jcworkma/jack/3d-form/src/../data/processed/modelnet10_toilet_index.csv and pctile None
Shuffling dataset
dataset n_input=7104
Splitting Datasets
Num input = 7104
Num batches per epoch = 222.00
Initializing VoxelVaegan
['tensorboard', '--logdir', 'current:/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-19_18-52-57/tb,vaegan_100epochs_toilets:/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-18_13-12-53/tb']
Epoch: 0, Elapsed Time: 0.03
Epoch: 0 / 201, Batch: 0 (0 / 32), Elapsed time: 0.03 mins
Enc Loss = 0.66, KL Divergence = 0.01, Reconstruction Loss = 0.11, ll_loss = 0.00, dis_Loss = 0.69, dec_Loss = 0.69, Elapsed time: 0.12 mins

Memory Use (GB): 1.6328125
Epoch: 0 / 201, Batch: 31 (0 / 1024), Elapsed time: 2.89 mins
Enc Loss = 39.02, KL Divergence = 0.00, Reconstruction Loss = 0.22, ll_loss = 387669.31, dis_Loss = 0.02, dec_Loss = 38.78, Elapsed time: 2.99 mins
Memory Use (GB): 1.45318603515625
Epoch: 0 / 201, Batch: 32 (0 / 1056), Elapsed time: 2.99 mins
Enc Loss = 40.50, KL Divergence = 0.00, Reconstruction Loss = 0.20, ll_loss = 402275.78, dis_Loss = 0.00, dec_Loss = 40.23, Elapsed time: 3.08 mins
Memory Use (GB): 1.7978897094726562
Epoch: 0 / 201, Batch: 33 (0 / 1088), Elapsed time: 3.08 mins
Enc Loss = 35.48, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 351835.56, dis_Loss = 0.00, dec_Loss = 35.18, Elapsed time: 3.17 mins
Memory Use (GB): 1.7172660827636719
Epoch: 0 / 201, Batch: 34 (0 / 1120), Elapsed time: 3.17 mins
Enc Loss = 15.66, KL Divergence = 0.00, Reconstruction Loss = 0.20, ll_loss = 154690.72, dis_Loss = 0.00, dec_Loss = 15.47, Elapsed time: 3.26 mins
Memory Use (GB): 1.67231369

Enc Loss = 49.48, KL Divergence = 0.01, Reconstruction Loss = 0.23, ll_loss = 488669.38, dis_Loss = 0.00, dec_Loss = 48.87, Elapsed time: 6.02 mins
Memory Use (GB): 1.5512046813964844
Epoch: 0 / 201, Batch: 65 (0 / 2112), Elapsed time: 6.02 mins
Enc Loss = 40.69, KL Divergence = 0.01, Reconstruction Loss = 0.22, ll_loss = 400888.19, dis_Loss = 0.00, dec_Loss = 40.09, Elapsed time: 6.10 mins
Memory Use (GB): 1.7963294982910156
Epoch: 0 / 201, Batch: 66 (0 / 2144), Elapsed time: 6.10 mins
Enc Loss = 33.26, KL Divergence = 0.00, Reconstruction Loss = 0.23, ll_loss = 328718.62, dis_Loss = 0.00, dec_Loss = 32.87, Elapsed time: 6.20 mins
Memory Use (GB): 1.6642723083496094
Epoch: 0 / 201, Batch: 67 (0 / 2176), Elapsed time: 6.20 mins
Enc Loss = 20.60, KL Divergence = 0.00, Reconstruction Loss = 0.20, ll_loss = 203250.31, dis_Loss = 0.00, dec_Loss = 20.33, Elapsed time: 6.29 mins
Memory Use (GB): 1.7118797302246094
Epoch: 0 / 201, Batch: 68 (0 / 2208), Elapsed time: 6.29 mins
Enc Loss = 59.80

Memory Use (GB): 1.4910507202148438
Epoch: 0 / 201, Batch: 98 (0 / 3168), Elapsed time: 9.03 mins
Enc Loss = 78.57, KL Divergence = 0.01, Reconstruction Loss = 0.24, ll_loss = 776795.44, dis_Loss = 0.02, dec_Loss = 77.70, Elapsed time: 9.12 mins
Memory Use (GB): 1.64013671875
Epoch: 0 / 201, Batch: 99 (0 / 3200), Elapsed time: 9.12 mins
Enc Loss = 40.27, KL Divergence = 0.00, Reconstruction Loss = 0.23, ll_loss = 397757.19, dis_Loss = 0.00, dec_Loss = 39.78, Elapsed time: 9.21 mins
Memory Use (GB): 1.6768341064453125
Epoch: 0 / 201, Batch: 100 (0 / 3232), Elapsed time: 9.21 mins
Enc Loss = 112.84, KL Divergence = 0.01, Reconstruction Loss = 0.25, ll_loss = 1114976.12, dis_Loss = 0.00, dec_Loss = 111.50, Elapsed time: 9.30 mins
Memory Use (GB): 1.5687599182128906
Epoch: 0 / 201, Batch: 101 (0 / 3264), Elapsed time: 9.30 mins
Enc Loss = 89.58, KL Divergence = 0.01, Reconstruction Loss = 0.21, ll_loss = 884756.38, dis_Loss = 0.00, dec_Loss = 88.48, Elapsed time: 9.39 mins
Memory Use (GB):

Memory Use (GB): 1.753631591796875
Epoch: 0 / 201, Batch: 131 (0 / 4224), Elapsed time: 12.04 mins
Enc Loss = 136.98, KL Divergence = 0.00, Reconstruction Loss = 0.28, ll_loss = 1366344.75, dis_Loss = 0.00, dec_Loss = 136.63, Elapsed time: 12.13 mins
Memory Use (GB): 1.62847900390625
Epoch: 0 / 201, Batch: 132 (0 / 4256), Elapsed time: 12.13 mins
Enc Loss = 71.21, KL Divergence = 0.00, Reconstruction Loss = 0.23, ll_loss = 711007.25, dis_Loss = 0.00, dec_Loss = 71.10, Elapsed time: 12.23 mins
Memory Use (GB): 1.5124092102050781
Epoch: 0 / 201, Batch: 133 (0 / 4288), Elapsed time: 12.23 mins
Enc Loss = 230.57, KL Divergence = 0.00, Reconstruction Loss = 0.31, ll_loss = 2304378.50, dis_Loss = 0.00, dec_Loss = 230.44, Elapsed time: 12.32 mins
Memory Use (GB): 1.733062744140625
Epoch: 0 / 201, Batch: 134 (0 / 4320), Elapsed time: 12.32 mins
Enc Loss = 122.25, KL Divergence = 0.00, Reconstruction Loss = 0.23, ll_loss = 1219990.25, dis_Loss = 0.00, dec_Loss = 122.00, Elapsed time: 12.41 mins

Memory Use (GB): 1.4438629150390625
Epoch: 0 / 201, Batch: 164 (0 / 5280), Elapsed time: 15.05 mins
Enc Loss = 136.00, KL Divergence = 0.00, Reconstruction Loss = 0.25, ll_loss = 1355493.00, dis_Loss = 0.00, dec_Loss = 135.55, Elapsed time: 15.14 mins
Memory Use (GB): 1.5336494445800781
Epoch: 0 / 201, Batch: 165 (0 / 5312), Elapsed time: 15.15 mins
Enc Loss = 144.40, KL Divergence = 0.00, Reconstruction Loss = 0.25, ll_loss = 1440116.75, dis_Loss = 0.00, dec_Loss = 144.01, Elapsed time: 15.24 mins
Memory Use (GB): 1.6159744262695312
Epoch: 0 / 201, Batch: 166 (0 / 5344), Elapsed time: 15.24 mins
Enc Loss = 243.19, KL Divergence = 0.01, Reconstruction Loss = 0.25, ll_loss = 2425630.00, dis_Loss = 0.00, dec_Loss = 242.56, Elapsed time: 15.33 mins
Memory Use (GB): 1.7369766235351562
Epoch: 0 / 201, Batch: 167 (0 / 5376), Elapsed time: 15.33 mins
Enc Loss = 148.46, KL Divergence = 0.01, Reconstruction Loss = 0.23, ll_loss = 1478436.38, dis_Loss = 0.00, dec_Loss = 147.84, Elapsed time: 15.

Memory Use (GB): 1.7835273742675781
Epoch: 1 / 201, Batch: 24 (800 / 5504), Elapsed time: 17.99 mins
Enc Loss = 169.25, KL Divergence = 0.04, Reconstruction Loss = 0.30, ll_loss = 1652807.88, dis_Loss = 0.00, dec_Loss = 165.28, Elapsed time: 18.08 mins
Memory Use (GB): 1.606414794921875
Epoch: 1 / 201, Batch: 25 (832 / 5504), Elapsed time: 18.08 mins
Enc Loss = 156.95, KL Divergence = 0.04, Reconstruction Loss = 0.25, ll_loss = 1527462.50, dis_Loss = 0.00, dec_Loss = 152.75, Elapsed time: 18.17 mins
Memory Use (GB): 1.4954872131347656
Epoch: 1 / 201, Batch: 26 (864 / 5504), Elapsed time: 18.18 mins
Enc Loss = 157.56, KL Divergence = 0.03, Reconstruction Loss = 0.29, ll_loss = 1540940.00, dis_Loss = 0.00, dec_Loss = 154.09, Elapsed time: 18.27 mins
Memory Use (GB): 1.6170120239257812
Epoch: 1 / 201, Batch: 27 (896 / 5504), Elapsed time: 18.27 mins
Enc Loss = 230.54, KL Divergence = 0.06, Reconstruction Loss = 0.28, ll_loss = 2245686.75, dis_Loss = 0.00, dec_Loss = 224.57, Elapsed time: 

Enc Loss = 193.59, KL Divergence = 0.09, Reconstruction Loss = 0.27, ll_loss = 1850276.25, dis_Loss = 0.00, dec_Loss = 185.03, Elapsed time: 21.03 mins
Memory Use (GB): 1.3752098083496094
Epoch: 1 / 201, Batch: 57 (1856 / 5504), Elapsed time: 21.03 mins
Enc Loss = 204.90, KL Divergence = 0.12, Reconstruction Loss = 0.28, ll_loss = 1933334.62, dis_Loss = 0.00, dec_Loss = 193.33, Elapsed time: 21.13 mins
Memory Use (GB): 1.5048179626464844
Epoch: 1 / 201, Batch: 58 (1888 / 5504), Elapsed time: 21.13 mins
Enc Loss = 183.13, KL Divergence = 0.12, Reconstruction Loss = 0.28, ll_loss = 1711716.38, dis_Loss = 0.00, dec_Loss = 171.17, Elapsed time: 21.22 mins
Memory Use (GB): 1.5819664001464844
Epoch: 1 / 201, Batch: 59 (1920 / 5504), Elapsed time: 21.23 mins
Enc Loss = 219.25, KL Divergence = 0.12, Reconstruction Loss = 0.26, ll_loss = 2072939.50, dis_Loss = 0.00, dec_Loss = 207.29, Elapsed time: 21.32 mins
Memory Use (GB): 1.4967994689941406
Epoch: 1 / 201, Batch: 60 (1952 / 5504), Elapsed t

Memory Use (GB): 1.6908950805664062
Epoch: 1 / 201, Batch: 89 (2880 / 5504), Elapsed time: 23.97 mins
Enc Loss = 117.98, KL Divergence = 0.06, Reconstruction Loss = 0.24, ll_loss = 1124597.00, dis_Loss = 0.00, dec_Loss = 112.46, Elapsed time: 24.06 mins
Memory Use (GB): 1.6023445129394531
Epoch: 1 / 201, Batch: 90 (2912 / 5504), Elapsed time: 24.06 mins
Enc Loss = 262.10, KL Divergence = 0.14, Reconstruction Loss = 0.33, ll_loss = 2477524.50, dis_Loss = 0.00, dec_Loss = 247.75, Elapsed time: 24.15 mins
Memory Use (GB): 1.6532783508300781
Epoch: 1 / 201, Batch: 91 (2944 / 5504), Elapsed time: 24.15 mins
Enc Loss = 227.67, KL Divergence = 0.17, Reconstruction Loss = 0.25, ll_loss = 2103462.00, dis_Loss = 0.00, dec_Loss = 210.35, Elapsed time: 24.25 mins
Memory Use (GB): 1.7370452880859375
Epoch: 1 / 201, Batch: 92 (2976 / 5504), Elapsed time: 24.25 mins
Enc Loss = 141.39, KL Divergence = 0.11, Reconstruction Loss = 0.22, ll_loss = 1303396.38, dis_Loss = 0.00, dec_Loss = 130.34, Elapsed t

Enc Loss = 196.80, KL Divergence = 0.31, Reconstruction Loss = 0.25, ll_loss = 1661168.75, dis_Loss = 0.00, dec_Loss = 166.12, Elapsed time: 27.00 mins
Memory Use (GB): 1.60272216796875
Epoch: 1 / 201, Batch: 122 (3936 / 5504), Elapsed time: 27.00 mins
Enc Loss = 283.00, KL Divergence = 0.48, Reconstruction Loss = 0.29, ll_loss = 2346126.75, dis_Loss = 0.00, dec_Loss = 234.61, Elapsed time: 27.09 mins
Memory Use (GB): 1.5199699401855469
Epoch: 1 / 201, Batch: 123 (3968 / 5504), Elapsed time: 27.09 mins
Enc Loss = 147.96, KL Divergence = 0.24, Reconstruction Loss = 0.22, ll_loss = 1243101.75, dis_Loss = 0.00, dec_Loss = 124.31, Elapsed time: 27.18 mins
Memory Use (GB): 1.7417526245117188
Epoch: 1 / 201, Batch: 124 (4000 / 5504), Elapsed time: 27.19 mins
Enc Loss = 294.88, KL Divergence = 0.46, Reconstruction Loss = 0.28, ll_loss = 2485474.00, dis_Loss = 0.00, dec_Loss = 248.55, Elapsed time: 27.28 mins
Memory Use (GB): 1.6727523803710938
Epoch: 1 / 201, Batch: 125 (4032 / 5504), Elapsed