# Voxel VAE-GAN Training

This notebook is designed to provide a wholistic vae-gan training experience. You can adjust the model and training parameters through the sacred configuration file, you can view training progress in tensorboard, and you can (wip) create reconstructions with the saved models!

References:

* https://github.com/anitan0925/vaegan/blob/master/examples/train.py
  * Runs 20 epochs on separate VAE and GAN then 200 on VAEGAN
* https://github.com/jlindsey15/VAEGAN/blob/master/main.py
  * Almost clear code for vaegan paper
* https://arxiv.org/pdf/1512.09300.pdf
  * vaegan paper
* https://github.com/timsainb/Tensorflow-MultiGPU-VAE-GAN
  * Best code yet!
  

## Setup

In [1]:
import env
from train_vaegan import train_vaegan
from data.thingi10k import Thingi10k
from data.modelnet10 import ModelNet10
from data import MODELNET10_TOILET_INDEX, MODELNET10_SOFA_INDEX, MODELNET10_SOFA_TOILET_INDEX
from models import MODEL_DIR


# plot things
%matplotlib inline
# autoreload modules
%load_ext autoreload
%autoreload 2

## Prepare Sacred Experiment

In [2]:
from sacred.observers import FileStorageObserver
from sacred import Experiment
import os

ex = Experiment(name='voxel_vaegan_notebook', interactive=True)
ex.observers.append(FileStorageObserver.create('experiments_vaegan'))

@ex.main
def run_experiment(cfg):
    train_vaegan(cfg)

import datetime
last_model_dir = None

## Prepare Model Config

The model dir is generated with a timestamp. This keeps you from overwriting past results and keeps results separate to avoid confusing tensorboard.

But be warned! These model dirs can take up space, so you might need to periodically go back and delete ones you do not care about.

Also, if you ever train a model that you would really like to keep, I recommend moving it to a new directory with a special name like "best_model_ever".

In [3]:
DATASET_CLASS = 'ModelNet10'
#INDEX = MODELNET10_SOFA_TOILET_INDEX
#INDEX = MODELNET10_SOFA_INDEX
INDEX = MODELNET10_TOILET_INDEX

def make_cfg():
    model_dir = os.path.join(
        MODEL_DIR,
        'voxel_vaegan1/modelnet10/{}'.format(datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')))
    print(model_dir)
    os.makedirs(model_dir)

    cfg = {
        'cfg': {
            "dataset": {
                "class": DATASET_CLASS,
                "index": INDEX,
                #"tag": "animal",
                #"filter_id": 126660,
                #"pctile": 1.0,
                "splits": True
                #"splits": {
                #    "train": .8,
                #    "dev": .1,
                #    "test": .1
                #}
            },
            "generator": {
                "verbose": True,
                "pad": True
            }, 
            "model": {
                "ckpt_dir": model_dir,
                "voxels_dim": 32,
                "batch_size": 32,
                # Do 0.0001 for 1 epoch, then 0.001 for rest of training
                #"learning_rate": [(1, 0.0001), (None, 0.001)],
                #"learning_rate": 0.0001,
                "enc_lr": 0.0002,
                "dec_lr": 0.0002,
                "dis_lr": 0.0002,
                "epochs": 201,
                "keep_prob": 1.0,
                "kl_div_loss_weight": 100,
                "recon_loss_weight": 10000,
                "ll_weight": 0.001,
                "dec_weight": 100,
                "latent_dim": 100,
                "verbose": True,
                "debug": False,
                "input_repeats": 1,
                "display_step": 1,
                #"example_stl_id": 126660,
                "voxel_prob_threshold": 0.065,
                "dev_step": 10,
                "save_step": 10,
                'launch_tensorboard': True,
                'tb_dir': 'tb',
                #'tb_compare': [('best_sofa_and_toilet', '/home/jcworkma/jack/3d-form/models/voxel_vaegan1/modelnet10/2019-03-15_17-08-43/tb')],
                #'tb_compare': [('best_vaegan', '/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-17_08-40-29/tb')],
                'tb_compare': [('vaegan_100epochs_toilets', '/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-18_13-12-53/tb')],
                'no_gan': False,
                'monitor_memory': True
            }
        }
    }
    
    return cfg

## Tensorboard Prep

We launch tensorboard with a call to the python subprocess module. Sometimes, that process does not die with the rest of the experiment and lingers on as a system process. This becomes a problem when we try to initialize tensorboard for the next experiment because they cannot share the same port!

The function below is designed to solve this problem. It uses the linux pgrep utility to search for existing tensorboard processes and kill them. Note that this probably won't work on Windows.

In [4]:
from utils import kill_tensorboard

kill_tensorboard()

['pgrep', 'tensorboard'] yielded -> b''


## Training

We start with a check that we are not attempting to overwrite the last MODEL_DIR. If you are blocked by the assert, re-execute the cfg code above to generate a new MODEL_DIR. This will allow you to move ahead with training.

The sacred experiment will save away a copy of your experiment settings in an experiments directory. This can be accessed later in case we need to retrieve a prime config.

If tensorboard is enabled, tune in at localhost:6006 or your_ip:6006
   

In [None]:
cfg = make_cfg()
model_dir = cfg.get('cfg').get('model').get('ckpt_dir')
kill_tensorboard()

/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-19_17-25-34
['pgrep', 'tensorboard'] yielded -> b''


In [None]:
if last_model_dir == model_dir:
    print('dont overwrite!')
    assert False
else:
    last_model_dir = model_dir

ex.run(config_updates=cfg)

INFO - voxel_vaegan_notebook - Running command 'run_experiment'
INFO - voxel_vaegan_notebook - Started run with ID "206"


Logging to /home/jcworkma/jack/3d-form/src/logs/2019-03-19_17-25__root.log
Starting train_vaegan main
Numpy random seed: 870960666
Saved cfg: /home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-19_17-25-34/cfg.json
Dataset: <class 'data.modelnet10.ModelNet10'>
Using dataset index /home/jcworkma/jack/3d-form/src/../data/processed/modelnet10_toilet_index.csv and pctile None
Shuffling dataset
dataset n_input=7104
Splitting Datasets
Num input = 7104
Num batches per epoch = 222.00
Initializing VoxelVaegan
['tensorboard', '--logdir', 'current:/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-19_17-25-34/tb,vaegan_100epochs_toilets:/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-18_13-12-53/tb']
Epoch: 0, Elapsed Time: 0.03
Epoch: 0 / 201, Batch: 0 (0 / 32), Elapsed time: 0.03 mins
Enc Loss = 12.94, KL Divergence = 0.04, Reconstruction Loss = 0.10, ll_loss = 9218.65, dis_Loss = 2.35, dec_Loss = 78.56, Elapsed time: 0.1

Memory Use (GB): 1.5297813415527344
Epoch: 0 / 201, Batch: 31 (0 / 1024), Elapsed time: 2.12 mins
Enc Loss = 14.83, KL Divergence = 0.00, Reconstruction Loss = 0.05, ll_loss = 14560.53, dis_Loss = 0.38, dec_Loss = 46.25, Elapsed time: 2.18 mins
Memory Use (GB): 1.4210739135742188
Epoch: 0 / 201, Batch: 32 (0 / 1056), Elapsed time: 2.18 mins
Enc Loss = 17.03, KL Divergence = 0.00, Reconstruction Loss = 0.05, ll_loss = 16751.88, dis_Loss = 0.36, dec_Loss = 44.88, Elapsed time: 2.26 mins
Memory Use (GB): 1.471588134765625
Epoch: 0 / 201, Batch: 33 (0 / 1088), Elapsed time: 2.26 mins
Enc Loss = 17.40, KL Divergence = 0.00, Reconstruction Loss = 0.05, ll_loss = 17119.13, dis_Loss = 0.45, dec_Loss = 41.73, Elapsed time: 2.32 mins
Memory Use (GB): 1.52838134765625
Epoch: 0 / 201, Batch: 34 (0 / 1120), Elapsed time: 2.32 mins
Enc Loss = 15.86, KL Divergence = 0.00, Reconstruction Loss = 0.05, ll_loss = 15581.92, dis_Loss = 0.29, dec_Loss = 37.19, Elapsed time: 2.39 mins
Memory Use (GB): 1.4845

Memory Use (GB): 1.631195068359375
Epoch: 0 / 201, Batch: 65 (0 / 2112), Elapsed time: 4.41 mins
Enc Loss = 16.17, KL Divergence = 0.00, Reconstruction Loss = 0.14, ll_loss = 15992.25, dis_Loss = 0.22, dec_Loss = 20.05, Elapsed time: 4.48 mins
Memory Use (GB): 1.74920654296875
Epoch: 0 / 201, Batch: 66 (0 / 2144), Elapsed time: 4.48 mins
Enc Loss = 16.27, KL Divergence = 0.00, Reconstruction Loss = 0.14, ll_loss = 16092.04, dis_Loss = 0.09, dec_Loss = 19.99, Elapsed time: 4.55 mins
Memory Use (GB): 1.5933761596679688
Epoch: 0 / 201, Batch: 67 (0 / 2176), Elapsed time: 4.55 mins
Enc Loss = 15.82, KL Divergence = 0.00, Reconstruction Loss = 0.15, ll_loss = 15638.19, dis_Loss = 0.21, dec_Loss = 19.41, Elapsed time: 4.61 mins
Memory Use (GB): 1.451416015625
Epoch: 0 / 201, Batch: 68 (0 / 2208), Elapsed time: 4.61 mins
Enc Loss = 15.87, KL Divergence = 0.00, Reconstruction Loss = 0.15, ll_loss = 15677.27, dis_Loss = 0.14, dec_Loss = 19.33, Elapsed time: 4.68 mins
Memory Use (GB): 1.66054153

Memory Use (GB): 1.791168212890625
Epoch: 0 / 201, Batch: 99 (0 / 3200), Elapsed time: 6.70 mins
Enc Loss = 15.83, KL Divergence = 0.00, Reconstruction Loss = 0.20, ll_loss = 15687.26, dis_Loss = 0.29, dec_Loss = 17.34, Elapsed time: 6.77 mins
Memory Use (GB): 1.693695068359375
Epoch: 0 / 201, Batch: 100 (0 / 3232), Elapsed time: 6.77 mins
Enc Loss = 16.02, KL Divergence = 0.00, Reconstruction Loss = 0.19, ll_loss = 15872.70, dis_Loss = 0.20, dec_Loss = 17.49, Elapsed time: 6.84 mins
Memory Use (GB): 1.6058387756347656
Epoch: 0 / 201, Batch: 101 (0 / 3264), Elapsed time: 6.84 mins
Enc Loss = 16.05, KL Divergence = 0.00, Reconstruction Loss = 0.20, ll_loss = 15908.55, dis_Loss = 0.26, dec_Loss = 17.49, Elapsed time: 6.91 mins
Memory Use (GB): 1.4795494079589844
Epoch: 0 / 201, Batch: 102 (0 / 3296), Elapsed time: 6.91 mins
Enc Loss = 16.40, KL Divergence = 0.00, Reconstruction Loss = 0.20, ll_loss = 16269.42, dis_Loss = 0.22, dec_Loss = 17.83, Elapsed time: 6.97 mins
Memory Use (GB): 1.

Enc Loss = 15.71, KL Divergence = 0.00, Reconstruction Loss = 0.21, ll_loss = 15593.17, dis_Loss = 0.30, dec_Loss = 16.61, Elapsed time: 9.01 mins
Memory Use (GB): 1.6659507751464844
Epoch: 0 / 201, Batch: 133 (0 / 4288), Elapsed time: 9.01 mins
Enc Loss = 15.82, KL Divergence = 0.00, Reconstruction Loss = 0.22, ll_loss = 15713.78, dis_Loss = 0.21, dec_Loss = 16.72, Elapsed time: 9.07 mins
Memory Use (GB): 1.6531333923339844
Epoch: 0 / 201, Batch: 134 (0 / 4320), Elapsed time: 9.07 mins
Enc Loss = 16.12, KL Divergence = 0.00, Reconstruction Loss = 0.21, ll_loss = 16004.53, dis_Loss = 0.21, dec_Loss = 17.00, Elapsed time: 9.14 mins
Memory Use (GB): 1.7148971557617188
Epoch: 0 / 201, Batch: 135 (0 / 4352), Elapsed time: 9.14 mins
Enc Loss = 15.80, KL Divergence = 0.00, Reconstruction Loss = 0.21, ll_loss = 15680.37, dis_Loss = 0.17, dec_Loss = 16.66, Elapsed time: 9.21 mins
Memory Use (GB): 1.7531776428222656
Epoch: 0 / 201, Batch: 136 (0 / 4384), Elapsed time: 9.21 mins
Enc Loss = 15.81

Memory Use (GB): 1.5655860900878906
Epoch: 0 / 201, Batch: 166 (0 / 5344), Elapsed time: 11.22 mins
Enc Loss = 15.87, KL Divergence = 0.00, Reconstruction Loss = 0.22, ll_loss = 15790.29, dis_Loss = 0.17, dec_Loss = 16.49, Elapsed time: 11.29 mins
Memory Use (GB): 1.7463760375976562
Epoch: 0 / 201, Batch: 167 (0 / 5376), Elapsed time: 11.29 mins
Enc Loss = 15.22, KL Divergence = 0.00, Reconstruction Loss = 0.22, ll_loss = 15132.77, dis_Loss = 0.08, dec_Loss = 15.82, Elapsed time: 11.36 mins
Memory Use (GB): 1.6009864807128906
Epoch: 0 / 201, Batch: 168 (0 / 5408), Elapsed time: 11.36 mins
Enc Loss = 16.21, KL Divergence = 0.00, Reconstruction Loss = 0.23, ll_loss = 16121.87, dis_Loss = 0.15, dec_Loss = 16.80, Elapsed time: 11.42 mins
Memory Use (GB): 1.5656700134277344
Epoch: 0 / 201, Batch: 169 (0 / 5440), Elapsed time: 11.42 mins
Enc Loss = 15.85, KL Divergence = 0.00, Reconstruction Loss = 0.23, ll_loss = 15766.54, dis_Loss = 0.37, dec_Loss = 16.44, Elapsed time: 11.49 mins
Memory U

Enc Loss = 16.08, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 16018.77, dis_Loss = 0.16, dec_Loss = 16.52, Elapsed time: 13.44 mins
Memory Use (GB): 1.6583709716796875
Epoch: 1 / 201, Batch: 27 (896 / 5504), Elapsed time: 13.44 mins
Enc Loss = 16.43, KL Divergence = 0.00, Reconstruction Loss = 0.23, ll_loss = 16372.79, dis_Loss = 0.11, dec_Loss = 16.87, Elapsed time: 13.50 mins
Memory Use (GB): 1.5489273071289062
Epoch: 1 / 201, Batch: 28 (928 / 5504), Elapsed time: 13.50 mins
Enc Loss = 15.20, KL Divergence = 0.00, Reconstruction Loss = 0.22, ll_loss = 15141.67, dis_Loss = 0.08, dec_Loss = 15.64, Elapsed time: 13.57 mins
Memory Use (GB): 1.7066459655761719
Epoch: 1 / 201, Batch: 29 (960 / 5504), Elapsed time: 13.57 mins
Enc Loss = 16.16, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 16103.25, dis_Loss = 0.02, dec_Loss = 16.59, Elapsed time: 13.64 mins
Memory Use (GB): 1.7536354064941406
Epoch: 1 / 201, Batch: 30 (992 / 5504), Elapsed time: 13.64 mins
Enc 

Enc Loss = 16.63, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 16585.85, dis_Loss = 0.09, dec_Loss = 16.98, Elapsed time: 15.65 mins
Memory Use (GB): 1.6886520385742188
Epoch: 1 / 201, Batch: 60 (1952 / 5504), Elapsed time: 15.65 mins
Enc Loss = 15.65, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 15596.16, dis_Loss = 0.05, dec_Loss = 15.99, Elapsed time: 15.72 mins
Memory Use (GB): 1.6765632629394531
Epoch: 1 / 201, Batch: 61 (1984 / 5504), Elapsed time: 15.72 mins
Enc Loss = 15.84, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 15798.00, dis_Loss = 0.16, dec_Loss = 16.18, Elapsed time: 15.79 mins
Memory Use (GB): 1.6601295471191406
Epoch: 1 / 201, Batch: 62 (2016 / 5504), Elapsed time: 15.79 mins
Enc Loss = 16.56, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 16508.04, dis_Loss = 0.11, dec_Loss = 16.89, Elapsed time: 15.85 mins
Memory Use (GB): 1.5261878967285156
Epoch: 1 / 201, Batch: 63 (2048 / 5504), Elapsed time: 15.85 mins


Enc Loss = 15.81, KL Divergence = 0.00, Reconstruction Loss = 0.25, ll_loss = 15769.34, dis_Loss = 0.20, dec_Loss = 16.08, Elapsed time: 17.87 mins
Memory Use (GB): 1.6916236877441406
Epoch: 1 / 201, Batch: 93 (3008 / 5504), Elapsed time: 17.87 mins
Enc Loss = 15.39, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 15341.28, dis_Loss = 0.21, dec_Loss = 15.65, Elapsed time: 17.94 mins
Memory Use (GB): 1.5498390197753906
Epoch: 1 / 201, Batch: 94 (3040 / 5504), Elapsed time: 17.94 mins
Enc Loss = 15.59, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 15539.67, dis_Loss = 0.06, dec_Loss = 15.85, Elapsed time: 18.00 mins
Memory Use (GB): 1.6092567443847656
Epoch: 1 / 201, Batch: 95 (3072 / 5504), Elapsed time: 18.00 mins
Enc Loss = 16.31, KL Divergence = 0.00, Reconstruction Loss = 0.25, ll_loss = 16266.66, dis_Loss = 0.17, dec_Loss = 16.58, Elapsed time: 18.07 mins
Memory Use (GB): 1.6137313842773438
Epoch: 1 / 201, Batch: 96 (3104 / 5504), Elapsed time: 18.07 mins


Epoch: 1 / 201, Batch: 125 (4032 / 5504), Elapsed time: 20.02 mins
Enc Loss = 16.17, KL Divergence = 0.00, Reconstruction Loss = 0.25, ll_loss = 16131.39, dis_Loss = 0.21, dec_Loss = 16.37, Elapsed time: 20.09 mins
Memory Use (GB): 1.5814933776855469
Epoch: 1 / 201, Batch: 126 (4064 / 5504), Elapsed time: 20.09 mins
Enc Loss = 16.21, KL Divergence = 0.00, Reconstruction Loss = 0.25, ll_loss = 16169.87, dis_Loss = 0.11, dec_Loss = 16.40, Elapsed time: 20.16 mins
Memory Use (GB): 1.487396240234375
Epoch: 1 / 201, Batch: 127 (4096 / 5504), Elapsed time: 20.16 mins
Enc Loss = 16.05, KL Divergence = 0.00, Reconstruction Loss = 0.25, ll_loss = 16013.93, dis_Loss = 0.13, dec_Loss = 16.24, Elapsed time: 20.23 mins
Memory Use (GB): 1.5380821228027344
Epoch: 1 / 201, Batch: 128 (4128 / 5504), Elapsed time: 20.23 mins
Enc Loss = 15.73, KL Divergence = 0.00, Reconstruction Loss = 0.24, ll_loss = 15688.72, dis_Loss = 0.08, dec_Loss = 15.91, Elapsed time: 20.29 mins
Memory Use (GB): 1.56035995483398

Memory Use (GB): 1.6002960205078125
Epoch: 1 / 201, Batch: 158 (5088 / 5504), Elapsed time: 22.24 mins
Enc Loss = 15.53, KL Divergence = 0.00, Reconstruction Loss = 0.26, ll_loss = 15501.62, dis_Loss = 0.08, dec_Loss = 15.67, Elapsed time: 22.31 mins
Memory Use (GB): 1.5849609375
Epoch: 1 / 201, Batch: 159 (5120 / 5504), Elapsed time: 22.31 mins
Enc Loss = 16.28, KL Divergence = 0.00, Reconstruction Loss = 0.26, ll_loss = 16253.58, dis_Loss = 0.11, dec_Loss = 16.43, Elapsed time: 22.38 mins
Memory Use (GB): 1.6749725341796875
Epoch: 1 / 201, Batch: 160 (5152 / 5504), Elapsed time: 22.38 mins
Enc Loss = 15.88, KL Divergence = 0.00, Reconstruction Loss = 0.25, ll_loss = 15850.67, dis_Loss = 0.17, dec_Loss = 16.02, Elapsed time: 22.45 mins
Memory Use (GB): 1.7564506530761719
Epoch: 1 / 201, Batch: 161 (5184 / 5504), Elapsed time: 22.45 mins
Enc Loss = 15.25, KL Divergence = 0.00, Reconstruction Loss = 0.26, ll_loss = 15213.68, dis_Loss = 0.16, dec_Loss = 15.38, Elapsed time: 22.52 mins
Me