# Voxel VAE-GAN Training

This notebook is designed to provide a wholistic vae-gan training experience. You can adjust the model and training parameters through the sacred configuration file, you can view training progress in tensorboard, and you can (wip) create reconstructions with the saved models!

References:

* https://github.com/anitan0925/vaegan/blob/master/examples/train.py
  * Runs 20 epochs on separate VAE and GAN then 200 on VAEGAN
* https://github.com/jlindsey15/VAEGAN/blob/master/main.py
  * Almost clear code for vaegan paper
* https://arxiv.org/pdf/1512.09300.pdf
  * vaegan paper
* https://github.com/timsainb/Tensorflow-MultiGPU-VAE-GAN
  * Best code yet!
  

## Setup

In [1]:
import env
from train_vaegan import train_vaegan
from data.thingi10k import Thingi10k
from data.modelnet10 import ModelNet10
from data import MODELNET10_TOILET_INDEX, MODELNET10_SOFA_INDEX, MODELNET10_SOFA_TOILET_INDEX
from models import MODEL_DIR


# plot things
%matplotlib inline
# autoreload modules
%load_ext autoreload
%autoreload 2

## Prepare Sacred Experiment

In [2]:
from sacred.observers import FileStorageObserver
from sacred import Experiment
import os

ex = Experiment(name='voxel_vaegan_notebook', interactive=True)
ex.observers.append(FileStorageObserver.create('experiments_vaegan'))

@ex.main
def run_experiment(cfg):
    train_vaegan(cfg)

import datetime
last_model_dir = None

## Prepare Model Config

The model dir is generated with a timestamp. This keeps you from overwriting past results and keeps results separate to avoid confusing tensorboard.

But be warned! These model dirs can take up space, so you might need to periodically go back and delete ones you do not care about.

Also, if you ever train a model that you would really like to keep, I recommend moving it to a new directory with a special name like "best_model_ever".

In [3]:
DATASET_CLASS = 'ModelNet10'
#INDEX = MODELNET10_SOFA_TOILET_INDEX
#INDEX = MODELNET10_SOFA_INDEX
INDEX = MODELNET10_TOILET_INDEX

def make_cfg():
    model_dir = os.path.join(
        MODEL_DIR,
        'voxel_vaegan1/modelnet10/{}'.format(datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')))
    print(model_dir)
    os.makedirs(model_dir)

    cfg = {
        'cfg': {
            "dataset": {
                "class": DATASET_CLASS,
                "index": INDEX,
                #"tag": "animal",
                #"filter_id": 126660,
                #"pctile": 1.0,
                "splits": True
                #"splits": {
                #    "train": .8,
                #    "dev": .1,
                #    "test": .1
                #}
            },
            "generator": {
                "verbose": True,
                "pad": True
            }, 
            "model": {
                "ckpt_dir": model_dir,
                "voxels_dim": 32,
                "batch_size": 32,
                # Do 0.0001 for 1 epoch, then 0.001 for rest of training
                #"learning_rate": [(1, 0.0001), (None, 0.001)],
                #"learning_rate": 0.0001,
                "enc_lr": 0.0001,
                "dec_lr": 0.0001,
                "dis_lr": 0.0001,
                "epochs": 201,
                "keep_prob": 0.8,
                "kl_div_loss_weight": 100,
                "recon_loss_weight": 10000,
                "ll_weight": .0001,
                "dec_weight": 100,
                "latent_dim": 100,
                "verbose": True,
                "debug": False,
                "input_repeats": 1,
                "display_step": 1,
                #"example_stl_id": 126660,
                "voxel_prob_threshold": 0.065,
                "dev_step": 10,
                "save_step": 10,
                'launch_tensorboard': True,
                'tb_dir': 'tb',
                #'tb_compare': [('best_sofa_and_toilet', '/home/jcworkma/jack/3d-form/models/voxel_vaegan1/modelnet10/2019-03-15_17-08-43/tb')],
                #'tb_compare': [('best_vaegan', '/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-17_08-40-29/tb')],
                #'tb_compare': [('vaegan_100epochs_toilets', '/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-18_13-12-53/tb')],
                'tb_compare': [('vaegan_1024_filter_discr', '/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-20_09-39-05/')],
                'no_gan': False,
                'monitor_memory': True,
                # these settings control how often the components' optimizers are executed during the training loop
                'train_vae_cadence': 1,
                'train_gan_cadence': 1,
                'dis_noise': 0.05,
                'adaptive_lr': False
            }
        }
    }
    
    return cfg

## Tensorboard Prep

We launch tensorboard with a call to the python subprocess module. Sometimes, that process does not die with the rest of the experiment and lingers on as a system process. This becomes a problem when we try to initialize tensorboard for the next experiment because they cannot share the same port!

The function below is designed to solve this problem. It uses the linux pgrep utility to search for existing tensorboard processes and kill them. Note that this probably won't work on Windows.

In [4]:
from utils import kill_tensorboard

kill_tensorboard()

['pgrep', 'tensorboard'] yielded -> b''


## Training

We start with a check that we are not attempting to overwrite the last MODEL_DIR. If you are blocked by the assert, re-execute the cfg code above to generate a new MODEL_DIR. This will allow you to move ahead with training.

The sacred experiment will save away a copy of your experiment settings in an experiments directory. This can be accessed later in case we need to retrieve a prime config.

If tensorboard is enabled, tune in at localhost:6006 or your_ip:6006
   

In [None]:
cfg = make_cfg()
model_dir = cfg.get('cfg').get('model').get('ckpt_dir')
kill_tensorboard()

/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-21_12-02-27
['pgrep', 'tensorboard'] yielded -> b''


In [None]:
if last_model_dir == model_dir:
    print('dont overwrite!')
    assert False
else:
    last_model_dir = model_dir

ex.run(config_updates=cfg)

INFO - voxel_vaegan_notebook - Running command 'run_experiment'
INFO - voxel_vaegan_notebook - Started run with ID "221"


Logging to /home/jcworkma/jack/3d-form/src/logs/2019-03-21_12-02__root.log
Starting train_vaegan main
Numpy random seed: 209999395
Saved cfg: /home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-21_12-02-27/cfg.json
Dataset: <class 'data.modelnet10.ModelNet10'>
Using dataset index /home/jcworkma/jack/3d-form/src/../data/processed/modelnet10_toilet_index.csv and pctile None
Shuffling dataset
dataset n_input=7104
Splitting Datasets
Num input = 7104
Num batches per epoch = 222.00
Initializing VoxelVaegan
['tensorboard', '--logdir', 'current:/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-21_12-02-27/tb,vaegan_100epochs_toilets:/home/jcworkma/jack/3d-form/src/../models/voxel_vaegan1/modelnet10/2019-03-18_13-12-53/tb']
Epoch: 0, Elapsed Time: 0.03
Training VAE in this epoch
Training GAN in this epoch
Epoch: 0 / 201, Batch: 0 (0 / 32), Elapsed time: 0.03 mins
Enc Loss = 7.48, KL Divergence = 0.07, Reconstruction Loss = 0.12, ll_loss = 1.36, di

Memory Use (GB): 2.021575927734375
Epoch: 0 / 201, Batch: 31 (0 / 1024), Elapsed time: 5.60 mins
Enc Loss = 1.41, KL Divergence = 0.01, Reconstruction Loss = 0.10, ll_loss = 95.25, dis_Loss = 0.28, dec_Loss = 0.29, Elapsed time: 5.78 mins
Memory Use (GB): 1.9760017395019531
Epoch: 0 / 201, Batch: 32 (0 / 1056), Elapsed time: 5.78 mins
Enc Loss = 1.72, KL Divergence = 0.02, Reconstruction Loss = 0.11, ll_loss = 84.94, dis_Loss = 0.26, dec_Loss = 0.27, Elapsed time: 5.96 mins
Memory Use (GB): 1.9238510131835938
Epoch: 0 / 201, Batch: 33 (0 / 1088), Elapsed time: 5.96 mins
Enc Loss = 1.77, KL Divergence = 0.02, Reconstruction Loss = 0.12, ll_loss = 80.32, dis_Loss = 0.25, dec_Loss = 0.26, Elapsed time: 6.14 mins
Memory Use (GB): 1.963226318359375
Epoch: 0 / 201, Batch: 34 (0 / 1120), Elapsed time: 6.14 mins
Enc Loss = 1.54, KL Divergence = 0.02, Reconstruction Loss = 0.11, ll_loss = 90.81, dis_Loss = 0.27, dec_Loss = 0.28, Elapsed time: 6.31 mins
Memory Use (GB): 1.9405708312988281
Epoch:

Epoch: 0 / 201, Batch: 65 (0 / 2112), Elapsed time: 11.68 mins
Enc Loss = 1.30, KL Divergence = 0.01, Reconstruction Loss = 0.11, ll_loss = 100.55, dis_Loss = 0.23, dec_Loss = 0.24, Elapsed time: 11.87 mins
Memory Use (GB): 1.9170112609863281
Epoch: 0 / 201, Batch: 66 (0 / 2144), Elapsed time: 11.87 mins
Enc Loss = 1.26, KL Divergence = 0.01, Reconstruction Loss = 0.11, ll_loss = 105.77, dis_Loss = 0.24, dec_Loss = 0.25, Elapsed time: 12.05 mins
Memory Use (GB): 1.872711181640625
Epoch: 0 / 201, Batch: 67 (0 / 2176), Elapsed time: 12.05 mins
Enc Loss = 1.27, KL Divergence = 0.01, Reconstruction Loss = 0.10, ll_loss = 109.31, dis_Loss = 0.24, dec_Loss = 0.25, Elapsed time: 12.23 mins
Memory Use (GB): 2.13275146484375
Epoch: 0 / 201, Batch: 68 (0 / 2208), Elapsed time: 12.23 mins
Enc Loss = 1.20, KL Divergence = 0.01, Reconstruction Loss = 0.11, ll_loss = 112.91, dis_Loss = 0.24, dec_Loss = 0.26, Elapsed time: 12.41 mins
Memory Use (GB): 2.0397605895996094
Epoch: 0 / 201, Batch: 69 (0 / 