## Model training
Notebook to test model training

### Imports

In [1]:
from reddit.utils import (load_tfrecord, pad_and_stack,
                          split_dataset)
from reddit.models import BatchTransformer
from reddit.losses import TripletLossBase
from reddit.training import Trainer
from transformers import TFDistilBertModel
import glob
from pathlib import Path
import tensorflow as tf
from official.nlp.optimization import create_optimizer

In [9]:
METRICS_PATH = Path('..') / 'logs' / '10anchor_1pos_1neg_ds1'
METRICS_PATH.mkdir(parents=True, exist_ok=True)

### Strategy

In [3]:
gpus = tf.config.list_physical_devices('GPU')
print("Num GPUs Available: ", len(gpus))

Num GPUs Available:  4


In [4]:
try:
    tf.config.experimental.set_visible_devices(gpus, 'GPU')
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
    print(e)

4 Physical GPUs, 4 Logical GPU


In [5]:
strategy = tf.distribute.MirroredStrategy(devices=logical_gpus)

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')


### Dataset
Load dataset, pad to desired length, batch and distribute

In [12]:
ds_params = {'n_anchor': 10,
             'n_pos': 1,
             'n_neg': 1,
             'batch_size': 4}

In [13]:
fs = glob.glob('../reddit/data/datasets/triplet/1pos_1neg_random/*')
ds = load_tfrecord(fs)
ds = pad_and_stack(ds, pad_to=[ds_params['n_anchor'], 
                               ds_params['n_pos'],
                               ds_params['n_neg']]).batch(ds_params['batch_size'], drop_remainder=True)

In [14]:
ds_train, ds_val, ds_test = split_dataset(ds, 
                                          perc_train=.7, 
                                          perc_val=.1,
                                          perc_test=.2)

Number of total examples: 426901


### Initialize training parametes

In [15]:
train_params = {'weights': 'distilbert-base-uncased',
                'model': TFDistilBertModel,
                'optimizer_learning_rate': 2e-5,
                'optimizer_n_train_steps': 426901 * 3,
                'optimizer_n_warmup_steps': 426901 / 10,
                'loss_margin': 1,
                'n_epochs': 3,
                'steps_per_epoch': 426901,
                'train_vars': ['losses','metrics', 
                               'dist_pos', 'dist_neg', 
                               'dist_anchor'],
                'test_vars': ['test_losses', 'test_metrics',
                              'test_dist_pos', 'test_dist_neg',
                              'test_dist_anchor'],
                'log_every': 1000}

### Initialize optimizer, model, loss, and trainer object

This is a hacky way to avoid TF yielding mysterious OOM error

In [16]:
%%capture
from transformers import DistilBertModel
DistilBertModel.from_pretrained('distilbert-base-uncased')

This is the actual initialization of all we need for training

In [17]:
%%capture
with strategy.scope():
    optimizer = create_optimizer(train_params['optimizer_learning_rate'],
                             num_train_steps=train_params['optimizer_n_train_steps'], 
                             num_warmup_steps=train_params['optimizer_n_warmup_steps'])
    model = BatchTransformer(train_params['model'], 
                             train_params['weights'])
    loss = TripletLossBase(train_params['loss_margin'],
                           n_pos=ds_params['n_pos'],
                           n_neg=ds_params['n_neg'])

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['activation_13', 'vocab_layer_norm', 'vocab_projector', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


In [18]:
trainer = Trainer(model=model,
                  loss_object=loss,
                  optimizer=optimizer,
                  strategy=strategy, 
                  n_epochs=train_params['n_epochs'], 
                  steps_per_epoch=train_params['steps_per_epoch'], 
                  log_every=train_params['log_every'],
                  train_vars=train_params['train_vars'], 
                  test_vars=train_params['test_vars'], 
                  log_path=str(METRICS_PATH),
                  checkpoint_device=None,
                  distributed=True)

### Train!

In [None]:
trainer.train(dataset_train=ds_train, 
              dataset_test=ds_val)

Epoch 1/3
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
INFO:tensorflow:batch_all_reduce: 98 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GP

## Check metrics

In [13]:
import json
import numpy as np

In [9]:
dbaseline = json.load(open('../logs/sample_output/metrics/BatchTransformer-distilbert-base-uncased/'
                           'triplet_loss_margin-1/epoch-test_only/log.json'))
dftune = json.load(open('../logs/sample_output/metrics/BatchTransformer-distilbert-base-uncased/'
                           'triplet_loss_margin-1/epoch-0/log.json'))

In [19]:
# Check if higher baseline performance is due to subreddit

In [23]:
print('Baseline performance')
print(np.sum(dbaseline['test_metrics'])/len(dbaseline['test_metrics']))
print('Performance after epoch 0')
print(np.sum(dftune['test_metrics'])/len(dftune['test_metrics']))

Baseline performance
0.83525
Performance after epoch 0
0.849875
