> In assignment 3, we used a vanilla transformer for Spanish-English Neural Machine Translation. It took quite a long time to train (compared to those smaller models), so we will learn to use a profiler to identify what exactly is the bottlenecks and possible improvements.
>
> The first part is a quick demo on how to use the TensorFlow Profiler. A couple of post readings will guide you towards different profiler use cases. Then you need to complete the modification to the `Transformer` Class in order to use `.fit()` method. This is required for profiler callbacks during training. After profiling the vanilla transformer, you will describe your profiling results to identify bottlenecks, propose and experiment to reduce or eliminate 3 of these bottlenecks. Discuss your experiment and results. Finally, rerun the non-improved version on ThetaGPU and compare profiling differences, and answer some questions. 



# Experiments & Write up

1. Describe your profiling results on Colab, according to your understanding of this particular transformer architecture design. 

2. Identify bottlenecks in the training phase using these diagrams. Choose three bottlenecks you think could be improved or eliminated by employing techniques learned from [Debugging and Optimization](https://colab.research.google.com/drive/1MwaOPAW8xfadGhFlsuziLRf80YfGcOnq?authuser=1#scrollTo=9VEjIwHO8Tv9) readings.

3. Carry out experiments (on Colab) testing your modifications to the pipeline / model, discuss:
- Why do you think it is an improvable bottleneck?
- How do you plan to modify to make it better?
- Does your plan work out well? If not, what could be the preventive factor?

4. Run the `unimproved version` notebook on ThetaGPU single-gpu queue, report your finding on profiling result difference. Are those bottlenecks you identified still bottlenecks?

# Initialization

In [4]:
%%capture
!pip install -U tensorboard-plugin-profile

In [1]:
from google.colab import drive
drive.mount("/content/drive")
%cd "/content/drive/MyDrive/Courses/Fall 2021/dlsys/DeepLearningSystems-Fall2021/HW5"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Courses/Fall 2021/dlsys/DeepLearningSystems-Fall2021/HW5


# Data preparations

See see the [data-setup.ipynb](https://github.com/tuanpham96/DeepLearningSystems-Fall2021/tree/main/HW5/notebooks/data-setup.ipynb) notebook

# ThetaGPU
See the [HW5-thetagpu.ipynb](https://github.com/tuanpham96/DeepLearningSystems-Fall2021/tree/main/HW5/notebooks/HW5-thetagpu.ipynb) notebook

# Colab

## Essentials

In [2]:
from datetime import datetime
from src.routines import *

In [3]:
# Define paths
DATA_PATH = 'data/eng_spa_translations'
OUTPUT_PATH = 'output'
TRAIN_FILENAME = 'spa.txt'
URL_NONBREAKING_FILES = ['nonbreaking_prefix.en', 'nonbreaking_prefix.es']

In [4]:
# Define configs
data_files = configure_datafiles(
    data_path               = DATA_PATH, 
    train_filename          = TRAIN_FILENAME, 
    nonbreaking_filenames   = URL_NONBREAKING_FILES
)

model_config = dict(    
    d_model                 = 512,
    n_layers                = 4,
    FFN_units               = 512,
    n_heads                 = 8,
    dropout_rate            = 0.1,
    act_fun                 = 'relu',
)


## Baseline

In [5]:
# Load and tranform data 
dataset, token_dset = load_datasets(data_files) 

# Clean the session
tf.keras.backend.clear_session()  

In [6]:
# Model name 
model_name = 'transformer-ColabBaseline'
# Create model
transformer = Transformer(
    vocab_size_enc=token_dset['input']['num_words'], 
    vocab_size_dec=token_dset['target']['num_words'],
    **model_config
)
# Compile model 
compile_model(transformer, model_config)
# Fit with callbacks
fit_model_with_callbacks(transformer, dataset, model_name, num_epochs=2)

Epoch 1/2
Epoch 2/2


## Try 1: Set `TF_GPU_THREAD_MODE=gpu_private` and use `cache` for data


In [5]:
os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private' # Change flag 
dataset, token_dset = load_datasets(data_files, use_cache=True) # Use flag 
tf.keras.backend.clear_session()  

In [8]:
# Model name 
model_name = 'transformer-ColabFlagNCache'
# Create model
transformer = Transformer(
    vocab_size_enc=token_dset['input']['num_words'], 
    vocab_size_dec=token_dset['target']['num_words'],
    **model_config
)
# Compile model 
compile_model(transformer, model_config)
# Fit with callbacks
fit_model_with_callbacks(transformer, dataset, model_name, num_epochs=2)

Epoch 1/2
Epoch 2/2


## Try 2: Minimize `cast` ops

In [6]:
# this is just one example where trying to replace `cast` op achieves some speedup
a = tf.random.normal([10000,100])
%timeit -n 100 tf.cast(tf.math.equal(a, 0), tf.float32)
%timeit -n 100 tf.where(tf.math.equal(a, 0), 1.0, 0.0)

100 loops, best of 5: 127 µs per loop
100 loops, best of 5: 87.3 µs per loop


In [8]:
from src.model_mincast import *

In [13]:
# Model name 
model_name = 'transformer-ColabMinCast'
# Create model
transformer = Transformer(
    vocab_size_enc=token_dset['input']['num_words'], 
    vocab_size_dec=token_dset['target']['num_words'],
    **model_config
)
# Compile model 
compile_model(transformer, model_config)
# Fit with callbacks
fit_model_with_callbacks(transformer, dataset, model_name, num_epochs=2)

Epoch 1/2
Epoch 2/2


## Try 3: Turn on `XLA` flags

In [15]:
os.environ['TF_XLA_FLAGS']='--tf_xla_auto_jit=2'

In [16]:
# reload the baseline model definitions instead of the mincast version
from src.model import * 

In [19]:
del scaled_dot_product_attention

In [20]:
@tf.function(jit_compile=True)
def scaled_dot_product_attention(queries, keys, values, mask):
    product = tf.matmul(queries, keys, transpose_b=True)
    scaled_product = product / tf.math.sqrt(tf.cast(tf.shape(keys)[-1], tf.float32))
    scaled_product += (mask * -1e9)
    return tf.matmul(tf.nn.softmax(scaled_product, axis=-1), values)

In [21]:
# Model name 
model_name = 'transformer-ColabXLA'
# Create model
transformer = Transformer(
    vocab_size_enc=token_dset['input']['num_words'], 
    vocab_size_dec=token_dset['target']['num_words'],
    **model_config
)
# Compile model 
compile_model(transformer, model_config)
# Fit with callbacks
fit_model_with_callbacks(transformer, dataset, model_name, num_epochs=2)

Epoch 1/2
Epoch 2/2


# Tensorboard

In [2]:
# Load the TensorBoard notebook extension.
%load_ext tensorboard

In [None]:
# If needed to kill and reload
!ps aux | grep '[/]bin/tensorboard' | awk '{print $2}' | xargs kill
%reload_ext tensorboard

In [None]:
# Launch TensorBoard and navigate to the Profile tab to view performance profile
%tensorboard --logdir=logs