# Tensorflow Performance Boosts

## Going back from Eager Execution

- Tf moved to eager execution as part of the migration to TF2. However, as with pytorch, the eager execution can be a major point of performance degradation.
- We can use tf.function to perform a similar function on tensor ops as with torch.compile. This exists as a decorator.

In [1]:
# Defining a simple model
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import mixed_precision

tf.keras.backend.set_floatx('float16')

# Enable Tensorcores
prec_policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(prec_policy)

# Define a simple sequential model with convolutions and linear layers
def create_model_internal_activation():
    model = tf.keras.models.Sequential([
        layers.Conv2D(16, [3, 3], activation='relu', input_shape=(32, 32, 3)),
        layers.MaxPooling2D(),
        layers.Conv2D(32, [3, 3], activation='relu'),
        layers.Flatten(),
        layers.Dense(1024, activation='relu'),
        layers.Dense(10),
        layers.Activation('softmax', name = "preds") # The output softmax layer should be float32 and must be specified as a constructor argument to the Activation layer
    ])
    return model

internal_model = create_model_internal_activation()


2024-05-27 11:12:23.580142: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
2024-05-27 11:12:24.753974: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-27 11:12:24.781117: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v

In [2]:
def create_model_external_activation():
    model = tf.keras.models.Sequential([
        layers.Conv2D(16, [3, 3], input_shape=(32, 32, 3)),
        layers.Activation('relu'),
        layers.MaxPooling2D(),
        layers.Conv2D(32, [3, 3]),
        layers.Activation('relu'),
        layers.Flatten(),
        layers.Dense(1024),
        layers.Activation('relu'),
        layers.Dense(10),
        layers.Activation('softmax', name = "preds") # The output softmax layer should be float32 and must be specified as a constructor argument to the Activation layer
    ])
    return model

# Enable Tensorcores
prec_policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(prec_policy)

external_model = create_model_external_activation()

## Defining the profiler

We will use the tensorflow experimental profiler API since we are only attempting to profile the inference. 

For training - use the [tf.keras.callbacks.Tensorboard](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/TensorBoard)

In [3]:
# Method 1: Using the profiler start/stop API - Not my preferred method

# tf.profiler.experimental.start('logs_internal_activation')
# internal_model.predict(tf.random.normal([1, 32, 32, 3]))
# tf.profiler.experimental.stop()

# Method 2: Using the profiler context manager - My preferred method

inp = tf.random.normal([2000, 32, 32, 3])
# Move to GPU
inp = tf.cast(tf.constant(inp), tf.float16)


In [4]:
# Run model once before profiling to ensure the model is built
internal_model.predict(inp)
external_model.predict(inp) 

# Internal Activation
with tf.profiler.experimental.Profile('logs_internal_activation'):
    internal_model.predict(inp)
    
# External Activation
with tf.profiler.experimental.Profile('logs_external_activation'):
    external_model.predict(inp)

I0000 00:00:1716822745.098845  282445 service.cc:145] XLA service 0x791ff0003120 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1716822745.098868  282445 service.cc:153]   StreamExecutor device (0): NVIDIA GeForce RTX 2070 Super, Compute Capability 7.5
2024-05-27 11:12:25.104773: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-05-27 11:12:25.136512: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8907


[1m 1/63[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m38s[0m 618ms/step

I0000 00:00:1716822745.658746  282445 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step 
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step


2024-05-27 11:12:26.412675: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2024-05-27 11:12:26.412695: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
2024-05-27 11:12:26.412710: I external/local_xla/xla/backends/profiler/gpu/cupti_tracer.cc:1239] Profiler found 1 GPUs
2024-05-27 11:12:26.575189: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:70] Profiler session collecting data.
2024-05-27 11:12:26.577339: I external/local_xla/xla/backends/profiler/gpu/cupti_tracer.cc:1364] CUPTI activity buffer flushed
2024-05-27 11:12:26.587318: I external/local_xla/xla/backends/profiler/gpu/cupti_collector.cc:540]  GpuTracer has collected 1025 callback api events and 1023 activity events. 
2024-05-27 11:12:26.596765: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2024-05-27 11:12:26.597908: I external/local_tsl/tsl/profiler/rpc/client/save_profile.cc:144] Co

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step


2024-05-27 11:12:26.791729: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:70] Profiler session collecting data.
2024-05-27 11:12:26.793046: I external/local_xla/xla/backends/profiler/gpu/cupti_tracer.cc:1364] CUPTI activity buffer flushed
2024-05-27 11:12:26.802091: I external/local_xla/xla/backends/profiler/gpu/cupti_collector.cc:540]  GpuTracer has collected 1021 callback api events and 1022 activity events. 
2024-05-27 11:12:26.810860: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2024-05-27 11:12:26.810962: I external/local_tsl/tsl/profiler/rpc/client/save_profile.cc:144] Collecting XSpace to repository: logs_external_activation/plugins/profile/2024_05_27_11_12_26/thameem-GE66.xplane.pb


In [5]:
%load_ext tensorboard

In [6]:
# Attempting to convert the eager graph to a tf.function

internal_model_func = tf.function(internal_model)
external_model_func = tf.function(external_model)

In [7]:
# Profile the tf.function models
# Run graph once before profiling to ensure the function is built
internal_model_func(inp)
external_model_func(inp)

with tf.profiler.experimental.Profile('logs_internal_activation_func'):
    internal_model_func(inp)
    
with tf.profiler.experimental.Profile('logs_external_activation_func'):
    external_model_func(inp)

2024-05-27 11:12:27.167293: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2024-05-27 11:12:27.167317: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
2024-05-27 11:12:27.175107: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:70] Profiler session collecting data.
2024-05-27 11:12:27.176165: I external/local_xla/xla/backends/profiler/gpu/cupti_tracer.cc:1364] CUPTI activity buffer flushed
2024-05-27 11:12:27.180193: I external/local_xla/xla/backends/profiler/gpu/cupti_collector.cc:540]  GpuTracer has collected 26 callback api events and 26 activity events. 
2024-05-27 11:12:27.180464: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2024-05-27 11:12:27.180535: I external/local_tsl/tsl/profiler/rpc/client/save_profile.cc:144] Collecting XSpace to repository: logs_internal_activation_func/plugins/profile/2024_05_27_11_12_27/thameem-GE66.xplane.pb
202

### Rough results

Simply wrapping the model in tf.function results in a drop from over 45 ms (Repeated run to exclude first run discrepancy) to < 1.5 ms (Repeated run -> tf.function(model) runs lazily. First run will include the graph build/compile time throwing off results.)

## Suggestions from TF Profiler

Recommendation for Next Step
No step time measured. Therefore we cannot tell where the performance bottleneck is.
Tool troubleshooting / FAQ

Refer to the TF2 Profiler FAQ
Next tools to use for reducing the input time

input_pipeline_analyzer (especially Section 3 for the breakdown of input operations on the Host)
tf_data_bottleneck_analysis (find the bottleneck in the tf.data input pipeline)
trace_viewer (look at the activities on the timeline of each Host Thread near the bottom of the trace view)
Next tools to use for reducing the Device time

framework_op_stats (identify the time-consuming operations executed on the GPU)
trace_viewer (look at the activities on the timeline of each GPU in the trace view)
Other useful resources

Analyze tf.data performance with the TF Profiler
Better performance with the tf.data API

In [10]:
# Converting to concrete functions
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2

input_spec = tf.TensorSpec(
    inp.shape, inp.dtype
)
internal_model_func_concrete = internal_model_func.get_concrete_function(inp)

In [11]:
const_graph = convert_variables_to_constants_v2(internal_model_func_concrete)

2024-05-27 11:35:50.602814: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-27 11:35:50.602977: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 1
2024-05-27 11:35:50.603049: I tensorflow/core/grappler/clusters/single_machine.cc:361] Starting new session
2024-05-27 11:35:50.603309: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-27 11:35:50.603531: I external/local_xla/xla/stream_executor/cuda/cuda_e

In [12]:
# Profile the concrete function

out = const_graph(inp)

with tf.profiler.experimental.Profile('logs_internal_activation_func_concrete'):
    const_graph(inp)

2024-05-27 11:36:14.333387: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2024-05-27 11:36:14.333409: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
2024-05-27 11:36:14.341940: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:70] Profiler session collecting data.
2024-05-27 11:36:14.342952: I external/local_xla/xla/backends/profiler/gpu/cupti_tracer.cc:1364] CUPTI activity buffer flushed
2024-05-27 11:36:14.346917: I external/local_xla/xla/backends/profiler/gpu/cupti_collector.cc:540]  GpuTracer has collected 14 callback api events and 14 activity events. 
2024-05-27 11:36:14.347216: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2024-05-27 11:36:14.347402: I external/local_tsl/tsl/profiler/rpc/client/save_profile.cc:144] Collecting XSpace to repository: logs_internal_activation_func_concrete/plugins/profile/2024_05_27_11_36_14/thameem-GE66.xpla

## Why Concrete functions ?

- The use of concrete functions converts all non-tensor values used in the graph to constants. 
- Experimentally speaking, it does a much better job of eliminating IDLE time on the device/GPU. This may vary based on the data load pattern. I'll likely make that it's own topic/blog.
-  

## Additional Performance Notes

1. Make dimensions a multiple of 8 if you can. This is shown to make a difference on Tensorcores
2. Use bfloat16 if your device supports it. You don't have to worry about precision loss. The dynamic range of bfloat16 is way higher than float16.
3. Validate the effect of the input pipeline. More often than not, there is not enough data for the device/GPU to work on. 
4. The previous point might extend to custom layers if they are not written well. 
5. Profile before you dive in.

## Additional potential dangerous optimizations

1. You can disable 