Tracing and profiling tool for TensorFlow. Support CUDA, HIP and SYCL TensorFlow
- Install LTTng : https://lttng.org/
- Install Babeltrace and python bindings : https://github.com/pzins/babeltrace (checkout branch "conversion_atp_to_ctf")
- Go to tracers folder
make
The first step is to install an instrumented version of TensorFlow;
- TensorFlow 1.6 CUDA : https://github.com/pzins/tensorflow
- TensorFlow 1.3 HIP/ROCM : https://github.com/pzins/tensorflow-rocm
- TensorFlow 1.0 HIP/ROCM : https://github.com/pzins/hiptensorflow
- TensorFlow 1.6 SYCL : https://github.com/pzins/tensorflow-sycl
- you need to checkout lttng branch
- Follow the classic instructions to build TensorFlow.
Nothing addictional is needed
Install the ROCm platform : https://rocm.github.io/ROCmInstall.html
- Clone, checkout lttng branch and build ROCR-Runtime : https://github.com/pzins/ROCR-Runtime
- Replace /opt/rocm/hsa/lib/libhsa-runtime64.so with the builded libhsa-runtime64.so
It's also possible to profile HSA API with interception libraries.
- Clone, checkout lttng branch and build HIP : https://github.com/pzins/HIP
The first possiblity is to rebuild an instrumented version of HC.
- Clone, checkout lttng branch and build HC : https://github.com/pzins/hcc
Sometimes, it's not possible to rebuild it, so there are 2 options :
- Using log output from HC and parsing it (automated in scripts)
- Using interception libraries
These libraries can be LD_PRELOADED to get some informations :
- HSA API
- GPU kernels begin/end
- Performance counters
Build instructions :
- Go to tensorflow-profiler/interception-libraries
- make
- Output libraries are in tensorflow-profiler/interception-libraries/lib/
- Before running your application : set the libraries you want into LD_PRELOAD
There are several possibilities to profile an application
tf_tracer.sh
Use scripts into scripts/
- start_tracing.sh : start lttng tracing
- stop_tracing.sh : stop lttng tracing
- set_env.sh : set the environment before tracing. Needed if using HIP/ROCm
- post_process.sh : post processing script
- Replace all the asynchronous events at the correct position
- Match all the GPU kernels with the corresponding TensorFlow operation
- trace_analysis : get textual statistics of a trace
- perfcounters_analysis.py : parse RCP performance counters and match the value with the callstack trace
- perfcounters_interception_analysis.py : Parse and create CSV file with the performance counters trace obtained with interception libraries
Only support basic "in model" parallelism, when you have a worker and a master and you split your graph on the two machines.
The instrumentation is available only with TensorFlow 1.0 ROCM/HIP TensorFlow 1.3 ROCM/HIP TensorFlow 1.6 SYCL
Scripts :
- fabfile.y : Fabric file to automate an execution
- scripts/grpc_worker.sh : Used by the fabfile, for the worker computer
- scripts/grpc_master.sh : Used by the fabfile, for the master computer