# GPU efficiency

## GPU compute efficiency methodology

Separate the kernels into 2 groups:
- kernels with a null FLOPs value
- kernels with a non-null FLOPs value

The GPU compute efficiency is calculated as the duration of the kernels with a non-null FLOPS value divided by the total duration.

## GPU FLOPS efficiency methodology

GPU: NVIDIA Tesla K40m  
FP32 theoretical flops: 4.29 TFLOPS

Compute the total FLOPs of all the kernels.  
Compute the total duration of all the kernels.  
Compute the FLOPs per seconds of the GPU: `total_FLOPs / total_duration`  
Compute the FLOPS efficiency: `FLOPs_per_sec_GPU / theoretical_flops`  

This methodology is equivalent to weighting each kernel by its duration and then summing the weighted average of the FLOPs per seconds of each kernel.

## GPU bandwidth efficiency methodology

GPU: NVIDIA Tesla K40m  
Memory bandwidth: 288 GB/s

Compute the bandwidth of each kernel: `bytes_in_and_out / duration_of_kernel`
Compute the duration weight of each kernel: `duration_of_kernel / sum_durations_of_all_kernels`<br/>
Determine the bandwidth of the GPU by computing the weighted average of the bandwidth of each kernel.<br/>
Compute the bandwidth efficiency: `bandwidth_GPU / theoretical_bandwidth`<br/>

In [1]:
import pandas as pd
import glob
import itertools
import numpy as np
pd.set_option('display.max_rows', 90)

In [2]:
THEORETICAL_FLOPS = 4.29 # in TFLOPS
THEORETICAL_BANDWIDTH = 288 # in GB/s

In [3]:
# select configurations
c_num_nodes = [1, 8, 16, 32]
c_gpus_per_node = [2]
c_network_backend = ['ib']
c_profile_level = ['nvprof']
c_workers = [2, 8]
c_neural_network = ['resnet50']
c_data_loader = ['dali-gpu', 'dali-cpu-to-gpu']
c_batch_size_per_gpu = [32, 64]
c_grad_precision = ['fp16', 'fp32']
c_compute_precision = ['fp32']

def configurations():
    
    def to_str(l):
        return [str(elem) for elem in l]

    confs = [
            to_str(c_num_nodes),
            to_str(c_gpus_per_node),
            c_network_backend,
            c_profile_level,
            to_str(c_workers),
            c_neural_network,
            c_data_loader,
            to_str(c_batch_size_per_gpu),
            c_grad_precision,
            c_compute_precision
            ]
    
    return itertools.product(*confs)

In [4]:
def compute_efficiency(df_csv):
#     df_forward = df_csv[df_csv['Direction'] == 'fprop'][['Kernel', 'Sil(ns)', 'FLOPs']]
    df_forward = df_csv[['Kernel', 'Sil(ns)', 'FLOPs']]
    d = df_forward.groupby('Kernel').sum()
    duration_kernels = d.sort_values(by='Sil(ns)', ascending=False)
    no_compute = duration_kernels[duration_kernels['FLOPs'] == 0].sum()['Sil(ns)']
    compute = duration_kernels[duration_kernels['FLOPs'] != 0].sum()['Sil(ns)']
    return compute / (no_compute+compute) * 100 # In %

In [5]:
def flops_per_sec(df_csv):
    total_duration = df_csv['Sil(ns)'].sum() * 1E-9 # in seconds
    total_FLOPs = df_csv['FLOPs'].sum()
    return total_FLOPs / total_duration * 1E-12 # in teraFLOPs per seconds

In [6]:
def bandwidth(df_csv):
    df_csv['bandwidth'] = 1E9 * df_csv['Bytes'] / df_csv['Sil(ns)'] # Bandwidth in bytes/sec
    total_duration = df_csv['Sil(ns)'].sum()
    df_csv['weight'] = df_csv['Sil(ns)'] / total_duration
    df_csv['bandwidth_weighted'] = df_csv['bandwidth'] * df_csv['weight']
    return df_csv['bandwidth_weighted'].sum() * 1E-9 # Convert to GB/s

In [7]:
rows = []
for conf in configurations():
    node, gpu, network, profile, workers, nn, data_loader, batch_size, grad, comp = conf
    file = f'data/pcm/pyprof_kernels/run_0_config_{node}_{gpu}_{network}_{profile}_{workers}_{nn}_{data_loader}_{batch_size}_{grad}_{comp}_ret_0_0.gzip'

    df = pd.read_parquet(file)
    row = [*conf, compute_efficiency(df), flops_per_sec(df), bandwidth(df)]
    rows.append(row)
#     break

efficiency_df = pd.DataFrame(rows, columns=['nodes', 'gpus_per_node', 'network_backend', 'profile_level', 'workers', 'nn', 'data_loader', 'batch_size', 'grad', 'comp', 'compute_efficiency (%)', 'TFLOPs per sec', 'bandwidth(GB/s)'])

In [8]:
efficiency_df['bandwidth_efficiency (%)'] = efficiency_df['bandwidth(GB/s)'] / THEORETICAL_BANDWIDTH * 100 # in %
efficiency_df['FLOPS_efficiency (%)'] = efficiency_df['TFLOPs per sec'] / THEORETICAL_FLOPS * 100 # in %
group_df = efficiency_df.groupby(['batch_size', 'nodes']).mean()
group_df

Unnamed: 0_level_0,Unnamed: 1_level_0,compute_efficiency (%),TFLOPs per sec,bandwidth(GB/s),bandwidth_efficiency (%),FLOPS_efficiency (%)
batch_size,nodes,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
32,1,68.791497,0.347732,25.976727,9.019697,8.105632
32,16,69.840852,0.353306,26.044084,9.043085,8.235571
32,32,69.807003,0.353268,26.052055,9.045853,8.234676
32,8,69.936754,0.354724,26.109891,9.065934,8.26863
64,1,59.959489,0.386201,28.071793,9.74715,9.002347
64,16,61.145733,0.392053,28.096251,9.755643,9.138773
64,32,61.107659,0.391818,28.079357,9.749777,9.133285
64,8,61.224527,0.392876,28.143853,9.772171,9.157942


# Kernel-level analysis

The maximum efficiency for a single kernel is 33.58 %. However, most of the kernels do not have FLOPs, and therefore a 0 % efficiency.

In [11]:
file = f'data/pcm/pyprof_kernels/run_0_config_8_2_ib_nvprof_2_resnet50_dali-gpu_32_fp16_fp32_ret_0_0.gzip'
df = pd.read_parquet(file)

In [12]:
# Kernel-level analysis
# df_forward = df[df['Direction'] == 'fprop'][['Kernel', 'Sil(ns)', 'FLOPs']]
df_forward = df[['Kernel', 'Sil(ns)', 'FLOPs']]
df_forward_sum = df_forward.groupby('Kernel').sum()
duration_kernels = df_forward_sum.sort_values(by='Sil(ns)', ascending=False)

In [13]:
duration_kernels['FLOPs per sec'] = duration_kernels['FLOPs'] / duration_kernels['Sil(ns)'] * 1E9
duration_kernels['FLOPS efficiency (%)'] = duration_kernels['FLOPs per sec'] * 1E-12 / THEORETICAL_FLOPS * 100
final = duration_kernels.sort_values(by=['Sil(ns)', 'FLOPS efficiency (%)'], ascending=False)
final.sort_values(by='FLOPS efficiency (%)', ascending=False)

Unnamed: 0_level_0,Sil(ns),FLOPs,FLOPs per sec,FLOPS efficiency (%)
Kernel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
cudnn_convolve_sgemm_sm35_ldg_nn_64x16x64x16x16,529095040,790176500000.0,1493449000000.0,34.812331
cudnn::detail::implicit_convolve_sgemm,5755177265,8142019000000.0,1414729000000.0,32.977376
cudnn_convolve_sgemm_sm35_ldg_nn_32x16x64x8x16,5097958,6576669000.0,1290059000000.0,30.071315
cudnn::detail::dgrad_engine,3396871034,2157969000000.0,635281500000.0,14.808427
cudnn::detail::wgrad_alg0_engine,5908700802,3559365000000.0,602393800000.0,14.041815
cudnn::winograd::winograd3x3Kernel,52062179,29595010000.0,568455100000.0,13.250701
sgemm_sm_heavy_nt_ldg,9172068,3276800000.0,357258600000.0,8.327706
fermiPlusCgemmLDS128_batched,226900852,55079600000.0,242747400000.0,5.658448
cudnn::detail::bn_fw_tr_1C11_kernel_NCHW,1041324312,111366600000.0,106947100000.0,2.49294
cudnn::detail::bn_fw_tr_1C11_singleread,355409374,33737540000.0,94925860000.0,2.212724


# Scaled FLOPS and bandwidth efficiencies

## Methodology

Sum the flops/transferred bytes of all kernels and divide by the duration of all the batches. The duration of a single batch is computed as the minimum of the tau batch duration, nvprof batch duration, and pyprof batch duration.

## Tau batch duration

In [18]:
df_init_times = pd.read_parquet('data/mpi/batch1_5runs_interval.gzip')

def get_init_time(conf):
    nodes, gpus, network, profile, workers, nn, data_loader, batch_size, grad, comp = conf
    
    init = df_init_times[
        (df_init_times['nodes'] == int(nodes)) & 
        (df_init_times['workers'] == int(workers)) & 
        (df_init_times['data_loader'] == data_loader) & 
        (df_init_times['batch_size_per_gpu'] == int(batch_size)) & 
        (df_init_times['grad_precision'] == grad) &
        (df_init_times['gpu'] == 0) &
        (df_init_times['thread'] == 0) &
        (df_init_times['function'] == '.tau application')
    ].inc_time.values
    init_time = np.median(init) * 1E-6 # in seconds
    
    return init_time

In [19]:
tau_file = f'data/mpi/interval.gzip'
df_tau = pd.read_parquet(tau_file)
NUM_BATCHES = 50

def tau_batch_duration(conf):
    node, gpu, network, profile, workers, nn, data_loader, batch_size, grad, comp = conf
    
    df_conf = df_tau[
        (df_tau['nodes']== int(node)) & 
        (df_tau['workers'] == int(workers)) & 
        (df_tau['data_loader'] == data_loader) & 
        (df_tau['batch_size_per_gpu'] == int(batch_size)) &
        (df_tau['grad_precision'] == grad) &
        (df_tau['gpu'] == 0) &
        (df_tau['thread'] == 0) &
        (df_tau['function'] == '.tau application')
    ]
    
    exp_duration = np.max(df_conf.inc_time.values) * 1E-6 - get_init_time(conf)
    batch_duration_tau = exp_duration / NUM_BATCHES
    
    return batch_duration_tau

## Nvprof batch duration

In [20]:
def nvprof_batch_duration(conf):    
    node, gpu, network, profile, workers, nn, data_loader, batch_size, grad, comp = conf
    
    nvprof_marker_file = f'data/pcm/nvprof_markers/markers_run_0_config_{node}_{gpu}_{network}_{profile}_{workers}_{nn}_{data_loader}_{batch_size}_{grad}_{comp}_ret_0_0.csv'
    df_nvprof = pd.read_csv(nvprof_marker_file)
    
    end_batch_markers = df_nvprof[df_nvprof['name_str'] == 'End of Batch'].timestamp.values
    batch_duration_nvprof = np.median(np.diff(end_batch_markers) * 1E-9)
    
    return batch_duration_nvprof

## Pyprof batch duration

In [21]:
def pyprof_batch_duration(conf):
    node, gpu, network, profile, workers, nn, data_loader, batch_size, grad, comp = conf
    
    nvprof_marker_file = f'data/pcm/nvprof_markers/markers_run_0_config_{node}_{gpu}_{network}_{profile}_{workers}_{nn}_{data_loader}_{batch_size}_{grad}_{comp}_ret_0_0.csv'
    df_nvprof = pd.read_csv(nvprof_marker_file)
    
    forward_markers = df_nvprof[df_nvprof['name_str'] == 'Forward Pass'].timestamp.values # pyprof batches do not have 'End of Batch' markers
    batch_duration_pyprof = np.median(np.diff(forward_markers[26:]) * 1E-9) # Only last 25 batches have pyprof enabled
    
    return batch_duration_pyprof

In [29]:
def batch_duration(conf):
    return min(tau_batch_duration(conf), nvprof_batch_duration(conf), pyprof_batch_duration(conf))

## Scaled efficiencies

In [23]:
data = []

for conf in configurations():
#     print(conf)
    node, gpu, network, profile, workers, nn, data_loader, batch_size, grad, comp = conf
    file = f'data/pcm/pyprof_kernels/run_0_config_{node}_{gpu}_{network}_{profile}_{workers}_{nn}_{data_loader}_{batch_size}_{grad}_{comp}_ret_0_0.gzip'
    df = pd.read_parquet(file)
    
    batch = batch_duration(conf)
    
    bandwidth_efficiency = df['Bytes'].sum() / (25 * batch) * 1E-9 / THEORETICAL_BANDWIDTH * 100
    flop_efficiency = df['FLOPs'].sum() / (25 * batch) * 1E-12 / THEORETICAL_FLOPS * 100
    
    data.append((*conf, flop_efficiency, bandwidth_efficiency))

df_efficiencies = pd.DataFrame(data, columns=['nodes', 'gpu_per_nodes', 'network_backend', 'profile', 'workers', 'nn', 'data_loader', 'batch_size', 'grad', 'comp', 'bandwidth_efficiency (%)', 'FLOPS_efficiency (%)'])

In [24]:
df_efficiencies

Unnamed: 0,nodes,gpu_per_nodes,network_backend,profile,workers,nn,data_loader,batch_size,grad,comp,bandwidth_efficiency (%),FLOPS_efficiency (%)
0,1,2,ib,nvprof,2,resnet50,dali-gpu,32,fp16,fp32,15.714534,17.460536
1,1,2,ib,nvprof,2,resnet50,dali-gpu,32,fp32,fp32,15.082657,16.758452
2,1,2,ib,nvprof,2,resnet50,dali-gpu,64,fp16,fp32,18.254345,19.67458
3,1,2,ib,nvprof,2,resnet50,dali-gpu,64,fp32,fp32,17.532566,18.896645
4,1,2,ib,nvprof,2,resnet50,dali-cpu-to-gpu,32,fp16,fp32,12.927701,14.364065
5,1,2,ib,nvprof,2,resnet50,dali-cpu-to-gpu,32,fp32,fp32,13.101616,14.557303
6,1,2,ib,nvprof,2,resnet50,dali-cpu-to-gpu,64,fp16,fp32,15.56221,16.772991
7,1,2,ib,nvprof,2,resnet50,dali-cpu-to-gpu,64,fp32,fp32,14.208194,15.500853
8,1,2,ib,nvprof,8,resnet50,dali-gpu,32,fp16,fp32,15.911288,17.893018
9,1,2,ib,nvprof,8,resnet50,dali-gpu,32,fp32,fp32,14.994187,16.660152


### Gradient precision

In [25]:
df_efficiencies.groupby(['grad', 'nodes']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,bandwidth_efficiency (%),FLOPS_efficiency (%)
grad,nodes,Unnamed: 2_level_1,Unnamed: 3_level_1
fp16,1,15.039161,16.469099
fp16,16,18.667964,20.168049
fp16,32,19.079883,20.620989
fp16,8,19.568244,21.108341
fp32,1,14.477612,15.903495
fp32,16,17.160375,18.57669
fp32,32,17.184887,18.60448
fp32,8,17.763889,19.232685


### Data loader

In [26]:
df_efficiencies.groupby(['data_loader', 'nodes']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,bandwidth_efficiency (%),FLOPS_efficiency (%)
data_loader,nodes,Unnamed: 2_level_1,Unnamed: 3_level_1
dali-cpu-to-gpu,1,12.794795,14.037066
dali-cpu-to-gpu,16,18.153467,19.582588
dali-cpu-to-gpu,32,18.192446,19.65986
dali-cpu-to-gpu,8,18.694219,20.16695
dali-gpu,1,16.721979,18.335528
dali-gpu,16,17.674872,19.162151
dali-gpu,32,18.072324,19.565608
dali-gpu,8,18.637914,20.174076


### Batch size

In [27]:
df_efficiencies.groupby(['batch_size', 'nodes']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,bandwidth_efficiency (%),FLOPS_efficiency (%)
batch_size,nodes,Unnamed: 2_level_1,Unnamed: 3_level_1
32,1,13.835465,15.399422
32,16,16.179725,17.769747
32,32,16.635425,18.272546
32,8,17.27231,18.939843
64,1,15.681308,16.973172
64,16,19.648614,20.974992
64,32,19.629345,20.952922
64,8,20.059823,21.401183


### Workers

In [28]:
df_efficiencies.groupby(['workers', 'nodes']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,bandwidth_efficiency (%),FLOPS_efficiency (%)
workers,nodes,Unnamed: 2_level_1,Unnamed: 3_level_1
2,1,15.297978,16.748178
2,16,17.674905,19.162242
2,32,18.259636,19.731943
2,8,18.650103,20.190598
8,1,14.218796,15.624416
8,16,18.153434,19.582497
8,32,18.005134,19.493525
8,8,18.68203,20.150428
