# G4Hive performance analysis

Let's look at some measurements of G4Hive jobs for different number of threads and make some plots. We want to look at how memory and throughput scale with the number of threads. We also want to look at the timing of algorithms in the job.

In [1]:
import os
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.patches as mpatch
%matplotlib notebook

In [2]:
# Local imports
from utils.prep import parse_job_results, load_job_results
from utils.timing import (get_job_time, get_evloop_time,
                          get_initialization_time, get_finalization_time,
                          print_timing_summary, get_throughput, get_avg_throughput,
                          calc_alg_timings, get_alg_duration_map)
from utils.memory import get_max_mem, get_mem_data

## Prepare the data 

The results come in the form of log files. A memory monitor runs alongside the job to measure its memory consumption as a function of time. The job also dumps a timeline log which shows the start and end times of every algorithm per thread and event slot. From these files we can extract everything we need.

In [3]:
results_dir = 'results_endeavour_1mu'

In [4]:
ls $results_dir/

log.100_0_100000.log       mem.248_0_248000.csv
log.104_0_104000.log       mem.24_0_24000.csv
log.108_0_108000.log       mem.256_0_256000.csv
log.10_0_10000.log         mem.26_0_26000.csv
log.112_0_112000.log       mem.28_0_28000.csv
log.116_0_116000.log       mem.2_0_2000.csv
log.11_0_11000.log         mem.30_0_30000.csv
log.120_0_120000.log       mem.32_0_32000.csv
log.124_0_124000.log       mem.34_0_34000.csv
log.128_0_128000.log       mem.36_0_36000.csv
log.12_0_12000.log         mem.38_0_38000.csv
log.132_0_132000.log       mem.3_0_3000.csv
log.136_0_136000.log       mem.40_0_40000.csv
log.13_0_13000.log         mem.42_0_42000.csv
log.140_0_140000.log       mem.44_0_44000.csv
log.144_0_144000.log       mem.46_0_46000.csv
log.148_0_148000.log       mem.48_0_48000.csv
log.14_0_14000.log         mem.4_0_4000.csv
log.152_0_152000.log       mem.52_0_52000.csv
log.15_0_15000.log         mem.56_0_56000.csv
log.160_0_160000.log       mem.5_0_5000.csv
log.168_0_168000.

Parse the log files and get a list of JobResult objects

In [5]:
# Use a pre-processed pickle file
use_pickle = False

# Load or parse the results
if use_pickle:
    job_results = load_job_results(os.path.join(results_dir, 'results.pickle'))
else:
    job_results = parse_job_results(results_dir, verbose=True)

Using results directory: results_endeavour_1mu
215 total files
71 memory log files
71 timeline log files
Processed 100 thread 0 proc 100000 event
Processed 104 thread 0 proc 104000 event
Processed 108 thread 0 proc 108000 event
Processed 10 thread 0 proc 10000 event
Processed 112 thread 0 proc 112000 event
Processed 116 thread 0 proc 116000 event
Processed 11 thread 0 proc 11000 event
Processed 120 thread 0 proc 120000 event
Processed 124 thread 0 proc 124000 event
Processed 128 thread 0 proc 128000 event
Processed 12 thread 0 proc 12000 event
Processed 132 thread 0 proc 132000 event
Processed 136 thread 0 proc 136000 event
Processed 13 thread 0 proc 13000 event
Processed 140 thread 0 proc 140000 event
Processed 144 thread 0 proc 144000 event
Processed 148 thread 0 proc 148000 event
Processed 14 thread 0 proc 14000 event
Processed 152 thread 0 proc 152000 event
Processed 15 thread 0 proc 15000 event
Processed 160 thread 0 proc 160000 event
Processed 168 thread 0 proc 168000 event
Proce

## Job timing
Let's look at some general timing info about the jobs

In [6]:
print_timing_summary(job_results)

Threads Events Job-time Init-time Loop-time Final-time
      1   1000   5068.7     888.5    4176.6        3.7
      2   2000   5410.3     889.0    4517.5        3.8
      3   3000   5024.0     893.8    4126.2        4.0
      4   4000   5314.8     892.9    4417.8        4.2
      5   5000   5380.7     894.7    4481.6        4.4
      6   6000   4992.1     897.1    4090.0        5.0
      7   7000   5020.8     898.7    4117.4        4.7
      8   8000   5394.5     901.1    4488.1        5.3
      9   9000   4939.6     901.3    4033.3        5.0
     10  10000   5471.8     918.6    4548.0        5.2
     11  11000   4903.9     906.1    3992.5        5.4
     12  12000   5264.1     906.7    4351.8        5.5
     13  13000   5090.4     910.9    4173.8        5.7
     14  14000   5265.1     914.3    4344.2        6.5
     15  15000   5498.0     914.7    4577.2        6.0
     16  16000   5271.6     917.3    4347.6        6.7
     18  18000   5413.6     917.3    4489.7        6.5
     20  2

Let's visualize the initialization and finalization times in plots.

In [7]:
init_times = [get_initialization_time(j) for j in job_results]
final_times = [get_finalization_time(j) for j in job_results]
nThreads = np.array([j.nThread for j in job_results])

In [8]:
plt.figure(figsize=(12,5))
plt.subplot(121)
plt.plot(nThreads, init_times, 'ko')
plt.title('Job initialization time')
plt.ylim(ymin=0)
plt.xlabel('Number of threads')
plt.ylabel('Initialization time [s]')
plt.subplot(122)
plt.plot(nThreads, final_times, 'ko')
plt.title('Job finalization time')
plt.ylim(ymin=0)
plt.xlabel('Number of threads')
plt.ylabel('Finalization time [s]');

<IPython.core.display.Javascript object>

## Event throughput

Event throughput is arguably the most important result, so let's see how it scales with the number of threads. We calculate it by considering only the time in the event loop and the number of events processed. Then ideally the throughput should scale linearly with the number of threads.

In [9]:
thruPuts = np.array([get_avg_throughput(j) for j in job_results])
plt.figure()
plt.title('Event Throughput')
plt.plot(nThreads, thruPuts, 'ko', label='Data')
plt.xlabel('Number of threads')
plt.ylabel('Events / s')

# Draw ideal-scaling line, assuming 1-thread job as baseline
num_cores = 64
ideal_threads = np.array([0, num_cores])
ideal_thruput = ideal_threads * thruPuts[0]
plt.plot(ideal_threads, ideal_thruput, '--r', label='Ideal scaling')

# Draw vertical line at number of physical cores
cores_x, cores_y = [num_cores, num_cores], [0, 25]
plt.plot(cores_x, cores_y, '--b')

plt.legend(loc=2, numpoints=1);

<IPython.core.display.Javascript object>

The throughput is _terrible_ above ~180 threads. Let's remake the plot cutting that part off so at least I have one that looks nice.

In [10]:
good_jobs = [j for j in job_results if j.nThread < 192]
good_nThreads = np.array([j.nThread for j in good_jobs])
good_thruPuts = np.array([get_avg_throughput(j) for j in good_jobs])
plt.figure()
plt.title('Event Throughput')
plt.plot(good_nThreads, good_thruPuts, 'ko', label='Data')
plt.xlabel('Number of threads')
plt.ylabel('Events / s')

# Draw ideal-scaling line, assuming 1-thread job as baseline
num_cores = 64
ideal_threads = np.array([0, num_cores])
ideal_thruput = ideal_threads * thruPuts[0]
plt.plot(ideal_threads, ideal_thruput, '--r', label='Ideal scaling')

# Draw vertical line at number of physical cores
cores_x, cores_y = [num_cores, num_cores], [0, 25]
plt.plot(cores_x, cores_y, '--b')

plt.legend(loc=2, numpoints=1);

<IPython.core.display.Javascript object>

## Memory scaling

Start with some helper functions for memory calculations, then plot memory footprint as a function of number of threads, as well as the memory in each job as a function of time.

In [11]:
maxMems = np.array([get_max_mem(j) for j in job_results])

# Fit a line to the data
fit = np.polyfit(nThreads, maxMems, 1)
fit_fn = np.poly1d(fit)

plt.figure()
plt.title('Maximum memory consumption')
plt.plot(nThreads, maxMems, 'ko', nThreads, fit_fn(nThreads), '--r')
plt.xlabel('Number of threads')
plt.ylabel('Memory [GB]')

print('Memory fit: {0:.2f} GB + {1:.2f} MB/thread'.format(fit[1], fit[0]*1e3))

<IPython.core.display.Javascript object>

Memory fit: 1.44 GB + 36.95 MB/thread


In [12]:
# Show memory as a function of job time
plt.figure()
plt.title('Memory consumption during the job')
for j in job_results[::10]:
    label = '%i threads' % j.nThread
    times, mems = get_mem_data(j)
    # Last point is sometimes iffy, so I exclude it
    plt.plot(times[:-1], mems[:-1], label=label)
plt.xlabel('Job time [s]')
plt.ylabel('Memory [GB]')
plt.legend(loc=4);

<IPython.core.display.Javascript object>

## Algorithm analysis

G4Hive currently has four algorithms:
* SGInputLoader populates the whiteboard with initial data
* BeamEffectsAlg applies some smearing effects to the generated event
* G4AtlasAlg runs Geant4 simulation on the smeared generated event
* StreamHITS writes the hit collections to output

Let's take a look at how the job breaks down by algorithm. We'd like to know how much time is spent in each algorithm and the timing distributions look for each alg.

In [13]:
# Prepare the alg timing results now
for job in job_results:
    calc_alg_timings(job)

Let's start with histograms of the duration of each algorithm. I want to see how the alg-time distribution varies with number of threads. I suspect the algorithms are taking longer with more threads because of some lock contention.

In [14]:
alg_duration_maps = [get_alg_duration_map(j) for j in job_results]
g4alg_times = [m['G4AtlasAlg'] for m in alg_duration_maps]
loaderalg_times = [m['SGInputLoader'] for m in alg_duration_maps]
streamalg_times = [m['StreamHITS'] for m in alg_duration_maps]
beamalg_times = [m['BeamEffectsAlg'] for m in alg_duration_maps]

In [15]:
# Plot the histograms
plt.figure(figsize=(12, 10))

common_args = {'histtype': 'stepfilled',
               'alpha' : 0.4, 'linewidth' : 1.5,
               'normed' : True}

# The G4AtlasAlg timings
plt.subplot(221)
plt.title('G4AtlasAlg execution times')
skip = 6
for thread, times in zip(nThreads[0::skip], g4alg_times[0::skip]):
    label = '{0:d} threads'.format(thread)
    plt.hist(times, bins=50, range=(0,10), label=label, **common_args)
plt.xlabel('Time [s]')
plt.ylabel('Normalized counts')
#plt.ylim(0, 0.007)
plt.legend()

# The StreamHITS timings
plt.subplot(222)
plt.title('StreamHITS execution times')
#skip=4
for thread, times in zip(nThreads[::skip], streamalg_times[::skip]):
    label = '{0:d} threads'.format(thread)
    plt.hist(times*1e3, bins=50, range=(0,50), label=label, **common_args)
plt.xlabel('Time [ms]')
plt.ylabel('Normalized counts')
#plt.ylim(0, 0.04)
plt.legend()

# The SGInputLoader timings
plt.subplot(223)
plt.title('SGInputLoader execution times')
#skip=10
for thread, times in zip(nThreads[::skip], loaderalg_times[::skip]):
    label = '{0:d} threads'.format(thread)
    plt.hist(times*1e6, bins=50, range=(0,120), label=label, **common_args)
plt.xlabel('Time [µs]')
plt.ylabel('Normalized counts')
#plt.ylim(0, 0.8)
plt.legend()

# The BeamEffectsAlg timings
plt.subplot(224)
plt.title('BeamEffectsAlg execution times')
#skip=4
for thread, times in zip(nThreads[::skip], beamalg_times[::skip]):
    label = '{0:d} threads'.format(thread)
    plt.hist(times*1e3, bins=50, range=(0,4), label=label, **common_args)
plt.xlabel('Time [ms]')
plt.ylabel('Normalized counts')
plt.legend();

<IPython.core.display.Javascript object>

For the next plot, I want to show how the total time in the event loop is broken down into algorithms and non-algorithmic time, where the latter includes scheduler overhead and waiting time. I think a stacked bar graph will service nicely here. Let's sum times across threads but normalize to the number of events. The total sum then is the inverse of the throughput.

Ok, so how do I get these results? I will likely want to break down the numbers in terms of each algorithm. I may need to restructure how I do the histograms above and the timeline below to reduce the amount of code and computation.

In [16]:
# A color map for the algorithms
alg_color_map = {'SGInputLoader' : 'yellow',
                 'BeamEffectsAlg' : 'blue',
                 'G4AtlasAlg' : 'red',
                 'StreamHITS' : 'green',
                 #'AthOutSeq' : 'yellow',
                 #'AthRegSeq' : 'purple',
                }

def get_time_sum_map(job_results, alg_duration_maps):
    """For each job, calculate the total time spent in each alg.
    Normalize by the number of events and organize the results
    into a list per alg in a dict."""
    time_sum_map = {}
    for j, dur_map in zip(job_results, alg_duration_maps):
        total_alg_time = 0.
        for alg, durs in dur_map.items():
            alg_time = durs.sum() / j.nEvent
            time_sum_map.setdefault(alg, []).append(alg_time)
    return time_sum_map

In [17]:
# Get the map of summed alg times
time_sum_map = get_time_sum_map(job_results, alg_duration_maps)
# Get the normalized total time in each job
total_time_sums = [get_evloop_time(j)*j.nThread/j.nEvent for j in job_results]

In [18]:
# This plot doesn't actually work correctly. When I specify the 'bottom' arg for each successive
# bar, it draws on top of that value, not on top of the draw location of the bar. So, only the
# first and second bars are in the correct place, and the following ones are draw completely wrong.
# I should really just throw out this plot in favor of more useful plots, anyway.

plt.figure(figsize=(12, 6))
algs = ['G4AtlasAlg', 'StreamHITS', 'SGInputLoader', 'BeamEffectsAlg']
colors = [alg_color_map[alg] for alg in algs]
leg_items = []
# Do the first one
x = plt.bar(nThreads, time_sum_map[algs[0]], color=colors[0], align='center')
leg_items.append(x[0])
# Do the rest
for i in range(1, len(algs)):
    x = plt.bar(nThreads, time_sum_map[algs[i]],
                bottom=time_sum_map[algs[i-1]],
                color=colors[i], align='center')
    leg_items.append(x[0])
x = plt.plot(nThreads, total_time_sums, 'sk', label='Total')
plt.ylabel('Alg time / nEvent  [s]')
plt.xlabel('Number of threads')
plt.ylim(ymax=20)
plt.legend(leg_items + [x[0]], algs + ['Total'], loc=0, numpoints=1);

<IPython.core.display.Javascript object>

## Event loop timeline

For the timeline plot, we'll split the results by thread in a bar graph.

In [19]:
class TimelineThreadData():
    """Simple struct for holding relevant timeline data in one thread"""
    def __init__(self, tid):
        self.tid = tid

def get_timeline_thread_data(job):
    """Get the processed timeline results per thread"""
    # Get the unique thread IDs
    tids = j.timeline_results['tids']
    unique_tids = np.unique(tids)
    assert(len(unique_tids) == j.nThread) # sanity check
    # Create and fill the per-thread timeline data
    ttds = [TimelineThreadData(tid) for tid in unique_tids]
    for ttd in ttds:
        indices = tids == ttd.tid
        algs = j.timeline_results['algs'][indices]
        ttd.colors = np.array([alg_color_map.get(alg, 'black') for alg in algs])
        starts = j.alg_starts[indices]
        durations = j.alg_durations[indices]
        ttd.times = np.column_stack((starts, durations))
    return ttds

In [20]:
# For the timeline plot, we'll look at just one job for now
j = job_results[-2]

# Prepare timeline data split by thread ID
tldata_by_thread = get_timeline_thread_data(j)
unique_tids = np.unique(j.timeline_results['tids'])

In [21]:
# Prepare the plot
plt.figure(figsize=(12, 40))
plt.title('Event loop timeline')
bar_thickness = 0.8
for i, tldata in enumerate(tldata_by_thread):
    ylow = (i + 1.) - bar_thickness/2
    plt.broken_barh(tldata.times, [ylow, bar_thickness], facecolors=tldata.colors, linewidth=0)
# Fake bar objects to populate the legend
legbars = [mpatch.Rectangle((0, 0), 1, 1, fc=c) for c in alg_color_map.values()]
plt.xlabel('Event loop time [s]')
plt.ylabel('Thread')
plt.yticks(range(1, len(unique_tids)+1))
plt.ylim(ymax=len(unique_tids)+1.5)
plt.xlim(xmin=0)
#plt.xlim(9, 9.1)
plt.legend(legbars, alg_color_map.keys(), loc=2);

<IPython.core.display.Javascript object>