<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Compute-Scaling-Study" data-toc-modified-id="Compute-Scaling-Study-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Compute Scaling Study</a></span><ul class="toc-item"><li><span><a href="#Objectives:" data-toc-modified-id="Objectives:-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Objectives:</a></span></li><li><span><a href="#Strong-Scaling" data-toc-modified-id="Strong-Scaling-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Strong Scaling</a></span></li><li><span><a href="#Weak-Scaling" data-toc-modified-id="Weak-Scaling-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Weak Scaling</a></span></li></ul></li></ul></div>

# Compute Scaling Study 

## Objectives:

We measure the performance of Dask’s distributed scheduler for a variety of different operations commonly performed in geosciences (`climatology`, `anomaly`, `spatial (former tempora)l` and `temporal( former global)` reductions). We measure performance under increasing scales of both dataset size and cluster size.
In this actual study, we increase cluster size by adding HPC nodes to a cluster. Each HPC nodes have one dask worker with 1 threads.  Each dask worker have 10 chunks.  


During this study, we vary our computations in following ways:

- Varying chunk size
- Varying cluster size (number of HPC nodes)
- Varying chunking scheme

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import dask.dataframe as dd
import hvplot.pandas
from distributed.utils import format_bytes, parse_bytes

In [None]:
df = dd.read_csv('results/hal24/2019-08-22/*.csv').compute()
df['chunk_size'] = df['chunk_size'].map(lambda x: format_bytes(parse_bytes(x)))
df['dataset_size'] = df['dataset_size'].map(
    lambda x: format_bytes(parse_bytes(x)))

In [None]:
len(df)

In [None]:
df.head()

In [None]:
def get_clean_df(df,
                 groupby=[
                     'chunk_size', 'dataset_size', 'chunking_scheme',
                     'operation', 'num_nodes'
                 ]):
    clean_df = df.groupby(groupby).median().reset_index()
    clean_df['nodes'] = clean_df['num_nodes']
    clean_df['chunk_scheme'] = clean_df['chunking_scheme']
    clean_df = clean_df.drop(columns=[
        'worker_per_node', 'maxcore_per_node', 'spil', 'threads_per_worker',
        'num_nodes', 'chunking_scheme'
    ])

    return clean_df

In [None]:
def log_linear_plot(df,
                    loglog=False,
                    plot_kind='line',
                    x='nodes',
                    y='runtime',
                    by=['operation'],
                    subplots=False,
                    groupby=['size', 'chunk_scheme']):
    df = df.sort_values(x)
    df = df.sort_values(groupby)
    if loglog:
        title = f'{y} vs {x} -- Log scale'
    else:
        title = f'{y} vs {x} -- Linear scale'

    if subplots:
        fig = df.hvplot(x=x,
                        y=y,
                        by=by,
                        groupby=groupby,
                        height=300,
                        width=500,
                        rot=45,
                        loglog=loglog,
                        kind=plot_kind,
                        title=title,
                        ylabel='Runtime (seconds)',
                        dynamic=False,
                        legend='top',
                        use_index=False,
                        shared_axes=False).layout().cols(1)
    else:
        fig = df.hvplot(x=x,
                        y=y,
                        by=by,
                        groupby=groupby,
                        height=300,
                        width=500,
                        rot=45,
                        loglog=loglog,
                        kind=plot_kind,
                        title=title,
                        ylabel='Runtime (seconds)',
                        dynamic=False,
                        legend='top')
    return fig

## Strong Scaling


strong scaling is how the time to solution varies with processor count with a fixed  size of computation task. 


In an ideal case (e.g., problems/algorithms with O(N) time complexity), **we expect to observe a constant decrease of time to solution**



In [None]:
df2 = get_clean_df(df).drop(columns='chunk_size')
df2['size'] = df2['dataset_size']
df2 = df2.drop(columns='dataset_size')
df3 = df2[df2['size'] == '20.48 GB']

log_linear_plot(df3,
                subplots=True,
                by=['operation'],
                groupby=['size', 'chunk_scheme'])

In [None]:
log_linear_plot(df2, subplots=False)

## Weak Scaling


[Weak scaling](https://en.wikipedia.org/wiki/Scalability#Weak_versus_strong_scaling) is how the time to solution varies with processor count with a fixed system size per processor. 


In an ideal case (e.g., problems/algorithms with O(N) time complexity), **we expect to observe a constant time to solution**, independent of the total number of processors in the system. 


In [None]:
df1 = get_clean_df(df).drop(columns='dataset_size')
df1['size'] = df1['chunk_size']
df1 = df1.drop(columns='chunk_size')

log_linear_plot(df1, subplots=True)

In [None]:
log_linear_plot(df1, subplots=False)