# LazyFox Analysis

This notebook showcases an analysis of a LazyFox run on Eu-core dataset with a threadcount of 1. This is equivalent to the output of the examplatory run of the `LazyFox Workflow.ipynb` notebook.

## Setup

You can change the `run_directory` below if you want to run on the output created from the `LazyFox Workflow.ipynb`

In [None]:
import os.path

run_directory = "./example_data/run_eu_with_1"

if not os.path.exists(run_directory):
    raise ValueError(f"No such run directory '{run_directory}'")


As the input data provided by SNAP does not always contain the graph properties 'node count' and 'edge count' they are hard coded. Therefore you need to specify the dataset you used!

In [None]:
dataset = "eu"

The following class bundles many usefull methods to analyse the result of a LazyFox run. It is defined in `BechmarkRun.py`. If you are not interested in the inner workings of the analysis, you can skip that.

However, if you want to run on a new dataset apart from Eu-core, DBLP or LiveJournal, you will have to add the node count and the edge count of the new dataset to the file!

In [None]:
from BenchmarkRun import BenchmarkRun

## Analysis

To analyse a LazyFox run, we create a `BenchmarkRun` object. It just needs the dataset (to lookup the node count and edge count) and the run directory.


In [None]:
run = BenchmarkRun(dataset, run_directory)

In [None]:
from pandas import DataFrame

cluster_stats = run.cluster_stats()
DataFrame([cluster_stats.values()], columns=cluster_stats.keys())

In [None]:
performance_stats = run.performance_stats()
DataFrame([performance_stats.values()], columns=performance_stats.keys())

## Comparative Analysis
If you run LazyFox multiple times you can create multiple `BenchmarkRun` objects and then compare their results.

We provided the output of multiple LazyFox runs on the Eu-core dataset to showcase a more advanced analysis below.

In [None]:
example_data_directory = "./example_data"

runs_by_threadcount = {}

threadcounts = [1, 2, 4, 8, 16, 32, 64, 128, 256]
for threadcount in threadcounts:
    run_directory = os.path.join(example_data_directory, f"run_eu_with_{threadcount}")

    runs_by_threadcount[threadcount] = BenchmarkRun("eu", run_directory)

### Create DataFrames

In [None]:
# Cluster Stats
data_rows = []
for threadcount in threadcounts:
    run = runs_by_threadcount[threadcount]
    cluster_stats = run.cluster_stats()

    if threadcount == 1:
        org = cluster_stats

    def display_diff(value, org, threadcount):
        new = round(value - org, 2)
        if threadcount != 1 and new >= 0:
            return "+" + str(new)
        return str(new)

    data_row = [threadcount] + [display_diff(value,org[key], threadcount) for key, value in cluster_stats.items()]
    data_rows.append(data_row)

cluster_stat_df = DataFrame(data_rows, columns=["threadcount"] + list(cluster_stats.keys()))
cluster_stat_df.style.set_caption("EU Dataset - Cluster Stat Changes over different threadcounts")

In [None]:
# Absolute Performance Stats
data_rows = []
for threadcount in threadcounts:
    run = runs_by_threadcount[threadcount]
    performance_stats = run.performance_stats()

    if threadcount == 1:
        org = performance_stats

    def display_diff(value, org, threadcount):
        new = round(value - org, 2)
        if threadcount != 1 and new >= 0:
            return "+" + str(new)
        return str(new)

    data_row = [threadcount] + [display_diff(value,org[key], threadcount) for key, value in performance_stats.items()]
    data_rows.append(data_row)

performance_stat_df = DataFrame(data_rows, columns=["threadcount"] + list(performance_stats.keys()))
performance_stat_df.style.set_caption("EU Dataset - Performance Stat Changes over different threadcounts")

In [None]:
# Relative Performance Stats
baselines_stats = runs_by_threadcount[1].performance_stats()

data_rows = []
for threadcount in threadcounts:
    run = runs_by_threadcount[threadcount]
    performance_stats = run.performance_stats()
    # Make the stats relative to threadcount 1
    for key in performance_stats:
        performance_stats[key] = performance_stats[key] / baselines_stats[key]

    if threadcount == 1:
        org = performance_stats


    data_row = [threadcount] + [value for key, value in performance_stats.items()]
    data_rows.append(data_row)

rel_performance_stat_df = DataFrame(data_rows, columns=["threadcount"] + list(performance_stats.keys()))
rel_performance_stat_df.style.set_caption("EU Dataset - Relative Performance Stat Changes over different threadcounts")

### Plot Performance Stats

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1,3, figsize=(20,5))
fig.suptitle("EU Core - Relative Performance Compared to FOX")
rel_performance_stat_df

for i, column in enumerate(["ram peak", "total time", "avg time per iteration"]):
    axes[i].set_title(f"{column}")
    axes[i].plot(rel_performance_stat_df["threadcount"], rel_performance_stat_df[column], label="eu")
    axes[i].set_xticks(threadcounts)
    axes[i].set_xlabel("Threadcount")
    axes[i].set_ylabel("Relative Change")
    axes[i].legend()

We can see that the maximum RAM consumed sacles aprox. linearly with the threadcount. The runtime however gets reduced fairly exponentially, caused by the speedup in the individual iterations.


### Plot WCC Development

In [None]:
# WCC Diff over iteration

iterations = {}
for threadcount in threadcounts:
    r = runs_by_threadcount[threadcount]
    iterations[threadcount] = [threadcount] + [i["wcc_diff"] for i in r.iterations]

df = DataFrame(list(iterations.values())).transpose()
df.columns = list(map(int, df.iloc[0]))
df = df.drop(df.index[0])

In [None]:
import matplotlib.pyplot as plt
from pandas import DataFrame

fig, ax = plt.subplots(1,1, figsize=(10,7))
fig.suptitle("EU Core - WCC Development")
# Plot wcc fidd over iterations
df.plot(
    xlabel="#iteration",
    ylabel="wcc change (log)",
    logy=True,
    marker='o',
    xticks=list(df.index),
    title="",
    ax=ax,
)

# Plot wcc diffs
for wcc_diff in [0.01 * x for x in range(1, 5)]:
    plt.axhline(y=wcc_diff, color='r', linestyle=':')
    plt.text(df.index[-1] + 1.5, wcc_diff, str(wcc_diff), color="red")
plt.legend(title="Threadcount")

We can see that the wcc changes are dependent on the threadcount. Note that this is due to the low node count in the used EU-Core dataset and is an exception. Refer to the report of this project for further details.