# Real Traces Plotting/Analysis

Traces got from [GitHub](https://github.com/Azure/AzurePublicDataset/blob/master/AzureFunctionsDataset2019.md). See `dataset/REDAME.md`.

In [None]:
# Common imports.
from pathlib import Path

%matplotlib ipympl
import base

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

Global options of the notebook:

* `data_file`: the full path of the data (a CSV file)

In [None]:
data_file = Path(
    "/home/emanuele/marl-dfaas/dataset/data/invocations_per_function_md.anon.d01.csv"
)

## Structure of a invocation file

In [None]:
invocations = pd.read_csv(data_file)  # Read the data, takes time.

In [None]:
print("Shape (rows, columns) =", invocations.shape)
print("First 10 columns =", list(invocations.columns[:10]))
print("Last 10 columns =", list(invocations.columns[-10:]))

The first three columns are hashes, consistent between all files:

* `HashOwner`: owner of the application. On owner can have multiple applications.
* `HashApp`: application. An application can have only one owner but many functions. Note two identical application (and functions) have two different hashes since they belong to different owners.
* `HashFunction`: the single function.
* `Trigger`: what causes the function execution.

The remaining columns are how many invocations there were for each minute in a single 24-hours day.

## Plot of a generic trace

Plots a single trace from the given data file. The trace can be selected by its hash (full or partial) or by the index inside the data file.

In [None]:
function_hash = "115ca7a2b5bc290052c3da74cd0347d19c3c67b7d5aa66e9a975e427f25fc7ed"
function_index = 101
select_by_hash = False

if select_by_hash:
    trace = invocations[invocations["HashFunction"].str.contains(function_hash)]
else:
    trace = invocations.iloc[[function_index]]

assert trace.shape[0] == 1, "Trace must be unique"

# Extract the first columns and the invocations columns ("1" -> "1440")
owner, app, func, trigger = trace.iloc[0][:"Trigger"]
invocs = trace.iloc[0]["1":]

In [None]:
print("Owner =", owner)
print("Application =", app)
print("Function =", func)
print("Trigger =", trigger)

In [None]:
fig = plt.figure(layout="constrained")
fig.canvas.header_visible = False
ax = fig.subplots()

minutes_idx = np.arange(1, len(invocs) + 1)
ax.bar(minutes_idx, invocs)

ax.set_title(f"Function invocations (function = {func[:10]})")
ax.set_ylabel("Invocations")
ax.set_xlabel("Minute")

ax.grid(axis="both")
ax.set_axisbelow(True)  # By default the axis is over the content.

## Trigger distribution

There are many triggers supported by Azure Functions, but in the dataset they are grouped into the following groups:

* `http` (HTTP)
* `timer` (Timer)
* `event` (Event Hub, Event Grid)
* `queue` (Service Bus, Queue Storage, RabbitMQ, Kafka, MQTT)
* `storage` (Blob Storage, CosmosDB, Redis, File)
* `orchestration` (Durable Functions: activities, orcherstration)
* `others` (all other triggers)

Note that I'm only interested in functions triggered by `http` requests. The analysis of these is in the next section.

In [None]:
# Calculate triggers count.
trigger_count = invocations.loc[:, "Trigger"].value_counts()
print(trigger_count)

In [None]:
# Plot data.
fig = plt.figure(layout="constrained")
fig.canvas.header_visible = False
ax = fig.subplots()

ax.bar(trigger_count.index, trigger_count)

ax.set_title(f"Trigger distribution")
ax.set_ylabel("Functions")
ax.set_xlabel("Trigger")

ax.grid(axis="both")
ax.set_axisbelow(True)  # By default the axis is over the content.

## Sum, mean and std of invocations

In [None]:
http = invocations[invocations["Trigger"] == "http"]

header = http.loc[:, :"Trigger"]  # Extract the first four columns (the "header").
values = http.loc[:, "1":].agg(
    ["sum", "mean", "std"], axis=1
)  # Calculate some stats for the invocations.

stats = header.join(values)  # Rebuild the dataframe.

In [None]:
for metric in ["sum", "mean", "std"]:
    fig = plt.figure(layout="constrained")
    fig.canvas.header_visible = False
    ax = fig.subplots()

    func_idx = np.arange(http.shape[0])

    # Required since there is too much variation between functions.
    ax.set_yscale("log")

    ax.bar(func_idx, stats[metric])

    ax.set_title(f"{metric.capitalize()} of invocations per http function")
    ax.set_ylabel("Total invocations")
    ax.set_xlabel("Function index")

    ax.grid(axis="both")
    ax.set_axisbelow(True)  # By default the axis is over the content.