# Benchmarks of Strategies for Selecting Outdated Items
This notebook contains the benchmarks related to the selection strategies for context data, which we report in our paper.
Context data are selected from the processed data and included in the next progressive computation step, such that its result approximates that of a _non-progressive_ computation over the processed data.

## Benchmark Configuration

We use the following configuration in our benchmarks:
### Test cases 
- full computation over the entire dataset (upper baseline)
- progressive computation without optimization (lower baseline)
- full computation of processed data
- progressive computation using optimization strategies

### Dataset
- NYC taxis dataset (10 Million items), stored in a compressed CSV file, loaded with DuckDB 

### Variables
- dependent variables: runtime, prediction error
- independent variables: 

## Benchmarks

### Defining the DOI function
In our experiments, we will use the DOI function below, which measures outlierness (i.e., how unique a datum is compared to the rest of the data).
Outlierness cannot be expressed by looking at an individual data item, but can only expressed within the context of other data, which is why a purely progressive computation of this DOI measure will inherently be inaccurate compared to a "monolithic" computation that looks at all data at once.


### Finding an appropriate `chunk size`
The number of items in each chunk dictate the computation time for each chunk in the data, in that the more items we process, the longer the DOI computation takes.
Therefore, the first consideration in our benchmarks is to find the maximum number of items, for which the computation time remains immediate.
Prior work (see Card et al., 1991) has shown this limit to be about one second.

In the cell below, we thus compute the maximum number of items, for which the DOI function still returns values after one second:

In [None]:
chunk_sizes = [10, 100, 1000, 10000]
reset()
for size in chunk_sizes:
  print(size)

Baseline: Chunk-based computation without any optimizations

In [2]:
import os
from sys import path
cwd = os.getcwd()
path.append(f"{cwd}/..")

from database import initialize_db, drop_tables
from benchmark_test_case import *
import numpy as np
import pandas as pd

from doi_component.outlierness_component import OutliernessComponent
outlierness = OutliernessComponent(["ratio", "duration"])

n_dims = 17
total_size = 99999
chunk_size = 1000
chunks = round(total_size / chunk_size)

data_path = "../data/nyc_taxis.shuffled_full.csv.gz"

def reset():
  drop_tables()
  initialize_db(data_path)

from outdated_item_selection_strategy.no_update import *
from outdated_item_selection_strategy.oldest_chunks_update import *
from outdated_item_selection_strategy.last_n_chunks_update import *
from outdated_item_selection_strategy.regular_interval_update import *
from outdated_item_selection_strategy.outdated_bin_update import *

update_strategies = [
  ("no chunk", NoUpdate(n_dims=n_dims)),
  ("oldest n chunks", OldestChunksUpdate(n_dims=n_dims, n_chunks=3, max_age=10)),
  ("last n chunks", LastNChunksUpdate(n_dims=n_dims, n_chunks=3)),
  ("regular intervals", RegularIntervalUpdate(n_dims=n_dims,interval=2, max_age=10)),
  ("outdated bins", OutdatedBinUpdate(n_dims=n_dims))
]

from context_item_selection_strategy.no_context import * 
from context_item_selection_strategy.chunk_based_context import *
from context_item_selection_strategy.sampling_based_context import *
from context_item_selection_strategy.clustering_based_context import *

context_strategies = [
  ("no context", NoContext(n_dims=n_dims)),
  ("chunk based", RandomChunkBasedContext(n_dims=n_dims, n_chunks=3)),
  ("sampling based", RandomSamplingBasedContext(n_dims=n_dims, n_samples=chunk_size)),
  ("clustering based", ClusteringBasedContext(n_dims=n_dims, n_clusters=chunk_size))
]

####################################################################################################
####################################################################################################

# create the path for storing the benchmark results if they do not exist
path = f"./out/{total_size}/{chunk_size}"
if not exists ("./out"):
  os.mkdir("./out")
if not exists(f"./out/{total_size}"):
  os.mkdir(f"./out/{total_size}")
if not exists(path):
  os.mkdir(path)

### Running the test cases

In [None]:
test_cases = []

for c_strat in context_strategies:
  for u_strat in update_strategies:
    print(f"context: {c_strat[0]}, update: {u_strat[0]}")
    
    # check if already completed
    label = f"{c_strat[0]}-{u_strat[0]}"
    if os.path.isfile(f"{path}/doi/{label}.csv") or os.path.isfile(f"{path}/times/{label}.csv"):
      print("skipping test case because already completed.")
      continue

    reset()
    test_case = BenchmarkTestCase(label, outlierness, c_strat[1], u_strat[1], chunk_size, chunks)
    test_case.run(doi_csv_path=f"{path}/doi/", times_csv_path=f"{path}/times/")
    test_cases += [test_case]
    print(f"done: {test_case.total_time}s")

### DOI histograms per test case

In [48]:
import altair as alt

charts = []

# load all data from the out directory into one dataframe and add a column that indicates the context
# and update strategies used in this particular use case
available_test_cases = os.listdir(f"{path}/doi")
available_test_cases

all_doi_values_df = pd.DataFrame()

# build one big dataframe containing all doi scores and label each based on the strategies that were
# used to generate them
for c_strat in context_strategies:
  for u_strat in update_strategies:
    # check if that test case exists
    test_case = f"{c_strat[0]}-{u_strat[0]}.csv"
    if test_case not in available_test_cases:
      continue

    df = pd.read_csv(f"{path}/doi/{test_case}")
    df["context_strategy"] = c_strat[0]
    df["update_strategy"] = u_strat[0]
    all_doi_values_df = all_doi_values_df.append(df)
    all_doi_values_df.reset_index(inplace=True, drop=True)

# approach1: manually compute the histogram over all groups in the data, then visualize those bins
# histogram = np.histogram(all_doi_values_df["doi"])

# approach2: use altair to do the grouping and binning.
alt.data_transformers.disable_max_rows()
alt.Chart(all_doi_values_df).mark_bar().encode(
  x=alt.X("doi:Q", bin=True),
  y=alt.Y("count()", axis=None),
  row=list(map(lambda strat: strat[0], context_strategies)),
  column=list(map(lambda strat: strat[0], update_strategies))
).properties(
  width=100,
  height=100
)


### Time series per test case

In [None]:
import altair as alt

charts = []

# load all data from the out directory into one dataframe and add a column that indicates the context
# and update strategies used in this particular use case
available_test_cases = os.listdir(f"{path}/times")
available_test_cases

all_doi_values_df = pd.DataFrame()

# build one big dataframe containing all doi scores and label each based on the strategies that were
# used to generate them
for c_strat in context_strategies:
  for u_strat in update_strategies:
    # check if that test case exists
    test_case = f"{c_strat[0]}-{u_strat[0]}.csv"
    if test_case not in available_test_cases:
      continue

    df = pd.read_csv(f"{path}/times/{test_case}")
    df["context_strategy"] = c_strat[0]
    df["update_strategy"] = u_strat[0]
    all_doi_values_df = all_doi_values_df.append(df)
    all_doi_values_df.reset_index(inplace=True, drop=True)

# approach1: manually compute the histogram over all groups in the data, then visualize those bins
# histogram = np.histogram(all_doi_values_df["doi"])
# approach2: use altair to do the grouping and binning.
alt.data_transformers.disable_max_rows()
alt.Chart(all_doi_values_df).mark_line().encode(
  x="chunk:Q",
  y={"field": "total_time", "type": "quantitative", "scale": {"type": "log"}},
  column={"field": "update_strategy"},
  color="context_strategy:N",
).properties(
  width=200,
  height=500
)

### Getting the ground truth: Monolithic computation
The ground truth for our strategies is a full computation over the entire dataset without any chunking.
This computation naturally takes a long time to complete, which is why the progressive scenario is so much more effective from a user perspective: we get to see the data much faster.

In the context of the `BenchmarkTestCase` class, the monolithic computation corresponds with running a progression with a single chunk. 

In [3]:
reset()
ground_truth_test_case = BenchmarkTestCase("grond_truth", outlierness, NoContext(n_dims), NoUpdate(n_dims), chunk_size=total_size, chunks=1)
ground_truth_test_case.run(doi_csv_path=f"{path}/doi/full/")

### Evaluation

In [None]:
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import jaccard_score, r2_score

rus = RandomUnderSampler(random_state=0)

def evaluate_test_case(test_case: np.ndarray, ground_truth: np.ndarray):
  # score = jaccard_score(test_case, ground_truth, average="weighted")
  score = r2_score(ground_truth, test_case)
  return score

ground_truth = results_full["doi"]
context_test_case = results_context["doi"]
baseline_test_case = results_chunked["doi"]


evaluate_test_case(baseline_test_case, ground_truth), evaluate_test_case(context_test_case, ground_truth)