# Benchmarks of Strategies for Selecting Outdated Items
This notebook contains the benchmarks related to the selection strategies for context data, which we report in our paper.
Context data are selected from the processed data and included in the next progressive computation step, such that its result approximates that of a _non-progressive_ computation over the processed data.

## Benchmark Configuration

We use the following configuration in our benchmarks:
### Test cases 
- full computation over the entire dataset (upper baseline)
- progressive computation without optimization (lower baseline)
- full computation of processed data
- progressive computation using optimization strategies

### Dataset
- NYC taxis dataset (10 Million items), stored in a compressed CSV file, loaded with DuckDB 

### Variables
- dependent variables: runtime, prediction error
- independent variables: 

## Benchmarks

### Defining the DOI function
In our experiments, we will use the DOI function below, which measures outlierness (i.e., how unique a datum is compared to the rest of the data).
Outlierness cannot be expressed by looking at an individual data item, but can only expressed within the context of other data, which is why a purely progressive computation of this DOI measure will inherently be inaccurate compared to a "monolithic" computation that looks at all data at once.


In [None]:
from doi_component.outlierness_component import OutliernessComponent

outlierness = OutliernessComponent(["ratio", "duration"])

In [1]:
import os
from sys import path
cwd = os.getcwd()
path.append(f"{cwd}/..")

from database import initialize_db, drop_tables
def reset():
  drop_tables()
  initialize_db(data_path)

### Finding an appropriate `chunk size`
The number of items in each chunk dictate the computation time for each chunk in the data, in that the more items we process, the longer the DOI computation takes.
Therefore, the first consideration in our benchmarks is to find the maximum number of items, for which the computation time remains immediate.
Prior work (see Card et al., 1991) has shown this limit to be about one second.

In the cell below, we thus compute the maximum number of items, for which the DOI function still returns values after one second:

In [None]:
chunk_sizes = [10, 100, 1000, 10000]
reset()
for size in chunk_sizes:
  
  print(size)

Baseline: Chunk-based computation without any optimizations

In [9]:
import os
from sys import path
cwd = os.getcwd()
path.append(f"{cwd}/..")

from database import initialize_db, drop_tables, get_next_chunk_from_db
from benchmark_test_case import *
import numpy as np
import pandas as pd
import time

from doi_component.outlierness_component import OutliernessComponent
outlierness = OutliernessComponent(["ratio", "duration"])

n_dims = 17
total_items = 99999
chunk_size = 1000
chunks = round(total_items / chunk_size)

data_path = "../data/nyc_taxis_sampled100k_shuffled.csv.gz"

def reset():
  drop_tables()
  initialize_db(data_path)

from outdated_item_selection_strategy.no_update import *
from outdated_item_selection_strategy.oldest_chunks_update import *
from outdated_item_selection_strategy.last_n_chunks_update import *
from outdated_item_selection_strategy.regular_interval_update import *
from outdated_item_selection_strategy.outdated_bin_update import *

update_strategies = [
  ("no chunk", NoUpdate(n_dims=n_dims)),
  ("oldest n chunks", OldestChunksUpdate(n_dims=n_dims, n_chunks=3, max_age=10)),
  ("last n chunks", LastNChunksUpdate(n_dims=n_dims, n_chunks=3)),
  ("regular intervals", RegularIntervalUpdate(n_dims=n_dims,interval=2, max_age=10)),
  ("outdated bins", OutdatedBinUpdate(n_dims=n_dims))
]

from context_item_selection_strategy.no_context import * 
from context_item_selection_strategy.chunk_based_context import *
from context_item_selection_strategy.sampling_based_context import *
from context_item_selection_strategy.clustering_based_context import *

context_strategies = [
  ("no context", NoContext(n_dims=n_dims)),
  ("chunk based", RandomChunkBasedContext(n_dims=n_dims, n_chunks=3)),
  ("sampling based", RandomSamplingBasedContext(n_dims=n_dims, n_samples=chunk_size)),
  ("clustering based", ClusteringBasedContext(n_dims=n_dims, n_clusters=chunk_size))
]

####################################################################################################
####################################################################################################

test_cases = []

path = f"./out/{total_items}/{chunk_size}"
if not exists(f"./out/{total_items}"):
  os.mkdir(f"./out/{total_items}")
if not exists(path):
  os.mkdir(path)


for c_strat in context_strategies:
  for u_strat in update_strategies:
    print(f"context: {c_strat[0]}, update: {u_strat[0]}")
    
    # check if already completed
    label = f"{c_strat[0]}-{u_strat[0]}"
    if os.path.isfile(f"{path}/doi/{label}.csv") or os.path.isfile(f"{path}/times/{label}.csv"):
      print("skipping test case because already completed.")
      continue

    reset()
    test_case = BenchmarkTestCase(label, outlierness, c_strat[1], u_strat[1], chunk_size, chunks)
    test_case.run(doi_csv_path=f"{path}/doi/", times_csv_path=f"{path}/times/")
    test_cases += [test_case]
    print(f"done: {test_case.total_time}s")


for tc in test_cases: 
  tc.doi_histogram()



context: no context, update: no chunk
done: 197.07497572898865s
context: no context, update: oldest n chunks
done: 501.54559350013733s
context: no context, update: last n chunks
done: 1565.0280170440674s
context: no context, update: regular intervals
done: 510.384170293808s
context: no context, update: outdated bins
done: 213.77439284324646s
context: chunk based, update: no chunk
done: 481.0443522930145s
context: chunk based, update: oldest n chunks
done: 786.4391601085663s
context: chunk based, update: last n chunks
done: 1954.5901024341583s
context: chunk based, update: regular intervals
done: 828.6719510555267s
context: chunk based, update: outdated bins
done: 539.5761635303497s
context: sampling based, update: no chunk
done: 342.1771938800812s
context: sampling based, update: oldest n chunks
done: 664.3325853347778s
context: sampling based, update: last n chunks
done: 1697.6373136043549s
context: sampling based, update: regular intervals
done: 673.5458197593689s
context: sampling b

PermissionError: [Errno 13] Permission denied: './out/99999/1000/doi/'

In [None]:
for tc in test_cases: 
  tc.doi_histogram()

### Configuration

In [None]:
n_dims = 17
total_items = 99999
chunk_size = 1000
chunks = round(total_items / chunk_size)
outlierness = OutliernessComponent(["ratio", "duration"])

In [None]:
reset()

# upper baseline: full computation over the processed data so far.
start = time.time()
data = get_next_chunk_from_db(chunk_size * chunks, as_df=True)
upper_bound_result = outlierness.compute_doi(data)
time_upper = time.time() - start

print(f"# upper bound: {time_upper}")
upper_bound_result

In [None]:
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import jaccard_score, r2_score

rus = RandomUnderSampler(random_state=0)

def evaluate_test_case(test_case: np.ndarray, ground_truth: np.ndarray):
  # score = jaccard_score(test_case, ground_truth, average="weighted")
  score = r2_score(ground_truth, test_case)
  return score

ground_truth = results_full["doi"]
context_test_case = results_context["doi"]
baseline_test_case = results_chunked["doi"]


evaluate_test_case(baseline_test_case, ground_truth), evaluate_test_case(context_test_case, ground_truth)