# Benchmarks of Strategies for Selecting Outdated Items
This notebook contains the benchmarks related to the selection strategies for context data, which we report in our paper.
Context data are selected from the processed data and included in the next progressive computation step, such that its result approximates that of a _non-progressive_ computation over the processed data.

## Benchmark Configuration

We use the following configuration in our benchmarks:
### Test cases 
- full computation over the entire dataset (upper baseline)
- progressive computation without optimization (lower baseline)
- full computation of processed data
- progressive computation using optimization strategies

### Dataset
- NYC taxis dataset (10 Million items), stored in a compressed CSV file, loaded with DuckDB 

### Variables
- dependent variables: runtime, prediction error
- independent variables: 

## Benchmarks

### Finding an appropriate `chunk size`
The number of items in each chunk dictate the computation time for each chunk in the data, in that the more items we process, the longer the DOI computation takes.
Therefore, the first consideration in our benchmarks is to find the maximum number of items, for which the computation time remains immediate.
Prior work (see Card et al., 1991) has shown this limit to be about one second.

In the cell below, we try different chunk sizes to find the maximum items we can pass to the doi function for computations under 1s.

In [None]:
from time import time
from database import get_from_data, process_chunk

chunk_sizes = [10, 100, 1000, 10000]
reset()
for size in chunk_sizes:
  before = time()
  data = get_from_data([f"TRUE LIMIT {size}"], as_df=True)
  data = process_chunk(data)
  doi.compute_doi(data)
  print(f"{size}: {time() - before}s")


### Computing the Baselines
#### Baseline1: Monolithic computation
The ground truth for our strategies is a full computation over the entire dataset without any chunking.
This computation naturally takes a long time to complete, which is why the progressive scenario is so much more effective from a user perspective: we get to see the data much faster.

In the context of the `BenchmarkTestCase` class, the monolithic computation corresponds with running a progression with a single chunk. 

#### Baseline2: Bigger chunks
In addition to the full computation, another important baseline is to compare ourselves against a computation that does not use any strategies, but instead uses the entire `chunk time` to compute a whole new chunk.
The idea here is to compare, whether all the context- and outdated-computations are actually valuable, or whether we could just use all resources on processing new data instead.

In [None]:
from setup import *
from benchmark_test_case import *

reset()
ground_truth_test_case = BenchmarkTestCase(
  name="__ground_truth__", 
  doi=doi, 
  storage_strategy=NoStorage(), 
  context_strategy=NoContext(n_dims, None), 
  update_strategy=NoUpdate(n_dims, None), 
  chunk_size=total_size, 
  chunks=1
)
ground_truth_test_case.run(doi_csv_path=f"{path}/doi/")

reset()
bigger_chunks_test_case = BenchmarkTestCase(
  name="__bigger_chunks__",
  doi=doi,
  storage_strategy=NoStorage(),
  context_strategy=NoContext(n_dims, None),
  update_strategy=NoUpdate(n_dims, None),
  chunk_size = chunk_size + update_size + context_size,
  chunks=round(total_size // (chunk_size + update_size + context_size))
)
bigger_chunks_test_case.run(doi_csv_path=f"{path}/doi/")

### Running the test cases

In [None]:
from setup import *
from benchmark_test_case import *


test_cases = []
total = len(context_strategies) * len(update_strategies)
i = 0
doi_ = doi
print(f"data: {data_label}, {short_test_case_title}\n####")
for c_strat in context_strategies:
  for u_strat in update_strategies:
    i += 1
    print(f"({i}/{total}) context: {c_strat[0]}, update: {u_strat[0]}")

    # check if already completed
    label = f"{c_strat[0]}-{u_strat[0]}"
    if os.path.isfile(f"{path}/doi/{label}.csv") and os.path.isfile(f"{path}/times/{label}.csv"):
      print("skipping test case because already completed.")
      continue

    reset()
    test_case = BenchmarkTestCase(
      label, 
      doi, 
      WindowingStorage(max_size=total_size), 
      c_strat[1](), 
      u_strat[1](), 
      chunk_size=chunk_size, 
      context_size=context_size,
      update_size=update_size,
      chunks=chunks
    )

    test_case.run(doi_csv_path=f"{path}/doi/", times_csv_path=f"{path}/times/")
    # test_case.run(
    #   doi_csv_path=f"{path}/doi/", 
    #   times_csv_path=f"{path}/times/", 
    #   update_interval=update_interval
    # )

    test_cases += [test_case]
    print(f"done: {test_case.total_time}s")

### Evaluation

In [None]:
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import jaccard_score, r2_score

rus = RandomUnderSampler(random_state=0)

def evaluate_test_case(test_case: np.ndarray, ground_truth: np.ndarray):
  # score = jaccard_score(test_case, ground_truth, average="weighted")
  score = r2_score(ground_truth, test_case)
  return score

ground_truth = results_full["doi"]
context_test_case = results_context["doi"]
baseline_test_case = results_chunked["doi"]


evaluate_test_case(baseline_test_case, ground_truth), evaluate_test_case(context_test_case, ground_truth)