# Benchmarks of Strategies for Selecting Outdated Items
This notebook contains the benchmarks related to the selection strategies for context data, which we report in our paper.
Context data are selected from the processed data and included in the next progressive computation step, such that its result approximates that of a _non-progressive_ computation over the processed data.

## Benchmark Configuration

We use the following configuration in our benchmarks:
### Test cases 
- full computation over the entire dataset (upper baseline)
- progressive computation without optimization (lower baseline)
- full computation of processed data
- progressive computation using optimization strategies

### Dataset
- NYC taxis dataset (10 Million items), stored in a compressed CSV file, loaded with DuckDB 

### Variables
- dependent variables: runtime, prediction error
- independent variables: 

## Setup

In [1]:
import os
from sys import path
cwd = os.getcwd()
path.append(f"{cwd}/..")

## Benchmarks

In [2]:
from context_item_selection_strategy.chunk_based_context import *
from context_item_selection_strategy.sampling_based_context import *
from context_item_selection_strategy.clustering_based_context import *

n_dims = 17
chunk_size = 100

strategies = [
  ("chunk based", ChunkBasedContext(n_dims=n_dims, n_chunks=3)),
  ("sampling based", SamplingBasedContext(n_dims=n_dims, n_samples=chunk_size)),
  ("clustering based", ClusteringBasedContext(n_dims=n_dims, n_clusters=chunk_size))
]

In [3]:
from database import get_next_chunk_from_db, initialize_db, drop_tables
import time

chunks = 15
chunk_size = 100

start = time.time()
drop_tables()
initialize_db("../data/nyc_taxis_sampled100k_shuffled.csv.gz")
print("initialization took", time.time() - start)

s = time.time()
for i in range(chunks):
  chunk = get_next_chunk_from_db(chunk_size)
print("loading the data in chunks took", time.time() - s)

initialization took 0.10671401023864746
loading the data in chunks took 1.4261481761932373


In [4]:
current_chunk = chunks
for i, strategy in enumerate(strategies):
  start = time.time()
  print("#", strategy[0])
  context_items = strategy[1].get_context_items(current_chunk)
  print(f"found {len(context_items)} context items:")
  print(context_items)
  print(time.time() - start)
  print("\n")

# chunk based
found 200 context items:
[[52792702 2 Timestamp('2018-10-06 11:40:12') ... 0.0 0.3 32.76]
 [66930640 2 Timestamp('2018-11-26 23:09:53') ... 5.76 0.3 32.14]
 [37619021 1 Timestamp('2018-08-08 23:01:25') ... 0.0 0.3 8.8]
 ...
 [31466697 2 Timestamp('2018-07-16 18:06:22') ... 0.0 0.3 7.82]
 [73877432 2 Timestamp('2018-12-20 20:35:12') ... 0.0 0.3 15.96]
 [61365300 1 Timestamp('2018-11-05 18:49:57') ... 0.0 0.3 23.92]]
0.2743861675262451


# sampling based
found 100 context items:
[[75736837 1 Timestamp('2018-12-29 18:49:57') ... 0.0 0.3 7.3]
 [42552891 2 Timestamp('2018-08-29 07:20:23') ... 0.0 0.3 61.56]
 [3063068 1 Timestamp('2018-05-20 06:35:00') ... 0.0 0.3 23.15]
 ...
 [11419837 1 Timestamp('2018-06-18 21:15:46') ... 0.0 0.3
  7.5600000000000005]
 [54511485 2 Timestamp('2018-10-12 17:47:23') ... 0.0 0.3 40.56]
 [53541945 2 Timestamp('2018-10-09 12:06:06') ... 0.0 0.3 29.75]]
0.2671966552734375


# clustering based
found 100 context items:
[[61726465 2 Timestamp('2018-11