# Benchmarks of Strategies for Selecting Outdated Items
This notebook contains the benchmarks related to the selection strategies for outdated data, which we report in our paper.
Outdated data are those elements of the processed data, which have "aged" since they were last computed, i.e., they may no longer be valid and need to be updated.

## Benchmark Configuration

We use the following configuration in our benchmarks:
### Test cases 
- full computation over the entire dataset (upper baseline)
- progressive computation without optimization (lower baseline)
- full computation of processed data
- progressive computation using optimization strategies

### Dataset
- NYC taxis dataset (10 Million items), stored in a compressed CSV file, loaded with DuckDB 

### Variables
- dependent variables: runtime, prediction error
- independent variables: 

## Setup

In [1]:
import os
from sys import path
cwd = os.getcwd()
path.append(f"{cwd}/..")

## Benchmarks

In [2]:
from outdated_item_selection_strategy.no_update import *
from outdated_item_selection_strategy.oldest_chunks_update import *
from outdated_item_selection_strategy.last_n_chunks_update import *
from outdated_item_selection_strategy.regular_interval_update import *
from outdated_item_selection_strategy.outdated_bin_update import *

n_dims = 17

strategies = [
  ("no chunk", NoUpdate(n_dims=n_dims)),
  ("oldest n chunks", OldestChunksUpdate(n_dims=n_dims, n_chunks=3, max_age=10)),
  ("last n chunks", LastNChunksUpdate(n_dims=n_dims, n_chunks=3)),
  ("regular intervals", RegularIntervalUpdate(n_dims=n_dims,interval=2, max_age=10)),
  ("outdated bins", OutdatedBinUpdate(n_dims=n_dims))
]

In [3]:
from database import get_next_chunk_from_db, initialize_db, drop_tables
import time
chunks = 15
chunk_size = 100

start = time.time()
drop_tables()
initialize_db("../data/nyc_taxis_sampled100k_shuffled.csv.gz")
print("initialization took", time.time() - start)

s = time.time()
for i in range(chunks):
  chunk = get_next_chunk_from_db(chunk_size)
print("loading the data in chunks took", time.time() - s)


initialization took 0.118682861328125
loading the data in chunks took 1.7711987495422363


In [4]:
current_chunk = chunks
for i, strategy in enumerate(strategies):
  start = time.time()
  print("#", strategy[0])
  outdated_items = strategy[1].get_outdated_items(current_chunk)
  print(f"found {len(outdated_items)} outdated items:")
  print(outdated_items)
  print(time.time() - start)
  print("\n")

# no chunk
found 0 outdated items:
[]
0.000997781753540039


# oldest n chunks
oldest chunk is 0
found 100 outdated items:
[[31852922 2 Timestamp('2018-07-18 06:17:11') ... 0.0 0.3 9.96]
 [16785706 2 Timestamp('2018-01-09 02:35:11') ... 10.5 0.3 76.62]
 [66379394 2 Timestamp('2018-11-24 17:07:34') ... 0.0 0.3 6.8]
 ...
 [52000532 2 Timestamp('2018-10-03 18:05:25') ... 0.0 0.3 11.8]
 [4868680 1 Timestamp('2018-05-26 16:58:54') ... 0.0 0.3 9.95]
 [33632368 2 Timestamp('2018-07-24 21:37:31') ... 0.0 0.3 20.47]]
0.2972254753112793


# last n chunks
getting from processed took 0.0009975433349609375
15 3
found 300 outdated items:
[[21012485 2 Timestamp('2018-01-23 15:50:50') ... 0.0 0.3 20.76]
 [31709419 1 Timestamp('2018-07-17 15:14:44') ... 0.0 0.3 17.3]
 [5437981 2 Timestamp('2018-05-29 11:36:59') ... 0.0 0.3 10.56]
 ...
 [31466697 2 Timestamp('2018-07-16 18:06:22') ... 0.0 0.3 7.82]
 [73877432 2 Timestamp('2018-12-20 20:35:12') ... 0.0 0.3 15.96]
 [61365300 1 Timestamp('2018-11-05 18:49: