# Progressively Updating DOI values
Another challenge when computing doi functions in chunks over large data is that the computed values "age".
The more data we process, the more representative of the entire dataset our predictions become, which means that values computed early on may no longer be valid if they are recomputed after hundreds of iterations.
In turn, we need a mechanism to update outdated doi values.

These updates can happen in different ways:
0. No update: User interest lies on the latest data, old interest values "decay" until they become 0.
1. In regular intervals, compute the doi of old items with the next chunk.
2. Whenever some metric exceeds a threshold, compute the doi of old items with the next chunk.
3. Predict the interest of "old" items based on latest doi information (train a regression model, or use fx. interpolation between knn, or use buckets in view space quadtree).

## Prepare the DOI function

In [None]:
import pandas as pd
import numpy as np
from sys import path;path.append("../")
from outlierness_component import OutliernessComponent

outlierness = OutliernessComponent()

def doi(X: np.ndarray):
  df = pd.DataFrame(X)
  return outlierness.compute_doi(df)

## Prepare the data

In [None]:
from sklearn.datasets import make_blobs
from progressive_bin_sampler import ProgressiveBinSampler

N = 10000
features = 2
chunk_size = 1000

chunks = N // chunk_size

blobs_params = {"n_samples": N, "n_features": features}

X = make_blobs(centers=6, cluster_std=1, **blobs_params)[0]
y = doi(X)

sampler = ProgressiveBinSampler(n_dims=2)

## Progressive Update Strategies

### No Update: Decaying Interest Values

In [None]:
doi_value_chunk = np.array((0, )) # stores the chunk at which the data was first processed
old_doi_values = np.array((0, )) # stores the original doi value from when the chunk was processed
updated_doi_values = np.array((0, )) # stores the doi values after the update

sampler.reset_reservoirs()

for i in range(chunks):
  chunk = X[i*chunk_size:(i+1)*chunk_size]
  progressive_sample = sampler.get_current_sample()
  X_ = np.append(progressive_sample, chunk, axis=0)
  y_ = doi(X_)

  new_doi_values = y_[-chunk_size:]

  sampler.add_chunk(chunk, new_doi_values, i*chunk_size)
  old_doi_values = np.append(old_doi_values, new_doi_values)
  age = np.full_like(new_doi_values, fill_value=i)
  doi_value_chunk = np.append(doi_value_chunk, age)

  # "radioactive" decay, where the newest values remain the same
  updated_doi_values = old_doi_values * (0.5**(i - doi_value_chunk))

[(j, updated_doi_values[doi_value_chunk == j].mean()) for j in range(chunks)]

### Recomputation: Recompute Interest Values
Mix the "old" data into the computation of new values based on some decision rule.

#### Recompute in regular Intervals


In [None]:
doi_value_chunk = np.array((0, )) # stores the chunk at which the data was seen first
updated_doi_values = np.array((0, )) # stores the doi values after the update

max_age = 2 # update the doi of each chunk every max_age iterations

sampler.reset_reservoirs()

for i in range(chunks):
  chunk = X[i*chunk_size:(i+1)*chunk_size]
  progressive_sample = sampler.get_current_sample()
  sample_size = len(progressive_sample)
  X_ = np.append(progressive_sample, chunk, axis=0)

  # add all chunks that haven't been updated in the last max_age iterations:
  added_index = {}
  for j in range(i):
    if j % max_age == 0:
      added_index[j] = len(X_) # store the beginning index for this chunk to find it later
      X_ = np.append(X_, X[j*chunk_size:(j+1)*chunk_size], axis=0)

  y_ = doi(X_)

  new_doi_values = y_[sample_size:sample_size + chunk_size]

  sampler.add_chunk(chunk, new_doi_values, i*chunk_size)
  updated_doi_values = np.append(updated_doi_values, new_doi_values)

  age = np.full_like(new_doi_values, fill_value=i)
  doi_value_chunk = np.append(doi_value_chunk, age)

  for j in range(1, i):
    if j % max_age == 0:
      index = added_index[j]
      updated_doi_values[doi_value_chunk == j] = y_[index:index+chunk_size]

[(j, updated_doi_values[doi_value_chunk == j].mean()) for j in range(chunks)]

#### Recompute based on a Bin-based Metric

In [None]:
doi_value_chunk = np.array((0, )) # stores the chunk at which the data was seen first
updated_doi_values = np.array((0, )) # stores the doi values after the update

sampler.reset_reservoirs()

for i in range(chunks):
  chunk = X[i*chunk_size:(i+1)*chunk_size]
  progressive_sample = sampler.get_current_sample()
  X_ = np.append(progressive_sample, chunk, axis=0)
  y_ = doi(X_)

  # TODO: compute the per-bin mean
  # TODO: detect mean shift per bin
  # TODO: find all items that belong to this bin and add them to the recomputation
  # TODO: update the doi of all items that were affected by that metric

  new_doi_values = y_[-chunk_size:]

  sampler.add_chunk(chunk, new_doi_values, i*chunk_size)
  age = np.full_like(new_doi_values, fill_value=i)
  doi_value_chunk = np.append(doi_value_chunk, age)
  updated_doi_values = np.append(updated_doi_values, new_doi_values)

[(j, updated_doi_values[doi_value_chunk == j].mean()) for j in range(chunks)]

### Prediction-based: Estimate previous Interest Values from latest Results

Option 1: Use nearest neighbor in KDTree

In [None]:
from sklearn.neighbors import KDTree

doi_value_chunk = np.array((0, )) # stores the chunk at which the data was seen first
updated_doi_values = np.array((0, )) # stores the doi values after the update

sampler.reset_reservoirs()

for i in range(chunks):
  chunk = X[i*chunk_size:(i+1)*chunk_size]
  progressive_sample = sampler.get_current_sample()
  X_ = np.append(progressive_sample, chunk, axis=0)
  y_ = doi(X_)

  new_doi_values = y_[-chunk_size:]
  updated_doi_values = np.append(updated_doi_values, new_doi_values)

  kdtree = KDTree(X_)

  if i > 0:
    # find knn in chunk for all points not in chunk
    knn = kdtree.query(X[:i*chunk_size], return_distance=False).reshape(-1, )
    updated_doi_values[:i*chunk_size] = y_[knn] # return type of query is odd

  # reservoir sample all new doi values
  sampler.add_chunk(chunk, new_doi_values, i*chunk_size)
  age = np.full_like(new_doi_values, fill_value=i)
  doi_value_chunk = np.append(doi_value_chunk, age)

[(j, updated_doi_values[doi_value_chunk == j].mean()) for j in range(chunks)]

Option 2: Predict from sampled items

In [None]:
from sklearn.tree import DecisionTreeRegressor

doi_value_chunk = np.array((0, )) # stores the chunk at which the data was seen first
updated_doi_values = np.array((0, )) # stores the doi values after the update

sampler.reset_reservoirs()
tree = DecisionTreeRegressor()

for i in range(chunks):
  chunk = X[i*chunk_size:(i+1)*chunk_size]
  y_ = doi(chunk)

  if i > 0:
    X_sample, y_sample = sampler.get_current_sample(return_labels=True)
    tree.fit(X_sample, y_sample)
    y2 = tree.predict(chunk)
    y_ = np.mean([y_, y2], axis=0)

  new_doi_values = y_
  updated_doi_values = np.append(updated_doi_values, new_doi_values)

  # reservoir sample all new doi values
  sampler.add_chunk(chunk, new_doi_values, i*chunk_size)
  age = np.full_like(new_doi_values, fill_value=i)
  doi_value_chunk = np.append(doi_value_chunk, age)

[(j, updated_doi_values[doi_value_chunk == j].mean()) for j in range(chunks)]

## Benchmarking