# Scagnostics DOI Component Tests
Testing the scagnostics doi component, which measures the interestingness of an item based on its impact on the different scagnostics measures: If removed from the sample of data items, how much does a particular score change?
The idea is that an item that yields a high doi value in of one of the metrics would increase the overall score, while an "uninteresting" item would lower it.

The general approach is to first compute the scagnostic measures for all items in the sample, and to then recompute for the subset that is missing each item respectively.

## Setup of the pyscagnostics package

In [None]:
from pyscagnostics import scagnostics
import numpy as np
import pandas as pd
from random import random, randint

## Test 1: Scagnostics on random data

Generate a random dataset and compute its scagnostics:

In [None]:
size = 1000

x = np.array([random() for _ in range(size)])
y = np.array([random() for _ in range(size)])

measures, _ = scagnostics(x, y)
measures

Next, exclude one "random" item from the dataset. Compute the scagnostics of that subset.

In [None]:
random_index = randint(0, size)
random_index

mask = np.ones(size)
mask[random_index] = 0

x_ = x[mask == 1]
y_ = y[mask == 1]

measures_, _ = scagnostics(x_, y_)
measures_

## Test 2: Scagnostics on three-dimensional data

In [None]:
# Simulate data for example
x = np.random.uniform(0, 1, 10)
y = np.random.uniform(0, 1, 10)
z = np.random.uniform(0, 1, 10)
df = pd.DataFrame({
    'x': x,
    'y': y,
    'z': z
})

def compute_mean(generator):
    all_results = []
    for x, y, result in generator:
      measures, _ = result
      all_results += [measures]

    return pd.DataFrame(all_results).mean()

results = scagnostics(df)
all_mean = compute_mean(results)
    
pd.DataFrame(all_mean).mean()

df[df.index != 1]

gens = df.apply(lambda item: scagnostics(df[df.index != item.name]), axis=1)
per_item = gens.apply(lambda generator: compute_mean(generator))

total = (per_item - all_mean).mean(axis=1).abs()
min = total.min()
max = total.max()
(total - min) / (max - min)

## Test 3: Using the Scagnostics Module

Next, we import the scagnostics_component that implements the doi_component interface.

In [None]:
import sys; sys.path.append('../') # required for relative imports in python ;)
from scagnostics_component import *

We then generate a random dataset using blobs and measure the predictive performance of the scagnostics component.

In [None]:
from sklearn.datasets import make_moons, make_blobs

# Example settings
n_samples = 1000 # gets too slow at around 2,000 samples
outliers_fraction = 0.15
n_outliers = int(outliers_fraction * n_samples)
n_inliers = n_samples - n_outliers

# Define datasets
blobs_params = dict(random_state=0, n_samples=n_inliers, n_features=2)
datasets = [
    make_blobs(centers=[[0, 0], [0, 0]], cluster_std=0.5,
               **blobs_params)[0],
    make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[0.5, 0.5],
               **blobs_params)[0],
    make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[1.5, .3],
               **blobs_params)[0],
    4. * (make_moons(n_samples=n_samples, noise=.05, random_state=0)[0] -
          np.array([0.5, 0.25])),
    14. * (np.random.RandomState(42).rand(n_samples, 2) - 0.5)]


scagn = ScagnosticsComponent()

for d in datasets:
    X_train = pd.DataFrame(d[:n_samples//2, :])
    X_train["id"] = X_train.index
    scagn.train(X_train)
    print("done training")
    
    X_test = pd.DataFrame(d[n_samples//2:, :])
    X_test["id"] = X_test.index
    err = scagn.get_prediction_error(X_test)
    print("mean error in prediction:", err.mean())