# DataOOB on Magictelescope dataset
- This Jupyter notebook demonstrates Data-OOB and existing data valuation methods including leave-one-out, KNNShap, DataShap, and BetaShap. 
- We use the "Magictelescope" dataset. An instruction to download other datasets is available at `dataoob/preprocess`.

In [1]:
import pickle
import numpy as np
import sys
sys.path.append('../dataoob')
import datasets
from data_valuation import DataValuation

import warnings
warnings.filterwarnings('ignore')

In [2]:
dataset='magictelescope'
problem='clf'
dargs = {'n_data_to_be_valued':300, 
        'n_val':30, 
        'n_test':3000,
        'n_trees':800,
        'openml_clf_path':'../dataoob/preprocess/tmp/dataset_clf_openml',
        'is_noisy':0.1,
        'model_family':'Tree',
        'run_id':0}

# Load dataset and prepare DataValuation engine

In [3]:
# Load dataset
(X, y), (X_val, y_val), (X_test, y_test), noisy_index=datasets.load_data(problem, dataset, **dargs)

# instantiate data valuation engine
data_valuation_engine=DataValuation(X=X, y=y, 
                                    X_val=X_val, y_val=y_val, 
                                    problem=problem, dargs=dargs)

------------------------------
{'n_data_to_be_valued': 300, 'n_val': 30, 'n_test': 3000, 'n_trees': 800, 'openml_clf_path': '../dataoob/preprocess/tmp/dataset_clf_openml', 'is_noisy': 0.1, 'model_family': 'Tree', 'run_id': 0}
--------------------------------------------------
MagicTelescope
--------------------------------------------------
Train X: (300, 10)
Val X: (30, 10)
Test X: (3000, 10)
------------------------------


# Compute data values
- `compute_marginal_contribution_based_methods` computes marginal-based methods, namely Leave-one-out, DataShap, KNNShap, and BetaShap. If the marginal contribution computation takes too long, you can skip it by feeding `betashap_run=False`. 
- `compute_oob_and_ame` computes Data-OOB and AME.

In [None]:
%%time
# compute data values
data_valuation_engine.compute_marginal_contribution_based_methods(betashap_run=True)
data_valuation_engine.compute_oob_and_ame()

Start: KNN_Shapley computation
Done: KNN_Shapley computation
Start: LOO computation
Done: LOO computation
Start: Beta_Shapley computation
Start: marginal contribution computation


# Evaluate the quality of data values

In [None]:
data_valuation_engine.evaluate_data_values(noisy_index, X_test, y_test, removal_run=True)

In [None]:
# The larger, the better
data_valuation_engine.noisy_detect_dict