# BBC dataset 
**Caution:** Many data valuation methods require training large number of models to get reliable estimates. **It is extremely slow**. We recommend using embeddings.

In [1]:
# Imports
import torch

# Opendataval
from opendataval.dataloader import mix_labels
from opendataval.dataval import (
    AME,
    DataBanzhaf,
    DataOob,
    InfluenceSubsample,
    RandomEvaluator,
)
from opendataval.experiment import ExperimentMediator



<stdin>:1:10: fatal error: 'omp.h' file not found
#include <omp.h>
         ^~~~~~~
1 error generated.


## [Step 1] Set up an environment
`ExperimentMediator` is a fundamental concept in establishing the `opendataval` environment. It empowers users to configure hyperparameters, including a dataset, a type of synthetic noise, and a prediction model. With  `ExperimentMediator`, users can effortlessly compute various data valuation algorithms.

The following code cell demonstrates how to set up `ExperimentMediator` with a pre-registered dataset and a prediction model.
- Dataset: bbc
- Model: transformer's DistilBertModel
- Metric: Classification accuracy

In [2]:
dataset_name = "bbc" 
train_count, valid_count, test_count = 1000, 100, 500
noise_rate = 0.1
noise_kwargs = {'noise_rate': noise_rate}
model_name = "BertClassifier"
metric_name = "accuracy"
train_kwargs = {"epochs": 2, "batch_size": 50}
device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'

exper_med = ExperimentMediator.model_factory_setup(
    dataset_name=dataset_name,
    cache_dir="../data_files/",  
    force_download=False,
    train_count=train_count,
    valid_count=valid_count,
    test_count=test_count,
    add_noise=mix_labels,
    noise_kwargs=noise_kwargs,
    train_kwargs=train_kwargs,
    device=device,
    model_name=model_name,
    metric_name=metric_name
)

Downloading:: 232it [00:00, 573.31it/s]


Base line model metric_name='accuracy': perf=0.8399999737739563


## [Step 2] Compute data values
`opendataval` provides various state-of-the-art data valuation algorithms. `ExperimentMediator.compute_data_values()` computes data values.

In [3]:
data_evaluators = [ 
    RandomEvaluator(),
#     LeaveOneOut(), # leave one out ## slow
    InfluenceSubsample(num_models=10), # influence function
#     DVRL(rl_epochs=10), # Data valuation using Reinforcement Learning ## inappropriate
#     KNNShapley(k_neighbors=valid_count), # KNN-Shapley ## inappropriate
#     DataShapley(gr_threshold=1.05, mc_epochs=300, cache_name=f"cached"), # Data-Shapley ## slow
#     BetaShapley(gr_threshold=1.05, mc_epochs=300, cache_name=f"cached"), # Beta-Shapley ## slow
    DataBanzhaf(num_models=10), # Data-Banzhaf
    AME(num_models=10), # Average Marginal Effects
    DataOob(num_models=10) # Data-OOB
#     LavaEvaluator(),
#     RobustVolumeShapley(mc_epochs=300)
]

In [4]:
%%time
# compute data values.
## Training multiple DistilBERT models is extremely slow. We recommend using embeddings.
exper_med = exper_med.compute_data_values(data_evaluators=data_evaluators)

Elapsed time RandomEvaluator(): 0:00:00.002190


  0%|          | 0/10 [00:00<?, ?it/s]

100%|██████████| 10/10 [00:36<00:00,  3.69s/it]


Elapsed time InfluenceSubsample(num_models=10): 0:00:36.914067


100%|██████████| 10/10 [00:31<00:00,  3.13s/it]


Elapsed time DataBanzhaf(num_models=10): 0:00:31.357946


100%|██████████| 10/10 [00:14<00:00,  1.43s/it]
100%|██████████| 10/10 [00:23<00:00,  2.32s/it]
100%|██████████| 10/10 [00:34<00:00,  3.46s/it]
100%|██████████| 10/10 [00:42<00:00,  4.23s/it]


Elapsed time AME(num_models=10): 0:01:54.492089


100%|██████████| 10/10 [00:58<00:00,  5.88s/it]

Elapsed time DataOob(num_models=10): 0:00:58.788613
CPU times: user 2min 24s, sys: 14.7 s, total: 2min 38s
Wall time: 4min 1s





## [Step 3] Store data values

In [5]:
from opendataval.experiment.exper_methods import save_dataval

# Saving the results
output_dir = f"../tmp/{dataset_name}_{noise_rate=}/"
exper_med.set_output_directory(output_dir)
output_dir

'../tmp/bbc_noise_rate=0.1/'

In [6]:
exper_med.evaluate(save_dataval, save_output=True)

Unnamed: 0,indices,data_values
RandomEvaluator(),2025,0.030026
RandomEvaluator(),2061,0.946807
RandomEvaluator(),777,0.40385
RandomEvaluator(),940,0.784179
RandomEvaluator(),643,0.985042
...,...,...
DataOob(num_models=10),1285,1.0
DataOob(num_models=10),695,1.0
DataOob(num_models=10),1194,1.0
DataOob(num_models=10),943,1.0
