# JENGA 

A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models

- Sebastian Schelter (University of Amsterdam)
- Tammo Rukat (Amazon)
- Felix Biessmann (Einstein Center Berlin, Beuth University, Amazon)

[https://github.com/schelterlabs/jenga](https://github.com/schelterlabs/jenga)

# Why?

* Software systems are tested (Unit tests, integration tests, user tests, ...)
* **ML applications are difficult to test**
 * ML models depend on data 
 * Real test data is limited, models overparametrized 
   * Google's underspecifiation paper ([D'Amour et al., 2020](https://arxiv.org/abs/2011.03395))
   * Stochastic Parrots paper ([Bender et al., 2021](https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf))

## ML Testing is becoming a thing

![ML testing publications](figs/ml-testing-publications.png)
[Zhang et al, Machine Learning Testing: Survey, Landscapes and Horizons, 2020](https://arxiv.org/pdf/1906.10742.pdf)

# How to test ML Systems?

- A lot of data sets with convenient API ([OpenML, Vanschoren et al. 2014](https://dl.acm.org/doi/10.1145/2641190.2641198))
- Data corruptions ([Schelter, Rukat, Biessmann, SIGMOD, 2020](https://dl.acm.org/doi/10.1145/3318464.3380604))

Jenga leverages both to ensure:
- automation
- reproducibility

# Jenga - Testing with data corruptions

![Jenga flowchart](figs/workflow.png)



# Jenga's core API components:

- Tasks:
    - binary/multiclass classification
    - regression
- Corruptions:
    - Text
    - Images
    - Tabular data
- Evaluators:
    - Applies corruptions and tests ML model on task


# Jenga enables easy ML testing

Example: Binary Classification on OpenML data set


In [1]:
from jenga.tasks.openml import OpenMLBinaryClassificationTask
import numpy as np

First, let's instantiate a binary OpenML task (id 1471)

In [2]:
task = OpenMLBinaryClassificationTask(1471)

Each task has a baseline model (here: simple sklearn pipeline)

In [3]:
task_model = task.fit_baseline_model()

print(f"Baseline ROC/AUC score: {task.get_baseline_performance()}")

Baseline ROC/AUC score: 0.609494255820072


# Data Corruptions

It's simple to extend Jenga's corruption API. 

These are already implemented:

- For text: leetspeek
- For images: standard augmentations
- For structured data (tables):
 - missing data
     - missing completely at random
     - missing at random
     - missing not at random
 - swapping columns
 - numerical data:
     - additive Gaussian noise
     - scaling

## Defining Custom Jenga Corruptions: Missing values

It's easy to build your own corruptions

```python
class MissingValues(TabularCorruption):

    def __init__(self, 
                 column, 
                 fraction, 
                 na_value=np.nan, 
                 missingness='MCAR'):
        '''
        Corruptions for structured data
        Input:
        column:      column to perturb, string
        fraction:    fraction of rows to corrupt, float in (0,1)
        na_value:    value
        missingness: string in ['MCAR', 'MAR', 'MNAR']
        '''
        self.column = column
        self.fraction = fraction
        self.sampling = missingness
        self.na_value = na_value

    def transform(self, data):
        corrupted_data = data.copy(deep=True)
        rows = self.sample_rows(corrupted_data)
        corrupted_data.loc[rows, [self.column]] = self.na_value
        return corrupted_data
```

## Example: Missing values and Scaling

Let's define some corruptions:
- Replace 40% of all values (drawn completely at random, i.e. MCAR) in column ``V3`` with NaNs
- Scale 30% of values in column ``V4`` (depending on values in another column, i.e. NAR)

In [4]:
from jenga.corruptions.generic import MissingValues

missingness = MissingValues(column='V3', 
                            fraction=0.4, 
                            missingness='MNAR',
                            na_value=np.nan)

corrupted_df = missingness.transform(task.test_data)
corrupted_df.sample(n=5)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14
5376,4346.67,3998.97,4284.1,4123.08,4355.38,4628.72,4087.18,4614.36,4204.62,4255.9,4228.72,4293.33,4641.54,4402.05
6658,4415.9,4058.46,4292.82,4135.9,4330.77,4609.23,4059.49,4623.08,4201.54,4249.23,4249.74,4315.9,4687.18,4483.08
7906,4298.97,4020.51,,4120.0,4333.85,4611.28,4044.62,4616.92,4196.41,4235.38,4204.1,4270.77,4599.49,4350.26
6774,4314.36,3978.97,4276.41,4107.18,4329.23,4617.44,4083.08,4633.33,4210.77,4238.97,4207.69,4291.28,4592.82,4380.0
4663,4290.77,3990.77,,4122.56,4332.31,4616.41,4065.13,4618.97,4200.51,4227.69,4198.97,4281.54,4584.1,4355.38


In [5]:
from jenga.corruptions.numerical import Scaling

scaling = Scaling(column='V4',
                  fraction=0.3,
                  sampling='NAR')

corrupted_df = scaling.transform(corrupted_df)
corrupted_df.sample(n=5)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14
14814,4287.18,4024.62,4247.69,411590.0,4334.87,4610.77,4066.15,4618.46,4203.08,4228.21,4212.31,4278.46,4616.41,4352.31
7987,4291.28,3992.82,4251.79,4107.18,4336.92,4621.03,4064.1,4627.18,4206.15,4232.82,4175.9,4265.64,4595.38,4351.79
5241,4420.0,4055.9,4295.38,4151.28,4359.49,4627.69,4087.69,4614.87,4216.92,4270.77,4243.08,4314.87,4680.51,4470.77
1382,4314.87,3971.79,,4091.28,4334.87,4603.08,4094.36,4624.62,4215.9,4251.79,4244.62,4294.36,4671.28,4420.0
6705,4350.26,4022.05,4276.92,4136.92,4319.49,4600.51,4056.92,4616.41,4196.92,4229.74,4217.44,4284.62,4626.67,4407.69


Now let's evaluate our baseline model in the above task with the specified corruptions. 

We repeat the experiment 5 times to get robust statistics. 

In [6]:
from jenga.evaluation.corruption_impact import CorruptionImpactEvaluator

num_repetitions = 5

task_evaluator = CorruptionImpactEvaluator(task)

results = task_evaluator.evaluate(task_model, 
                                  num_repetitions, 
                                  missingness,
                                  scaling)

0/10 (0.0448059999999999)


In [7]:
print(f'ROC on clean data: \t{results[0].baseline_score:0.2}')
print(f'ROC on corrupted data: \t{np.mean(results[0].corrupted_scores):0.2}')

ROC on clean data: 	0.61
ROC on corrupted data: 	0.54


# Summary

- ML model testing is important
- JENGA provides convenient and extensible API to test ML:
    - predefined tasks, errors and evaluators
    - declarative error specification
    - realistic error sampling
- Future work: More and more realistic error generators
- Interested in how to use JENGA? Check out our 2020 SIGMOD paper:
    [Schelter, Rukat, Biessmann, "Learning to Validate the Predictions of Black Box Classifiers on Unseen Data"](https://dl.acm.org/doi/abs/10.1145/3318464.3380604)