# JENGA 

A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models

- Sebastian Schelter (University of Amsterdam)
- Tammo Rukat (Amazon)
- Felix Biessmann (Einstein Center Berlin, Beuth University, Amazon)

[https://github.com/schelterlabs/jenga](https://github.com/schelterlabs/jenga)

# Why?

* Software systems are tested (Unit tests, integration tests, user tests, ...)
* **ML Systems are difficult to test**
 * ML models depend on data 
 * Real test data is limited, models overparametrized 
   * Google's underspecifiation paper ([D'Amour et al., 2020](https://arxiv.org/abs/2011.03395))
   * Stochastic Parrots paper ([Bender et al., 2021](https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf))

## ML Testing is becoming a thing

![ML testing publications](figs/ml-testing-publications.png)
[Zhang et al, Machine Learning Testing: Survey, Landscapes and Horizons, 2020](https://arxiv.org/pdf/1906.10742.pdf)

# How to test ML Systems?

- A lot of data sets with convenient API ([OpenML, Vanschoren et al. 2014](https://dl.acm.org/doi/10.1145/2641190.2641198))
- Data corruptions ([Schelter, Rukat, Biessmann, SIGMOD, 2020](https://dl.acm.org/doi/10.1145/3318464.3380604))

Jenga leverages both to ensure:
- automation
- scalability
- reproducibility

# Jenga's core API components:

- Tasks:
    - binary/multiclass classification
    - regression
- Corruptions:
    - Text
    - Images
    - Tabular data
- Evaluators:
    - Applies corruptions and tests ML model on task


# Jenga enables easy ML testing

Example: Binary Classification on OpenML data set


In [1]:
from jenga.tasks.openml import OpenMLBinaryClassificationTask
from jenga.corruptions.generic import MissingValues
from jenga.evaluation.corruption_impact import CorruptionImpactEvaluator

import numpy as np

First, let's instantiate a binary OpenML task (id 1471)

In [8]:
binary_task = OpenMLBinaryClassificationTask(1471)

Baseline ROC/AUC score: 0.66617072988402


Each task has a baseline model (here: simple sklearn pipeline)

In [8]:
binary_task_model = binary_task.fit_baseline_model()

print(f"Baseline ROC/AUC score: {binary_task.get_baseline_performance()}")

Baseline ROC/AUC score: 0.66617072988402


# Data Corruptions

It's simple to extend Jenga's corruption API. 

These are already implemented:

- For text: leetspeek
- For images: standard augmentations
- For structured data (tables):
 - missing data
     - missing completely at random
     - missing at random
     - missing not at random
 - swapping columns
 - numerical data:
     - additive Gaussian noise
     - scaling

## Example: Missing values

Let's replace 50% of all values in column V3 with NaNs

In [11]:
binary_task_corruption = MissingValues(column='V3', 
                                       fraction=0.5, 
                                       na_value=np.nan)

Now let's evaluate our baseline model in the above task with the specified corruptions. 

We repeat the experiment 5 times to get robust statistics. 

In [12]:
num_repetitions = 5

binary_task_evaluator = CorruptionImpactEvaluator(binary_task)

binary_task_results = binary_task_evaluator.evaluate(binary_task_model, 
                                                     num_repetitions, 
                                                     binary_task_corruption)

0/5 (0.057767000000000124)
