# ``jenga`` Corruptions

This notebook illustrates some of the corruptions used in jenga

Most corruptions are operating on tabular data, for which different sampling mechanisms are used. 

Inspired by research on missing values, we consider three different sampling schemes for corrupting tabular data (we'll keep using the 'missingness' terminology here, even if we use other corruptions than discarding data):

* **MCAR** Missing completely at random
* **MAR** Missing at random - corruption is conditioned on other column
* **NAR** Missing not at random - corruption is conditioned on values in column on which it is applied


In [24]:
%load_ext autoreload
%autoreload 2

import random
import pandas as pd
import numpy as np

from jenga.corruptions.generic import MissingValues

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [39]:
def new_df(N=20):
    return pd.DataFrame({
        'A':np.arange(0,N) / N, 
        'B': [random.choice(['a','b','c']) for _ in range(N)],
        'C': [random.choice(['foo','bah']) for _ in range(N)]
        })

## Missing Values

Below some examples on different row sampling strategies for corruptions with missing values

In [40]:
from jenga.corruptions.generic import MissingValues

df = new_df()

df['B_MCAR'] = MissingValues(column='B', fraction=.5, sampling='MCAR').transform(df)['B']
df['B_MAR'] = MissingValues(column='B', fraction=.5, sampling='MAR').transform(df)['B']
df['B_MNAR'] = MissingValues(column='B', fraction=.5, sampling='MNAR').transform(df)['B']

df

Unnamed: 0,A,B,C,B_MCAR,B_MAR,B_MNAR
0,0.0,c,bah,c,,
1,0.05,c,foo,c,,c
2,0.1,c,foo,c,,c
3,0.15,a,bah,,,a
4,0.2,a,foo,a,a,
5,0.25,b,bah,b,b,
6,0.3,b,bah,,,
7,0.35,c,bah,,,c
8,0.4,c,bah,,,c
9,0.45,c,foo,,,c


If we sort according to the values in the column on which the corruption as applied, we see that the 'MNAR' condition works as it should

In [41]:
df.sort_values('B')


Unnamed: 0,A,B,C,B_MCAR,B_MAR,B_MNAR
3,0.15,a,bah,,,a
4,0.2,a,foo,a,a,
11,0.55,a,foo,,a,
16,0.8,b,bah,,b,
5,0.25,b,bah,b,b,
6,0.3,b,bah,,,
0,0.0,c,bah,c,,
17,0.85,c,bah,,c,
15,0.75,c,bah,,c,
14,0.7,c,foo,,c,


## Swapping Values

Also works between numeric and non-numeric values, with different sampling schemes

In [62]:
from jenga.corruptions.generic import SwappedValues
df = new_df()
df['C_swapped_MAR'] = SwappedValues(column='B', fraction=.5, sampling='MAR').transform(df)['C']
df['C_swapped_MCAR'] = SwappedValues(column='B', fraction=.5, sampling='MCAR').transform(df)['C']
df['A_swapped_MCAR'] = SwappedValues(column='A', fraction=.5, sampling='MCAR').transform(df)['A']
df

Unnamed: 0,A,B,C,C_swapped_MAR,C_swapped_MCAR,A_swapped_MCAR
0,0.0,a,foo,foo,foo,0
1,0.05,b,bah,bah,bah,0.05
2,0.1,b,foo,foo,foo,0.1
3,0.15,b,bah,bah,bah,bah
4,0.2,a,foo,foo,foo,0.2
5,0.25,b,bah,bah,bah,bah
6,0.3,c,bah,bah,bah,bah
7,0.35,c,bah,bah,bah,0.35
8,0.4,a,bah,bah,bah,bah
9,0.45,a,bah,bah,bah,bah


## Messing up categorical Values

These corruptions permute the histogram of values

In [80]:
from jenga.corruptions.generic import CategoricalShift
df = new_df()
df['C_permuted_MAR'] = CategoricalShift(column='B', fraction=.5, sampling='MAR').transform(df)['C']
df['C_permuted_MCAR'] = CategoricalShift(column='B', fraction=.5, sampling='MCAR').transform(df)['C']
df['A_permuted_MCAR'] = CategoricalShift(column='A', fraction=.5, sampling='MCAR').transform(df)['A']
df

CategoricalShift implemented only for categorical variables


Unnamed: 0,A,B,C,C_permuted_MAR,C_permuted_MCAR,A_permuted_MCAR
0,0.0,a,bah,bah,bah,0.0
1,0.05,a,bah,bah,bah,0.05
2,0.1,c,bah,bah,bah,0.1
3,0.15,b,foo,foo,foo,0.15
4,0.2,a,bah,bah,bah,0.2
5,0.25,b,bah,bah,bah,0.25
6,0.3,b,bah,bah,bah,0.3
7,0.35,c,foo,foo,foo,0.35
8,0.4,c,bah,bah,bah,0.4
9,0.45,c,bah,bah,bah,0.45


## Adding Gaussian Noise



In [89]:
from jenga.corruptions.numerical import GaussianNoise
df = new_df()
df['A_GaussNoise_MAR'] = GaussianNoise(column='A', fraction=.5, sampling='MAR').transform(df)['A']
df['A_GaussNoise__MCAR'] = GaussianNoise(column='A', fraction=.5, sampling='MCAR').transform(df)['A']
df

Unnamed: 0,A,B,C,A_GaussNoise_MAR,A_GaussNoise__MCAR
0,0.0,c,bah,0.0,0.0
1,0.05,a,foo,0.05,0.615024
2,0.1,c,foo,0.1,-0.123286
3,0.15,c,foo,0.15,-0.730584
4,0.2,a,bah,0.2,0.2
5,0.25,b,bah,0.25,0.25
6,0.3,c,bah,-0.582742,0.3
7,0.35,c,foo,0.834302,0.162061
8,0.4,b,foo,1.847041,0.4
9,0.45,a,foo,1.500384,0.779006


## Scaling numeric values



In [91]:
from jenga.corruptions.numerical import Scaling
df = new_df()
df['A_Scaled_MAR'] = Scaling(column='A', fraction=.5, sampling='MAR').transform(df)['A']
df['A_Scaled_MCAR'] = Scaling(column='A', fraction=.5, sampling='MCAR').transform(df)['A']
df

Unnamed: 0,A,B,C,A_Scaled_MAR,A_Scaled_MCAR
0,0.0,a,foo,0.0,0.0
1,0.05,a,foo,0.05,5.0
2,0.1,a,foo,0.1,0.1
3,0.15,b,foo,150.0,0.15
4,0.2,a,foo,0.2,0.2
5,0.25,a,foo,0.25,0.25
6,0.3,a,bah,0.3,30.0
7,0.35,a,foo,0.35,35.0
8,0.4,c,foo,400.0,0.4
9,0.45,c,foo,450.0,0.45
