In [1]:
import numpy as np

### Instantiate the prediction task

In [3]:
from jenga.tasks.income import IncomeEstimationTask

task = IncomeEstimationTask(seed=42)

### Task details

In this task, we try to predict the income level of a person (more or less than 50K dollars per years) from demographic and work-related data. It is often used as a proxy to study automated decision making for loan applications.

The original data is available in the UCI machine learning repository at https://archive.ics.uci.edu/ml/datasets/adult

In [4]:
task.train_data

Unnamed: 0,workclass,occupation,marital_status,education,hours_per_week,age
5514,Local-gov,Prof-specialty,Never-married,Bachelors,50,33
19777,Private,Exec-managerial,Married-civ-spouse,Assoc-voc,50,36
10781,Self-emp-not-inc,Craft-repair,Separated,9th,40,58
32240,Private,Farming-fishing,Married-civ-spouse,Assoc-voc,46,21
9876,Private,Other-service,Divorced,Some-college,40,27
...,...,...,...,...,...,...
29802,Private,Craft-repair,Married-civ-spouse,Bachelors,40,47
5390,Private,Other-service,Divorced,12th,21,31
860,Private,Adm-clerical,Never-married,11th,20,18
15795,Self-emp-not-inc,Farming-fishing,Married-civ-spouse,HS-grad,84,50


### Train the provided baseline model

Jenga allows us to easily train and evaluate a logistic regression model for this task. Have a look at https://github.com/schelterlabs/jenga/blob/master/jenga/tasks/income.py if you want to know the details.


In [5]:
model = task.fit_baseline_model(task.train_data,task.train_labels)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:    3.1s finished


In [10]:
y_pred = model.predict_proba(task.test_data)

f"The ROC AUC score on the test data is {task.score_on_test_data(y_pred)}"

'The ROC AUC score on the test data is 0.8820613837253065'

### Let's have a look at the test data

In [12]:
task.test_data

Unnamed: 0,workclass,occupation,marital_status,education,hours_per_week,age
14160,Private,Adm-clerical,Divorced,Some-college,38,27
27048,State-gov,Exec-managerial,Married-civ-spouse,HS-grad,40,45
28868,Private,Exec-managerial,Married-civ-spouse,Bachelors,55,29
5667,Private,Machine-op-inspct,Never-married,Bachelors,40,30
7827,Self-emp-not-inc,Craft-repair,Divorced,Some-college,50,29
...,...,...,...,...,...,...
1338,Private,Tech-support,Divorced,Bachelors,16,71
24534,Local-gov,Prof-specialty,Married-civ-spouse,Some-college,40,55
18080,Private,Prof-specialty,Married-civ-spouse,Prof-school,48,47
10354,Private,Adm-clerical,Never-married,Bachelors,40,27


### Corruptions

Jenga provides a set of predefined data corruptions that we can use to simulate errors in the test data. We will simulate 'implicit missing values' in a column and see how this impacts the prediction quality.

Jenga supports much more predefined corruptions, have a look at https://github.com/schelterlabs/jenga/tree/master/jenga/corruptions.


In [11]:
from jenga.corruptions.generic import MissingValues

marital_status_corruption = MissingValues(column='marital_status', fraction=0.99, na_value='BROKEN')


In [13]:
corrupted_test_data = marital_status_corruption.transform(task.test_data)
corrupted_test_data

Unnamed: 0,workclass,occupation,marital_status,education,hours_per_week,age
14160,Private,Adm-clerical,BROKEN,Some-college,38,27
27048,State-gov,Exec-managerial,BROKEN,HS-grad,40,45
28868,Private,Exec-managerial,BROKEN,Bachelors,55,29
5667,Private,Machine-op-inspct,BROKEN,Bachelors,40,30
7827,Self-emp-not-inc,Craft-repair,BROKEN,Some-college,50,29
...,...,...,...,...,...,...
1338,Private,Tech-support,BROKEN,Bachelors,16,71
24534,Local-gov,Prof-specialty,BROKEN,Some-college,40,55
18080,Private,Prof-specialty,BROKEN,Prof-school,48,47
10354,Private,Adm-clerical,BROKEN,Bachelors,40,27


In [14]:
y_pred = model.predict_proba(corrupted_test_data)

f"The ROC AUC score on the corrupted test data is {task.score_on_test_data(y_pred)}"

'The ROC AUC score on the corrupted test data is 0.8071567161891435'

### Jenga's evaluators

Jenga provides a set of evaluators which allow us to automate the evaluation of the impact of given data corruptions.

In [17]:
from jenga.evaluation.corruption_impact import CorruptionImpactEvaluator

evaluator = CorruptionImpactEvaluator(task)

In [15]:
corruptions = [
    MissingValues(column='marital_status', fraction=0.99, na_value='BROKEN'),
    MissingValues(column='age', fraction=0.05, na_value=-999),
]

### Run the evaluation of the corruptions with 10 repetitions 

In [24]:
num_repetitions = 10
results = evaluator.evaluate(model, num_repetitions, *corruptions)

0/20 (0.018007553000000343)
10/20 (0.15351217600000044)


### Investigate the impact on the predictive performance of the model

In [25]:
for validation_result in results:
    
    print(validation_result.corruption)
    print(f"""
     Score (AUC) on 
      clean data:     {validation_result.baseline_score}
      corrupted data: {np.mean(validation_result.corrupted_scores)}
     """)    
    print("\n")

MissingValues: {'column': 'marital_status', 'fraction': 0.99, 'na_value': 'BROKEN', 'missingness': 'MCAR'}

     Score (AUC) on 
      clean data:     0.8820613837253065
      corrupted data: 0.8090494098184386
     


MissingValues: {'column': 'age', 'fraction': 0.05, 'na_value': -999, 'missingness': 'MCAR'}

     Score (AUC) on 
      clean data:     0.8820613837253065
      corrupted data: 0.8444987378736565
     


