## **Automatic stratification of Tabular data using MDSS**

The goal of AutoStrat is to identify sub-populations who, as a group, have outcomes that significantly diverge from the overall population.

There are $\prod_{m=1}^{M}\left(2^{|X_{m}|}-1\right)$ unique subgroups from a dataset with $M$ features, with each feature having $|X_{m}|$ discretized values, where a subgroup is any $M$-dimension Cartesian set product, between subsets of feature-values from each feature --- excluding the empty set. MDSS mitigates this computational hurdle by approximately identifing the most statistically divergent subgroup in linear time (rather than exponential).




In this example, we have a Census-Income dataset of ~16K records where we observe a given individual represented by their census data as earning more than $50k/yr or not. 

Since the outcome is binary, we can use a Bernoulli scoring function. Other scoring functions eg.`Poisson` may be appropriate for your dataset depending on the parametric assumptions of the outcome. In scenarios where we do not wish to make any parametric assumptions a scoring function like `BerkJones` may be more appropriate

Import the MDSS module and Bernoulli modules

In [1]:


from aif360.detectors.mdss_detector import bias_scan

import numpy as np
import pandas as pd

In [2]:
np.random.seed(0)

We'll load the csv file containing the data from Github. Note that the data containes categorical features and continous features have been bineed into 10 bins. When applying autostrat it is important that categorical features are not one-hot encoded as this may result in subsets that have individuals belonging to two different categories. The cardinality of each feature also influences run-time.

### Data

In [3]:
dff = pd.read_csv('https://gist.githubusercontent.com/Viktour19/b690679802c431646d36f7e2dd117b9e/raw/d8f17bf25664bd2d9fa010750b9e451c4155dd61/adult_autostrat.csv')
dff.head()

Unnamed: 0,workclass,education,marital_status,occupation,relationship,race,sex,native_country,age_bin,education_num_bin,hours_per_week_bin,capital_gain_bin,capital_loss_bin,observed,expectation
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States,17-27,1-8,40-44,0,0,0,0.236226
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States,37-47,9,45-99,0,0,0,0.236226
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States,28-36,12-16,40-44,0,0,1,0.236226
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States,37-47,10-11,40-44,7298-7978,0,1,0.236226
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States,17-27,10-11,1-39,0,0,0,0.236226


Notice that ALL the records have the same expectation which corresponds to the mean income status of the entire population

### MDSS API and parameters

The `scoring` parameter specifies the scoring function to be used in the scan. These can be one of `Bernoulli`, `Gaussian`, `Poisson` or `BerkJones`.

The `overpredicted` parameter specifies whether we want to scan for a subgroup which the model favors or disfavors. If `True`, we scan for a subgroup whose predictions are fovourable in comparison with the actual outcomes. If `False`, the converse is true.

The `penalty` co-efficient allows us to adjust the complexity of the highest scoring subset. It can be thought of a regularization constant.
In each iteration, we optimize over subsets of all the attributes and randomly initialize the values of each attribute. `num_iters` specifies the number of iterations and thus random initializations

We will scan for individuals with significantly lower and higher income level than the overall average of `0.24`. We'll start with a low penalty value and observe the score and complexity of the subset we find.

In [4]:
observed = dff['observed']
data = dff.drop(['observed','expectation'], axis=1)


We'll scan for individuals with significantly higher income level than the overall average of `0.24`. We start with a low penalty value and observe the subset we find.

In [5]:
privileged_subset = bias_scan(data=data, observations=observed, scoring='Bernoulli', overpredicted=False,penalty=2)
print(privileged_subset)

({'marital_status': [' Married-AF-spouse', ' Married-civ-spouse'], 'education_num_bin': ['10-11', '12-16'], 'occupation': [' Adm-clerical', ' Armed-Forces', ' Craft-repair', ' Exec-managerial', ' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support', ' Transport-moving'], 'age_bin': ['28-36', '37-47', '48-90']}, 1339.0438)


In [6]:
dff = data.copy()
dff['observed'] = observed
dff['probabilities'] = observed.mean()

to_choose = dff[privileged_subset[0].keys()].isin(privileged_subset[0]).all(axis=1)
temp_df = dff.loc[to_choose]

print("Our detected privileged group has a size of {}, we observe {} as the average probability of earning >50k, but our populatin mean is {}"\
.format(len(temp_df), np.round(temp_df['observed'].mean(),4), np.round(temp_df['probabilities'].mean(),4)))

group_obs = temp_df['observed'].mean()
group_prob = temp_df['probabilities'].mean()

odds_mul = (group_obs / (1 - group_obs) /(group_prob /(1 - group_prob)))
print()
print("This is a multiplicative increase in the odds by {}"\
.format(odds_mul))


Our detected privileged group has a size of 3401, we observe 0.6601 as the average probability of earning >50k, but our populatin mean is 0.2362

This is a multiplicative increase in the odds by 6.279065608991142


Next, We'll scan for individuals with significantly lower income level than the overall average of `0.24`. We start with a low penalty value and observe the subset we find.

In [7]:
unprivileged_subset = bias_scan(data=data, observations=observed, scoring='Bernoulli', overpredicted=True,penalty=2)
print(unprivileged_subset)


({'capital_gain_bin': ['0', '114-2354', '2407-3273', '3325-4416'], 'education': [' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' HS-grad', ' Preschool', ' Some-college'], 'relationship': [' Not-in-family', ' Other-relative', ' Own-child', ' Unmarried'], 'capital_loss_bin': ['0', '1672-1876']}, 1312.9051)


In [8]:
to_choose = dff[unprivileged_subset[0].keys()].isin(unprivileged_subset[0]).all(axis=1)
temp_df = dff.loc[to_choose]

print("Our detected unprivileged group has a size of {}, we observe {} as the average probability of earning >50k, but our populatin mean is {}"\
.format(len(temp_df), np.round(temp_df['observed'].mean(),4), np.round(temp_df['probabilities'].mean(),4)))

group_obs = temp_df['observed'].mean()
group_prob = temp_df['probabilities'].mean()

odds_mul = (group_obs / (1 - group_obs) /(group_prob /(1 - group_prob)))
print()
print("This is a multiplicative decrease in the odds by {}"\
.format(1/odds_mul))


Our detected unprivileged group has a size of 8074, we observe 0.0307 as the average probability of earning >50k, but our populatin mean is 0.2362

This is a multiplicative decrease in the odds by 9.760041246741118


## Bias scan using Multi-Dimensional Subset Scan (MDSS)

"Identifying Significant Predictive Bias in Classifiers" https://arxiv.org/abs/1611.08292

The goal of bias scan is to identify a subgroup(s) that has significantly more predictive bias than would be expected from an unbiased classifier. There are $\prod_{m=1}^{M}\left(2^{|X_{m}|}-1\right)$ unique subgroups from a dataset with $M$ features, with each feature having $|X_{m}|$ discretized values, where a subgroup is any $M$-dimension
Cartesian set product, between subsets of feature-values from each feature --- excluding the empty set. Bias scan mitigates this computational hurdle by approximately identifing the most statistically biased subgroup in linear time (rather than exponential).




In [9]:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score


#### Data

In [10]:
np.random.seed(0)


In [11]:
#transform data to numeric for the model
y = dff['observed']
features = dff.drop(['observed'], axis = 1)
X = features.copy()

for feature in X.columns:
    X[feature] = X[feature].astype('category').cat.codes



### Training the model

We train a simple classifier to predict the probability of the outcome.

In [12]:
np.random.seed(0)

In [13]:
def report(actual, predicted):
    print('Precison: ', precision_score(actual, predicted))
    print('Recall: ', recall_score(actual, predicted))
    print('F1: ', f1_score(actual, predicted))
    print('AUC-ROC: ', roc_auc_score(actual, predicted))

model = GBC()

model.fit(X, y)
preds = pd.Series(model.predict(X))

print('Global Results:')
report(y, preds)



Global Results:
Precison:  0.7693075898801598
Recall:  0.6008840353614144
F1:  0.6747445255474452
AUC-ROC:  0.7725771202138797


### bias scan
In bias scan, we are using the predictions from the model as the expectations

In [14]:
subset, _ = bias_scan(data=features, observations=y, 
          expectations=preds, scoring='Bernoulli', 
          overpredicted=False, penalty=5)

print()
print('Anomalous Subgroup:\n', subset)


Anomalous Subgroup:
 {'education_num_bin': ['1-8', '10-11', '9'], 'capital_gain_bin': ['0', '2407-3273', '3325-4416', '4508-5178', '5455-7262']}


In [15]:
to_choose = data[subset.keys()].isin(subset).all(axis=1)
_y = y.loc[to_choose].copy()
_preds = preds.loc[to_choose].copy()

print()
print('Subgroup Results:')
report(_y, _preds)


Subgroup Results:
Precison:  0.6510851419031719
Recall:  0.2498398462524023
F1:  0.3611111111111111
AUC-ROC:  0.6142479460020183


In [16]:

print("Our detected subgroup has a size of {}, we observe {} as the mean outcome, but our model predicts {}"\
.format(len(_y), _y.mean(), _preds.mean()))

group_obs = _y.mean()
group_prob = _preds.mean()

odds_mul = ((group_prob /(1 - group_prob) / group_obs / (1 - group_obs)))
print("This is a multiplicative decrease in the odds by {}"\
.format(1/odds_mul))

Our detected subgroup has a size of 11353, we observe 0.13749669690830618 as the mean outcome, but our model predicts 0.05276138465603805
This is a multiplicative decrease in the odds by 2.129100892041427


In order to get a simpler subset, we can increase the penalty. Think of this penalty as a regularization parameter. The higher the value, the simpler our subset gets but the less extreme our bias is

In [17]:
subset, _ = bias_scan(data=features, observations=y, 
          expectations=preds, scoring='Bernoulli', 
          overpredicted=False, penalty=20)

print()
print('Anomalous Subgroup:\n', subset)


Anomalous Subgroup:
 {'education_num_bin': ['1-8', '10-11', '9']}


In [18]:
to_choose = data[subset.keys()].isin(subset).all(axis=1)
_y = y.loc[to_choose].copy()
_preds = preds.loc[to_choose].copy()

print()
print('Subgroup Results:')
report(_y, _preds)


Subgroup Results:
Precison:  0.7430639324487334
Recall:  0.34471180749860103
F1:  0.4709480122324159
AUC-ROC:  0.6616167689303029


In [19]:
print("Our detected subgroup has a size of {}, we observe {} as the mean outcome, but our model predicts {}"\
.format(len(_y), _y.mean(), _preds.mean()))

group_obs = _y.mean()
group_prob = _preds.mean()

odds_mul = ((group_prob /(1 - group_prob) / group_obs / (1 - group_obs)))
print("This is a multiplicative decrease in the odds by {}"\
.format(1/odds_mul))

Our detected subgroup has a size of 11704, we observe 0.15268284347231717 as the mean outcome, but our model predicts 0.07083048530416952
This is a multiplicative decrease in the odds by 1.6971138377552293


#### Summary
This notebook demonstrates how to discover systematic deviations in data or models for a binary classification case. There are more examples for other use cases in demo_mdss_detectors.ipynb found in the [AIF360 repo](https://github.com/Trusted-AI/AIF360) examples folder. 