# N-Naive Bayes

This notebook demonstrates the testing for the N-naive-Bayes project. The source for the classifiers can be found in the `models/` directory.

 - `models/nnb_base`: The base class for NNB, containing fitting and prediction generation
 - `models/nnb_parity`: The statistical parity version of NNB
 - `models/nnb_df`: The differential fairness version of NNB
 - `models/two_naive_bayes`: A scikit-Learn implementation of the original CV2NB
 - `models/gaussian_sub`: The Gaussian naive Bayes sub-estimator

Supporting code:
 - `dataset.py`: Classes for interacting with the US Census data used for testing. See the [folktables library](https://github.com/zykls/folktables)
 - `scoring.py`: Scoring functions implementing various popular group-fairness measures


## Setup

In [9]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

import sys
import logging
from IPython.display import display
logging.basicConfig(
    format="%(asctime)s [%(levelname)s] %(message)s"
)
sns.set_theme(style="darkgrid")

from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import GaussianNB

In [13]:
sys.path.append('models')

%load_ext autoreload
%autoreload 1
%aimport dataset, scoring, plots, nnb_parity, nnb_df, two_naive_bayes

from scoring import split_preserve_groups, score_table, score_means, split_ds
from dataset import Income, Employment, init_data_src, SensitiveAttr
from plots import compare_groups, group_comparison_barplot, group_comparison_multiplot
from nnb_parity import NNB_Parity
from nnb_df import NNB_DF
from two_naive_bayes import TwoNaiveBayes

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [21]:
try:
    DATA = {
        '2018':init_data_src('2018', False)
    }
except FileNotFoundError:
    warnings.warn('If on windows, you must download the file manually: https://www2.census.gov/programs-surveys/acs/data/pums/')

In [4]:
def log(l = logging.DEBUG):
    logging.getLogger("nnb_parity").setLevel(l)
    logging.getLogger("nnb_df").setLevel(l)
    logging.getLogger("two_naive_bayes").setLevel(l)

We will store the classification tasks in these globals:

In [None]:
income_race = Income().load(DATA['2018'])
income_racesex = Income(sensitive=SensitiveAttr.RACESEX).load(DATA['2018'])
employment_race = Employment().load(DATA['2018'])
employment_racesex = Employment(sensitive=SensitiveAttr.RACESEX).load(DATA['2018'])

# Testing

For each dataset, we pick sensitive attributes: Race, Sex, or Both (Race-Sex). These will be the classification tasks.

For each classification task, we train a baseline (Gaussian NB), Two-naive-Bayes, and the N-naive-Bayes variants. We present the mean and variance of the scores achieved over `n_splits` (defined below) random train-test splits.

In [None]:
n_splits=10

## Income-Race

Data: Age, class of worker, marital status, relationship, educational attainment, occupation, place of birth, usual hours worked per week.

Aim: To predict whether each individual's income is above $50.000

Sensitive Feature(s) Being Tested: Race

In [None]:
log(logging.ERROR)
score_means(income_race, n_splits=n_splits, classifiers_cat = {
    "GaussianNB":GaussianNB(),
    "NNB10":NNB10(delta=0.05, max_iter=1000, disc_threshold=1e-3),
    "NNB12":NNB12(delta=0.05, max_iter=1000, disc_threshold=1e-3),
}, classifiers_bin = {
    "GaussianNB_Binary":GaussianNB(),
    "CV2NB":NNB10(delta=0.05, max_iter=1000, disc_threshold=1e-3, use_old_balancing=True)
}, output_order=["Perfect", "GaussianNB_Binary", "CV2NB", "GaussianNB", "NNB10", "NNB12"], include_perfect=True)

In [None]:
compare_groups(
    *split_ds(income_race), 
    classifiers_cat={
        "NNB-Parity": NNB10(delta=0.05, max_iter=1000, disc_threshold=1e-2), 
        "NNB-DF": NNB12(delta=0.05, max_iter=1000, disc_threshold=1e-2), 
    }, classifiers_bin={
        "2NB": NNB10(delta=0.05, max_iter=1000, disc_threshold=1e-2, use_old_balancing=True),
    },
    rows=2, cols=4, include_actual=True, axis_lim=(0.0, 0.8))

## Employment-Race

Data: Age, educational attainment, marital status, relationship, disability, employment status of parents, citizenship status, mobility status, military service, ancestry, nativity, hearing difficulty, vision difficulty, cognitive difficulty.

Aim: To predict whether each individual is currently employed

Sensitive Feature(s) Being Tested: Race

In [None]:
log(logging.ERROR)
score_means(employment_race, n_splits=n_splits, classifiers_cat = {
    "GaussianNB":GaussianNB(),
    "NNB10":NNB10(delta=0.05, max_iter=1000, disc_threshold=1e-3),
    "NNB12":NNB12(delta=0.05, max_iter=1000, disc_threshold=1e-3),
}, classifiers_bin = {
    "GaussianNB_Binary":GaussianNB(),
    "CV2NB":NNB10(delta=0.05, max_iter=1000, disc_threshold=1e-3, use_old_balancing=True)
}, output_order=["Perfect", "GaussianNB_Binary", "CV2NB", "GaussianNB", "NNB10", "NNB12"], include_perfect=True)

In [None]:
compare_groups(
    *split_ds(employment_race), 
    classifiers_cat={
        "NNB10": NNB10(delta=0.05, max_iter=1000, disc_threshold=1e-2), 
        "NNB12": NNB12(delta=0.05, max_iter=1000, disc_threshold=1e-2), 
    }, 
    rows=2, cols=3, include_actual=True)

## Income-Race-Sex

Data: Age, class of worker, marital status, relationship, educational attainment, occupation, place of birth, usual hours worked per week.

Aim: To predict whether each individual's income is above $50.000

Sensitive Feature(s) Being Tested: Race, Sex

In [None]:
log(logging.ERROR)
score_means(income_racesex, n_splits=n_splits, classifiers_cat = {
    "GaussianNB":GaussianNB(),
    "NNB10":NNB10(delta=0.05, max_iter=1000, disc_threshold=1e-3),
    "NNB12":NNB12(delta=0.05, max_iter=1000, disc_threshold=1e-3),
}, classifiers_bin = {
    "GaussianNB_Binary":GaussianNB(),
    "CV2NB":NNB10(delta=0.05, max_iter=1000, disc_threshold=1e-3, use_old_balancing=True)
}, output_order=["Perfect", "GaussianNB_Binary", "CV2NB", "GaussianNB", "NNB10", "NNB12"], include_perfect=True)

In [None]:
compare_groups(
    *split_ds(income_racesex), 
    classifiers_cat={
        "NNB-Parity": NNB10(delta=0.05, max_iter=1000, disc_threshold=1e-2), 
        "NNB-DF": NNB12(delta=0.05, max_iter=1000, disc_threshold=1e-2), 
    }, classifiers_bin={
        "2NB": NNB10(delta=0.05, max_iter=1000, disc_threshold=1e-2, use_old_balancing=True)
    }, 
    rows=2, cols=4, include_actual=True, axis_lim=(0.0, 0.8))

## Employment-Race-Sex

Data: Age, educational attainment, marital status, relationship, disability, employment status of parents, citizenship status, mobility status, military service, ancestry, nativity, hearing difficulty, vision difficulty, cognitive difficulty.

Aim: To predict whether each individual is currently employed

Sensitive Feature(s) Being Tested: Race, Sex

In [11]:
log(logging.ERROR)
score_means(employment_racesex, n_splits=n_splits, classifiers_cat = {
    "GaussianNB":GaussianNB(),
    "NNB10":NNB10(delta=0.05, max_iter=1000, disc_threshold=1e-3),
    "NNB12":NNB12(delta=0.2, max_iter=1000, disc_threshold=1e-3),
}, classifiers_bin = {
    "GaussianNB_Binary":GaussianNB(),
    "CV2NB":NNB10(delta=0.05, max_iter=1000, disc_threshold=1e-3, use_old_balancing=True)
}, output_order=["Perfect", "GaussianNB_Binary", "CV2NB", "GaussianNB", "NNB10", "NNB12"], include_perfect=True)

Unnamed: 0_level_0,Perfect,Perfect,GaussianNB_Binary,GaussianNB_Binary,CV2NB,CV2NB,GaussianNB,GaussianNB,NNB10,NNB10,NNB12,NNB12
Unnamed: 0_level_1,Mean,Var,Mean,Var,Mean,Var,Mean,Var,Mean,Var,Mean,Var
Metric,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
AUC,1.0,,0.81491,,0.80747,,0.81491,,0.80752,,0.80881,
Accuracy,1.0,,0.72785,,0.72694,,0.72785,,0.72695,,0.72181,
DIAvgAll,0.8761,,1.0878,,0.9934,,1.0878,,0.99335,,0.99481,
EDF-amp,0.0,,-0.04811,,-0.12566,,-0.04811,,-0.12561,,-0.12708,
EDF-amp-R,0.0,,-0.04811,,-0.12566,,-0.04811,,-0.12561,,-0.12708,
EDF-ratio,0.8761,,0.91928,,0.9934,,0.91928,,0.99335,,0.99481,
EDF-ratio-R,0.8761,,1.0878,,0.9934,,1.0878,,0.99335,,0.99481,
EDF-ε,0.13227,,0.08416,,0.00662,,0.08416,,0.00667,,0.0052,
EDF-ε-R,0.13227,,0.08416,,0.00662,,0.08416,,0.00667,,0.0052,
Parity,0.06155,,0.05432,,0.00446,,0.05432,,0.0045,,0.00355,


In [None]:
log(logging.ERROR)
compare_groups(
    *split_ds(employment_racesex), 
    classifiers_cat={
        "NNB10": NNB10(delta=0.05, max_iter=1000, disc_threshold=1e-2), 
        "NNB12": NNB12(delta=0.2, max_iter=1000, disc_threshold=1e-2), 
    }, classifiers_bin={
        "2NB": NNB10(delta=0.05, max_iter=1000, disc_threshold=1e-3)
    }, 
    rows=2, cols=3, include_actual=True)