# How special are you? Density estimation on Census data

We can use density estimation to identify odd or unusual demographics in Census data. We'll use the Census data that the SHAP package makes available (exact provenance unknown):

In [1]:
import pandas as pd
import pydens
import shap

def load_SHAP_census_data():
    ''' This loads the 'adults' dataset cached in SHAP, borrowing a few SHAP
    file parsing code snippets, https://github.com/slundberg/shap/blob/master/shap/datasets.py
    '''
    dtypes = [
        ("Age", "float32"), 
        ("Workclass", "category"), 
        ("fnlwgt", "float32"),
        ("Education", "category"), 
        ("Education-Num", "float32"), 
        ("Marital Status", "category"),
        ("Occupation", "category"), 
        ("Relationship", "category"), 
        ("Race", "category"),
        ("Sex", "category"), 
        ("Capital Gain", "float32"), 
        ("Capital Loss", "float32"),
        ("Hours per week", "float32"), 
        ("Country", "category"), 
        ("Target", "category")
    ]
    df = pd.read_csv(
        "https://github.com/slundberg/shap/raw/master/data/adult.data",
        names=[d[0] for d in dtypes],
        na_values="?",
        dtype=dict(dtypes)
    )
    df = df[df.Country == ' United-States'].copy()
    df.drop(['Country', 'Education', "fnlwgt"], axis=1, inplace=True)
    df.rename({'Target': 'Income'}, axis=1, inplace=True)
    return df

df = load_SHAP_census_data()

We'll integer-code the categorical variables because CADE currently requires numeric inputs. Then we train CADE on the census data and sort the original data according to the fitted density:

In [2]:
categorical_cols = [col for col in df.columns if df[col].dtype.name == 'category']
num_data = df.copy()
for col in categorical_cols:
    num_data[col] = num_data[col].cat.codes

classifier = pydens.classifiers.lightgbm.Lgbm(categorical_features=categorical_cols)
cade = pydens.cade.Cade(
    initial_density=pydens.models.JointDensity(),
    classifier=classifier
)
cade.train(num_data, diagnostics=True)
cade.diagnostics['auc']

0.9487577641122965

In [3]:
df['density'] = cade.density(num_data)
df.sort_values('density', inplace=True)

## Examples of unusual demographics

The most unusual demographics are those at points of lowest density. Here are a few examples:

In [4]:
df.head()

Unnamed: 0,Age,Workclass,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Income,density
10114,43.0,State-gov,16.0,Married-spouse-absent,Prof-specialty,Unmarried,White,Male,25236.0,0.0,64.0,>50K,6.4997239999999996e-21
16984,75.0,?,9.0,Married-AF-spouse,?,Wife,White,Female,2653.0,0.0,14.0,<=50K,3.119675e-20
17609,79.0,Self-emp-inc,8.0,Widowed,Sales,Not-in-family,White,Male,18481.0,0.0,45.0,>50K,3.9444819999999996e-20
2906,81.0,Private,5.0,Widowed,Priv-house-serv,Not-in-family,Black,Female,2062.0,0.0,5.0,<=50K,6.956401999999999e-20
13107,67.0,Local-gov,14.0,Never-married,Exec-managerial,Other-relative,White,Female,15831.0,0.0,72.0,>50K,9.353933e-20


## The modal American

The most common adult American (in terms of the features that the census collects) is a white male of age about 35 who works 40 per week for a private employer in 'craft-repair' based on a high school diploma, married, with no large capital gains or losses, earning less than 50k.

In [5]:
df.tail()

Unnamed: 0,Age,Workclass,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Income,density
25269,36.0,Private,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,40.0,<=50K,0.000104
65,36.0,Private,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,40.0,<=50K,0.000104
22093,36.0,Private,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,40.0,<=50K,0.000104
17824,36.0,Private,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,40.0,<=50K,0.000104
9437,36.0,Private,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,40.0,<=50K,0.000104


In [6]:
# import shmistogram as shmist
# shm = shmist.Shmistogram(df.Age.values)
# shm.plot()