# Identifying the common and the rare in Census data

We can use density estimation to identify a "prototypical American" in Census data. Similarly, we can identify some highly unusual demographics. We'll use the Census data that the SHAP package makes available (exact provenance unknown):

In [1]:
from empdens import models
from empdens.cade import Cade
from empdens.classifiers import lightgbm
from empdens.data import load_SHAP_census_data

df = load_SHAP_census_data()

We'll integer-code the categorical variables because CADE currently requires numeric inputs. Then we train CADE on the census data and sort the original data according to the fitted density:

In [2]:
categoricals = [col for col in df.columns if df[col].dtype.name == "category"]
classifier = lightgbm.Lgbm()
cade = Cade(
    initial_density=models.JointDensity(),
    classifier=lightgbm.Lgbm(),
)
cade.train(df, diagnostics=True)
cade.diagnostics["auroc"]

np.float64(0.9460550739293473)

In [3]:
ddf = df.drop_duplicates().copy()
ddf["density"] = cade.density(ddf)
ddf = ddf.sort_values("density")

## A typical American

The most common adult American demographic (in terms of the sample participants and the features that the census collects) is a mid-30s married white male who leverages a high school diploma to earn less than 50k while working 40 hours per week for a private employer in 'craft-repair' and accrues no large capital gains or losses.

In [4]:
ddf.tail().sort_values("density", ascending=False)

Unnamed: 0,Age,Workclass,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Income,density
65,36.0,Private,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,40.0,<=50K,0.000102
3277,35.0,Private,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,40.0,<=50K,0.000101
2675,34.0,Private,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,40.0,<=50K,0.0001
5086,33.0,Private,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,40.0,<=50K,9.9e-05
6266,31.0,Private,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,40.0,<=50K,9.8e-05


## Examples of rare demographics

The rarest demographics are those at points of lowest density. Here are a few examples:

In [5]:
ddf.head()

Unnamed: 0,Age,Workclass,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Income,density
10114,43.0,State-gov,16.0,Married-spouse-absent,Prof-specialty,Unmarried,White,Male,25236.0,0.0,64.0,>50K,8.213052e-21
16984,75.0,?,9.0,Married-AF-spouse,?,Wife,White,Female,2653.0,0.0,14.0,<=50K,1.386095e-20
2906,81.0,Private,5.0,Widowed,Priv-house-serv,Not-in-family,Black,Female,2062.0,0.0,5.0,<=50K,1.135171e-19
17609,79.0,Self-emp-inc,8.0,Widowed,Sales,Not-in-family,White,Male,18481.0,0.0,45.0,>50K,1.1944629999999998e-19
13107,67.0,Local-gov,14.0,Never-married,Exec-managerial,Other-relative,White,Female,15831.0,0.0,72.0,>50K,1.632015e-19
