# Identifying the common and the rare in Census data

We can use density estimation to identify a "prototypical American" in Census data. Similarly, we can identify some highly unusual demographics. We'll use the Census data that the SHAP package makes available (exact provenance unknown):

In [1]:
from empdens import models
from empdens.cade import Cade
from empdens.classifiers import lightgbm
from empdens.data import load_SHAP_census_data

df = load_SHAP_census_data()

We'll integer-code the categorical variables because CADE currently requires numeric inputs. Then we train CADE on the census data and sort the original data according to the fitted density:

In [2]:
categoricals = [col for col in df.columns if df[col].dtype.name == "category"]
classifier = lightgbm.Lgbm()
cade = Cade(
    initial_density=models.JointDensity(),
    classifier=lightgbm.Lgbm(),
)
cade.train(df, diagnostics=True)
cade.diagnostics["auc"]

np.float64(1.0)

In [3]:
df["density"] = cade.density(df)
ddf = df.copy().drop_duplicates()
ddf.sort_values("density", inplace=True)

## A typical American

The most common adult American demographic (in terms of the sample participants and the features that the census collects) is a mid-30s married white male who leverages a high school diploma to earn less than 50k while working 40 hours per week for a private employer in 'craft-repair' and accrues no large capital gains or losses.

In [4]:
ddf.tail()

Unnamed: 0,Age,Workclass,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Income,density
3490,69.0,Private,5.0,Married-civ-spouse,Other-service,Husband,White,Male,1424.0,0.0,35.0,<=50K,3.3465270000000003e-22
2694,27.0,Private,10.0,Married-civ-spouse,Craft-repair,Husband,White,Male,2829.0,0.0,70.0,<=50K,4.662935000000001e-22
20907,44.0,Private,10.0,Married-civ-spouse,Craft-repair,Husband,White,Male,4386.0,0.0,55.0,<=50K,5.979317e-22
29748,42.0,Private,9.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0.0,1579.0,42.0,<=50K,6.814572e-22
29731,21.0,Private,10.0,Never-married,Adm-clerical,Own-child,White,Female,0.0,1721.0,35.0,<=50K,8.611145e-22


## Examples of rare demographics

The rarest demographics are those at points of lowest density. Here are a few examples:

In [5]:
df.head()

Unnamed: 0,Age,Workclass,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Income,density
0,39.0,State-gov,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,<=50K,4.9312110000000003e-29
1,50.0,Self-emp-not-inc,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,<=50K,2.848834e-30
2,38.0,Private,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,<=50K,4.395886e-31
3,53.0,Private,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,<=50K,2.459196e-31
5,37.0,Private,14.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,0.0,0.0,40.0,<=50K,3.736858e-28


In [6]:
# import shmistogram as shmist
# shm = shmist.Shmistogram(df.Age.values)
# shm.plot()