# Identifying the common and the rare in Census data

We can use density estimation to identify a "prototypical American" in Census data. Similarly, we can identify some highly unusual demographics. We'll use the Census data that the SHAP package makes available (exact provenance unknown):

In [1]:
import empdens
from empdens.data import load_SHAP_census_data
from empdens.classifiers import lightgbm
from empdens.cade import Cade
from empdens import models

df = load_SHAP_census_data()

We'll integer-code the categorical variables because CADE currently requires numeric inputs. Then we train CADE on the census data and sort the original data according to the fitted density:

In [2]:
categorical_cols = [col for col in df.columns if df[col].dtype.name == "category"]
num_data = df.copy()
for col in categorical_cols:
    num_data[col] = num_data[col].cat.codes

classifier = lightgbm.Lgbm(categorical_features=categorical_cols)
cade = Cade(initial_density=models.JointDensity(), classifier=classifier)
cade.train(num_data, diagnostics=True)
cade.diagnostics["auc"]



The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  xdf.is_loner.replace(False, True, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  xdf.is_loner.replace(False, True, inplace=True)


np.float64(1.0)

In [3]:
df["density"] = cade.density(num_data)
ddf = df.copy().drop_duplicates()
ddf.sort_values("density", inplace=True)

## A typical American

The most common adult American demographic (in terms of the sample participants and the features that the census collects) is a mid-30s married white male who leverages a high school diploma to earn less than 50k while working 40 hours per week for a private employer in 'craft-repair' and accrues no large capital gains or losses.

In [4]:
ddf.tail()

Unnamed: 0,Age,Workclass,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Income,density
7719,69.0,Private,10.0,Widowed,Adm-clerical,Not-in-family,White,Female,2050.0,0.0,24.0,<=50K,8.942537e-52
27029,60.0,Self-emp-not-inc,9.0,Divorced,Sales,Not-in-family,Black,Male,2597.0,0.0,55.0,<=50K,1.7807799999999998e-51
29731,21.0,Private,10.0,Never-married,Adm-clerical,Own-child,White,Female,0.0,1721.0,35.0,<=50K,3.162745e-51
28840,21.0,Private,9.0,Never-married,Priv-house-serv,Not-in-family,White,Female,0.0,0.0,25.0,<=50K,8.293535e-51
20109,44.0,Private,11.0,Never-married,Priv-house-serv,Not-in-family,White,Male,594.0,0.0,25.0,<=50K,2.63289e-49


## Examples of rare demographics

The rarest demographics are those at points of lowest density. Here are a few examples:

In [5]:
df.head()

Unnamed: 0,Age,Workclass,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Income,density
0,39.0,State-gov,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,<=50K,9.324532e-58
1,50.0,Self-emp-not-inc,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,<=50K,1.184076e-59
2,38.0,Private,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,<=50K,3.335201e-60
3,53.0,Private,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,<=50K,3.335201e-60
5,37.0,Private,14.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,0.0,0.0,40.0,<=50K,3.1287419999999996e-57


In [6]:
# import shmistogram as shmist
# shm = shmist.Shmistogram(df.Age.values)
# shm.plot()