# Basic usage and testing

We'll demonstrate the application of CADE on simulated data and compare its performance with several other density estimators.

## Simulate data

Define a problem by simulating some data from a bivariate distribution:

In [1]:
import numpy as np
import pandas as pd

from empdens import cade, classifiers, evaluation, models, simulators
from empdens.wrappers.fast_kde import FastKDE
from empdens.wrappers.sklearn_isolation_forest import SklearnIsolationForest
from empdens.wrappers.sklearn_kde import SklearnKDE

np.random.seed(0)
sz = simulators.Zena()
df = sz.rvs(1000)
df.describe()

Unnamed: 0,gaussian,triangular
count,1000.0,1000.0
mean,0.050142,1.038288
std,0.951613,0.741677
min,-1.990214,0.000551
25%,-0.628853,0.410971
50%,-0.017278,0.92043
75%,0.653287,1.586806
max,3.516386,2.884844


## Train CADE

Cade works by first fitting an initial naive joint density model and subsequently
improving the initial density estimates with a classifier that
tries to distinguish between the real data versus fake data sampled
from the initial density model:

In [2]:
# All arguments can be ommitted; displaying defaults here to be explicit:
cc = cade.Cade(initial_density=models.JointDensity(), classifier=classifiers.lightgbm.Lgbm())
cc.train(df, diagnostics=True)
print("CADE real-vs-fake classifier AUROC = " + str(cc.diagnostics["auc"])[:6])

CADE real-vs-fake classifier AUROC = 0.9112


The AUROC score (i.e. AUC or area under the receiver operating characteristic) has both theoretical and practical interpretations. Scoring much greater than 0.5 indicates that there are substantial differences between the simulated and real data, reflecting how much the classifier improves on the initial density estimate. However, an extremely high AUROC is a warning flag; if the classifier achieves near-perfect separation between the real and fake data, there is a risk that it achieves this separation "too easily", without taking all structure of the data into consideration. Todo: How high is too high?

## Train other density estimators

Let's also train fastKDE (pip install fastkde), sklearn's KernelDensity, and sklearn's Isolation Forest (technically not a density):

In [3]:
estimators = [
    FastKDE(),
    SklearnKDE(),
    SklearnIsolationForest(),
]

for e in estimators:
    e.train(df)

## Performance evaluation

Let's compare the performance of the estimators on new data from the same simulation:

In [4]:
new_df = sz.rvs(1000)
estimators_dict = {type(e).__name__: e.density(new_df) for e in [cc] + estimators}
ev = evaluation.Evaluation(estimators=estimators_dict, truth=sz.density(new_df))
pd.set_option("display.precision", 4)
print(ev.evaluate())

                          Cade  FastKDE  SklearnKDE  SklearnIsolationForest
mean_absolute_error     0.6651   0.0182      0.0652                  0.3782
mean_squared_error      0.9129   0.0008      0.0071                  0.1452
rank-order correlation  0.8623   0.9525      0.7847                  0.7910
pearson correlation     0.8187   0.9388      0.7839                  0.7471
mean density            0.7884   0.1155      0.0648                  0.5061


Here `fastKDE` dominates. This is not too surprising considering that this simulation is

- low-dimensional
- has a smooth and simple structure
- and includes only numeric features.

CADE, however, is competitive with the the other estimators, and is able to handle categorical features in addition to continuous ones (see `census_demographics.ipynb` for an example).