# Basic usage and testing

We'll demonstrate the application of CADE on simulated data and compare its performance with several other density estimators.

## Simulate data

Define a problem by simulating some data from a bivariate distribution:

In [2]:
import numpy as np
import pandas as pd
import pydens

np.random.seed(0)
sz = pydens.simulators.Zena()
df = sz.rvs(1000)
df.head()
# Some example values:

Unnamed: 0,gaussian,triangular
0,0.148588,1.085822
1,0.587739,0.015134
2,0.283965,0.828005
3,0.138861,1.381029
4,-0.159206,0.066705


## Train CADE

Cade works by first fitting an initial naive joint density model and subsequently
improving the initial density estimates with a classifier that
tries to distinguish between the real data versus fake data sampled
from the initial density model:

In [2]:
# All arguments can be ommitted; displaying defaults here to be explicit:
cade = pydens.cade.Cade(
    initial_density=pydens.models.JointDensity(),
    classifier=pydens.classifiers.lightgbm.Lgbm()
)
cade.train(df, diagnostics=True)
print("CADE real-vs-fake classifier AUROC = " + str(cade.diagnostics['auc'])[:6])

CADE real-vs-fake classifier AUROC = 0.7491


The AUROC score (i.e. AUC or area under the receiver operating characteristic) has both theoretical and practical interpretations. Scoring much greater than 0.5 indicates that there are substantial differences between the simulated and real data, reflecting how much the classifier improves on the initial density estimate. However, an extremely high AUROC is a warning flag; if the classifier achieves near-perfect separation between the real and fake data, there is a risk that it achieves this separation "too easily", without taking all structure of the data into consideration. Todo: How high is too high?

## Train other density estimators

Let's also train fastKDE (pip install fastkde), sklearn's KernelDensity, and sklearn's Isolation Forest (technically not a density):

In [3]:
estimators = [
    pydens.wrappers.FastKDE(),
    pydens.wrappers.SklearnKDE(),
    pydens.wrappers.SklearnIsolationForest()
]

for e in estimators:
    e.train(df)

## Performance evaluation

Let's compare the performance of the estimators on new data from the same simulation:

In [4]:
new_df = sz.rvs(1000)
estimators_dict = {type(e).__name__: e.density(new_df) for e in [cade] + estimators}
ev = pydens.evaluation.Evaluation(
    estimators=estimators_dict,
    truth=sz.density(new_df)
)
pd.set_option('display.precision', 4)
print(ev.evaluate())

                          Cade  FastKDE  SklearnKDE  SklearnIsolationForest
mean_absolute_error     0.0299   0.0181      0.0652                  0.3854
mean_squared_error      0.0017   0.0008      0.0071                  0.1506
rank-order correlation  0.8564   0.9525      0.7847                  0.8359
pearson correlation     0.8362   0.9388      0.7839                  0.7796
mean density            0.1236   0.1155      0.0648                  0.5134
