# Basic usage and testing

The demo compares the performance of CADE, fastKDE, and sklearn's KernelDensity on a simulated data set.

## Simulate data

Define a problem by simulating some data from a bivariate distribution:

In [1]:
import numpy as np
import pandas as pd
import pydens

np.random.seed(0)
sz = pydens.simulators.bivariate.Zena()
df = sz.rvs(1000)
df.head()
# Some example values:

Unnamed: 0,gaussian,triangular
0,0.148588,1.085822
1,0.587739,0.015134
2,0.283965,0.828005
3,0.138861,1.381029
4,-0.159206,0.066705


## Train CADE

Use Cade to estimate the density of the data. Cade works by
first fitting an initial naive joint density model and subsequently
improving the initial density estimates with a classifier that
tries to distinguish between the real data versus fake data sampled
from the initial density model:

In [5]:
# All arguments can be ommitted; displaying defaults here to be explicit:
cade = pydens.cade.Cade(
    initial_density=pydens.models.JointDensity(),
    classifier=pydens.classifiers.lightgbm.Lgbm()
)
cade.train(df, diagnostics=True)
print("CADE real-vs-fake classifier AUROC = " + str(cade.diagnostics['auc'])[:6])

CADE real-vs-fake classifier AUROC = 0.7495


The AUROC score (i.e. AUC or area under the receiver operating characteristic) has both theoretical and practical interpretations. An AUROC that is substantially greater than 0.5 indicates that there are substantial differences between the simulated and real data, reflecting the degree to which the classifier improves upon th initial density estimate. However, an extremely high AUROC is a warning flag; if the classifier achieves near-perfect separation between the real and fake data, there is a risk that it achieves this separation without taking all the features into consideration, rendering the classifier adjustment entirely useless.

## Train other density estimators

Let's also train fastKDE (pip install fastkde) and sklearn's KernalDensity:

In [6]:
fkde = pydens.wrappers.FastKDE() # pass params = <a dictionary of fastKDE argument> to use non-default values
fkde.train(df)

skde = pydens.wrappers.SklearnKDE()
skde.train(df)

## Performance evaluation

Compare the performance of the estimators on new data from the same simulation:

In [4]:
new_df = sz.rvs(1000)
ev = pydens.evaluation.Evaluation(
    estimators={type(e).__name__: e.density(new_df) for e in [cade, fkde, skde]},
    truth=sz.density(new_df)
)
pd.set_option('display.precision', 3)
print(ev.evaluate())

                                    Cade  FastKDE  SklearnKDE
rank-order correlation with truth  0.837    0.957       0.782
mean density                       0.124    0.117       0.065
