# Learning univariate, model-free and deterministic distributions

In this tutorial we will go through the line of thinking, math and usage of learning univariate, model-free and deterministic distribution from data.

In most machine learning scenarios the distribution of the data is unknown. Learning a distribution is than usually done by assuming a functional form and fitting the parameters of the function to the data.
 This is called parametric learning.
 
In this tutorial we will learn a distribution without assuming a functional form. This is called non-parametric learning.

Let $\mathcal{D}$ be a dataset of $N$ samples $\mathcal{D} = \{d_1, d_2, \dots, d_N\}$, where $x_i \in \mathbb{R}$. 

In [3]:
import numpy as np
np.random.seed(69)
dataset = np.concatenate((np.random.normal(0, 1, size=1000), np.random.normal(5, 0.5, size=1000)))
dataset

array([ 0.9155071 , -0.60354197,  1.16229517, ...,  4.99355308,
        5.14493969,  5.50211339])

The idea is to create as many components in a mixture of uniform distributions as is needed to achieve a good fit to the data. The fitness of the model is measured by the average likelihood of the data under the model.

$$
f_{average}(\mathcal{D}| \boldsymbol{\Theta}) = \frac{1}{N} \sum_{i=1}^N p(d_i),
$$
where $\boldsymbol{\Theta}$ represents the parameters of the model.

We will solve this problem using a recursive partitioning scheme. First off, we will sort the data, make it unique.
Next, we iterate through every possible partitioning that can be made on the data, such that we get two datasets

$$
\mathcal D_{left} = {d_1, d_2, \dots, d_k} \\
\mathcal D_{right} = {d_{k+1}, d_{k+2}, \dots, d_N}
$$

The split values are calculated by the distance maximizing value between $d_k$ and $d_{k+1}$.

$$
d_{split} = \frac{d_k + d_{k+1}}{2}
$$

The likelihood of such a split is given by assuming a uniform distribution on the left and right side of the split. 
This constructs a deterministic mixture of uniform distributions where the weights are given by the relative number of samples on the left and right side of the split.
Hence, the following function has to be evaluated

$$
f_{average}(\mathcal{D} | \boldsymbol{\Theta}, d_{split}) = \frac{1}{N} \sum_{i=1}^N \frac{k}{N} \mathcal{U}(d_i, d_{split}) \cdot k + \frac{N-k}{N}\mathcal{U}(d_{split}, d_N) \cdot (N-k)\\
= \frac{1}{N^2} \sum_{i=1}^N k^2 \cdot \mathcal{U}(d_i, d_{split}) + (N-k)^2 \cdot \mathcal{U}(d_{split}, d_N),
$$
where $\mathcal{U(a, b)}$ is the probability density function of the uniform distribution over the interval $[a, b]$.

The most likely split is selected and the process is repeated recursively on the left and right side of the split until the likelihood of the split is smaller than a given threshold.
The value to compare against is given by the following equation

$$
f_{average}(\mathcal{D} | \boldsymbol{\Theta}) = \frac{1}{N} \sum_{i=1}^N \mathcal{U}(d_1, d_N) \cdot N = \mathcal{U}(d_1, d_N)
$$

The threshold is given by the following equation

$$
 max_{d_{split}} P_{average}(\mathcal{D} | \boldsymbol{\Theta}, d_{split}) > (1 + \xi) \cdot f_{average}(\mathcal{D} | \boldsymbol{\Theta})
$$

In simpler terms, the induction is terminated as soon as the likelihood does not improve by more than $\xi\,\%$ anymore if a split is made. This parameter is referenced to as `min_likelihood_improvement` in the code.

This algorithm is implemented in the [NygaDistribution](https://probabilistic-model.readthedocs.io/en/latest/autoapi/probabilistic_model/learning/nyga_distribution/index.html#probabilistic_model.learning.nyga_distribution.NygaDistribution).

Finally, if we apply the algorithm to the dataset we get the following result.
We can see that it looks very similar to the gaussian mixture we sampled from.

In [4]:
from probabilistic_model.learning.nyga_distribution import NygaDistribution
from random_events.variables import Continuous
import plotly.graph_objects as go

distribution = NygaDistribution(Continuous("x"), min_samples_per_quantile=50, min_likelihood_improvement=0.001)
distribution.fit(dataset.tolist())
fig = go.Figure(distribution.plot())
fig.update_layout(
    title="Nyga distribution",
    xaxis_title=distribution.variable.name)
fig.show()