# Introduction to PyDaddy

`pydaddy` is a Python toolbox to derive stochastic differential equations (SDEs) from time-series data. Given samples of a time-series $x(t)$, `pydaddy` attempts to fit functions $f$ and $g^2$ such that

$$ \frac{dx}{dt} = f(x) + g(x) \cdot \eta(t) $$

where $\eta(t)$ is uncorrelated white noise. The function $f$ is called the _drift_, and governs the deterministic part of the dynamics. $g^2$ is called the _diffusion_ and governs the stochastic part of the dynamics.

PyDaddy estimates the drift function $f$ directly. For diffusion, PyDaddy estimates $g^2$ and not $g$.

In [None]:
# Execute this cell to set up PyDaddy in your Colab environment.
%pip install git+https://github.com/tee-lab/PyDaddy.git

In [None]:
import pydaddy

## Initializing a `pydaddy` object

To start analysis, we need to create a `pydaddy` object with our dataset. This will compute the drift and diffusion parts, and generate a summary plot. To initialize a `pydaddy` object, we need to provide the following arguments:
 - `data`: the timeseries data, could be either one or two dimensional. This example will deal with 1D data, see [Getting Started with Vector Data](./2_getting_started_vector.ipynb) for a 2D example. `pydaddy` assumes that the samples are evenly spaced. `data` should be a list of Numpy arrays; with one array for the scalar case, and two arrays for the vector case.
 - `t`: This could either be scalar, denoting the time-interval between samples, or an array denoting the timestamp of each sample.
 - `bins`: The number of bins to use for computing the average drift and diffusion. Binning is only done for visualization purposes.
 
There are also other optional arguments: see [documentation](https://pydaddy.readthedocs.io/api.html) for detailed descriptions of all arguments.

This example uses a sample dataset, loaded using a helper function. For details about data formats and loading/saving data, see [Exporting Data](./5_exporting_data.ipynb).

In [None]:
data, t = pydaddy.load_sample_dataset('model-data-scalar-pairwise')
ddsde = pydaddy.Characterize(data, t, bins=20)

`pydaddy.Characterize` initializes a `ddsde` object which can be used for further analysis. It also produces summary plots, showing the time-series, histograms, and the estimated drift and diffusion functions.

`pydaddy` can automatically try to fit polynomial functions if called with argument `fit_functions=True`. However, for best results, it is recommended to do the fitting separately, with some level of manual intervention. See [Advanced Function Fitting](./3_advanced_function_fitting.ipynb)  for more details.

## Recovering functional forms for $f$ and $g$

`pydaddy` has a `fit()` function which can recover functional forms for the drift and diffusion functions, using sparse regression. By default, `pydaddy` fits polynomials (of a specified degree), but it is possible to fit arbitrary functions by specifying a custom library (see [Fitting Non-Polynomial Functions](./6_non_poly_function_fitting.ipynb)).

Two parameters need to be specified during fitting:
 - `order`: The maximum degree of the polynomial to be fitted (see [Advanced Function Fitting](./3_advanced_function_fitting.ipynb) for some tips on how to choose the correct order).
 - `threshold`: a _sparsification threshold_, that governs the level of sparsity (i.e. the number of terms in the polynomial). For `threshold=theta`, the fitted polynomial will only have terms with coefficients greater than `theta`. 
 
We can ask `pydaddy` to try to automatically find an appropriate sparsification threshold by calling `fit()` with argument `tune=True`. 

In [None]:
# Fitting with automatic threshold tuning
F = ddsde.fit('F', order=3, tune=True)
print(F)

In [None]:
G = ddsde.fit('G', order=3, tune=True)
print(G)

In the above example, automatic model selection (`tune=True`) sucessfully found the correct threshold. If the data is too noisy, or if `order` is too high, automatic model selection can give poor results. In such cases, good results can be obtained with some manual intervention: see [Advanced Function Fitting](./3_advanced_function_fitting.ipynb) for more details.

In [None]:
ddsde.fit('F', order=3, threshold=0.01)

The fitted functions can also be printed individually.

In [None]:
print(ddsde.F)

In [None]:
print(ddsde.G)

`dddsde.F` and `ddsde.G` are, in fact, callable functions: this means that you can call `ddsde.F` or `ddsde.G` with some desired arguments and evaluate the drift or diffusion for that value.

In [None]:
ddsde.F(0.2)

## Interactive plots for drift and diffusion

To get interactive plots for the drift and diffusion functions, use `ddsde.drift()` or `ddsde.diffusion()`. These will be particularly useful for the 2-D case, where the drift and diffusion plots will be 3-D plots (see [Getting Started with Vector Data](./2_getting_started_vector.ipynb)).

In [None]:
ddsde.drift()

In [None]:
ddsde.diffusion()

## Diagnostics

For a drift-diffusion model fit to be valid, the data should satisfy some underlying assumptions. `pydaddy.noise_diagnostics()` allows us to verify if the data satisfies these assumptions.
The function produces 4 plots:

- The distribution of the residuals, which should be a Gaussian.
- QQ plot of the residual distribution, against a theoretical Gaussian distribution of the same mean and variance. Ideally (i.e. if the residuals are Gaussian distributed), all points of this plot should fall on a straight line of slope 1.
- Autocorrelation plot of the residuals. Ideally, the residuals should be uncorrelated, i.e. autocorrelation time should be close to 0.
- The plot of the 2nd and 4th Kramer-Moyal coefficients. Theory dictates that $\text{KM}(4)$ should equal 3 times $\text{KM}(2)$, i.e. the plot should be a straight line of slope 1.


In [None]:
ddsde.noise_diagnostics()

The `model_diagnostics()` functions checks if the model is self-consistent. 

To do this, a simulated time series, with the same length and sampling time as the original time series, is generated by integrating the discovered SDE. The drift and diffusion functions are now re-estimated from this simulated time series, with the same fitting parameters as the original fit. If the model is self-consistent, the re-estimated drift and diffusion functions should match the original drift and diffusion.

In [None]:
ddsde.model_diagnostics()