[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tee-lab/PyDaddy/blob/colab/notebooks/3_advanced_function_fitting.ipynb)

# Advanced Function Fitting

(This notebook assumes that you have gone through the [Getting Started](./1_getting_started.ipynb) notebook. The explanations and examples in the notebook will mostly be based on vector datasets, but the ideas remains mostly unchanged for the scalar case.)

The _Getting Started_ notebook introduced how to fit functional forms for drift and diffusion terms. In those examples, the automatic model selection worked well. However, in many cases, some amount of manual intervention may be required to get the best results. 

There are two key parameters to (polynomial) function fitting; the _maximum degree_ of the polynomial, and the _sparsification threshold_. This notebook explores how to choose these parameters appropriately. 

As an experimental feature, `pydaddy` also allows us to fit functions other than polynomials, by providing a custom library. For more details on this, see [Fitting Non-polynomial Functions](./6_non_poly_function_fitting.ipynb)

In [None]:
import pydaddy

## Choosing the correct polynomial degree

For this notebook, we will use the group polarization time-series for a school of fish. Assuming there are no preferred directions for the school, we can expect symmetries between $F_1$ and $F_2$ and between $G_{11}$ and $G_{22}$.

The first step is to choose the right order for the polynomial. This can be done by visually inspecting the drift and diffusion plots.

In [None]:
data, t = pydaddy.load_sample_dataset('model-data-vector-ternary')
ddsde = pydaddy.Characterize(data, t, bins=20)

In [None]:
ddsde.drift()

In [None]:
ddsde.diffusion()

From the drift and diffusion plots, it looks like the the diffusion is quadratic. The drift looks cubic.

## Setting the sparsification threshold

The automatic model selection algorithm of `pydaddy` is a good place to start, and usually works well. This can be done by calling `ddsde.fit()` with argument `tune=True`.

The model selection algorithm estimates polynomial fits for a wide range of threshold values; compute the _cross validation error_ for each model, and find the model that achieves the best drop in cross validation error. By default, range of thresholds to search over is automatically determined, but this can be specified manually if required, using the `thresholds` parameter: see the [documentation](http://pydaddy.readthedocs.io/api.html#pydaddy.daddy.Daddy.fit) for more information.

In [None]:
F1 = ddsde.fit('F1', order=3, tune=True)
print(F1)

In [None]:
F2 = ddsde.fit('F2', order=3, tune=True)
print(F2)

In [None]:
G11 = ddsde.fit('G11', order=2, tune=True)
print(G11)

In [None]:
G22 = ddsde.fit('G22', order=2, tune=True)
print(G22)

## Evaluting and fine-tuning fits

The goodness of the fit can be examined qualitatively by visualizing the fit against the points using `ddsde.drift()` or `ddsde.diffusion()`.

In [None]:
ddsde.drift()

In [None]:
ddsde.diffusion()

For this example, auotmatic model selection (with the correct degree) worked well for $F_1$, $F_2$ and $G_{11}$, but failed for $G_{22}$. The automatic model selection can sometime fail like this: it may eliminate too many terms or return a polynomial with many spurious terms. In these cases, some manual intervention can improve results.

The sparse regression algorithm used for fitting uses a sparsity threshold, which can be either manually specified, or be automatically tuned for (as above). The sparsity threshold can be interpreted as follows: for a given value of the sparsity threshold, the fitted polynomial wil only have coefficients with magnitudes higher than the threshold value. Therefore, if you set the threshold too high, there most or all terms will be zeroed out. If the threshold is too low, the result will have too many terms.

When there are spurious terms present, one intuitive approach is to look for terms that are clearly spurious; i.e. some terms with coefficients an order of magnitude lesser than others. We can eliminate these terms by setting a threshold high enough to kill off these terms.

In [None]:
# Threshold too small
print(ddsde.fit('F1', order=3, threshold=0.01))

In [None]:
# Threshold too high
print(ddsde.fit('F1', order=3, threshold=0.1))

In [None]:
# Threshold just right
print(ddsde.fit('F1', order=3, threshold=0.05))

A more systematic approach would be to look at the cross-validation curve, to try and identify if there is a clearly suitable threshold value where the CV error is either lowest, or has a significant jump. To this, use call `ddsde.fit()` with `tune=True` and `plot=True`. This will plot the cross validation error and the level of sparsity (i.e. the number of terms) for each level of threshold, helping you choose the right threshold.

In [None]:
ddsde.fit('F1', order=3, tune=True, plot=True)

In [None]:
ddsde.fit('G11', order=3, tune=True, plot=True)

In the above example, for $A_1$, thresholds between ~0.02 and ~0.06 seems to give the best trade-off, with 3 terms. Decreasing the threshold further drastically increases the number of terms, with no significant improvement in CV error. Increasing the threshold above 0.6 kills off all the terms and makes the CV error very high.

Similarly, for $B_{11}$, thresholds between ~0.01 and ~0.04 achieve the best trade-off.

In [None]:
print(ddsde.fit('F1', order=3, threshold=0.03))
print(ddsde.fit('G11', order=2, threshold=0.03))

## Handling outliers using regularized regression

The `fit()` function has an `alpha` parameter which is a ridge regularization parameter for the polynomial fitting. When a non-zero value of alpha is given, ridge regression is used in the fitting process. This will give best results when the data is noisy or has outliers. 

In cases when a non-zero `alpha` parameter is required, fairly high values of alpha may be required: as a rough rule of thumb, try `alpha=100` to `alpha=1e6` (this is a very vague rule of thumb; your mileage may vary according to your specific dataset). However, be aware of the fact that alpha has the overall effect of shrinking your model parameters: if alpha is too high, your estimated coefficients can be biased to be too small.

As en example, see how fitting the diffusion function with a very large degree gives rise to overfitting, even when the threshold is fairly high.

In [None]:
ddsde.fit('G11', order=10, threshold=0.03)

Setting `alpha` to be fairly high and re-doing the regression eliminates the overfitting.

In [None]:
ddsde.fit('G11', order=10, threshold=0.02, alpha=1000)

Notice that the coefficients from the ridge-regression-based fitting is slightly smaller than the coefficients we obtained during our earlier fits: this is the side effect of ridge regularization. Therefore, non-zero alpha should be only used as a last resort when other attempts fail.

For more information on ridge regression, see:

https://en.wikipedia.org/wiki/Ridge_regression

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ridge_regression.html