Skip to content

Commit

Permalink
User-defined functions (#18)
Browse files Browse the repository at this point in the history
User-defined functions
  • Loading branch information
trevorstephens committed Sep 3, 2016
1 parent cdfe6bf commit 5d8c7d2
Show file tree
Hide file tree
Showing 14 changed files with 1,515 additions and 915 deletions.
28 changes: 28 additions & 0 deletions doc/changelog.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
.. currentmodule:: gplearn
.. _changelog:

Release History
===============

Version 0.2.0
-------------

- Allow users to define their own functions for use in genetic programs.
Supported by the :func:`functions.make_function()` factory function. Using this
a user may define any mathematical relationship with any number of arguments
and grow totally customized programs. This also required modifying the API
with the deprecation of the `comparison`, `transformer` and `trigonometric`
arguments to the :class:`genetic.SymbolicRegressor` and
:class:`genetic.SymbolicTransformer` classes in favor of the new
`function_set` where any combination of preset and user-defined functions can
be supplied. To restore previous behavior initialize the estimator with
`function_set=['add2', 'sub2', 'mul2', 'div2', 'sqrt1', 'log1', 'abs1',
'neg1', 'inv1', 'max2', 'min2']`.


Version 0.1.0
-------------

Initial public release supporting symbolic regression tasks through the
:class:`genetic.SymbolicRegressor` class for regression problems and the
:class:`genetic.SymbolicTransformer` class for automated feature engineering.
2 changes: 1 addition & 1 deletion doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@

# General information about the project.
project = u'gplearn'
copyright = u'2015, Trevor Stephens'
copyright = u'2016, Trevor Stephens'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
Expand Down
131 changes: 108 additions & 23 deletions doc/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,20 @@
Examples
========

The code used to generate these examples can be `found here <http://nbviewer.ipython.org/github/trevorstephens/gplearn/blob/master/doc/gp_examples.ipynb>`_ as an iPython Notebook.
The code used to generate these examples can be
`found here <http://nbviewer.ipython.org/github/trevorstephens/gplearn/blob/master/doc/gp_examples.ipynb>`_
as an iPython Notebook.

.. currentmodule:: gplearn.genetic

Example 1: Symbolic Regressor
-----------------------------

This example demonstrates using the :class:`SymbolicRegressor` to fit a symbolic relationship.
This example demonstrates using the :class:`SymbolicRegressor` to fit a
symbolic relationship.

Let's create some synthetic data based on the relationship :math:`y = X_0^{2} - X_1^{2} + X_1 - 1`::
Let's create some synthetic data based on the relationship
:math:`y = X_0^{2} - X_1^{2} + X_1 - 1`::

x0 = np.arange(-1, 1, 1/10.)
x1 = np.arange(-1, 1, 1/10.)
Expand All @@ -22,7 +26,8 @@ Let's create some synthetic data based on the relationship :math:`y = X_0^{2} -
ax = plt.figure().gca(projection='3d')
ax.set_xlim(-1, 1)
ax.set_ylim(-1, 1)
surf = ax.plot_surface(x0, x1, y_truth, rstride=1, cstride=1, color='green', alpha=0.5)
surf = ax.plot_surface(x0, x1, y_truth, rstride=1, cstride=1,
color='green', alpha=0.5)
plt.show()

.. image:: images/ex1_fig1.png
Expand All @@ -40,11 +45,17 @@ We can create some random training and test data that lies on this surface too::
X_test = rng.uniform(-1, 1, 100).reshape(50, 2)
y_test = X_test[:, 0]**2 - X_test[:, 1]**2 + X_test[:, 1] - 1

Now let's consider how to fit our :class:`SymbolicRegressor` to this data. Since it's a fairly small dataset, we can probably use a large population since training time will still be pretty fast. We'll evolve 20 generations unless the error falls below 0.01. Examining the equation, it doesn't appear that the ``transformer`` or ``comparison`` function sets are useful, so we'll turn those off. Let's bump up the amount of mutation and subsample so that we can watch the OOB error evolve. We'll also increase the parsimony coefficient to keep our solutions small, since we know the truth is a pretty simple equation::
Now let's consider how to fit our :class:`SymbolicRegressor` to this data.
Since it's a fairly small dataset, we can probably use a large population since
training time will still be pretty fast. We'll evolve 20 generations unless the
error falls below 0.01. Examining the equation, it looks like the default
function set of addition, subtraction, multiplication and division will cover
us. Let's bump up the amount of mutation and subsample so that we can watch
the OOB error evolve. We'll also increase the parsimony coefficient to keep our
solutions small, since we know the truth is a pretty simple equation::

est_gp = SymbolicRegressor(population_size=5000,
generations=20, stopping_criteria=0.01,
comparison=False, transformer=False,
p_crossover=0.7, p_subtree_mutation=0.1,
p_hoist_mutation=0.05, p_point_mutation=0.1,
max_samples=0.9, verbose=1,
Expand All @@ -65,13 +76,18 @@ Now let's consider how to fit our :class:`SymbolicRegressor` to this data. Since
8 8.02 1.02643443398 11 0.043612562970 0.043612562970 1.08m
9 9.07 1.22732144371 11 0.000781474035 0.0007814740353 59.43s

The evolution process stopped early as the error of the best program in the 9th generation was better than 0.01. It also appears that the parsimony coefficient was just about right as the average length of the programs fluctuated around a bit before settling on a pretty reasonable size. Let's look at what our solution was::
The evolution process stopped early as the error of the best program in the 9th
generation was better than 0.01. It also appears that the parsimony coefficient
was just about right as the average length of the programs fluctuated around a
bit before settling on a pretty reasonable size. Let's look at what our
solution was::

print est_gp._program
sub(add(-0.999, X1), mul(sub(X1, X0), add(X0, X1)))

Interestingly, this does not have the same structure as our target function. But let's expand the mathematics out:
Interestingly, this does not have the same structure as our target function.
But let's expand the mathematics out:

.. math::
y = (-0.999 + X_1) - ((X_1 - X_0) \times (X_0 + X_1))
Expand All @@ -82,7 +98,8 @@ Interestingly, this does not have the same structure as our target function. But
.. math::
y = X_0^{2} - X_1^{2} + X_1 - 0.999
Despite representing an interaction of :math:`X_0` and :math:`X_1`, these terms cancel and we're left with the (almost) exact relationship we were seeking!
Despite representing an interaction of :math:`X_0` and :math:`X_1`, these terms
cancel and we're left with the (almost) exact relationship we were seeking!

Great, but let's compare with some other non-linear models to see how they do::

Expand Down Expand Up @@ -120,11 +137,15 @@ We can plot the decision surfaces of all three to visualize each one::
.. image:: images/ex1_fig2.png
:align: center

Not bad :class:`SymbolicRegressor`! We were able to fit a very smooth function to the data, while the tree-based estimators created very "blocky" decision surfaces. The Random Forest appears to have smoothed out some of the wrinkles but in both cases the tree models have fit very well to the training data, but done worse on out-of-sample data.
Not bad :class:`SymbolicRegressor`! We were able to fit a very smooth function
to the data, while the tree-based estimators created very "blocky" decision
surfaces. The Random Forest appears to have smoothed out some of the wrinkles
but in both cases the tree models have fit very well to the training data, but
done worse on out-of-sample data.

We can also inspect the program that the :class:`SymbolicRegressor` found::

graph = pydot.graph_from_dot_data(est_gp._program.export_graphviz())
graph = pydotplus.graphviz.graph_from_dot_data(est_gp._program.export_graphviz())
Image(graph.create_png())

.. image:: images/ex1_child.png
Expand All @@ -140,23 +161,27 @@ And check out who its parents were::
'donor_idx': 116,
'donor_nodes': [0, 1, 2, 6]}

This dictionary tells us what evolution operation was performed to get our new individual, as well as the parents from the prior generation, and any nodes that were removed from them during, in this case, Crossover.
This dictionary tells us what evolution operation was performed to get our new
individual, as well as the parents from the prior generation, and any nodes
that were removed from them during, in this case, Crossover.

Plotting the parents shows how the genetic material from them combined to form our winning program::
Plotting the parents shows how the genetic material from them combined to form
our winning program::

idx = est_gp._program.parents['donor_idx']
fade_nodes = est_gp._program.parents['donor_nodes']
graph = est_gp._programs[-2][idx].export_graphviz(fade_nodes=fade_nodes)
graph = pydot.graph_from_dot_data(graph)
graph = pydotplus.graphviz.graph_from_dot_data(graph)
Image(graph.create_png())

.. image:: images/ex1_fig3.png
:align: center

Example 2: Symbolic Transformer
------------------------------
-------------------------------

This example demonstrates using the :class:`SymbolicTransformer` to generate new non-linear features automatically.
This example demonstrates using the :class:`SymbolicTransformer` to generate
new non-linear features automatically.

Let's load up the Boston housing dataset and randomly shuffle it::

Expand All @@ -166,36 +191,96 @@ Let's load up the Boston housing dataset and randomly shuffle it::
boston.data = boston.data[perm]
boston.target = boston.target[perm]

We'll use Ridge Regression for this example and train our regressor on the first 300 samples, and see how it performs on the unseen final 200 samples. The benchmark to beat is simply Ridge running on the dataset as-is::
We'll use Ridge Regression for this example and train our regressor on the
first 300 samples, and see how it performs on the unseen final 200 samples. The
benchmark to beat is simply Ridge running on the dataset as-is::

est = Ridge()
est.fit(boston.data[:300, :], boston.target[:300])
print est.score(boston.data[300:, :], boston.target[300:])
0.759145222183

So now we'll train our transformer on the same first 300 samples to generate some new features. Let's use a large population of 2000 individuals over 20 generations. We'll select the best 100 of these for the ``hall_of_fame``, and then use the least-correlated 10 as our new features. A little parsimony should control bloat, but we'll leave the rest of the evolution options at their defaults. The default ``metric='pearson'`` is appropriate here since we are using a linear model as the estimator. If we were going to use a tree-based estimator, the Spearman correlation might be interesting to try out too::

So now we'll train our transformer on the same first 300 samples to generate
some new features. Let's use a large population of 2000 individuals over 20
generations. We'll select the best 100 of these for the ``hall_of_fame``, and
then use the least-correlated 10 as our new features. A little parsimony should
control bloat, but we'll leave the rest of the evolution options at their
defaults. The default ``metric='pearson'`` is appropriate here since we are
using a linear model as the estimator. If we were going to use a tree-based
estimator, the Spearman correlation might be interesting to try out too::

function_set = ['add', 'sub', 'mul', 'div',
'sqrt', 'log', 'abs', 'neg', 'inv',
'max', 'min']
gp = SymbolicTransformer(generations=20, population_size=2000,
hall_of_fame=100, n_components=10,
function_set=function_set,
parsimony_coefficient=0.0005,
max_samples=0.9, verbose=1,
random_state=0, n_jobs=3)
gp.fit(boston.data[:300, :], boston.target[:300])

We will then apply our trained transformer to the entire Boston dataset (remember, it still hasn't seen the final 200 samples) and concatenate this to the original data::
We will then apply our trained transformer to the entire Boston dataset
(remember, it still hasn't seen the final 200 samples) and concatenate this to
the original data::

gp_features = gp.transform(boston.data)
new_boston = np.hstack((boston.data, gp_features))

Now we train the Ridge regressor on the first 300 samples of the transformed dataset and see how it performs on the final 200 again::
Now we train the Ridge regressor on the first 300 samples of the transformed
dataset and see how it performs on the final 200 again::

est = Ridge()
est.fit(new_boston[:300, :], boston.target[:300])
print est.score(new_boston[300:, :], boston.target[300:])
0.853618353633

Great! We have improved the :math:`R^{2}` score by a significant margin. It looks like the linear model was able to take advantage of some new non-linear features to fit the data even better.
Great! We have improved the :math:`R^{2}` score by a significant margin. It
looks like the linear model was able to take advantage of some new non-linear
features to fit the data even better.

.. currentmodule:: gplearn

Example 3: Customizing Your Programs
------------------------------------

This example demonstrates modifying the function set with your own user-defined
functions using the :func:`functions.make_function()` factory function.

First you need to define some function which will return a numpy array of the
correct shape. Most numpy operations will automatically do this. The factory
will perform some basic checks on your function to ensure it complies with
this. The function must also protect against zero division and invalid floating
point operations (such as the log of a negative number).

For this example we will implement a logical operation where two arguments are
compared, and if the first one is larger, return a third value, otherwise
return a fourth value:

def logic(x1, x2, x3, x4):
return np.where(x1 > x2, x3, x4)

To make this into a ``gplearn`` compatible function, we use the factory where
we must give it a name for display purposes and declare the arity of the
function which must match the number of arguments that your function expects:

logical = make_function(function=logic,
name='logical',
arity=4)

This can then be added to a ``gplearn`` estimator like so:

gp = SymbolicTransformer(function_set=['add', 'sub', 'mul', 'div', logical])

After fitting, you will see some of your programs will have used your own
customized functions, for example:

mul(logical(X0, mul(-0.629, X3), X7, sub(0.790, X7)), X9)

.. image:: images/ex3_fig1.png
:align: center

Next up, :ref:`explore the full API reference <reference>` or just skip ahead :ref:`install the package <installation>`!
Next up, :ref:`explore the full API reference <reference>` or just skip ahead
:ref:`install the package <installation>`!
1,226 changes: 609 additions & 617 deletions doc/gp_examples.ipynb

Large diffs are not rendered by default.

Binary file added doc/images/ex3_fig1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,4 +52,4 @@ Contents:
examples
reference
installation

changelog
5 changes: 3 additions & 2 deletions doc/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@
Installation
============

`gplearn` requires a recent version of scikit-learn (which requires numpy and scipy). So first you will need to `follow their installation instructions <http://scikit-learn.org/dev/install.html>`_ to get the dependencies.
`gplearn` requires a recent version of scikit-learn (which requires numpy and
scipy). So first you will need to `follow their installation instructions <http://scikit-learn.org/dev/install.html>`_
to get the dependencies.

Now that you have scikit-learn installed, you can install gplearn using pip::

Expand All @@ -26,4 +28,3 @@ or::
python setup.py install --user

and you're done!

0 comments on commit 5d8c7d2

Please sign in to comment.