Skip to content

Commit

Permalink
Fix second issue from #42 (#60)
Browse files Browse the repository at this point in the history
* fix issues with metric evaluation in transformer
* improve transformer selection algorithm to maintain fittest program
* update documentation for new updates
* update changelog with changes in random sampling
  • Loading branch information
trevorstephens committed Nov 16, 2017
1 parent 6ab0bbc commit dfbae86
Show file tree
Hide file tree
Showing 9 changed files with 164 additions and 151 deletions.
2 changes: 1 addition & 1 deletion doc/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ This can then be added to a ``gplearn`` estimator like so::
After fitting, you will see some of your programs will have used your own
customized functions, for example::

mul(logical(X0, mul(-0.629, X3), X7, sub(0.790, X7)), X9)
sub(logical(X6, add(X11, 0.898), X10, X2), X5)

.. image:: images/ex3_fig1.png
:align: center
Expand Down
17 changes: 15 additions & 2 deletions doc/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,25 @@
Release History
===============

Version 0.2.1
Version 0.3.0
-------------

- Fixed two bugs in :class:`genetic.SymbolicTransformer` where the final
solution selection logic was incorrect and suboptimal. This fix will change
the solutions from all previous versions of `gplearn`. Thanks to
`iblasi <https://github.com/iblasi>`_ for diagnosing the problem and helping
craft the solution.
- Fixed bug in :class:`genetic.SymbolicRegressor` where a custom fitness
measure was defined in :func:`fitness.make_fitness()` with the parameter
`greater_is_better=True`. This was ignored during final solution selection.
This change will alter the results from previous releases where
`greater_is_better=True` was set in a custom fitness measure. By
`sun ao <https://github.com/eggachecat>`_.
- Increase minimum required version of ``scikit-learn`` to 0.18.1. This allows
streamlining the test suite and removal of many utilities to reduce future
technical debt.
technical debt. **Please note that due to this change, previous versions
may have different results** due to a change in random sampling noted
`here <http://scikit-learn.org/stable/whats_new.html#version-0-18-1>`_.
- Drop support for Python 2.6 and add support for Python 3.5 and 3.6 in order
to support the latest release of ``scikit-learn`` 0.19 and avoid future test
failures. By `hugovk <https://github.com/hugovk>`_.
Expand Down
27 changes: 12 additions & 15 deletions doc/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,16 +65,13 @@ solutions small, since we know the truth is a pretty simple equation::
| Population Average | Best Individual |
---- ------------------------- ------------------------------------------ ----------
Gen Length Fitness Length Fitness OOB Fitness Time Left
0 38.13 386.19117972 7 0.331580808730 0.470286152255 55.15s
1 9.91 1.66832489614 5 0.335361761359 0.488347149514 1.25m
2 7.76 1.888657267 7 0.260765934398 0.565517599814 1.45m
3 5.37 1.00018638338 17 0.223753461954 0.274920433701 1.42m
4 4.69 0.878161643513 17 0.145095322600 0.158359554221 1.35m
5 6.1 0.91987274474 11 0.043612562970 0.043612562970 1.31m
6 7.18 1.09868887802 11 0.043612562970 0.043612562970 1.23m
7 7.65 1.96650325011 11 0.043612562970 0.043612562970 1.18m
8 8.02 1.02643443398 11 0.043612562970 0.043612562970 1.08m
9 9.07 1.22732144371 11 0.000781474035 0.0007814740353 59.43s
0 38.13 458.57768152 5 0.320665972828 0.556763539274 1.28m
1 9.97 1.70232723129 5 0.320201761523 0.624787148042 57.78s
2 7.72 1.94456344674 11 0.239536660154 0.533148180489 46.35s
3 5.41 0.990156815469 7 0.235676349446 0.719906258051 37.93s
4 4.66 0.894443363616 11 0.103946413589 0.103946413589 32.20s
5 5.41 0.940242380405 11 0.060802040427 0.060802040427 28.15s
6 6.78 1.0953592564 11 0.000781474035 0.000781474035 24.85s

The evolution process stopped early as the error of the best program in the 9th
generation was better than 0.01. It also appears that the parsimony coefficient
Expand Down Expand Up @@ -154,12 +151,12 @@ We can also inspect the program that the :class:`SymbolicRegressor` found::
And check out who its parents were::

print est_gp._program.parents

{'method': 'Crossover',
'parent_idx': 374,
'parent_idx': 1555,
'parent_nodes': [1, 2, 3],
'donor_idx': 116,
'donor_nodes': [0, 1, 2, 6]}
'donor_idx': 78,
'donor_nodes': []}

This dictionary tells us what evolution operation was performed to get our new
individual, as well as the parents from the prior generation, and any nodes
Expand Down Expand Up @@ -235,7 +232,7 @@ dataset and see how it performs on the final 200 again::
est.fit(new_boston[:300, :], boston.target[:300])
print est.score(new_boston[300:, :], boston.target[300:])
0.853618353633
0.841750404385

Great! We have improved the :math:`R^{2}` score by a significant margin. It
looks like the linear model was able to take advantage of some new non-linear
Expand Down
188 changes: 71 additions & 117 deletions doc/gp_examples.ipynb

Large diffs are not rendered by default.

Binary file modified doc/images/ex1_fig1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified doc/images/ex1_fig2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified doc/images/ex1_fig3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 10 additions & 5 deletions gplearn/genetic.py
Original file line number Diff line number Diff line change
Expand Up @@ -480,23 +480,28 @@ def fit(self, X, y, sample_weight=None):
if isinstance(self, TransformerMixin):
# Find the best individuals in the final generation
fitness = np.array(fitness)
hall_of_fame = fitness.argsort()[:self.hall_of_fame]
if self._metric.greater_is_better:
hall_of_fame = fitness.argsort()[::-1][:self.hall_of_fame]
else:
hall_of_fame = fitness.argsort()[:self.hall_of_fame]
evaluation = np.array([gp.execute(X) for gp in
[self._programs[-1][i] for
i in hall_of_fame]])
if self.metric == 'spearman':
evaluation = np.apply_along_axis(rankdata, 1, evaluation)

# Iteratively remove the worst individual of the worst pair
with np.errstate(divide='ignore', invalid='ignore'):
correlations = np.abs(np.corrcoef(evaluation))
np.fill_diagonal(correlations, 0.)
components = list(range(self.hall_of_fame))
indices = list(range(self.hall_of_fame))
# Iteratively remove least fit individual of most correlated pair
while len(components) > self.n_components:
worst = np.unravel_index(np.argmax(correlations),
correlations.shape)
worst = worst[np.argmax(np.sum(correlations[worst, :], 1))]
most_correlated = np.unravel_index(np.argmax(correlations),
correlations.shape)
# The correlation matrix is sorted by fitness, so identifying
# the least fit of the pair is simply getting the higher index
worst = max(most_correlated)
components.pop(worst)
indices.remove(worst)
correlations = correlations[:, indices][indices, :]
Expand Down
66 changes: 55 additions & 11 deletions gplearn/tests/test_genetic.py
Original file line number Diff line number Diff line change
Expand Up @@ -829,7 +829,7 @@ def test_transformer_iterable():
est.fit(X, y)
fitted_len = len(est)
fitted_iter = [gp.length_ for gp in est]
expected_iter = [15, 19, 19, 12, 9, 10, 7, 14, 6, 21]
expected_iter = [8, 12, 2, 29, 9, 33, 9, 8, 4, 22]

assert_true(fitted_len == 10)
assert_true(fitted_iter == expected_iter)
Expand Down Expand Up @@ -1022,29 +1022,73 @@ def test_warm_start():
assert_equal(cold_program, warm_program)


def test_customizied_regressor_metrics():
"""Check whether parameter greater_is_better works fine"""
def test_customized_regressor_metrics():
"""Check whether greater_is_better works for SymbolicRegressor."""

x_data = rng.uniform(-1, 1, 100).reshape(50, 2)
y_true = x_data[:, 0] ** 2 + x_data[:, 1] ** 2

est_gp = SymbolicRegressor(metric='mean absolute error', stopping_criteria=0.000001, random_state=415,
parsimony_coefficient=0.001, verbose=0, init_method='full', init_depth=(2, 4))
est_gp = SymbolicRegressor(metric='mean absolute error',
stopping_criteria=0.000001, random_state=415,
parsimony_coefficient=0.001, init_method='full',
init_depth=(2, 4))
est_gp.fit(x_data, y_true)
formula = est_gp.__str__()
assert_equal("add(mul(X1, X1), mul(X0, X0))", formula, True)
assert_equal('add(mul(X1, X1), mul(X0, X0))', formula, True)

def neg_mean_absolute_error(y, y_pred, sample_weight):
return -1 * mean_absolute_error(y, y_pred, sample_weight)

customizied_fitness = make_fitness(neg_mean_absolute_error, greater_is_better=True)
customizied_fitness = make_fitness(neg_mean_absolute_error,
greater_is_better=True)

c_est_gp = SymbolicRegressor(metric=customizied_fitness, stopping_criteria=-0.000001, random_state=415,
parsimony_coefficient=0.001, verbose=0, init_method='full', init_depth=(2, 4))
c_est_gp = SymbolicRegressor(metric=customizied_fitness,
stopping_criteria=-0.000001, random_state=415,
parsimony_coefficient=0.001, verbose=0,
init_method='full', init_depth=(2, 4))
c_est_gp.fit(x_data, y_true)
c_formula = c_est_gp.__str__()

assert_equal("add(mul(X1, X1), mul(X0, X0))", c_formula, True)
assert_equal('add(mul(X1, X1), mul(X0, X0))', c_formula, True)


def test_customized_transformer_metrics():
"""Check whether greater_is_better works for SymbolicTransformer."""

est_gp = SymbolicTransformer(generations=2, population_size=100,
hall_of_fame=10, n_components=1,
metric='pearson', random_state=415)
est_gp.fit(boston.data, boston.target)
for program in est_gp:
formula = program.__str__()
expected_formula = ('sub(div(mul(X4, X12), div(X9, X9)), '
'sub(div(X11, X12), add(X12, X0)))')
assert_equal(expected_formula, formula, True)

def _neg_weighted_pearson(y, y_pred, w):
"""Calculate the weighted Pearson correlation coefficient."""
with np.errstate(divide='ignore', invalid='ignore'):
y_pred_demean = y_pred - np.average(y_pred, weights=w)
y_demean = y - np.average(y, weights=w)
corr = ((np.sum(w * y_pred_demean * y_demean) / np.sum(w)) /
np.sqrt((np.sum(w * y_pred_demean ** 2) *
np.sum(w * y_demean ** 2)) /
(np.sum(w) ** 2)))
if np.isfinite(corr):
return -1 * np.abs(corr)
return 0.

neg_weighted_pearson = make_fitness(function=_neg_weighted_pearson,
greater_is_better=False)

c_est_gp = SymbolicTransformer(generations=2, population_size=100,
hall_of_fame=10, n_components=1,
stopping_criteria=-1,
metric=neg_weighted_pearson,
random_state=415)
c_est_gp.fit(boston.data, boston.target)
for program in c_est_gp:
c_formula = program.__str__()
assert_equal(expected_formula, c_formula, True)


if __name__ == "__main__":
Expand Down

0 comments on commit dfbae86

Please sign in to comment.