Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.cv_results_ does not include info from first generation #27

Closed
ClimbsRocks opened this issue Jun 27, 2017 · 12 comments
Closed

.cv_results_ does not include info from first generation #27

ClimbsRocks opened this issue Jun 27, 2017 · 12 comments

Comments

@ClimbsRocks
Copy link
Contributor

I think there's a fenceposting/off-by-one error somewhere.

When I pass in generations_number = 1, it's actually 0-indexed, and gives me 2 generations. Similarly, if I pass in 2 generations, I actually get 3.

Then, when I examine the cv_results_ property, I noticed that I only get the results from all generations after the first generation (the 0-indexed generation).

This is most apparently if you set generations_number = 1.

I looked through the code quickly, but didn't see any obvious source of it. Hopefully someone who knows the library can find it more easily!

@ClimbsRocks
Copy link
Contributor Author

@rsteca or even @ryanpeach - any thoughts on how to get the first generation included in .cv_results_?

@ryanpeach
Copy link
Contributor

I'll take a look.

@ryanpeach
Copy link
Contributor

In the _fit function, where most of the work is done, the history object caclulates after the mate+mutate step. As such, there is a problem where either we keep it the way it is, and loose 1st gen information, or move it above the mate+mutate step and lose the last generation. Maybe we should have a special "first run" condition which saves the 1st generation data.

@ClimbsRocks
Copy link
Contributor Author

That sounds good to me. Personally, I'm more interested in the 1st run than the last run (first run is where we try all the crazy ideas, and we'll see the most variance across different hyperparameter combinations, while last run is generally a bit safer and more boring combinations of things we've already tried before).

But I like your idea- sounds like a pretty simple bit of code to get all the data people would expect. Thanks for finding that!

@ryanpeach
Copy link
Contributor

Someone test this branch #29

@ryanpeach
Copy link
Contributor

I basically discovered that we just haven't included the evaluation step of the population in the history logger. I've now added both evaluation and selection steps but they need testing.

@ryanpeach
Copy link
Contributor

Hey, so I think we have a misunderstanding. cv_results_ does not include "Generation information." it includes all generated individuals from all generations. It's a pretty big table...

@ClimbsRocks
Copy link
Contributor Author

@ryanpeach Yeah, i understand that we're including individuals in .cv_results_, not generation information. but, from what i can understand, we're not including any of the individuals from the first generation right now.

i ran into this issue when i ran a pretty small search space that was only two generations, and the second generation was primarily just re-picking candidates from the first generation.

try setting generations_number=1, and i think you'll see the issue i'm talking about.

thanks for looking into this! it's a really cool project, and a pretty big improvement over gridsearch

@ryanpeach
Copy link
Contributor

ryanpeach commented Aug 16, 2017

@ClimbsRocks Great, ok just being clear. Wasn't sure.

I'm actually not super familiar with how DEAP works (which is the framework we use). I am following the code referenced here:

http://deap.readthedocs.io/en/master/api/tools.html

history = History()

# Decorate the variation operators
toolbox.decorate("mate", history.decorator)
toolbox.decorate("mutate", history.decorator)

# Create the population and populate the history
population = toolbox.population(n=POPSIZE)
history.update(population)

# Do the evolution, the decorators will take care of updating the
# history
# [...]

import matplotlib.pyplot as plt
import networkx

graph = networkx.DiGraph(history.genealogy_tree)
graph = graph.reverse()     # Make the grah top-down
colors = [toolbox.evaluate(history.genealogy_history[i])[0] for i in graph]
networkx.draw(graph, node_color=colors)
plt.show()

Here:

pop = toolbox.population(n=self.population_size)
        hof = tools.HallOfFame(1)

        # Stats
        stats = tools.Statistics(lambda ind: ind.fitness.values)
        stats.register("avg", np.nanmean)
        stats.register("min", np.nanmin)
        stats.register("max", np.nanmax)

        # History
        hist = tools.History()
        toolbox.decorate("mate", hist.decorator)
        toolbox.decorate("mutate", hist.decorator)
        hist.update(pop)

And here

idxs, individuals, each_scores = zip(*[(idx, indiv, np.mean(indiv.fitness.values))
                                                for idx, indiv in list(gen.genealogy_history.items())
                                                if indiv.fitness.valid and not np.all(np.isnan(indiv.fitness.values))])

Just for the reference.

I'm lost as to how the history object works, but I think it contains all individuals ever populated in pop, and then "decorates" those individuals by the decorator commands "such as, creating a graph of who was selected, or who mated with who." But the evaluation step is saved in the history automatically I think.

I'll keep looking I guess, just thinking out loud.

@ryanpeach
Copy link
Contributor

@ClimbsRocks

Hey, so I did what you said and I'm just not replicating the results. On the test.ipynb notebook (use my fork) if you put generation_number to 1 you still get some individuals. Note, they wont be the same number of individuals as population_size indicates, because if 2 individuals are functionally the same, they are treated as the same (so a population of 3 "111" individuals in history just shows up as some individual "111"). You sure you aren't just miscounting?

If you are sure this is still an issue, please provide an example jupyter notebook. Thanks!

@ryanpeach
Copy link
Contributor

And... now I'm seeing it. I swear it worked just a min ago...

@ryanpeach
Copy link
Contributor

Nope, nvm, it works as expected. Here is a link to my notebook:

https://github.com/ryanpeach/sklearn-deap/blob/test_issue27/test.ipynb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants