Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random seed and model stability #114

Open
Antoine-Cate opened this issue Jan 19, 2017 · 16 comments
Open

Random seed and model stability #114

Antoine-Cate opened this issue Jan 19, 2017 · 16 comments

Comments

@Antoine-Cate
Copy link
Contributor

Hi everyone,

As everyone has seen, the random seed can have a significant effect on the prediction scores. This is due to the fact that most of us are using algorithms with a random component (e.g., random forest, extra trees...).
The effect is probably enhanced by the fact that the dataset we are working on is small and non stationary.

Matt has been solving the problem by testing a series of random seeds and taking the best. This avoids discarding a model just because of a "bad" random seed. However, this might favor the most unstable models. A very stable model will yield scores in a small range when testing several random seeds, while an unstable model will yield a wide range of scores when testing several random seeds. Thus, it is likely that an unstable model can get a very high score given enough random seeds are tested. But it does not mean the model will be good at predicting new test data.

A possible solution would be to test 10 (or an other number) random seeds and to take the median score as the prediction score. It would require us to directly include that in our scripts to avoid further work for Matt. We could just make 10 predictions, using 10 random seeds and export them in a single csv file.

What do you guys (and especially Matt) think about that?

@geckya
Copy link

geckya commented Jan 19, 2017

Great suggestion!

@LukasMosser
Copy link
Contributor

I think it's a great suggestion. I've been seeing that as well with our attempts.
Would 10 be enough to be significant? What's the cost of running 100 predictions? I assume not too much.

@kwinkunks
Copy link
Member

I agree, a good suggestion.

For most models, it wouldn't take much for me to implement this, and usually the model instantiation is separate enough from the CV workflow that I can make it fast enough to do many realizations.., but some models have been 'tricky' to reproduce, either because of the workflow or the setting of seeds (eg we had trouble getting seeds to work properly with Keras/TensorFlow).

I'll take a look at the top 3 now, since I know I can work with those, and report back.

@kwinkunks
Copy link
Member

kwinkunks commented Jan 20, 2017

OK, here's a description of 100 realizations for the current HouMath model:

image

image

Median: 0.619

So it looks like, indeed, my lazy 'method' was rather favourable. In my defence, I used the same approach for everyone, so I hope there's been no unfairness. Either way, I'll take a look at some more models now.

To make sure they see this conversation, I'll cc @dalide @ar4 @bestagini @gccrowther @lperozzi @thanish @mycarta @alexcombessie

@Antoine-Cate
Copy link
Contributor Author

I guess a Std of 0.007 would not be a big deal in an industrial application (the number of misclassification does not change dramatically). But looking at how close to each other we are in the contest, this is significant.
Thanks @kwinkunks !

@kwinkunks
Copy link
Member

Result from @ar4's submission:

image

@bestagini
Copy link
Contributor

Hi everybody!

I also agree that considering the average, median, or some other value obtained after testing multiple random seeds could be a good option.

This could also solve another problem. Working with Keras (TensorFlow or Theano backend), I am having issues in fixing a given seed for reproducibility. Hopefully, averaged results could be more representative of the proposed method.

@kwinkunks
Copy link
Member

@bestagini Ah yes, of course that property of the 'unreproducible' results is dealt with here... I was thinking they'd be a problem but obviously that's the whole point: we're fixing that problem :)

Here's the same treatment of your own entry:

image

So implementing this will indeed change the order of the 2nd and 3rd entries, as things stand.

Side note to @alexcombessie — I can't reproduce your workflow, so only have your submission to go on. I will have another crack at it. @Antoine-Cate I am working on yours now.

@kwinkunks
Copy link
Member

kwinkunks commented Jan 20, 2017

Here's geoLEARN's result:

image

cc @Antoine-Cate @lperozzi @mablou

Rather than soaking up this thread, maybe I'll just start putting the validation scores (all realizations) into another folder, so everyone can see the data etc. Stay tuned.

Fearing for the rest of my day, I might adopt the following strategy:

  • Do more or less what I've been doing until 31 January, perhaps without searching too hard for a maximum.
  • If you want a 'stochastic score', please explicitly make them and give me all the realizations you want me to score in a CSV or NumPy or similar.
  • On 1 February, I will validate the final top (5? 10? I guess it depends how close people are at the end) in the way I've done here (see below).
  • This might mean that some scores will change after the contest closes.

For the record, here's how I'm getting the realizations (generic example:

y_pred = []
for seed in range(100):
    np.random.seed(seed)
    clf = RandomForestClassifier(<hyperparams>, random_state=seed,  n_jobs=-1)
    clf.fit(X, y)    
    y_pred.append(clf.predict(X_test))
    print('.', end='')
np.save('100_realizations.npy', y_pred)

@ar4
Copy link
Contributor

ar4 commented Jan 20, 2017

Excellent idea. One thing to consider is how to handle cases (such as my own) where there are two fits (PE, and then Facies). Would two loops be a good approach - an outer loop for 10 iterations that picks the seed for the PE fit, and an inner loop around the Facies fit for another 10 iterations? Edit: Or one outer loop for 100 iterations that picks two random seeds each time.

@kwinkunks
Copy link
Member

kwinkunks commented Jan 20, 2017

@ar4 Just to be clear: I'm just getting the results from 100 seeds, and averaging the scores those results achieve. So there's no optimization going on. You probably got this, just checking :)

I made a new workflow, bringing in a super hacky way the PE part into the seed-setting loop. I checked this in so you can see it HERE... Please check it!

The score is now like this:

image

Did I understand what you were asking??

@kwinkunks
Copy link
Member

By the way everyone, the results from realizations are now in the Stochastic_validations directory.

@kwinkunks kwinkunks mentioned this issue Jan 20, 2017
@ar4
Copy link
Contributor

ar4 commented Jan 20, 2017

Ah, I wondered for a moment why you wanted to clarify that there was no optimization going on, and finally realised that my choice of the phrase "picks the seed" was problematic. Now there would be an example of overfitting! ;-) (I just meant a new seed is picked/set for each loop iteration.)

Your modification seems to be approximately my second proposal (one outer loop), but I see you use the same seed for both the PE and Facies steps. It's probably not a problem, but two random seeds - one for each - seems like it might be a bit safer.

@kwinkunks
Copy link
Member

I ran it again where I keep the same loop, but give the PE generator seed+100. So it's not the same as the latin square arrangement you were thinking of, but should nonetheless be a better answer. I think. Right?

image

@LukasMosser
Copy link
Contributor

Since this was referenced here, what our team was trying to prove is that using an "ensemble" of results one would get an improved result. Full credit to the top four and our score will stay at 0.568 (until we submit our own work). We did expect this "meta-submission" to perform better than it did though.

@kwinkunks kwinkunks mentioned this issue Jan 24, 2017
@kwinkunks
Copy link
Member

kwinkunks commented Jan 24, 2017

FYI all, this is the stochastic result of the new leader, SHandPR:

image

This was referenced Jan 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants