-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Predictions from RandomForestClassifier completely unstable between different machines #7366
Comments
You didn't set |
@amueller No I used the default option. My configuration is RandomForestClassifier(n_jobs=1, n_estimators=200, warm_start= True). Should I use the random_state? |
yes, otherwise it's not deterministic, even on the same machine. |
@amueller OK, I will try it and send feedback. Should I use a specific value e.g. (random_state=1) or tune it let's say with gridsearch or is irrelevant with the final result? Also, LinearSVC, LogisticRegression has the same parameter, I also used the default option but the results are the same. Should I feel lucky or is less important on these classifiers? |
just us a fixed value. And on linear models it's less important, and not all solvers use it. |
Hi there, @amueller I tried to solve the issue using the following configuration (with random_state) as you suggested:
, firstly setting the rs variable as
and also using numpy RandomState as:
The predictions in both cases are still different between different machines and are always the same in a single machine: TEST RESULTS (LAPTOP)
TEST RESULTS (VM)
I'm just copying the saved files from the VM to my laptop and load them using joblib.load() |
Oh sorry I didn't read correctly. So you pickle on one machine and unpickle on the other? |
@amueller Yes that's what I want to do save using joblib on VM and load the classifier on my laptop. Both machines seems to be 64bit... VM returns:
My laptop returns:
|
wow, that is strange. ping @jmschrei rf.estimators_[0].tree_.values_ on both machines? Also, can you please provide the full code for training, predicting, storing and loading? Thanks. |
I placed the following command on validation code:
Your suggestion, if I understood,
gave me: "AttributeError: 'sklearn.tree.tree.Tree' object has no attribute 'values'" So with my code we have the following printing: VM
LAPTOP
I'm really sorry but I would like to keep the code private :) |
Are the values identical? |
It'd be really useful if you could look at the full tree, which will allow you to see the overall structure, thresholds, and values together easily. It seems like your values are the same though. Here is how to plot your trees: http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html It shouldn't make a huge difference hopefully, but can you make sure that your dataset is 64 bit floats in both cases by explicitly casting them before putting them in the prediction step? |
I suppose another possible issue is that joblib has a bug. Can you try training your model, looking at the testing/validation data with it on the VM, saving the model, loading the mdoel back up on your VM, and checking to make sure the results are the same? |
@jmschrei This is what I do actually. First I run a python file (File 1) on VM, which has the following code:
Then I run another python file (File 2), on both VM and laptop, which has the following code:
Both VM and my laptop run the same piece of code (File 2) to print the evaluation report and have the different printed results I already mentioned. The whole project is on a Git repository, so both machines run absolutely the same code. I do the same process with LogisticRegression and LinearSVC and I don't have any problem like that so far... Thanx for your replies! |
The line, |
@notmatthancock Yes I download the classifier files from VM and place them in the models directory on my laptop. |
Try comparing the attributes of |
@notmatthancock , I am having the same issue. What would you do if you saw differences in the t values as described above? In my scenario I have trained on vm1, dumped the model onto the filesystem of vm1, then loaded the model back into a notebook on vm1. At this point I get the exact same predictions that were made right after the model was first fit/created in memory on vm1. However when I take the model from vm1 and put it into a docker container that runs on vm1, or on another set of vms running on k8s nodes I get vastly different predictions than when the model is loaded into an ipynb. |
We would need a reproducer to be able to answer here. I can think of some potential architecture issue maybe some tie breaking that will be different as well. |
Text classification - Entity Recognition
I am currently having some text classification experiments using windows of word embeddings + some hand-crafted features. I already used LinearSVC and LogisticRegression and I want to finalize my experiments with RandomForestClassifier.
Use of VM - Dataset (train,validation,test) - Weird behaviour
Because my datasets are pretty big around 15-35 GB, I am having my training online on a VM and then I make a classification report on my test set on my laptop.
So far so good with LinearSVC and LogisticRegression, but RandomForestClassifier has pretty weird behaviour.
After training classifier in my code, I make a classification report over a validation set (the last 20% of my initial samples). Of course I don't use the same samples during the training, so we have 20% of fresh samples to measure our classifier. WIth RandomForestClassifier I get the following report:
VALIDATION RESULTS
Then I downloaded the classifier, loaded and start the classification report over test set on my laptop, which gives us:
TEST RESULTS (LAPTOP)
This doesn't look good at all. There is no any possibility to have so much different results between validation and test from the same dataset (both unseen during training). So I tried to think what possibly is going wrong from a theoretical point of view and I couldn't find something that makes such a great difference. So I tried to run the same test on the VM which I use for training experiments.
TEST RESULTS (VM)
So it seems (in my case?) RandomForestClassifier works really differently from one machine to the other. Any other classifier (SVC, LR) gives the same validation report on both machines.
I would be grateful to have an explanation of this strange behavior to fix this issue. Maybe I'm missing something...
Extra Information
VM :
OS: Ubuntu 14.04 LTS
Python 3.5
scikit-learn 0.17.1
numpy 1.11.1
scipy 0.17.1
LAPTOP :
OS: Ubuntu 16.04 LTS
Python 3.5
scikit-learn 0.17.1
numpy 1.11.1
scipy 0.17.1
RandomForestClassifier(n_jobs=1, n_estimators=200, warm_start= True)
I was using n_jobs=-1, but then I fixed it to one to be sure that the problem is not happening because of the multi-thread difference (4 threads vs 8 threads).
I'm saving - loading classifiers as suggested on scikit-learn.org ( http://scikit-learn.org/stable/modules/model_persistence.html )
Default scikit-learn classification_report (metrics.classification_report)
LinearSVC and LogisticRegression are saved in 1 .pkl file and 4 extra .npy files for each classifier. RandomForestClassifier is saved 1 .pkl file and 801(!!!) extra .npy files.
The text was updated successfully, but these errors were encountered: