Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GridSearchCV parallel execution with own scorer freezes #2889

Closed
adverley opened this issue Feb 24, 2014 · 99 comments
Closed

GridSearchCV parallel execution with own scorer freezes #2889

adverley opened this issue Feb 24, 2014 · 99 comments

Comments

@adverley
Copy link

I have been searching hours on this problem and can consistently replicate it:

clf = GridSearchCV( sk.LogisticRegression(),
                            tuned_parameters,
                            cv = N_folds_validation,
                            pre_dispatch='6*n_jobs', 
                            n_jobs=4,
                            verbose = 1,
                            scoring=metrics.make_scorer(metrics.scorer.f1_score, average="macro")
                        )

This snippet crashes because of scoring=metrics.make_scorer(metrics.scorer.f1_score, average="macro") where metrics refers to sklearn.metrics module. If I cancel out the scoring=... line, the parallel execution works. If I want to use the f1 score as evaluation method, I have to cancel out the parallel execution by setting n_jobs = 1.

Is there a way I can define another score method without losing the parallel execution possibility?

Thanks

@jnothman
Copy link
Member

This is surprising, so we'll have to work out what the problem is and make sure it works!

Can you please provide a little more detail:

  • What do you mean by "crashes"?
  • What version of scikit-learn is this? If it's 0.14, does it still happen in the current development version?
  • Multiprocessing has platform-specific issues. What platform are you on? (e.g. import platform; platform.platform())
  • Have you tried it on different datasets?

FWIW, my machine has no problem fitting iris with this snippet on the development version of sklearn.

@adverley
Copy link
Author

Thank you for your fast reply.

With crashing I actually mean freezing. It doesn't continue anymore and there is also no more activity to be monitored in the python process of task manager of windows. The processes are still there and consume a constant amount of RAM but require no processing time.

This is scikit-learn version 0.14, last updated and run using Enthought Canopy.

I am on platform "Windows-7-6.1.7601-SP1".

I will go more into depth by providing a generic example of the problem. I think it has to do with the GridSearchCV being placed in a for loop. (To not waste too much of your time, you should probably start at the run_tune_process() method which is being called at the bottom of the code and calls the method containing GridSearchCV() in a for loop)

Code:

import sklearn.metrics as metrics
from sklearn.grid_search import GridSearchCV
import numpy as np
import os
from sklearn import datasets
from sklearn import svm as sk


def tune_hyperparameters(trainingData, period):
    allDataTrain = trainingData

    # Define hyperparameters and construct a dictionary of them
    amount_kernels = 2
    kernels = ['rbf','linear']
    gamma_range =   10. ** np.arange(-5, 5)
    C_range =       10. ** np.arange(-5, 5)
    tuned_parameters = [
                        {'kernel': ['rbf'],     'gamma': gamma_range , 'C': C_range},
                        {'kernel': ['linear'],  'C': C_range}
                       ]

    print("Tuning hyper-parameters on period = " + str(period) + "\n")

    clf = GridSearchCV( sk.SVC(), 
                        tuned_parameters,
                        cv=5,
                        pre_dispatch='4*n_jobs', 
                        n_jobs=2,
                        verbose = 1,
                        scoring=metrics.make_scorer(metrics.scorer.f1_score, average="macro")
                        )
    clf.fit(allDataTrain[:,1:], allDataTrain[:,0:1].ravel())

    # other code will output some data to files, graphs and will save the optimal model with joblib package


    #   Eventually we will return the optimal model
    return clf

def run_tune_process(hyperparam_tuning_method,trainingData, testData):    
    for period in np.arange(0,100,10):
                clf = hyperparam_tuning_method(trainingData,period)

                y_real = testData[:,0:1].ravel()
                y_pred = clf.predict(testData[:,1:])

# import some data to play with
iris = datasets.load_iris()
X_training = iris.data[0:100,:]  
Y_training = (iris.target[0:100]).reshape(100,1)
trainingset = np.hstack((Y_training, X_training))

X_test = iris.data[100:150,:]  
Y_test = (iris.target[100:150]).reshape(50,1)
testset = np.hstack((Y_test, X_test))

run_tune_process(tune_hyperparameters,trainingset,testset)

Once again, this code works on my computer only when I change n_jobs to 1 or when I don't define a scoring= argument.

@jnothman
Copy link
Member

Generally multiprocessing in Windows encounters a lot of problems. But I
don't know why this should be correlated with a custom metric. There's
nothing about the average=macro option in 0.14 that suggests it should be
more likely to hang than the default average (weighted). At the development
head, this completes in 11s on my macbook, and in 7s at version 0.14
(that's something to look into!)

Are you able to try this out in the current development version, to see if
it's still an issue?

On 25 February 2014 20:40, adverley notifications@github.com wrote:

Thank you for your fast reply.

With crashing I actually mean freezing. It doesn't continue anymore and
there is also no more activity to be monitored in the python process of
task manager of windows. The processes are still there and consume a
constant amount of RAM but require no processing time.

This is scikit-learn version 0.14, last updated and run using Enthought
Canopy.

I am on platform "Windows-7-6.1.7601-SP1".

I will go more into depth by providing a generic example of the problem. I
think it has to do with the GridSearchCV being placed in a for loop. (To
not waste too much of your time, you should probably start at the
run_tune_process() method which is being called at the bottom of the code
and calls the method containing GridSearchCV() in a for loop)
Code:

import sklearn.metrics as metrics
from sklearn.grid_search import GridSearchCV
import numpy as np
import os
from sklearn import datasets
from sklearn import svm as sk

def tune_hyperparameters(trainingData, period):
allDataTrain = trainingData

# Define hyperparameters and construct a dictionary of them
amount_kernels = 2
kernels = ['rbf','linear']
gamma_range =   10. ** np.arange(-5, 5)
C_range =       10. ** np.arange(-5, 5)
tuned_parameters = [
                    {'kernel': ['rbf'],     'gamma': gamma_range , 'C': C_range},
                    {'kernel': ['linear'],  'C': C_range}
                   ]

print("Tuning hyper-parameters on period = " + str(period) + "\n")

clf = GridSearchCV( sk.SVC(),
                    tuned_parameters,
                    cv=5,
                    pre_dispatch='4*n_jobs',
                    n_jobs=2,
                    verbose = 1,
                    scoring=metrics.make_scorer(metrics.scorer.f1_score, average="macro")
                    )
clf.fit(allDataTrain[:,1:], allDataTrain[:,0:1].ravel())

# other code will output some data to files, graphs and will save the optimal model with joblib package


#   Eventually we will return the optimal model
return clf

def run_tune_process(hyperparam_tuning_method,trainingData, testData):
for period in np.arange(0,100,10):
clf = hyperparam_tuning_method(trainingData,period)

            y_real = testData[:,0:1].ravel()
            y_pred = clf.predict(testData[:,1:])

import some data to play with

iris = datasets.load_iris()
X_training = iris.data[0:100,:]
Y_training = (iris.target[0:100]).reshape(100,1)
trainingset = np.hstack((Y_training, X_training))

X_test = iris.data[100:150,:]
Y_test = (iris.target[100:150]).reshape(50,1)
testset = np.hstack((Y_test, X_test))

run_tune_process(tune_hyperparameters,trainingset,testset)

Reply to this email directly or view it on GitHubhttps://github.com//issues/2889#issuecomment-35990430
.

@jnothman
Copy link
Member

(As a side point, @ogrisel, I note there seems to be a lot more joblib
parallelisation overhead in master -- on OS X at least -- that wasn't there
in 0.14...)

On 25 February 2014 21:52, Joel Nothman jnothman@student.usyd.edu.auwrote:

Generally multiprocessing in Windows encounters a lot of problems. But I
don't know why this should be correlated with a custom metric. There's
nothing about the average=macro option in 0.14 that suggests it should be
more likely to hang than the default average (weighted). At the development
head, this completes in 11s on my macbook, and in 7s at version 0.14
(that's something to look into!)

Are you able to try this out in the current development version, to see if
it's still an issue?

On 25 February 2014 20:40, adverley notifications@github.com wrote:

Thank you for your fast reply.

With crashing I actually mean freezing. It doesn't continue anymore and
there is also no more activity to be monitored in the python process of
task manager of windows. The processes are still there and consume a
constant amount of RAM but require no processing time.

This is scikit-learn version 0.14, last updated and run using Enthought
Canopy.

I am on platform "Windows-7-6.1.7601-SP1".

I will go more into depth by providing a generic example of the problem.
I think it has to do with the GridSearchCV being placed in a for loop. (To
not waste too much of your time, you should probably start at the
run_tune_process() method which is being called at the bottom of the code
and calls the method containing GridSearchCV() in a for loop)
Code:

import sklearn.metrics as metrics
from sklearn.grid_search import GridSearchCV
import numpy as np
import os
from sklearn import datasets
from sklearn import svm as sk

def tune_hyperparameters(trainingData, period):
allDataTrain = trainingData

# Define hyperparameters and construct a dictionary of them
amount_kernels = 2
kernels = ['rbf','linear']
gamma_range =   10. ** np.arange(-5, 5)
C_range =       10. ** np.arange(-5, 5)
tuned_parameters = [
                    {'kernel': ['rbf'],     'gamma': gamma_range , 'C': C_range},
                    {'kernel': ['linear'],  'C': C_range}
                   ]

print("Tuning hyper-parameters on period = " + str(period) + "\n")

clf = GridSearchCV( sk.SVC(),
                    tuned_parameters,
                    cv=5,
                    pre_dispatch='4*n_jobs',
                    n_jobs=2,
                    verbose = 1,
                    scoring=metrics.make_scorer(metrics.scorer.f1_score, average="macro")
                    )
clf.fit(allDataTrain[:,1:], allDataTrain[:,0:1].ravel())

# other code will output some data to files, graphs and will save the optimal model with joblib package


#   Eventually we will return the optimal model
return clf

def run_tune_process(hyperparam_tuning_method,trainingData, testData):
for period in np.arange(0,100,10):
clf = hyperparam_tuning_method(trainingData,period)

            y_real = testData[:,0:1].ravel()
            y_pred = clf.predict(testData[:,1:])

import some data to play with

iris = datasets.load_iris()
X_training = iris.data[0:100,:]
Y_training = (iris.target[0:100]).reshape(100,1)
trainingset = np.hstack((Y_training, X_training))

X_test = iris.data[100:150,:]
Y_test = (iris.target[100:150]).reshape(50,1)
testset = np.hstack((Y_test, X_test))

run_tune_process(tune_hyperparameters,trainingset,testset)

Reply to this email directly or view it on GitHubhttps://github.com//issues/2889#issuecomment-35990430
.

@larsmans
Copy link
Member

This has nothing to do with custom scorers. This is a well-known feature of Python multiprocessing on Windows: you have to run everything that uses n_jobs=-1 in an if __name__ == '__main__' block or you'll get freezes/crashes. Maybe we should document this somewhere prominently, e.g. in the README?

@GaelVaroquaux
Copy link
Member

you have to run everything that uses n_jobs= -1 in an if name ==
'main' block or you'll get freezes/crashes.

Well, the good news is that nowadays joblib gives a meaningful error
message on such crash, rather than a fork bomb.

@larsmans
Copy link
Member

@GaelVaroquaux does current scikit-learn give that error message? If so, the issue can be considered fixed, IMHO.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux does current scikit-learn give that error message? If so, the
issue can be considered fixed, IMHO.

It should do. The only way to be sure is to check. I am on the move right
now, and I cannot boot up a Windows VM to do that.

@larsmans
Copy link
Member

I'm not going to install a C compiler on Windows just for this. Sorry, but I really don't do Windows :)

@GaelVaroquaux
Copy link
Member

I'm not going to install a C compiler on Windows just for this. Sorry, but I
really don't do Windows :)

I have a Windows VM. I can check. It's just a question of finding a
little be of time to do it.

@adverley
Copy link
Author

@larsmans , you are completely right. The custom scorer object was a mistake of me, the problem lies indeed in the multiprocessing on windows. I tried this same code on a Linux and it runs well.

I don't get any error messages because it doesn't crash, it just stops doing any meaningful.

@larsmans
Copy link
Member

@adverley Could you try the most recent version from GitHub on your Windows box?

@amueller
Copy link
Member

Closing because of lack of feeback and it is probably a known issue that is fixed in newer joblib.

@hirak99
Copy link

hirak99 commented Mar 27, 2015

Not sure if related, does seem to be.

In windows, custom scorer still freezes. I encountered this thread on google - removed the scorer, and the grid search works.

When it freezes, it shows no error message. There are 3 python processes spawned too (because I set n_jobs=3). However, the CPU utilization remains 0 for all python processes. I am using IPython Notebook.

@amueller
Copy link
Member

amueller commented Apr 1, 2015

Can you share the code of the scorer? It seems a bit unlikely.

@amueller
Copy link
Member

amueller commented Apr 1, 2015

Does your scorer use joblib / n_jobs anywhere? It shouldn't, and that could maybe cause problems (though I think joblib should detect that).

@hirak99
Copy link

hirak99 commented Apr 1, 2015

Sure - here's the full code - http://pastebin.com/yUE26SNs

The scorer function is "score_model", it doesn't use joblib.

This runs from command prompt, but not from IPython Notebook. The error message is -
AttributeError: Can't get attribute 'score_model' on <module '__main__' (built-in)>;

Then the IPython and all the spawned python instances become idle - silently - and don't respond to any python code anymore till I restart it.

@amueller
Copy link
Member

amueller commented Apr 1, 2015

Fix the attribute error, then it'll work.
Do you do pylab imports in IPython notebook? Otherwise everything should be the same.

@hirak99
Copy link

hirak99 commented Apr 1, 2015

Well I do not know what causes the AttributeError... Though it is most likely related to joblibs, since it happens only when n_jobs is more than 1, runs fine with n_jobs=1.

The error talks about attribute score_model missing from __main__, whether or not I have a if __name__ == '__main__' in the IPython Notebook or not.

(I realized that the error line was pasted incorrectly above - I edited in the post above.)

I don't use pylab.

Here's the full extended error message - http://pastebin.com/23y5uHT2

@amueller
Copy link
Member

amueller commented Apr 2, 2015

Hum, that is likely related to issues of multiprocessing on windows. Maybe @GaelVaroquaux or @ogrisel can help.
I don't know what the notebook makes of the __name__ == "__main__".
Try not defining the metric in the notebook, but in a separate file and import it. I'd think that would fix it.
This is not really related to GridSearchCV, but some interesting interaction between windows multiprocessing, IPython notebook and joblib.

@alwaysandeep
Copy link

alwaysandeep commented Nov 5, 2016

guys...thanks for the thread. Anyway i should have checked this thread before, wasted 5 hours of my time on this. Trying to run in parallel processing. Thanks a lot :)
TO ADD A FEEDBACK: its still freezing. I faced the same issue when in presence of my own make_Score cost function..my system starts freezing. When i did not use custom cost function, i did not face these freezes in parallel processing

@lesteve
Copy link
Member

lesteve commented Nov 8, 2016

The best way of turning these 5 hours into something useful for the project, would be to provide us with a stand-alone example reproducing the problem.

@vosilov
Copy link

vosilov commented Dec 21, 2016

I was experiencing the same issue on Windows 10 working in Jupyter notebook trying to use a custom scorer within a nested cross-validation and n_jobs=-1. I was getting the AttributeError: Can't get attribute 'custom_scorer' on <module '__main__' (built-in)>; message.
As @amueller suggested, importing the custom scorer instead of defining it in the notebook works.

@martinxtm
Copy link

I have the exact same problem on OSX 10.10.5

@boazsh
Copy link

boazsh commented Aug 4, 2017

Same here.
OSX 10.12.5

@jnothman
Copy link
Member

jnothman commented Aug 6, 2017

Please give a reproducible code snippet. We'd love to get to the bottom of this. It is hard to understand without code, including data, that shows us the issue.

@boazsh
Copy link

boazsh commented Aug 8, 2017

Just run these lines in a python shell

import numpy as np
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_predict

np.random.seed(1234)
X = np.random.sample((1000, 100))
Y = np.random.sample((1000)) > 0.5
svc_pipeline = Pipeline([('pca', PCA(n_components=95)), ('svc', SVC())])
predictions = cross_val_predict(svc_pipeline, X, Y, cv=30, n_jobs=-1)
print classification_report(Y, predictions)

Note that removing the PCA step from the pipeline solves the issue.

More info:

Darwin-16.6.0-x86_64-i386-64bit
('Python', '2.7.13 (default, Apr 4 2017, 08:47:57) \n[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.38)]')
('NumPy', '1.12.1')
('SciPy', '0.19.1')
('Scikit-Learn', '0.18.2')

@jnothman
Copy link
Member

jnothman commented Aug 8, 2017 via email

@lesteve
Copy link
Member

lesteve commented Oct 6, 2017

@KaisJM I think it is more useful if you start from your freezing script and manage to simplify and post a fully stand-alone that freezes for you.

@KaisJM
Copy link

KaisJM commented Oct 6, 2017

@lesteve Agreed. I created a new python2 environment like the one I had before installing Gensim. Code ran fine, NO freeze with n_jobs=-1. What's more, Numpy is using OpenBLAS and has the same config as the environment that exhibits the freeze (the one where Gensim was installed). So it seems that openblas is not the cause of this freeze.

bumpy.__config__.show()
lapack_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_lapack_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_mkl_info:
  NOT AVAILABLE

@paulaceccon
Copy link

@KaisJM I'm running the same snippet here (windows) and it freezes.

from sklearn.datasets import make_classification
X, y = make_classification()

from sklearn.ensemble import RandomForestClassifier
clf_rf_params = {
    'n_estimators': [400, 600, 800],
    'min_samples_leaf' : [5, 10, 15],
    'min_samples_split' : [10, 15, 20],
    'criterion': ['gini', 'entropy'],
    'class_weight': [{0: 0.51891309,  1: 13.71835531}]
}

import numpy as np
def ginic(actual, pred):
    actual = np.asarray(actual) # In case, someone passes Series or list
    n = len(actual)
    a_s = actual[np.argsort(pred)]
    a_c = a_s.cumsum()
    giniSum = a_c.sum() / a_s.sum() - (n + 1) / 2.0
    return giniSum / n
 
def gini_normalizedc(a, p):
    if p.ndim == 2:  # Required for sklearn wrapper
        p = p[:,1]   # If proba array contains proba for both 0 and 1 classes, just pick class 1
    return ginic(a, p) / ginic(a, a)

from sklearn import metrics
gini_sklearn = metrics.make_scorer(gini_normalizedc, True, True)

from sklearn.model_selection import GridSearchCV

clf_rf = RandomForestClassifier()
grid = GridSearchCV(clf_rf, clf_rf_params, scoring=gini_sklearn, cv=3, verbose=1, n_jobs=-1)
grid.fit(X, y)

print (grid.best_params_)

I know that it's awkward but it didn't froze when running with a custom metric.

@snovik75
Copy link

I have a similar problem. I have been running the same code and simply wanted to update the model with the new month data and it stopped running. i believe sklearn got updated in the meantime to 0.19

@thomberg1
Copy link

Running GridSearchCV or RandomizedSearchCV in a loop and n_jobs > 1 would hang silently in Jupiter & IntelliJ:

for trial in tqdm(range(NUM_TRIALS)):
    ...
    gscv = GridSearchCV(estimator=estimator, param_grid=param_grid,
                          scoring=scoring, cv=cv, verbose=1, n_jobs=-1)
    gscv.fit(X_data, y_data)

    ...

Followed @lesteve recommendation & checked environment & removed numpy installed with pip:

Darwin-16.6.0-x86_64-i386-64bit
Python 3.6.1 |Anaconda custom (x86_64)| (default, May 11 2017, 13:04:09)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.13.1
SciPy 0.19.1
Scikit-Learn 0.19.0

$conda list | grep numpy
gnumpy 0.2 pip
numpy 1.13.1 py36_0
numpy 1.13.3 pip
numpydoc 0.6.0 py36_0

$pip uninstall numpy

$conda list | grep numpy
gnumpy 0.2 pip
numpy 1.13.1 py36_0
numpydoc 0.6.0 py36_0

$conda install numpy -f // most likely unnecessary

$conda list | grep numpy
gnumpy 0.2 pip
numpy 1.13.1 py36_0
numpydoc 0.6.0 py36_0

Fixed my problem.

@thomberg1
Copy link

@paulaceccon your problem is related to

https://stackoverflow.com/questions/36533134/cant-get-attribute-abc-on-module-main-from-abc-h-py
If you declare the pool prior to declaring the function you are trying to use in parallel it will throw this error. Reverse the order and it will no longer throw this error.

The following will run your code:

import multiprocessing

if __name__ == '__main__':
    multiprocessing.set_start_method('spawn')

    from external import *

    from sklearn.datasets import make_classification
    X, y = make_classification()

    from sklearn.ensemble import RandomForestClassifier
    clf_rf_params = {
        'n_estimators': [400, 600, 800],
        'min_samples_leaf' : [5, 10, 15],
        'min_samples_split' : [10, 15, 20],
        'criterion': ['gini', 'entropy'],
        'class_weight': [{0: 0.51891309,  1: 13.71835531}]
    }

    from sklearn.model_selection import GridSearchCV

    clf_rf = RandomForestClassifier()
    grid = GridSearchCV(clf_rf, clf_rf_params, scoring=gini_sklearn, cv=3, verbose=1, n_jobs=-1)
    grid.fit(X, y)

    print (grid.best_params_)

with external.py

import numpy as np
def ginic(actual, pred):
    actual = np.asarray(actual) # In case, someone passes Series or list
    n = len(actual)
    a_s = actual[np.argsort(pred)]
    a_c = a_s.cumsum()
    giniSum = a_c.sum() / a_s.sum() - (n + 1) / 2.0
    return giniSum / n

def gini_normalizedc(a, p):
    if p.ndim == 2:  # Required for sklearn wrapper
        p = p[:,1]   # If proba array contains proba for both 0 and 1 classes, just pick class 1
    return ginic(a, p) / ginic(a, a)

from sklearn import metrics
gini_sklearn = metrics.make_scorer(gini_normalizedc, True, True)

Results running on 8 cores

Fitting 3 folds for each of 54 candidates, totalling 162 fits
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 7.1s
[Parallel(n_jobs=-1)]: Done 162 out of 162 | elapsed: 30.5s finished
{'class_weight': {0: 0.51891309, 1: 13.71835531}, 'criterion': 'gini', 'min_samples_leaf': 10, 'min_samples_split': 20, 'n_estimators': 400}

@xtosis
Copy link

xtosis commented Feb 12, 2018

Issue is still there guys. I am using a custom scorer and it keeps going on forever when I set n_jobs to anything. When I don't specify n_jobs at all it works fine but otherwise it freezes.

@lesteve
Copy link
Member

lesteve commented Feb 12, 2018

Can you provide a stand-alone snippet to reproduce the problem ? Please read https://stackoverflow.com/help/mcve for more details.

@paulaceccon
Copy link

Still facing this problem with the same sample code.

Windows-10-10.0.15063-SP0
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)]
NumPy 1.14.1
SciPy 1.0.0
Scikit-Learn 0.19.1

@glemaitre
Copy link
Member

Can you provide a stand-alone snippet to reproduce the problem ? Please read https://stackoverflow.com/help/mcve for more details.

@jnothman
Copy link
Member

jnothman commented Mar 18, 2018 via email

@chi18000
Copy link

chi18000 commented Apr 11, 2018

I tested the code in thomberg1's #2889 (comment).

OS: Windows 10 x64 10.0.16299.309
Python package: WinPython-64bit-3.6.1
numpy (1.14.2)
scikit-learn (0.19.1)
scipy (1.0.0)

It worked fine in Jupyter Notebook and command-line.

@siideffect
Copy link

siideffect commented Apr 20, 2018

HI, i m having the same issue, so i did not want to open new one which could lead to almost identical thread.

-Macos
-Anaconda
-scikit-learn 0.19.1
-scipy 1.0.1
-numpy 1.14.2

# MLP for Pima Indians Dataset with grid search via sklearn
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
import numpy

# Function to create model, required for KerasClassifier
def create_model(optimizer='rmsprop', init='glorot_uniform'):
  # create model
  model = Sequential()
  model.add(Dense(12, input_dim=8, kernel_initializer=init, activation='relu'))
  model.add(Dense(8, kernel_initializer=init, activation='relu'))
  model.add(Dense(1, kernel_initializer=init, activation='sigmoid'))
  # Compile model
  model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
  return model

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load pima indians dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]


# create model
model = KerasClassifier(build_fn=create_model, verbose=0)
# grid search epochs, batch size and optimizer
optimizers = ['rmsprop', 'adam']
init = ['glorot_uniform', 'normal', 'uniform']
epochs = [50, 100, 150]
batches = [5, 10, 20]
param_grid = dict(optimizer=optimizers, epochs=epochs, batch_size=batches, init=init)
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid_result = grid.fit(X, Y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
  print("%f (%f) with: %r" % (mean, stdev, param))

Code is from a tutorial : https://machinelearningmastery.com/use-keras-deep-learning-models-scikit-learn-python/
I tried changing the n_jobs parameter to 1, -1, but neither of these worked. Any hint?

@thomberg1
Copy link

it runs if I add the multiprocessing import and the if statement as show below - I don't work with keras so I don't have more insight

import multiprocessing

if __name__ == '__main__':

    # MLP for Pima Indians Dataset with grid search via sklearn
    from keras.models import Sequential
    from keras.layers import Dense
    from keras.wrappers.scikit_learn import KerasClassifier
    from sklearn.model_selection import GridSearchCV
    import numpy

    # Function to create model, required for KerasClassifier
    def create_model(optimizer='rmsprop', init='glorot_uniform'):
      # create model
      model = Sequential()
      model.add(Dense(12, input_dim=8, kernel_initializer=init, activation='relu'))
      model.add(Dense(8, kernel_initializer=init, activation='relu'))
      model.add(Dense(1, kernel_initializer=init, activation='sigmoid'))
      # Compile model
      model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
      return model

    # fix random seed for reproducibility
    seed = 7
    numpy.random.seed(seed)
    # load pima indians dataset
    dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
    # split into input (X) and output (Y) variables
    X = dataset[:,0:8]
    Y = dataset[:,8]


    # create model
    model = KerasClassifier(build_fn=create_model, verbose=0)
    # grid search epochs, batch size and optimizer
    optimizers = ['rmsprop', 'adam']
    init = ['glorot_uniform', 'normal', 'uniform']
    epochs = [5]
    batches = [5, 10, 20]
    param_grid = dict(optimizer=optimizers, epochs=epochs, batch_size=batches, init=init)
    grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=12, verbose=1)
    grid_result = grid.fit(X, Y)
    # summarize results
    print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
    means = grid_result.cv_results_['mean_test_score']
    stds = grid_result.cv_results_['std_test_score']
    params = grid_result.cv_results_['params']
    for mean, stdev, param in zip(means, stds, params):
      print("%f (%f) with: %r" % (mean, stdev, param))

Fitting 3 folds for each of 18 candidates, totalling 54 fits

[Parallel(n_jobs=12)]: Done 26 tasks | elapsed: 18.4s
[Parallel(n_jobs=12)]: Done 54 out of 54 | elapsed: 23.7s finished
Best: 0.675781 using {'batch_size': 5, 'epochs': 5, 'init': 'glorot_uniform', 'optimizer': 'adam'}
0.621094 (0.036225) with: {'batch_size': 5, 'epochs': 5, 'init': 'glorot_uniform', 'optimizer': 'rmsprop'}
0.675781 (0.006379) with: {'batch_size': 5, 'epochs': 5, 'init': 'glorot_uniform', 'optimizer': 'adam'}
...
0.651042 (0.025780) with: {'batch_size': 20, 'epochs': 5, 'init': 'uniform', 'optimizer': 'adam'}


version info if needed
sys 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 12:04:33)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
numpy 1.14.2
pandas 0.22.0
sklearn 0.19.1
torch 0.4.0a0+9692519
IPython 6.2.1
keras 2.1.5

compiler : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
system : Darwin
release : 17.5.0
machine : x86_64
processor : i386
CPU cores : 24
interpreter: 64bit

@siideffect
Copy link

siideffect commented Apr 30, 2018

Thank you @thomberg1 , but adding

import multiprocessing
if __name__ == '__main__':

did not help. The problem is still the same

@byrony
Copy link

byrony commented May 20, 2018

Same problem on my machine when using customized scoring function in GridsearchCV.
python 3.6.4,
scikit-learn 0.19.1,
windows 10.,
CPU cores: 24

@amueller
Copy link
Member

@byrony can you provide code to reproduce? did you use if __name__ == "__main__"?

@Pazitos10
Copy link

Pazitos10 commented May 25, 2018

I've experienced a similar problem multiple times on my machine when using n_jobs=-1 or n_jobs=8 as an argument for GridsearchCV but using the default scorer argument.

  • Python 3.6.5,
  • scikit-learn 0.19.1,
  • Arch Linux,
  • CPU cores: 8.

Here is the code I used:

from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.utils import shuffle
from sklearn.neural_network import MLPClassifier
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


def main():
    
    df = pd.read_csv('../csvs/my_data.csv', nrows=4000000)    
    
    X = np.array(list(map(lambda a: np.fromstring(a[1:-1] , sep=','), df['X'])))
    y = np.array(list(map(lambda a: np.fromstring(a[1:-1] , sep=','), df['y'])))
    
    scalerX = MinMaxScaler()
    scalerY = MinMaxScaler()
    X = scalerX.fit_transform(X)
    y = scalerY.fit_transform(y)
   
    grid_params = {
        'beta_1': [ .1, .2, .3, .4, .5, .6, .7, .8, .9 ],
        'activation': ['identity', 'logistic', 'tanh', 'relu'],
        'learning_rate_init': [0.01, 0.001, 0.0001]
    }
    
    estimator = MLPClassifier(random_state=1, 
                              max_iter=1000, 
                              verbose=10,
                              early_stopping=True)
    
    gs = GridSearchCV(estimator, 
                      grid_params, 
                      cv=5,
                      verbose=10, 
                      return_train_score=True,
                      n_jobs=8)
    
    X, y = shuffle(X, y, random_state=0)
    
    y = y.astype(np.int16)    

    gs.fit(X, y.ravel())
    
    print("GridSearchCV Report \n\n")
    print("best_estimator_ {}".format(gs.best_estimator_))
    print("best_score_ {}".format(gs.best_score_))
    print("best_params_ {}".format(gs.best_params_))
    print("best_index_ {}".format(gs.best_index_))
    print("scorer_ {}".format(gs.scorer_))
    print("n_splits_ {}".format(gs.n_splits_))
    
    print("Exporting")
    results = pd.DataFrame(data=gs.cv_results_)
    results.to_csv('../csvs/gs_results.csv')


if __name__ == '__main__':
    main()

I know is a big dataset so I expected it would take some time to get results but then after 2 days running, it just stopped working (the script keeps executing but is not using any resource apart from RAM and swap).

captura de pantalla de 2018-05-25 17-53-11

captura de pantalla de 2018-05-25 17-54-59

Thanks in advance!

@byrony
Copy link

byrony commented May 26, 2018

@amueller I didn't use the if __name__ == "__main__". Below is my code, it only works when n_jobs=1

def neg_mape(true, pred):
    true, pred = np.array(true)+0.01, np.array(pred)
    return -1*np.mean(np.absolute((true - pred)/true))

xgb_test1 = XGBRegressor(
    #learning_rate =0.1,
    n_estimators=150,
    max_depth=3,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    colsample_bytree=0.8,
    objective= 'reg:linear',
    nthread=4,
    scale_pos_weight=1,
    seed=123,
)

param_test1 = {
    'learning_rate':[0.01, 0.05, 0.1, 0.2, 0.3],
}

gsearch1 = GridSearchCV(estimator = xgb_test1, param_grid = param_test1, scoring=neg_mape, n_jobs=4, cv = 5)

@amueller
Copy link
Member

You're using XGBoost. I don't know what they do internally, it's very possible that's the issue. Can you try to see if adding the if __name__ helps?
Otherwise I don't think there's a fix for that yet.

@amueller
Copy link
Member

@Pazitos10 can you reproduce with synthetic data and/or smaller data? I can't reproduce without your data and it would be good to reproduce in shorter time.

@Pazitos10
Copy link

Pazitos10 commented May 26, 2018

@amueller Ok, I will run it again with 500k rows and will post the results. Thanks!

@Pazitos10
Copy link

@amueller, running the script with 50k rows works as expected. The script ends correctly, showing the results as follows (sorry, I meant 50k not 500k):

captura de pantalla de 2018-05-26 13-09-00

captura de pantalla de 2018-05-26 13-09-51

The problem is that I don't know if these results are going to be the best for my whole dataset. Any advice?

@amueller
Copy link
Member

Seems like you're running out of ram. Maybe try using Keras instead, it's likely a better solution for large scale neural nets.

@Pazitos10
Copy link

@amueller Oh, ok. I will try using Keras instead. Thank you again!

@PGTBoos
Copy link

PGTBoos commented Oct 9, 2018

This has nothing to do with custom scorers. This is a well-known feature of Python multiprocessing on Windows: you have to run everything that uses n_jobs=-1 in an if __name__ == '__main__' block or you'll get freezes/crashes. Maybe we should document this somewhere prominently, e.g. in the README?

Is it perhaps an idea for scikit, that in case of Windows to alter the function
And use queues to feed tasks to a collection of worker processes and collect the results
As described here : https://docs.python.org/2/library/multiprocessing.html#windows
and for 3.6 here : https://docs.python.org/3.6/library/multiprocessing.html#windows

@amueller
Copy link
Member

@PGTBoos this is fixed in scikit-learn 0.20.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests