GridSearchCV parallel execution with own scorer freezes #2889

adverley · 2014-02-24T22:53:11Z

I have been searching hours on this problem and can consistently replicate it:

clf = GridSearchCV( sk.LogisticRegression(),
                            tuned_parameters,
                            cv = N_folds_validation,
                            pre_dispatch='6*n_jobs', 
                            n_jobs=4,
                            verbose = 1,
                            scoring=metrics.make_scorer(metrics.scorer.f1_score, average="macro")
                        )

This snippet crashes because of scoring=metrics.make_scorer(metrics.scorer.f1_score, average="macro") where metrics refers to sklearn.metrics module. If I cancel out the scoring=... line, the parallel execution works. If I want to use the f1 score as evaluation method, I have to cancel out the parallel execution by setting n_jobs = 1.

Is there a way I can define another score method without losing the parallel execution possibility?

Thanks

The text was updated successfully, but these errors were encountered:

jnothman · 2014-02-24T23:41:49Z

This is surprising, so we'll have to work out what the problem is and make sure it works!

Can you please provide a little more detail:

What do you mean by "crashes"?
What version of scikit-learn is this? If it's 0.14, does it still happen in the current development version?
Multiprocessing has platform-specific issues. What platform are you on? (e.g. import platform; platform.platform())
Have you tried it on different datasets?

FWIW, my machine has no problem fitting iris with this snippet on the development version of sklearn.

adverley · 2014-02-25T09:40:56Z

Thank you for your fast reply.

With crashing I actually mean freezing. It doesn't continue anymore and there is also no more activity to be monitored in the python process of task manager of windows. The processes are still there and consume a constant amount of RAM but require no processing time.

This is scikit-learn version 0.14, last updated and run using Enthought Canopy.

I am on platform "Windows-7-6.1.7601-SP1".

I will go more into depth by providing a generic example of the problem. I think it has to do with the GridSearchCV being placed in a for loop. (To not waste too much of your time, you should probably start at the run_tune_process() method which is being called at the bottom of the code and calls the method containing GridSearchCV() in a for loop)

Code:

import sklearn.metrics as metrics
from sklearn.grid_search import GridSearchCV
import numpy as np
import os
from sklearn import datasets
from sklearn import svm as sk


def tune_hyperparameters(trainingData, period):
    allDataTrain = trainingData

    # Define hyperparameters and construct a dictionary of them
    amount_kernels = 2
    kernels = ['rbf','linear']
    gamma_range =   10. ** np.arange(-5, 5)
    C_range =       10. ** np.arange(-5, 5)
    tuned_parameters = [
                        {'kernel': ['rbf'],     'gamma': gamma_range , 'C': C_range},
                        {'kernel': ['linear'],  'C': C_range}
                       ]

    print("Tuning hyper-parameters on period = " + str(period) + "\n")

    clf = GridSearchCV( sk.SVC(), 
                        tuned_parameters,
                        cv=5,
                        pre_dispatch='4*n_jobs', 
                        n_jobs=2,
                        verbose = 1,
                        scoring=metrics.make_scorer(metrics.scorer.f1_score, average="macro")
                        )
    clf.fit(allDataTrain[:,1:], allDataTrain[:,0:1].ravel())

    # other code will output some data to files, graphs and will save the optimal model with joblib package


    #   Eventually we will return the optimal model
    return clf

def run_tune_process(hyperparam_tuning_method,trainingData, testData):    
    for period in np.arange(0,100,10):
                clf = hyperparam_tuning_method(trainingData,period)

                y_real = testData[:,0:1].ravel()
                y_pred = clf.predict(testData[:,1:])

# import some data to play with
iris = datasets.load_iris()
X_training = iris.data[0:100,:]  
Y_training = (iris.target[0:100]).reshape(100,1)
trainingset = np.hstack((Y_training, X_training))

X_test = iris.data[100:150,:]  
Y_test = (iris.target[100:150]).reshape(50,1)
testset = np.hstack((Y_test, X_test))

run_tune_process(tune_hyperparameters,trainingset,testset)

Once again, this code works on my computer only when I change n_jobs to 1 or when I don't define a scoring= argument.

jnothman · 2014-02-25T10:52:55Z

Generally multiprocessing in Windows encounters a lot of problems. But I
don't know why this should be correlated with a custom metric. There's
nothing about the average=macro option in 0.14 that suggests it should be
more likely to hang than the default average (weighted). At the development
head, this completes in 11s on my macbook, and in 7s at version 0.14
(that's something to look into!)

Are you able to try this out in the current development version, to see if
it's still an issue?

On 25 February 2014 20:40, adverley notifications@github.com wrote:

Thank you for your fast reply.

With crashing I actually mean freezing. It doesn't continue anymore and
there is also no more activity to be monitored in the python process of
task manager of windows. The processes are still there and consume a
constant amount of RAM but require no processing time.

This is scikit-learn version 0.14, last updated and run using Enthought
Canopy.

I am on platform "Windows-7-6.1.7601-SP1".

I will go more into depth by providing a generic example of the problem. I
think it has to do with the GridSearchCV being placed in a for loop. (To
not waste too much of your time, you should probably start at the
run_tune_process() method which is being called at the bottom of the code
and calls the method containing GridSearchCV() in a for loop)
Code:

import sklearn.metrics as metrics
from sklearn.grid_search import GridSearchCV
import numpy as np
import os
from sklearn import datasets
from sklearn import svm as sk

def tune_hyperparameters(trainingData, period):
allDataTrain = trainingData
# Define hyperparameters and construct a dictionary of them
amount_kernels = 2
kernels = ['rbf','linear']
gamma_range =   10. ** np.arange(-5, 5)
C_range =       10. ** np.arange(-5, 5)
tuned_parameters = [
                    {'kernel': ['rbf'],     'gamma': gamma_range , 'C': C_range},
                    {'kernel': ['linear'],  'C': C_range}
                   ]

print("Tuning hyper-parameters on period = " + str(period) + "\n")

clf = GridSearchCV( sk.SVC(),
                    tuned_parameters,
                    cv=5,
                    pre_dispatch='4*n_jobs',
                    n_jobs=2,
                    verbose = 1,
                    scoring=metrics.make_scorer(metrics.scorer.f1_score, average="macro")
                    )
clf.fit(allDataTrain[:,1:], allDataTrain[:,0:1].ravel())

# other code will output some data to files, graphs and will save the optimal model with joblib package


#   Eventually we will return the optimal model
return clf
def run_tune_process(hyperparam_tuning_method,trainingData, testData):
for period in np.arange(0,100,10):
clf = hyperparam_tuning_method(trainingData,period)
            y_real = testData[:,0:1].ravel()
            y_pred = clf.predict(testData[:,1:])
import some data to play with

iris = datasets.load_iris()
X_training = iris.data[0:100,:]
Y_training = (iris.target[0:100]).reshape(100,1)
trainingset = np.hstack((Y_training, X_training))

X_test = iris.data[100:150,:]
Y_test = (iris.target[100:150]).reshape(50,1)
testset = np.hstack((Y_test, X_test))

run_tune_process(tune_hyperparameters,trainingset,testset)

Reply to this email directly or view it on GitHubhttps://github.com//issues/2889#issuecomment-35990430
.

jnothman · 2014-02-25T11:00:41Z

(As a side point, @ogrisel, I note there seems to be a lot more joblib
parallelisation overhead in master -- on OS X at least -- that wasn't there
in 0.14...)

On 25 February 2014 21:52, Joel Nothman jnothman@student.usyd.edu.auwrote:

Generally multiprocessing in Windows encounters a lot of problems. But I
don't know why this should be correlated with a custom metric. There's
nothing about the average=macro option in 0.14 that suggests it should be
more likely to hang than the default average (weighted). At the development
head, this completes in 11s on my macbook, and in 7s at version 0.14
(that's something to look into!)

Are you able to try this out in the current development version, to see if
it's still an issue?

On 25 February 2014 20:40, adverley notifications@github.com wrote:
Thank you for your fast reply.

With crashing I actually mean freezing. It doesn't continue anymore and
there is also no more activity to be monitored in the python process of
task manager of windows. The processes are still there and consume a
constant amount of RAM but require no processing time.

This is scikit-learn version 0.14, last updated and run using Enthought
Canopy.

I am on platform "Windows-7-6.1.7601-SP1".

I will go more into depth by providing a generic example of the problem.
I think it has to do with the GridSearchCV being placed in a for loop. (To
not waste too much of your time, you should probably start at the
run_tune_process() method which is being called at the bottom of the code
and calls the method containing GridSearchCV() in a for loop)
Code:

import sklearn.metrics as metrics
from sklearn.grid_search import GridSearchCV
import numpy as np
import os
from sklearn import datasets
from sklearn import svm as sk

def tune_hyperparameters(trainingData, period):
allDataTrain = trainingData
# Define hyperparameters and construct a dictionary of them
amount_kernels = 2
kernels = ['rbf','linear']
gamma_range =   10. ** np.arange(-5, 5)
C_range =       10. ** np.arange(-5, 5)
tuned_parameters = [
                    {'kernel': ['rbf'],     'gamma': gamma_range , 'C': C_range},
                    {'kernel': ['linear'],  'C': C_range}
                   ]

print("Tuning hyper-parameters on period = " + str(period) + "\n")

clf = GridSearchCV( sk.SVC(),
                    tuned_parameters,
                    cv=5,
                    pre_dispatch='4*n_jobs',
                    n_jobs=2,
                    verbose = 1,
                    scoring=metrics.make_scorer(metrics.scorer.f1_score, average="macro")
                    )
clf.fit(allDataTrain[:,1:], allDataTrain[:,0:1].ravel())

# other code will output some data to files, graphs and will save the optimal model with joblib package


#   Eventually we will return the optimal model
return clf
def run_tune_process(hyperparam_tuning_method,trainingData, testData):
for period in np.arange(0,100,10):
clf = hyperparam_tuning_method(trainingData,period)
            y_real = testData[:,0:1].ravel()
            y_pred = clf.predict(testData[:,1:])
import some data to play with

iris = datasets.load_iris()
X_training = iris.data[0:100,:]
Y_training = (iris.target[0:100]).reshape(100,1)
trainingset = np.hstack((Y_training, X_training))

X_test = iris.data[100:150,:]
Y_test = (iris.target[100:150]).reshape(50,1)
testset = np.hstack((Y_test, X_test))

run_tune_process(tune_hyperparameters,trainingset,testset)

Reply to this email directly or view it on GitHubhttps://github.com//issues/2889#issuecomment-35990430
.

larsmans · 2014-03-11T15:50:26Z

This has nothing to do with custom scorers. This is a well-known feature of Python multiprocessing on Windows: you have to run everything that uses n_jobs=-1 in an if __name__ == '__main__' block or you'll get freezes/crashes. Maybe we should document this somewhere prominently, e.g. in the README?

GaelVaroquaux · 2014-03-11T16:11:49Z

you have to run everything that uses n_jobs= -1 in an if name ==
'main' block or you'll get freezes/crashes.

Well, the good news is that nowadays joblib gives a meaningful error
message on such crash, rather than a fork bomb.

larsmans · 2014-03-11T16:15:43Z

@GaelVaroquaux does current scikit-learn give that error message? If so, the issue can be considered fixed, IMHO.

GaelVaroquaux · 2014-03-11T16:17:27Z

@GaelVaroquaux does current scikit-learn give that error message? If so, the
issue can be considered fixed, IMHO.

It should do. The only way to be sure is to check. I am on the move right
now, and I cannot boot up a Windows VM to do that.

larsmans · 2014-03-11T16:23:40Z

I'm not going to install a C compiler on Windows just for this. Sorry, but I really don't do Windows :)

GaelVaroquaux · 2014-03-11T16:27:51Z

I'm not going to install a C compiler on Windows just for this. Sorry, but I
really don't do Windows :)

I have a Windows VM. I can check. It's just a question of finding a
little be of time to do it.

adverley · 2014-03-11T16:42:56Z

@larsmans , you are completely right. The custom scorer object was a mistake of me, the problem lies indeed in the multiprocessing on windows. I tried this same code on a Linux and it runs well.

I don't get any error messages because it doesn't crash, it just stops doing any meaningful.

larsmans · 2014-03-15T15:38:38Z

@adverley Could you try the most recent version from GitHub on your Windows box?

amueller · 2015-01-23T03:45:36Z

Closing because of lack of feeback and it is probably a known issue that is fixed in newer joblib.

hirak99 · 2015-03-27T00:53:28Z

Not sure if related, does seem to be.

In windows, custom scorer still freezes. I encountered this thread on google - removed the scorer, and the grid search works.

When it freezes, it shows no error message. There are 3 python processes spawned too (because I set n_jobs=3). However, the CPU utilization remains 0 for all python processes. I am using IPython Notebook.

amueller · 2015-04-01T00:45:39Z

Can you share the code of the scorer? It seems a bit unlikely.

amueller · 2015-04-01T00:46:21Z

Does your scorer use joblib / n_jobs anywhere? It shouldn't, and that could maybe cause problems (though I think joblib should detect that).

hirak99 · 2015-04-01T03:12:13Z

Sure - here's the full code - http://pastebin.com/yUE26SNs

The scorer function is "score_model", it doesn't use joblib.

This runs from command prompt, but not from IPython Notebook. The error message is -
AttributeError: Can't get attribute 'score_model' on <module '__main__' (built-in)>;

Then the IPython and all the spawned python instances become idle - silently - and don't respond to any python code anymore till I restart it.

amueller · 2015-04-01T15:51:38Z

Fix the attribute error, then it'll work.
Do you do pylab imports in IPython notebook? Otherwise everything should be the same.

hirak99 · 2015-04-01T23:34:09Z

Well I do not know what causes the AttributeError... Though it is most likely related to joblibs, since it happens only when n_jobs is more than 1, runs fine with n_jobs=1.

The error talks about attribute score_model missing from __main__, whether or not I have a if __name__ == '__main__' in the IPython Notebook or not.

(I realized that the error line was pasted incorrectly above - I edited in the post above.)

I don't use pylab.

Here's the full extended error message - http://pastebin.com/23y5uHT2

amueller · 2015-04-02T14:43:20Z

Hum, that is likely related to issues of multiprocessing on windows. Maybe @GaelVaroquaux or @ogrisel can help.
I don't know what the notebook makes of the __name__ == "__main__".
Try not defining the metric in the notebook, but in a separate file and import it. I'd think that would fix it.
This is not really related to GridSearchCV, but some interesting interaction between windows multiprocessing, IPython notebook and joblib.

alwaysandeep · 2016-11-05T05:18:50Z

guys...thanks for the thread. Anyway i should have checked this thread before, wasted 5 hours of my time on this. Trying to run in parallel processing. Thanks a lot :)
TO ADD A FEEDBACK: its still freezing. I faced the same issue when in presence of my own make_Score cost function..my system starts freezing. When i did not use custom cost function, i did not face these freezes in parallel processing

lesteve · 2016-11-08T10:07:57Z

The best way of turning these 5 hours into something useful for the project, would be to provide us with a stand-alone example reproducing the problem.

vosilov · 2016-12-21T00:02:32Z

I was experiencing the same issue on Windows 10 working in Jupyter notebook trying to use a custom scorer within a nested cross-validation and n_jobs=-1. I was getting the AttributeError: Can't get attribute 'custom_scorer' on <module '__main__' (built-in)>; message.
As @amueller suggested, importing the custom scorer instead of defining it in the notebook works.

martinxtm · 2017-08-02T08:29:41Z

I have the exact same problem on OSX 10.10.5

boazsh · 2017-08-04T20:25:42Z

Same here.
OSX 10.12.5

jnothman · 2017-08-06T03:38:14Z

Please give a reproducible code snippet. We'd love to get to the bottom of this. It is hard to understand without code, including data, that shows us the issue.

boazsh · 2017-08-08T08:15:06Z

Just run these lines in a python shell

import numpy as np
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_predict

np.random.seed(1234)
X = np.random.sample((1000, 100))
Y = np.random.sample((1000)) > 0.5
svc_pipeline = Pipeline([('pca', PCA(n_components=95)), ('svc', SVC())])
predictions = cross_val_predict(svc_pipeline, X, Y, cv=30, n_jobs=-1)
print classification_report(Y, predictions)

Note that removing the PCA step from the pipeline solves the issue.

More info:

Darwin-16.6.0-x86_64-i386-64bit
('Python', '2.7.13 (default, Apr 4 2017, 08:47:57) \n[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.38)]')
('NumPy', '1.12.1')
('SciPy', '0.19.1')
('Scikit-Learn', '0.18.2')

jnothman · 2017-08-08T11:09:12Z

seeing as you don't use a custom scorer, should we assume that is a separate issue?

…

On 8 Aug 2017 6:15 pm, "boazsh" ***@***.***> wrote: Just run these lines in a python shell from sklearn.decomposition import PCAfrom sklearn.svm import SVCfrom sklearn.preprocessing import RobustScalerfrom sklearn.metrics import classification_reportfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import cross_val_predict X = np.random.sample((1000, 100)) Y = np.random.sample((1000)) > 0.5 svc_pipeline = Pipeline([('pca', PCA(n_components=95)), ('svc', SVC())]) predictions = cross_val_predict(svc_pipeline, X, Y, cv=30, n_jobs=-1)print classification_report(Y, predictions) Note that removing the PCA step from the pipeline solves the issue. More info: scikit-learn==0.18.2 scipy==0.19.1 numpy==1.12.1 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2889 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-6Klhc67b5kZ17fFTxc8RfZQ_BWks5sWBkLgaJpZM4BkiD9> .

lesteve · 2017-10-06T17:07:53Z

@KaisJM I think it is more useful if you start from your freezing script and manage to simplify and post a fully stand-alone that freezes for you.

KaisJM · 2017-10-06T18:13:26Z

@lesteve Agreed. I created a new python2 environment like the one I had before installing Gensim. Code ran fine, NO freeze with n_jobs=-1. What's more, Numpy is using OpenBLAS and has the same config as the environment that exhibits the freeze (the one where Gensim was installed). So it seems that openblas is not the cause of this freeze.

bumpy.__config__.show()
lapack_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_lapack_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_mkl_info:
  NOT AVAILABLE

paulaceccon · 2017-10-10T02:00:05Z

@KaisJM I'm running the same snippet here (windows) and it freezes.

from sklearn.datasets import make_classification
X, y = make_classification()

from sklearn.ensemble import RandomForestClassifier
clf_rf_params = {
    'n_estimators': [400, 600, 800],
    'min_samples_leaf' : [5, 10, 15],
    'min_samples_split' : [10, 15, 20],
    'criterion': ['gini', 'entropy'],
    'class_weight': [{0: 0.51891309,  1: 13.71835531}]
}

import numpy as np
def ginic(actual, pred):
    actual = np.asarray(actual) # In case, someone passes Series or list
    n = len(actual)
    a_s = actual[np.argsort(pred)]
    a_c = a_s.cumsum()
    giniSum = a_c.sum() / a_s.sum() - (n + 1) / 2.0
    return giniSum / n
 
def gini_normalizedc(a, p):
    if p.ndim == 2:  # Required for sklearn wrapper
        p = p[:,1]   # If proba array contains proba for both 0 and 1 classes, just pick class 1
    return ginic(a, p) / ginic(a, a)

from sklearn import metrics
gini_sklearn = metrics.make_scorer(gini_normalizedc, True, True)

from sklearn.model_selection import GridSearchCV

clf_rf = RandomForestClassifier()
grid = GridSearchCV(clf_rf, clf_rf_params, scoring=gini_sklearn, cv=3, verbose=1, n_jobs=-1)
grid.fit(X, y)

print (grid.best_params_)

I know that it's awkward but it didn't froze when running with a custom metric.

snovik75 · 2017-10-16T09:34:15Z

I have a similar problem. I have been running the same code and simply wanted to update the model with the new month data and it stopped running. i believe sklearn got updated in the meantime to 0.19

thomberg1 · 2017-10-19T17:31:56Z

Running GridSearchCV or RandomizedSearchCV in a loop and n_jobs > 1 would hang silently in Jupiter & IntelliJ:

for trial in tqdm(range(NUM_TRIALS)):
    ...
    gscv = GridSearchCV(estimator=estimator, param_grid=param_grid,
                          scoring=scoring, cv=cv, verbose=1, n_jobs=-1)
    gscv.fit(X_data, y_data)

    ...

Followed @lesteve recommendation & checked environment & removed numpy installed with pip:

Darwin-16.6.0-x86_64-i386-64bit
Python 3.6.1 |Anaconda custom (x86_64)| (default, May 11 2017, 13:04:09)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.13.1
SciPy 0.19.1
Scikit-Learn 0.19.0

$conda list | grep numpy
gnumpy 0.2 pip
numpy 1.13.1 py36_0
numpy 1.13.3 pip
numpydoc 0.6.0 py36_0

$pip uninstall numpy

$conda list | grep numpy
gnumpy 0.2 pip
numpy 1.13.1 py36_0
numpydoc 0.6.0 py36_0

$conda install numpy -f // most likely unnecessary

$conda list | grep numpy
gnumpy 0.2 pip
numpy 1.13.1 py36_0
numpydoc 0.6.0 py36_0

Fixed my problem.

thomberg1 · 2017-10-19T17:49:32Z

@paulaceccon your problem is related to

https://stackoverflow.com/questions/36533134/cant-get-attribute-abc-on-module-main-from-abc-h-py
If you declare the pool prior to declaring the function you are trying to use in parallel it will throw this error. Reverse the order and it will no longer throw this error.

The following will run your code:

import multiprocessing

if __name__ == '__main__':
    multiprocessing.set_start_method('spawn')

    from external import *

    from sklearn.datasets import make_classification
    X, y = make_classification()

    from sklearn.ensemble import RandomForestClassifier
    clf_rf_params = {
        'n_estimators': [400, 600, 800],
        'min_samples_leaf' : [5, 10, 15],
        'min_samples_split' : [10, 15, 20],
        'criterion': ['gini', 'entropy'],
        'class_weight': [{0: 0.51891309,  1: 13.71835531}]
    }

    from sklearn.model_selection import GridSearchCV

    clf_rf = RandomForestClassifier()
    grid = GridSearchCV(clf_rf, clf_rf_params, scoring=gini_sklearn, cv=3, verbose=1, n_jobs=-1)
    grid.fit(X, y)

    print (grid.best_params_)

with external.py

import numpy as np
def ginic(actual, pred):
    actual = np.asarray(actual) # In case, someone passes Series or list
    n = len(actual)
    a_s = actual[np.argsort(pred)]
    a_c = a_s.cumsum()
    giniSum = a_c.sum() / a_s.sum() - (n + 1) / 2.0
    return giniSum / n

def gini_normalizedc(a, p):
    if p.ndim == 2:  # Required for sklearn wrapper
        p = p[:,1]   # If proba array contains proba for both 0 and 1 classes, just pick class 1
    return ginic(a, p) / ginic(a, a)

from sklearn import metrics
gini_sklearn = metrics.make_scorer(gini_normalizedc, True, True)

Results running on 8 cores

Fitting 3 folds for each of 54 candidates, totalling 162 fits
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 7.1s
[Parallel(n_jobs=-1)]: Done 162 out of 162 | elapsed: 30.5s finished
{'class_weight': {0: 0.51891309, 1: 13.71835531}, 'criterion': 'gini', 'min_samples_leaf': 10, 'min_samples_split': 20, 'n_estimators': 400}

xtosis · 2018-02-12T08:08:20Z

Issue is still there guys. I am using a custom scorer and it keeps going on forever when I set n_jobs to anything. When I don't specify n_jobs at all it works fine but otherwise it freezes.

lesteve · 2018-02-12T10:01:36Z

Can you provide a stand-alone snippet to reproduce the problem ? Please read https://stackoverflow.com/help/mcve for more details.

paulaceccon · 2018-03-16T23:56:44Z

Still facing this problem with the same sample code.

Windows-10-10.0.15063-SP0
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)]
NumPy 1.14.1
SciPy 1.0.0
Scikit-Learn 0.19.1

glemaitre · 2018-03-17T07:34:00Z

Can you provide a stand-alone snippet to reproduce the problem ? Please read https://stackoverflow.com/help/mcve for more details.

jnothman · 2018-03-18T00:15:38Z

I suspect this is the same old multiprocessing in windows issue. see our FAQ

chi18000 · 2018-04-11T01:35:45Z

I tested the code in thomberg1's #2889 (comment).

OS: Windows 10 x64 10.0.16299.309
Python package: WinPython-64bit-3.6.1
numpy (1.14.2)
scikit-learn (0.19.1)
scipy (1.0.0)

It worked fine in Jupyter Notebook and command-line.

siideffect · 2018-04-20T17:20:46Z

HI, i m having the same issue, so i did not want to open new one which could lead to almost identical thread.

-Macos
-Anaconda
-scikit-learn 0.19.1
-scipy 1.0.1
-numpy 1.14.2

# MLP for Pima Indians Dataset with grid search via sklearn
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
import numpy

# Function to create model, required for KerasClassifier
def create_model(optimizer='rmsprop', init='glorot_uniform'):
  # create model
  model = Sequential()
  model.add(Dense(12, input_dim=8, kernel_initializer=init, activation='relu'))
  model.add(Dense(8, kernel_initializer=init, activation='relu'))
  model.add(Dense(1, kernel_initializer=init, activation='sigmoid'))
  # Compile model
  model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
  return model

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load pima indians dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]


# create model
model = KerasClassifier(build_fn=create_model, verbose=0)
# grid search epochs, batch size and optimizer
optimizers = ['rmsprop', 'adam']
init = ['glorot_uniform', 'normal', 'uniform']
epochs = [50, 100, 150]
batches = [5, 10, 20]
param_grid = dict(optimizer=optimizers, epochs=epochs, batch_size=batches, init=init)
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid_result = grid.fit(X, Y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
  print("%f (%f) with: %r" % (mean, stdev, param))

Code is from a tutorial : https://machinelearningmastery.com/use-keras-deep-learning-models-scikit-learn-python/
I tried changing the n_jobs parameter to 1, -1, but neither of these worked. Any hint?

thomberg1 · 2018-04-27T21:36:31Z

it runs if I add the multiprocessing import and the if statement as show below - I don't work with keras so I don't have more insight

import multiprocessing

if __name__ == '__main__':

    # MLP for Pima Indians Dataset with grid search via sklearn
    from keras.models import Sequential
    from keras.layers import Dense
    from keras.wrappers.scikit_learn import KerasClassifier
    from sklearn.model_selection import GridSearchCV
    import numpy

    # Function to create model, required for KerasClassifier
    def create_model(optimizer='rmsprop', init='glorot_uniform'):
      # create model
      model = Sequential()
      model.add(Dense(12, input_dim=8, kernel_initializer=init, activation='relu'))
      model.add(Dense(8, kernel_initializer=init, activation='relu'))
      model.add(Dense(1, kernel_initializer=init, activation='sigmoid'))
      # Compile model
      model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
      return model

    # fix random seed for reproducibility
    seed = 7
    numpy.random.seed(seed)
    # load pima indians dataset
    dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
    # split into input (X) and output (Y) variables
    X = dataset[:,0:8]
    Y = dataset[:,8]


    # create model
    model = KerasClassifier(build_fn=create_model, verbose=0)
    # grid search epochs, batch size and optimizer
    optimizers = ['rmsprop', 'adam']
    init = ['glorot_uniform', 'normal', 'uniform']
    epochs = [5]
    batches = [5, 10, 20]
    param_grid = dict(optimizer=optimizers, epochs=epochs, batch_size=batches, init=init)
    grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=12, verbose=1)
    grid_result = grid.fit(X, Y)
    # summarize results
    print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
    means = grid_result.cv_results_['mean_test_score']
    stds = grid_result.cv_results_['std_test_score']
    params = grid_result.cv_results_['params']
    for mean, stdev, param in zip(means, stds, params):
      print("%f (%f) with: %r" % (mean, stdev, param))

Fitting 3 folds for each of 18 candidates, totalling 54 fits

[Parallel(n_jobs=12)]: Done 26 tasks | elapsed: 18.4s
[Parallel(n_jobs=12)]: Done 54 out of 54 | elapsed: 23.7s finished
Best: 0.675781 using {'batch_size': 5, 'epochs': 5, 'init': 'glorot_uniform', 'optimizer': 'adam'}
0.621094 (0.036225) with: {'batch_size': 5, 'epochs': 5, 'init': 'glorot_uniform', 'optimizer': 'rmsprop'}
0.675781 (0.006379) with: {'batch_size': 5, 'epochs': 5, 'init': 'glorot_uniform', 'optimizer': 'adam'}
...
0.651042 (0.025780) with: {'batch_size': 20, 'epochs': 5, 'init': 'uniform', 'optimizer': 'adam'}

version info if needed
sys 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 12:04:33)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
numpy 1.14.2
pandas 0.22.0
sklearn 0.19.1
torch 0.4.0a0+9692519
IPython 6.2.1
keras 2.1.5

compiler : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
system : Darwin
release : 17.5.0
machine : x86_64
processor : i386
CPU cores : 24
interpreter: 64bit

siideffect · 2018-04-30T09:38:44Z

Thank you @thomberg1 , but adding

import multiprocessing
if __name__ == '__main__':

did not help. The problem is still the same

byrony · 2018-05-20T20:18:26Z

Same problem on my machine when using customized scoring function in GridsearchCV.
python 3.6.4,
scikit-learn 0.19.1,
windows 10.,
CPU cores: 24

amueller · 2018-05-21T16:02:23Z

@byrony can you provide code to reproduce? did you use if __name__ == "__main__"?

Pazitos10 · 2018-05-25T20:52:34Z

I've experienced a similar problem multiple times on my machine when using n_jobs=-1 or n_jobs=8 as an argument for GridsearchCV but using the default scorer argument.

Python 3.6.5,
scikit-learn 0.19.1,
Arch Linux,
CPU cores: 8.

Here is the code I used:

from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.utils import shuffle
from sklearn.neural_network import MLPClassifier
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


def main():
    
    df = pd.read_csv('../csvs/my_data.csv', nrows=4000000)    
    
    X = np.array(list(map(lambda a: np.fromstring(a[1:-1] , sep=','), df['X'])))
    y = np.array(list(map(lambda a: np.fromstring(a[1:-1] , sep=','), df['y'])))
    
    scalerX = MinMaxScaler()
    scalerY = MinMaxScaler()
    X = scalerX.fit_transform(X)
    y = scalerY.fit_transform(y)
   
    grid_params = {
        'beta_1': [ .1, .2, .3, .4, .5, .6, .7, .8, .9 ],
        'activation': ['identity', 'logistic', 'tanh', 'relu'],
        'learning_rate_init': [0.01, 0.001, 0.0001]
    }
    
    estimator = MLPClassifier(random_state=1, 
                              max_iter=1000, 
                              verbose=10,
                              early_stopping=True)
    
    gs = GridSearchCV(estimator, 
                      grid_params, 
                      cv=5,
                      verbose=10, 
                      return_train_score=True,
                      n_jobs=8)
    
    X, y = shuffle(X, y, random_state=0)
    
    y = y.astype(np.int16)    

    gs.fit(X, y.ravel())
    
    print("GridSearchCV Report \n\n")
    print("best_estimator_ {}".format(gs.best_estimator_))
    print("best_score_ {}".format(gs.best_score_))
    print("best_params_ {}".format(gs.best_params_))
    print("best_index_ {}".format(gs.best_index_))
    print("scorer_ {}".format(gs.scorer_))
    print("n_splits_ {}".format(gs.n_splits_))
    
    print("Exporting")
    results = pd.DataFrame(data=gs.cv_results_)
    results.to_csv('../csvs/gs_results.csv')


if __name__ == '__main__':
    main()

I know is a big dataset so I expected it would take some time to get results but then after 2 days running, it just stopped working (the script keeps executing but is not using any resource apart from RAM and swap).

Thanks in advance!

byrony · 2018-05-26T00:04:26Z

@amueller I didn't use the if __name__ == "__main__". Below is my code, it only works when n_jobs=1

def neg_mape(true, pred):
    true, pred = np.array(true)+0.01, np.array(pred)
    return -1*np.mean(np.absolute((true - pred)/true))

xgb_test1 = XGBRegressor(
    #learning_rate =0.1,
    n_estimators=150,
    max_depth=3,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    colsample_bytree=0.8,
    objective= 'reg:linear',
    nthread=4,
    scale_pos_weight=1,
    seed=123,
)

param_test1 = {
    'learning_rate':[0.01, 0.05, 0.1, 0.2, 0.3],
}

gsearch1 = GridSearchCV(estimator = xgb_test1, param_grid = param_test1, scoring=neg_mape, n_jobs=4, cv = 5)

amueller · 2018-05-26T15:22:31Z

You're using XGBoost. I don't know what they do internally, it's very possible that's the issue. Can you try to see if adding the if __name__ helps?
Otherwise I don't think there's a fix for that yet.

amueller · 2018-05-26T15:23:26Z

@Pazitos10 can you reproduce with synthetic data and/or smaller data? I can't reproduce without your data and it would be good to reproduce in shorter time.

Pazitos10 · 2018-05-26T15:31:57Z

@amueller Ok, I will run it again with 500k rows and will post the results. Thanks!

Pazitos10 · 2018-05-26T16:12:25Z

@amueller, running the script with 50k rows works as expected. The script ends correctly, showing the results as follows (sorry, I meant 50k not 500k):

The problem is that I don't know if these results are going to be the best for my whole dataset. Any advice?

amueller · 2018-05-26T16:13:33Z

Seems like you're running out of ram. Maybe try using Keras instead, it's likely a better solution for large scale neural nets.

Pazitos10 · 2018-05-26T16:17:47Z

@amueller Oh, ok. I will try using Keras instead. Thank you again!

PGTBoos · 2018-10-09T21:56:42Z

This has nothing to do with custom scorers. This is a well-known feature of Python multiprocessing on Windows: you have to run everything that uses n_jobs=-1 in an if __name__ == '__main__' block or you'll get freezes/crashes. Maybe we should document this somewhere prominently, e.g. in the README?

Is it perhaps an idea for scikit, that in case of Windows to alter the function
And use queues to feed tasks to a collection of worker processes and collect the results
As described here : https://docs.python.org/2/library/multiprocessing.html#windows
and for 3.6 here : https://docs.python.org/3.6/library/multiprocessing.html#windows

amueller · 2018-10-10T00:49:44Z

@PGTBoos this is fixed in scikit-learn 0.20.0

amueller closed this as completed Jan 23, 2015

eric-czech mentioned this issue Aug 12, 2015

GridSearchCV freezes indefinitely with multithreading enabled (i.e. w/ n_jobs != 1) #5115

Closed

lesteve mentioned this issue Oct 6, 2017

Outdated instructions in advanced installation instructions for Linux #9881

Closed

omelnikov mentioned this issue Dec 9, 2017

Training doesn't end RasaHQ/rasa#340

Closed

DKorman mentioned this issue Feb 14, 2019

'dmlc::Error' when using xgboost with multiprocessing.pool dmlc/xgboost#4141

Closed

GridSearchCV parallel execution with own scorer freezes #2889

GridSearchCV parallel execution with own scorer freezes #2889

Comments

adverley commented Feb 24, 2014

jnothman commented Feb 24, 2014

adverley commented Feb 25, 2014

Code:

jnothman commented Feb 25, 2014

import some data to play with

jnothman commented Feb 25, 2014

import some data to play with

larsmans commented Mar 11, 2014

GaelVaroquaux commented Mar 11, 2014

larsmans commented Mar 11, 2014

GaelVaroquaux commented Mar 11, 2014

larsmans commented Mar 11, 2014

GaelVaroquaux commented Mar 11, 2014

adverley commented Mar 11, 2014

larsmans commented Mar 15, 2014

amueller commented Jan 23, 2015

hirak99 commented Mar 27, 2015

amueller commented Apr 1, 2015

amueller commented Apr 1, 2015

hirak99 commented Apr 1, 2015

amueller commented Apr 1, 2015

hirak99 commented Apr 1, 2015

amueller commented Apr 2, 2015

alwaysandeep commented Nov 5, 2016 • edited

lesteve commented Nov 8, 2016

vosilov commented Dec 21, 2016

martinxtm commented Aug 2, 2017

boazsh commented Aug 4, 2017

jnothman commented Aug 6, 2017

boazsh commented Aug 8, 2017 • edited

jnothman commented Aug 8, 2017 via email

lesteve commented Oct 6, 2017

KaisJM commented Oct 6, 2017 • edited

paulaceccon commented Oct 10, 2017

snovik75 commented Oct 16, 2017

thomberg1 commented Oct 19, 2017

thomberg1 commented Oct 19, 2017

xtosis commented Feb 12, 2018

lesteve commented Feb 12, 2018

paulaceccon commented Mar 16, 2018

glemaitre commented Mar 17, 2018

jnothman commented Mar 18, 2018 via email

chi18000 commented Apr 11, 2018 • edited by lesteve

siideffect commented Apr 20, 2018 • edited

thomberg1 commented Apr 27, 2018

Fitting 3 folds for each of 18 candidates, totalling 54 fits

siideffect commented Apr 30, 2018 • edited

byrony commented May 20, 2018

amueller commented May 21, 2018

Pazitos10 commented May 25, 2018 • edited

byrony commented May 26, 2018 • edited

amueller commented May 26, 2018

amueller commented May 26, 2018

Pazitos10 commented May 26, 2018 • edited

Pazitos10 commented May 26, 2018

amueller commented May 26, 2018

Pazitos10 commented May 26, 2018

PGTBoos commented Oct 9, 2018

amueller commented Oct 10, 2018

alwaysandeep commented Nov 5, 2016 •

edited

boazsh commented Aug 8, 2017 •

edited

KaisJM commented Oct 6, 2017 •

edited

chi18000 commented Apr 11, 2018 •

edited by lesteve

siideffect commented Apr 20, 2018 •

edited

siideffect commented Apr 30, 2018 •

edited

Pazitos10 commented May 25, 2018 •

edited

byrony commented May 26, 2018 •

edited

Pazitos10 commented May 26, 2018 •

edited