Joblib saved classifier slow prediction #1593

Closed
jaganadhg opened this Issue Jan 18, 2013 · 6 comments

Comments

Projects
None yet
3 participants
@jaganadhg

As per the discussion in this thread (http://comments.gmane.org/gmane.comp.python.scikit-learn/5716) I am filing the issue.
Code to reproduce the same https://gist.github.com/4564287

Best regards

Jaggu

@mrjbq7

This comment has been minimized.

Show comment Hide comment
@mrjbq7

mrjbq7 Feb 4, 2013

Contributor

I can't seem to reproduce this using a slightly modified version of your script:

import time

from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.externals import joblib

twenty = fetch_20newsgroups()

t0 = time.time()
X = CountVectorizer().fit_transform(twenty.data)
print 'CountVectorizer.fit_transform() took %.3f seconds' % (time.time() - t0)

t0 = time.time()
X = TfidfTransformer().fit_transform(X)
print 'TfidfTransformer.fit_transform() took %.3f seconds' % (time.time() - t0)

clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())
])

t0 = time.time()
_ = clf.fit(twenty.data, twenty.target)
print 'Pipeline.fit() took %.3f seconds' % (time.time() - t0)

t0 = time.time()
clf.predict(["this is a good sentence for debugging this code I think. what do you think"])
print 'Pipeline.predict() took %.3f seconds' % (time.time() - t0)

_ = joblib.dump(clf,"test_speed.model", compress=9)
joblib_clf = joblib.load("test_speed.model")

t0 = time.time()
joblib_clf.predict(["this is a good sentence for debugging this code I think. what do you think"])
print 'Joblib.predict() took %.3f seconds' % (time.time() - t0)

The output is:

CountVectorizer.fit_transform() took 7.670 seconds
TfidfTransformer.fit_transform() took 0.743 seconds
Pipeline.fit() took 8.554 seconds
Pipeline.predict() took 0.005 seconds
Joblib.predict() took 0.007 seconds

It seems like the pipeline takes the same amount of time as the classifiers separately, and the classifier dumped and then loaded by joblib is just as fast?

Contributor

mrjbq7 commented Feb 4, 2013

I can't seem to reproduce this using a slightly modified version of your script:

import time

from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.externals import joblib

twenty = fetch_20newsgroups()

t0 = time.time()
X = CountVectorizer().fit_transform(twenty.data)
print 'CountVectorizer.fit_transform() took %.3f seconds' % (time.time() - t0)

t0 = time.time()
X = TfidfTransformer().fit_transform(X)
print 'TfidfTransformer.fit_transform() took %.3f seconds' % (time.time() - t0)

clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())
])

t0 = time.time()
_ = clf.fit(twenty.data, twenty.target)
print 'Pipeline.fit() took %.3f seconds' % (time.time() - t0)

t0 = time.time()
clf.predict(["this is a good sentence for debugging this code I think. what do you think"])
print 'Pipeline.predict() took %.3f seconds' % (time.time() - t0)

_ = joblib.dump(clf,"test_speed.model", compress=9)
joblib_clf = joblib.load("test_speed.model")

t0 = time.time()
joblib_clf.predict(["this is a good sentence for debugging this code I think. what do you think"])
print 'Joblib.predict() took %.3f seconds' % (time.time() - t0)

The output is:

CountVectorizer.fit_transform() took 7.670 seconds
TfidfTransformer.fit_transform() took 0.743 seconds
Pipeline.fit() took 8.554 seconds
Pipeline.predict() took 0.005 seconds
Joblib.predict() took 0.007 seconds

It seems like the pipeline takes the same amount of time as the classifiers separately, and the classifier dumped and then loaded by joblib is just as fast?

@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Feb 11, 2013

Owner

I think the real issue was that tfidf was very slow but we could not reproduce with 20news.

Owner

amueller commented Feb 11, 2013

I think the real issue was that tfidf was very slow but we could not reproduce with 20news.

@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Feb 11, 2013

Owner

@jaganadhg could you maybe again give some details to what exactly was the problem?

Owner

amueller commented Feb 11, 2013

@jaganadhg could you maybe again give some details to what exactly was the problem?

@jaganadhg

This comment has been minimized.

Show comment Hide comment
@jaganadhg

jaganadhg Feb 14, 2013

On Tue, Feb 12, 2013 at 3:34 AM, Andreas Mueller
notifications@github.comwrote:

@jaganadhg https://github.com/jaganadhg could you maybe again give some
details to what exactly was the problem?

Hi Team,

I used this data set (
http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip) to train the
classifier in the AWS (Micro instance I think - Ubuntu 12.4 , Python 2.7) .

I trained the classifier over 16 Lakh training set. The pipeline contained

  1. CountVectorizer 2) TFIDFTransformer and 3)NaiveBayes/SVM.
    The serialized classifier took around one minute to generate the
    prediction. That is the issue I posted in the mailing list.

Best regards


JAGANADH G
http://jaganadhg.in
ILUGCBE
http://ilugcbe.org.in

On Tue, Feb 12, 2013 at 3:34 AM, Andreas Mueller
notifications@github.comwrote:

@jaganadhg https://github.com/jaganadhg could you maybe again give some
details to what exactly was the problem?

Hi Team,

I used this data set (
http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip) to train the
classifier in the AWS (Micro instance I think - Ubuntu 12.4 , Python 2.7) .

I trained the classifier over 16 Lakh training set. The pipeline contained

  1. CountVectorizer 2) TFIDFTransformer and 3)NaiveBayes/SVM.
    The serialized classifier took around one minute to generate the
    prediction. That is the issue I posted in the mailing list.

Best regards


JAGANADH G
http://jaganadhg.in
ILUGCBE
http://ilugcbe.org.in

@amueller amueller modified the milestones: 0.15.1, 0.15 Jul 18, 2014

@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Jul 18, 2014

Owner

I think it is reasonable that the estimator takes a minute to generate the prediction on this dataset on this instance.
I ran your gist and the prediction takes
100 loops, best of 3: 10.5 ms per loop

Owner

amueller commented Jul 18, 2014

I think it is reasonable that the estimator takes a minute to generate the prediction on this dataset on this instance.
I ran your gist and the prediction takes
100 loops, best of 3: 10.5 ms per loop

@amueller

This comment has been minimized.

Show comment Hide comment
@amueller

amueller Jul 18, 2014

Owner

I'm closing this for now. If you have a script that reproduces the problem, please reopen.

Owner

amueller commented Jul 18, 2014

I'm closing this for now. If you have a script that reproduces the problem, please reopen.

@amueller amueller closed this Jul 18, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment