Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend score to take an array of shorttext #131

Closed
rja172 opened this issue Aug 21, 2022 · 6 comments
Closed

extend score to take an array of shorttext #131

rja172 opened this issue Aug 21, 2022 · 6 comments

Comments

@rja172
Copy link

rja172 commented Aug 21, 2022

Currently, score takes only a single input and as a result, the method is very slow if you are trying to classify thousands of examples. Is there a way you can generate scores for 10K+ samples at the same time.

@stephenhky
Copy link
Owner

vectorizing the input might works....

@rja172
Copy link
Author

rja172 commented Aug 22, 2022

do you have an example? If I do

predictions = classifier.score(sampleDF['product_name'].values)

I get following error:

AttributeError                            Traceback (most recent call last)
Input In [10], in <cell line: 2>()
      1 # predictions = model.predict(sampleDF['product_name'].head(100).values)
----> 2 predictions = model.classifier.score(sampleDF['product_name'].values)

File ~/.venv/category-predictor-e7zd-_vR-py3.9/lib/python3.9/site-packages/shorttext/classifiers/embed/nnlib/VarNNEmbedVecClassification.py:229, in VarNNEmbeddedVecClassifier.score(self, shorttext)
    226     raise e.ModelNotTrainedException()
    228 # retrieve vector
--> 229 matrix = np.array([self.shorttext_to_matrix(shorttext)])
    231 # classification using the neural network
    232 predictions = self.model.predict(matrix)

File ~/.venv/category-predictor-e7zd-_vR-py3.9/lib/python3.9/site-packages/shorttext/classifiers/embed/nnlib/VarNNEmbedVecClassification.py:205, in VarNNEmbeddedVecClassifier.shorttext_to_matrix(self, shorttext)
    193 def shorttext_to_matrix(self, shorttext):
    194     """ Convert the short text into a matrix with word-embedding representation.
    195 
    196     Given a short sentence, it converts all the tokens into embedded vectors according to
   (...)
    203     :rtype: numpy.ndarray
    204     """
--> 205     tokens = tokenize(shorttext)
    206     matrix = np.zeros((self.maxlen, self.vecsize))
    207     for i in range(min(self.maxlen, len(tokens))):

File ~/.venv/category-predictor-e7zd-_vR-py3.9/lib/python3.9/site-packages/shorttext/utils/textpreprocessing.py:9, in <lambda>(s)
      6 import snowballstemmer
      8 # tokenizer
----> 9 tokenize = lambda s: s.split(' ')
     12 # stemmer
     13 stemmer = snowballstemmer.stemmer('porter')

AttributeError: 'numpy.ndarray' object has no attribute 'split'

if I try using np.vectorize command (example: predictions = np.vectorize(lambda x: model.classifier.score(x))(sampleDF['product_name'])) It doesn't improve time by a lot

@rja172
Copy link
Author

rja172 commented Aug 22, 2022

As suggested, vectorizing np. vectorize(self.classifier.score)(data['X']) did help speed up the process. However, inference is still very slow. I think if the internal functions can handle multiple short text at the same time might significantly help speed up things. It is taking about 1 hours to resolve 100K short text.

@stephenhky
Copy link
Owner

Because the procedure involves tokenizing the string, it takes some time. Yes, doing it in parallel can speed it up. I will need to do some careful changes to the codes.

@ragrawal
Copy link
Contributor

this PR addresses it: #134

@stephenhky
Copy link
Owner

Release 1.5.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants