extend score to take an array of shorttext #131

rja172 · 2022-08-21T07:07:40Z

Currently, score takes only a single input and as a result, the method is very slow if you are trying to classify thousands of examples. Is there a way you can generate scores for 10K+ samples at the same time.

stephenhky · 2022-08-22T02:36:36Z

vectorizing the input might works....

rja172 · 2022-08-22T03:32:49Z

do you have an example? If I do

predictions = classifier.score(sampleDF['product_name'].values)

I get following error:

AttributeError                            Traceback (most recent call last)
Input In [10], in <cell line: 2>()
      1 # predictions = model.predict(sampleDF['product_name'].head(100).values)
----> 2 predictions = model.classifier.score(sampleDF['product_name'].values)

File ~/.venv/category-predictor-e7zd-_vR-py3.9/lib/python3.9/site-packages/shorttext/classifiers/embed/nnlib/VarNNEmbedVecClassification.py:229, in VarNNEmbeddedVecClassifier.score(self, shorttext)
    226     raise e.ModelNotTrainedException()
    228 # retrieve vector
--> 229 matrix = np.array([self.shorttext_to_matrix(shorttext)])
    231 # classification using the neural network
    232 predictions = self.model.predict(matrix)

File ~/.venv/category-predictor-e7zd-_vR-py3.9/lib/python3.9/site-packages/shorttext/classifiers/embed/nnlib/VarNNEmbedVecClassification.py:205, in VarNNEmbeddedVecClassifier.shorttext_to_matrix(self, shorttext)
    193 def shorttext_to_matrix(self, shorttext):
    194     """ Convert the short text into a matrix with word-embedding representation.
    195 
    196     Given a short sentence, it converts all the tokens into embedded vectors according to
   (...)
    203     :rtype: numpy.ndarray
    204     """
--> 205     tokens = tokenize(shorttext)
    206     matrix = np.zeros((self.maxlen, self.vecsize))
    207     for i in range(min(self.maxlen, len(tokens))):

File ~/.venv/category-predictor-e7zd-_vR-py3.9/lib/python3.9/site-packages/shorttext/utils/textpreprocessing.py:9, in <lambda>(s)
      6 import snowballstemmer
      8 # tokenizer
----> 9 tokenize = lambda s: s.split(' ')
     12 # stemmer
     13 stemmer = snowballstemmer.stemmer('porter')

AttributeError: 'numpy.ndarray' object has no attribute 'split'

if I try using np.vectorize command (example: predictions = np.vectorize(lambda x: model.classifier.score(x))(sampleDF['product_name'])) It doesn't improve time by a lot

rja172 · 2022-08-22T12:59:27Z

As suggested, vectorizing np. vectorize(self.classifier.score)(data['X']) did help speed up the process. However, inference is still very slow. I think if the internal functions can handle multiple short text at the same time might significantly help speed up things. It is taking about 1 hours to resolve 100K short text.

stephenhky · 2022-08-26T22:31:52Z

Because the procedure involves tokenizing the string, it takes some time. Yes, doing it in parallel can speed it up. I will need to do some careful changes to the codes.

ragrawal · 2022-08-28T22:04:36Z

this PR addresses it: #134

stephenhky · 2022-08-30T03:10:07Z

Release 1.5.6.

stephenhky closed this as completed Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extend score to take an array of shorttext #131

extend score to take an array of shorttext #131

rja172 commented Aug 21, 2022

stephenhky commented Aug 22, 2022

rja172 commented Aug 22, 2022 •

edited

rja172 commented Aug 22, 2022 •

edited

stephenhky commented Aug 26, 2022

ragrawal commented Aug 28, 2022

stephenhky commented Aug 30, 2022

extend score to take an array of shorttext #131

extend score to take an array of shorttext #131

Comments

rja172 commented Aug 21, 2022

stephenhky commented Aug 22, 2022

rja172 commented Aug 22, 2022 • edited

rja172 commented Aug 22, 2022 • edited

stephenhky commented Aug 26, 2022

ragrawal commented Aug 28, 2022

stephenhky commented Aug 30, 2022

rja172 commented Aug 22, 2022 •

edited

rja172 commented Aug 22, 2022 •

edited