Re-evaluate memory usage in CountVectorizer #13062

rth · 2019-01-28T21:19:23Z

In #7272 (that was a mix of different PRs aiming to optimize text vectorizers) among other things, the intermediary storage of X.indices (where X is the document term matrix) was moved from array.array('i') to List[int].

There were benchmarks suggesting that overall after that PR, less peak memory was used to vectorize documents. However, now I don't understand how that is possible since,

one element of array.array('i') should be around 8x smaller than that of a Python int
storage of indices of the output sparse arrays should be in large part responsible for the overall memory footprint

I can't see any obvious mistakes in my benchmarks back than, but something doesn't make sense here.

Re-measuring peak memory usage of CountVectorizer.fit_transform, with and without the above change would be a good place to start.

Related discussion in #13045 (comment)

The text was updated successfully, but these errors were encountered:

rth added the help wanted label Jan 28, 2019

cmarmo added the module:feature_extraction label Aug 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-evaluate memory usage in CountVectorizer #13062

Re-evaluate memory usage in CountVectorizer #13062

rth commented Jan 28, 2019

Re-evaluate memory usage in CountVectorizer #13062

Re-evaluate memory usage in CountVectorizer #13062

Comments

rth commented Jan 28, 2019