You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #7272 (that was a mix of different PRs aiming to optimize text vectorizers) among other things, the intermediary storage of X.indices (where X is the document term matrix) was moved from array.array('i') to List[int].
There were benchmarks suggesting that overall after that PR, less peak memory was used to vectorize documents. However, now I don't understand how that is possible since,
one element of array.array('i') should be around 8x smaller than that of a Python int
storage of indices of the output sparse arrays should be in large part responsible for the overall memory footprint
I can't see any obvious mistakes in my benchmarks back than, but something doesn't make sense here.
Re-measuring peak memory usage of CountVectorizer.fit_transform, with and without the above change would be a good place to start.
In #7272 (that was a mix of different PRs aiming to optimize text vectorizers) among other things, the intermediary storage of
X.indices
(whereX
is the document term matrix) was moved fromarray.array('i')
toList[int]
.There were benchmarks suggesting that overall after that PR, less peak memory was used to vectorize documents. However, now I don't understand how that is possible since,
array.array('i')
should be around 8x smaller than that of a Python intI can't see any obvious mistakes in my benchmarks back than, but something doesn't make sense here.
Re-measuring peak memory usage of
CountVectorizer.fit_transform
, with and without the above change would be a good place to start.Related discussion in #13045 (comment)
The text was updated successfully, but these errors were encountered: