Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-evaluate memory usage in CountVectorizer #13062

Open
rth opened this issue Jan 28, 2019 · 0 comments
Open

Re-evaluate memory usage in CountVectorizer #13062

rth opened this issue Jan 28, 2019 · 0 comments

Comments

@rth
Copy link
Member

rth commented Jan 28, 2019

In #7272 (that was a mix of different PRs aiming to optimize text vectorizers) among other things, the intermediary storage of X.indices (where X is the document term matrix) was moved from array.array('i') to List[int].

There were benchmarks suggesting that overall after that PR, less peak memory was used to vectorize documents. However, now I don't understand how that is possible since,

  • one element of array.array('i') should be around 8x smaller than that of a Python int
  • storage of indices of the output sparse arrays should be in large part responsible for the overall memory footprint

I can't see any obvious mistakes in my benchmarks back than, but something doesn't make sense here.

Re-measuring peak memory usage of CountVectorizer.fit_transform, with and without the above change would be a good place to start.

Related discussion in #13045 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants