ENH: optimize StringEncoder#1248
Conversation
For memory and speed
|
FYI, the examples can be seen here: https://output.circle-artifacts.com/output/job/f25f287d-9bba-4153-977a-bbd5fbc2f3ca/artifacts/0/doc/auto_examples/02_text_with_string_encoders.html#sphx-glr-auto-examples-02-text-with-string-encoders-py |
jeromedockes
left a comment
There was a problem hiding this comment.
LGTM but it would be nice to cover the shape[1] == n_components case in a test, LMK if you don't have time now I can push a small commit for it
| elif X_out.shape[1] == self.n_components: | ||
| result = X_out.toarray() |
There was a problem hiding this comment.
does the 'arpack' algorithm not like it when p == n_components? or is skipping the tsvd in that case an optimization? if the latter, I guess the case where the number of discovered ngrams is exactly equal to n_components might be too rare to warrant it? and we might want a test for that branch
There was a problem hiding this comment.
I guess when n_components matches the number of dimensions of the vector, running the SVD doesn't make sense? We could coalesce this unlikely condition with the else statement below, though.
There was a problem hiding this comment.
It's both, actually.
I think that needed this for the tests to pass. At least my first implementation hit a corner case in the tests (good tests!)
| # Therefore, self.n_components_ below stores the resulting | ||
| # number of dimensions of result. | ||
| result = X_out[:, : self.n_components].toarray() | ||
| result = result.copy() |
There was a problem hiding this comment.
I suppose the copy() is that otherwise the reference to the slice would prevent X_out from being garbage collected; might be worth a short comment
It would be helpful. Thanks! Let's also wait for a systematic evaluation before merging |
|
@rcap107 : super useful. Thanks!!! |
Co-authored-by: Jerome Dockes <jerome@dockes.org>
Co-authored-by: Jerome Dockes <jerome@dockes.org>



For memory and speed: