Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate a sparse tf-idf matrix from a tokenlist in a recipe step #258

Open
jd4ds opened this issue Nov 15, 2023 · 2 comments
Open

Generate a sparse tf-idf matrix from a tokenlist in a recipe step #258

jd4ds opened this issue Nov 15, 2023 · 2 comments
Labels
feature a feature request or enhancement Long Term We will do this, but it will take a while before we do

Comments

@jd4ds
Copy link

jd4ds commented Nov 15, 2023

Hello,
I am currently trying to use textrecipes in a project as part of our NLP pipeline in connection with {tidymodels}. At one point I came across a problem for which I have not yet found a solution. My problem is that the step textrecipes::step_tfidf apparently only generates a dense matrix (in the form of a tibble) and not a sparse matrix (dgCMatrix) and this leads to such a large object that I cannot process it in memory. The details in the documentation for this function also describe that step_tokenfilter should be executed in advance for this purpose. I would be very reluctant to do this, however, as I assume that in a sparse format - which I use as a blueprint in the modelling workflow anyway - the resulting object is sufficiently small. Meanwhile, tidymodels also seems to be able to cope with sparse matrices as input.
So my question is, is there a way to convert from a tokenlist representation to a sparse tf-idf (or other dtm) representation in a recipe step or to use another low-memory format as an intermediate step (such as the format from {tidytext}).
It would also be interesting to know whether this is currently only a technical restriction or whether the idea behind it is that there is no legitimate modelling assumption in which we cannot (better) manage with a token filter or another word embedding?

Many thanks in advance and best regards!

@EmilHvitfeldt
Copy link
Member

Hello @jd4ds 👋

Short answer: No.

Long answer: Right now the recipe forces each step to return a tibble. This works fine for the tokenized state as it uses a custom class to store it, but once we turn it into numbers such as with step_tfidf() we are forced to make it a tibble, hence the dense format. So it is techincally correct that tidymodels support sparse input, but recipes carry the data in a dense format before turning it sparse. Which we know is a blocker for some people.

Good news: We are planning how to deal with this in the long term, sneak-peak here: https://github.com/EmilHvitfeldt/sparsevctrs/pull/1/files. As we know that this is something that is really missing.

@EmilHvitfeldt EmilHvitfeldt added feature a feature request or enhancement Long Term We will do this, but it will take a while before we do labels Nov 15, 2023
@jd4ds
Copy link
Author

jd4ds commented Nov 16, 2023

Hey Emil,
thanks for the quick reply. Interesting to hear that a solution for this is already being worked on and how. However, it seems to me that this is a project that will take some time before it is finalised and can be used productively. Do I see that correctly? Until then I will probably not be able to work with {tidymodels}, at least not for this specific project.
Thanks again and best regards!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement Long Term We will do this, but it will take a while before we do
Projects
None yet
Development

No branches or pull requests

2 participants