Generate a sparse tf-idf matrix from a tokenlist in a recipe step #258

jd4ds · 2023-11-15T17:03:33Z

Hello,
I am currently trying to use textrecipes in a project as part of our NLP pipeline in connection with {tidymodels}. At one point I came across a problem for which I have not yet found a solution. My problem is that the step textrecipes::step_tfidf apparently only generates a dense matrix (in the form of a tibble) and not a sparse matrix (dgCMatrix) and this leads to such a large object that I cannot process it in memory. The details in the documentation for this function also describe that step_tokenfilter should be executed in advance for this purpose. I would be very reluctant to do this, however, as I assume that in a sparse format - which I use as a blueprint in the modelling workflow anyway - the resulting object is sufficiently small. Meanwhile, tidymodels also seems to be able to cope with sparse matrices as input.
So my question is, is there a way to convert from a tokenlist representation to a sparse tf-idf (or other dtm) representation in a recipe step or to use another low-memory format as an intermediate step (such as the format from {tidytext}).
It would also be interesting to know whether this is currently only a technical restriction or whether the idea behind it is that there is no legitimate modelling assumption in which we cannot (better) manage with a token filter or another word embedding?

Many thanks in advance and best regards!

The text was updated successfully, but these errors were encountered:

EmilHvitfeldt · 2023-11-15T17:27:27Z

Hello @jd4ds 👋

Short answer: No.

Long answer: Right now the recipe forces each step to return a tibble. This works fine for the tokenized state as it uses a custom class to store it, but once we turn it into numbers such as with step_tfidf() we are forced to make it a tibble, hence the dense format. So it is techincally correct that tidymodels support sparse input, but recipes carry the data in a dense format before turning it sparse. Which we know is a blocker for some people.

Good news: We are planning how to deal with this in the long term, sneak-peak here: https://github.com/EmilHvitfeldt/sparsevctrs/pull/1/files. As we know that this is something that is really missing.

jd4ds · 2023-11-16T13:21:16Z

Hey Emil,
thanks for the quick reply. Interesting to hear that a solution for this is already being worked on and how. However, it seems to me that this is a project that will take some time before it is finalised and can be used productively. Do I see that correctly? Until then I will probably not be able to work with {tidymodels}, at least not for this specific project.
Thanks again and best regards!

EmilHvitfeldt added feature a feature request or enhancement Long Term We will do this, but it will take a while before we do labels Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate a sparse tf-idf matrix from a tokenlist in a recipe step #258

Generate a sparse tf-idf matrix from a tokenlist in a recipe step #258

jd4ds commented Nov 15, 2023

EmilHvitfeldt commented Nov 15, 2023

jd4ds commented Nov 16, 2023

Generate a sparse tf-idf matrix from a tokenlist in a recipe step #258

Generate a sparse tf-idf matrix from a tokenlist in a recipe step #258

Comments

jd4ds commented Nov 15, 2023

EmilHvitfeldt commented Nov 15, 2023

jd4ds commented Nov 16, 2023