Token filtering step based on relative frequency (length of source document) #259

Malichot · 2024-01-09T15:46:02Z

Hello,
First of all, thank you very much for developing this package!

I was wondering if it was possible in the token filtering step (step_tokenfilter()) to implement a method based on the relative and not absolute frequency of a token's appearance. For example, here's my code:

data_train <- tibble(sentence = c("This is words", "They are nice !", "Pretty pretty pretty good !", "Another sentence that is in doc2"),
                                   doc = c("doc1", "doc2", "doc2", "doc2"))

data_rec <- recipe(x ~ sentence, data = data_train) %>%
      step_tokenize(sentence) %>%
      step_stopwords(sentence, custom_stopword_source = stopwords_list) %>%
      step_tokenfilter(sentence, max_tokens = 1000) %>%
      step_tfidf(sentence)

Indeed, the absolute frequency-based filter is largely influenced by the potential length of the document in which the sentences are found. The longer the text, the more likely the token will appear. It would be interesting to be able to filter tokens based on relative frequency (based on the total number of words in each value of the doc variable in the example) before calculating the tf-idf on these extracted tokens, wouldn't it?

Thx in advance ;)

The text was updated successfully, but these errors were encountered:

EmilHvitfeldt added the feature a feature request or enhancement label Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token filtering step based on relative frequency (length of source document) #259

Token filtering step based on relative frequency (length of source document) #259

Malichot commented Jan 9, 2024

Token filtering step based on relative frequency (length of source document) #259

Token filtering step based on relative frequency (length of source document) #259

Comments

Malichot commented Jan 9, 2024