Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token filtering step based on relative frequency (length of source document) #259

Open
Malichot opened this issue Jan 9, 2024 · 0 comments
Labels
feature a feature request or enhancement

Comments

@Malichot
Copy link

Malichot commented Jan 9, 2024

Hello,
First of all, thank you very much for developing this package!

I was wondering if it was possible in the token filtering step (step_tokenfilter()) to implement a method based on the relative and not absolute frequency of a token's appearance. For example, here's my code:

data_train <- tibble(sentence = c("This is words", "They are nice !", "Pretty pretty pretty good !", "Another sentence that is in doc2"),
                                   doc = c("doc1", "doc2", "doc2", "doc2"))

data_rec <- recipe(x ~ sentence, data = data_train) %>%
      step_tokenize(sentence) %>%
      step_stopwords(sentence, custom_stopword_source = stopwords_list) %>%
      step_tokenfilter(sentence, max_tokens = 1000) %>%
      step_tfidf(sentence)


Indeed, the absolute frequency-based filter is largely influenced by the potential length of the document in which the sentences are found. The longer the text, the more likely the token will appear. It would be interesting to be able to filter tokens based on relative frequency (based on the total number of words in each value of the doc variable in the example) before calculating the tf-idf on these extracted tokens, wouldn't it?

Thx in advance ;)

@EmilHvitfeldt EmilHvitfeldt added the feature a feature request or enhancement label Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants