Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textrecipes tuning parameters #16

Closed
EmilHvitfeldt opened this issue Nov 15, 2018 · 15 comments
Closed

textrecipes tuning parameters #16

EmilHvitfeldt opened this issue Nov 15, 2018 · 15 comments

Comments

@EmilHvitfeldt
Copy link
Member

In accordance to tidymodels/textrecipes#14 here are my thoughts on what should be tunable.

step_texthash: num_terms integer.

step_tf: weight numeric.

step_tokenfilter: max numeric. min numeric. max_tokens integer.

Question.
Would something like weight_scheme in step_tf be tunable as it takes a couple of different (method as characters) values?

@topepo
Copy link
Member

topepo commented Nov 19, 2018

Could you change min and max to something like min_occurance or min_times or something more specific?

Would something like weight_scheme in step_tf be tunable as it takes a couple of different (method as characters) values?

Sure. We have qualitative parameters in other models too.

@EmilHvitfeldt
Copy link
Member Author

Could you change min and max to something like min_occurance or min_times or something more specific?

Done. Changed to *_times.

Sure. We have qualitative parameters in other models too.

Then I have these additions.

step_tf: weight_scheme takes the following values: "binary", "raw count", "term frequency", "log normalization", "double normalization".

step_tokenize: token takes the following values: "characters", "character_shingle", "lines", "ngrams", "paragraphs", "ptb", "regex", "sentences", "skip_ngrams", "tweets", "words", "word_stems".

@EmilHvitfeldt
Copy link
Member Author

Would a whole step be dialable?

@topepo
Copy link
Member

topepo commented Dec 8, 2018

I've thought about the issue of including a step or not. We could add an eval_step option that is logical and add a tuning parameter for it that way.

@EmilHvitfeldt
Copy link
Member Author

That would be great!

@topepo
Copy link
Member

topepo commented Dec 8, 2018

Should weight be between [0, 1]?

@EmilHvitfeldt
Copy link
Member Author

Mainly weight should be positive. But I think it is reasonable to bound it in [0, 1].

@topepo
Copy link
Member

topepo commented Dec 8, 2018

Mind if I default weight to be on the log scale?

@EmilHvitfeldt
Copy link
Member Author

That should be fine.

topepo added a commit that referenced this issue Dec 8, 2018
@topepo
Copy link
Member

topepo commented Dec 8, 2018

Take a look at this commit and let me know if the default ranges (or anything else) should be changed.

@EmilHvitfeldt
Copy link
Member Author

Looks good.

@topepo
Copy link
Member

topepo commented Dec 8, 2018

Gak. I think that we need large numbers instead of Inf:

> max_times
Maximum Token Frequency  (quantitative)
Range: [1, Inf]
> grid_random(max_times, size = 5)
 Show Traceback
 
 Rerun with Debug
 Error in min(unlist(object$range)):max(unlist(object$range)) : 
  result would be too long a vector 

What should we put in? We could do:

> .Machine$integer.max
[1] 2147483647
> library(dials)
> max_times
Maximum Token Frequency  (quantitative)
Range: [1, 2147483647]
> grid_random(max_times, size = 5)
# A tibble: 5 x 1
   max_times
       <int>
1 1024987753
2 2080355927
3 1342632065
4   48813909
5   85432412

Maybe something smaller like as.integer(10^5)?

@EmilHvitfeldt
Copy link
Member Author

So in essence inf was taken as 'do not remove no matter how many times it appears'. But we can use as.integer(10^5). (I don't know and haven't been able to research on the subject, but i feel like this parameter would work well on log scale). Thinking 10, 100, 1000, 10000 is better then 2500, 5000, 7500, 10000

@topepo
Copy link
Member

topepo commented Dec 8, 2018

merged PR

@topepo topepo closed this as completed Dec 8, 2018
@github-actions
Copy link

github-actions bot commented Mar 7, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants