-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokens_lookup should not match sequences of dictionary values #836
Comments
If you want to remove |
I do suggest we remove the However we should always match a sequence of tokens matching a value that is a multi-word pattern separated by whitespace. So dictionary(country = c("Zimbabwe", "United States of America")) should always match > tokens("United States of America")
tokens from 1 document.
Component 1 :
[1] "United" "States" "of" "America" but dictionary(country = c("United", "States", "of", "America")) should never match the same tokens. Given this scheme, I don't think we need a |
It does not match the same tokens more than once already. See an example from test-tokens_lookup.R. We get only one 'Countries' in d1. txt <- c(d1 = "The United States of America is bordered by the Atlantic Ocean and the Pacific Ocean.",
d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt)
dict <- dictionary(list(Countries = c("United States", "America", "United States of America"),
oceans = c("Ocean")))
tokens_lookup(toks, dict, valuetype = "glob")
# tokens from 2 documents.
# d1 :
# [1] "Countries" "oceans" "oceans"
#
# d2 :
# [1] "Countries"
|
Yet, setting multiword = FALSE does not solve. We should get two "Countries", but both return one. toks_uni <- as.tokens(list(d1 = c("The", "United States of America", "and",
"the", "United", "States", "of", "America",
"are", "the", "same", "country")))
tokens_lookup(toks_uni, multiword = FALSE,
dictionary = dictionary(country = c("China", "United States of America")))
tokens_lookup(toks_uni, multiword = TRUE,
dictionary = dictionary(country = c("China", "United States of America"))) In order to get right result, list(c('United States of America'), c('United', 'States', 'of', 'America') |
And what about the case of token sequences including a matching sequence of compounds, such as c( This is tricky. |
This is too complex to automate, so we have to leave the issue with users, giving them options (an argument or a wrapper). |
How about this for a set of principles:
|
1-3 can be done by generating |
Sounds good. I assume here you mean internally - no need for a user to set any option or wrap a dictionary in |
Yes, it is internal. We should also make |
Much simpler if it's a literal match of a value For the default concatenator, I prefer the |
The only behaviour should be what is now matched when
multiword = FALSE
. We should hardwiremultiword = FALSE
and remove it as an option. We don't want to allow sequences of separate values to generate matches because their order in the dictionary values should not matter.This is what I would expect:
Since we thought - wrongly - that the alphabetical order of types might matter (it does not), I also tried this with a "Z" word. Same result so this is not something to be worried about.
The text was updated successfully, but these errors were encountered: