-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nested (overlapped) elements to tokens_lookup() #1375
Comments
Can dictionaries by assembled so that categories follow logical commands to ignore hits by neighboring categories?
|
Really should be a SO question, but anyway: The easiest way would be to compound the tokens you want treated as single items. You can also look up sequences and match them using dictionaries. library("quanteda")
## Package version: 1.4.3
txt <- c("I spent one guinea while visiting Equatorial Guinea and then went to Guinea Bissau")
toks <- tokens(txt)
# method 1: compound the tokens first
tokens_compound(toks, phrase(c("equatorial guinea", "guinea bissau")))
## tokens from 1 document.
## text1 :
## [1] "I" "spent" "one"
## [4] "guinea" "while" "visiting"
## [7] "Equatorial_Guinea" "and" "then"
## [10] "went" "to" "Guinea_Bissau"
# method 2: define just the compound word dictionaries
dict <- dictionary(list(
Equatorial_Guinea = "equatorial guinea",
Guinea_Bissau = "guinea bissau"
))
tokens_lookup(toks, dict, exclusive = FALSE)
## tokens from 1 document.
## text1 :
## [1] "I" "spent" "one"
## [4] "guinea" "while" "visiting"
## [7] "EQUATORIAL_GUINEA" "and" "then"
## [10] "went" "to" "GUINEA_BISSAU" Set |
Thanks for clearing that, @kbenoit |
tokens_lookup() returns more than value if a dictionary has overlapping elements (see koheiw/newsmap#13 (comment)).
There is a workaround which works.
tokens_lookup
with the optionvaluetype = 'glob'
.tokens_lookup
with the optionvaluetype = 'fixed'
. And then drop the other values from (1).However, an option such as the one @koheiw suggested might be more efficient and easier to implement.
The text was updated successfully, but these errors were encountered: