Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested (overlapped) elements to tokens_lookup() #1375

Closed
R01010010R opened this issue Jun 2, 2018 · 4 comments
Closed

Nested (overlapped) elements to tokens_lookup() #1375

R01010010R opened this issue Jun 2, 2018 · 4 comments

Comments

@R01010010R
Copy link

tokens_lookup() returns more than value if a dictionary has overlapping elements (see koheiw/newsmap#13 (comment)).

There is a workaround which works.

  1. Use tokens_lookup with the option valuetype = 'glob'.
  2. Check if the tokens also return at least one of the values when I use tokens_lookup with the option valuetype = 'fixed'. And then drop the other values from (1).

However, an option such as the one @koheiw suggested might be more efficient and easier to implement.

@kbenoit kbenoit added this to the v2.0 milestone Jan 30, 2019
@plugrafico
Copy link

Can dictionaries by assembled so that categories follow logical commands to ignore hits by neighboring categories?
For instance (just exemplifying, not really sure how the logical expressions should look like):

dict <- dictionary(list(
guinea_bissau="guinea bissau", 
equatorial_guinea="equatorial guinea",
guinea="guinea(NOT 'guinea_bissau' NOT 'equatorial_guinea')")

@kbenoit
Copy link
Collaborator

kbenoit commented Jun 12, 2019

Really should be a SO question, but anyway:

The easiest way would be to compound the tokens you want treated as single items. You can also look up sequences and match them using dictionaries.

library("quanteda")
## Package version: 1.4.3

txt <- c("I spent one guinea while visiting Equatorial Guinea and then went to Guinea Bissau")
toks <- tokens(txt)

# method 1: compound the tokens first
tokens_compound(toks, phrase(c("equatorial guinea", "guinea bissau")))
## tokens from 1 document.
## text1 :
##  [1] "I"                 "spent"             "one"              
##  [4] "guinea"            "while"             "visiting"         
##  [7] "Equatorial_Guinea" "and"               "then"             
## [10] "went"              "to"                "Guinea_Bissau"

# method 2: define just the compound word dictionaries
dict <- dictionary(list(
  Equatorial_Guinea = "equatorial guinea",
  Guinea_Bissau = "guinea bissau"
))
tokens_lookup(toks, dict, exclusive = FALSE)
## tokens from 1 document.
## text1 :
##  [1] "I"                 "spent"             "one"              
##  [4] "guinea"            "while"             "visiting"         
##  [7] "EQUATORIAL_GUINEA" "and"               "then"             
## [10] "went"              "to"                "GUINEA_BISSAU"

Set capkeys = FALSE to turn off the capitalization.

koheiw added a commit that referenced this issue Jun 13, 2019
@plugrafico
Copy link

Thanks for clearing that, @kbenoit

@kbenoit
Copy link
Collaborator

kbenoit commented Jun 13, 2019

No problem and @koheiw and I are working on more built-in solutions too (#1708).

kbenoit added a commit that referenced this issue Jul 1, 2019
@kbenoit kbenoit closed this as completed Jul 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants