Nested (overlapped) elements to tokens_lookup() #1375

R01010010R · 2018-06-02T11:44:28Z

tokens_lookup() returns more than value if a dictionary has overlapping elements (see koheiw/newsmap#13 (comment)).

There is a workaround which works.

Use tokens_lookup with the option valuetype = 'glob'.
Check if the tokens also return at least one of the values when I use tokens_lookup with the option valuetype = 'fixed'. And then drop the other values from (1).

However, an option such as the one @koheiw suggested might be more efficient and easier to implement.

The text was updated successfully, but these errors were encountered:

plugrafico · 2019-06-12T18:47:15Z

Can dictionaries by assembled so that categories follow logical commands to ignore hits by neighboring categories?
For instance (just exemplifying, not really sure how the logical expressions should look like):

dict <- dictionary(list(
guinea_bissau="guinea bissau", 
equatorial_guinea="equatorial guinea",
guinea="guinea(NOT 'guinea_bissau' NOT 'equatorial_guinea')")

kbenoit · 2019-06-12T20:06:19Z

Really should be a SO question, but anyway:

The easiest way would be to compound the tokens you want treated as single items. You can also look up sequences and match them using dictionaries.

library("quanteda")
## Package version: 1.4.3

txt <- c("I spent one guinea while visiting Equatorial Guinea and then went to Guinea Bissau")
toks <- tokens(txt)

# method 1: compound the tokens first
tokens_compound(toks, phrase(c("equatorial guinea", "guinea bissau")))
## tokens from 1 document.
## text1 :
##  [1] "I"                 "spent"             "one"              
##  [4] "guinea"            "while"             "visiting"         
##  [7] "Equatorial_Guinea" "and"               "then"             
## [10] "went"              "to"                "Guinea_Bissau"

# method 2: define just the compound word dictionaries
dict <- dictionary(list(
  Equatorial_Guinea = "equatorial guinea",
  Guinea_Bissau = "guinea bissau"
))
tokens_lookup(toks, dict, exclusive = FALSE)
## tokens from 1 document.
## text1 :
##  [1] "I"                 "spent"             "one"              
##  [4] "guinea"            "while"             "visiting"         
##  [7] "EQUATORIAL_GUINEA" "and"               "then"             
## [10] "went"              "to"                "GUINEA_BISSAU"

Set capkeys = FALSE to turn off the capitalization.

plugrafico · 2019-06-13T14:33:06Z

Thanks for clearing that, @kbenoit

kbenoit · 2019-06-13T14:46:12Z

No problem and @koheiw and I are working on more built-in solutions too (#1708).

Add overlap argument for #1375

kbenoit added this to the v2.0 milestone Jan 30, 2019

kbenoit added the question label Jun 12, 2019

koheiw added a commit that referenced this issue Jun 13, 2019

Add overlap argument for #1375

0884a15

kbenoit added a commit that referenced this issue Jul 1, 2019

Merge pull request #1708 from quanteda/issue-1375

868afa6

Add overlap argument for #1375

kbenoit modified the milestones: v2.0 desirables, v1.5 desirables Jul 1, 2019

kbenoit closed this as completed Jul 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested (overlapped) elements to tokens_lookup() #1375

Nested (overlapped) elements to tokens_lookup() #1375

R01010010R commented Jun 2, 2018

plugrafico commented Jun 12, 2019

kbenoit commented Jun 12, 2019

plugrafico commented Jun 13, 2019

kbenoit commented Jun 13, 2019

Nested (overlapped) elements to tokens_lookup() #1375

Nested (overlapped) elements to tokens_lookup() #1375

Comments

R01010010R commented Jun 2, 2018

plugrafico commented Jun 12, 2019

kbenoit commented Jun 12, 2019

plugrafico commented Jun 13, 2019

kbenoit commented Jun 13, 2019