tokens_lookup should not match sequences of dictionary values #836

kbenoit · 2017-07-06T09:24:25Z

The only behaviour should be what is now matched when multiword = FALSE. We should hardwire multiword = FALSE and remove it as an option. We don't want to allow sequences of separate values to generate matches because their order in the dictionary values should not matter.

(toks_uni2 <- as.tokens(list(d1 = c("The", "United States of America", 
                                    "China", "is", "not", "a", "country"))))
## tokens from 1 document.
## d1 :
## [1] "The"                      "United States of America" "China"                   
## [4] "is"                       "not"                      "a"                       
## [7] "country"                 

## WRONG: SHOULD BE TWO MATCHES
tokens_lookup(toks_uni2, 
              dictionary = dictionary(country = c("China", "United States of America")))
## tokens from 1 document.
## d1 :
## [1] "country"

This is what I would expect:

## CORRECT
tokens_lookup(toks_uni2, multiword = FALSE,
              dictionary = dictionary(country = c("China", "United States of America")))
## tokens from 1 document.
## d1 :
## [1] "country" "country"

Since we thought - wrongly - that the alphabetical order of types might matter (it does not), I also tried this with a "Z" word. Same result so this is not something to be worried about.

(toks_uni3 <- as.tokens(list(d1 = c("The", "United States of America", "Zimbabwe", "is", "not", "a", "country"))))
## tokens from 1 document.
## d1 :
## [1] "The"                      "United States of America" "Zimbabwe"                
## [4] "is"                       "not"                      "a"                       
## [7] "country"   

## WRONG: SHOULD BE TWO MATCHES
tokens_lookup(toks_uni3, 
              dictionary = dictionary(country = c("Zimbabwe", "United States of America")))
## tokens from 1 document.
## d1 :
## [1] "country"

## CORRECT
tokens_lookup(toks_uni3, multiword = FALSE,
              dictionary = dictionary(country = c("Zimbabwe", "United States of America")))
## tokens from 1 document.
## d1 :
## [1] "country" "country

The text was updated successfully, but these errors were encountered:

koheiw · 2017-07-12T08:10:27Z

If you want to remove multiword argument, we have to let users to wrap a dictionary with phrase(). To do so, I will add a multiword attribute to dictionary objects.

kbenoit · 2017-07-12T08:18:53Z

I do suggest we remove the multiword argument, but I don't think we should set any attribute for a dictionary about this. Since the order of dictionary values is undefined as part of the definition of a dictionary object, we should not be matching tokens based on their order. In addition, because sequence length is undefined, we then get into nesting and overlapping issues. Better to avoid them all and require a pattern match based on a single whitespace-separated dictionary value only.

However we should always match a sequence of tokens matching a value that is a multi-word pattern separated by whitespace. So

dictionary(country = c("Zimbabwe", "United States of America"))

should always match

> tokens("United States of America")
tokens from 1 document.
Component 1 :
[1] "United"  "States"  "of"      "America"

but

dictionary(country = c("United", "States", "of", "America"))

should never match the same tokens.

Given this scheme, I don't think we need a multiword attribute for a dictionary.

koheiw · 2017-07-12T08:47:52Z

It does not match the same tokens more than once already. See an example from test-tokens_lookup.R. We get only one 'Countries' in d1.

txt <- c(d1 = "The United States of America is bordered by the Atlantic Ocean and the Pacific Ocean.",
         d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt)
dict <- dictionary(list(Countries = c("United States", "America", "United States of America"),
                        oceans = c("Ocean")))
tokens_lookup(toks, dict, valuetype = "glob")

# tokens from 2 documents.
# d1 :
#     [1] "Countries" "oceans"    "oceans"   
# 
# d2 :
#     [1] "Countries"

koheiw · 2017-07-12T09:10:10Z

Yet, setting multiword = FALSE does not solve. We should get two "Countries", but both return one.

toks_uni <- as.tokens(list(d1 = c("The", "United States of America", "and", 
                                   "the", "United", "States", "of", "America",
                                   "are", "the", "same", "country")))

tokens_lookup(toks_uni, multiword = FALSE,
              dictionary = dictionary(country = c("China", "United States of America")))

tokens_lookup(toks_uni, multiword = TRUE,
              dictionary = dictionary(country = c("China", "United States of America")))

In order to get right result, tokens_lookup() should find both multiword = TRUE and multiword = FALSE. When "United States of America" is given as a dictionary value, it should be converted to a pattern:

list(c('United States of America'), c('United', 'States', 'of', 'America')

kbenoit · 2017-07-12T09:44:45Z

And what about the case of token sequences including a matching sequence of compounds, such as c("United States", "of", "America")?

This is tricky.

koheiw · 2017-07-12T09:56:21Z

This is too complex to automate, so we have to leave the issue with users, giving them options (an argument or a wrapper).

kbenoit · 2017-07-12T09:57:54Z

How about this for a set of principles:

Dictionary values can consist of multi-word patterns, separated by whitespace. Values are unordered and we never match sequences of values to sequences of tokens.
dictionary(key = c("A B C")) will match tokens "A B C" (once).
dictionary(key = c("A B C")) will match tokens "A", "B", "C" (once)
dictionary(key = c("A B C")) will not match tokens "A", "B C". If users desire, this as you say, it's up to them to get working using some transformations first. If this becomes an issue, we can deal with it later.

koheiw · 2017-07-12T10:11:45Z

1-3 can be done by generating list(c('United States of America'), c('United', 'States', 'of', 'America') (easy). I am fine to leave 4 unsolved.

kbenoit · 2017-07-12T10:18:46Z

Sounds good. I assume here you mean internally - no need for a user to set any option or wrap a dictionary in phrase(), since this behaviour always occurs with tokens_lookup. For dfm_lookup, of course, we only match case 2.

koheiw · 2017-07-12T10:22:11Z

Yes, it is internal. We should also make concatenator = ' ' in tokens_compound() and tokens_ngrams() as the default to make it work.

kbenoit · 2017-07-12T10:30:00Z

Much simpler if it's a literal match of a value "A B C" to a token "A B C", but we're saying here that we also need a way to match a value "A B C" to a token "A_B_C" if that token's concatenator setting is "_"? That's how we match 2 in the general case. But that means that even if a user sets it to "_", we still need to handle that case, so we are not relying on a default setting of concatenator = " ".

For the default concatenator, I prefer the "_" since it really makes a single "token" from multiple tokens, and makes it really explicit that they have been compounded. Especially when a dfm is inspected, it's nice to see the compounded tokens clearly.

kbenoit added design dictionary labels Jul 6, 2017

kbenoit assigned koheiw Jul 6, 2017

This was referenced Jul 6, 2017

Reimplement features/sequence/keywords arguments more consistently #820

Merged

change feature arguments to pattern #839

Closed

koheiw mentioned this issue Jul 12, 2017

Issue 836 #848

Merged

kbenoit closed this as completed in 6f3faf4 Jul 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokens_lookup should not match sequences of dictionary values #836

tokens_lookup should not match sequences of dictionary values #836

kbenoit commented Jul 6, 2017

koheiw commented Jul 12, 2017

kbenoit commented Jul 12, 2017

koheiw commented Jul 12, 2017

koheiw commented Jul 12, 2017

kbenoit commented Jul 12, 2017

koheiw commented Jul 12, 2017

kbenoit commented Jul 12, 2017 •

edited

Loading

koheiw commented Jul 12, 2017

kbenoit commented Jul 12, 2017

koheiw commented Jul 12, 2017

kbenoit commented Jul 12, 2017

tokens_lookup should not match sequences of dictionary values #836

tokens_lookup should not match sequences of dictionary values #836

Comments

kbenoit commented Jul 6, 2017

koheiw commented Jul 12, 2017

kbenoit commented Jul 12, 2017

koheiw commented Jul 12, 2017

koheiw commented Jul 12, 2017

kbenoit commented Jul 12, 2017

koheiw commented Jul 12, 2017

kbenoit commented Jul 12, 2017 • edited Loading

koheiw commented Jul 12, 2017

kbenoit commented Jul 12, 2017

koheiw commented Jul 12, 2017

kbenoit commented Jul 12, 2017

kbenoit commented Jul 12, 2017 •

edited

Loading