Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokens_lookup should not match sequences of dictionary values #836

Closed
kbenoit opened this issue Jul 6, 2017 · 11 comments
Closed

tokens_lookup should not match sequences of dictionary values #836

kbenoit opened this issue Jul 6, 2017 · 11 comments
Assignees

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Jul 6, 2017

The only behaviour should be what is now matched when multiword = FALSE. We should hardwire multiword = FALSE and remove it as an option. We don't want to allow sequences of separate values to generate matches because their order in the dictionary values should not matter.

(toks_uni2 <- as.tokens(list(d1 = c("The", "United States of America", 
                                    "China", "is", "not", "a", "country"))))
## tokens from 1 document.
## d1 :
## [1] "The"                      "United States of America" "China"                   
## [4] "is"                       "not"                      "a"                       
## [7] "country"                 

## WRONG: SHOULD BE TWO MATCHES
tokens_lookup(toks_uni2, 
              dictionary = dictionary(country = c("China", "United States of America")))
## tokens from 1 document.
## d1 :
## [1] "country"

This is what I would expect:

## CORRECT
tokens_lookup(toks_uni2, multiword = FALSE,
              dictionary = dictionary(country = c("China", "United States of America")))
## tokens from 1 document.
## d1 :
## [1] "country" "country"

Since we thought - wrongly - that the alphabetical order of types might matter (it does not), I also tried this with a "Z" word. Same result so this is not something to be worried about.

(toks_uni3 <- as.tokens(list(d1 = c("The", "United States of America", "Zimbabwe", "is", "not", "a", "country"))))
## tokens from 1 document.
## d1 :
## [1] "The"                      "United States of America" "Zimbabwe"                
## [4] "is"                       "not"                      "a"                       
## [7] "country"   

## WRONG: SHOULD BE TWO MATCHES
tokens_lookup(toks_uni3, 
              dictionary = dictionary(country = c("Zimbabwe", "United States of America")))
## tokens from 1 document.
## d1 :
## [1] "country"

## CORRECT
tokens_lookup(toks_uni3, multiword = FALSE,
              dictionary = dictionary(country = c("Zimbabwe", "United States of America")))
## tokens from 1 document.
## d1 :
## [1] "country" "country
@koheiw
Copy link
Collaborator

koheiw commented Jul 12, 2017

If you want to remove multiword argument, we have to let users to wrap a dictionary with phrase(). To do so, I will add a multiword attribute to dictionary objects.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 12, 2017

I do suggest we remove the multiword argument, but I don't think we should set any attribute for a dictionary about this. Since the order of dictionary values is undefined as part of the definition of a dictionary object, we should not be matching tokens based on their order. In addition, because sequence length is undefined, we then get into nesting and overlapping issues. Better to avoid them all and require a pattern match based on a single whitespace-separated dictionary value only.

However we should always match a sequence of tokens matching a value that is a multi-word pattern separated by whitespace. So

dictionary(country = c("Zimbabwe", "United States of America"))

should always match

> tokens("United States of America")
tokens from 1 document.
Component 1 :
[1] "United"  "States"  "of"      "America"

but

dictionary(country = c("United", "States", "of", "America"))

should never match the same tokens.

Given this scheme, I don't think we need a multiword attribute for a dictionary.

@koheiw
Copy link
Collaborator

koheiw commented Jul 12, 2017

It does not match the same tokens more than once already. See an example from test-tokens_lookup.R. We get only one 'Countries' in d1.

txt <- c(d1 = "The United States of America is bordered by the Atlantic Ocean and the Pacific Ocean.",
         d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt)
dict <- dictionary(list(Countries = c("United States", "America", "United States of America"),
                        oceans = c("Ocean")))
tokens_lookup(toks, dict, valuetype = "glob")

# tokens from 2 documents.
# d1 :
#     [1] "Countries" "oceans"    "oceans"   
# 
# d2 :
#     [1] "Countries"

@koheiw
Copy link
Collaborator

koheiw commented Jul 12, 2017

Yet, setting multiword = FALSE does not solve. We should get two "Countries", but both return one.

toks_uni <- as.tokens(list(d1 = c("The", "United States of America", "and", 
                                   "the", "United", "States", "of", "America",
                                   "are", "the", "same", "country")))

tokens_lookup(toks_uni, multiword = FALSE,
              dictionary = dictionary(country = c("China", "United States of America")))

tokens_lookup(toks_uni, multiword = TRUE,
              dictionary = dictionary(country = c("China", "United States of America")))

In order to get right result, tokens_lookup() should find both multiword = TRUE and multiword = FALSE. When "United States of America" is given as a dictionary value, it should be converted to a pattern:

list(c('United States of America'), c('United', 'States', 'of', 'America')

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 12, 2017

And what about the case of token sequences including a matching sequence of compounds, such as c("United States", "of", "America")?

This is tricky.

@koheiw
Copy link
Collaborator

koheiw commented Jul 12, 2017

This is too complex to automate, so we have to leave the issue with users, giving them options (an argument or a wrapper).

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 12, 2017

How about this for a set of principles:

  1. Dictionary values can consist of multi-word patterns, separated by whitespace. Values are unordered and we never match sequences of values to sequences of tokens.
  2. dictionary(key = c("A B C")) will match tokens "A B C" (once).
  3. dictionary(key = c("A B C")) will match tokens "A", "B", "C" (once)
  4. dictionary(key = c("A B C")) will not match tokens "A", "B C". If users desire, this as you say, it's up to them to get working using some transformations first. If this becomes an issue, we can deal with it later.

@koheiw
Copy link
Collaborator

koheiw commented Jul 12, 2017

1-3 can be done by generating list(c('United States of America'), c('United', 'States', 'of', 'America') (easy). I am fine to leave 4 unsolved.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 12, 2017

Sounds good. I assume here you mean internally - no need for a user to set any option or wrap a dictionary in phrase(), since this behaviour always occurs with tokens_lookup. For dfm_lookup, of course, we only match case 2.

@koheiw
Copy link
Collaborator

koheiw commented Jul 12, 2017

Yes, it is internal. We should also make concatenator = ' ' in tokens_compound() and tokens_ngrams() as the default to make it work.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 12, 2017

Much simpler if it's a literal match of a value "A B C" to a token "A B C", but we're saying here that we also need a way to match a value "A B C" to a token "A_B_C" if that token's concatenator setting is "_"? That's how we match 2 in the general case. But that means that even if a user sets it to "_", we still need to handle that case, so we are not relying on a default setting of concatenator = " ".

For the default concatenator, I prefer the "_" since it really makes a single "token" from multiple tokens, and makes it really explicit that they have been compounded. Especially when a dfm is inspected, it's nice to see the compounded tokens clearly.

@koheiw koheiw mentioned this issue Jul 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants