Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement features/sequence/keywords arguments more consistently #820

Merged
merged 74 commits into from
Jul 6, 2017

Conversation

kbenoit
Copy link
Collaborator

@kbenoit kbenoit commented Jun 27, 2017

Implements consistent handling for features matching.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 3, 2017

And @koheiw please make sure pull first before committing - I just had to revert a commit you made that was based on an older version before we modified the textstat_collocations() function. The current branch issue-787-ken contains the recent changes now in master.

@koheiw
Copy link
Collaborator

koheiw commented Jul 4, 2017

I understand that phrase() split a character vector with whitespace into a list of character vectors, just like current features2list(), but I don't think this is a good idea, because it does not work for dictionary objects as an input to tokens_lookup().

We need to allow users to decide whether a dictionary with dictionary entries like "United States", "Irish Sea" are unigrams (typically after tokens_compound() is applied) or ngrams in tokens_lookup().

To do this with phrase(), the code would be like the following:

dict <- dictionary(list('US'=list(
                        Countries = c("United States"),
                        oceans = c("Atlantic", "Pacific")),
                        'Europe'=list(
                            Countries = c("Britain", "Ireland"),
                            oceans = list(west = "Irish Sea", east = "English Channel"))))

tokens_lookup(tok, dictionary = phrase(dict), level = 2)

but this code does not work.

To keep the dictionary's hierarchical structure, phrase() should only add an attribute to the dictionary, based on which tokens_lookup() split the entries internally.

phrase <- function(x) {
    attr(x, 'phrase') <- TRUE
    return(x)
}

I also fear that changes to not splitting character strings with whitespaces to cause confusion in dictionary() as well as kwic(). It is not hard to imagine users trying to make dictionary in this (wrong) way:

dict <- dictionary(list('US'=list(
                        Countries = phrase("United States"),
                        oceans = c("Atlantic", "Pacific")),
                        'Europe'=list(
                            Countries = c("Britain", "Ireland"),
                            oceans = list(west = phrase("Irish Sea"), east = phrase("English Channel")))))

@koheiw
Copy link
Collaborator

koheiw commented Jul 4, 2017

To my mind, there are three options at the moment, but none of them is perfect. We probably need to think harder.

Option1

Stop splinting features with whitespaces, and create phrase() which only adds phrase attribute to input object. This solves inconsistency between dfm_select() and tokens_select() but introduces a new inconsistency between dfm_select() and tokens_select() and dictionary() and kwic(). This solution comes with special objects only used as feature inputs.

Option2

Add phrase as an argument to relevant functions, and set it TRUE by default. This is the most straightforward, but users have to set phrase = FALSE to get the same results from dfm_select() and tokens_select().

Option3

Create tokens_concatenate() or as.unigram(), and split features with whitespaces only when tokens are unigrams. dfm_select() and tokens_select() become consistent when tokens are ngrams (i.e. output of tokens_ngrams()), but users need to be always aware of types of tokens objects.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 4, 2017

I'm afraid I prefer Option 4: Keep the current behaviour, which makes sense to me and is consistent. Using your above example and adding toks for testing:

This works fine for me, getting the multi-word matches as expected:

tok <- tokens("The United States is bordered by the Atlantic Ocean, not the Irish Sea.")
tokens_lookup(tok, dictionary = dict, level = 2)
# tokens from 1 document.
# Component 1 :
# [1] "Countries" "oceans"    "oceans"

The following fails, but should fail, since we should not allow a phrase-ified dictionary as an argument to a _lookup function. (We could issue a friendlier error message though.)

tokens_lookup(tok, dictionary = phrase(dict), level = 2)
# Error in tokens_lookup.tokens(tok, dictionary = phrase(dict), level = 2) : 
#  dictionary must be a dictionary object 

This behaviour is consistent because if we form a dfm, the dfm will not have multi-word features unless these were explicitly formed. So this should match:

dfm2 <- dfm(tokens_ngrams(tok, n = 2, concatenator = " "))
# Document-feature matrix of: 1 document, 14 features (0% sparse).
# 1 x 14 sparse Matrix of class "dfmSparse"
#        features
# docs    the united united states states is is bordered bordered by by the the atlantic atlantic ocean
#   text1          1             1         1           1           1      1            1              1
#        features
# docs    ocean , , not not the the irish irish sea sea .
#   text1       1     1       1         1         1     1
dfm_lookup(dfm2, dictionary = dict, level = 2)
# Document-feature matrix of: 1 document, 2 features (0% sparse).
# 1 x 2 sparse Matrix of class "dfmSparse"
#        features
# docs    Countries oceans
#   text1         1      1

and this should error, or just return no matches:

dfm_lookup(dfm(tok), dictionary = dict, level = 2)
# Error in dfm_lookup(dfm(tok), dictionary = dict, level = 2) : 
#  dfm_lookup not implemented for ngrams > 1 and multi-word dictionary values

If we change how phrase works with dictionaries, then this will not work, and this is what I think is consistent:

tokens_select(tok, dict)
# tokens from 1 document.
# Component 1 :
# [1] "Atlantic"

tokens_select(tok, phrase(dict))
# tokens from 1 document.
# Component 1 :
# [1] "United"   "States"   "Atlantic" "Irish"    "Sea"     

@koheiw
Copy link
Collaborator

koheiw commented Jul 4, 2017

I agree that phrase() works with *_select() functions, but my question is bit different: how can users apply this dictionary to get the same result from the first and the second tokens object below?

dict <- dictionary(list(US=list(
    Countries = c("United States"),
    oceans = c("Atlantic", "Pacific")),
    Europe=list(
        Countries = c("Britain", "Ireland"),
        oceans = list(west = "Irish Sea", east = "English Channel"))))

# tokes are all unigrams
toks <- tokens("The United States is bordered by the Atlantic Ocean, not the Irish Sea.")
tokens_lookup(toks, dictionary = dict)

# tokens has bigrams
toks2 <- tokens_compound(toks, list(c('United', 'States'), c('Atlantic', 'Ocean'), c('Irish', 'Sea')), 
                         concatenator = ' ')
tokens_lookup(toks2, dictionary = dict)

Anyway, the idea of "consistency" seems rather subjective, and a matter of personal preference. We have to remember why there are so many user interface settings in Mac, Windows and Linux. While @kbenoit prefers this branch's current behavior, I am most comfortable with master's behavior. So I think we should add something like split_by_whitespace to quanteda_options(), and keep the master's behavior when it is TRUE (FALSE by default).

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 4, 2017

That's a very good point! We want a match for the token sequence "United", "States" but also want to match the compound "United States" (as it would have to work for a dfm feature). But we don't want to double-match. How's this for a workaround: we first compound the tokens using the "phrased" dictionary values, then look it up. This works the same for both cases and will not double-count.

require(magrittr)
tokens_compound(toks, phrase(dict), concatenator = " ") %>%
    tokens_lookup(dictionary = dict)
# ---- should return matches for:
## tokens from 1 document.
## Component 1 :
## [1] "US.Countries"       "US.oceans"          "Europe.oceans.west"  

tokens_compound(toks2, phrase(dict), concatenator = " ") %>%
    tokens_lookup(dictionary = dict)
# ---- this will now be the same

(Note: I'm proposing to use this inside the tokens_lookup() function, not to require this every time from the user!)

On the idea of adding more options, I really want to avoid that, and strive keep this as simple and as consistent as possible. My starting point is the dfm, and the idea that whitespace should not be privileged in a feature label. To make this consistent for tokens, the same has to apply for tokens. I have proposed to require this for kwic patterns too, where it will slightly less convenient but consistent (since the user will have to type phrase() around the whitespace-separated patterns).

The only exception has to do with the multi-word dictionary values, since it's already a list, and we don't want one-level dictionaries to require a two-level list, so we allow multi-word matches to be separated by whitespace. When applied to tokens_lookup only, we allow these to match multi-word matches.

@koheiw
Copy link
Collaborator

koheiw commented Jul 5, 2017

I do not think compounding is a good solution, because tokens_lookup() does not work with ngram tokens in the same way as with unigram tokens.

txt <- "The United States of America is bordered by the Atlantic Ocean and the Pacific Ocean."
toks <- tokens(txt)

comp <- list(c("United", "States", "of", "America"))
toks_uni <- tokens_compound(toks, comp, concatenator = ' ')

dict <- dictionary(list(country = c("United States of America", "China"),
                           region = c("America", "Asia")))

(out1 <- tokens_lookup(toks, dict))
(out2 <- tokens_lookup(toks_uni, dict, multiword = FALSE)) 
identical(out1, out1) # FALSE

The question here is how to switch multiword argument (the following code is for master branch):

# manually   -----------------------------------
tokens_lookup(toks, dict, multiword = TRUE)
tokens_lookup(toks_uni, dict, multiword = FALSE)

# based on attribues of dictionary -------------
tokens_lookup_new <- function(x, dictionary) {
    tokens_lookup(x, dictionary, multiword = is.null(attr(dictionary, 'phrase')))
}
phrase <- function(x) {
    attr(x, 'phrase') <- TRUE
    return(x)
}
tokens_lookup_new(toks, phrase(dict))
tokens_lookup_new(toks_uni, dict)

# based on attribues of tokens ----------------
tokens_lookup_new2 <- function(x, dictionary) {
    tokens_lookup(x, dictionary, multiword = max(attr(x, 'ngrams')) == 1)
}
attr(toks_uni, 'ngrams') <- 1:max(lengths(comp)) # compounded tokens are ngrams
tokens_lookup_new2(toks, dict)
tokens_lookup_new2(toks_uni, dict)

The manual approach is not bad actually.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 5, 2017

Ah, I'd forgotten about the multiword argument in tokens_lookup(). I'm suggesting we get rid of that now.

EDITED

In my view the behaviour should be:

# works if the compounding is done first
tokens_lookup(toks, dict)
# tokens from 1 document.
# Component 1 :
# [1] "country" "region"

# should generate only a match for US of A, not its fragment
tokens_lookup(toks_uni, dict)
# tokens from 1 document.
# Component 1 :
# [1] "country" 

# should generate an error, and maybe a message warning against sending
# phrase(dictionary) to a _lookup function
tokens_lookup(toks, phrase(dict))
tokens_lookup(toks_uni, phrase(dict))

So to sum up:

  1. We want to return matches for both country and region since these are in separate keys. Does not matter that one is nested in the other.
  2. If the key were country = c("United States of America", "America") then we would want to match just once, in other words not count twice a match that is nested within another match. (This happens in some of the LIWC definitions, where a wildcarded match also matches a fixed pattern.) If this is the behaviour you would like to offer the user as an option, I'm open to that.
  3. phrase(any_dictionary) should not be sent to a _lookup function, and we simply disallow it.

Are there any use cases I have missed? Not trying to restrict choice, just provide consistency.

@koheiw
Copy link
Collaborator

koheiw commented Jul 5, 2017

We should allow users to chose whether it counts overlapped values or not. This is your package, but you cannot not dictate how people use the package under the name of 'consistency'.

@koheiw
Copy link
Collaborator

koheiw commented Jul 5, 2017

I made the current tokens_lookup() to meet @BobMuenchen 's request with special internal data structure. Please see #517.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 5, 2017

I edited my comment above, since I got this totally wrong - I was confusing dfm_lookup with dfm_compound (easy to do since so many functions are part of this discussion). See the edited version. I don't think we need multiword and that eliminating it will not restrict user options. But be sure that you agree with (especially) my points 2 and 3 above (since I think we agree on 1 - I had this wrong earlier).

On the tokens_compound and #517, I'm happy to keep the join argument, but that's a different issue. It's about how overlapping patterns match, and unless I am not seeing this right, not an issue with how we count dictionary keys by matching dictionary values if we follow my suggestions 1-3 above.

Merge branch 'master' into issue-787-ken

# Conflicts:
#	R/RcppExports.R
@koheiw
Copy link
Collaborator

koheiw commented Jul 6, 2017

I don't think tokens_lookup() should return the same result from the two (I also do not know how to do):

tokens_lookup(toks, dict)
tokens_lookup(toks_uni, dict)

Users should be aware that compounded tokens are totally different from original tokens, and that they have to change input values. It is absolutely natural that users to get different outputs from different inputs.

For the same reason, it is OK for those two functions to select different features, because inputs are different (mt != toks).

dfm_select(mt, dict)
tokens_select(toks, dict)

However, it is confusing when those two select different features, because inputs are identical:

dfm(toks, select = dict)
tokens_select(toks, features = dict)

I think 'consistency' matters only in last case. My definition of the consistency here is "users get the same results from the same inputs across functions". To address this kind of 'inconsistency', we should apply tokens_select() internally in dfm() (currenlty dfm_select() is used).

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 6, 2017

Here's another problem: what if we have multi-word dictionary values, and sequences? Example:

(toks_uni2 <- as.tokens(list(d1 = c("The", "United States of America", "China", "is", "not", "a", "country"))))
## tokens from 1 document.
## d1 :
## [1] "The"                      "United States of America" "China"                   
## [4] "is"                       "not"                      "a"                       
## [7] "country"                 

tokens_lookup(toks_uni2, 
              dictionary = dictionary(country = c("China", "United States of America")))
## tokens from 1 document.
## d1 :
## [1] "country"

# BUT
(toks_uni3 <- as.tokens(list(d1 = c("The", "United States of America", "Zimbabwe", "is", "not", "a", "country"))))
tokens from 1 document.
d1 :
[1] "The"                      "United States of America" "Zimbabwe"                
[4] "is"                       "not"                      "a"                       
[7] "country"                 

> tokens_lookup(toks_uni3, 
                dictionary = dictionary(country = c("Zimbabwe", "United States of America")))
## tokens from 1 document.
## d1 :
## [1] "country"

Moved to #836.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 6, 2017

BTW above on your comment, I did not know what mt was.

@codecov
Copy link

codecov bot commented Jul 6, 2017

Codecov Report

Merging #820 into master will decrease coverage by 0.34%.
The diff coverage is 90.9%.

@@            Coverage Diff             @@
##           master     #820      +/-   ##
==========================================
- Coverage   76.91%   76.57%   -0.35%     
==========================================
  Files         104      105       +1     
  Lines        7816     7828      +12     
==========================================
- Hits         6012     5994      -18     
- Misses       1804     1834      +30

@kbenoit kbenoit merged commit f69fc94 into master Jul 6, 2017
@kbenoit kbenoit deleted the issue-787-ken branch July 6, 2017 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants