Reimplement features/sequence/keywords arguments more consistently #820

kbenoit · 2017-06-27T23:24:40Z

Implements consistent handling for features matching.

replaces features, keywords, and sequences arguments with pattern
new phrase() wrapper that treats multi-word characters as sequences of patterns
updates all tests and documentation
Fixes Create a page for features in documentation #787
Fixes tokens_select not selecting ngrams #740
Replaces PR feature selection consistency #804 (and the branch issue-787)

… features()

Some time ago, this was renamed to featnames()

Merge branch 'master' into issue-787 # Conflicts: # R/RcppExports.R

@kbenoit

- @kbenoit's thinking in relation to PR #804

- add tokens_compound() and a consideration of multi-word sequences.

Remove warning from featuresvector

…nto issue-787-ken

kbenoit · 2017-07-03T19:48:42Z

And @koheiw please make sure pull first before committing - I just had to revert a commit you made that was based on an older version before we modified the textstat_collocations() function. The current branch issue-787-ken contains the recent changes now in master.

koheiw · 2017-07-04T14:38:38Z

I understand that phrase() split a character vector with whitespace into a list of character vectors, just like current features2list(), but I don't think this is a good idea, because it does not work for dictionary objects as an input to tokens_lookup().

We need to allow users to decide whether a dictionary with dictionary entries like "United States", "Irish Sea" are unigrams (typically after tokens_compound() is applied) or ngrams in tokens_lookup().

To do this with phrase(), the code would be like the following:

dict <- dictionary(list('US'=list(
                        Countries = c("United States"),
                        oceans = c("Atlantic", "Pacific")),
                        'Europe'=list(
                            Countries = c("Britain", "Ireland"),
                            oceans = list(west = "Irish Sea", east = "English Channel"))))

tokens_lookup(tok, dictionary = phrase(dict), level = 2)

but this code does not work.

To keep the dictionary's hierarchical structure, phrase() should only add an attribute to the dictionary, based on which tokens_lookup() split the entries internally.

phrase <- function(x) {
    attr(x, 'phrase') <- TRUE
    return(x)
}

I also fear that changes to not splitting character strings with whitespaces to cause confusion in dictionary() as well as kwic(). It is not hard to imagine users trying to make dictionary in this (wrong) way:

dict <- dictionary(list('US'=list(
                        Countries = phrase("United States"),
                        oceans = c("Atlantic", "Pacific")),
                        'Europe'=list(
                            Countries = c("Britain", "Ireland"),
                            oceans = list(west = phrase("Irish Sea"), east = phrase("English Channel")))))

koheiw · 2017-07-04T14:41:15Z

To my mind, there are three options at the moment, but none of them is perfect. We probably need to think harder.

Option1

Stop splinting features with whitespaces, and create phrase() which only adds phrase attribute to input object. This solves inconsistency between dfm_select() and tokens_select() but introduces a new inconsistency between dfm_select() and tokens_select() and dictionary() and kwic(). This solution comes with special objects only used as feature inputs.

Option2

Add phrase as an argument to relevant functions, and set it TRUE by default. This is the most straightforward, but users have to set phrase = FALSE to get the same results from dfm_select() and tokens_select().

Option3

Create tokens_concatenate() or as.unigram(), and split features with whitespaces only when tokens are unigrams. dfm_select() and tokens_select() become consistent when tokens are ngrams (i.e. output of tokens_ngrams()), but users need to be always aware of types of tokens objects.

kbenoit · 2017-07-04T15:06:33Z

I'm afraid I prefer Option 4: Keep the current behaviour, which makes sense to me and is consistent. Using your above example and adding toks for testing:

This works fine for me, getting the multi-word matches as expected:

tok <- tokens("The United States is bordered by the Atlantic Ocean, not the Irish Sea.")
tokens_lookup(tok, dictionary = dict, level = 2)
# tokens from 1 document.
# Component 1 :
# [1] "Countries" "oceans"    "oceans"

The following fails, but should fail, since we should not allow a phrase-ified dictionary as an argument to a _lookup function. (We could issue a friendlier error message though.)

tokens_lookup(tok, dictionary = phrase(dict), level = 2)
# Error in tokens_lookup.tokens(tok, dictionary = phrase(dict), level = 2) : 
#  dictionary must be a dictionary object

This behaviour is consistent because if we form a dfm, the dfm will not have multi-word features unless these were explicitly formed. So this should match:

dfm2 <- dfm(tokens_ngrams(tok, n = 2, concatenator = " "))
# Document-feature matrix of: 1 document, 14 features (0% sparse).
# 1 x 14 sparse Matrix of class "dfmSparse"
#        features
# docs    the united united states states is is bordered bordered by by the the atlantic atlantic ocean
#   text1          1             1         1           1           1      1            1              1
#        features
# docs    ocean , , not not the the irish irish sea sea .
#   text1       1     1       1         1         1     1
dfm_lookup(dfm2, dictionary = dict, level = 2)
# Document-feature matrix of: 1 document, 2 features (0% sparse).
# 1 x 2 sparse Matrix of class "dfmSparse"
#        features
# docs    Countries oceans
#   text1         1      1

and this should error, or just return no matches:

dfm_lookup(dfm(tok), dictionary = dict, level = 2)
# Error in dfm_lookup(dfm(tok), dictionary = dict, level = 2) : 
#  dfm_lookup not implemented for ngrams > 1 and multi-word dictionary values

If we change how phrase works with dictionaries, then this will not work, and this is what I think is consistent:

tokens_select(tok, dict)
# tokens from 1 document.
# Component 1 :
# [1] "Atlantic"

tokens_select(tok, phrase(dict))
# tokens from 1 document.
# Component 1 :
# [1] "United"   "States"   "Atlantic" "Irish"    "Sea"

koheiw · 2017-07-04T17:12:39Z

I agree that phrase() works with *_select() functions, but my question is bit different: how can users apply this dictionary to get the same result from the first and the second tokens object below?

dict <- dictionary(list(US=list(
    Countries = c("United States"),
    oceans = c("Atlantic", "Pacific")),
    Europe=list(
        Countries = c("Britain", "Ireland"),
        oceans = list(west = "Irish Sea", east = "English Channel"))))

# tokes are all unigrams
toks <- tokens("The United States is bordered by the Atlantic Ocean, not the Irish Sea.")
tokens_lookup(toks, dictionary = dict)

# tokens has bigrams
toks2 <- tokens_compound(toks, list(c('United', 'States'), c('Atlantic', 'Ocean'), c('Irish', 'Sea')), 
                         concatenator = ' ')
tokens_lookup(toks2, dictionary = dict)

Anyway, the idea of "consistency" seems rather subjective, and a matter of personal preference. We have to remember why there are so many user interface settings in Mac, Windows and Linux. While @kbenoit prefers this branch's current behavior, I am most comfortable with master's behavior. So I think we should add something like split_by_whitespace to quanteda_options(), and keep the master's behavior when it is TRUE (FALSE by default).

kbenoit · 2017-07-04T17:30:32Z

That's a very good point! We want a match for the token sequence "United", "States" but also want to match the compound "United States" (as it would have to work for a dfm feature). But we don't want to double-match. How's this for a workaround: we first compound the tokens using the "phrased" dictionary values, then look it up. This works the same for both cases and will not double-count.

require(magrittr)
tokens_compound(toks, phrase(dict), concatenator = " ") %>%
    tokens_lookup(dictionary = dict)
# ---- should return matches for:
## tokens from 1 document.
## Component 1 :
## [1] "US.Countries"       "US.oceans"          "Europe.oceans.west"  

tokens_compound(toks2, phrase(dict), concatenator = " ") %>%
    tokens_lookup(dictionary = dict)
# ---- this will now be the same

(Note: I'm proposing to use this inside the tokens_lookup() function, not to require this every time from the user!)

On the idea of adding more options, I really want to avoid that, and strive keep this as simple and as consistent as possible. My starting point is the dfm, and the idea that whitespace should not be privileged in a feature label. To make this consistent for tokens, the same has to apply for tokens. I have proposed to require this for kwic patterns too, where it will slightly less convenient but consistent (since the user will have to type phrase() around the whitespace-separated patterns).

The only exception has to do with the multi-word dictionary values, since it's already a list, and we don't want one-level dictionaries to require a two-level list, so we allow multi-word matches to be separated by whitespace. When applied to tokens_lookup only, we allow these to match multi-word matches.

koheiw · 2017-07-05T12:47:02Z

I do not think compounding is a good solution, because tokens_lookup() does not work with ngram tokens in the same way as with unigram tokens.

txt <- "The United States of America is bordered by the Atlantic Ocean and the Pacific Ocean."
toks <- tokens(txt)

comp <- list(c("United", "States", "of", "America"))
toks_uni <- tokens_compound(toks, comp, concatenator = ' ')

dict <- dictionary(list(country = c("United States of America", "China"),
                           region = c("America", "Asia")))

(out1 <- tokens_lookup(toks, dict))
(out2 <- tokens_lookup(toks_uni, dict, multiword = FALSE)) 
identical(out1, out1) # FALSE

The question here is how to switch multiword argument (the following code is for master branch):

# manually   -----------------------------------
tokens_lookup(toks, dict, multiword = TRUE)
tokens_lookup(toks_uni, dict, multiword = FALSE)

# based on attribues of dictionary -------------
tokens_lookup_new <- function(x, dictionary) {
    tokens_lookup(x, dictionary, multiword = is.null(attr(dictionary, 'phrase')))
}
phrase <- function(x) {
    attr(x, 'phrase') <- TRUE
    return(x)
}
tokens_lookup_new(toks, phrase(dict))
tokens_lookup_new(toks_uni, dict)

# based on attribues of tokens ----------------
tokens_lookup_new2 <- function(x, dictionary) {
    tokens_lookup(x, dictionary, multiword = max(attr(x, 'ngrams')) == 1)
}
attr(toks_uni, 'ngrams') <- 1:max(lengths(comp)) # compounded tokens are ngrams
tokens_lookup_new2(toks, dict)
tokens_lookup_new2(toks_uni, dict)

The manual approach is not bad actually.

kbenoit · 2017-07-05T13:52:59Z

Ah, I'd forgotten about the multiword argument in tokens_lookup(). I'm suggesting we get rid of that now.

EDITED

In my view the behaviour should be:

# works if the compounding is done first
tokens_lookup(toks, dict)
# tokens from 1 document.
# Component 1 :
# [1] "country" "region"

# should generate only a match for US of A, not its fragment
tokens_lookup(toks_uni, dict)
# tokens from 1 document.
# Component 1 :
# [1] "country" 

# should generate an error, and maybe a message warning against sending
# phrase(dictionary) to a _lookup function
tokens_lookup(toks, phrase(dict))
tokens_lookup(toks_uni, phrase(dict))

So to sum up:

We want to return matches for both country and region since these are in separate keys. Does not matter that one is nested in the other.
If the key were country = c("United States of America", "America") then we would want to match just once, in other words not count twice a match that is nested within another match. (This happens in some of the LIWC definitions, where a wildcarded match also matches a fixed pattern.) If this is the behaviour you would like to offer the user as an option, I'm open to that.
phrase(any_dictionary) should not be sent to a _lookup function, and we simply disallow it.

Are there any use cases I have missed? Not trying to restrict choice, just provide consistency.

koheiw · 2017-07-05T16:32:38Z

We should allow users to chose whether it counts overlapped values or not. This is your package, but you cannot not dictate how people use the package under the name of 'consistency'.

koheiw · 2017-07-05T16:40:14Z

I made the current tokens_lookup() to meet @BobMuenchen 's request with special internal data structure. Please see #517.

kbenoit · 2017-07-05T16:53:54Z

I edited my comment above, since I got this totally wrong - I was confusing dfm_lookup with dfm_compound (easy to do since so many functions are part of this discussion). See the edited version. I don't think we need multiword and that eliminating it will not restrict user options. But be sure that you agree with (especially) my points 2 and 3 above (since I think we agree on 1 - I had this wrong earlier).

On the tokens_compound and #517, I'm happy to keep the join argument, but that's a different issue. It's about how overlapping patterns match, and unless I am not seeing this right, not an issue with how we count dictionary keys by matching dictionary values if we follow my suggestions 1-3 above.

Merge branch 'master' into issue-787-ken # Conflicts: # R/RcppExports.R

koheiw · 2017-07-06T07:52:16Z

I don't think tokens_lookup() should return the same result from the two (I also do not know how to do):

tokens_lookup(toks, dict)
tokens_lookup(toks_uni, dict)

Users should be aware that compounded tokens are totally different from original tokens, and that they have to change input values. It is absolutely natural that users to get different outputs from different inputs.

For the same reason, it is OK for those two functions to select different features, because inputs are different (mt != toks).

dfm_select(mt, dict)
tokens_select(toks, dict)

However, it is confusing when those two select different features, because inputs are identical:

dfm(toks, select = dict)
tokens_select(toks, features = dict)

I think 'consistency' matters only in last case. My definition of the consistency here is "users get the same results from the same inputs across functions". To address this kind of 'inconsistency', we should apply tokens_select() internally in dfm() (currenlty dfm_select() is used).

kbenoit · 2017-07-06T08:13:14Z

Here's another problem: what if we have multi-word dictionary values, and sequences? Example:

(toks_uni2 <- as.tokens(list(d1 = c("The", "United States of America", "China", "is", "not", "a", "country"))))
## tokens from 1 document.
## d1 :
## [1] "The"                      "United States of America" "China"                   
## [4] "is"                       "not"                      "a"                       
## [7] "country"                 

tokens_lookup(toks_uni2, 
              dictionary = dictionary(country = c("China", "United States of America")))
## tokens from 1 document.
## d1 :
## [1] "country"

# BUT
(toks_uni3 <- as.tokens(list(d1 = c("The", "United States of America", "Zimbabwe", "is", "not", "a", "country"))))
tokens from 1 document.
d1 :
[1] "The"                      "United States of America" "Zimbabwe"                
[4] "is"                       "not"                      "a"                       
[7] "country"                 

> tokens_lookup(toks_uni3, 
                dictionary = dictionary(country = c("Zimbabwe", "United States of America")))
## tokens from 1 document.
## d1 :
## [1] "country"

Moved to #836.

kbenoit · 2017-07-06T08:19:17Z

BTW above on your comment, I did not know what mt was.

codecov · 2017-07-06T10:26:23Z

Codecov Report

Merging #820 into master will decrease coverage by 0.34%.
The diff coverage is 90.9%.

@@            Coverage Diff             @@
##           master     #820      +/-   ##
==========================================
- Coverage   76.91%   76.57%   -0.35%     
==========================================
  Files         104      105       +1     
  Lines        7816     7828      +12     
==========================================
- Hits         6012     5994      -18     
- Misses       1804     1834      +30

koheiw and others added 28 commits June 15, 2017 17:37

make a page for features for #787. It is currently mixed with the old…

82d813a

… features()

Fix examples

43b7362

Fix man

4c36e91

Merge branch 'master' into issue-787

f682db2

Minor changes to features documentation

f249862

Remove documentation for deprecated features function

d77f242

Some time ago, this was renamed to featnames()

Use @inheritParams features instead of @param

3379059

RcppExports

ed412ab

Rename sequences to features

0b286ba

Return empty list when no regex is given

cf5fb8b

Add tests

ac85116

Add another test

d3f6973

Fix test

c2440ee

Merge branch 'master' into issue-787

a70b302

Merge branch 'master' into issue-787

26b3aae

Merge branch 'master' into issue-787

a65d744

merge with master

7e58061

Merge branch 'master' into issue-787 # Conflicts: # R/RcppExports.R

Implement analysis and recommendations for featur

7c4a241

- @kbenoit's thinking in relation to PR #804

Update proposal for feature consolidation

49ad75e

- add tokens_compound() and a consideration of multi-word sequences.

Merge branch 'master' into issue-787

b281d3f

Add RMD on concatenators and tokens selection

c1a85a7

Fix

5c44ebc

Merge branch 'master' into issue-787

b18e98b

Undo some of the undesired changes

ad97e64

Add new phrase methods

84252d0

Change features2list behaviour

8137126

Remove warning from featuresvector

Update features2vector man page

5043379

Update features2list/vector tests

df3132f

kbenoit added documentation enhancement labels Jun 27, 2017

kbenoit and others added 4 commits July 3, 2017 19:14

Correct the duplicated parameter in phrase.Rd

c43fa17

Add nested sequences test to test-tokens_compound.R

1501313

Merge branch 'issue-787-ken' of https://github.com/kbenoit/quanteda i…

dbc77a2

…nto issue-787-ken

revert a744092

422244e

kbenoit added 2 commits July 3, 2017 21:49

Merge branch 'master' into issue-787-ken

2471764

Merge branch 'master' into issue-787-ken

7d280db

merge recent changes from master

80a0b09

Merge branch 'master' into issue-787-ken # Conflicts: # R/RcppExports.R

kbenoit added 3 commits July 6, 2017 11:41

Skip tests on CI that still need fixing

d8efe60

Fix broken example in sequences_old

101e89a

Correct the tests in tokens_compound()

36b1340

This was referenced Jul 6, 2017

kwic behaviour needs updating for phrase() scheme #838

Closed

tokens_compound should not compound patterns that nest inside one another #837

Closed

tokens_select not selecting ngrams #740

Closed

kbenoit merged commit f69fc94 into master Jul 6, 2017

kbenoit deleted the issue-787-ken branch July 6, 2017 10:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement features/sequence/keywords arguments more consistently #820

Reimplement features/sequence/keywords arguments more consistently #820

kbenoit commented Jun 27, 2017 •

edited

Loading

kbenoit commented Jul 3, 2017

koheiw commented Jul 4, 2017

koheiw commented Jul 4, 2017

kbenoit commented Jul 4, 2017 •

edited

Loading

koheiw commented Jul 4, 2017 •

edited

Loading

kbenoit commented Jul 4, 2017 •

edited

Loading

koheiw commented Jul 5, 2017

kbenoit commented Jul 5, 2017 •

edited

Loading

koheiw commented Jul 5, 2017 •

edited

Loading

koheiw commented Jul 5, 2017

kbenoit commented Jul 5, 2017

koheiw commented Jul 6, 2017

kbenoit commented Jul 6, 2017 •

edited

Loading

kbenoit commented Jul 6, 2017

codecov bot commented Jul 6, 2017 •

edited

Loading

Reimplement features/sequence/keywords arguments more consistently #820

Reimplement features/sequence/keywords arguments more consistently #820

Conversation

kbenoit commented Jun 27, 2017 • edited Loading

kbenoit commented Jul 3, 2017

koheiw commented Jul 4, 2017

koheiw commented Jul 4, 2017

Option1

Option2

Option3

kbenoit commented Jul 4, 2017 • edited Loading

koheiw commented Jul 4, 2017 • edited Loading

kbenoit commented Jul 4, 2017 • edited Loading

koheiw commented Jul 5, 2017

kbenoit commented Jul 5, 2017 • edited Loading

koheiw commented Jul 5, 2017 • edited Loading

koheiw commented Jul 5, 2017

kbenoit commented Jul 5, 2017

koheiw commented Jul 6, 2017

kbenoit commented Jul 6, 2017 • edited Loading

kbenoit commented Jul 6, 2017

codecov bot commented Jul 6, 2017 • edited Loading

Codecov Report

kbenoit commented Jun 27, 2017 •

edited

Loading

kbenoit commented Jul 4, 2017 •

edited

Loading

koheiw commented Jul 4, 2017 •

edited

Loading

kbenoit commented Jul 4, 2017 •

edited

Loading

kbenoit commented Jul 5, 2017 •

edited

Loading

koheiw commented Jul 5, 2017 •

edited

Loading

kbenoit commented Jul 6, 2017 •

edited

Loading

codecov bot commented Jul 6, 2017 •

edited

Loading