-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement features/sequence/keywords arguments more consistently #820
Conversation
Some time ago, this was renamed to featnames()
Merge branch 'master' into issue-787 # Conflicts: # R/RcppExports.R
- add tokens_compound() and a consideration of multi-word sequences.
Remove warning from featuresvector
And @koheiw please make sure pull first before committing - I just had to revert a commit you made that was based on an older version before we modified the |
I understand that We need to allow users to decide whether a dictionary with dictionary entries like "United States", "Irish Sea" are unigrams (typically after tokens_compound() is applied) or ngrams in To do this with dict <- dictionary(list('US'=list(
Countries = c("United States"),
oceans = c("Atlantic", "Pacific")),
'Europe'=list(
Countries = c("Britain", "Ireland"),
oceans = list(west = "Irish Sea", east = "English Channel"))))
tokens_lookup(tok, dictionary = phrase(dict), level = 2) but this code does not work. To keep the dictionary's hierarchical structure, phrase <- function(x) {
attr(x, 'phrase') <- TRUE
return(x)
} I also fear that changes to not splitting character strings with whitespaces to cause confusion in dict <- dictionary(list('US'=list(
Countries = phrase("United States"),
oceans = c("Atlantic", "Pacific")),
'Europe'=list(
Countries = c("Britain", "Ireland"),
oceans = list(west = phrase("Irish Sea"), east = phrase("English Channel"))))) |
To my mind, there are three options at the moment, but none of them is perfect. We probably need to think harder. Option1Stop splinting features with whitespaces, and create Option2Add Option3Create |
I'm afraid I prefer Option 4: Keep the current behaviour, which makes sense to me and is consistent. Using your above example and adding This works fine for me, getting the multi-word matches as expected: tok <- tokens("The United States is bordered by the Atlantic Ocean, not the Irish Sea.")
tokens_lookup(tok, dictionary = dict, level = 2)
# tokens from 1 document.
# Component 1 :
# [1] "Countries" "oceans" "oceans" The following fails, but should fail, since we should not allow a phrase-ified dictionary as an argument to a tokens_lookup(tok, dictionary = phrase(dict), level = 2)
# Error in tokens_lookup.tokens(tok, dictionary = phrase(dict), level = 2) :
# dictionary must be a dictionary object This behaviour is consistent because if we form a dfm, the dfm will not have multi-word features unless these were explicitly formed. So this should match: dfm2 <- dfm(tokens_ngrams(tok, n = 2, concatenator = " "))
# Document-feature matrix of: 1 document, 14 features (0% sparse).
# 1 x 14 sparse Matrix of class "dfmSparse"
# features
# docs the united united states states is is bordered bordered by by the the atlantic atlantic ocean
# text1 1 1 1 1 1 1 1 1
# features
# docs ocean , , not not the the irish irish sea sea .
# text1 1 1 1 1 1 1
dfm_lookup(dfm2, dictionary = dict, level = 2)
# Document-feature matrix of: 1 document, 2 features (0% sparse).
# 1 x 2 sparse Matrix of class "dfmSparse"
# features
# docs Countries oceans
# text1 1 1 and this should error, or just return no matches: dfm_lookup(dfm(tok), dictionary = dict, level = 2)
# Error in dfm_lookup(dfm(tok), dictionary = dict, level = 2) :
# dfm_lookup not implemented for ngrams > 1 and multi-word dictionary values If we change how phrase works with dictionaries, then this will not work, and this is what I think is consistent: tokens_select(tok, dict)
# tokens from 1 document.
# Component 1 :
# [1] "Atlantic"
tokens_select(tok, phrase(dict))
# tokens from 1 document.
# Component 1 :
# [1] "United" "States" "Atlantic" "Irish" "Sea" |
I agree that dict <- dictionary(list(US=list(
Countries = c("United States"),
oceans = c("Atlantic", "Pacific")),
Europe=list(
Countries = c("Britain", "Ireland"),
oceans = list(west = "Irish Sea", east = "English Channel"))))
# tokes are all unigrams
toks <- tokens("The United States is bordered by the Atlantic Ocean, not the Irish Sea.")
tokens_lookup(toks, dictionary = dict)
# tokens has bigrams
toks2 <- tokens_compound(toks, list(c('United', 'States'), c('Atlantic', 'Ocean'), c('Irish', 'Sea')),
concatenator = ' ')
tokens_lookup(toks2, dictionary = dict) Anyway, the idea of "consistency" seems rather subjective, and a matter of personal preference. We have to remember why there are so many user interface settings in Mac, Windows and Linux. While @kbenoit prefers this branch's current behavior, I am most comfortable with master's behavior. So I think we should add something like |
That's a very good point! We want a match for the token sequence require(magrittr)
tokens_compound(toks, phrase(dict), concatenator = " ") %>%
tokens_lookup(dictionary = dict)
# ---- should return matches for:
## tokens from 1 document.
## Component 1 :
## [1] "US.Countries" "US.oceans" "Europe.oceans.west"
tokens_compound(toks2, phrase(dict), concatenator = " ") %>%
tokens_lookup(dictionary = dict)
# ---- this will now be the same (Note: I'm proposing to use this inside the On the idea of adding more options, I really want to avoid that, and strive keep this as simple and as consistent as possible. My starting point is the dfm, and the idea that whitespace should not be privileged in a feature label. To make this consistent for tokens, the same has to apply for tokens. I have proposed to require this for The only exception has to do with the multi-word dictionary values, since it's already a list, and we don't want one-level dictionaries to require a two-level list, so we allow multi-word matches to be separated by whitespace. When applied to |
I do not think compounding is a good solution, because txt <- "The United States of America is bordered by the Atlantic Ocean and the Pacific Ocean."
toks <- tokens(txt)
comp <- list(c("United", "States", "of", "America"))
toks_uni <- tokens_compound(toks, comp, concatenator = ' ')
dict <- dictionary(list(country = c("United States of America", "China"),
region = c("America", "Asia")))
(out1 <- tokens_lookup(toks, dict))
(out2 <- tokens_lookup(toks_uni, dict, multiword = FALSE))
identical(out1, out1) # FALSE The question here is how to switch # manually -----------------------------------
tokens_lookup(toks, dict, multiword = TRUE)
tokens_lookup(toks_uni, dict, multiword = FALSE)
# based on attribues of dictionary -------------
tokens_lookup_new <- function(x, dictionary) {
tokens_lookup(x, dictionary, multiword = is.null(attr(dictionary, 'phrase')))
}
phrase <- function(x) {
attr(x, 'phrase') <- TRUE
return(x)
}
tokens_lookup_new(toks, phrase(dict))
tokens_lookup_new(toks_uni, dict)
# based on attribues of tokens ----------------
tokens_lookup_new2 <- function(x, dictionary) {
tokens_lookup(x, dictionary, multiword = max(attr(x, 'ngrams')) == 1)
}
attr(toks_uni, 'ngrams') <- 1:max(lengths(comp)) # compounded tokens are ngrams
tokens_lookup_new2(toks, dict)
tokens_lookup_new2(toks_uni, dict) The manual approach is not bad actually. |
Ah, I'd forgotten about the EDITED In my view the behaviour should be: # works if the compounding is done first
tokens_lookup(toks, dict)
# tokens from 1 document.
# Component 1 :
# [1] "country" "region"
# should generate only a match for US of A, not its fragment
tokens_lookup(toks_uni, dict)
# tokens from 1 document.
# Component 1 :
# [1] "country"
# should generate an error, and maybe a message warning against sending
# phrase(dictionary) to a _lookup function
tokens_lookup(toks, phrase(dict))
tokens_lookup(toks_uni, phrase(dict)) So to sum up:
Are there any use cases I have missed? Not trying to restrict choice, just provide consistency. |
We should allow users to chose whether it counts overlapped values or not. This is your package, but you cannot not dictate how people use the package under the name of 'consistency'. |
I made the current |
I edited my comment above, since I got this totally wrong - I was confusing On the tokens_compound and #517, I'm happy to keep the |
Merge branch 'master' into issue-787-ken # Conflicts: # R/RcppExports.R
I don't think tokens_lookup(toks, dict)
tokens_lookup(toks_uni, dict) Users should be aware that compounded tokens are totally different from original tokens, and that they have to change input values. It is absolutely natural that users to get different outputs from different inputs. For the same reason, it is OK for those two functions to select different features, because inputs are different (mt != toks). dfm_select(mt, dict)
tokens_select(toks, dict) However, it is confusing when those two select different features, because inputs are identical: dfm(toks, select = dict)
tokens_select(toks, features = dict) I think 'consistency' matters only in last case. My definition of the consistency here is "users get the same results from the same inputs across functions". To address this kind of 'inconsistency', we should apply |
Here's another problem: what if we have multi-word dictionary values, and sequences? Example: (toks_uni2 <- as.tokens(list(d1 = c("The", "United States of America", "China", "is", "not", "a", "country"))))
## tokens from 1 document.
## d1 :
## [1] "The" "United States of America" "China"
## [4] "is" "not" "a"
## [7] "country"
tokens_lookup(toks_uni2,
dictionary = dictionary(country = c("China", "United States of America")))
## tokens from 1 document.
## d1 :
## [1] "country"
# BUT
(toks_uni3 <- as.tokens(list(d1 = c("The", "United States of America", "Zimbabwe", "is", "not", "a", "country"))))
tokens from 1 document.
d1 :
[1] "The" "United States of America" "Zimbabwe"
[4] "is" "not" "a"
[7] "country"
> tokens_lookup(toks_uni3,
dictionary = dictionary(country = c("Zimbabwe", "United States of America")))
## tokens from 1 document.
## d1 :
## [1] "country" Moved to #836. |
BTW above on your comment, I did not know what |
Codecov Report
@@ Coverage Diff @@
## master #820 +/- ##
==========================================
- Coverage 76.91% 76.57% -0.35%
==========================================
Files 104 105 +1
Lines 7816 7828 +12
==========================================
- Hits 6012 5994 -18
- Misses 1804 1834 +30 |
Implements consistent handling for features matching.
features
,keywords
, andsequences
arguments withpattern
phrase()
wrapper that treats multi-word characters as sequences of patternsissue-787
)