-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokens_compound should not compound patterns that nest inside one another #837
Comments
I'm risking undoing the careful scheme I'd outlined, but... What behaviour should we expect for collocations as an argument to
cols <- textstat_collocations("capital gains taxes are worse than inheritance taxes",
size = 2, min_count = 1)
toks <- tokens("The new law included capital gains taxes and inheritance taxes.")
tokens_compound(toks, cols)
# Error in sequences$collocation :
# object of type 'closure' is not subsettable |
It is OK to treat collocations object as phrases (it is working in this way in the master). If someone wants to use the object in a different way, he or she can extract However, handling dictionary objects is much more difficult. We have to provide an interface that works any combination of a dictionary and a tokens object dict1 <- dictionary(key1 = 'a b', key2 = 'c d', concatenator = ' ')
dict2 <- dictionary(key1 = 'a b', key2 = 'c d', concatenator = '_')
dict3 <- dictionary(key1 = 'a_b', key2 = 'c_d', concatenator = ' ')
dict4 <- dictionary(key1 = 'a_b', key2 = 'c_d', concatenator = '_')
toks1 <- as.tokens(list(c('a', 'b', 'c d', 'e f', 'g', 'e')))
toks2 <- as.tokens(list(c('a', 'b', 'c_d', 'e_f', 'g', 'e'))) to produce all of those: # using tokens_compound()
# with toks1
toks1_comp1 <- as.tokens(list(c('a b', 'c d', 'e f', 'g', 'e')))
toks1_comp2 <- as.tokens(list(c('a_b', 'c d', 'e f', 'g', 'e')))
# with toks2
toks2_comp3 <- as.tokens(list(c('a b', 'c_d', 'e f', 'g', 'e')))
toks2_comp3 <- as.tokens(list(c('a_b', 'c_d', 'e f', 'g', 'e')))
# using tokens_lookup()
# with toks1
toks1_look1 <- as.tokens(list(c('key1', 'key2')))
toks1_look2 <- as.tokens(list(c('key1')))
toks1_look3 <- as.tokens(list(c('key2')))
# with toks2
toks2_look1 <- as.tokens(list(c('key1', 'key2')))
toks2_look2 <- as.tokens(list(c('key1')))
toks2_look3 <- as.tokens(list(c('key2'))) The simples way is
|
Good point. Here's an alternative: get rid of Can you see any advantage to retaining a |
Maybe what you meant by your option 1. above was exactly what I've suggested, just to make a space the intra-pattern connector in dictionary values, always? And if we import a format that uses underscore, we just convert it to space? |
We should keep the |
The second point is that, if we normalize dictionary values in that way, we need a function to change white-space to other letters ("_", "+", "-") for |
OK, so on the first point, the On the second point, this could happen if we are looking up multi-word tokens created using tokens_ngram, for instance. In this case I think since all dictionaries will only have white-space separating the patterns in their multi-word values, we only need an internal function to convert the white space into the |
Yes, the first point is exactly what I meant. As for the second point, I am not entirely sure if concatenator can be determined automatically (by an attibute of tokens objects), because tokens may not always be concatenated by quanteda's functions. For example, users have to chose '-' manually here: dict <- dictionary(immig = c('low skilled', 'under paid'))
toks <- tokens("low-skilled workers are usualy under-paid in the UK", what = 'fastestword') |
OK, I'll modify the code to remove On the "low-skilled" example, that's an interesting and tricky case, since if we specified the hyphen as concatenator, we would miss dict <- dictionary(immig = c('low skilled', 'under paid'))
toks <- tokens("low-skilled workers are usualy under-paid in the UK",
remove_hyphens = TRUE)
toks
# tokens from 1 document.
# Component 1 :
# [1] "low" "skilled" "workers" "are" "usualy" "under" "paid" "in" "the"
# [10] "UK"
tokens_compound(toks, phrase(dict))
# tokens from 1 document.
# Component 1 :
# [1] "low_skilled" "workers" "are" "usualy" "under_paid" "in"
# [7] "the" "UK" |
I can image some people do tokenization with some kind of POS tagger, and do not want to modify tokens at all. quanteda has to be ready for them. |
Agreed. This does not work either: (toks <- tokens("low-skilled workers are low skilled"))
# tokens from 1 document.
# Component 1 :
# [1] "low" "-" "skilled" "workers" "are" "low" "skilled"
seqs <- c("low [-]{0,1} skilled")
tokens_compound(toks, phrase(seqs), valuetype = "regex")
# tokens from 1 document.
# Component 1 :
# [1] "low_-_skilled" "workers" "are" "low" "skilled"
dict <- dictionary("low-skilled" = "low * skilled")
tokens_lookup(toks, dict, exclusive = FALSE, capkeys = FALSE)
# tokens from 1 document.
# Component 1 :
# [1] "low-skilled" "workers" "are" "low" "skilled" The |
Implement new phrase behaviour for tokens_compound (#837)
If the pattern is
c("a b", "a b c")
, and the tokens are"a", "b", "c"
, then the compounded version should be just"a_b_c"
, not"a_b_c", "a_b"
. Compounding should never increase the number of tokens, just join them.This is a separate issue from joining overlapping sequences, determined by
join = TRUE
, which is an argument we want to keep. This provides a slight exception to the "never increase the total tokens" behaviour, but this is the only way to do this correctly, since otherwise we have an indeterminate set of compounds to form, and their order will affect what is formed. In the following example, for instance,"c"
gets counted twice (in two compounds), but it needs to be, otherwise we would have"b_c"
or"c_d"
but not both, and there is no deterministic rule for choosing which.This points to an argument for making
join = TRUE
the default, since this is behaviour most people will expect.(Once PR #820 is merged), these tests need to pass by removing the
skip_
functions fromtest-tokens_compound.R
:The text was updated successfully, but these errors were encountered: