Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokens_compound should not compound patterns that nest inside one another #837

Closed
kbenoit opened this issue Jul 6, 2017 · 11 comments
Closed

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Jul 6, 2017

If the pattern is c("a b", "a b c"), and the tokens are "a", "b", "c", then the compounded version should be just "a_b_c", not "a_b_c", "a_b". Compounding should never increase the number of tokens, just join them.

This is a separate issue from joining overlapping sequences, determined by join = TRUE, which is an argument we want to keep. This provides a slight exception to the "never increase the total tokens" behaviour, but this is the only way to do this correctly, since otherwise we have an indeterminate set of compounds to form, and their order will affect what is formed. In the following example, for instance, "c" gets counted twice (in two compounds), but it needs to be, otherwise we would have "b_c" or "c_d" but not both, and there is no deterministic rule for choosing which.

tokens_compound(tokens("a b c d e"), 
                list(c("b", "c"), c("c", "d")),
                join = FALSE)
## tokens from 1 document.
## Component 1 :
## [1] "a"   "b_c" "c_d" "e" 

This points to an argument for making join = TRUE the default, since this is behaviour most people will expect.

(Once PR #820 is merged), these tests need to pass by removing the skip_ functions from test-tokens_compound.R:

test_that("tokens_compound works as expected with nested tokens", {
    skip_on_appveyor()
    skip_on_travis()
    expect_equal(
        as.character(tokens_compound(tokens("a b c d"), phrase(c("a b", "a b c")), 
                     join = FALSE)),
        c("a_b_c", "d")
    )
    expect_equal(
        as.character(tokens_compound(tokens("a b c d"), phrase(c("a b", "a b c")), 
                     join = TRUE)),
        c("a_b_c", "d")
    )
})

test_that("tokens_compound works as expected with nested and overlapping tokens", {
    skip_on_appveyor()
    skip_on_travis()
    expect_equal(
        as.character(tokens_compound(tokens("a b c d e"), 
                                     phrase(c("a b", "a b c", "c d")),
                                     join = FALSE)),
        c("a_b_c", "c_d", "e")
    )
    expect_equal(
        as.character(tokens_compound(tokens("a b c d e"), 
                                     phrase(c("a b", "a b c", "c d")),
                                     join = TRUE)),
        c("a_b_c_d", "e")
    )
})
@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 11, 2017

I'm risking undoing the careful scheme I'd outlined, but... What behaviour should we expect for collocations as an argument to tokens_compound()? I just uncommented the example for this, and added tests to test-tokens_compound(), but these only work when a collocations features is wrapped in phrase().

  • Question: Should collocations objects work with tokens_compound as if they were phrases, automatically? This would make sense to me, since collocations are always compounds. So we just wrap it in phrase inside tokens_compound if it's a collocations? Or is this inconsistent?
  • If we do not wrap this, then at the least we need to trap the condition so that we do not see this, as we do currently:
cols <- textstat_collocations("capital gains taxes are worse than inheritance taxes",
                              size = 2, min_count = 1)
toks <- tokens("The new law included capital gains taxes and inheritance taxes.")
tokens_compound(toks, cols)
#  Error in sequences$collocation : 
#   object of type 'closure' is not subsettable 

@koheiw
Copy link
Collaborator

koheiw commented Jul 11, 2017

It is OK to treat collocations object as phrases (it is working in this way in the master). If someone wants to use the object in a different way, he or she can extract cols$collocations and use it as a feature input (if they can find out this easily).

However, handling dictionary objects is much more difficult. We have to provide an interface that works any combination of a dictionary and a tokens object

dict1 <- dictionary(key1 = 'a b', key2 = 'c d', concatenator = ' ')
dict2 <- dictionary(key1 = 'a b', key2 = 'c d', concatenator = '_')
dict3 <- dictionary(key1 = 'a_b', key2 = 'c_d', concatenator = ' ')
dict4 <- dictionary(key1 = 'a_b', key2 = 'c_d', concatenator = '_')

toks1 <- as.tokens(list(c('a', 'b', 'c d', 'e f', 'g', 'e')))
toks2 <- as.tokens(list(c('a', 'b', 'c_d', 'e_f', 'g', 'e')))

to produce all of those:

# using tokens_compound()

# with toks1
toks1_comp1 <- as.tokens(list(c('a b', 'c d', 'e f', 'g', 'e')))
toks1_comp2 <- as.tokens(list(c('a_b', 'c d', 'e f', 'g', 'e')))

# with toks2
toks2_comp3 <- as.tokens(list(c('a b', 'c_d', 'e f', 'g', 'e')))
toks2_comp3 <- as.tokens(list(c('a_b', 'c_d', 'e f', 'g', 'e')))

# using tokens_lookup()

# with toks1
toks1_look1 <- as.tokens(list(c('key1', 'key2')))
toks1_look2 <- as.tokens(list(c('key1')))
toks1_look3 <- as.tokens(list(c('key2')))

# with toks2
toks2_look1 <- as.tokens(list(c('key1', 'key2')))
toks2_look2 <- as.tokens(list(c('key1')))
toks2_look3 <- as.tokens(list(c('key2')))

The simples way is

  1. normalizing concatenators in the dictionary constructor (dictionary values will be all white-space segmented)
  2. create a function to substitute white-space in dictionary values with something else
  3. wrap the dictionary either by phrase() or the new function before passing to tokens_lookup() or tokens_compound().

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 11, 2017

Good point.

Here's an alternative: get rid of concatenator in dictionary() and in the dictionary class object, and make this always a whitespace. Any scheme that uses an alternative (and there is really only _) we just make a whitespace. Then the rules for multi-word matches apply out of the box.

Can you see any advantage to retaining a concatenator option?

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 11, 2017

Maybe what you meant by your option 1. above was exactly what I've suggested, just to make a space the intra-pattern connector in dictionary values, always? And if we import a format that uses underscore, we just convert it to space?

@koheiw
Copy link
Collaborator

koheiw commented Jul 11, 2017

We should keep the concatenator in the dictionary constructor, because dictionary formats are so diverse, but dictionary objects will always be white-space segmented once created. The new dictionary object therefore no-longer have concatenator as an attributes.

@koheiw
Copy link
Collaborator

koheiw commented Jul 11, 2017

The second point is that, if we normalize dictionary values in that way, we need a function to change white-space to other letters ("_", "+", "-") for tokens_lookup() when tokens are already concatenated.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 11, 2017

OK, so on the first point, the concatenator argument is simply to manually specify what a separator could be in the import format. We use that to always convert multi-word values to space-separated. No need for a concatenator slot anymore in a dictionary class object. Correct?

On the second point, this could happen if we are looking up multi-word tokens created using tokens_ngram, for instance. In this case I think since all dictionaries will only have white-space separating the patterns in their multi-word values, we only need an internal function to convert the white space into the concatenator attribute of the tokens object. For the user this will be automatic. So we can remove the concatenator value from the tokens_lookup() and dfm_lookup functions too?

@koheiw
Copy link
Collaborator

koheiw commented Jul 11, 2017

Yes, the first point is exactly what I meant.

As for the second point, I am not entirely sure if concatenator can be determined automatically (by an attibute of tokens objects), because tokens may not always be concatenated by quanteda's functions.

For example, users have to chose '-' manually here:

dict <- dictionary(immig = c('low skilled', 'under paid'))
toks <- tokens("low-skilled workers are usualy under-paid in the UK", what = 'fastestword')

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 11, 2017

OK, I'll modify the code to remove concatenator, on this branch.

On the "low-skilled" example, that's an interesting and tricky case, since if we specified the hyphen as concatenator, we would miss "low skilled". Probably the safest is to remove the hyphens and then use the standard dictionary to detect sequence with hyphens removed. e.g.

dict <- dictionary(immig = c('low skilled', 'under paid'))
toks <- tokens("low-skilled workers are usualy under-paid in the UK", 
               remove_hyphens = TRUE)

toks
# tokens from 1 document.
# Component 1 :
# [1] "low"     "skilled" "workers" "are"     "usualy"  "under"   "paid"    "in"      "the"    
# [10] "UK"  

tokens_compound(toks, phrase(dict))
# tokens from 1 document.
# Component 1 :
# [1] "low_skilled" "workers"     "are"         "usualy"      "under_paid"  "in"         
# [7] "the"         "UK" 

@koheiw
Copy link
Collaborator

koheiw commented Jul 11, 2017

I can image some people do tokenization with some kind of POS tagger, and do not want to modify tokens at all. quanteda has to be ready for them.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jul 11, 2017

Agreed. This does not work either:

(toks <- tokens("low-skilled workers are low skilled"))
# tokens from 1 document.
# Component 1 :
# [1] "low"     "-"       "skilled" "workers" "are"     "low"     "skilled"

seqs <- c("low [-]{0,1} skilled")
tokens_compound(toks, phrase(seqs), valuetype = "regex")
# tokens from 1 document.
# Component 1 :
# [1] "low_-_skilled" "workers"       "are"           "low"           "skilled"      

dict <- dictionary("low-skilled" = "low * skilled")
tokens_lookup(toks, dict, exclusive = FALSE, capkeys = FALSE)
# tokens from 1 document.
# Component 1 :
# [1] "low-skilled" "workers"     "are"         "low"         "skilled"      

The tokens_lookup() almost works but appears not to because the middle match of * does not match a "non-token". The obvious problem there is that we can't determine how many "non-tokens" to count!

kbenoit added a commit that referenced this issue Jul 11, 2017
Implement new phrase behaviour for tokens_compound (#837)
@kbenoit kbenoit closed this as completed Jul 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants