tokens_compound should not compound patterns that nest inside one another #837

kbenoit · 2017-07-06T09:44:17Z

If the pattern is c("a b", "a b c"), and the tokens are "a", "b", "c", then the compounded version should be just "a_b_c", not "a_b_c", "a_b". Compounding should never increase the number of tokens, just join them.

This is a separate issue from joining overlapping sequences, determined by join = TRUE, which is an argument we want to keep. This provides a slight exception to the "never increase the total tokens" behaviour, but this is the only way to do this correctly, since otherwise we have an indeterminate set of compounds to form, and their order will affect what is formed. In the following example, for instance, "c" gets counted twice (in two compounds), but it needs to be, otherwise we would have "b_c" or "c_d" but not both, and there is no deterministic rule for choosing which.

tokens_compound(tokens("a b c d e"), 
                list(c("b", "c"), c("c", "d")),
                join = FALSE)
## tokens from 1 document.
## Component 1 :
## [1] "a"   "b_c" "c_d" "e"

This points to an argument for making join = TRUE the default, since this is behaviour most people will expect.

(Once PR #820 is merged), these tests need to pass by removing the skip_ functions from test-tokens_compound.R:

test_that("tokens_compound works as expected with nested tokens", {
    skip_on_appveyor()
    skip_on_travis()
    expect_equal(
        as.character(tokens_compound(tokens("a b c d"), phrase(c("a b", "a b c")), 
                     join = FALSE)),
        c("a_b_c", "d")
    )
    expect_equal(
        as.character(tokens_compound(tokens("a b c d"), phrase(c("a b", "a b c")), 
                     join = TRUE)),
        c("a_b_c", "d")
    )
})

test_that("tokens_compound works as expected with nested and overlapping tokens", {
    skip_on_appveyor()
    skip_on_travis()
    expect_equal(
        as.character(tokens_compound(tokens("a b c d e"), 
                                     phrase(c("a b", "a b c", "c d")),
                                     join = FALSE)),
        c("a_b_c", "c_d", "e")
    )
    expect_equal(
        as.character(tokens_compound(tokens("a b c d e"), 
                                     phrase(c("a b", "a b c", "c d")),
                                     join = TRUE)),
        c("a_b_c_d", "e")
    )
})

The text was updated successfully, but these errors were encountered:

kbenoit · 2017-07-11T09:11:45Z

I'm risking undoing the careful scheme I'd outlined, but... What behaviour should we expect for collocations as an argument to tokens_compound()? I just uncommented the example for this, and added tests to test-tokens_compound(), but these only work when a collocations features is wrapped in phrase().

Question: Should collocations objects work with tokens_compound as if they were phrases, automatically? This would make sense to me, since collocations are always compounds. So we just wrap it in phrase inside tokens_compound if it's a collocations? Or is this inconsistent?
If we do not wrap this, then at the least we need to trap the condition so that we do not see this, as we do currently:

cols <- textstat_collocations("capital gains taxes are worse than inheritance taxes",
                              size = 2, min_count = 1)
toks <- tokens("The new law included capital gains taxes and inheritance taxes.")
tokens_compound(toks, cols)
#  Error in sequences$collocation : 
#   object of type 'closure' is not subsettable

koheiw · 2017-07-11T11:05:44Z

It is OK to treat collocations object as phrases (it is working in this way in the master). If someone wants to use the object in a different way, he or she can extract cols$collocations and use it as a feature input (if they can find out this easily).

However, handling dictionary objects is much more difficult. We have to provide an interface that works any combination of a dictionary and a tokens object

dict1 <- dictionary(key1 = 'a b', key2 = 'c d', concatenator = ' ')
dict2 <- dictionary(key1 = 'a b', key2 = 'c d', concatenator = '_')
dict3 <- dictionary(key1 = 'a_b', key2 = 'c_d', concatenator = ' ')
dict4 <- dictionary(key1 = 'a_b', key2 = 'c_d', concatenator = '_')

toks1 <- as.tokens(list(c('a', 'b', 'c d', 'e f', 'g', 'e')))
toks2 <- as.tokens(list(c('a', 'b', 'c_d', 'e_f', 'g', 'e')))

to produce all of those:

# using tokens_compound()

# with toks1
toks1_comp1 <- as.tokens(list(c('a b', 'c d', 'e f', 'g', 'e')))
toks1_comp2 <- as.tokens(list(c('a_b', 'c d', 'e f', 'g', 'e')))

# with toks2
toks2_comp3 <- as.tokens(list(c('a b', 'c_d', 'e f', 'g', 'e')))
toks2_comp3 <- as.tokens(list(c('a_b', 'c_d', 'e f', 'g', 'e')))

# using tokens_lookup()

# with toks1
toks1_look1 <- as.tokens(list(c('key1', 'key2')))
toks1_look2 <- as.tokens(list(c('key1')))
toks1_look3 <- as.tokens(list(c('key2')))

# with toks2
toks2_look1 <- as.tokens(list(c('key1', 'key2')))
toks2_look2 <- as.tokens(list(c('key1')))
toks2_look3 <- as.tokens(list(c('key2')))

The simples way is

normalizing concatenators in the dictionary constructor (dictionary values will be all white-space segmented)
create a function to substitute white-space in dictionary values with something else
wrap the dictionary either by phrase() or the new function before passing to tokens_lookup() or tokens_compound().

kbenoit · 2017-07-11T13:00:02Z

Good point.

Here's an alternative: get rid of concatenator in dictionary() and in the dictionary class object, and make this always a whitespace. Any scheme that uses an alternative (and there is really only _) we just make a whitespace. Then the rules for multi-word matches apply out of the box.

Can you see any advantage to retaining a concatenator option?

kbenoit · 2017-07-11T13:16:06Z

Maybe what you meant by your option 1. above was exactly what I've suggested, just to make a space the intra-pattern connector in dictionary values, always? And if we import a format that uses underscore, we just convert it to space?

koheiw · 2017-07-11T13:18:28Z

We should keep the concatenator in the dictionary constructor, because dictionary formats are so diverse, but dictionary objects will always be white-space segmented once created. The new dictionary object therefore no-longer have concatenator as an attributes.

koheiw · 2017-07-11T13:23:31Z

The second point is that, if we normalize dictionary values in that way, we need a function to change white-space to other letters ("_", "+", "-") for tokens_lookup() when tokens are already concatenated.

kbenoit · 2017-07-11T13:28:19Z

OK, so on the first point, the concatenator argument is simply to manually specify what a separator could be in the import format. We use that to always convert multi-word values to space-separated. No need for a concatenator slot anymore in a dictionary class object. Correct?

On the second point, this could happen if we are looking up multi-word tokens created using tokens_ngram, for instance. In this case I think since all dictionaries will only have white-space separating the patterns in their multi-word values, we only need an internal function to convert the white space into the concatenator attribute of the tokens object. For the user this will be automatic. So we can remove the concatenator value from the tokens_lookup() and dfm_lookup functions too?

koheiw · 2017-07-11T13:45:07Z

Yes, the first point is exactly what I meant.

As for the second point, I am not entirely sure if concatenator can be determined automatically (by an attibute of tokens objects), because tokens may not always be concatenated by quanteda's functions.

For example, users have to chose '-' manually here:

dict <- dictionary(immig = c('low skilled', 'under paid'))
toks <- tokens("low-skilled workers are usualy under-paid in the UK", what = 'fastestword')

kbenoit · 2017-07-11T15:41:33Z

OK, I'll modify the code to remove concatenator, on this branch.

On the "low-skilled" example, that's an interesting and tricky case, since if we specified the hyphen as concatenator, we would miss "low skilled". Probably the safest is to remove the hyphens and then use the standard dictionary to detect sequence with hyphens removed. e.g.

dict <- dictionary(immig = c('low skilled', 'under paid'))
toks <- tokens("low-skilled workers are usualy under-paid in the UK", 
               remove_hyphens = TRUE)

toks
# tokens from 1 document.
# Component 1 :
# [1] "low"     "skilled" "workers" "are"     "usualy"  "under"   "paid"    "in"      "the"    
# [10] "UK"  

tokens_compound(toks, phrase(dict))
# tokens from 1 document.
# Component 1 :
# [1] "low_skilled" "workers"     "are"         "usualy"      "under_paid"  "in"         
# [7] "the"         "UK"

koheiw · 2017-07-11T16:19:45Z

I can image some people do tokenization with some kind of POS tagger, and do not want to modify tokens at all. quanteda has to be ready for them.

kbenoit · 2017-07-11T16:34:47Z

Agreed. This does not work either:

(toks <- tokens("low-skilled workers are low skilled"))
# tokens from 1 document.
# Component 1 :
# [1] "low"     "-"       "skilled" "workers" "are"     "low"     "skilled"

seqs <- c("low [-]{0,1} skilled")
tokens_compound(toks, phrase(seqs), valuetype = "regex")
# tokens from 1 document.
# Component 1 :
# [1] "low_-_skilled" "workers"       "are"           "low"           "skilled"      

dict <- dictionary("low-skilled" = "low * skilled")
tokens_lookup(toks, dict, exclusive = FALSE, capkeys = FALSE)
# tokens from 1 document.
# Component 1 :
# [1] "low-skilled" "workers"     "are"         "low"         "skilled"

The tokens_lookup() almost works but appears not to because the middle match of * does not match a "non-token". The obvious problem there is that we can't determine how many "non-tokens" to count!

Implement new phrase behaviour for tokens_compound (#837)

kbenoit added infrastructure tokens labels Jul 6, 2017

kbenoit assigned koheiw Jul 6, 2017

kbenoit mentioned this issue Jul 6, 2017

change feature arguments to pattern #839

Closed

10 tasks

koheiw mentioned this issue Jul 6, 2017

Implement new phrase behaviour for tokens_compound (#837) #841

Merged

koheiw added a commit that referenced this issue Jul 6, 2017

Change for #837

5c2ba9e

kbenoit added a commit that referenced this issue Jul 11, 2017

Merge pull request #841 from kbenoit/issue-837

4227316

Implement new phrase behaviour for tokens_compound (#837)

kbenoit closed this as completed Jul 11, 2017

kbenoit mentioned this issue Sep 13, 2017

tokens_lookup(x, dictionary, exclusive = TRUE/FALSE) inconsistent #970

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokens_compound should not compound patterns that nest inside one another #837

tokens_compound should not compound patterns that nest inside one another #837

kbenoit commented Jul 6, 2017 •

edited

Loading

kbenoit commented Jul 11, 2017

koheiw commented Jul 11, 2017 •

edited

Loading

kbenoit commented Jul 11, 2017 •

edited

Loading

kbenoit commented Jul 11, 2017

koheiw commented Jul 11, 2017

koheiw commented Jul 11, 2017 •

edited

Loading

kbenoit commented Jul 11, 2017

koheiw commented Jul 11, 2017

kbenoit commented Jul 11, 2017

koheiw commented Jul 11, 2017 •

edited

Loading

kbenoit commented Jul 11, 2017 •

edited

Loading

tokens_compound should not compound patterns that nest inside one another #837

tokens_compound should not compound patterns that nest inside one another #837

Comments

kbenoit commented Jul 6, 2017 • edited Loading

kbenoit commented Jul 11, 2017

koheiw commented Jul 11, 2017 • edited Loading

kbenoit commented Jul 11, 2017 • edited Loading

kbenoit commented Jul 11, 2017

koheiw commented Jul 11, 2017

koheiw commented Jul 11, 2017 • edited Loading

kbenoit commented Jul 11, 2017

koheiw commented Jul 11, 2017

kbenoit commented Jul 11, 2017

koheiw commented Jul 11, 2017 • edited Loading

kbenoit commented Jul 11, 2017 • edited Loading

kbenoit commented Jul 6, 2017 •

edited

Loading

koheiw commented Jul 11, 2017 •

edited

Loading

kbenoit commented Jul 11, 2017 •

edited

Loading

koheiw commented Jul 11, 2017 •

edited

Loading

koheiw commented Jul 11, 2017 •

edited

Loading

kbenoit commented Jul 11, 2017 •

edited

Loading