Add tokens_chunk() #1520

koheiw · 2018-12-11T08:58:43Z

Fairly fast already

Unit: milliseconds
                        expr       min        lq      mean    median        uq      max neval
       tokens_chunk(toks, 5)  39.24824  45.25473  81.37482  49.94681  77.87797 345.7807   100
 tokens_chunk(toks, 5, TRUE) 205.85162 293.30307 371.14174 340.42323 422.35138 707.4850   100
      tokens_ngrams(toks, 5) 232.01304 250.11479 287.26929 269.02318 297.44102 572.6281   100

codecov · 2018-12-11T09:18:30Z

Codecov Report

Merging #1520 into master will increase coverage by 0.08%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1520      +/-   ##
==========================================
+ Coverage   89.79%   89.87%   +0.08%     
==========================================
  Files         103      105       +2     
  Lines        7752     7816      +64     
==========================================
+ Hits         6961     7025      +64     
  Misses        791      791

and: Add (mostly) R version for comparison

kbenoit · 2018-12-11T11:51:19Z

I added an argument to truncate the (uneven) remainder, and put my version code in (not exported) for comparison. Not too bad...! It's the insert_values plus tokens_segment() that is doing the work.

toks <- tokens(data_corpus_inaugural)

microbenchmark::microbenchmark(
  tokens_chunk(toks, size = 10),
  quanteda:::tokens_chunk2(toks, size = 10)
)
## Unit: milliseconds
##                                       expr      min       lq     mean
##              tokens_chunk(toks, size = 10) 15.92971 18.84621 20.85797
##  quanteda:::tokens_chunk2(toks, size = 10) 15.58881 18.45565 21.38188
##    median       uq     max neval
##  19.88384 22.04475  39.724   100
##  19.59377 21.92013 132.733   100

microbenchmark::microbenchmark(
  tokens_chunk(toks, size = 100),
  quanteda:::tokens_chunk2(toks, size = 100)
)
## Unit: milliseconds
##                                        expr       min       lq     mean
##              tokens_chunk(toks, size = 100) 10.360761 12.50879 16.61262
##  quanteda:::tokens_chunk2(toks, size = 100)  9.856674 12.33062 15.31867
##    median       uq      max neval
##  13.59635 16.30358 34.88291   100
##  13.32331 15.31451 36.81684   100

microbenchmark::microbenchmark(
  tokens_chunk(toks, size = 1000),
  quanteda:::tokens_chunk2(toks, size = 1000)
)
## Unit: milliseconds
##                                         expr      min       lq     mean
##              tokens_chunk(toks, size = 1000) 5.800675 6.397960 9.573177
##  quanteda:::tokens_chunk2(toks, size = 1000) 5.983992 6.370365 8.502899
##    median       uq      max neval
##  7.634302 9.864545 100.3539   100
##  7.593575 9.506097  19.9341   100

into dev-tokens_chunck

kbenoit

Looks good (and fast!) and while I am still not really sure about the overlap use-case - which seems to be a special case of the sort of word shingles created by ngrams - I'm fine to leave it, as long as it's turned off by default.

I added a test that should pass, which traps an error caused when size > any(lengths(x)). When discard_remainder = TRUE then for a document whose length is less than size, we should return an empty tokens set for that document, as as this does:

tokens_remove(tokens("a a"), "a")
## tokens from 1 document.
## text1 :
## character(0)

koheiw · 2018-12-12T08:25:47Z

That is easy but not sure about the logic behind this

    expect_identical(
        as.list(tokens_chunk(toks, size = 4, discard_remainder = TRUE)),
        list(d1 = c("a", "b", "c", "d"),
             d2 = character(0))
    )

It should be

    expect_identical(
        as.list(tokens_chunk(toks, size = 4, discard_remainder = TRUE)),
        list(d1 = c("a", "b", "c", "d"))
    )

or

    expect_identical(
        as.list(tokens_chunk(toks, size = 4, discard_remainder = TRUE)),
        list(d1.1 = c("a", "b", "c", "d"),
             d1.2 = character(0))
             d2.1 = character(0))
    )

kbenoit · 2018-12-12T08:40:11Z

I'd be ok with the last one. We should not be letting this function drop documents, meaning it changes the unique entries in the _docid docvar.

koheiw · 2018-12-12T09:05:59Z

Then shouldn't discard_remainder be called pad_remainder?

kbenoit · 2018-12-12T10:14:12Z

Chunking for a fixed size implies that each chunk will be of that size. By default the uneven remainder would be discarded. Padding, at least for me, implies adding something in between what exists or is left, as we do with removed tokens. Plus, pad_remainder would have an opposite logical value from discard_remainder.

koheiw · 2018-12-12T10:38:41Z

OK "padding" is not the word, but why not "remove". There is no "discard" anywhere else as far as I can remember. Is "discard" suggest an action to empty documents? Then we could (possible but not really) make tokens_discard() which returns empty documents that match a certain criteria. This is actually can be done by tokens_remove(x, window = max(lengths(x)).

kbenoit · 2018-12-12T19:03:28Z

I thought "discard" was more accurate than remove, since remove suggests to get rid of something that exists or is searched for, while discard means to get rid of something if and only if it exists. But that's not clear.

This is a very specific case that comes from forming fixed-length chunks, so I think it's ok to make it specific here. So how about changing it to keep_remainder = FALSE (which involves flipping the logical value of the default of course) but uses a verb found elsewhere ("keep") and in my view makes the purpose of the argument very clear.

kbenoit

I'm happy with this now if you are.

tests/testthat/test-tokens_chunk.R

kbenoit · 2018-12-12T10:17:03Z

tests/testthat/test-tokens_chunk.R

@@ -71,19 +71,19 @@ test_that("tokens_chunk works", {
    expect_is(tokens_chunk(toks, size = 3), "tokens")
    expect_equivalent(
        as.list(tokens_chunk(toks, 3, discard_remainder = TRUE)),
-        list(c("a", "b", "c"), c("d", "e", "f"), c("a", "a", "b"))
+        list(c("a", "b", "c"), c("d", "e", "f"), c("a", "a", "b"), character())


I don't think we should have an empty character at the end - it should be simply discarded.

So discard_remainder = TRUE should mean that the remainder is simply gone. The exception would be when the document's total tokens are shorter than length, in which case there would be an empty document (empty in terms of tokens) returned.

Slightly inconsistent, but there is no use for empty character pads that I can think of.

Alternative to make it consistent is dropping documents, but that could cause us problems in other areas, since only a few tokens functions alter documents (sample, subset).

This sounds like the old DFM construction method where you where inserting artificial tokens for empty documents. Right solution is not discarding anything. Why do you want to discard_remainder?

Because if you are applying a method that requires equally-sized chunks, then you don't want an unequal chunk at the end. (For the Mean Segmental TTR for instance.) For as per the example of segmenting a three-token document with size = 5, with keep_remainder = FALSE the chunk has length similar to

> 3 %/% 4 [1] 0

but it might make sense to set the default to keep_remainder = TRUE.

Where is the code for "Mean Segmental TTR"?

Yes but the point is that the length of the split elements should be the same. That does not happen when some segments are remainders. Let's keep keep_remainder but set the default to TRUE.

Why don't you do it using lengths(), which is pretty fast?

You can also let textstat_lexdiv(x, "MTTR") to return NA for document shorter than a certain size.

Yes it should but it also should not average in segments shorter than the specified size.

Anyway more generally on chunks: it’s fundamentally an integer division operation, and integer divisions have a quotient and a modulus. The keep_remainder is to keep the modulus. Why would we not provide that option?

I don't think we need that because I cannot imagine any other use-case, and it can be done just by textstat_lexdiv(toks[lengths(toks) >= window], "TTR").

kbenoit · 2018-12-15T02:16:09Z

proposal: `tokens_reshape()`

OK, been thinking about this more, and it occurs to me that if we have trouble designing a function, then maybe there is something that needs rethinking. I've also always been a bit unsure about defining a new function that purports to operate on tokens, but then redefines documents. We have just two functions that do that (except select and sample functions): *_segment() and *_reshape(). We have a tokens_segment() that is analogous to corpus_segment(), which splits tokens into new documents based on pattern matches. This keeps track of the document ID through a special docvar, and has the potential to be reshaped back to the "documents" unit using this special docvar. But we never defined a tokens_reshape(), probably because once a document is tokenized, we no longer have units easily identifiable as sentences, paragraphs, etc.

However, we could implement the tokens_reshape() units in terms of "shingles" or sequence sizes, including a variable overlap argument. This would avoid our debate about whether a remainder should be discarded or kept, and make it unnecessary to have empty remainders versus no remainders. In tokens_reshape(), we would always keep the remainder, along with the identifying information about which document and segment it is.

tokens_reshape(x, size = NULL, overlap = 0, to = c("segments", "documents"), use_docvars = TRUE)

Where overlap < size, so you could get successive segments with ngram-type overlaps, e.g. "a b c d" becomes "a b c" "b c d" for W-shingles, etc or with overlap = 1 we get the behaviour of the current draft overlap = TRUE.

size is required except when to = "documents".

For MSTTR we just segment to size and tokens_subset(x, ntype(x) == size) to discard the remainders.

Note that this also provides a solution for the MATTR (see #1508) which is a moving window across tokens. Other uses could be special similarity measures of the type that motivated to to add the overlap = TRUE argument and other statistical measures where we needed segments or moving window samples across tokens.

And as a bonus, if we use a tokenizer that also records sentenceID (e.g. spacy's tokenizer/parser) we could add "sentences" to the the reshape target list (as we already have for corpus_reshape()).

What do you think?

koheiw · 2018-12-15T10:23:18Z

I think the name tokens_chunk() is not great but still better than tokens_reshape(). corpus_reshape() is a function that I want to get rid of in the future by providing more flexible token selection functions, so I don't want to model this.

I also think dfm_group() is enough for reconstructing original document units. Concatenating overlapped documents (even if duplicates are remove) by tokens_reshape() does not make sense to me either.

koheiw · 2018-12-16T22:33:38Z

I like these.

tokens_chunk(x, size, overlap = 0, use_docvars = TRUE)
# or 
tokens_chunk(x, size, move = size, use_docvars = TRUE)

Add use_docvars Change overlap to integer

koheiw · 2018-12-18T00:43:17Z

I am not sure why @kbenoit approved this PR before seeing a version that reflects the discussion. The latest version is

tokens_chunk(x, size, overlap = 0, use_docvars = TRUE)

I will click the green button if you are OK with this.

Function has changed since approval

kbenoit

Let's discuss this first, since it has the potential to be a useful new general function, but needs to fit in consistently with existing functions. It brings us back to the tricky debate we have before 1.0 when we tried to define the difference between corpus_reshape() and corpus_segment(), because both redefined document units. tokens_chunk() effectively does the same as those two, which is to split the existing document units into more documents.

tokens/corpus_segment(x, extract_pattern = TRUE) not only segments/chunks the existing documents nut also changes its content, by removing the pattern and putting it into a docvar. That cannot be recovered, unlike the reshaping of units using corpus_reshape(x, to = "documents").

With the latest state of tokens_chunk(), we have the potential to radically redefine the units when overlap > 0, or mildly when remainders are dropped.

Issues then in my mind to resolve are:

what to call it and how, if at all, to integrate with existing _segment() or _reshape() framework. (and BTW I like corpus_reshape() and use this all the time!)
I would like to restore the keep_remainder = TRUE option. This not only makes it explicit that the chunks are likely to be uneven on the end, but also lets us keep the operation to keep or discard the smaller than size remainder within the function, reducing user effort and potential error. We have lots of such arguments elsewhere and the cost is virtually nil in terms of added complexity to add it. It also also us to structure the return values and thereby enforce any reshaping etc that we might want to perform, since no document would be dropped, even if it is empty (see next point).
For the return value when keep_remainder = FALSE,` I think the behaviour should be the following:

toks <- tokens(c(d1 = "a b c d e", d2 = "a b c"))
tokens_chunk(toks, size = 4, keep_remainder = FALSE)
# tokens from 3 documents.
# d1.1 :
# [1] "a" "b" "c" "d"
# 
# d2.1 :
# character(0)

koheiw · 2018-12-18T10:04:35Z

I consider this this function for advanced users who have no problem with removing short objects. I also cannot think of any use case of keep_remainder = TRUE other than textstat_lexlidiv(). Are short chunks remainder? Probably no, especially when chunks are overlapping.

kbenoit · 2018-12-18T13:39:41Z

I will try to work around this in my aggregation functions for lexdiv, and probably can, but just selecting the chunked tokens on their lengths means some documents will disappear entirely.

To get the behaviour requested in my last comment above, you need to be a pretty advanced user!

size <- 4
toks <- tokens(c(d1 = "a b c d e", d2 = "a b c")) %>%
  tokens_chunk(size = 4)

library("data.table")
remove_remainders <- function(x, size) {
  # figure out remainders and non-end remainders
  dt <- data.table(
    docnames = docnames(x),
    attr(x, "docvars")
  )
  dt[, remainder := (lengths(x) < size)]
  dt[, onlyremainder := (.N == 1), by = "_docid"]

  # select non-end remainders
  dt <- dt[!remainder | onlyremainder]
  x <- x[dt[, docnames]]

  # delete tokens from the only remainder ends
  att <- attributes(x)
  x <- unclass(x)
  x[[ dt[(onlyremainder), docnames] ]] <- integer(0)
  attributes(x) <- att
  x
}

toks %>%
  remove_remainders(size)
## tokens from 2 documents.
## d1.1 :
## [1] "a" "b" "c" "d"
## 
## d2.1 :
## character(0)

kbenoit

Last question: Should we call the second argument: overlap_size? I think think size is ok, as we also use it in other functions, although it could be chunk_size.

jiongweilua · 2018-12-18T15:03:19Z

I have not chimed in on this as I think the conversation above requires more contextual knowledge about how quanteda's functions have historically developed than I possess.

But @kbenoit, regarding your final question about argument names, my input as a relatively new user is that the combination of chunk_size and overlap_size will be most unambiguous.

tokens_chunk(x, chunk_size = 3, overlap_size = 0, use_docvars = TRUE)

koheiw · 2018-12-19T00:10:03Z

@kbenoit I thought you understood what I mean by "use split with factor levels" but probably did not. Here is the code:

txt <- c(long = "a a b s e c d d e a b a b e s", short = "a b")
toks <- tokens(txt)
toks_seg <- tokens_chunk(toks, 3)
toks_seg <- toks_seg[lengths(toks_seg) >= 3] # drop short segments
lex <- textstat_lexdiv(toks_seg, "TTR")
lex <- split(lex$TTR, factor(attr(toks_seg, "docvars")[["_document"]], 
                                levels = docnames(toks)))
lex
# $long
# [1] 0.6666667 1.0000000 0.6666667 0.6666667 1.0000000
# 
# $short
# numeric(0)

"short" document is still there even if its segments are all removed from toks_seg.

kbenoit · 2018-12-19T07:44:27Z

Yes of course that’s a good workaround for the lexdiv issue but I was illustrating the complexity of the general way to get the outcome of chunking I’d pointed to above. Anyway I’ve approved PR as is now.

Merge if you are ok with the argument names, but give it a think.

Add tokens_chunk()

8bcabe0

Add _segid and _docid to docvars

301acee

koheiw mentioned this pull request Dec 11, 2018

Introduce support for for MATTR, MSTTR, MTLD, vocd-D, hd-D in textstat_lexdiv.tokens #1508

Merged

kbenoit added 2 commits December 11, 2018 10:44

Merge branch 'master' into dev-tokens_chunck

40b5a86

Update man and tests, add discard_remainder argument

5ae9587

and: Add (mostly) R version for comparison

kbenoit and others added 5 commits December 11, 2018 12:17

Merge branch 'master' into dev-tokens_chunck

cd703dc

Update WORDLIST

d3a57d5

Merge branch 'dev-tokens_chunck' of https://github.com/quanteda/quanteda

edb52be

into dev-tokens_chunck

Do implement discard_remainder in C++

13c27b0

Add a test for tokens shorter than size

739bfcf

kbenoit self-requested a review December 12, 2018 07:07

kbenoit requested changes Dec 12, 2018

View reviewed changes

Preserve short chunks

3471b18

Correct test

fe6ed5a

kbenoit added 2 commits December 13, 2018 13:08

Update NEWS and increment version

7772554

change discard_remainder to keep_remainder

d5e1e54

kbenoit previously approved these changes Dec 13, 2018

View reviewed changes

koheiw added 2 commits December 17, 2018 20:48

Merge branch 'master' into dev-tokens_chunck

b1c602c

Remove keep_remainder

3e8652e

Add use_docvars Change overlap to integer

Update man

1124f5e

kbenoit reviewed Dec 18, 2018

View reviewed changes

kbenoit approved these changes Dec 18, 2018

View reviewed changes

Merge branch 'master' into dev-tokens_chunck

dea0792

Improve documentation

e9d12b7

kbenoit merged commit 612718b into master Dec 20, 2018

kbenoit deleted the dev-tokens_chunck branch December 20, 2018 07:46

This was referenced Dec 27, 2018

corpus_segment by fixed token lengths #679

Closed

Need (internal?) tokens_split() #1500

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokens_chunk() #1520

Add tokens_chunk() #1520

koheiw commented Dec 11, 2018

codecov bot commented Dec 11, 2018 •

edited

Loading

kbenoit commented Dec 11, 2018 •

edited

Loading

kbenoit left a comment

koheiw commented Dec 12, 2018

kbenoit commented Dec 12, 2018

koheiw commented Dec 12, 2018

kbenoit commented Dec 12, 2018

koheiw commented Dec 12, 2018

kbenoit commented Dec 12, 2018

kbenoit left a comment

kbenoit Dec 12, 2018

koheiw Dec 13, 2018

kbenoit Dec 13, 2018

kbenoit Dec 13, 2018

koheiw Dec 13, 2018

kbenoit Dec 14, 2018

koheiw Dec 14, 2018

koheiw Dec 14, 2018

kbenoit Dec 14, 2018

koheiw Dec 14, 2018

kbenoit commented Dec 15, 2018

koheiw commented Dec 15, 2018

koheiw commented Dec 16, 2018

koheiw commented Dec 18, 2018

kbenoit left a comment •

edited

Loading

koheiw commented Dec 18, 2018

kbenoit commented Dec 18, 2018

kbenoit left a comment •

edited

Loading

jiongweilua commented Dec 18, 2018

koheiw commented Dec 19, 2018

kbenoit commented Dec 19, 2018

Add tokens_chunk() #1520

Add tokens_chunk() #1520

Conversation

koheiw commented Dec 11, 2018

codecov bot commented Dec 11, 2018 • edited Loading

Codecov Report

kbenoit commented Dec 11, 2018 • edited Loading

kbenoit left a comment

Choose a reason for hiding this comment

koheiw commented Dec 12, 2018

kbenoit commented Dec 12, 2018

koheiw commented Dec 12, 2018

kbenoit commented Dec 12, 2018

koheiw commented Dec 12, 2018

kbenoit commented Dec 12, 2018

kbenoit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kbenoit commented Dec 15, 2018

proposal: tokens_reshape()

koheiw commented Dec 15, 2018

koheiw commented Dec 16, 2018

koheiw commented Dec 18, 2018

kbenoit left a comment • edited Loading

Choose a reason for hiding this comment

koheiw commented Dec 18, 2018

kbenoit commented Dec 18, 2018

kbenoit left a comment • edited Loading

Choose a reason for hiding this comment

jiongweilua commented Dec 18, 2018

koheiw commented Dec 19, 2018

kbenoit commented Dec 19, 2018

codecov bot commented Dec 11, 2018 •

edited

Loading

kbenoit commented Dec 11, 2018 •

edited

Loading

proposal: `tokens_reshape()`

kbenoit left a comment •

edited

Loading

kbenoit left a comment •

edited

Loading