Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tokens_chunk() #1520

Merged
merged 18 commits into from
Dec 20, 2018
Merged

Add tokens_chunk() #1520

merged 18 commits into from
Dec 20, 2018

Conversation

koheiw
Copy link
Collaborator

@koheiw koheiw commented Dec 11, 2018

Fairly fast already

Unit: milliseconds
                        expr       min        lq      mean    median        uq      max neval
       tokens_chunk(toks, 5)  39.24824  45.25473  81.37482  49.94681  77.87797 345.7807   100
 tokens_chunk(toks, 5, TRUE) 205.85162 293.30307 371.14174 340.42323 422.35138 707.4850   100
      tokens_ngrams(toks, 5) 232.01304 250.11479 287.26929 269.02318 297.44102 572.6281   100

@codecov
Copy link

codecov bot commented Dec 11, 2018

Codecov Report

Merging #1520 into master will increase coverage by 0.08%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1520      +/-   ##
==========================================
+ Coverage   89.79%   89.87%   +0.08%     
==========================================
  Files         103      105       +2     
  Lines        7752     7816      +64     
==========================================
+ Hits         6961     7025      +64     
  Misses        791      791

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 11, 2018

I added an argument to truncate the (uneven) remainder, and put my version code in (not exported) for comparison. Not too bad...! It's the insert_values plus tokens_segment() that is doing the work.

toks <- tokens(data_corpus_inaugural)

microbenchmark::microbenchmark(
  tokens_chunk(toks, size = 10),
  quanteda:::tokens_chunk2(toks, size = 10)
)
## Unit: milliseconds
##                                       expr      min       lq     mean
##              tokens_chunk(toks, size = 10) 15.92971 18.84621 20.85797
##  quanteda:::tokens_chunk2(toks, size = 10) 15.58881 18.45565 21.38188
##    median       uq     max neval
##  19.88384 22.04475  39.724   100
##  19.59377 21.92013 132.733   100

microbenchmark::microbenchmark(
  tokens_chunk(toks, size = 100),
  quanteda:::tokens_chunk2(toks, size = 100)
)
## Unit: milliseconds
##                                        expr       min       lq     mean
##              tokens_chunk(toks, size = 100) 10.360761 12.50879 16.61262
##  quanteda:::tokens_chunk2(toks, size = 100)  9.856674 12.33062 15.31867
##    median       uq      max neval
##  13.59635 16.30358 34.88291   100
##  13.32331 15.31451 36.81684   100

microbenchmark::microbenchmark(
  tokens_chunk(toks, size = 1000),
  quanteda:::tokens_chunk2(toks, size = 1000)
)
## Unit: milliseconds
##                                         expr      min       lq     mean
##              tokens_chunk(toks, size = 1000) 5.800675 6.397960 9.573177
##  quanteda:::tokens_chunk2(toks, size = 1000) 5.983992 6.370365 8.502899
##    median       uq      max neval
##  7.634302 9.864545 100.3539   100
##  7.593575 9.506097  19.9341   100

@kbenoit kbenoit self-requested a review December 12, 2018 07:07
Copy link
Collaborator

@kbenoit kbenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good (and fast!) and while I am still not really sure about the overlap use-case - which seems to be a special case of the sort of word shingles created by ngrams - I'm fine to leave it, as long as it's turned off by default.

I added a test that should pass, which traps an error caused when size > any(lengths(x)). When discard_remainder = TRUE then for a document whose length is less than size, we should return an empty tokens set for that document, as as this does:

tokens_remove(tokens("a a"), "a")
## tokens from 1 document.
## text1 :
## character(0)

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 12, 2018

That is easy but not sure about the logic behind this

    expect_identical(
        as.list(tokens_chunk(toks, size = 4, discard_remainder = TRUE)),
        list(d1 = c("a", "b", "c", "d"),
             d2 = character(0))
    )

It should be

    expect_identical(
        as.list(tokens_chunk(toks, size = 4, discard_remainder = TRUE)),
        list(d1 = c("a", "b", "c", "d"))
    )

or

    expect_identical(
        as.list(tokens_chunk(toks, size = 4, discard_remainder = TRUE)),
        list(d1.1 = c("a", "b", "c", "d"),
             d1.2 = character(0))
             d2.1 = character(0))
    )

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 12, 2018

I'd be ok with the last one. We should not be letting this function drop documents, meaning it changes the unique entries in the _docid docvar.

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 12, 2018

Then shouldn't discard_remainder be called pad_remainder?

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 12, 2018

Chunking for a fixed size implies that each chunk will be of that size. By default the uneven remainder would be discarded. Padding, at least for me, implies adding something in between what exists or is left, as we do with removed tokens. Plus, pad_remainder would have an opposite logical value from discard_remainder.

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 12, 2018

OK "padding" is not the word, but why not "remove". There is no "discard" anywhere else as far as I can remember. Is "discard" suggest an action to empty documents? Then we could (possible but not really) make tokens_discard() which returns empty documents that match a certain criteria. This is actually can be done by tokens_remove(x, window = max(lengths(x)).

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 12, 2018

I thought "discard" was more accurate than remove, since remove suggests to get rid of something that exists or is searched for, while discard means to get rid of something if and only if it exists. But that's not clear.

This is a very specific case that comes from forming fixed-length chunks, so I think it's ok to make it specific here. So how about changing it to keep_remainder = FALSE (which involves flipping the logical value of the default of course) but uses a verb found elsewhere ("keep") and in my view makes the purpose of the argument very clear.

kbenoit
kbenoit previously approved these changes Dec 13, 2018
Copy link
Collaborator

@kbenoit kbenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this now if you are.

tests/testthat/test-tokens_chunk.R Outdated Show resolved Hide resolved
@@ -71,19 +71,19 @@ test_that("tokens_chunk works", {
expect_is(tokens_chunk(toks, size = 3), "tokens")
expect_equivalent(
as.list(tokens_chunk(toks, 3, discard_remainder = TRUE)),
list(c("a", "b", "c"), c("d", "e", "f"), c("a", "a", "b"))
list(c("a", "b", "c"), c("d", "e", "f"), c("a", "a", "b"), character())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have an empty character at the end - it should be simply discarded.

So discard_remainder = TRUE should mean that the remainder is simply gone. The exception would be when the document's total tokens are shorter than length, in which case there would be an empty document (empty in terms of tokens) returned.

Slightly inconsistent, but there is no use for empty character pads that I can think of.

Alternative to make it consistent is dropping documents, but that could cause us problems in other areas, since only a few tokens functions alter documents (sample, subset).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like the old DFM construction method where you where inserting artificial tokens for empty documents. Right solution is not discarding anything. Why do you want to discard_remainder?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if you are applying a method that requires equally-sized chunks, then you don't want an unequal chunk at the end. (For the Mean Segmental TTR for instance.) For as per the example of segmenting a three-token document with size = 5, with keep_remainder = FALSE the chunk has length similar to

> 3 %/% 4
[1] 0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it might make sense to set the default to keep_remainder = TRUE.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the code for "Mean Segmental TTR"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but the point is that the length of the split elements should be the same. That does not happen when some segments are remainders. Let's keep keep_remainder but set the default to TRUE.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you do it using lengths(), which is pretty fast?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also let textstat_lexdiv(x, "MTTR") to return NA for document shorter than a certain size.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should but it also should not average in segments shorter than the specified size.

Anyway more generally on chunks: it’s fundamentally an integer division operation, and integer divisions have a quotient and a modulus. The keep_remainder is to keep the modulus. Why would we not provide that option?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need that because I cannot imagine any other use-case, and it can be done just by textstat_lexdiv(toks[lengths(toks) >= window], "TTR").

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 15, 2018

proposal: tokens_reshape()

OK, been thinking about this more, and it occurs to me that if we have trouble designing a function, then maybe there is something that needs rethinking. I've also always been a bit unsure about defining a new function that purports to operate on tokens, but then redefines documents. We have just two functions that do that (except select and sample functions): *_segment() and *_reshape(). We have a tokens_segment() that is analogous to corpus_segment(), which splits tokens into new documents based on pattern matches. This keeps track of the document ID through a special docvar, and has the potential to be reshaped back to the "documents" unit using this special docvar. But we never defined a tokens_reshape(), probably because once a document is tokenized, we no longer have units easily identifiable as sentences, paragraphs, etc.

However, we could implement the tokens_reshape() units in terms of "shingles" or sequence sizes, including a variable overlap argument. This would avoid our debate about whether a remainder should be discarded or kept, and make it unnecessary to have empty remainders versus no remainders. In tokens_reshape(), we would always keep the remainder, along with the identifying information about which document and segment it is.

tokens_reshape(x, size = NULL, overlap = 0, to = c("segments", "documents"), use_docvars = TRUE)

Where overlap < size, so you could get successive segments with ngram-type overlaps, e.g. "a b c d" becomes "a b c" "b c d" for W-shingles, etc or with overlap = 1 we get the behaviour of the current draft overlap = TRUE.

size is required except when to = "documents".

For MSTTR we just segment to size and tokens_subset(x, ntype(x) == size) to discard the remainders.

Note that this also provides a solution for the MATTR (see #1508) which is a moving window across tokens. Other uses could be special similarity measures of the type that motivated to to add the overlap = TRUE argument and other statistical measures where we needed segments or moving window samples across tokens.

And as a bonus, if we use a tokenizer that also records sentenceID (e.g. spacy's tokenizer/parser) we could add "sentences" to the the reshape target list (as we already have for corpus_reshape()).

What do you think?

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 15, 2018

I think the name tokens_chunk() is not great but still better than tokens_reshape(). corpus_reshape() is a function that I want to get rid of in the future by providing more flexible token selection functions, so I don't want to model this.

I also think dfm_group() is enough for reconstructing original document units. Concatenating overlapped documents (even if duplicates are remove) by tokens_reshape() does not make sense to me either.

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 16, 2018

I like these.

tokens_chunk(x, size, overlap = 0, use_docvars = TRUE)
# or 
tokens_chunk(x, size, move = size, use_docvars = TRUE)

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 18, 2018

I am not sure why @kbenoit approved this PR before seeing a version that reflects the discussion. The latest version is

tokens_chunk(x, size, overlap = 0, use_docvars = TRUE)

I will click the green button if you are OK with this.

@kbenoit kbenoit dismissed their stale review December 18, 2018 07:06

Function has changed since approval

Copy link
Collaborator

@kbenoit kbenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this first, since it has the potential to be a useful new general function, but needs to fit in consistently with existing functions. It brings us back to the tricky debate we have before 1.0 when we tried to define the difference between corpus_reshape() and corpus_segment(), because both redefined document units. tokens_chunk() effectively does the same as those two, which is to split the existing document units into more documents.

tokens/corpus_segment(x, extract_pattern = TRUE) not only segments/chunks the existing documents nut also changes its content, by removing the pattern and putting it into a docvar. That cannot be recovered, unlike the reshaping of units using corpus_reshape(x, to = "documents").

With the latest state of tokens_chunk(), we have the potential to radically redefine the units when overlap > 0, or mildly when remainders are dropped.

Issues then in my mind to resolve are:

  • what to call it and how, if at all, to integrate with existing _segment() or _reshape() framework. (and BTW I like corpus_reshape() and use this all the time!)

  • I would like to restore the keep_remainder = TRUE option. This not only makes it explicit that the chunks are likely to be uneven on the end, but also lets us keep the operation to keep or discard the smaller than size remainder within the function, reducing user effort and potential error. We have lots of such arguments elsewhere and the cost is virtually nil in terms of added complexity to add it. It also also us to structure the return values and thereby enforce any reshaping etc that we might want to perform, since no document would be dropped, even if it is empty (see next point).

  • For the return value when keep_remainder = FALSE,` I think the behaviour should be the following:

toks <- tokens(c(d1 = "a b c d e", d2 = "a b c"))
tokens_chunk(toks, size = 4, keep_remainder = FALSE)
# tokens from 3 documents.
# d1.1 :
# [1] "a" "b" "c" "d"
# 
# d2.1 :
# character(0)

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 18, 2018

I consider this this function for advanced users who have no problem with removing short objects. I also cannot think of any use case of keep_remainder = TRUE other than textstat_lexlidiv(). Are short chunks remainder? Probably no, especially when chunks are overlapping.

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 18, 2018

I will try to work around this in my aggregation functions for lexdiv, and probably can, but just selecting the chunked tokens on their lengths means some documents will disappear entirely.

To get the behaviour requested in my last comment above, you need to be a pretty advanced user!

size <- 4
toks <- tokens(c(d1 = "a b c d e", d2 = "a b c")) %>%
  tokens_chunk(size = 4)

library("data.table")
remove_remainders <- function(x, size) {
  # figure out remainders and non-end remainders
  dt <- data.table(
    docnames = docnames(x),
    attr(x, "docvars")
  )
  dt[, remainder := (lengths(x) < size)]
  dt[, onlyremainder := (.N == 1), by = "_docid"]

  # select non-end remainders
  dt <- dt[!remainder | onlyremainder]
  x <- x[dt[, docnames]]

  # delete tokens from the only remainder ends
  att <- attributes(x)
  x <- unclass(x)
  x[[ dt[(onlyremainder), docnames] ]] <- integer(0)
  attributes(x) <- att
  x
}

toks %>%
  remove_remainders(size)
## tokens from 2 documents.
## d1.1 :
## [1] "a" "b" "c" "d"
## 
## d2.1 :
## character(0)

Copy link
Collaborator

@kbenoit kbenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last question: Should we call the second argument: overlap_size? I think think size is ok, as we also use it in other functions, although it could be chunk_size.

@jiongweilua
Copy link
Collaborator

I have not chimed in on this as I think the conversation above requires more contextual knowledge about how quanteda's functions have historically developed than I possess.

But @kbenoit, regarding your final question about argument names, my input as a relatively new user is that the combination of chunk_size and overlap_size will be most unambiguous.

tokens_chunk(x, chunk_size = 3, overlap_size = 0, use_docvars = TRUE)

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 19, 2018

@kbenoit I thought you understood what I mean by "use split with factor levels" but probably did not. Here is the code:

txt <- c(long = "a a b s e c d d e a b a b e s", short = "a b")
toks <- tokens(txt)
toks_seg <- tokens_chunk(toks, 3)
toks_seg <- toks_seg[lengths(toks_seg) >= 3] # drop short segments
lex <- textstat_lexdiv(toks_seg, "TTR")
lex <- split(lex$TTR, factor(attr(toks_seg, "docvars")[["_document"]], 
                                levels = docnames(toks)))
lex
# $long
# [1] 0.6666667 1.0000000 0.6666667 0.6666667 1.0000000
# 
# $short
# numeric(0)

"short" document is still there even if its segments are all removed from toks_seg.

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 19, 2018

Yes of course that’s a good workaround for the lexdiv issue but I was illustrating the complexity of the general way to get the outcome of chunking I’d pointed to above. Anyway I’ve approved PR as is now.

Merge if you are ok with the argument names, but give it a think.

@kbenoit kbenoit merged commit 612718b into master Dec 20, 2018
@kbenoit kbenoit deleted the dev-tokens_chunck branch December 20, 2018 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants