-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tokens_chunk() #1520
Add tokens_chunk() #1520
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1520 +/- ##
==========================================
+ Coverage 89.79% 89.87% +0.08%
==========================================
Files 103 105 +2
Lines 7752 7816 +64
==========================================
+ Hits 6961 7025 +64
Misses 791 791 |
and: Add (mostly) R version for comparison
I added an argument to truncate the (uneven) remainder, and put my version code in (not exported) for comparison. Not too bad...! It's the insert_values plus toks <- tokens(data_corpus_inaugural)
microbenchmark::microbenchmark(
tokens_chunk(toks, size = 10),
quanteda:::tokens_chunk2(toks, size = 10)
)
## Unit: milliseconds
## expr min lq mean
## tokens_chunk(toks, size = 10) 15.92971 18.84621 20.85797
## quanteda:::tokens_chunk2(toks, size = 10) 15.58881 18.45565 21.38188
## median uq max neval
## 19.88384 22.04475 39.724 100
## 19.59377 21.92013 132.733 100
microbenchmark::microbenchmark(
tokens_chunk(toks, size = 100),
quanteda:::tokens_chunk2(toks, size = 100)
)
## Unit: milliseconds
## expr min lq mean
## tokens_chunk(toks, size = 100) 10.360761 12.50879 16.61262
## quanteda:::tokens_chunk2(toks, size = 100) 9.856674 12.33062 15.31867
## median uq max neval
## 13.59635 16.30358 34.88291 100
## 13.32331 15.31451 36.81684 100
microbenchmark::microbenchmark(
tokens_chunk(toks, size = 1000),
quanteda:::tokens_chunk2(toks, size = 1000)
)
## Unit: milliseconds
## expr min lq mean
## tokens_chunk(toks, size = 1000) 5.800675 6.397960 9.573177
## quanteda:::tokens_chunk2(toks, size = 1000) 5.983992 6.370365 8.502899
## median uq max neval
## 7.634302 9.864545 100.3539 100
## 7.593575 9.506097 19.9341 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good (and fast!) and while I am still not really sure about the overlap use-case - which seems to be a special case of the sort of word shingles created by ngrams - I'm fine to leave it, as long as it's turned off by default.
I added a test that should pass, which traps an error caused when size > any(lengths(x))
. When discard_remainder = TRUE
then for a document whose length is less than size
, we should return an empty tokens set for that document, as as this does:
tokens_remove(tokens("a a"), "a")
## tokens from 1 document.
## text1 :
## character(0)
That is easy but not sure about the logic behind this expect_identical(
as.list(tokens_chunk(toks, size = 4, discard_remainder = TRUE)),
list(d1 = c("a", "b", "c", "d"),
d2 = character(0))
) It should be expect_identical(
as.list(tokens_chunk(toks, size = 4, discard_remainder = TRUE)),
list(d1 = c("a", "b", "c", "d"))
) or expect_identical(
as.list(tokens_chunk(toks, size = 4, discard_remainder = TRUE)),
list(d1.1 = c("a", "b", "c", "d"),
d1.2 = character(0))
d2.1 = character(0))
) |
I'd be ok with the last one. We should not be letting this function drop documents, meaning it changes the unique entries in the |
Then shouldn't |
Chunking for a fixed size implies that each chunk will be of that size. By default the uneven remainder would be discarded. Padding, at least for me, implies adding something in between what exists or is left, as we do with removed tokens. Plus, |
OK "padding" is not the word, but why not "remove". There is no "discard" anywhere else as far as I can remember. Is "discard" suggest an action to empty documents? Then we could (possible but not really) make |
I thought "discard" was more accurate than remove, since remove suggests to get rid of something that exists or is searched for, while discard means to get rid of something if and only if it exists. But that's not clear. This is a very specific case that comes from forming fixed-length chunks, so I think it's ok to make it specific here. So how about changing it to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this now if you are.
tests/testthat/test-tokens_chunk.R
Outdated
@@ -71,19 +71,19 @@ test_that("tokens_chunk works", { | |||
expect_is(tokens_chunk(toks, size = 3), "tokens") | |||
expect_equivalent( | |||
as.list(tokens_chunk(toks, 3, discard_remainder = TRUE)), | |||
list(c("a", "b", "c"), c("d", "e", "f"), c("a", "a", "b")) | |||
list(c("a", "b", "c"), c("d", "e", "f"), c("a", "a", "b"), character()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have an empty character at the end - it should be simply discarded.
So discard_remainder = TRUE
should mean that the remainder is simply gone. The exception would be when the document's total tokens are shorter than length, in which case there would be an empty document (empty in terms of tokens) returned.
Slightly inconsistent, but there is no use for empty character pads that I can think of.
Alternative to make it consistent is dropping documents, but that could cause us problems in other areas, since only a few tokens functions alter documents (sample, subset).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like the old DFM construction method where you where inserting artificial tokens for empty documents. Right solution is not discarding anything. Why do you want to discard_remainder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because if you are applying a method that requires equally-sized chunks, then you don't want an unequal chunk at the end. (For the Mean Segmental TTR for instance.) For as per the example of segmenting a three-token document with size = 5
, with keep_remainder = FALSE
the chunk has length similar to
> 3 %/% 4
[1] 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it might make sense to set the default to keep_remainder = TRUE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the code for "Mean Segmental TTR"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but the point is that the length of the split elements should be the same. That does not happen when some segments are remainders. Let's keep keep_remainder
but set the default to TRUE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you do it using lengths()
, which is pretty fast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also let textstat_lexdiv(x, "MTTR")
to return NA for document shorter than a certain size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should but it also should not average in segments shorter than the specified size.
Anyway more generally on chunks: it’s fundamentally an integer division operation, and integer divisions have a quotient and a modulus. The keep_remainder is to keep the modulus. Why would we not provide that option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need that because I cannot imagine any other use-case, and it can be done just by textstat_lexdiv(toks[lengths(toks) >= window], "TTR")
.
proposal:
|
I think the name I also think |
I like these. tokens_chunk(x, size, overlap = 0, use_docvars = TRUE)
# or
tokens_chunk(x, size, move = size, use_docvars = TRUE) |
Add use_docvars Change overlap to integer
I am not sure why @kbenoit approved this PR before seeing a version that reflects the discussion. The latest version is tokens_chunk(x, size, overlap = 0, use_docvars = TRUE) I will click the green button if you are OK with this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss this first, since it has the potential to be a useful new general function, but needs to fit in consistently with existing functions. It brings us back to the tricky debate we have before 1.0 when we tried to define the difference between corpus_reshape() and corpus_segment(), because both redefined document units. tokens_chunk() effectively does the same as those two, which is to split the existing document units into more documents.
tokens/corpus_segment(x, extract_pattern = TRUE)
not only segments/chunks the existing documents nut also changes its content, by removing the pattern and putting it into a docvar. That cannot be recovered, unlike the reshaping of units using corpus_reshape(x, to = "documents")
.
With the latest state of tokens_chunk()
, we have the potential to radically redefine the units when overlap > 0
, or mildly when remainders are dropped.
Issues then in my mind to resolve are:
-
what to call it and how, if at all, to integrate with existing
_segment()
or_reshape()
framework. (and BTW I likecorpus_reshape()
and use this all the time!) -
I would like to restore the
keep_remainder = TRUE
option. This not only makes it explicit that the chunks are likely to be uneven on the end, but also lets us keep the operation to keep or discard the smaller thansize
remainder within the function, reducing user effort and potential error. We have lots of such arguments elsewhere and the cost is virtually nil in terms of added complexity to add it. It also also us to structure the return values and thereby enforce any reshaping etc that we might want to perform, since no document would be dropped, even if it is empty (see next point). -
For the return value when
keep_remainder = FALSE
,` I think the behaviour should be the following:
toks <- tokens(c(d1 = "a b c d e", d2 = "a b c"))
tokens_chunk(toks, size = 4, keep_remainder = FALSE)
# tokens from 3 documents.
# d1.1 :
# [1] "a" "b" "c" "d"
#
# d2.1 :
# character(0)
I consider this this function for advanced users who have no problem with removing short objects. I also cannot think of any use case of |
I will try to work around this in my aggregation functions for lexdiv, and probably can, but just selecting the chunked tokens on their lengths means some documents will disappear entirely. To get the behaviour requested in my last comment above, you need to be a pretty advanced user! size <- 4
toks <- tokens(c(d1 = "a b c d e", d2 = "a b c")) %>%
tokens_chunk(size = 4)
library("data.table")
remove_remainders <- function(x, size) {
# figure out remainders and non-end remainders
dt <- data.table(
docnames = docnames(x),
attr(x, "docvars")
)
dt[, remainder := (lengths(x) < size)]
dt[, onlyremainder := (.N == 1), by = "_docid"]
# select non-end remainders
dt <- dt[!remainder | onlyremainder]
x <- x[dt[, docnames]]
# delete tokens from the only remainder ends
att <- attributes(x)
x <- unclass(x)
x[[ dt[(onlyremainder), docnames] ]] <- integer(0)
attributes(x) <- att
x
}
toks %>%
remove_remainders(size)
## tokens from 2 documents.
## d1.1 :
## [1] "a" "b" "c" "d"
##
## d2.1 :
## character(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last question: Should we call the second argument: overlap_size
? I think think size
is ok, as we also use it in other functions, although it could be chunk_size
.
I have not chimed in on this as I think the conversation above requires more contextual knowledge about how quanteda's functions have historically developed than I possess. But @kbenoit, regarding your final question about argument names, my input as a relatively new user is that the combination of tokens_chunk(x, chunk_size = 3, overlap_size = 0, use_docvars = TRUE) |
@kbenoit I thought you understood what I mean by "use txt <- c(long = "a a b s e c d d e a b a b e s", short = "a b")
toks <- tokens(txt)
toks_seg <- tokens_chunk(toks, 3)
toks_seg <- toks_seg[lengths(toks_seg) >= 3] # drop short segments
lex <- textstat_lexdiv(toks_seg, "TTR")
lex <- split(lex$TTR, factor(attr(toks_seg, "docvars")[["_document"]],
levels = docnames(toks)))
lex
# $long
# [1] 0.6666667 1.0000000 0.6666667 0.6666667 1.0000000
#
# $short
# numeric(0) "short" document is still there even if its segments are all removed from |
Yes of course that’s a good workaround for the lexdiv issue but I was illustrating the complexity of the general way to get the outcome of chunking I’d pointed to above. Anyway I’ve approved PR as is now. Merge if you are ok with the argument names, but give it a think. |
Fairly fast already