-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 1840 #2045
Issue 1840 #2045
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2045 +/- ##
==========================================
+ Coverage 95.17% 95.25% +0.08%
==========================================
Files 71 71
Lines 3979 4028 +49
==========================================
+ Hits 3787 3837 +50
+ Misses 192 191 -1
Continue to review full report at Codecov.
|
Overall my sense is that if the kwic object now includes the full set of tokens as an attribute. Isn't this pretty wasteful in terms of space? There are ways to minimise the break with previous behaviours, by letting the user specify
And see my comment above about |
Thinking about this more, the PR essentially builds a token position index for the search terms, and includes the entire token set. This is like a single-use indexing, but means that with two kwic objects - common for instance in our tutorials and examples where these are plotted together - each kwic object is greater than the original tokens object. This is just one of the features of the functional, pass-by-value approach of R and quanteda. In another language, we might attach searched terms to the tokens object, and make these available, but that's very different from our existing approach in the package. This could be solved if we returned to an idea we discussed early on, which is that a corpus could be processed to include tokens and indexes to all types in the document. Then processing a kwic would be fast since it would just find the indexed tokens and extract their context. It would be slow, but only once, when the corpus is created (or modified) or optionally, processed following creation of the existing type of corpus we have. And "creating" a tokens object would then not involve the tokenisation. |
The tokens object is a subset of original: attr(result, "tokens") <- x[unique(result$docname)] As far as I know, R does not make copy unless changes are made, so attaching the original |
Looks like it does make a full copy, see below. Is there a performance gain by copying the full tokens for documents with a match? library("quanteda")
## Package version: 2.9.9000
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
library("pryr")
## Registered S3 method overwritten by 'pryr':
## method from
## print.bytes Rcpp
toks <- tokens(data_corpus_inaugural)
object_size(toks)
## 1.32 MB
kw <- kwic(toks, "a")
object_size(kw)
## 1.37 MB
object_size(attr(kw, "tokens"))
## 1.32 MB
object_size(toks, kw)
## 2.06 MB Created on 2021-02-10 by the reprex package (v1.0.0) |
If full tokens is attached to kwic, the attribute only point to the original. If tokens is sliced, new object is created. Unless kwic object is saved in RDS, the former approach is more efficient. > require(pryr)
> toks <- tokens(data_corpus_inaugural)
> kw <- kwic(toks, "a")
>
> address(toks)
[1] "0x5605cfe02630"
> attr(kw, "tokens") <- toks
> toks_out <- attr(kw, "tokens")
> address(toks_out)
[1] "0x5605cfe02630"
>
> attr(kw, "tokens") <- toks[1:10]
> toks_out2 <- attr(kw, "tokens")
> address(toks_out2)
[1] "0x5605d1711988" |
OK, looks like this is done, so here are my comments on the current PR. Overall, this looks fine, I can see the usefulness, although it's unclear how much of a performance difference this creates. To the extent that However this breaks too much, including our own code, much user code, and one of our packages. We can virtually eliminate this incompatibility however through some small changes.
|
I am OK with restoring corpus.kwic() for deprecation and restore the arguments. I would be nice to add |
- adds separator and window to kwic - adds tests - restore corpus.kwic() and tests
- also adds tests
OK updates done, with some small improvements and stronger tests. I'm happy to add |
#' | ||
#' corp <- corpus(txt) | ||
#' kwic(corp, c("is", "a"), valuetype = "fixed", separator = "", remove_separators = FALSE) | ||
#' print(kwic(corp, "is"), separator = "") | ||
#' @export | ||
kwic.tokens <- function(x, pattern, window = 5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we deprecate window and separator to make clear that kwic()
is only for locating patters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say no, simply because this is the way that users have gotten used to using kwic since it was first written. It's also more natural to put this into the function that creates the kwic. We can make it clear that this can be changed using as.data.frame()
or print()
, but most people using quanteda probably don't want to type print()
to see output, they just want to see default output. My experience with kwic()
in particular is that this is a function most used by our most qualitative and interactive users.
When I use kwic()
or teach this, I also prefer to set it at the time we call it. And this is not like what we've discussed about removing the convenience options from dfm()
or removing dfm.character()
, since those are a form of bad practice, skipping a step that the user should be taking more direct responsibility for. Rather, the window
and separator
options belong to kwic()
and are natural to associate with creating a kwic()
, but our implementation has now disassociated itself from needing to set this at creation time. But that's more about our current implementation than what the action implemented by the function calls for.
One more reason: If we do change kwic()
again in the future, it might make more sense to set these at creation time.
So I think it makes sense to keep them without warnings. Most users will happily never know the difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes more sense to have window
in kwic()
but I am not comfortable with the same argument in the different places. How about let users to call kwic.kwck()
to change window
or separator
? max_nrow
should still be in print()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kwic.kwic()
makes no sense unless it would be possible to search a kwic()
for keywords - although we could make it do so since we have the tokens stored inside it. I think that's an edge case however that will create more confusion than the benefit that might come from using such a method to (re)set window or separator.
I think it's ok to keep window and separator in both the constructor and the print and conversion methods. We just have to make it clear that the default is set when the kwic is created, but can be changed at the print or data.frame stage. My guess is that it will be a very rare user that wants to do so, so in the vast majority of cases for the vast majority of users, the kwic functions will simply function as they have before.
For print.kwic()
I'd suggest that our default be max_nrow = -1
. Want me to implement that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take out window
or separator
from print for now. My understanding is that people do not want to run kwic()
multiple times because it was really slow when there are many matches, but it is no loner so was far as the resulting object is subset. Let set max_nrow = 1000
or a bit more so that kwic()
does not freeze.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about why we have to make a decision to add these argument to either or both kwic()
or print()
... I think my design to make print.kwic()
was the mistake. kwic()
should be the print function that creates the concordance view.
What I should have done is creating a function that might be call search()
or locate()
that returns the positions of matches (the same as what kwic.tokens()
does in this branch) and calling this function in kwic()
. If we add a special class to this object (like "position"), we can make a kwic.position()
. The object can be useful for information retrieval applications too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work too. But the current structure is fine. Let's do this:
- Keep
kwic()
as in this PR. - Remove
window
andseparator
fromprint.kwic()
, but keep these foras.data.frame.kwic()
. - Add
max_nrow = 1000
toprint.kwic()
and make it work like the other object print functions (with the "x more rows..." output). - Not implement yet the other functions you mention above, but keep in mind that these can be added when and if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. That is the middle way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I did all of the above in 13fa3ea. This implements the same print options now for kwic as for other core objects, including show_summary
. The arguments window
and separator
are kept in as.data.frame.kwic()
but removed from print.kwic()
(but I left the internal code for these intact in case we want to add them back).
We can remove the .corpus and .character methods in a separate PR, when we address #2058 more directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I did all of the above in 13fa3ea. This implements the same print options now for kwic as for other core objects, including show_summary
. The arguments window
and separator
are kept in as.data.frame.kwic()
but removed from print.kwic()
(but I left the internal code for these intact in case we want to add them back).
We can remove the .corpus and .character methods in a separate PR, when we address #2058.
Merge branch 'master' into issue-1840 # Conflicts: # R/kwic.R # R/pattern2fixed.R # R/utils.R # man/is_regex.Rd # man/pattern2list.Rd
I checked how View(kwic(corp, pattern = "secure*", valuetype = "glob", window = 3)) # shows only keywords and positions
View(print(kwic(corp, pattern = "secure*", valuetype = "glob", window = 3))) # shows nothing
View(as.data.frame(kwic(corp, pattern = "secure*", valuetype = "glob", window = 3))) # shows concordance view It is not necessary to define
|
That all sounds reasonable, but hard to follow as this PR keeps shifting. To get View.data.frame to work with a raw kwic object, we really only need to add the pre and post columns to the kwic object. If you want them viewable from the kwic object, just add the columns from the window at the kwic creation step. We could keep the tokens attribute so that these could still be changed in How do you want to proceed? |
I prefer to make different (exported) function only to locate pattern matches in C++, and class the object differently from "kwic" (say "matches"). Then we make In this way, we can fulfill requests like in #2063, and avoid conflict with old KWIC objects. We can move this discussion to an issue page and create a new branch if you like. |
OK - if that's based on this branch, then I suggest we merge this branch, and then work on a new branch for the changes you suggest. We have a lot of improvements in this PR including a better approach, in my view, for separating the construction of the kwic from its conversion and printing functions. However I'm happy to add the pre and post columns to the data.frame part of the kwic object for now, as suggested above (#2045 (comment)), if you think we should add that back to the kwic object in this branch before merging. |
Separated text search (
kwic
) and concordance view generationprint.kwic()
to address #2038 and #1840. This patch needs final adjustment, but difficult part is already done.