Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 1840 #2045

Merged
merged 41 commits into from
Feb 25, 2021
Merged

Issue 1840 #2045

merged 41 commits into from
Feb 25, 2021

Conversation

koheiw
Copy link
Collaborator

@koheiw koheiw commented Feb 2, 2021

Separated text search (kwic) and concordance view generation print.kwic() to address #2038 and #1840. This patch needs final adjustment, but difficult part is already done.

@codecov
Copy link

codecov bot commented Feb 2, 2021

Codecov Report

Merging #2045 (cac9e38) into master (5af26da) will increase coverage by 0.08%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2045      +/-   ##
==========================================
+ Coverage   95.17%   95.25%   +0.08%     
==========================================
  Files          71       71              
  Lines        3979     4028      +49     
==========================================
+ Hits         3787     3837      +50     
+ Misses        192      191       -1     
Impacted Files Coverage Δ
R/dictionaries.R 93.78% <ø> (ø)
R/docvars.R 95.04% <ø> (ø)
R/meta.R 87.00% <ø> (ø)
R/pattern2fixed.R 93.79% <ø> (ø)
R/utils.R 98.46% <ø> (ø)
R/corpus.R 100.00% <100.00%> (ø)
R/kwic.R 98.94% <100.00%> (+3.11%) ⬆️
R/quanteda_options.R 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5af26da...1b3ec18. Read the comment docs.

@koheiw koheiw requested a review from kbenoit February 2, 2021 08:58
R/corpus.R Show resolved Hide resolved
@kbenoit
Copy link
Collaborator

kbenoit commented Feb 6, 2021

Overall my sense is that if the kwic object now includes the full set of tokens as an attribute. Isn't this pretty wasteful in terms of space?

There are ways to minimise the break with previous behaviours, by letting the user specify separator and window in kwic(), and passing these through via attributes to print.kwic() and as.data.frame.kwic() as defaults.

print.kwic() needs to be documented.

View.kwic() is no longer functioning correctly.

And see my comment above about corpus.kwic().

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 7, 2021

Thinking about this more, the PR essentially builds a token position index for the search terms, and includes the entire token set. This is like a single-use indexing, but means that with two kwic objects - common for instance in our tutorials and examples where these are plotted together - each kwic object is greater than the original tokens object. This is just one of the features of the functional, pass-by-value approach of R and quanteda. In another language, we might attach searched terms to the tokens object, and make these available, but that's very different from our existing approach in the package.

This could be solved if we returned to an idea we discussed early on, which is that a corpus could be processed to include tokens and indexes to all types in the document. Then processing a kwic would be fast since it would just find the indexed tokens and extract their context. It would be slow, but only once, when the corpus is created (or modified) or optionally, processed following creation of the existing type of corpus we have. And "creating" a tokens object would then not involve the tokenisation.

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 9, 2021

The tokens object is a subset of original:

attr(result, "tokens") <- x[unique(result$docname)]

As far as I know, R does not make copy unless changes are made, so attaching the original x would actually save RAM space.

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 10, 2021

Looks like it does make a full copy, see below.

Is there a performance gain by copying the full tokens for documents with a match?

library("quanteda")
## Package version: 2.9.9000
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
library("pryr")
## Registered S3 method overwritten by 'pryr':
##   method      from
##   print.bytes Rcpp

toks <- tokens(data_corpus_inaugural)
object_size(toks)
## 1.32 MB

kw <- kwic(toks, "a")
object_size(kw)
## 1.37 MB
object_size(attr(kw, "tokens"))
## 1.32 MB

object_size(toks, kw)
## 2.06 MB

Created on 2021-02-10 by the reprex package (v1.0.0)

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 15, 2021

If full tokens is attached to kwic, the attribute only point to the original. If tokens is sliced, new object is created. Unless kwic object is saved in RDS, the former approach is more efficient.

> require(pryr)
> toks <- tokens(data_corpus_inaugural)
> kw <- kwic(toks, "a")
> 
> address(toks)
[1] "0x5605cfe02630"
> attr(kw, "tokens") <- toks
> toks_out <- attr(kw, "tokens")
> address(toks_out)
[1] "0x5605cfe02630"
> 
> attr(kw, "tokens") <- toks[1:10]
> toks_out2 <- attr(kw, "tokens")
> address(toks_out2)
[1] "0x5605d1711988"

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 16, 2021

OK, looks like this is done, so here are my comments on the current PR.

Overall, this looks fine, I can see the usefulness, although it's unclear how much of a performance difference this creates. To the extent that as.data.frame() on a kwic object returns what used to be the main output object, I think this is fine, and it gives us more flexibility to redefine the kwic structure further if we want, since we are no longer required to maintain this as a data.frame object, but can produce that through coercion. This is a good move and consistent with our approach elsewhere.

However this breaks too much, including our own code, much user code, and one of our packages. We can virtually eliminate this incompatibility however through some small changes.

  1. Add back window and separator to kwic(). These won't do anything in the object but can be passed to print.kwic() and as.data.frame.kwic() as defaults, with the print methods being able to override them, so we get the best of this PR's new approach but compatibility with and the convenience of the former approach. This means we don't have to change the examples either. (The print() looks unwieldy and reminds me too much of Python.)

  2. This broke textplots::textplot_xray(), but I've already managed to fix this in a way that works for the pre-v3 kwic objects as well as the v3 kwic objects. But as a note, we always need to think about what we are breaking in our own quanteda-verse packages when we change the core package.

  3. I'd like to fix and undelete corpus.kwic(), which I don't see as doing any harm. If we decide to sentence it to death then it should at least be deprecated and made internal before removing it.

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 16, 2021

I am OK with restoring corpus.kwic() for deprecation and restore the arguments. I would be nice to add max_nrow to print.kwic(). I will leave these with you. have done my part on this branch.

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 17, 2021

OK updates done, with some small improvements and stronger tests. kwic() now works exactly as before, from the user's standpoint, but with the new machinery. (See the added tests for details.) window and separator can be specified upon creation but can be changed in either print.kwic() or as.data.frame.kwic().

I'm happy to add max_nrow to kwic. This would make it work as do our other print methods, but those by default are pretty
conservative. We'd probably want to decide what is a suitable default, since most people are going to want to see the whole thing. I suggest we merge this PR and make this a new issue.

R/kwic.R Show resolved Hide resolved
R/kwic.R Outdated Show resolved Hide resolved
#'
#' corp <- corpus(txt)
#' kwic(corp, c("is", "a"), valuetype = "fixed", separator = "", remove_separators = FALSE)
#' print(kwic(corp, "is"), separator = "")
#' @export
kwic.tokens <- function(x, pattern, window = 5,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we deprecate window and separator to make clear that kwic() is only for locating patters?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say no, simply because this is the way that users have gotten used to using kwic since it was first written. It's also more natural to put this into the function that creates the kwic. We can make it clear that this can be changed using as.data.frame() or print(), but most people using quanteda probably don't want to type print() to see output, they just want to see default output. My experience with kwic() in particular is that this is a function most used by our most qualitative and interactive users.

When I use kwic() or teach this, I also prefer to set it at the time we call it. And this is not like what we've discussed about removing the convenience options from dfm() or removing dfm.character(), since those are a form of bad practice, skipping a step that the user should be taking more direct responsibility for. Rather, the window and separator options belong to kwic() and are natural to associate with creating a kwic(), but our implementation has now disassociated itself from needing to set this at creation time. But that's more about our current implementation than what the action implemented by the function calls for.

One more reason: If we do change kwic() again in the future, it might make more sense to set these at creation time.

So I think it makes sense to keep them without warnings. Most users will happily never know the difference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes more sense to have window in kwic() but I am not comfortable with the same argument in the different places. How about let users to call kwic.kwck() to change window or separator? max_nrow should still be in print().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kwic.kwic() makes no sense unless it would be possible to search a kwic() for keywords - although we could make it do so since we have the tokens stored inside it. I think that's an edge case however that will create more confusion than the benefit that might come from using such a method to (re)set window or separator.

I think it's ok to keep window and separator in both the constructor and the print and conversion methods. We just have to make it clear that the default is set when the kwic is created, but can be changed at the print or data.frame stage. My guess is that it will be a very rare user that wants to do so, so in the vast majority of cases for the vast majority of users, the kwic functions will simply function as they have before.

For print.kwic() I'd suggest that our default be max_nrow = -1. Want me to implement that?

Copy link
Collaborator Author

@koheiw koheiw Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take out window or separator from print for now. My understanding is that people do not want to run kwic() multiple times because it was really slow when there are many matches, but it is no loner so was far as the resulting object is subset. Let set max_nrow = 1000 or a bit more so that kwic() does not freeze.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about why we have to make a decision to add these argument to either or both kwic() or print()... I think my design to make print.kwic() was the mistake. kwic() should be the print function that creates the concordance view.

What I should have done is creating a function that might be call search() or locate() that returns the positions of matches (the same as what kwic.tokens() does in this branch) and calling this function in kwic(). If we add a special class to this object (like "position"), we can make a kwic.position(). The object can be useful for information retrieval applications too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work too. But the current structure is fine. Let's do this:

  • Keep kwic() as in this PR.
  • Remove window and separator from print.kwic(), but keep these for as.data.frame.kwic().
  • Add max_nrow = 1000 to print.kwic() and make it work like the other object print functions (with the "x more rows..." output).
  • Not implement yet the other functions you mention above, but keep in mind that these can be added when and if needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. That is the middle way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I did all of the above in 13fa3ea. This implements the same print options now for kwic as for other core objects, including show_summary. The arguments window and separator are kept in as.data.frame.kwic() but removed from print.kwic() (but I left the internal code for these intact in case we want to add them back).

We can remove the .corpus and .character methods in a separate PR, when we address #2058 more directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I did all of the above in 13fa3ea. This implements the same print options now for kwic as for other core objects, including show_summary. The arguments window and separator are kept in as.data.frame.kwic() but removed from print.kwic() (but I left the internal code for these intact in case we want to add them back).

We can remove the .corpus and .character methods in a separate PR, when we address #2058.

Merge branch 'master' into issue-1840

# Conflicts:
#	R/kwic.R
#	R/pattern2fixed.R
#	R/utils.R
#	man/is_regex.Rd
#	man/pattern2list.Rd
@koheiw
Copy link
Collaborator Author

koheiw commented Feb 23, 2021

I checked how kwic() works. I don't think users enjoy wrapping the function by as.data.frame() always to see the concordance view in Data Viewer.

View(kwic(corp, pattern = "secure*", valuetype = "glob", window = 3)) # shows only keywords and positions
View(print(kwic(corp, pattern = "secure*", valuetype = "glob", window = 3))) # shows nothing 
View(as.data.frame(kwic(corp, pattern = "secure*", valuetype = "glob", window = 3))) # shows concordance view

It is not necessary to define View.kwic() if kwic() returns concordance view (as I argued above).

What I should have done is creating a function that might be call search() or locate() that returns the positions of matches (the same as what kwic.tokens() does in this branch) and calling this function in kwic(). If we add a special class to this object (like "position"), we can make a kwic.position(). The object can be useful for information retrieval applications too.

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 23, 2021

That all sounds reasonable, but hard to follow as this PR keeps shifting.

To get View.data.frame to work with a raw kwic object, we really only need to add the pre and post columns to the kwic object. If you want them viewable from the kwic object, just add the columns from the window at the kwic creation step.

We could keep the tokens attribute so that these could still be changed in print.kwic() or as.data.frame.kwic() since the original tokens are included, as in this PR. Or, if you no longer want to use that approach, we no longer need that attribute, but then we are not far from where we started before this PR.

How do you want to proceed?

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 23, 2021

I prefer to make different (exported) function only to locate pattern matches in C++, and class the object differently from "kwic" (say "matches"). Then we make kwic.matches(), that has window and max_nrow (or page) to limit rows in the concordance view and returns the KWIC object classed "kwic" (the same as v2.0).

In this way, we can fulfill requests like in #2063, and avoid conflict with old KWIC objects.

We can move this discussion to an issue page and create a new branch if you like.

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 23, 2021

OK - if that's based on this branch, then I suggest we merge this branch, and then work on a new branch for the changes you suggest. We have a lot of improvements in this PR including a better approach, in my view, for separating the construction of the kwic from its conversion and printing functions.

However I'm happy to add the pre and post columns to the data.frame part of the kwic object for now, as suggested above (#2045 (comment)), if you think we should add that back to the kwic object in this branch before merging.

@koheiw koheiw mentioned this pull request Feb 24, 2021
@kbenoit kbenoit merged commit 69444af into master Feb 25, 2021
@kbenoit kbenoit deleted the issue-1840 branch March 5, 2021 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants