Issue 1840 #2045

koheiw · 2021-02-02T08:46:54Z

Separated text search (kwic) and concordance view generation print.kwic() to address #2038 and #1840. This patch needs final adjustment, but difficult part is already done.

codecov · 2021-02-02T08:53:01Z

Codecov Report

Merging #2045 (cac9e38) into master (5af26da) will increase coverage by 0.08%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #2045      +/-   ##
==========================================
+ Coverage   95.17%   95.25%   +0.08%     
==========================================
  Files          71       71              
  Lines        3979     4028      +49     
==========================================
+ Hits         3787     3837      +50     
+ Misses        192      191       -1

Impacted Files	Coverage Δ
R/dictionaries.R	`93.78% <ø> (ø)`
R/docvars.R	`95.04% <ø> (ø)`
R/meta.R	`87.00% <ø> (ø)`
R/pattern2fixed.R	`93.79% <ø> (ø)`
R/utils.R	`98.46% <ø> (ø)`
R/corpus.R	`100.00% <100.00%> (ø)`
R/kwic.R	`98.94% <100.00%> (+3.11%)`	⬆️
R/quanteda_options.R	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5af26da...1b3ec18. Read the comment docs.

R/corpus.R

kbenoit · 2021-02-06T18:11:07Z

Overall my sense is that if the kwic object now includes the full set of tokens as an attribute. Isn't this pretty wasteful in terms of space?

There are ways to minimise the break with previous behaviours, by letting the user specify separator and window in kwic(), and passing these through via attributes to print.kwic() and as.data.frame.kwic() as defaults.

print.kwic() needs to be documented.

View.kwic() is no longer functioning correctly.

And see my comment above about corpus.kwic().

kbenoit · 2021-02-07T12:23:33Z

Thinking about this more, the PR essentially builds a token position index for the search terms, and includes the entire token set. This is like a single-use indexing, but means that with two kwic objects - common for instance in our tutorials and examples where these are plotted together - each kwic object is greater than the original tokens object. This is just one of the features of the functional, pass-by-value approach of R and quanteda. In another language, we might attach searched terms to the tokens object, and make these available, but that's very different from our existing approach in the package.

This could be solved if we returned to an idea we discussed early on, which is that a corpus could be processed to include tokens and indexes to all types in the document. Then processing a kwic would be fast since it would just find the indexed tokens and extract their context. It would be slow, but only once, when the corpus is created (or modified) or optionally, processed following creation of the existing type of corpus we have. And "creating" a tokens object would then not involve the tokenisation.

koheiw · 2021-02-09T02:47:49Z

The tokens object is a subset of original:

attr(result, "tokens") <- x[unique(result$docname)]

As far as I know, R does not make copy unless changes are made, so attaching the original x would actually save RAM space.

kbenoit · 2021-02-10T21:30:57Z

Looks like it does make a full copy, see below.

Is there a performance gain by copying the full tokens for documents with a match?

library("quanteda")
## Package version: 2.9.9000
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
library("pryr")
## Registered S3 method overwritten by 'pryr':
##   method      from
##   print.bytes Rcpp

toks <- tokens(data_corpus_inaugural)
object_size(toks)
## 1.32 MB

kw <- kwic(toks, "a")
object_size(kw)
## 1.37 MB
object_size(attr(kw, "tokens"))
## 1.32 MB

object_size(toks, kw)
## 2.06 MB

^{Created on 2021-02-10 by the reprex package (v1.0.0)}

koheiw · 2021-02-15T13:48:28Z

If full tokens is attached to kwic, the attribute only point to the original. If tokens is sliced, new object is created. Unless kwic object is saved in RDS, the former approach is more efficient.

> require(pryr)
> toks <- tokens(data_corpus_inaugural)
> kw <- kwic(toks, "a")
> 
> address(toks)
[1] "0x5605cfe02630"
> attr(kw, "tokens") <- toks
> toks_out <- attr(kw, "tokens")
> address(toks_out)
[1] "0x5605cfe02630"
> 
> attr(kw, "tokens") <- toks[1:10]
> toks_out2 <- attr(kw, "tokens")
> address(toks_out2)
[1] "0x5605d1711988"

kbenoit · 2021-02-16T11:03:11Z

OK, looks like this is done, so here are my comments on the current PR.

Overall, this looks fine, I can see the usefulness, although it's unclear how much of a performance difference this creates. To the extent that as.data.frame() on a kwic object returns what used to be the main output object, I think this is fine, and it gives us more flexibility to redefine the kwic structure further if we want, since we are no longer required to maintain this as a data.frame object, but can produce that through coercion. This is a good move and consistent with our approach elsewhere.

However this breaks too much, including our own code, much user code, and one of our packages. We can virtually eliminate this incompatibility however through some small changes.

Add back window and separator to kwic(). These won't do anything in the object but can be passed to print.kwic() and as.data.frame.kwic() as defaults, with the print methods being able to override them, so we get the best of this PR's new approach but compatibility with and the convenience of the former approach. This means we don't have to change the examples either. (The print() looks unwieldy and reminds me too much of Python.)
This broke textplots::textplot_xray(), but I've already managed to fix this in a way that works for the pre-v3 kwic objects as well as the v3 kwic objects. But as a note, we always need to think about what we are breaking in our own quanteda-verse packages when we change the core package.
I'd like to fix and undelete corpus.kwic(), which I don't see as doing any harm. If we decide to sentence it to death then it should at least be deprecated and made internal before removing it.

koheiw · 2021-02-16T11:35:35Z

I am OK with restoring corpus.kwic() for deprecation and restore the arguments. I would be nice to add max_nrow to print.kwic(). I will leave these with you. have done my part on this branch.

- adds separator and window to kwic - adds tests - restore corpus.kwic() and tests

- also adds tests

kbenoit · 2021-02-17T13:07:36Z

OK updates done, with some small improvements and stronger tests. kwic() now works exactly as before, from the user's standpoint, but with the new machinery. (See the added tests for details.) window and separator can be specified upon creation but can be changed in either print.kwic() or as.data.frame.kwic().

I'm happy to add max_nrow to kwic. This would make it work as do our other print methods, but those by default are pretty
conservative. We'd probably want to decide what is a suitable default, since most people are going to want to see the whole thing. I suggest we merge this PR and make this a new issue.

R/kwic.R

koheiw · 2021-02-18T01:38:33Z

R/kwic.R

 #'
 #' corp <- corpus(txt)
-#' kwic(corp, c("is", "a"), valuetype = "fixed", separator = "", remove_separators = FALSE)
+#' print(kwic(corp, "is"), separator = "")
 #' @export
 kwic.tokens <- function(x, pattern, window = 5,


Shall we deprecate window and separator to make clear that kwic() is only for locating patters?

I would say no, simply because this is the way that users have gotten used to using kwic since it was first written. It's also more natural to put this into the function that creates the kwic. We can make it clear that this can be changed using as.data.frame() or print(), but most people using quanteda probably don't want to type print() to see output, they just want to see default output. My experience with kwic() in particular is that this is a function most used by our most qualitative and interactive users.

When I use kwic() or teach this, I also prefer to set it at the time we call it. And this is not like what we've discussed about removing the convenience options from dfm() or removing dfm.character(), since those are a form of bad practice, skipping a step that the user should be taking more direct responsibility for. Rather, the window and separator options belong to kwic() and are natural to associate with creating a kwic(), but our implementation has now disassociated itself from needing to set this at creation time. But that's more about our current implementation than what the action implemented by the function calls for.

One more reason: If we do change kwic() again in the future, it might make more sense to set these at creation time.

So I think it makes sense to keep them without warnings. Most users will happily never know the difference.

It makes more sense to have window in kwic() but I am not comfortable with the same argument in the different places. How about let users to call kwic.kwck() to change window or separator? max_nrow should still be in print().

kwic.kwic() makes no sense unless it would be possible to search a kwic() for keywords - although we could make it do so since we have the tokens stored inside it. I think that's an edge case however that will create more confusion than the benefit that might come from using such a method to (re)set window or separator.

I think it's ok to keep window and separator in both the constructor and the print and conversion methods. We just have to make it clear that the default is set when the kwic is created, but can be changed at the print or data.frame stage. My guess is that it will be a very rare user that wants to do so, so in the vast majority of cases for the vast majority of users, the kwic functions will simply function as they have before.

For print.kwic() I'd suggest that our default be max_nrow = -1. Want me to implement that?

Let's take out window or separator from print for now. My understanding is that people do not want to run kwic() multiple times because it was really slow when there are many matches, but it is no loner so was far as the resulting object is subset. Let set max_nrow = 1000 or a bit more so that kwic() does not freeze.

Thinking about why we have to make a decision to add these argument to either or both kwic() or print()... I think my design to make print.kwic() was the mistake. kwic() should be the print function that creates the concordance view.

What I should have done is creating a function that might be call search() or locate() that returns the positions of matches (the same as what kwic.tokens() does in this branch) and calling this function in kwic(). If we add a special class to this object (like "position"), we can make a kwic.position(). The object can be useful for information retrieval applications too.

That would work too. But the current structure is fine. Let's do this:

Keep kwic() as in this PR.

Remove window and separator from print.kwic(), but keep these for as.data.frame.kwic().

Add max_nrow = 1000 to print.kwic() and make it work like the other object print functions (with the "x more rows..." output).

Not implement yet the other functions you mention above, but keep in mind that these can be added when and if needed.

OK. That is the middle way.

Alright, I did all of the above in 13fa3ea. This implements the same print options now for kwic as for other core objects, including show_summary. The arguments window and separator are kept in as.data.frame.kwic() but removed from print.kwic() (but I left the internal code for these intact in case we want to add them back).

We can remove the .corpus and .character methods in a separate PR, when we address #2058 more directly.

Alright, I did all of the above in 13fa3ea. This implements the same print options now for kwic as for other core objects, including show_summary. The arguments window and separator are kept in as.data.frame.kwic() but removed from print.kwic() (but I left the internal code for these intact in case we want to add them back).

We can remove the .corpus and .character methods in a separate PR, when we address #2058.

…o issue-1840

Merge branch 'master' into issue-1840 # Conflicts: # R/kwic.R # R/pattern2fixed.R # R/utils.R # man/is_regex.Rd # man/pattern2list.Rd

koheiw · 2021-02-23T08:27:40Z

I checked how kwic() works. I don't think users enjoy wrapping the function by as.data.frame() always to see the concordance view in Data Viewer.

View(kwic(corp, pattern = "secure*", valuetype = "glob", window = 3)) # shows only keywords and positions
View(print(kwic(corp, pattern = "secure*", valuetype = "glob", window = 3))) # shows nothing 
View(as.data.frame(kwic(corp, pattern = "secure*", valuetype = "glob", window = 3))) # shows concordance view

It is not necessary to define View.kwic() if kwic() returns concordance view (as I argued above).

What I should have done is creating a function that might be call search() or locate() that returns the positions of matches (the same as what kwic.tokens() does in this branch) and calling this function in kwic(). If we add a special class to this object (like "position"), we can make a kwic.position(). The object can be useful for information retrieval applications too.

kbenoit · 2021-02-23T09:02:06Z

That all sounds reasonable, but hard to follow as this PR keeps shifting.

To get View.data.frame to work with a raw kwic object, we really only need to add the pre and post columns to the kwic object. If you want them viewable from the kwic object, just add the columns from the window at the kwic creation step.

We could keep the tokens attribute so that these could still be changed in print.kwic() or as.data.frame.kwic() since the original tokens are included, as in this PR. Or, if you no longer want to use that approach, we no longer need that attribute, but then we are not far from where we started before this PR.

How do you want to proceed?

koheiw · 2021-02-23T09:43:19Z

I prefer to make different (exported) function only to locate pattern matches in C++, and class the object differently from "kwic" (say "matches"). Then we make kwic.matches(), that has window and max_nrow (or page) to limit rows in the concordance view and returns the KWIC object classed "kwic" (the same as v2.0).

In this way, we can fulfill requests like in #2063, and avoid conflict with old KWIC objects.

We can move this discussion to an issue page and create a new branch if you like.

kbenoit · 2021-02-23T10:19:33Z

OK - if that's based on this branch, then I suggest we merge this branch, and then work on a new branch for the changes you suggest. We have a lot of improvements in this PR including a better approach, in my view, for separating the construction of the kwic from its conversion and printing functions.

However I'm happy to add the pre and post columns to the data.frame part of the kwic object for now, as suggested above (#2045 (comment)), if you think we should add that back to the kwic object in this branch before merging.

koheiw added 13 commits February 2, 2021 08:39

Stop generating character strings

a88ce79

Rename vars

a637ef2

Rename objects

98a37ca

Return patterns as a character vector

76dc0ab

Remove window and separator

34ef875

Move concordance view generator to print.kwic

ce91d35

Fix extractor

403cadd

Add as.data.frame.kiwc

1af9ee0

Add as.data.frame()

400efce

Update tests

f32f50a

Update MAN

2b3d7d8

Remove corpus.kwic

3bb83c1

Update tests

9d419e2

Remove corpus.kwic example

43efb6b

koheiw requested a review from kbenoit February 2, 2021 08:58

kbenoit reviewed Feb 2, 2021

View reviewed changes

R/corpus.R Show resolved Hide resolved

koheiw and others added 2 commits February 9, 2021 11:53

Merge branch 'master' into issue-1840

66bde9c

Merge branch 'master' into issue-1840

814330f

Merge branch 'master' into issue-1840

98b6ac8

koheiw added 5 commits February 15, 2021 22:48

Merge branch 'master' into issue-1840

1386b34

Merge branch 'master' into issue-1840

7f9a3f4

Update MAN

fb3742e

Name argument

0ad8a74

Don't use as.data.frame

93e63a3

kbenoit mentioned this pull request Feb 16, 2021

Fixes for quanteda v3 quanteda/quanteda.textplots#12

Merged

kbenoit added 10 commits February 16, 2021 13:33

Merge branch 'master' into issue-1840

97b56fa

Fix Rd issues

631b01a

Merge branch 'master' of https://github.com/quanteda/quanteda

dfebc72

Minor documentation improvements

33d0267

Update kwic functions

11145ce

- adds separator and window to kwic - adds tests - restore corpus.kwic() and tests

Fix plural/singular in output

3f8ff95

- also adds tests

Merge branch 'master' of https://github.com/quanteda/quanteda

864e392

Remove use of print() in examples

403f0d5

Add failing test

6777b81

Fix bug in kwic

a4741b1

kbenoit approved these changes Feb 17, 2021

View reviewed changes

R/kwic.R Show resolved Hide resolved

Merge branch 'master' into issue-1840

e102a9d

kbenoit mentioned this pull request Feb 17, 2021

Consider removing "shortcut" methods #2058

Closed

koheiw commented Feb 18, 2021

View reviewed changes

R/kwic.R Outdated Show resolved Hide resolved

koheiw commented Feb 18, 2021

View reviewed changes

kbenoit added 4 commits February 20, 2021 15:17

Implement print options for kwic

13fa3ea

Merge branch 'issue-1840' of https://github.com/quanteda/quanteda int…

4767c55

…o issue-1840

Merge branch 'master' of https://github.com/quanteda/quanteda

2d8cacc

merge with master

1b3ec18

Merge branch 'master' into issue-1840 # Conflicts: # R/kwic.R # R/pattern2fixed.R # R/utils.R # man/is_regex.Rd # man/pattern2list.Rd

koheiw mentioned this pull request Feb 24, 2021

Issue 1840 2 #2065

Merged

kbenoit merged commit 69444af into master Feb 25, 2021

kbenoit deleted the issue-1840 branch March 5, 2021 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 1840 #2045

Issue 1840 #2045

koheiw commented Feb 2, 2021 •

edited

Loading

codecov bot commented Feb 2, 2021 •

edited

Loading

kbenoit commented Feb 6, 2021

kbenoit commented Feb 7, 2021

koheiw commented Feb 9, 2021

kbenoit commented Feb 10, 2021

koheiw commented Feb 15, 2021

kbenoit commented Feb 16, 2021

koheiw commented Feb 16, 2021

kbenoit commented Feb 17, 2021

koheiw Feb 18, 2021

kbenoit Feb 18, 2021

koheiw Feb 18, 2021

kbenoit Feb 18, 2021

koheiw Feb 18, 2021 •

edited

Loading

koheiw Feb 18, 2021

kbenoit Feb 19, 2021

koheiw Feb 19, 2021

kbenoit Feb 20, 2021

kbenoit Feb 20, 2021

koheiw commented Feb 23, 2021

kbenoit commented Feb 23, 2021

koheiw commented Feb 23, 2021

kbenoit commented Feb 23, 2021

Issue 1840 #2045

Issue 1840 #2045

Conversation

koheiw commented Feb 2, 2021 • edited Loading

codecov bot commented Feb 2, 2021 • edited Loading

Codecov Report

kbenoit commented Feb 6, 2021

kbenoit commented Feb 7, 2021

koheiw commented Feb 9, 2021

kbenoit commented Feb 10, 2021

koheiw commented Feb 15, 2021

kbenoit commented Feb 16, 2021

koheiw commented Feb 16, 2021

kbenoit commented Feb 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koheiw Feb 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koheiw commented Feb 23, 2021

kbenoit commented Feb 23, 2021

koheiw commented Feb 23, 2021

kbenoit commented Feb 23, 2021

koheiw commented Feb 2, 2021 •

edited

Loading

codecov bot commented Feb 2, 2021 •

edited

Loading

koheiw Feb 18, 2021 •

edited

Loading