Fix 1514 #1515

kbenoit · 2018-12-07T23:14:49Z

Corrects the problem where the keyword pattern was not displaying correctly at the top of the textplot_xray panel. This was caused by the kwic attribute keywords not getting properly set. Because this was different for characters versus dictionaries, I used a bit of hack to choose unique values (dictionary) or unique patterns (character vector).

The rest of the changes are just linting.

Corrects the problem where the keyword pattern was not displaying correctly at the top of the textplot_xray panel. This was caused by the kwic attribute `keywords` not getting properly set. Because this was different for characters versus dictionaries, I used a bit of hack to choose unique values (dictionary) or unique patterns (character vector). The rest of the changes are just linting.

R/kwic.R

codecov · 2018-12-08T00:25:29Z

Codecov Report

Merging #1515 into master will increase coverage by <.01%.
The diff coverage is 96.73%.

@@            Coverage Diff             @@
##           master    #1515      +/-   ##
==========================================
+ Coverage   89.87%   89.88%   +<.01%     
==========================================
  Files         105      105              
  Lines        7816     7818       +2     
==========================================
+ Hits         7025     7027       +2     
  Misses        791      791

codecov · 2018-12-08T00:25:29Z

Codecov Report

Merging #1515 into master will decrease coverage by <.01%.
The diff coverage is 95.65%.

@@            Coverage Diff             @@
##           master    #1515      +/-   ##
==========================================
- Coverage   89.76%   89.76%   -0.01%     
==========================================
  Files         103      103              
  Lines        7721     7727       +6     
==========================================
+ Hits         6931     6936       +5     
- Misses        790      791       +1

After fixing #1514, the article needs to be rebuilt, as it contains several textplot_ray() plots.

…fix-1514

koheiw · 2018-12-12T02:00:28Z

Firstly, I think labels should be keys when pattern is a dictionary, so special handling of dictionary is not necessary.

toks <- tokens(data_corpus_inaugural)
textplot_xray(kwic(head(toks), dictionary(list(A = "citizen*", B = "government*"))))

However, I was not sure how textplot_xlay() understands which match is for which pattern. As it turned out, it does not know. This function needs re-engineering.

textplot_xray(kwic(head(toks), c("citizen*", "government*")))

textplot_xray(kwic(head(toks), c("government*", "citizen*")))

kbenoit · 2018-12-12T04:00:27Z

Totally agree on the first point. In fixing the bug, I changed the behaviour to something that it should not be. I'll fix this.

Will look into the second and fix that too. Good eye! And that's why we get a second pair of them on PRs.

…fix-1514

- I removed a text from test-kwic.R that specified the wrong behaviour.

kbenoit · 2018-12-12T06:30:11Z

See #1521, which I think we should discuss there, then fix in this PR.

R/kwic.R

- All character patterns get coerced to a list - The keywords attribute is returned as the pattern or dictionary key matching the actual keyword matched. This works fine except in cases of dictionaries or lists that have intersecting values, in which case the first key match only is associated with the keyword matched.

kbenoit · 2018-12-21T11:33:59Z

@koheiw let me know what you think via a review.

koheiw

You cannot match temp$keyword against type when patterns are a phrase:

txt <- c("This is a test",
          "This is it.",
          "What is in a train?",
          "Is it a question?",
          "Sometimes you don't know if this is it.",
          "Is it a bird or a plane or is it a train?")

attr(kwic(txt, c("is", "a"), valuetype = "fixed"), "keywords") # works
# [1] is a  is is a  is a  is is a  a  is a 
# Levels: is a

attr(kwic(txt, phrase("is a"), valuetype = "fixed"), "keywords") # does  not
# [1] <NA>
# Levels:

This is what makes this fix non-trivial.

kbenoit · 2018-12-21T12:44:09Z

True. Maybe we could match the first word according to its type index, to get the phrase as the keywords entry that matches the corresponding kwic row. We just need a workaround for this case, and right now it's only used for the xray plot facet label.

koheiw · 2018-12-21T13:04:06Z

Another difficult case is

> stringi::stri_split_fixed(kwic(txt, phrase(c("is i*", "is it", "is in")))$keyword, " ")
[[1]]
[1] "is" "it"

[[2]]
[1] "is" "in"

[[3]]
[1] "Is" "it"

[[4]]
[1] "is" "it"

[[5]]
[1] "Is" "it"

[[6]]
[1] "is" "it"

It is impossible to 1-1 match the patterns and keywords here. I think these require too much work only for textplot_xlay().

kbenoit · 2018-12-21T15:34:52Z

OK, I tried but have given up for now trying to solve that issue.

We've fixed all but the sequences. Since PR fixes some clear errors and represents a general improvement, I suggest we merge it, and outline the problems you identify above as a new issue to be solved separately. Those problems existed before this PR, so I don't see why they should hold up us solving other problems. See the warning notes I added to textplot_xray.Rd.

Plus the only real consequence is for the facet labels in textplot_xray(). If a user really wanted to fix these, it's possible to do so by modifying the ggplot object.

koheiw · 2018-12-21T20:42:46Z

Partial fixes can make problem even more complex. Why don't you remove the pattern label entirely from the plot as you proposed initially? I don't think we can solve the problem without fundamentally changing the kwic object considering the nature of phrasal matches.

kbenoit · 2018-12-21T21:43:40Z

Well, dictionaries need facets, and it works for them in the PR (except for phrases) and is broken and wrong in master. As I mentioned, what's broken in the PR was also broken in master, but at least the PR fixes the broken bits for non-phrases.

koheiw · 2018-12-21T22:32:52Z

If the purpose is adding dictionary keys as labels, you could loop over keys and rbind() the output from
qatd_cpp_kwic(). I was doing something similar earlier for tokens_lookup().

quanteda/R/tokens_lookup.R

Lines 112 to 117 in 4ee90db

    
           for (h in seq_along(dictionary)) { 
        
               values <- split_dictionary_values(dictionary[[h]], attr(x, 'concatenator')) 
        
               values_temp <- pattern2id(values, index = index) 
        
               values_id <- c(values_id, values_temp) 
        
               keys_id <- c(keys_id, rep(h, length(values_temp))) 
        
           }

This is a slow method but the most accurate. kwic() cannot be use for inspecting giant tokens object anyway.
We could make qatd_cpp_kwic() more like qatd_cpp_tokens_lookup() in the future.

kbenoit · 2018-12-22T09:31:23Z

That's a good solution. Essentially, it's an lapply of kwic over the elements of the list or dictionary keys, and reassembling the results including (attributes) to output the single kwic. I'll work on that when I emerge from the 3-day holiday tunnel that starts this evening.

This allows us to reliably match the pattern with the returned keyword match, including for phrases.

kbenoit · 2019-01-01T18:34:16Z

OK, I implemented the "iterate over keys" approach, but did this in a way that is consistent for all different pattern types for kwic. It preserves the existing behaviours but now correctly associates the "keyword" attribute as the pattern that was matches, and passes this to textplot_xray() for correct faceting. Maybe not the most efficient approach but it's more transparent to keep the manipulations required to do this on the R side, and makes the C++ workhorse parts of kwic more flexible if they can be kept general and then used in kwic.tokens() in high-level ways to get the results and objects we want.

Take a look and let me know if you agree that it's ready to merge. I'm eager to see the end of this one!

- Remove flatten argument - Add keep_nomatch

- Reduce use of slow functions - Move patten to column

koheiw · 2019-01-02T09:46:49Z

I changed the code to make it simpler and faster (along with upgrading of core functions). The main structural change is moving matched pattern in the keyword attribute to the pattern column. By doing this, users can more easily combine KWIC objects. We should let them to rbind() KWIC objects instead of passing them via ... in textplot_xlay().

kbenoit · 2019-01-02T13:19:11Z

R/dictionaries.R

@@ -93,7 +93,7 @@ split_values <- function(dict, concatenator_dictionary, concatenator_tokens) {
                                                 concatenator_tokens))
        names(result) <- key
    }
-    result
+    return(result)


Makes no functional difference, however according to our own style guide we should not use return() to send an object's evaluation back to the parent environment at the end of a function. I'm not bothered by it either way.

I consider this one of the most strange R conventions. We should return values explicitly like in other (proper) programing languages. For me, Google R Style Guide is much more agreeable than tidyverse's.

They're about 90% compatible but the tidyverse guide is a bit more complete. We can't agree with https://google.github.io/styleguide/Rguide.xml#object for quanteda at least. (Sometimes with R you have to embrace a slight bit of chaos!)

We already have local exceptions (to both), so let's add the suggestion of an explicit return(finalresult) to the quanteda Style Guide.

kbenoit · 2019-01-02T13:36:37Z

All looks good to me. I also thought about adding pattern as a column to the kwic object. A good addition.

Ready to merge I'd say.

kbenoit · 2019-01-02T21:56:18Z

🎉🎉 🎉

…igal Humanities page

kbenoit added 2 commits December 8, 2018 10:12

Linting improvements

a98205e

kbenoit added bug textplot kwic Keywords in context issues labels Dec 7, 2018

kbenoit requested a review from koheiw December 7, 2018 23:14

kbenoit commented Dec 7, 2018

View reviewed changes

R/kwic.R Outdated Show resolved Hide resolved

Update .Rd - forgot to knit

1186535

stefan-mueller and others added 5 commits December 8, 2018 11:13

Change references in plotting article to from TeX to RMarkdown format

a97de94

After fixing #1514, the article needs to be rebuilt, as it contains several textplot_ray() plots.

Merge branch 'fix-1514' of https://github.com/quanteda/quanteda into …

e737e50

…fix-1514

Merge branch 'master' into fix-1514

3220595

Merge branch 'fix-1514' of https://github.com/quanteda/quanteda into …

bda09c9

…fix-1514

Merge branch 'master' into fix-1514

00e46b4

kbenoit added 2 commits December 12, 2018 15:01

Merge branch 'fix-1514' of https://github.com/quanteda/quanteda into …

7ac9ffa

…fix-1514

Make keywords attr the keys if pattern is a dictionary

3cb03d1

- I removed a text from test-kwic.R that specified the wrong behaviour.

kbenoit mentioned this pull request Dec 12, 2018

textplot_xray() fails with pattern length > 1 #1521

Closed

Add tests related to #1521

07df0c4

kbenoit added 3 commits December 18, 2018 15:31

Merge branch 'master' into fix-1514

d905131

Merge branch 'master' into fix-1514

616f5fa

Start to fix the keyword issue

940594d

kbenoit commented Dec 20, 2018

View reviewed changes

R/kwic.R Outdated Show resolved Hide resolved

kbenoit commented Dec 20, 2018

View reviewed changes

R/kwic.R Outdated Show resolved Hide resolved

koheiw requested changes Dec 21, 2018

View reviewed changes

kbenoit added 2 commits December 21, 2018 15:30

Add tests to be solved in a future fix

44fb0d4

Add warning about textplot_xray

39b46f5

kbenoit added 2 commits January 1, 2019 14:33

Merge branch 'master' into fix-1514

e1fd124

Implement kwic as a series of individual element kwics

2cb2fc3

This allows us to reliably match the pattern with the returned keyword match, including for phrases.

koheiw added 2 commits January 2, 2019 18:35

Make output always flat

458f9c8

- Remove flatten argument - Add keep_nomatch

Simplify the code

ace3a39

- Reduce use of slow functions - Move patten to column

Build man

dc17f3b

kbenoit commented Jan 2, 2019

View reviewed changes

picky linting

ccf9c5e

koheiw approved these changes Jan 2, 2019

View reviewed changes

koheiw merged commit b2d03fc into master Jan 2, 2019

koheiw deleted the fix-1514 branch January 2, 2019 21:54

stefan-mueller mentioned this pull request Jan 8, 2019

Update "Digital Humanities" replication #1548

Merged

stefan-mueller added a commit that referenced this pull request Jan 10, 2019

Rebuild pkgdown website after fixing #1515 and #1549 and updating Dit…

47f78df

…igal Humanities page

jiongweilua pushed a commit that referenced this pull request Jan 21, 2019

Rebuild pkgdown website after fixing #1515 and #1549 and updating Dit…

37813aa

…igal Humanities page

kbenoit mentioned this pull request May 9, 2019

textplot_xray() from kwic() based on dictionary pattern #1684

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 1514 #1515

Fix 1514 #1515

kbenoit commented Dec 7, 2018

codecov bot commented Dec 8, 2018 •

edited

Loading

codecov bot commented Dec 8, 2018

koheiw commented Dec 12, 2018

kbenoit commented Dec 12, 2018

kbenoit commented Dec 12, 2018

kbenoit commented Dec 21, 2018

koheiw left a comment •

edited by kbenoit

Loading

kbenoit commented Dec 21, 2018

koheiw commented Dec 21, 2018

kbenoit commented Dec 21, 2018

koheiw commented Dec 21, 2018 •

edited

Loading

kbenoit commented Dec 21, 2018

koheiw commented Dec 21, 2018 •

edited

Loading

kbenoit commented Dec 22, 2018

kbenoit commented Jan 1, 2019

koheiw commented Jan 2, 2019 •

edited

Loading

kbenoit Jan 2, 2019

koheiw Jan 2, 2019

kbenoit Jan 2, 2019

kbenoit commented Jan 2, 2019

kbenoit commented Jan 2, 2019

Fix 1514 #1515

Fix 1514 #1515

Conversation

kbenoit commented Dec 7, 2018

codecov bot commented Dec 8, 2018 • edited Loading

Codecov Report

codecov bot commented Dec 8, 2018

Codecov Report

koheiw commented Dec 12, 2018

kbenoit commented Dec 12, 2018

kbenoit commented Dec 12, 2018

kbenoit commented Dec 21, 2018

koheiw left a comment • edited by kbenoit Loading

Choose a reason for hiding this comment

kbenoit commented Dec 21, 2018

koheiw commented Dec 21, 2018

kbenoit commented Dec 21, 2018

koheiw commented Dec 21, 2018 • edited Loading

kbenoit commented Dec 21, 2018

koheiw commented Dec 21, 2018 • edited Loading

kbenoit commented Dec 22, 2018

kbenoit commented Jan 1, 2019

koheiw commented Jan 2, 2019 • edited Loading

kbenoit Jan 2, 2019

Choose a reason for hiding this comment

koheiw Jan 2, 2019

Choose a reason for hiding this comment

kbenoit Jan 2, 2019

Choose a reason for hiding this comment

kbenoit commented Jan 2, 2019

kbenoit commented Jan 2, 2019

codecov bot commented Dec 8, 2018 •

edited

Loading

koheiw left a comment •

edited by kbenoit

Loading

koheiw commented Dec 21, 2018 •

edited

Loading

koheiw commented Dec 21, 2018 •

edited

Loading

koheiw commented Jan 2, 2019 •

edited

Loading