Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 1514 #1515

Merged
merged 23 commits into from
Jan 2, 2019
Merged

Fix 1514 #1515

merged 23 commits into from
Jan 2, 2019

Conversation

kbenoit
Copy link
Collaborator

@kbenoit kbenoit commented Dec 7, 2018

Corrects the problem where the keyword pattern was not displaying correctly at the top of the textplot_xray panel. This was caused by the kwic attribute keywords not getting properly set. Because this was different for characters versus dictionaries, I used a bit of hack to choose unique values (dictionary) or unique patterns (character vector).

The rest of the changes are just linting.

Corrects the problem where the keyword pattern was not displaying correctly at the top of the textplot_xray panel.  This was caused by the kwic attribute `keywords` not getting properly set.  Because this was different for characters versus dictionaries, I used a bit of hack to choose unique values (dictionary) or unique patterns (character vector).

The rest of the changes are just linting.
@kbenoit kbenoit added bug textplot kwic Keywords in context issues labels Dec 7, 2018
@kbenoit kbenoit requested a review from koheiw December 7, 2018 23:14
R/kwic.R Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Dec 8, 2018

Codecov Report

Merging #1515 into master will increase coverage by <.01%.
The diff coverage is 96.73%.

@@            Coverage Diff             @@
##           master    #1515      +/-   ##
==========================================
+ Coverage   89.87%   89.88%   +<.01%     
==========================================
  Files         105      105              
  Lines        7816     7818       +2     
==========================================
+ Hits         7025     7027       +2     
  Misses        791      791

@codecov
Copy link

codecov bot commented Dec 8, 2018

Codecov Report

Merging #1515 into master will decrease coverage by <.01%.
The diff coverage is 95.65%.

@@            Coverage Diff             @@
##           master    #1515      +/-   ##
==========================================
- Coverage   89.76%   89.76%   -0.01%     
==========================================
  Files         103      103              
  Lines        7721     7727       +6     
==========================================
+ Hits         6931     6936       +5     
- Misses        790      791       +1

@koheiw
Copy link
Collaborator

koheiw commented Dec 12, 2018

Firstly, I think labels should be keys when pattern is a dictionary, so special handling of dictionary is not necessary.

toks <- tokens(data_corpus_inaugural)
textplot_xray(kwic(head(toks), dictionary(list(A = "citizen*", B = "government*"))))

rplot01

However, I was not sure how textplot_xlay() understands which match is for which pattern. As it turned out, it does not know. This function needs re-engineering.

textplot_xray(kwic(head(toks), c("citizen*", "government*")))

rplot02

textplot_xray(kwic(head(toks), c("government*", "citizen*")))

rplot03

@kbenoit
Copy link
Collaborator Author

kbenoit commented Dec 12, 2018

Totally agree on the first point. In fixing the bug, I changed the behaviour to something that it should not be. I'll fix this.

Will look into the second and fix that too. Good eye! And that's why we get a second pair of them on PRs.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Dec 12, 2018

See #1521, which I think we should discuss there, then fix in this PR.

R/kwic.R Outdated Show resolved Hide resolved
R/kwic.R Outdated Show resolved Hide resolved
- All character patterns get coerced to a list
- The keywords attribute is returned as the pattern or dictionary key matching the actual keyword matched.  This works fine except in cases of dictionaries or lists that have intersecting values, in which case the first key match only is associated with the keyword matched.
@kbenoit
Copy link
Collaborator Author

kbenoit commented Dec 21, 2018

@koheiw let me know what you think via a review.

Copy link
Collaborator

@koheiw koheiw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot match temp$keyword against type when patterns are a phrase:

txt <- c("This is a test",
          "This is it.",
          "What is in a train?",
          "Is it a question?",
          "Sometimes you don't know if this is it.",
          "Is it a bird or a plane or is it a train?")

attr(kwic(txt, c("is", "a"), valuetype = "fixed"), "keywords") # works
# [1] is a  is is a  is a  is is a  a  is a 
# Levels: is a

attr(kwic(txt, phrase("is a"), valuetype = "fixed"), "keywords") # does  not
# [1] <NA>
# Levels: 

This is what makes this fix non-trivial.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Dec 21, 2018

True. Maybe we could match the first word according to its type index, to get the phrase as the keywords entry that matches the corresponding kwic row. We just need a workaround for this case, and right now it's only used for the xray plot facet label.

@koheiw
Copy link
Collaborator

koheiw commented Dec 21, 2018

Another difficult case is

> stringi::stri_split_fixed(kwic(txt, phrase(c("is i*", "is it", "is in")))$keyword, " ")
[[1]]
[1] "is" "it"

[[2]]
[1] "is" "in"

[[3]]
[1] "Is" "it"

[[4]]
[1] "is" "it"

[[5]]
[1] "Is" "it"

[[6]]
[1] "is" "it"

It is impossible to 1-1 match the patterns and keywords here. I think these require too much work only for textplot_xlay().

@kbenoit
Copy link
Collaborator Author

kbenoit commented Dec 21, 2018

OK, I tried but have given up for now trying to solve that issue.

We've fixed all but the sequences. Since PR fixes some clear errors and represents a general improvement, I suggest we merge it, and outline the problems you identify above as a new issue to be solved separately. Those problems existed before this PR, so I don't see why they should hold up us solving other problems. See the warning notes I added to textplot_xray.Rd.

Plus the only real consequence is for the facet labels in textplot_xray(). If a user really wanted to fix these, it's possible to do so by modifying the ggplot object.

@koheiw
Copy link
Collaborator

koheiw commented Dec 21, 2018

Partial fixes can make problem even more complex. Why don't you remove the pattern label entirely from the plot as you proposed initially? I don't think we can solve the problem without fundamentally changing the kwic object considering the nature of phrasal matches.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Dec 21, 2018

Well, dictionaries need facets, and it works for them in the PR (except for phrases) and is broken and wrong in master. As I mentioned, what's broken in the PR was also broken in master, but at least the PR fixes the broken bits for non-phrases.

@koheiw
Copy link
Collaborator

koheiw commented Dec 21, 2018

If the purpose is adding dictionary keys as labels, you could loop over keys and rbind() the output from
qatd_cpp_kwic(). I was doing something similar earlier for tokens_lookup().

quanteda/R/tokens_lookup.R

Lines 112 to 117 in 4ee90db

for (h in seq_along(dictionary)) {
values <- split_dictionary_values(dictionary[[h]], attr(x, 'concatenator'))
values_temp <- pattern2id(values, index = index)
values_id <- c(values_id, values_temp)
keys_id <- c(keys_id, rep(h, length(values_temp)))
}

This is a slow method but the most accurate. kwic() cannot be use for inspecting giant tokens object anyway.
We could make qatd_cpp_kwic() more like qatd_cpp_tokens_lookup() in the future.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Dec 22, 2018

That's a good solution. Essentially, it's an lapply of kwic over the elements of the list or dictionary keys, and reassembling the results including (attributes) to output the single kwic. I'll work on that when I emerge from the 3-day holiday tunnel that starts this evening.

This allows us to reliably match the pattern with the returned keyword match, including for phrases.
@kbenoit
Copy link
Collaborator Author

kbenoit commented Jan 1, 2019

OK, I implemented the "iterate over keys" approach, but did this in a way that is consistent for all different pattern types for kwic. It preserves the existing behaviours but now correctly associates the "keyword" attribute as the pattern that was matches, and passes this to textplot_xray() for correct faceting. Maybe not the most efficient approach but it's more transparent to keep the manipulations required to do this on the R side, and makes the C++ workhorse parts of kwic more flexible if they can be kept general and then used in kwic.tokens() in high-level ways to get the results and objects we want.

Take a look and let me know if you agree that it's ready to merge. I'm eager to see the end of this one!

- Remove flatten argument
- Add keep_nomatch
- Reduce use of slow functions
- Move patten to column
@koheiw
Copy link
Collaborator

koheiw commented Jan 2, 2019

I changed the code to make it simpler and faster (along with upgrading of core functions). The main structural change is moving matched pattern in the keyword attribute to the pattern column. By doing this, users can more easily combine KWIC objects. We should let them to rbind() KWIC objects instead of passing them via ... in textplot_xlay().

@@ -93,7 +93,7 @@ split_values <- function(dict, concatenator_dictionary, concatenator_tokens) {
concatenator_tokens))
names(result) <- key
}
result
return(result)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes no functional difference, however according to our own style guide we should not use return() to send an object's evaluation back to the parent environment at the end of a function. I'm not bothered by it either way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider this one of the most strange R conventions. We should return values explicitly like in other (proper) programing languages. For me, Google R Style Guide is much more agreeable than tidyverse's.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're about 90% compatible but the tidyverse guide is a bit more complete. We can't agree with https://google.github.io/styleguide/Rguide.xml#object for quanteda at least. (Sometimes with R you have to embrace a slight bit of chaos!)

We already have local exceptions (to both), so let's add the suggestion of an explicit return(finalresult) to the quanteda Style Guide.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jan 2, 2019

All looks good to me. I also thought about adding pattern as a column to the kwic object. A good addition.

Ready to merge I'd say.

@koheiw koheiw merged commit b2d03fc into master Jan 2, 2019
@koheiw koheiw deleted the fix-1514 branch January 2, 2019 21:54
@kbenoit
Copy link
Collaborator Author

kbenoit commented Jan 2, 2019

🎉🎉 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug kwic Keywords in context issues textplot
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants