-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 1514 #1515
Conversation
Corrects the problem where the keyword pattern was not displaying correctly at the top of the textplot_xray panel. This was caused by the kwic attribute `keywords` not getting properly set. Because this was different for characters versus dictionaries, I used a bit of hack to choose unique values (dictionary) or unique patterns (character vector). The rest of the changes are just linting.
Codecov Report
@@ Coverage Diff @@
## master #1515 +/- ##
==========================================
+ Coverage 89.87% 89.88% +<.01%
==========================================
Files 105 105
Lines 7816 7818 +2
==========================================
+ Hits 7025 7027 +2
Misses 791 791 |
Codecov Report
@@ Coverage Diff @@
## master #1515 +/- ##
==========================================
- Coverage 89.76% 89.76% -0.01%
==========================================
Files 103 103
Lines 7721 7727 +6
==========================================
+ Hits 6931 6936 +5
- Misses 790 791 +1 |
After fixing #1514, the article needs to be rebuilt, as it contains several textplot_ray() plots.
Firstly, I think labels should be keys when toks <- tokens(data_corpus_inaugural)
textplot_xray(kwic(head(toks), dictionary(list(A = "citizen*", B = "government*")))) However, I was not sure how textplot_xray(kwic(head(toks), c("citizen*", "government*"))) textplot_xray(kwic(head(toks), c("government*", "citizen*"))) |
Totally agree on the first point. In fixing the bug, I changed the behaviour to something that it should not be. I'll fix this. Will look into the second and fix that too. Good eye! And that's why we get a second pair of them on PRs. |
- I removed a text from test-kwic.R that specified the wrong behaviour.
See #1521, which I think we should discuss there, then fix in this PR. |
- All character patterns get coerced to a list - The keywords attribute is returned as the pattern or dictionary key matching the actual keyword matched. This works fine except in cases of dictionaries or lists that have intersecting values, in which case the first key match only is associated with the keyword matched.
@koheiw let me know what you think via a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot match temp$keyword
against type when patterns are a phrase:
txt <- c("This is a test",
"This is it.",
"What is in a train?",
"Is it a question?",
"Sometimes you don't know if this is it.",
"Is it a bird or a plane or is it a train?")
attr(kwic(txt, c("is", "a"), valuetype = "fixed"), "keywords") # works
# [1] is a is is a is a is is a a is a
# Levels: is a
attr(kwic(txt, phrase("is a"), valuetype = "fixed"), "keywords") # does not
# [1] <NA>
# Levels:
This is what makes this fix non-trivial.
True. Maybe we could match the first word according to its type index, to get the phrase as the |
Another difficult case is > stringi::stri_split_fixed(kwic(txt, phrase(c("is i*", "is it", "is in")))$keyword, " ")
[[1]]
[1] "is" "it"
[[2]]
[1] "is" "in"
[[3]]
[1] "Is" "it"
[[4]]
[1] "is" "it"
[[5]]
[1] "Is" "it"
[[6]]
[1] "is" "it" It is impossible to 1-1 match the patterns and keywords here. I think these require too much work only for |
OK, I tried but have given up for now trying to solve that issue. We've fixed all but the sequences. Since PR fixes some clear errors and represents a general improvement, I suggest we merge it, and outline the problems you identify above as a new issue to be solved separately. Those problems existed before this PR, so I don't see why they should hold up us solving other problems. See the warning notes I added to Plus the only real consequence is for the facet labels in |
Partial fixes can make problem even more complex. Why don't you remove the pattern label entirely from the plot as you proposed initially? I don't think we can solve the problem without fundamentally changing the kwic object considering the nature of phrasal matches. |
Well, dictionaries need facets, and it works for them in the PR (except for phrases) and is broken and wrong in master. As I mentioned, what's broken in the PR was also broken in master, but at least the PR fixes the broken bits for non-phrases. |
If the purpose is adding dictionary keys as labels, you could loop over keys and Lines 112 to 117 in 4ee90db
This is a slow method but the most accurate. kwic() cannot be use for inspecting giant tokens object anyway.We could make qatd_cpp_kwic() more like qatd_cpp_tokens_lookup() in the future.
|
That's a good solution. Essentially, it's an lapply of kwic over the elements of the list or dictionary keys, and reassembling the results including (attributes) to output the single kwic. I'll work on that when I emerge from the 3-day holiday tunnel that starts this evening. |
This allows us to reliably match the pattern with the returned keyword match, including for phrases.
OK, I implemented the "iterate over keys" approach, but did this in a way that is consistent for all different pattern types for kwic. It preserves the existing behaviours but now correctly associates the "keyword" attribute as the pattern that was matches, and passes this to Take a look and let me know if you agree that it's ready to merge. I'm eager to see the end of this one! |
- Remove flatten argument - Add keep_nomatch
- Reduce use of slow functions - Move patten to column
I changed the code to make it simpler and faster (along with upgrading of core functions). The main structural change is moving matched pattern in the |
@@ -93,7 +93,7 @@ split_values <- function(dict, concatenator_dictionary, concatenator_tokens) { | |||
concatenator_tokens)) | |||
names(result) <- key | |||
} | |||
result | |||
return(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes no functional difference, however according to our own style guide we should not use return()
to send an object's evaluation back to the parent environment at the end of a function. I'm not bothered by it either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider this one of the most strange R conventions. We should return values explicitly like in other (proper) programing languages. For me, Google R Style Guide is much more agreeable than tidyverse's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're about 90% compatible but the tidyverse guide is a bit more complete. We can't agree with https://google.github.io/styleguide/Rguide.xml#object for quanteda at least. (Sometimes with R you have to embrace a slight bit of chaos!)
We already have local exceptions (to both), so let's add the suggestion of an explicit return(finalresult)
to the quanteda Style Guide.
All looks good to me. I also thought about adding Ready to merge I'd say. |
🎉🎉 🎉 |
Corrects the problem where the keyword pattern was not displaying correctly at the top of the textplot_xray panel. This was caused by the kwic attribute
keywords
not getting properly set. Because this was different for characters versus dictionaries, I used a bit of hack to choose unique values (dictionary) or unique patterns (character vector).The rest of the changes are just linting.