Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

common: Cache Highlight/Ignore/nick rules, port to QRegularExpression, fixes #415

Merged

Conversation

digitalcircuit
Copy link
Contributor

@digitalcircuit digitalcircuit commented Aug 17, 2018

How do I use this?

Check out the Quassel wiki page on pattern matching.

Added since this has been merged.

In short

  • Add ExpressionMatch class, unifying regular expression handling
    • Supports Phrase, MultiPhrase, Wildcard, MultiWildcard, and RegEx patterns
    • Condenses patterns into a single regular expression, reusing it
    • Fixes inability to escape ?, *, !, and ; in Wildcard/MultiWildcard
    • Backed by partial tests in ExpressionMatchTests
  • Add NickHighlightMatcher class, unifying and caching nickname highlights
    • Track nickname per network, automatically rebuild pattern on change
    • Remove cache on network removal
  • Port local/core-side highlights, ignore rules to above new classes
    • Use ExpressionMatch for all, NickHighlightMatcher for highlights
    • Should significantly improve performance when many rules are enabled
  • Fix issue with invalid whitespace-only CTCP ignore rules
    • Check length of sender types before attempting to use value
  • Add documentation to Quassel FAQ on new escape patterns
  • Clean up
    • Properly handle ignore rule scope with newlines
    • Fix invalid resource path in settings page
    • Remove broken scopeMatch() function from util.cpp
Criteria Rank Reason
Impact ★★★ 3/3 User-facing performance, ability to escape *, !, etc
Risk ★★★ 3/3 Manual migrations, possible broken highlight/ignore rules
Intrusiveness ★★☆ 2/3 Modifes highlight/ignore processing, may conflict with other changes

Thanks to @sandsmark for the initial QRegularExpression efforts!

Rationale

Quassel processes potentially hundreds of messages at a time to apply highlight and ignore rules, some retroactively (ignore rules, local highlights). To improve performance, Quassel should compile down regular expressions to the simplest format, and cache them for reuse until changed. Similarly, on Qt 5, porting to QRegularExpression provides further speedups.

Some IRC channels can begin with !. Quassel scope matching needs a way to escape ! to match it literally, e.g. for !channel.

Spam messages can contain * and ? characters, such as **** ATTENTION ****. Quassel should provide a way to escape these characters to match them literally.

Breaking changes

  • Wildcard no longer supports character classes, like [abc] to match a, b, or c
    • For highlight rules and ignore rule contents, manually switch to RegEx mode and rewrite the rule
    • For ignore rule scope, no workaround exists as Regular expression does not affect the scope (legacy reasons)
  • \ in Wildcard rules must now be escaped
    • \* translates to literally matching *, \? to literal ?
    • \\[...] translates to literally matching \[...]
    • Necessary to support escaping *, ?, and ;
  • ! at the start of Wildcard rules now inverts the rule
    • E.g. !*some message* sets a rule for everything that does not contain some message
    • \![...] escapes the ! to literally match ![...]
    • \\![...] escapes the \ to literally match \![...]

Implementation

ExpressionMatch

ExpressionMatch implements an iterative parser, tracking state while looping through the list. This is necessary to provide escaping sequences for Wildcard and MultiWildcard, akin to how Qt implemented their algorithm but dropping the character class support, and adding handling ; as a separator.

ExpressionMatch will take any input pattern and simplify it into a single regular expression, optimizing for speed.

To avoid separating documentation from code, I've put most of the implementation remarks into the comments of the code itself.

It is not easily possible to implement escape sequences with simple string-to-regex substitution.

Test cases are implemented as opt-in Q_ASSERT() via ExpressionMatchTests::runTests(). These are not run by default. A quick edit to quassel.cpp will add them in for development tests:

--- a/src/common/quassel.cpp
+++ b/src/common/quassel.cpp
@@ -54,6 +54,8 @@
 
 #include "../../version.h"
 
+#include "expressionmatchtests.h"
+
 Quassel *Quassel::instance()
 {
     static Quassel instance;
@@ -124,6 +126,8 @@ bool Quassel::init()
         return false;
     }
 
+    ExpressionMatchTests::runTests();
+
     // Don't keep a debug log on the core
     return instance()->logger()->setup(runMode() != RunMode::CoreOnly);
 }

This is designed to be adaptable to a proper test framework whenever Quassel adopts one.

NickHighlightMatcher

NickHighlightMatcher builds upon ExpressionMatch, using the MultiPhrase mode, and stores nickname-based expression match objects per network.

Whenever a nickname set changes, or a network is removed, the cache is deleted and built anew on demand.

Examples

Legend

Symbol Meaning
🏁 Behavior in 0.12.5 stable release
☑️ Behavior in 0.13rc1
🆕 New or changed in this pull request

Modes of operation

Nickname matching always uses MatchMode::MultiPhrase.

Regular expressions / RegEx is NOT checked

Item MatchMode for patterns Status
Highlight rule contents Phrase 🏁 In stable
Highlight rule sender/channel MultiWildcard ☑️ In 0.13rc1
Ignore rule contents Wildcard 🏁 In stable
Ignore rule scope MultiWildcard 🏁 In stable

Regular expressions / RegEx is checked

Item MatchMode for patterns Status
Highlight rule contents RegEx 🏁 In stable
Highlight rule sender/channel RegEx 🏁 In stable
Ignore rule contents RegEx 🏁 In stable
Ignore rule scope MultiWildcard 🏁 In stable

Ignore rule scope is fixed to MultiWildcard for backwards compatibility.

Phrase (Phrase)

This is used for highlight rules when RegEx is not checked

🏁 In stable

word
  • Matches
    • word
    • A word.
    • has a word in it
  • Does not match
    • wording

🆕 New to this PR
There's a leading space; this can be used for more exact matches

 spaced
  • Matches
    • , spaced
    • is spaced out
  • Does not match
    • ;spaced

Multiple phrases (MultiPhrase)

This is used for nickname matching.

🏁 In stable

nick
othernick
test
  • Matches
    • nick
    • nick[something]
    • othernick: ...
    • pinging test hi
  • Does not match
    • nicks
    • tests

Quassel 0.12.5 also would match nick[something]; if this isn't desirable, it can be changed, albeit with some complexity, needing a new Nickname matching mode. Quassel currently assumes non-word characters are not valid in nicknames as per the \W in phrase regular expression.

Wildcard (Wildcard)

This is used for ignore rule contents when Regular expression is not checked

🆕 Changed in this PR

*Asking questions\?  Nope?
  • Matches
    • Asking questions? Nopea
    • Basking questions? Nope.
  • Does not match
    • Asking questions. Nope.
    • Asking questions? Nopes.

🆕 New to this PR
Implicit wildcard is supported, too

!*filter*
  • Matches
    • Everything, unless it contains something from the below list
  • Does not match
    • filter
    • filtering
    • #nofilter yo

🆕 New to this PR
Exclamation points can be escaped at beginning (not required elsewhere)

\!*filter*
  • Matches
    • !filter
    • !yes filtering
  • Does not match
    • filter

🆕 New to this PR
Escape character can be escaped

\\!*filter*
  • Matches
    • \!filter
    • \!yes filtering
  • Does not match
    • filter
    • !filter

Multiple wildcards (MultiWildcard)

This is used for ignore rule scope regardless of Regular expression checkbox, and for highlight rule sender/channel when RegEx is not checked

🆕 Changed in this PR

Alice!*; Bob!*@example.com; Carol*!*; !Caroline!*
Dan!*; escaped \; separator; \!not-inverted; \\!slash-prefixed

Newlines and ; are interchangeable in ignore rule scope; highlight rules treat Enter as submit

  • Matches
    • Alice![...]
    • Bob![...]@example.com
    • Carol[...]![...] except as noted below
    • Dan![...]
    • escaped ; separator
    • !not-inverted
    • \!slash-prefixed
  • Does not match
    • Caroline![...]
    • Malice![...]
    • John!

☑️ In 0.13rc1
Implicit wildcard is supported, too

!Announce*!*; !Wheatley!aperture@*
  • Matches
    • Everything, unless it contains something from the below list
  • Does not match
    • Announce[...]![...]
    • Wheatley!aperture@[...]

Regular expression (RegEx)

This is used for ignore rule contents and highlight rule contents/sender/channel when RegEx is checked

🏁 In stable

simple.\*escape-match.*
  • Matches
    • simpleA*escape-match
    • simpleA*escape-matchBBBB
  • Does not match
    • not above
    • simpleA*escape-mat
    • simple*escape-match
    • simpleABBBBescape-matchBBBB

🏁 In stable
Inverted rules are supported, too

!invert.\*escape-match.*
  • Matches
    • Everything, unless it matches something from the below list
  • Does not match
    • invertA*escape-match
    • invertA*escape-matchBBBB

🆕 Changed in this PR
Exclamation points can be escaped at beginning (not required elsewhere)

\!simple.\*escape-match.*
  • Matches
    • !simpleA*escape-matchBBBB
  • Does not match
    • simpleA*escape-matchBBBB

🆕 New to this PR
Escape character can be escaped at beginning

\\!simple.\*escape-match.*
  • Matches
    • \!simpleA*escape-matchBBBB
  • Does not match
    • !simpleA*escape-matchBBBB

Though I checked, I may still have made mistakes in the examples above. If you want some samples to test with, try out the above.

@digitalcircuit digitalcircuit force-pushed the ft-regex-have-your-cache-and-eat-it-too branch 4 times, most recently from 07fb3d8 to 3b2cfdc Compare August 23, 2018 04:57
@digitalcircuit
Copy link
Contributor Author

digitalcircuit commented Aug 23, 2018

Outdated notes to self on work-in-progress version
Click to expand.

Notes to self: I'll need to switch this to drop backtracking and use looping instead. Two dedicated loops should be fine, one for handling escapable ; splitting in IgnoreListEditItemSomethingDlg, and one for Wildcard up-conversion in ExpressionMatch that handles ; splitting and \* + \? reg-ex escape fix-ups.

Also ignore greedy nature, * is greedy in shell globbing by default. That performance can be profiled/fixed later, should be much smaller difference than combining and caching regex instances.

@digitalcircuit digitalcircuit force-pushed the ft-regex-have-your-cache-and-eat-it-too branch from 3b2cfdc to 3d1a6d9 Compare August 24, 2018 06:26
@digitalcircuit
Copy link
Contributor Author

digitalcircuit commented Aug 24, 2018

Outdated notes to self on work-in-progress version
Click to expand.

Notes to self:

  • Maybe splitting first, ignoring the QRegExp::escape(), then wildcardToRegEx upconverting each individual part will work better after all.
    • ! at start inverts, \! at start converts to !, ! elsewhere stays !, \! elsewhere stays \!.
      ; splits, \; doesn't split and gets converted to ;, \\; splits and \\ stays as \\ (to be converted into \ in wildcardToRegex()), \[...] stays as \[...] where [...] represents any other character (not ;)
    • \n always splits.
  • And the trimming logic of IgnoreListEditDlg::widgetHasChanged() will always ignore ! (treat it as the same, don't remove it), and never splits on \n (as that's splitted separately).
  • Add a ::Wildcard vs. ::MultiWildcard mode, so Ignore Rules won't have splitting (just the direct wildcard convert), but the Ignore Rule/Highlight rule scope will.
  • Also need to double-check for space at the start of ; rules due to rule1; rule2 having a space before rule2. Probably need to trim that off.
  • Also need to fix util::scopeMatch()
  • Also check CTCP ignores with scope of Sender

@digitalcircuit digitalcircuit force-pushed the ft-regex-have-your-cache-and-eat-it-too branch 2 times, most recently from 6f6f1fd to 461a2a5 Compare August 25, 2018 06:42
@digitalcircuit
Copy link
Contributor Author

digitalcircuit commented Aug 25, 2018

Tentative working draft is here! Just need to do all the documentation and testing...

Aside, the qDebug() << "Regenerated nickname matching cache for network ID" << netId; could be left out.. I'm not sure what the policies are around Quassel's noisiness at debug levels.

@digitalcircuit digitalcircuit force-pushed the ft-regex-have-your-cache-and-eat-it-too branch 8 times, most recently from c412436 to 226117c Compare August 26, 2018 03:23
@genius3000
Copy link
Contributor

Haven't been able to get fully into testing this but gave it a quick try with both client/core updated. Also to note, my test core doesn't have enough backlog/large enough channels to really test the time to load backlog issue. I'll try to remedy that for more in-depth testing.

Initial good findings:

  • Ignore List

    • Regex of .*(test|foo|bar).* works
    • Wildcard of meow*, meow, meow\*, meow\? work (matching wildcard or literal escaped character, as expected)
  • Remote Highlights

    • Simple regexes (ie. .*meow.*, meow.*) work
    • Literal matching (non-regex) of meow, meow* work
  • Local Highlights (currently same as Remote Highlights)

Noted concerns (minor/thoughts):

  • Ignore list manager logs a Warning of trying to set an empty rule, when changing options in the widget (and the rule is empty):
    [Warn ] void IgnoreListManager::IgnoreListItem::setContents(const QString&) Asked to set an ignore rule to empty
    Example: Create new rule, change from Dynamic to Permanent, Sender to Message, non-regex to regex, etc.
    At a glance, it looks like it would be caused by this line which calls .setContents() whenever the widget is updated. Is it trying to save the rule with every change rather than just upon clicking Ok?

  • Local/Remote Highlights
    I was actually surprised that these didn't support wildcard matching when set non-regex. Had to double check with current 0.13-rc1 to verify intended behavior. I've always used a few regex rules to match/not match specific criteria and never noticed this before. I realize the match allows for a trailing ';', ',' and maybe others. Would it be worth allowing wildcard matching here or just rely on the regex option when people want that ability?

 
Thanks for the awesome work so far with this, it should definitely improve things and I'll try to get more testing done before too late.

Fix ignorelisteditdlg.ui's reference to oxygen_icons.qrc, pointing it
to the new location.
@digitalcircuit digitalcircuit force-pushed the ft-regex-have-your-cache-and-eat-it-too branch from 226117c to cf53160 Compare September 1, 2018 21:44
@digitalcircuit
Copy link
Contributor Author

@genius3000 I've addressed the warning issue (by removing it). It's not a crashing issue, it just means that the ignore rule won't ever match. Quassel never warned about this in the past, so it should be fine to leave as is.

Supporting wildcard for local/remote highlights is a nice idea, and now's the time to do any backwards-incompatible protocol breaking and string changes before 0.13 becomes stable, to provide a Phrase, Wildcard, and RegEx dropdown. That said, regular expressions work, too, so I feel it might not be worth the extra complexity. Either way, it'd be a follow-up pull request to this 🙂

Thank you for your initial testing! I've got my own testing and documentation to write, then this should be okay...

@digitalcircuit digitalcircuit force-pushed the ft-regex-have-your-cache-and-eat-it-too branch from cf53160 to e8933e0 Compare September 2, 2018 00:21
@digitalcircuit digitalcircuit changed the title [WIP] Cache Highlight/Ignore rules, port to QRegularExpression [WIP] common: Cache Highlight/Ignore/nick rules, port to QRegularExpression, fixes Sep 2, 2018
@digitalcircuit digitalcircuit force-pushed the ft-regex-have-your-cache-and-eat-it-too branch 4 times, most recently from 3e48e36 to 666cf88 Compare September 2, 2018 04:30
@digitalcircuit
Copy link
Contributor Author

2018-9-1 - More testing is welcome!

I tested the basics (highlight match, nick match, ignore rules), and did more piecemeal tests during development, but I haven't extensively tried patterns outside of ExpressionMatchTests.

@digitalcircuit digitalcircuit changed the title [WIP] common: Cache Highlight/Ignore/nick rules, port to QRegularExpression, fixes common: Cache Highlight/Ignore/nick rules, port to QRegularExpression, fixes Sep 2, 2018
@digitalcircuit digitalcircuit force-pushed the ft-regex-have-your-cache-and-eat-it-too branch 2 times, most recently from 98bd576 to 8475e2a Compare September 2, 2018 07:44
@digitalcircuit
Copy link
Contributor Author

digitalcircuit commented Sep 2, 2018

Made a small typo in a test case and pull request documentation (regex \\!test* matches \!test[...]), no impact on matching code. Fixed.

Edit 2018-9-2: Fixed typos in comments, no change to functionality or tests (fixed description of wildcard setting of internal regex objects).

Add ExpressionMatch class to unify handling of various search
expressions in Quassel, including automatically caching the regular
expression instances and providing Qt 4/5 compatibility.

The source expression depends on the matching mode:
* ExpressionMatch::MatchPhrase
Match the entire phrase, looking for whitespace or beginning/end
around either side of the phrase.  No further processing.
* ExpressionMatch::MatchMultiPhrase
Same as MatchPhrase, but split the expression on newlines ("\n") and
treat as match successful if any phrase is found.  This avoids having
to create multiple ExpressionMatch classes just to match multiple
phrases.
* ExpressionMatch::MatchWildcard
Split on ";" and newlines ("\n"), and apply basic wildcard globbing,
with "*" representing any characters and "?" a single character.
Prefixing a section with "!" turns it into an invert-match, negating
any other matching rules.  If only invert-match rules exist, matching
is true unless an invert-rule applies.  "\[...]" escapes the given
character.
* ExpressionMatch::MatchRegEx
Treat expression as a regular expression, inverting if prefixed with
"!" (and not escaped as "\!").

Cached regular expression objects are updated whenever changing any
parameters.

When Qt 4 support is dropped, the QT_VERSION macros can be adjusted.

This lays the foundation for performance and readibility improvements
in future commits.
Add ExpressionMatchTests class to test the functionality of
ExpressionMatch, helping ensure it works across Qt 4 and Qt 5.

This is implemented as a series of Q_ASSERT()'s as Quassel does not
yet have an actual testing framework.  Once a test framework is
settled on, this class can be migrated to unit tests.

To use ExpressionMatchTests:

1.  Include it in "quassel.cpp"

2.  Call ExpressionMatchTests::runTests() near end of Quassel::init()
// DEBUG: Run tests!
ExpressionMatchTests::runTests();
Handle "\n" and ";" as separator in scope rules.  This fixes using
newlines in the Configure Ignore Rule dialog.

Make use of ExpressionMatch::trimMultiWildcardWhitespace() to handle
all of the arcane details, unifying code into one place.
Port HighlightRule objects to ExpressionMatch class, providing easy
caching and simplifying expression handling.

Migrate HighlightRule struct into a full-blown class for easier
management and greater assurance over automatic internal cache
management.

Port HighlightRuleManager to ExpressionMatch for nickname matching,
providing easy caching and simplifying expression handling.
(Noticing a theme?)

Add tons of documentation comments, too, and fix up line lengths.
Port IgnoreListItem objects to ExpressionMatch class, providing easy
caching and simplifying expression handling.

Migrate IgnoreListItem struct into a full-blown class for easier
management and greater assurance over automatic internal cache
management.

Add tons of documentation comments, too, and fix up line lengths.

Thanks to @sandsmark for the initial efforts towards the
QRegularExpression migration; it helped a lot!
When constructing a CTCP ignore, check that splitting on whitespace
results in non-empty list before taking from the list.  If it's
empty, fall back to assuming any valid sender, i.e. "*"/".*"
depending on whether the rule is set as wildcard or regex.

This fixes issues when receiving a CTCP with an invalid CTCP ignore
rule such as "  ".
Port QtUiMessageProcessor HighlightRule objects to ExpressionMatch
class, providing easy caching and simplifying expression handling.

Migrate HighlightRule struct into a full-blown class for easier
management and greater assurance over automatic internal cache
management.

Port QtUiMessageProcessor to ExpressionMatch for nickname matching,
providing easy caching and simplifying expression handling.

Add tons of documentation comments, too, and fix up line lengths.

NOTE: Legacy highlight rules lack support for "sender", "ID", and
inverse rules.  There's no technical reason for this, just lack of
developer time.  Feel free to add support in the future!  Just make
sure to not miss any part - I'd suggest simply replacing the legacy
LegacyHighlightRule class in QtUiMessageProcessor with the full one
from HighlightRuleManager, and don't forget to fix
> QtUiMessageProcessor::HighlightRule::operator!=()
...and...
> QtUiMessageProcessor::HighlightRule::determineExpressions()
Remove scopeMatch() from util.cpp - not only is it no longer needed,
superseded by ExpressionMatch, the scopeMatch behavior broke previous
Quassel functionality.

Details: Channel names can begin with "!", and scopeMatch() does not
support escaping "!" in the beginning of rules.  Escaping support
needs to be added.. but ExpressionMatch can now handle this.  Let us
keep the ugly parser loops confined to one place...
Add NickHighlightMatcher class to unify handling of nick highlight
matching in Quassel, including automatically updating the expression
matcher instance as needed per network.

Cached ExpressionMatch objects are updated on demand after any change
in nickname configuration or active/configured nicks.

This lays the foundation for performance and readibility improvements
in future commits.
Port HighlightRuleManager nick highlights to NickHighlightMatcher
class, providing easy caching and simplifying expression handling.

This fixes nickname caching being reset when switching between
networks.

Add SIGNAL/SLOT traversal to pass on information about network
removal to clean up per-network nickname highlight caches, avoiding
memory leaks.
Port QtUiMessageProcessor nick highlights to NickHighlightMatcher
class, providing easy caching and simplifying expression handling.

This fixes nickname caching being reset when switching between
networks.

Add SIGNAL/SLOT traversal to pass on information about network
removal to clean up per-network nickname highlight caches, avoiding
memory leaks.
@digitalcircuit digitalcircuit force-pushed the ft-regex-have-your-cache-and-eat-it-too branch from 8475e2a to 899ca3a Compare September 2, 2018 16:52
@digitalcircuit
Copy link
Contributor Author

With the unifying changes in this pull request, I have been wondering how little (or much) work is left to give Local Highlights all the functionality of Remote Highlights - add a settings minor version upgrade, copy the settings page table UI and code (minus import button), copy the HighlightRule class, add a few more message checks, change the import mechanics...

We'll see if/when this is merged, maybe.

@Sput42 Sput42 merged commit 4c4fe2b into quassel:master Sep 3, 2018
@digitalcircuit digitalcircuit deleted the ft-regex-have-your-cache-and-eat-it-too branch September 3, 2018 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants