Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized regexp for matching tags. #206

Closed
wants to merge 1 commit into from
Closed

Conversation

@emorozov
Copy link

@emorozov emorozov commented Nov 9, 2019

A tiny optimization for the SQLite regexp matching removing no-op .* before and after the regexp matching tags. I'm not sure if it affects performance noticeably, but it's an inaccurate regexp because .* will match anything and doesn't really have to be included.


  • I have signed the CLA
@emorozov
Copy link
Author

@emorozov emorozov commented Nov 10, 2019

Second commit adds ability to support non-ascii tags. My blog is in Russian and I was all the time wondering why only tags in English work. So I did study writefreely sources and some of the Golang docs, and it seems that Golang regexp fully supports only English language. For example, \b works only for English, it won't match word boundary for any word written in non-English alphabet.

I've added a very simple workaround that attempts to match #tag_text[whitespace|punctuation] when searching for a tag. While it doesn't solve all the issues with poor handling of non-English languages by Golang and SQLite (e.g. LOWER(content) part of the query converts only English text to lower case) but it is good enough for the most cases.

@robjloranger robjloranger requested a review from thebaer Dec 4, 2019
@thebaer thebaer linked an issue that may be closed by this pull request Feb 6, 2020
@emorozov emorozov force-pushed the emorozov:develop branch from 338620d to 7b02667 Feb 16, 2020
@emorozov emorozov force-pushed the emorozov:develop branch from 7b02667 to b899ba3 Feb 16, 2020
@thebaer
Copy link
Member

@thebaer thebaer commented Mar 12, 2020

Thanks for contributing this. While I think this is a good stopgap for instances that need it, I'd prefer we implement a more permanent fix that supports all possible character sets on the front and back end. Part of this work will likely involve larger database changes, including tracking hashtags associated with posts instead of doing things with regular expressions.

I'll leave this open for now, but ideally we can fix #219 with a more robust system than what we have today.

@emorozov
Copy link
Author

@emorozov emorozov commented Mar 12, 2020

Isn't it better to have a solution that is not perfect but works today while waiting for a better system? It's your choice, of course, but this PR replaces one regexp with another regexp, so while it's not a perfect solution, but original regexp isn't optimal either. E.g. .* at the beginning and end is just a NOP that wastes regexp engine time.

@thebaer
Copy link
Member

@thebaer thebaer commented Mar 13, 2020

I agree and am fine with removing the .* patterns, but I'm not a fan of introducing new regressions in the word boundary matching just to fix this (e.g. it breaks on #tag; or #tag)). If we can get SQLite matching closer to the POSIX word boundary we use in the MySQL query, I'll be happy to merge that.

@thebaer
Copy link
Member

@thebaer thebaer commented May 29, 2020

Closing now since there hasn't been any progress. If you want to make those improvements, please feel free to reopen this!

@thebaer thebaer closed this May 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

2 participants
You can’t perform that action at this time.