Optimized regexp for matching tags. #206

emorozov · 2019-11-09T19:55:41Z

A tiny optimization for the SQLite regexp matching removing no-op .* before and after the regexp matching tags. I'm not sure if it affects performance noticeably, but it's an inaccurate regexp because .* will match anything and doesn't really have to be included.

I have signed the CLA

emorozov · 2019-11-10T04:25:04Z

Second commit adds ability to support non-ascii tags. My blog is in Russian and I was all the time wondering why only tags in English work. So I did study writefreely sources and some of the Golang docs, and it seems that Golang regexp fully supports only English language. For example, \b works only for English, it won't match word boundary for any word written in non-English alphabet.

I've added a very simple workaround that attempts to match #tag_text[whitespace|punctuation] when searching for a tag. While it doesn't solve all the issues with poor handling of non-English languages by Golang and SQLite (e.g. LOWER(content) part of the query converts only English text to lower case) but it is good enough for the most cases.

thebaer · 2020-03-12T19:29:57Z

Thanks for contributing this. While I think this is a good stopgap for instances that need it, I'd prefer we implement a more permanent fix that supports all possible character sets on the front and back end. Part of this work will likely involve larger database changes, including tracking hashtags associated with posts instead of doing things with regular expressions.

I'll leave this open for now, but ideally we can fix #219 with a more robust system than what we have today.

emorozov · 2020-03-12T19:47:48Z

Isn't it better to have a solution that is not perfect but works today while waiting for a better system? It's your choice, of course, but this PR replaces one regexp with another regexp, so while it's not a perfect solution, but original regexp isn't optimal either. E.g. .* at the beginning and end is just a NOP that wastes regexp engine time.

thebaer · 2020-03-13T19:59:14Z

I agree and am fine with removing the .* patterns, but I'm not a fan of introducing new regressions in the word boundary matching just to fix this (e.g. it breaks on #tag; or #tag)). If we can get SQLite matching closer to the POSIX word boundary we use in the MySQL query, I'll be happy to merge that.

thebaer · 2020-05-29T09:51:58Z

Closing now since there hasn't been any progress. If you want to make those improvements, please feel free to reopen this!

emorozov mentioned this pull request Dec 4, 2019

Only hashtags with English/ASCII names are supported #219

Open

ghost requested a review from thebaer December 4, 2019 15:26

thebaer linked an issue Feb 6, 2020 that may be closed by this pull request

Only hashtags with English/ASCII names are supported #219

Open

emorozov force-pushed the develop branch from 338620d to 7b02667 Compare February 16, 2020 09:51

Rudimentary support for non-ascii tags.

b899ba3

emorozov force-pushed the develop branch from 7b02667 to b899ba3 Compare February 16, 2020 09:53

thebaer closed this May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimized regexp for matching tags. #206

Optimized regexp for matching tags. #206

Uh oh!

emorozov commented Nov 9, 2019 •

edited

Loading

Uh oh!

emorozov commented Nov 10, 2019 •

edited

Loading

Uh oh!

thebaer commented Mar 12, 2020

Uh oh!

emorozov commented Mar 12, 2020

Uh oh!

thebaer commented Mar 13, 2020

Uh oh!

thebaer commented May 29, 2020

Uh oh!

Uh oh!

Uh oh!

Optimized regexp for matching tags. #206

Optimized regexp for matching tags. #206

Uh oh!

Conversation

emorozov commented Nov 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emorozov commented Nov 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thebaer commented Mar 12, 2020

Uh oh!

emorozov commented Mar 12, 2020

Uh oh!

thebaer commented Mar 13, 2020

Uh oh!

thebaer commented May 29, 2020

Uh oh!

Uh oh!

emorozov commented Nov 9, 2019 •

edited

Loading

emorozov commented Nov 10, 2019 •

edited

Loading