pyzet grep
two-way support for searching for non-ASCII characters
#34
Labels
enhancement
New feature or request
Alphabets that use non-ASCII characters are annoying to grep for, and there should be a way to enable a convenient search patterns that deal with this problem.
Problem statement
A user used non-ASCII character in the zettel content, and would like to find it with using only ASCII characters
A user used ASCII character in the zettel content (any reason like laziness/mistake/copied text), and would like to find it also when looking for its non-ASCII counterpart
Example
E.g. for Polish we have:
Of course, capital letters also should be supported.
Behaviors
grepping for
zolta ges
should findżółta gęś
grepping for
żółta gęś
should findzolta ges
-- (use case: we want to find a copied text from someone who haven't used diacritics)probably controlled with a special flag or even multiple flags (i.e. there can be different modes: a single two-way or two one-way)
Implementation
git grep
pattern should be probably modified in such a way that it looks for strings withOR
parts when one or the other character should matchHowever, multiple non-ASCII chars can map to a single ASCII, e.g. both
ż
andź
map toz
. In such case, all three should be detected when grepping forz
, but only two when grepping forż
orź
(becauseż
andź
shouldn't be treated as the same letter)There are many languages, so hard-coding these rules for Polish doesn't seem like the best idea under the sun. I would prefer to create some kind of abstraction layer, so the rules can be added independently for each language. Maybe it can be even a part of a config file for custom mappings (to be checked is how YAML handles non-ASCII), but I think that built-in support for given languages can be included.
Above, I only wondered about a situation when we have char to char mapping. But there are examples when multiple ASCII characters map to a single non-ASCII char (e.g. German
ß
maps toss
). I'm not sure if this is trivial to extend it like that.The text was updated successfully, but these errors were encountered: