`pyzet grep` two-way support for searching for non-ASCII characters #34

tpwo · 2022-04-15T22:53:06Z

Alphabets that use non-ASCII characters are annoying to grep for, and there should be a way to enable a convenient search patterns that deal with this problem.

Problem statement

A user used non-ASCII character in the zettel content, and would like to find it with using only ASCII characters
A user used ASCII character in the zettel content (any reason like laziness/mistake/copied text), and would like to find it also when looking for its non-ASCII counterpart

Example

E.g. for Polish we have:

ą -- a
ć -- c
ę -- e
ł -- l
ń -- n
ó -- o
ś -- s
ź -- z
ż -- z

Of course, capital letters also should be supported.

Behaviors

grepping for zolta ges should find żółta gęś
grepping for żółta gęś should find zolta ges -- (use case: we want to find a copied text from someone who haven't used diacritics)
probably controlled with a special flag or even multiple flags (i.e. there can be different modes: a single two-way or two one-way)

Implementation

git grep pattern should be probably modified in such a way that it looks for strings with OR parts when one or the other character should match
However, multiple non-ASCII chars can map to a single ASCII, e.g. both ż and ź map to z. In such case, all three should be detected when grepping for z, but only two when grepping for ż or ź (because ż and ź shouldn't be treated as the same letter)
There are many languages, so hard-coding these rules for Polish doesn't seem like the best idea under the sun. I would prefer to create some kind of abstraction layer, so the rules can be added independently for each language. Maybe it can be even a part of a config file for custom mappings (to be checked is how YAML handles non-ASCII), but I think that built-in support for given languages can be included.
Above, I only wondered about a situation when we have char to char mapping. But there are examples when multiple ASCII characters map to a single non-ASCII char (e.g. German ß maps to ss). I'm not sure if this is trivial to extend it like that.

The text was updated successfully, but these errors were encountered:

tpwo · 2022-08-12T20:04:55Z

Regarding conversions from ß to ss, Python actually implements a function based on Unicode standard which is able to covert such characters automatically:

https://docs.python.org/3/library/stdtypes.html#str.casefold

I need to do some research, because converting from single letter non-ASCII to ASCII seems to be even simpler to do, and there is probably some ready to use solution available.

The opposite however would be probably much harder. Brute force solution is to use casefold() on the zettels' contents but it would be probably very slow in bigger ZK repos

EDIT: in database world there is something called Latin1_General_CI_AI which is used to compare and order strings consisting of different character sets. It seems like a solution for a problem I want to solve here.

Basically, if two characters look similarly (i.e., o and ó), this solution will tell that they are the same. AI stands for 'Accent Insensitive'. There is also a sensitive variant: AS.

tpwo added the enhancement New feature or request label Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`pyzet grep` two-way support for searching for non-ASCII characters #34

`pyzet grep` two-way support for searching for non-ASCII characters #34

tpwo commented Apr 15, 2022 •

edited

Loading

tpwo commented Aug 12, 2022 •

edited

Loading

pyzet grep two-way support for searching for non-ASCII characters #34

pyzet grep two-way support for searching for non-ASCII characters #34

Comments

tpwo commented Apr 15, 2022 • edited Loading

Problem statement

Example

Behaviors

Implementation

tpwo commented Aug 12, 2022 • edited Loading

`pyzet grep` two-way support for searching for non-ASCII characters #34

`pyzet grep` two-way support for searching for non-ASCII characters #34

tpwo commented Apr 15, 2022 •

edited

Loading

tpwo commented Aug 12, 2022 •

edited

Loading