Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyzet grep two-way support for searching for non-ASCII characters #34

Open
tpwo opened this issue Apr 15, 2022 · 1 comment
Open

pyzet grep two-way support for searching for non-ASCII characters #34

tpwo opened this issue Apr 15, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@tpwo
Copy link
Owner

tpwo commented Apr 15, 2022

Alphabets that use non-ASCII characters are annoying to grep for, and there should be a way to enable a convenient search patterns that deal with this problem.

Problem statement

  • A user used non-ASCII character in the zettel content, and would like to find it with using only ASCII characters

  • A user used ASCII character in the zettel content (any reason like laziness/mistake/copied text), and would like to find it also when looking for its non-ASCII counterpart

Example

E.g. for Polish we have:

ą -- a
ć -- c
ę -- e
ł -- l
ń -- n
ó -- o
ś -- s
ź -- z
ż -- z

Of course, capital letters also should be supported.

Behaviors

  • grepping for zolta ges should find żółta gęś

  • grepping for żółta gęś should find zolta ges -- (use case: we want to find a copied text from someone who haven't used diacritics)

  • probably controlled with a special flag or even multiple flags (i.e. there can be different modes: a single two-way or two one-way)

Implementation

  • git grep pattern should be probably modified in such a way that it looks for strings with OR parts when one or the other character should match

  • However, multiple non-ASCII chars can map to a single ASCII, e.g. both ż and ź map to z. In such case, all three should be detected when grepping for z, but only two when grepping for ż or ź (because ż and ź shouldn't be treated as the same letter)

  • There are many languages, so hard-coding these rules for Polish doesn't seem like the best idea under the sun. I would prefer to create some kind of abstraction layer, so the rules can be added independently for each language. Maybe it can be even a part of a config file for custom mappings (to be checked is how YAML handles non-ASCII), but I think that built-in support for given languages can be included.

  • Above, I only wondered about a situation when we have char to char mapping. But there are examples when multiple ASCII characters map to a single non-ASCII char (e.g. German ß maps to ss). I'm not sure if this is trivial to extend it like that.

@tpwo tpwo added the enhancement New feature or request label Apr 15, 2022
@tpwo
Copy link
Owner Author

tpwo commented Aug 12, 2022

Regarding conversions from ß to ss, Python actually implements a function based on Unicode standard which is able to covert such characters automatically:

https://docs.python.org/3/library/stdtypes.html#str.casefold

I need to do some research, because converting from single letter non-ASCII to ASCII seems to be even simpler to do, and there is probably some ready to use solution available.

The opposite however would be probably much harder. Brute force solution is to use casefold() on the zettels' contents but it would be probably very slow in bigger ZK repos


EDIT: in database world there is something called Latin1_General_CI_AI which is used to compare and order strings consisting of different character sets. It seems like a solution for a problem I want to solve here.

Basically, if two characters look similarly (i.e., o and ó), this solution will tell that they are the same. AI stands for 'Accent Insensitive'. There is also a sensitive variant: AS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant