Skip to content

Remove Telugu normalization of vu వు to ma మ from IndicNormalizer #14659

Open
@Trey314159

Description

@Trey314159

Description

Telugu vu వు and ma మ are visually similar—akin to English "rn" and "m"—but they should not be conflated. Names like వెంకటరామ (Venkatarama) and వెంకటరావు (Venkatarao) and words like మండే and వుండే (links to Telugu Wiktionary) are distinct.

It's like conflating "rn" and "m" to merge burn/bum and corn/com. It could happen when reading quickly or with poor handwriting, but it is not something that should happen for search indexing.

I notice that some of the Telugu elements of IndicNormalizer are in TeluguNormalizer, but this mapping is not—which is good!

(Sorry for the botched pull request. Obviously this change would also affect some tests, which need to be updated or re-evaluated.)

Version and environment details

My version:
"distribution" : "opensearch",
"number" : "1.3.20",
"lucene_version : "8.10.1"

Running on x86_64 GNU/Linux in Docker 4.15.0 on MacOS 13.6.3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions