Improve Turkish implementation #534

gorkemgoknar · 2023-09-18T11:54:57Z

Expected Behaviour

On Turkish, numbers to text is joined.
It is acceptable for goverment purpose (tax filing etc) , but it does not reflect real wording and will fail on a TTS a lot.
On floating point e.g 1.5 , it uses "virgül" as point, this is expected and correct though on mathematical sense it should also have option to pronounce point ("nokta")
Floating point only works for 2 decimal point, it should be at least going to 7 decimal points (English goes to 15 decimal points)

*to not change default behaviour , there should be an override to add spaces between words (it fails on TTS like this)

Actual Behaviour

num2words(12455544, lang="tr")
'onikimilyondörtyüzellibeşbinbeşyüzkırkdört'

num2words(1.5, lang="tr")
'birvirgülelli'

num2words(1.2455544, lang="tr")
'birvirgülyirmidört' >> missing

on english
num2words(1.2455544, lang="en")
'one point two four five five five four four'

Steps to reproduce

num2words(12455544, lang="tr")
'onikimilyondörtyüzellibeşbinbeşyüzkırkdört'

num2words(1.5, lang="tr")
'birvirgülelli'

num2words(1.2455544, lang="tr")
'birvirgülyirmidört'

gorkemgoknar · 2023-09-18T12:19:10Z

VNLP itself handles better, maybe get that implementation (giving comma and dot spelling option)

https://github.com/vngrs-ai/vnlp/blob/b5011692c997b9d110827421d491bb3492d3b5dd/vnlp/normalizer/normalizer.py#L200

from vnlp import Normalizer
normalizer = Normalizer()

normalizer .convert_numbers_to_words(["1.523233351"],decimal_seperator=".")
#['bir', 'virgül', 'beş', 'yüz', 'yirmi', 'üç', 'bin', 'iki', 'yüz', 'otuz', 'üç']

normalizer .convert_numbers_to_words(["1.523233351"]) # by default incorrect, better spell command instead of very big number, so num2words should maybe have this implemetation (then can option to join with space or no space
#['bir', 'katrilyon', 'beş', 'yüz', 'yirmi', 'üç', 'milyon', 'iki', 'yüz', 'otuz', 'üç', 'bin', 'üç', 'yüz', 'elli', 'bir']

gorkemgoknar · 2023-09-18T12:51:00Z

From TDK (turkish language)
https://tdk.gov.tr/icerik/yazim-kurallari/sayilarin-yazilisi/#:~:text=Birden%20fazla%20kelimeden%20olu%C5%9Fan%20say%C4%B1lar,35%20(alt%C4%B1y%C3%BCzelliTL%2Cotuzbe%C5%9Fkr.)

Birden fazla kelimeden oluşan sayılar ayrı yazılır: iki yüz, üç yüz altmış beş, bin iki yüz elli bir vb.
Para ile ilgili işlemlerle senet, çek vb. ticari belgelerde geçen sayılar bitişik yazılır: 650,35 (altıyüzelliTL,otuzbeşkr.)

So num2words intended for "monetary/currency" only for Turkish (clause 3, but not for actually word spelling).

Edit: This needs fix too , zero after point is not spelled
num2words(84003.01, lang='tr')

mrodriguezg1991 · 2023-09-20T19:41:36Z

Hello, i am currently maintaining the project, however i dont speak the language, if you have the time you could submit a MR and we can include it on the next release
Thanks

gorkemgoknar · 2023-09-21T13:38:01Z

Hello, i am currently maintaining the project, however i dont speak the language, if you have the time you could submit a MR and we can include it on the next release Thanks

sure will improve upon it when have time.
One question though: I think it would be wise to make these changes overridable/optional as I guess some people are using it as is, though cannot confirm.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Turkish implementation #534

Improve Turkish implementation #534

gorkemgoknar commented Sep 18, 2023 •

edited

gorkemgoknar commented Sep 18, 2023 •

edited

gorkemgoknar commented Sep 18, 2023 •

edited

mrodriguezg1991 commented Sep 20, 2023

gorkemgoknar commented Sep 21, 2023

Improve Turkish implementation #534

Improve Turkish implementation #534

Comments

gorkemgoknar commented Sep 18, 2023 • edited

Expected Behaviour

Actual Behaviour

Steps to reproduce

gorkemgoknar commented Sep 18, 2023 • edited

gorkemgoknar commented Sep 18, 2023 • edited

mrodriguezg1991 commented Sep 20, 2023

gorkemgoknar commented Sep 21, 2023

gorkemgoknar commented Sep 18, 2023 •

edited

gorkemgoknar commented Sep 18, 2023 •

edited

gorkemgoknar commented Sep 18, 2023 •

edited