Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Turkish implementation #534

Open
gorkemgoknar opened this issue Sep 18, 2023 · 4 comments
Open

Improve Turkish implementation #534

gorkemgoknar opened this issue Sep 18, 2023 · 4 comments

Comments

@gorkemgoknar
Copy link

gorkemgoknar commented Sep 18, 2023

Expected Behaviour

  • On Turkish, numbers to text is joined.
    It is acceptable for goverment purpose (tax filing etc) , but it does not reflect real wording and will fail on a TTS a lot.

  • On floating point e.g 1.5 , it uses "virgül" as point, this is expected and correct though on mathematical sense it should also have option to pronounce point ("nokta")

  • Floating point only works for 2 decimal point, it should be at least going to 7 decimal points (English goes to 15 decimal points)

*to not change default behaviour , there should be an override to add spaces between words (it fails on TTS like this)

Actual Behaviour

num2words(12455544, lang="tr")
'onikimilyondörtyüzellibeşbinbeşyüzkırkdört'

num2words(1.5, lang="tr")
'birvirgülelli'

num2words(1.2455544, lang="tr")
'birvirgülyirmidört' >> missing

on english
num2words(1.2455544, lang="en")
'one point two four five five five four four'

Steps to reproduce

num2words(12455544, lang="tr")
'onikimilyondörtyüzellibeşbinbeşyüzkırkdört'

num2words(1.5, lang="tr")
'birvirgülelli'

num2words(1.2455544, lang="tr")
'birvirgülyirmidört'

@gorkemgoknar
Copy link
Author

gorkemgoknar commented Sep 18, 2023

VNLP itself handles better, maybe get that implementation (giving comma and dot spelling option)

https://github.com/vngrs-ai/vnlp/blob/b5011692c997b9d110827421d491bb3492d3b5dd/vnlp/normalizer/normalizer.py#L200

from vnlp import Normalizer
normalizer = Normalizer()

normalizer .convert_numbers_to_words(["1.523233351"],decimal_seperator=".")
#['bir', 'virgül', 'beş', 'yüz', 'yirmi', 'üç', 'bin', 'iki', 'yüz', 'otuz', 'üç']

normalizer .convert_numbers_to_words(["1.523233351"]) # by default incorrect, better spell command instead of very big number, so num2words should maybe have this implemetation (then can option to join with space or no space
#['bir', 'katrilyon', 'beş', 'yüz', 'yirmi', 'üç', 'milyon', 'iki', 'yüz', 'otuz', 'üç', 'bin', 'üç', 'yüz', 'elli', 'bir']


@gorkemgoknar
Copy link
Author

gorkemgoknar commented Sep 18, 2023

From TDK (turkish language)
https://tdk.gov.tr/icerik/yazim-kurallari/sayilarin-yazilisi/#:~:text=Birden%20fazla%20kelimeden%20olu%C5%9Fan%20say%C4%B1lar,35%20(alt%C4%B1y%C3%BCzelliTL%2Cotuzbe%C5%9Fkr.)

  1. Birden fazla kelimeden oluşan sayılar ayrı yazılır: iki yüz, üç yüz altmış beş, bin iki yüz elli bir vb.

  2. Para ile ilgili işlemlerle senet, çek vb. ticari belgelerde geçen sayılar bitişik yazılır: 650,35 (altıyüzelliTL,otuzbeşkr.)

So num2words intended for "monetary/currency" only for Turkish (clause 3, but not for actually word spelling).

Edit: This needs fix too , zero after point is not spelled
num2words(84003.01, lang='tr')

@mrodriguezg1991
Copy link
Contributor

Hello, i am currently maintaining the project, however i dont speak the language, if you have the time you could submit a MR and we can include it on the next release
Thanks

@gorkemgoknar
Copy link
Author

Hello, i am currently maintaining the project, however i dont speak the language, if you have the time you could submit a MR and we can include it on the next release Thanks

sure will improve upon it when have time.
One question though: I think it would be wise to make these changes overridable/optional as I guess some people are using it as is, though cannot confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants