Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in converting some hanzi to numbers #114

Closed
xumingkuan opened this issue Dec 18, 2019 · 11 comments · Fixed by #413
Closed

bug in converting some hanzi to numbers #114

xumingkuan opened this issue Dec 18, 2019 · 11 comments · Fixed by #413
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@xumingkuan
Copy link

一千一十萬埃 is converted to 1.0090000000000005e-13.
一千一萬埃 is converted to 9.990000000000006e-14.
負一萬埃 is converted to 1.0000000000000006e-16.
一萬埃 is converted to -1.0000000000000006e-16.
And currently there is even assert.equal(hanzi2num("三千萬埃"), 2.9990000000000027e-13); in src/test/test.js.
None of them seem to be correct.

@carlyu99
Copy link

Also, there are problems in converting numbers to hanzi.
For example, the output of 吾有一數曰一又一絲書之 is 一又九忽九微.
In particular, num2hanzi(1.0001) results in 一又九忽九微.

@LingDong-
Copy link
Member

I believe some of oddities are due to floating point precision problems.
The problem with 萬埃 is indeed a bug, i think.
Because 萬埃 (10^4埃) is 塵. The parser did not expect it. But I guess it should be allowed. Similarly 一萬萬 一億億 should also be allowed. But 一千百 doesn't sound right.
I'll look into it! Thanks a lot for spotting the issue.

@LingDong- LingDong- added the bug Something isn't working label Dec 18, 2019
@pinxue
Copy link

pinxue commented Dec 20, 2019

It seems not consistent:

吾有一數曰三萬三書之;吾有一數曰三萬三千三書之

三萬零三
三萬三千三百

@LingDong-
Copy link
Member

@pinxue Thanks for spotting the bug. I'll fix it. Meanwhile if someone good with numbers can take a look at hanzi2num.js it would be very much appreciated.

@LingDong- LingDong- added the help wanted Extra attention is needed label Dec 20, 2019
@farteryhr
Copy link

farteryhr commented Dec 20, 2019

i once passed by an accurate (i came up with many test cases and it worked well) Chinese-to-number converter with very good handling of 零's, omitted last unit, mixed arabics and supports decimal point. also it supports to near 1e16 due to using double precision number, failing on 九千零七兆一千九百九十二亿五千四百七十四万零九百九十三.
oops the original website is down, but the archived version works.
https://web.archive.org/web/20190410215628/https://toshuo.com/chinese-tools/chinese-number-tool/#

most of others only deals with 1e12 range and 0.01 precision.

i wrote an arbitrary precision configurable (but only integer for now) number-to-Chinese converter. it may be reference and test case generator. i think it's proper ad here.
http://farter.cn/number
test cases are mostly alternating 1 and 0's.
some thinking before (and some addition after) i actually wrote this: https://www.zhihu.com/question/22629654/answer/54295979

the current number to hanzi converter seems lack 零 on:
百兆百億百萬百

omitting the last unit is not compatible with omitting all 零s. 一百一 is the minimal example (101/110). since wenyan is written Chinese and omitting the last unit is colloquial, and omitting 零 is frequent in wenyan, i suggest that we disallow omitting the last unit.

allowing 一萬萬, 一萬億, 一億億 etc. along with 兆京 etc. also enabled may be troublesome. consider 一萬億、一萬零一億 (the toshuo one seems to simply replace "萬億" to "兆", so it fails on this)、一萬零一萬、一億零一萬億 etc. refer to “最高用万” “最高用亿” in my converter.

i still don't have a clear idea about allowing 十百千萬 for small number units, but i think it may be troublesome too.... btw according to wikipedia and many other sources, 分厘毫絲忽微纖沙塵埃渺漠 are all units by 10. why do you say

Because 萬埃 (10^4埃) is 塵.

?

@LingDong-
Copy link
Member

LingDong- commented Dec 20, 2019

@farteryhr Wow! Thanks for all the resources. I'll peak into them and learn from those implementation.

why do you say Because 萬埃 (10^4埃) is 塵. ?

In fact I wrote the hanzi2num program quite a while ago for fun, and I am re-using the code for this project. At the time I read some source (can't find it now, so probably not a very reliable source :P) that says 塵埃渺漠 are 10^4 units while 忽微纖沙 are 10^1 units. (If I'm not mistaken, different authors of different dynasties define these very large/small numbers differently, e.g. 萬萬塵曰沙 from 算學啟蒙). But now checking wikipedia it seems that 10^1 makes more sense. I'll correct for that.

Thanks again.

@pinxue
Copy link

pinxue commented Dec 22, 2019

For fraction part, 納(10^-9)、皮(10^-12)、飛(10^-15) are more popular than other names.

Here is a full list: https://baike.baidu.com/item/数字/6204#8

@statementreply
Copy link
Contributor

statementreply commented Dec 22, 2019

For floating point accuracy problems, I suggest that we only convert to/from its decimal string representation (in fixed or scientific notation), and let JavaScript/target language do the string to/from number conversion.

Conversion between binary floating point and decimal representation is extremely tricky. It took several decades of research to get a fully correct algorithm (that is not too slow), and it would require tens of dev-months to implement that from scratch. (Thus I also suggest not to implement things like Number.toFixed in pure wenyan)

For Chinese-to-number conversion, there are lots of edge cases:

一千零一百 = Error
一千零一十 = 1010
一千一百 = 1100
一千一十 = 1010
一千零百 = 1000
一千零十 = 1010
一千百 = 1100
一千十 = 1010
一億零一萬零一百 = 1 0001 0100
一億零一百零一萬 = 1 0101 0000
一萬零一億零一百 = 1 0001 0000 0100
一萬零一百零一億 = 1 0101 0000 0000
一百零一億零一萬 = 101 0001 0000
一百零一萬零一億 = 101 0001 0000 0000
億萬百 = 1 0001 0100
億百萬 = 1 0100 0000
萬億百 = 1 0000 0000 0100
萬百億 = 1 0100 0000 0000
百億萬 = 100 0001 0000
百萬億 = 100 0000 0000 0000
一兆零一億零一萬 = 1 0001 0001 0000
一兆零一萬零一億 = Error
一億零一兆零一萬 = 1 0000 0001 0000 0001 0000
一億零一萬零一兆 = 1 0001 0001 0000 0000 0000
一萬零一兆零一億 = 1 0001 0001 0000 0000
一萬零一億零一兆 = 1 0001 0000 0001 0000 0000 0000

@farteryhr
Copy link

For fraction part, 納(10^-9)、皮(10^-12)、飛(10^-15) are more popular than other names.

Here is a full list: https://baike.baidu.com/item/数字/6204#8

It's the SI scheme where all levels differ by 10^3, the corresponding big units are 千 兆 吉 太 拍 etc. my converter also supports reading big integers in SI scheme in Chinese (which sounds weird).

@kaiyuan01
Copy link
Contributor

added test cases for 渺、埃、尘、沙、纤、微, failed for 纤 and 尘 - need code change for compiler

@kaiyuan01
Copy link
Contributor

Just wanted to add: only for 纤 and 尘 failed. If we support simplified Chinese, we may change the code for compiler (their traditional Chinese character versions work as expected)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants