-
-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word and character counts for Chinese #1657
Comments
Yes, I'm aware there are some issues here. I'm happy to improve CJK support in general, as long as someone can assist with how to best implement it. Currently, word count is the primary metric used in various places, with character and paragraph as secondary metrics. I'm wondering if adding a CJK flag in Preferences that treats text stats differently is the way to go? Then we can redefine the GUI elements and various logic according to this. I would need someone to define what changes this mode should make though, and where. |
Incidentally, I'm redesigning the Project Details tool as well, where I want to add some more text analysis features. I know CJK character count has been requested before, and it's one of the additions I have in mind. Basically, I want to define a tool box of text analysis and stats tools that the user can select from to generate a report for a given novel folder. This is something where user contributions may be very helpful. I will create a framework for it, and adding language-specific ones shouldn't be a problem. |
On a side note, if you have the time to update the remaining missing translations (6% of the text), I can include a complete translation in the 2.2.1 release I'm planning really soon. Only complete translations are updated. |
Yes, it might be a good solution to set different character count rules for different languages in the preferences. Besides, during my use in these past few days, I have also found some Chinese translations that do not fit well in the Chinese context. I hope I can re-translate all the content into Chinese again. |
I don't know if you are familiar with languages like Chinese and Japanese, but I can give you a simple example:
This sentence means: "This is a very good software, I like it very much." In the common Chinese character count, this sentence contains 13 characters. In English, each word is separated by a space, so when using a character count feature not suitable for Chinese, this sentence would show as having 2 words. However, it has 13 characters. And in Chinese, the count of words is usually not considered. Perhaps in Chinese and similar languages, character counting doesn’t need to count words, but just the number of characters. In the current novelWriter, the 'characters' count is actually the number of Chinese characters, but the count for words is not the number of vocabulary words, but the number of sentences. For example,
|
Thanks for the clarification. I have a few questions to follow up:
I propose that CJK mode will use character counts instead of word counts in the project tree, status bar, and in the writing statistics. Any other changes we need to make? I would also like some input from @longqzh here too, the original translator of the Chinese translation. |
Currently, @longqzh has access to approve Chinese translations. @hebekeg is in charge of the Japanese translation. |
For Chinese, there are a few concepts that might need some explanation:
Regarding your questions:
|
If the user is involved in submitting their work after completion, different submission platforms have different counting standards. Some platforms count the word count (only Chinese characters), while others count the character count (including Chinese characters and symbols). Therefore, for Chinese users, 'Word Count' and 'Character Count' might be the primary indicators for text statistics. |
The regular expression for pure Chinese characters (excluding symbols) is as follows: [\u4e00-\u9fa5]. |
There is no code in novelWriter to handle this. It likely comes from the Qt library or the system itself. What operating system are you using? |
Is this what's sometimes called CJK count? |
windows10, |
It should be, now in novelWriter, 'characters' is counted as 'Character Count' (字符数),
|
I plan to switch novelWriter to Qt6 some time this year. Probably in release 2.4. Hopefully that will help, but a bit of searching reveals the positioning of the candidate box is a problem for many apps. This requires a bit of research. I suggest you make a bug report from your comment so we can track it separately. |
Great. Then it may be as simple as adding an option to change the word count algorithm to use CJK count instead. I know how to make such a counter. Would you still want to label it "Word Count" in the English translation on the user interface, or would it be clearer if it said "CJK Count"? Also, should the CJK Count vs Word Count setting be a per project setting, or for the whole app? Either is possible. |
I consulted chatGPT, and it might be an issue with the QT5 package. Here is their response:
|
ChatGPT is not a reliable source. It is mixing up a bunch of different issues in that answer too. I'm familiar with the Linux issues with positioning. I will do some proper research. |
Thank you very much, I am not sure if this issue exists in the Linux system, as I am using the Windows 10 operating system. |
The Linux issue mostly has to do with determining the (0,0) coordinate of the screen on multi-screen setups where the monitors have different resolution or pixel scaling. AFAIK this is related to Wayland. This is much better supported in Qt6. On Windows, I'm not sure what the issue is. But in general, the way Qt computes coordinates for pop up boxes, including menus, depend on getting information from the window manager in the operating system. The Qt5 library is not up to date with various such features, including high DPI scaling, light/dark OS settings, etc. A lot of this has been fixed in Qt6. Anyway, please make a proper bug ticket on this so I can track it as a bug. It will be lost here as this one is about the word count. |
@ruixuan658 , @vkbo As a native Chiense and Korean speaker, I'd like to redefine some conceptions in order to avoid confusion.
No , because in Korean, there are blank between words, but in Chinese, there isn't any blank between words. So Conception C makes no sense in Chinese. And currently
For Chinese, Conception A and Conception B is enough. For Korean, Conception C is also necessary. I don't think we need other new method.
If we don't change the algorithm about "word count", in Chinese it means sentence count naturally. But I don't think it's a useful parameter, I have never seen it in others normal editor ( I'm not profesional editer's user )
I'm not sure. To be honest, maybe we should force the user to use double line breaks between paragraphes.
Generally speaking, I prefer to just show |
I really want to avoid adding an external library here. Both because I like to avoid dependencies whenever possible, and because the word count algorithm is performance critical, and I need to have control over how it's optimised. So my real question is, will a CJK character count suffice? It's a fairly easy counter to implement.
How necessary? The CJK count will ignore spaces. I propose to use CJK count as a drop-in replacement for word count, and just remove word count when it is enabled. Basically I would like to add a new setting in Preferences called "Primary Counter" or something, with the following options:
With an option like above, the user could select their preferred counter. I'm also working on adding text analysis tools which will include a bunch of different counter methods, so CJK count will always be available, but this is mostly about which metric is used for collecting statistics and displaying in the project tree. |
We also need to discuss how such a counter is implemented. Python uses utf-8, so the absolute simplest option is to count Unicode values between See Wikipedia: CJK Unified Ideographs Edit: There is also the list at the bottom of this wiki page. I am not familiar enough with these languages to know what would be the best approach, or if this is even the right track. |
For Chinese and Korean, character count is enough. I also think word couting is pointless, especially given the complexity of the implementation :)
I agree with you. character count is much more importance than word count. |
Just to clarify, CJK character count is not the same as the current character count. |
To clarify Python use Unicode to represent all character in the python virtual machine Official Docs. |
Python 3 strings default to utf-8 always. This is what the count is run on. This is not an issue. The question is what Unicode ranges we need to check for CJK count. |
In any case, I have full control over the encoding in novelWriter. Both novelWriter and the Qt framework use Unicode (utf-8 and utf-16 respectively). What I need input on is what Unicode ranges will cover the correct symbols needed to make a proper count. I don't know the differences between the ones listed on that wiki page. |
novelWriter is very user-friendly and also has some parts translated into Chinese. I recently discovered this project and liked it very much, and I contributed some Chinese translations. However, there is an issue for Chinese users:
Chinese, unlike English and other alphabetic languages, does not have the concept of letters, and the smallest unit for word count in Chinese is a character. In the current version of novelWriter, the word count includes 'characters' and 'words'. In current usage, for counting words in Chinese, one needs to look at 'characters', as 'words' does not make sense for counting Chinese characters. Additionally, in the current counting feature, 'words' count as a sentence in Chinese. I hope novelWriter can enhance the experience for Chinese users by improving this aspect.
The text was updated successfully, but these errors were encountered: