-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update and extend the Norwegian rules #470
Conversation
Can you rebase this pull request? |
This contains use of "figure space" as thousand separator. It is not clear if "narrow non-breaking space" should be used. Later is claimed in an article in Digi. It is not clear where this claim emerge, but the article is a reprint from a Danish news site Version2. I have found no clear preference for which space should be used. |
There are lots of cool features in here, and I can see they would be useful in some contexts. However we should consider whether they might be an inconvenience in other contexts. For example:
We should consider the tradeoff between these cool features and the current more basic Norwegian transliteration that just supports å / æ / ø (i.e. just targetting the specific features in Norwegian that a non-Norwegian keyboard doesn't support). |
Note that it will always be cases where a specific pattern will create problems. The issue is more which problem hurts most, and to be able to chose to turn on (or off) a pattern can be the right solution. Underscores at Wikipedia is virtually newer used except in links, and they are then copy-pasted. Subscripts are hard to create on several platforms. Subscripts are although not as troublesome as superscripts as those are used in composite units. What I really would like is to have a high level state describing allowed type of parsing. Now the state is only a few chars long, which blocks detection of troublesome contexts. If the platform doesn't support \u2007 then it is probably not supporting Unicode. Whether it should use "narrow non-breaking space" or "figure space" is more important. Former gives a better layout, but most sources says the later should be used. This is a kind of substitution that should be added as an option, the all or nothing method now in use is a bit awkward. The prefix "da" is not common in Norwegian, it is usually interpreted as "dekar" (1000 m²). Although a valid form I have newer seen the form "dag" for "dekagram". When I wrote the rules I wondered if I should drop the prefix, because it is so uncommon and could create problems. We don't write "1 dag" in Norwegian, we write "en dag". We write out the numbers up to twelve, but from seven we use weeks with the exception of two weeks were can use 14 days. In this case the number of false positives is lowered by reverting the space if whatever following is not a valid SI unit. That means "1 dag" will be handled as a SI unit, but not "2 dager". The problem with "1 dag" will happen very seldom, in fact it has only 39 occurrences in the whole Norwegian Bokmål Wikipedia. It is important to check how often the actual problem will arise. The current jquery.ime implementation has a problem with opaque rules. What goes on is a mystery for the user, and it is very difficult to establish a valid mental model. I'm not sure, but perhaps the user should see what the ime is acting upon, or at least get a hint when something is changed. Anyhow, I wrote the first Norwegian, Swedish and Danish transliterations, and the way they are now they are not used. The normforms creates a lot of false positives in Norwegian, both Bokmål and Nynorsk, and the tildeforma are pretty awkward. Nobody uses them unless they have too. |
Unless something happen I'm going to close this. |
This version for Bokmål and Nynorsk reuse a common baseform, which provide transliteration of some non-character entities.
This change set depends on #469 and replaces the somewhat messy #461.