Update and extend the Norwegian rules #470

jeblad · 2017-02-13T05:10:57Z

This version for Bokmål and Nynorsk reuse a common baseform, which provide transliteration of some non-character entities.

This change set depends on #469 and replaces the somewhat messy #461.

kartikm · 2017-04-14T03:55:55Z

Can you rebase this pull request?

jeblad · 2017-04-15T11:59:00Z

This contains use of "figure space" as thousand separator. It is not clear if "narrow non-breaking space" should be used. Later is claimed in an article in Digi. It is not clear where this claim emerge, but the article is a reprint from a Danish news site Version2. I have found no clear preference for which space should be used.

divec · 2017-04-16T04:28:35Z

There are lots of cool features in here, and I can see they would be useful in some contexts. However we should consider whether they might be an inconvenience in other contexts. For example:

A Wikipedia editor probably needs the underscore more than subscripted numbers
The (invisible) substitution \u0020 -> \u2007 might cause problems that are hard to debug if the platform doesn't support \u2007 very well
There cold be misinterpretations, e.g. "1 dag" means one day as well as decagram (and it's strange if we insert "1\u2007dag" for one day but "1 natt" for one night).

We should consider the tradeoff between these cool features and the current more basic Norwegian transliteration that just supports å / æ / ø (i.e. just targetting the specific features in Norwegian that a non-Norwegian keyboard doesn't support).

jeblad · 2017-04-16T10:16:23Z

Note that it will always be cases where a specific pattern will create problems. The issue is more which problem hurts most, and to be able to chose to turn on (or off) a pattern can be the right solution.

Underscores at Wikipedia is virtually newer used except in links, and they are then copy-pasted. Subscripts are hard to create on several platforms. Subscripts are although not as troublesome as superscripts as those are used in composite units.

What I really would like is to have a high level state describing allowed type of parsing. Now the state is only a few chars long, which blocks detection of troublesome contexts.

If the platform doesn't support \u2007 then it is probably not supporting Unicode. Whether it should use "narrow non-breaking space" or "figure space" is more important. Former gives a better layout, but most sources says the later should be used.

This is a kind of substitution that should be added as an option, the all or nothing method now in use is a bit awkward.

The prefix "da" is not common in Norwegian, it is usually interpreted as "dekar" (1000 m²). Although a valid form I have newer seen the form "dag" for "dekagram". When I wrote the rules I wondered if I should drop the prefix, because it is so uncommon and could create problems.

We don't write "1 dag" in Norwegian, we write "en dag". We write out the numbers up to twelve, but from seven we use weeks with the exception of two weeks were can use 14 days.

In this case the number of false positives is lowered by reverting the space if whatever following is not a valid SI unit. That means "1 dag" will be handled as a SI unit, but not "2 dager".

The problem with "1 dag" will happen very seldom, in fact it has only 39 occurrences in the whole Norwegian Bokmål Wikipedia. It is important to check how often the actual problem will arise.

The current jquery.ime implementation has a problem with opaque rules. What goes on is a mystery for the user, and it is very difficult to establish a valid mental model. I'm not sure, but perhaps the user should see what the ime is acting upon, or at least get a hint when something is changed.

Anyhow, I wrote the first Norwegian, Swedish and Danish transliterations, and the way they are now they are not used. The normforms creates a lot of false positives in Norwegian, both Bokmål and Nynorsk, and the tildeforma are pretty awkward. Nobody uses them unless they have too.

jeblad · 2017-09-15T19:20:54Z

Unless something happen I'm going to close this.

jeblad added 3 commits February 13, 2017 04:24

Changed to concatenation of extended patterns

4c97429

Updated rules for Norwegian

391ca07

Updated load sequence

54c335d

jeblad changed the title ~~Update and extension of Norwegian~~ Update and extend the Norwegian rules Feb 13, 2017

jeblad mentioned this pull request Jun 2, 2017

Add functionality for base rules? #491

Closed

jeblad closed this Sep 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update and extend the Norwegian rules #470

Update and extend the Norwegian rules #470

jeblad commented Feb 13, 2017

kartikm commented Apr 14, 2017

jeblad commented Apr 15, 2017 •

edited

Loading

divec commented Apr 16, 2017

jeblad commented Apr 16, 2017

jeblad commented Sep 15, 2017

Update and extend the Norwegian rules #470

Update and extend the Norwegian rules #470

Conversation

jeblad commented Feb 13, 2017

kartikm commented Apr 14, 2017

jeblad commented Apr 15, 2017 • edited Loading

divec commented Apr 16, 2017

jeblad commented Apr 16, 2017

jeblad commented Sep 15, 2017

jeblad commented Apr 15, 2017 •

edited

Loading