Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update and extend the Norwegian rules #470

Closed
wants to merge 3 commits into from

Conversation

jeblad
Copy link
Contributor

@jeblad jeblad commented Feb 13, 2017

This version for Bokmål and Nynorsk reuse a common baseform, which provide transliteration of some non-character entities.

This change set depends on #469 and replaces the somewhat messy #461.

@jeblad jeblad changed the title Update and extension of Norwegian Update and extend the Norwegian rules Feb 13, 2017
@kartikm
Copy link
Member

kartikm commented Apr 14, 2017

Can you rebase this pull request?

@jeblad
Copy link
Contributor Author

jeblad commented Apr 15, 2017

This contains use of "figure space" as thousand separator. It is not clear if "narrow non-breaking space" should be used. Later is claimed in an article in Digi. It is not clear where this claim emerge, but the article is a reprint from a Danish news site Version2. I have found no clear preference for which space should be used.

@divec
Copy link
Contributor

divec commented Apr 16, 2017

There are lots of cool features in here, and I can see they would be useful in some contexts. However we should consider whether they might be an inconvenience in other contexts. For example:

  • A Wikipedia editor probably needs the underscore more than subscripted numbers
  • The (invisible) substitution \u0020 -> \u2007 might cause problems that are hard to debug if the platform doesn't support \u2007 very well
  • There cold be misinterpretations, e.g. "1 dag" means one day as well as decagram (and it's strange if we insert "1\u2007dag" for one day but "1 natt" for one night).

We should consider the tradeoff between these cool features and the current more basic Norwegian transliteration that just supports å / æ / ø (i.e. just targetting the specific features in Norwegian that a non-Norwegian keyboard doesn't support).

@jeblad
Copy link
Contributor Author

jeblad commented Apr 16, 2017

Note that it will always be cases where a specific pattern will create problems. The issue is more which problem hurts most, and to be able to chose to turn on (or off) a pattern can be the right solution.

Underscores at Wikipedia is virtually newer used except in links, and they are then copy-pasted. Subscripts are hard to create on several platforms. Subscripts are although not as troublesome as superscripts as those are used in composite units.

What I really would like is to have a high level state describing allowed type of parsing. Now the state is only a few chars long, which blocks detection of troublesome contexts.

If the platform doesn't support \u2007 then it is probably not supporting Unicode. Whether it should use "narrow non-breaking space" or "figure space" is more important. Former gives a better layout, but most sources says the later should be used.

This is a kind of substitution that should be added as an option, the all or nothing method now in use is a bit awkward.

The prefix "da" is not common in Norwegian, it is usually interpreted as "dekar" (1000 m²). Although a valid form I have newer seen the form "dag" for "dekagram". When I wrote the rules I wondered if I should drop the prefix, because it is so uncommon and could create problems.

We don't write "1 dag" in Norwegian, we write "en dag". We write out the numbers up to twelve, but from seven we use weeks with the exception of two weeks were can use 14 days.

In this case the number of false positives is lowered by reverting the space if whatever following is not a valid SI unit. That means "1 dag" will be handled as a SI unit, but not "2 dager".

The problem with "1 dag" will happen very seldom, in fact it has only 39 occurrences in the whole Norwegian Bokmål Wikipedia. It is important to check how often the actual problem will arise.

The current jquery.ime implementation has a problem with opaque rules. What goes on is a mystery for the user, and it is very difficult to establish a valid mental model. I'm not sure, but perhaps the user should see what the ime is acting upon, or at least get a hint when something is changed.

Anyhow, I wrote the first Norwegian, Swedish and Danish transliterations, and the way they are now they are not used. The normforms creates a lot of false positives in Norwegian, both Bokmål and Nynorsk, and the tildeforma are pretty awkward. Nobody uses them unless they have too.

@jeblad
Copy link
Contributor Author

jeblad commented Sep 15, 2017

Unless something happen I'm going to close this.

@jeblad jeblad closed this Sep 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants