UTF-8 downcase is not working correctly #2053

Closed
hron84 opened this Issue Nov 24, 2012 · 1 comment

Comments

Projects
None yet
2 participants

hron84 commented Nov 24, 2012

Given the following failing RSpec:

# -*- encoding:utf-8 -*-
require 'rspec/autorun'

$KCODE = 'UTF8' if not defined?(Rubinius) and $KCODE == 'NONE'

describe String do
  it 'current encoding is UTF-8' do
    if defined?(Rubinius)
      __ENCODING__.name.should eq 'UTF-8'
    else
      $KCODE.should eq 'UTF8'
    end
  end

  context "with US-ASCII string" do
    before do
      @string = "The quick brown fox jumps over the lazy dog"
    end

    it 'should downcase it correctly' do
      @string.downcase.should eq 'the quick brown fox jumps over the lazy dog'
    end

    it 'should upcase it correctly' do
      @string.upcase.should eq 'THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG'
    end
  end

  context "with UTF-8 string" do
    before do
      @string = "Egy hűtlen vejét fülöncsípő, dühös mexikói úr ázik Quitóban"
    end

    it 'should downcase it correctly' do
      @string.downcase.should eq 'egy hűtlen vejét fülöncsípő, dühös mexikói úr ázik quitóban'
    end

    it 'should upcase it correctly' do
      @string.upcase.should eq 'EGY HŰTLEN VEJÉT FÜLÖNCSÍPŐ, DÜHÖS MEXIKÓI ÚR ÁZIK QUITÓBAN'
    end
  end
end

The problem what is highlights, the accented characters (read as: UTF-8 characters) does not converted to downcase/upcase correctly (both in 1.8 and 1.9 mode)

The fact: it mimics the correspondent MRI bug, however, I think it is a bug in MRI too, so Rubinius can decide to not mimic this bug, as because nobody rely on this bug (I strongly hope).

I used hungarian characters to make the problem clear: all accented character should be replaced with an upcased one with same accent, in the real it is not happening, Rubinius keeps intact (lowercased) all accented character...

Sorry for RSpec spec, but I do not know your spec system enough to write a correct spec.

The third line can be commented out or not, it does not have any effect on this bug.

Owner

dbussink commented Nov 24, 2012

The biggest problem with this is that capitalization is often language dependent. Different languages capitalize letters in different ways. I suspect this is also why MRI has the same behavior as Rubinius, because there is not a straightforward way to do this right.

If you feel this a bug in Ruby, you should open an issue with ruby-core at https://bugs.ruby-lang.org about this. Maybe there has already been discussion on this topic there, so you could search for that. We are not unilaterally going to change Rubinius in this regard, so what happens depends on the discussion there.

dbussink closed this Nov 24, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment