Regression since ruby 2.4 - `unknown encoding name - bom|utf-8` #23

ShockwaveNN · 2018-02-14T09:50:57Z

Hi there I update my app to ruby-2.5 and have some problems with sending encoding bom|utf-8 to csv parser.

I create a sample Dockerfiles for this:
With ruby-2.4.3 everything working fine:

FROM ruby:2.4.3

RUN echo 'test' > test.csv
RUN echo "require 'csv'\ncsv = CSV.read('test.csv', encoding: 'bom|utf-8')\n p csv[0][0]" > script.rb
CMD ruby script.rb

output is "test"

But for ruby-2.5.0 encoding error is happend

FROM ruby:2.5.0

RUN echo 'test' > test.csv
RUN echo "require 'csv'\ncsv = CSV.read('test.csv', encoding: 'bom|utf-8')\n p csv[0][0]" > script.rb
CMD ruby script.rb

Error is:

/usr/local/lib/ruby/2.5.0/csv.rb:1532:in `find': unknown encoding name - bom|utf-8 (ArgumentError)
        from /usr/local/lib/ruby/2.5.0/csv.rb:1532:in `initialize'
        from /usr/local/lib/ruby/2.5.0/csv.rb:1280:in `new'
        from /usr/local/lib/ruby/2.5.0/csv.rb:1280:in `open'
        from /usr/local/lib/ruby/2.5.0/csv.rb:1346:in `read'
        from script.rb:2:in `<main>'

The text was updated successfully, but these errors were encountered:

ShockwaveNN · 2018-02-14T09:55:52Z

Is it a good practice to post bugs here or better post in on https://bugs.ruby-lang.org/ ?

stevendaniels · 2018-02-14T10:15:52Z

I can confirm this issue. It looks like the problem started in this commit: d4dd5d1

-    @encoding = raw_encoding(nil) ||
-                ( if encoding = options.delete(:internal_encoding)
-                    case encoding
-                    when Encoding; encoding
-                    else Encoding.find(encoding)
-                    end
-                  end ) ||
-                ( case encoding = options.delete(:encoding)
-                  when Encoding; encoding
-                  when /\A[^:]+/; Encoding.find($&)
-                  end ) ||
+    internal_encoding = Encoding.find(internal_encoding) if internal_encoding
+    if encoding
+      encoding, = encoding.split(":") if encoding.is_a?(String)
+      encoding = Encoding.find(encoding)
+    end
+    @encoding = raw_encoding(nil) || internal_encoding || encoding ||

On current master, the relevant section looks like this:

  # honor the IO encoding if we can, otherwise default to ASCII-8BIT
  internal_encoding = Encoding.find(internal_encoding) if internal_encoding
  external_encoding = Encoding.find(external_encoding) if external_encoding
  if encoding
    encoding, = encoding.split(":", 2) if encoding.is_a?(String)
    encoding = Encoding.find(encoding)
  end
  @encoding = raw_encoding(nil) || internal_encoding || encoding ||
              Encoding.default_internal || Encoding.default_external

Beyond the issue that @ShockwaveNN raised, I also noticed we aren't using the external_encoding argument anywhere.

stevendaniels · 2018-02-14T10:46:16Z

@kou, @hsbt:
It looks like change in the order of operations is what broke things. raw_encoding(nil) returns <Encoding:UTF-8> because @io.external_encoding == <Encoding:UTF-8>

  def raw_encoding(default = Encoding::ASCII_8BIT)
    if @io.respond_to? :internal_encoding
      @io.internal_encoding || @io.external_encoding
    elsif @io.is_a? StringIO
      @io.string.encoding
    elsif @io.respond_to? :encoding
      @io.encoding
    else
      default
    end
  end

To me, it looks like raw_encoding will always return an encoding, so
raw_encoding(nil) || internal_encoding || encoding || Encoding.default_internal || Encoding.default_external will never reach internal_encoding

BTW, neither Ruby 2.4.3 or 2.5.0 recognize bom|utf-8 as a valid encoding:

irb(main):001:0> RUBY_VERSION
=> "2.4.3"
irb(main):002:0>  Encoding.find("bom|utf-8")
ArgumentError: unknown encoding name - bom|utf-8
	from (irb):2:in `find'
	from (irb):2
	from /Users/steven/.rbenv/versions/2.4.3/bin/irb:11:in `<main>'

It doesn't look like we use the internal_encoding, encoding, external_encoding arguments. Do we still need them?

@encoding

This fixes ruby#23 and makes the following modification: + @encoding can be set to external_encoding. Previously, it ignored external_encoding. To me, it looks like the encoding option is usually ignored. `CSV.read|CSV.open` will use the encoding options provided by a user if: 1. the user passes in an `IO`-like object where `respond_to?(:internal_encoding) == true && (io.internal_encoding || io.external_encoding).nil?` 2. the user passes in a `StringIO`-like object where the string has no encoding. 3. the user passes in an `IO`-like object where`respond_to?(:encoding) == true && io.encoding.nil?` As far as I know, the above situations can only occur if the user is passing in a custom IO object that doesn't have encoding information. This analysis is based off of the `raw_encoding` method ```ruby def raw_encoding(default = Encoding::ASCII_8BIT) if @io.respond_to? :internal_encoding @io.internal_encoding || @io.external_encoding elsif @io.is_a? StringIO @io.string.encoding elsif @io.respond_to? :encoding @io.encoding else default end end ```

kou · 2018-02-22T07:30:42Z

@ShockwaveNN Thanks for your report. I've fixed it.
You can use here.

tricknotes · 2018-05-02T11:11:23Z

@kou 4d13339 is so cool!
Could you release a new version including this commit?
I want to use this fix as a released version.

kou · 2018-05-02T11:38:19Z

OK.
Can you update news.md? We can release a new version after we have a release note for the next version in new.md.

tricknotes · 2018-05-02T14:06:59Z

Of course!
Thanks for contribution chance for me.

I sent a PR for news.md. #36

kou · 2018-05-02T22:02:15Z

Great!

Issue with CSV library not liking the encoding option: ruby/csv#23 Bump to use newer version of CSV and not the version that ships with jruby

radar · 2018-09-05T06:46:00Z

I am still having issues trying to read a CSV string that has a byte order mark prepended to it:

require 'csv'
puts "CSV VERSION #{CSV::VERSION}" # Shows 3.0.0

bom_character = 65_279
contents = "first_name\nRyan".codepoints.unshift(bom_character).pack("U*")
csv = CSV.parse(contents, headers: true, encoding: 'bom|utf-8')
csv.each do |row|
  p row.to_h.keys.first.codepoints
  p "ROW FIRST NAME IS #{row["first_name"]}"
end

This outputs nil for the first name and indicates that the key also contains the BOM. What am I doing wrong here?

kou · 2018-09-06T06:06:42Z

BOM is for opening a file not parse target string.

What am I doing wrong here?

You should not reuse closed issue. You should open a new issue.

ShockwaveNN changed the title ~~unknown encoding name - bom|utf-8~~ Regression since ruby 2.4 - unknown encoding name - bom|utf-8 Feb 14, 2018

stevendaniels mentioned this issue Feb 16, 2018

Fix Unknown Encoding issue #24

Closed

kou closed this as completed in 4d13339 Feb 22, 2018

tricknotes mentioned this issue May 2, 2018

Add news for v1.0.2 #36

Merged

awead mentioned this issue May 9, 2019

Downgrade to Ruby 2.4.6 psu-libraries/scholarsphere-3#1579

Merged

khiav223577 mentioned this issue Oct 28, 2021

CSV.parse not removing BOM character #43

Closed

rstueven mentioned this issue Jul 14, 2022

ArgumentError: unknown encoding name - iso-8859-1|utf-8 #254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression since ruby 2.4 - `unknown encoding name - bom|utf-8` #23

Regression since ruby 2.4 - `unknown encoding name - bom|utf-8` #23

ShockwaveNN commented Feb 14, 2018 •

edited

Loading

ShockwaveNN commented Feb 14, 2018

stevendaniels commented Feb 14, 2018 •

edited

Loading

stevendaniels commented Feb 14, 2018 •

edited

Loading

kou commented Feb 22, 2018

tricknotes commented May 2, 2018

kou commented May 2, 2018

tricknotes commented May 2, 2018

kou commented May 2, 2018

radar commented Sep 5, 2018 •

edited

Loading

kou commented Sep 6, 2018 •

edited

Loading

Regression since ruby 2.4 - unknown encoding name - bom|utf-8 #23

Regression since ruby 2.4 - unknown encoding name - bom|utf-8 #23

Comments

ShockwaveNN commented Feb 14, 2018 • edited Loading

ShockwaveNN commented Feb 14, 2018

stevendaniels commented Feb 14, 2018 • edited Loading

stevendaniels commented Feb 14, 2018 • edited Loading

kou commented Feb 22, 2018

tricknotes commented May 2, 2018

kou commented May 2, 2018

tricknotes commented May 2, 2018

kou commented May 2, 2018

radar commented Sep 5, 2018 • edited Loading

kou commented Sep 6, 2018 • edited Loading

Regression since ruby 2.4 - `unknown encoding name - bom|utf-8` #23

Regression since ruby 2.4 - `unknown encoding name - bom|utf-8` #23

ShockwaveNN commented Feb 14, 2018 •

edited

Loading

stevendaniels commented Feb 14, 2018 •

edited

Loading

stevendaniels commented Feb 14, 2018 •

edited

Loading

radar commented Sep 5, 2018 •

edited

Loading

kou commented Sep 6, 2018 •

edited

Loading