Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression since ruby 2.4 - unknown encoding name - bom|utf-8 #23

Closed
ShockwaveNN opened this issue Feb 14, 2018 · 10 comments
Closed

Regression since ruby 2.4 - unknown encoding name - bom|utf-8 #23

ShockwaveNN opened this issue Feb 14, 2018 · 10 comments

Comments

@ShockwaveNN
Copy link
Contributor

ShockwaveNN commented Feb 14, 2018

Hi there I update my app to ruby-2.5 and have some problems with sending encoding bom|utf-8 to csv parser.

I create a sample Dockerfiles for this:
With ruby-2.4.3 everything working fine:

FROM ruby:2.4.3

RUN echo 'test' > test.csv
RUN echo "require 'csv'\ncsv = CSV.read('test.csv', encoding: 'bom|utf-8')\n p csv[0][0]" > script.rb
CMD ruby script.rb

output is "test"

But for ruby-2.5.0 encoding error is happend

FROM ruby:2.5.0

RUN echo 'test' > test.csv
RUN echo "require 'csv'\ncsv = CSV.read('test.csv', encoding: 'bom|utf-8')\n p csv[0][0]" > script.rb
CMD ruby script.rb

Error is:

/usr/local/lib/ruby/2.5.0/csv.rb:1532:in `find': unknown encoding name - bom|utf-8 (ArgumentError)
        from /usr/local/lib/ruby/2.5.0/csv.rb:1532:in `initialize'
        from /usr/local/lib/ruby/2.5.0/csv.rb:1280:in `new'
        from /usr/local/lib/ruby/2.5.0/csv.rb:1280:in `open'
        from /usr/local/lib/ruby/2.5.0/csv.rb:1346:in `read'
        from script.rb:2:in `<main>'
@ShockwaveNN ShockwaveNN changed the title unknown encoding name - bom|utf-8 Regression since ruby 2.4 - unknown encoding name - bom|utf-8 Feb 14, 2018
@ShockwaveNN
Copy link
Contributor Author

Is it a good practice to post bugs here or better post in on https://bugs.ruby-lang.org/ ?

@stevendaniels
Copy link
Contributor

stevendaniels commented Feb 14, 2018

I can confirm this issue. It looks like the problem started in this commit: d4dd5d1

-    @encoding = raw_encoding(nil) ||
-                ( if encoding = options.delete(:internal_encoding)
-                    case encoding
-                    when Encoding; encoding
-                    else Encoding.find(encoding)
-                    end
-                  end ) ||
-                ( case encoding = options.delete(:encoding)
-                  when Encoding; encoding
-                  when /\A[^:]+/; Encoding.find($&)
-                  end ) ||
+    internal_encoding = Encoding.find(internal_encoding) if internal_encoding
+    if encoding
+      encoding, = encoding.split(":") if encoding.is_a?(String)
+      encoding = Encoding.find(encoding)
+    end
+    @encoding = raw_encoding(nil) || internal_encoding || encoding ||

On current master, the relevant section looks like this:

  # honor the IO encoding if we can, otherwise default to ASCII-8BIT
  internal_encoding = Encoding.find(internal_encoding) if internal_encoding
  external_encoding = Encoding.find(external_encoding) if external_encoding
  if encoding
    encoding, = encoding.split(":", 2) if encoding.is_a?(String)
    encoding = Encoding.find(encoding)
  end
  @encoding = raw_encoding(nil) || internal_encoding || encoding ||
              Encoding.default_internal || Encoding.default_external

Beyond the issue that @ShockwaveNN raised, I also noticed we aren't using the external_encoding argument anywhere.

@stevendaniels
Copy link
Contributor

stevendaniels commented Feb 14, 2018

@kou, @hsbt:
It looks like change in the order of operations is what broke things. raw_encoding(nil) returns <Encoding:UTF-8> because @io.external_encoding == <Encoding:UTF-8>

  def raw_encoding(default = Encoding::ASCII_8BIT)
    if @io.respond_to? :internal_encoding
      @io.internal_encoding || @io.external_encoding
    elsif @io.is_a? StringIO
      @io.string.encoding
    elsif @io.respond_to? :encoding
      @io.encoding
    else
      default
    end
  end

To me, it looks like raw_encoding will always return an encoding, so
raw_encoding(nil) || internal_encoding || encoding || Encoding.default_internal || Encoding.default_external will never reach internal_encoding

BTW, neither Ruby 2.4.3 or 2.5.0 recognize bom|utf-8 as a valid encoding:

irb(main):001:0> RUBY_VERSION
=> "2.4.3"
irb(main):002:0>  Encoding.find("bom|utf-8")
ArgumentError: unknown encoding name - bom|utf-8
	from (irb):2:in `find'
	from (irb):2
	from /Users/steven/.rbenv/versions/2.4.3/bin/irb:11:in `<main>'

It doesn't look like we use the internal_encoding, encoding, external_encoding arguments. Do we still need them?

stevendaniels added a commit to stevendaniels/csv that referenced this issue Feb 16, 2018
This fixes ruby#23 and makes the following modification:

+ @encoding can be set to external_encoding. Previously, it ignored external_encoding.

To me, it looks like the encoding option is usually ignored. `CSV.read|CSV.open` will use the encoding options provided by a user if:

1. the user passes in an `IO`-like object where `respond_to?(:internal_encoding) == true && (io.internal_encoding || io.external_encoding).nil?`
2. the user passes in a `StringIO`-like object where the string has no encoding.
3. the user passes in an `IO`-like object where`respond_to?(:encoding) == true && io.encoding.nil?`

As far as I know, the above situations can only occur if the user is passing in a custom IO object
that doesn't have encoding information. This analysis is based off of the `raw_encoding` method

```ruby
def raw_encoding(default = Encoding::ASCII_8BIT)
  if @io.respond_to? :internal_encoding
    @io.internal_encoding || @io.external_encoding
  elsif @io.is_a? StringIO
    @io.string.encoding
  elsif @io.respond_to? :encoding
    @io.encoding
  else
    default
  end
end
```
@kou kou closed this as completed in 4d13339 Feb 22, 2018
@kou
Copy link
Member

kou commented Feb 22, 2018

@ShockwaveNN Thanks for your report. I've fixed it.
You can use here.

@tricknotes
Copy link
Contributor

@kou 4d13339 is so cool!
Could you release a new version including this commit?
I want to use this fix as a released version.

@kou
Copy link
Member

kou commented May 2, 2018

OK.
Can you update news.md? We can release a new version after we have a release note for the next version in new.md.

@tricknotes
Copy link
Contributor

Of course!
Thanks for contribution chance for me.

I sent a PR for news.md. #36

@kou
Copy link
Member

kou commented May 2, 2018

Great!

cfitz added a commit to cfitz/archivesspace that referenced this issue Jun 20, 2018
Issue with CSV library not liking the encoding option:
ruby/csv#23

Bump to use newer version of CSV and not the version that ships with
jruby
@radar
Copy link

radar commented Sep 5, 2018

I am still having issues trying to read a CSV string that has a byte order mark prepended to it:

require 'csv'
puts "CSV VERSION #{CSV::VERSION}" # Shows 3.0.0

bom_character = 65_279
contents = "first_name\nRyan".codepoints.unshift(bom_character).pack("U*")
csv = CSV.parse(contents, headers: true, encoding: 'bom|utf-8')
csv.each do |row|
  p row.to_h.keys.first.codepoints
  p "ROW FIRST NAME IS #{row["first_name"]}"
end

This outputs nil for the first name and indicates that the key also contains the BOM. What am I doing wrong here?

@kou
Copy link
Member

kou commented Sep 6, 2018

BOM is for opening a file not parse target string.

What am I doing wrong here?

You should not reuse closed issue. You should open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

5 participants