Skip to content

ArgumentError: invalid byte sequence in US-ASCII #314

Closed
mperham opened this Issue Apr 18, 2012 · 33 comments

9 participants

@mperham
mperham commented Apr 18, 2012

Rubygems 1.8.22

Trying to bundle install when pointing to a git repo. Looks like a gemspec is being converted to YAML:

# -*- encoding: utf-8 -*-
require File.expand_path("../lib/transitions/version", __FILE__)

Gem::Specification.new do |s|
  s.name        = "transitions"
  s.version     = Transitions::VERSION
  s.platform    = Gem::Platform::RUBY
  s.authors     = ["Jakub Kuźma", "Timo Rößner"]
  s.email       = "timo.roessner@googlemail.com"
  s.homepage    = "http://github.com/troessner/transitions"
  s.summary     = "State machine extracted from ActiveModel"
  ...
end

Notice the wacky characters in the author names. Crazy europeans. Does YAML not track the encoding?

/usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/site_ruby/1.9.1/rubygems/specification.rb:1952:in `gsub'
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/site_ruby/1.9.1/rubygems/specification.rb:1952:in `to_yaml'
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/site_ruby/1.9.1/rubygems/builder.rb:79:in `block (2 levels) in write_package'
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/site_ruby/1.9.1/rubygems/package/tar_output.rb:73:in `block (3 levels) in add_gem_contents'
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/site_ruby/1.9.1/rubygems/package/tar_writer.rb:83:in `new'
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/site_ruby/1.9.1/rubygems/package/tar_output.rb:67:in `block (2 levels) in add_gem_contents'
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/site_ruby/1.9.1/rubygems/package/tar_output.rb:65:in `wrap'
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/site_ruby/1.9.1/rubygems/package/tar_output.rb:65:in `block in add_gem_contents'
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/site_ruby/1.9.1/rubygems/package/tar_writer.rb:113:in `add_file'
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/site_ruby/1.9.1/rubygems/package/tar_output.rb:63:in `add_gem_contents'
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/site_ruby/1.9.1/rubygems/package/tar_output.rb:31:in `open'
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/site_ruby/1.9.1/rubygems/package.rb:44:in `open'
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/site_ruby/1.9.1/rubygems/builder.rb:78:in `block in write_package'
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/open-uri.rb:35:in `open'
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/open-uri.rb:35:in `open'
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/site_ruby/1.9.1/rubygems/builder.rb:77:in `write_package'
 ** [out :: rc.theclymb.com] 
 ** [out :: rc.theclymb.com] /usr/local/rvm/rubies/ruby-1.9.3-p125/lib/ruby/site_ruby/1.9.1/rubygems/builder.rb:39:in `build'
 ** [out :: rc.theclymb.com] /usr/local/rvm/gems/ruby-1.9.3-p125@global/gems/bundler-1.1.3/lib/bundler/source.rb:443:in `block in generate_bin'
 ** [out :: rc.theclymb.com] /usr/local/rvm/gems/ruby-1.9.3-p125@global/gems/bundler-1.1.3/lib/bundler/source.rb:443:in `chdir'
 ...
@luislavena
RubyGems member

Hello @mperham indeed crazy europeans

  • RubyGems will serialize Gem::Specification as YAML inside the gem. That is the portable way
  • The issues with accented/acute characters can be caused by presence or lack of Psych to parse it, it will depend when the gemspec was generated or read.

Just tried locally like this:

Gem::Specification.new do |s|
  s.name        = "transitions"
  s.version     = "0.1"
  s.platform    = Gem::Platform::RUBY
  s.authors     = ["Jakub Kuźma", "Timo Rößner"]
  s.email       = "timo.roessner@googlemail.com"
  s.homepage    = "http://github.com/troessner/transitions"
  s.summary     = "State machine extracted from ActiveModel"
  s.description = "<something for description>"
  s.files       = ["README"]
end
$ gem build myspec.gemspec 
  Successfully built RubyGem
  Name: transitions
  Version: 0.1
  File: transitions-0.1.gem

$ gem spec transitions-0.1.gem authors
---
- Jakub Kuźma
- Timo Rößner

Seems to work. My env:

$ gem env
RubyGems Environment:
  - RUBYGEMS VERSION: 1.8.21
  - RUBY VERSION: 1.9.3 (2012-02-16 patchlevel 125) [x86_64-darwin10.8.0]
...

With Psych as YAML engine.

@mperham
mperham commented Apr 18, 2012

I removed the non-ascii characters and now my bundle works. I'm using bundler 1.1.3 and 1.9.3-p125.

@luislavena
RubyGems member

@mperham do you have Psych (libyaml) for your Ruby 1.9.3? Also, what version of RubyGems?

@luislavena
RubyGems member

Because I can read the gem without issues:

$ gem fetch transitions
Fetching: transitions-0.0.14.gem (100%)
Downloaded transitions-0.0.14
luis@seyori:~/sandbox 
$ gem spec transitions-0.0.14.gem authors
---
- Jakub Kuźma
- Timo Rößner
@mperham
mperham commented Apr 18, 2012

I guess my point is that your codepath is different because I can do the exact same thing:

$ gem fetch transitions
Fetching: transitions-0.0.14.gem (100%)
Downloaded transitions-0.0.14
[mikep@moxley-dev current]$ gem spec transitions-0.0.14.gem authors
---
- Jakub Kuźma
- Timo Rößner

$ ruby -v
ruby 1.9.3p125 (2012-02-16) [x86_64-linux]
$ gem -v
1.8.22

I have no idea which YAML I'm using or how to get you that info. Let me know how to do that.

@luislavena
RubyGems member

I guess my point is that your codepath is different because I can do the exact same thing

Sorry, I'm trying to determine if the issue is coming from RubyGems, rubygems.org or Bundler. Since Bundler patches a bunch of stuff of RubyGems first I need to ensure that RubyGems is able to process the YAML from a gem (That it fetched) and also accented characters from a Gem::Specification

$ ruby -v -ryaml -e "puts YAML::ENGINE.yamler"
ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin10.8.0]
psych

Sorry I omitted that I've tried to bundle it without problems:

$ cat Gemfile
source :rubygems

gem "transitions", :git => "https://github.com/troessner/transitions.git"


$ bundle check
https://github.com/troessner/transitions.git (at master) is not checked out. Please run `bundle install`

$ bundle install
Fetching https://github.com/troessner/transitions.git
remote: Counting objects: 590, done.
remote: Compressing objects: 100% (301/301), done.
remote: Total 590 (delta 337), reused 485 (delta 243)
Receiving objects: 100% (590/590), 78.62 KiB | 77 KiB/s, done.
Resolving deltas: 100% (337/337), done.
Fetching gem metadata from http://rubygems.org/.......
Using transitions (0.0.16) from https://github.com/troessner/transitions.git (at master) 
Using bundler (1.1.3) 
Your bundle is complete! Use `bundle show [gemname]` to see where a bundled gem is installed.

$ bundle exec gem list

*** LOCAL GEMS ***

bundler (1.1.3)
transitions (0.0.16)

So there must be something in your particular environment that could be affecting (or more specifically: syck versus Psych as YAML engine)

@evanphx
RubyGems member
evanphx commented Apr 18, 2012

@mperham What is the output of ruby -e 'p Encoding.default_external' in the terminal where you had the problem? It appears that StringIO (which is used by in #to_yaml) sets the encoding of it's internal buffer to the default_external, which can result in raw utf-8 sequences showing up in a US-ASCII. When a string like this is #gsub!'d you get the exception you saw.

@mperham
mperham commented Apr 18, 2012

@evanphx I'm seeing #<Encoding:UTF-8>.

@luislavena Note that transitions 0.0.16 has removed the special characters from the gemspec. That YAML code prints:

ruby 1.9.3p125 (2012-02-16) [x86_64-linux]
psych
malloc_limit=59000000 (8000000)
heap_min_slots=600000 (10000)
@evanphx
RubyGems member
evanphx commented Apr 18, 2012

To see the bug, download https://gist.github.com/2417140 and run LANG=C ruby file.rb vs LANG=UTF-8 ruby file.rb.

@luislavena
RubyGems member

@mperham did you found the issue?

@mperham
@voxik
voxik commented Jul 12, 2012

@evanphx I can reproduce the issue with the gist you posted above:

$ LANG=en_US.utf-8 ruby file.rb 
#<Encoding:UTF-8>
"---\n- Jakub Kuźma\n- Timo Rößner\n"
#<Encoding:UTF-8>
$ LANG=C ruby file.rb 
#<Encoding:UTF-8>
"---\n- Jakub Ku\xC5\xBAma\n- Timo R\xC3\xB6\xC3\x9Fner\n"
#<Encoding:US-ASCII>
file.rb:22:in `gsub': invalid byte sequence in US-ASCII (ArgumentError)
    from file.rb:22:in `<main>'

I am facing the same issue, when I am trying to build Bundler for Fedora and running

$ LANG=C gem build bundler.gemspec 
ERROR:  While executing gem ... (ArgumentError)
    invalid byte sequence in US-ASCII

fails while the

$ LANG=en_US.utf-8 gem build bundler.gemspec 
  Successfully built RubyGem
  Name: bundler
  Version: 1.1.4
  File: bundler-1.1.4.gem

works. Fedora is using LANG=C for the build system, and since part of the test suite is also the "gem build" this issue makes the Bundler's test suite to fail.

@evanphx evanphx was assigned Jul 12, 2012
@luislavena
RubyGems member

@voxik can you tell us what Encoding.default_internal, Encoding.default_external and puts StringIO.new.string.encoding returns?

$ ruby -v -rstringio -e "puts Encoding.default_internal, Encoding.default_external, StringIO.new.string.encoding"
@voxik
voxik commented Jul 13, 2012
$ ruby -v -rstringio -e "puts Encoding.default_internal, Encoding.default_external, StringIO.new.string.encoding"
ruby 1.9.3p194 (2012-04-20 revision 35410) [i386-linux]

US-ASCII
US-ASCII
$ LANG=en_US.utf-8 ruby -v -rstringio -e "puts Encoding.default_internal, Encoding.default_external, StringIO.new.string.encoding"
ruby 1.9.3p194 (2012-04-20 revision 35410) [i386-linux]

UTF-8
UTF-8
$ LANG=C ruby -v -rstringio -e "puts Encoding.default_internal, Encoding.default_external, StringIO.new.string.encoding"
ruby 1.9.3p194 (2012-04-20 revision 35410) [i386-linux]

US-ASCII
US-ASCII

Please note that the Enclding.default_internal is always nil, therefore the empty line:

$ ruby -v -e "p Encoding.default_internal"
ruby 1.9.3p194 (2012-04-20 revision 35410) [i386-linux]
nil
@luislavena
RubyGems member

@voxik thanks, Encoding.default_internal always return nil pretty much for every platform.

@evanphx and @tenderlove might better comment on this, but seems to me that StringIO#string ignores the encoding of the string you're feeding into it you write but not if you build a new with it:

>> a = "ABC".encode("US-ASCII")
=> "ABC"
>> a.encoding
=> #<Encoding:US-ASCII>

>> n = StringIO.new(a)
=> #<StringIO:0x00000101047368>
>> n.string.encoding
=> #<Encoding:US-ASCII>

>> o = StringIO.new
=> #<StringIO:0x000001010451d0>
>> o.write(a)
=> 3
>> o.string.encoding
=> #<Encoding:UTF-8>

I'm missing something? I got it wrong?

@evanphx
RubyGems member
evanphx commented Jul 13, 2012

This is a bug in ruby so I'm not sure what we can even do about it honestly.

@luislavena
RubyGems member

@evanphx thanks for confirming wasn't me :tongue:

Found a reference to this:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/388092

@voxik can you try from Evan's sample change io = StringIO.new to io = StringIO.new("") ?

I just tested on a Ubuntu works and seems to work:

$ LANG=C /app/bin/ruby -v file2.rb
ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]
#<Encoding:UTF-8>
"---\n- Jakub Ku\u017Ama\n- Timo R\u00F6\u00DFner\n"
#<Encoding:UTF-8>
@voxik
voxik commented Jul 16, 2012

@luislavena Yes, that works. Would be nice to use this workaround in RubyGems. Thank you.

@luislavena luislavena was assigned Jul 16, 2012
@luislavena
RubyGems member

@voxik awesome, will take a look if I can create a test scenario and commit a fix around this.

@tenderlove

@luislavena @evanphx I don't think this StringIO behavior is a bug in Ruby. The StringIO is using the encoding of Encoding.default_external, like a File would do:

$ ruby -rstringio -e'p [Encoding.default_external, StringIO.new.string.encoding]'
[#<Encoding:UTF-8>, #<Encoding:UTF-8>]
$ ruby -Eascii:ascii -rstringio -e'p [Encoding.default_external, StringIO.new.string.encoding]'
[#<Encoding:US-ASCII>, #<Encoding:US-ASCII>]
$ LANG=C ruby -rstringio -e'p [Encoding.default_external, StringIO.new.string.encoding]'
[#<Encoding:US-ASCII>, #<Encoding:US-ASCII>]

Since the YAML output will always be UTF-8, we should ensure the encoding of the internal string is also UTF-8. It's lame, but you can do it by passing and encoded string to the constructor:

io = StringIO.new(''.encode('utf-8'))

I know it looks bad, but it should fix the bug. AFAIK, there's no other way to pass an encoding to the StringIO object.

@tenderlove

Forgot to mention, I forked evan's gist, and patched it with the "encode" trick. It should work regardless of the LANG you pass on the command line.

@tenderlove

Also, calling IO#set_encoding seems to work too, which is curious because it looks like Rubygems already uses that on master. I think d781b0a needs to be backported to the 1.8.x branch. It looks like that commit (which would fix this bug) was never released.

@drbrain drbrain was assigned Jul 18, 2012
@luislavena
RubyGems member

@drbrain can I backport d781b0a into 1.8 branch? Let me know and I will happily do it :)

@luislavena luislavena was assigned Jul 18, 2012
@evanphx
RubyGems member
evanphx commented Jul 18, 2012

@luislavena Go right ahead!

@luislavena luislavena added a commit that referenced this issue Jul 18, 2012
@luislavena luislavena Manually backport encoding-aware YAML gemspec
Gem::Specification was not serialized properly with Ruby 1.9

This solves issue #314, which was not fully backported into 1.8 branch.

Since code and test orgnization differs between branches, I've modified
the test to fit 1.8 branch organization.
a1827da
@luislavena
RubyGems member

@voxik @mperham landed in 1.8 branch at a1827da

Can you guys test it out just to confirm so I can close this out?

Thanks for all the patience and sorry for all the back and forth!

@t0d0r
t0d0r commented Oct 13, 2012

I have the similar problem with anemone gem the problem was in uri/ common.rb , here is my monkey patch

adding str.force_encoding(Encoding::BINARY) to following method fix the problem

class URI::Parser
  def escape(str, unsafe = @regexp[:UNSAFE])
    unless unsafe.kind_of?(Regexp)
      # perhaps unsafe is String object
      unsafe = Regexp.new("[#{Regexp.quote(unsafe)}]", false)
    end
    str.force_encoding(Encoding::BINARY) # FIX
    str.gsub(unsafe) do
      us = $&
        tmp = ''
      us.each_byte do |uc|
        tmp << sprintf('%%%02X', uc)
      end
      tmp
    end.force_encoding(Encoding::US_ASCII)
  end
end
@pjg
pjg commented Nov 25, 2012

I've had the misfortune to run into this issue while upgrading ruby to 1.9.3-p327. Some previous version of rubygems allowed me to bundle and run a Rails app with gem that had gemspec with UTF-8 characters (author's name). Running rubygems 1.8.24 allows me to bundle such gem, I can run 'rails server' and I can even run passenger-standalone just fine, but when I try to fire up the app using passenger (doesn't matter if 3.0.12 or 3.0.18 or latest 3.9.1 beta) with apache2 I run into the dreaded:

invalid byte sequence in US-ASCII (ArgumentError)
***/.rvm/rubies/ruby-1.9.3-p327/lib/ruby/site_ruby/1.9.1/rubygems/specification.rb:578:in `normalize_yaml_input'

I've then replaced rubygems 1.8.24 with latest version from the 1.8 branch (995f66c), but that, unfortunately did not solve the issue, so I guess that the patch a1827da by @luislavena doesn't fully work. Unfortunately running latest HEAD versoin of rubygems won't let me bundle anything (NoMethodError(s)), so I could not test it.

Perhaps the issue lies somewhere between bundler, rubygems and passenger.

@drbrain
RubyGems member
drbrain commented Nov 27, 2012

According to the comments, this was fixed by @a1827da

If this is not the case, please reopen.

@drbrain drbrain closed this Nov 27, 2012
@pjg
pjg commented Dec 1, 2012

@drbrain Uhm, I guess you did not read my comment...

@drbrain
RubyGems member
drbrain commented Dec 1, 2012

I guess that the patch @a1827da by @luislavena doesn't fully work. Unfortunately running latest HEAD versoin of rubygems won't let me bundle anything (NoMethodError(s)), so I could not test it.

Since you did not test it and are only guessing it is still broken I have no evidence it is still broken, so I closed this ticket aggressively. If you have evidence it is broken you will need to test it and show the results.

Don't test RubyGems through bundler, it adds too much confusion. Just use gem install.

@pjg
pjg commented Dec 1, 2012

Uhm, it's still broken on the 1.8 branch...

And the error I've experienced had to do with bundler + passenger + rubygems (testing 1.8.24 without bundler worked just fine). Anyway, I've now changed my gemspecs to be without any UTF-8 chars, so I won't run into this issue again. Perhaps someone else runs into this and adds something to this discussion.

@drbrain
RubyGems member
drbrain commented Dec 1, 2012

I doubt there will be further releases of the 1.8 branch. You need to test against master.

@eyaleizenberg

Removing the accented characters solved it for me. Thanks!

@fabiode fabiode pushed a commit to locomotivapro/boleto_bancario that referenced this issue Dec 2, 2014
@nelsonmhjr nelsonmhjr Removing utf-8 chars on gem summary because of yaml bug 3bbe354
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.